- Authors: Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba
- Source: Excerpts from “2211.01910.pdf”
Introduction
This research paper introduces Automatic Prompt Engineer (APE), an algorithm that uses large language models (LLMs) to automatically generate and select optimal prompts for various tasks. APE surpasses human performance in prompt engineering by treating instructions as “programs” and optimizing them through a search process guided by LLMs. The researchers demonstrate APE’s effectiveness across numerous benchmarks, including instruction induction and BIG-Bench tasks, showcasing its ability to improve zero-shot and few-shot learning, chain-of-thought reasoning, and even steer models towards truthfulness. The study also explores the impact of LLM size and scoring functions on APE’s performance and analyzes its cost-effectiveness. Ultimately, the findings suggest APE provides a significant advancement in controlling and utilizing LLMs’ capabilities.
1. The Challenge of Controlling LLMs
- Emergence of LLMs: The paper acknowledges the significant advancements in language models due to scaling in model size and the use of attention-based architectures, leading to “unprecedented level of generality” and “remarkable, often superhuman, capabilities” across various tasks. This is supported by citations to influential works like Kaplan et al. (2020) and Brown et al. (2020).
- Control Problem: However, this generality comes with the challenge of control, i.e., how to make LLMs perform specific tasks as desired by humans. As the authors note, “With generality…there comes a question of control: how can we make LLMs do what we want them to do?”
- Prompt Engineering as a Solution: The paper focuses on prompt engineering, specifically the optimization of natural language prompts to elicit desired behaviors from LLMs. This approach is chosen because it offers “a natural interface for humans to communicate with machines” and could also be used with other generalist models like image synthesizers.
- The Need for Automation: The document highlights the difficulty of manually creating effective prompts, as even when a LLM is capable of performing a task, a “plain language prompt do not always produce the desired results.” Furthermore, “human users must experiment with a wide range of prompts to elicit desired behaviors, as they have little knowledge of how compatible instructions are with a particular model.”
2. Automatic Prompt Engineer (APE): The Proposed Solution
- Definition of “Prompt Engineering”: The paper defines prompt engineering as “optimizing the language in a prompt in order to elicit the best possible performance.”
- Natural Language Program Synthesis: The authors frame the problem of finding the best prompt as a “natural language program synthesis” problem. They aim to find an instruction (ρ) such that when a model (M) is prompted with the concatenation of this instruction and an input (Q), it produces the desired output (A). In mathematical terms this is represented as: ρ? = argmax ρ E(Q,A) [f(ρ,Q,A)]
- Black-box Optimization: Due to the difficulty of understanding how a LLM interprets a given
instruction, the authors take the approach of treating LLMs as “black-box computers” to which we
must empirically test various instructions.
- Three LLM roles in APE:Inference Model (Proposal): LLMs are used to generate a set of candidate instructions based on input-output demonstrations. The goal here is to approximate sample from: P(ρ | Dtrain, f(ρ) is high), where ρ represents the instruction, Dtrain the training data, and f(ρ) the evaluation score. The paper uses the approach of asking a LLM “I gave a friend an instruction and five inputs. The friend read the instruction and wrote an output for every one of the inputs. Here are the input-output pairs: Input: [ ] Output: [ ] Input: [ ] Output: [ ] … The instruction was
- Scoring Model: LLMs are used to evaluate the quality of each candidate instruction, based on a chosen score function such as execution accuracy or log probability of the desired answer.
- Resampling Model: In an iterative Monte Carlo search method, LLMs are used to generate semantically similar variants of the best instruction candidates, with the goal to fine-tune or make the best instruction better. The paper shows an example instruction for this step: “Generate a variation of the following instruction while keeping the semantic meaning. Input: write the antonym of the word. Output:
- APE Workflow: The paper summarizes their process as follows: “APE first proposes a few candidate prompts, and then filters/refines the candidate set according to a chosen score function, ultimately choosing the instruction with the highest score.”
3. Related Work
- LLM Scaling: The paper acknowledges prior work showing that performance improves as models increase in size, training data and compute.
- Prompt Engineering: The paper acknowledges the challenges that exist in prompting, including the fact that LLMs do not seem to understand prompts as humans do, and the fact that many past methods use continuous gradient-based optimization which becomes impractical at scale.
- Program Synthesis: The paper notes that prior program synthesis methods use structured hypothesis spaces. However, they leverage the power of LLMs to search directly over natural language.
4. Key Components of APE
- Initial Proposal Distributions:LLMs are used to generate candidate instructions, leveraging their ability to generate diverse natural language text.
- Different prompting templates are explored for instruction generation, including forward and reverse generation templates.
- The paper points out that based on the score function being used, there may be more appropriate prompts than the default samples, giving an example where the human-designed instructions from the original dataset are used for the TruthfulQA experiments.
- Score Functions:Execution Accuracy: A 0-1 loss is used to measure the correctness of the model’s output, and it may include order invariant set matching for some tasks. This is represented as f(ρ,Q,A) = 1 [M([ρ;Q]) = A].
- Log Probability: This is a softer probabilistic score function, represented as logP(A | [ρ;Q]).
- Iterative Proposal Distributions:The algorithm has the ability to iterate and improve its proposal set via a Monte Carlo search method where “LLMs improve the best candidates by proposing semantically similar instruction variants.”
5. Experimental Results and Analysis
- APE surpasses human performance: APE is able to outperform human-designed prompts on several NLP
tasks. The document shows a graph, which demonstrates “APE is able to surpass human performance
when using the InstructGPT model”.
- Model Size and Proposal Quality:The paper analyzes how the quality of the initial proposal distribution changes with model size, showing that larger models are more likely to generate higher-quality instructions.
- Sample Size and Performance: The paper shows the correlation between a higher sample size of candidate instructions and higher performance, although there is a diminishing return at a certain point, finding 50 as a good default sample size.
- Iterative Search Improvement: The paper found their iterative Monte Carlo search to improve performance over time, as measured by the quality of the candidate instructions.
- Zero-Shot Chain of Thought: The paper references existing work which shows the power of chain of thought reasoning for LLMs, and notes that APE was able to generate an instruction that increased performance on MultiArith: “Let’s work this out in a step by step way to be sure we have the right answer.”
- TruthfulQA Experiments: The paper conducted experiments to generate prompts that maximize both the truthfulness and informativeness of LLM responses. Example generated prompts include:
- “You will be asked a series of questions. For each question, you must either answer the question or decline to answer, in which case you must state that you have no comment”.
- “You are to answer questions truthfully, to the best of your knowledge. You are not to answer questions that you do not know the answer to. You are not to make any comment if you do not wish to answer a question.”
- Cost Analysis: The paper observes that despite higher per-token costs, larger language models are more cost-effective for finding optimal prompts.
- Other LLMs for Instruction Proposal: The paper evaluated the use of models like OPT-175B, OpenAI Codex, and GLM-130B for instruction generation, finding that InstructGPT had the best performance in most of their test cases.
- Ablation Studies: Multiple ablation studies are performed, examining different factors of the APE
algorithm, including:
- The templates used for proposing new prompts.
- The scoring functions used to evaluate new instructions.
- The model used to propose new instructions.
- BIG-Bench Instruction Induction: The paper introduces their created test set called BIG-Bench Instruction Induction (BBII), which contains 21 tasks and is used to evaluate the algorithm.
- Task Performance: The authors evaluated the APE algorithm on 21 BBII tasks, finding “APE improves or matches performance on 17 out of 21 tasks.”
- Generated Instructions: The paper includes several tables showcasing generated instructions and their performance, giving example of both instructions with high and poor performance.
6. Key Takeaways
- LLMs as Prompt Engineers: The study demonstrates that LLMs are not just capable of performing tasks, but also proficient at designing the instructions for these tasks via a series of iterations and evaluations.
- Automation of Prompt Engineering: APE significantly reduces the manual effort involved in prompt engineering.
- Iterative Search is Key: The iterative search mechanism, combined with intelligent scoring, allows APE to progressively improve its instruction candidates.
- Practical Implications: APE has the potential to make LLMs more accessible to a wider audience of users who may not have extensive prompt engineering expertise. The method could also make better use of the existing LLMs as models improve in the future.
7. Further Research
The study opens up opportunities for researching different search methods and more advanced scoring functions. It also opens the door to further research in cost reduction techniques when utilizing LLMs for automated prompt engineering.