PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

Content has been generated from NotebookLM

Authors: Eshaan Agarwal, Joykirat Singh, Vivek Dani, Raghav Magazine, Tanuja Ganu, Akshay Nambi
Source: Excerpts from “2405.18369.pdf”

Introduction

This document reviews the key concepts and findings from two sources related to PromptWizard, a prompt optimization framework developed by Microsoft Research. These sources highlight the limitations of existing prompt optimization techniques, particularly for closed-source Large Language Models (LLMs), and introduce PromptWizard as a novel, iterative approach that leverages feedback and iterative refinement.

Key Themes and Ideas

Challenge of Black-Box LLM Prompt Optimization:

Many existing prompt optimization methods, especially gradient-based ones, require access to the internal mechanics of a model, making them unusable for closed-source LLMs like GPT-4 and Gemini.
- “Some approaches, such as gradient-based methods, have been used to optimize prompts by leveraging token probabilities and model gradients… However, these methods are limited to white-box (open-source) models, as they require direct access to the model’s internal mechanics.”
Optimizing prompts for black-box LLMs requires gradient-free approaches, which can be broadly categorized into continuous and discrete methods.
- “This necessitates gradient-free prompt optimization strategies.”

Limitations of Existing Gradient-Free Methods:

Continuous methods (e.g., InstructZero, Instinct) that utilize “soft prompts” and additional training of neural networks, often suffer in performance based on the complexity of the task and the open-source model used.
- “However, these methods require additional training of NNs and their performance often varies based on the open-source model and task complexity. For more complex tasks, learning the op-timal prompt-performance mapping becomes challenging.”
Discrete methods (e.g., PromptBreeder, EvoPrompt) generate multiple prompts but lack a guided refinement process leading to less task-specific prompts.
- “On the other hand, discrete methods like PromptBreeder and EvoPrompt generate multiple prompt”

PromptWizard’s Iterative Optimization Framework:

PromptWizard employs an iterative optimization approach that refines both the prompt instruction and few-shot examples. This is achieved through a 4-step process which is repeated for specified rounds:
Mutation: Generates variations of the current prompt using “thinking styles.”
- “These strategies can be broadly classified into two types: continuous and discrete prompt optimization.”
Scoring: Evaluates the mutated prompts against training examples to determine the best-performing prompts.
- “We begin by extracting candidate examples from the dataset and employ a scoring mechanism to assess the current prompt’s effectiveness against these examples, classifying them into positive and negative categories.”
Critique: Provides feedback on the best-scoring prompt, identifying weaknesses and areas for improvement. This feedback is targeted towards specific weaknesses, providing a focused improvement rather than general changes.
- “This targeted feedback is critical in refining the prompt, as it provides insights into specific weaknesses, allowing for focused improvements rather than general changes.”
Synthesis: Refines the prompt based on the critique, creating a more task-specific and optimized instruction.
- “Finally, PW synthesize component uses the critique’s feedback to refine the best prompt. It rephrases and enhances the instruction based on the critique, producing a more task-specific and optimized prompt.”
This process is used in both the optimization of the initial instruction and in subsequent optimization of few-shot examples * “By combining these steps—mutation, scoring, critique, and synthesis—PW ensures that the prompts are not only diverse and creative but also highly tailored to the specific task at hand, outperforming prior methods that lack this guided refinement process.”PromptWizard identifies diverse examples and leverages Chain-of-Thought reasoning for enhanced performance.
- “Next, we focus on identifying a diverse set of candidate examples to enhance prompt effectiveness… With the optimized prompt and few-shot examples, we further enhance model performance by in-corporating chain-of-thought (CoT) reasoning.”
The identified few-shot examples also go through iterative improvement. This step evaluates the effectiveness of the prompt against a dataset. The questions are classified into positive and negative examples based on whether or not the prompt is able to answer them.
- “Positive examples demonstrate where the prompt succeeds, while negative examples highlight areas for improvement.”
Both the prompts and few-shot examples are refined using critique and synthesis to ultimately yield optimized results.
Reasoning chains are automatically generated for examples and then validated.
The system also attempts to determine the human intent for the given task, as well as develop an expert persona.

Performance Evaluation and Results:

PromptWizard was tested on various datasets, including the BIG-Bench Instruction Induction (BBII) dataset, arithmetic reasoning tasks (GSM8k, AQUARAT, SVAMP), and domain-specific tasks from BigBench Hard (BBH).
PromptWizard outperforms existing state-of-the-art methods across various evaluation metrics and datasets.
- “PromptWizard outperforms the baselines, achieving the highest accuracy on 13 out of 19 tasks (68%), compared to Instinct’s 8 tasks (42%).”
It achieves superior performance in zero-shot settings with GPT3.5Turbo and also maintains this lead in one-shot settings.
Significant gains are seen in arithmetic reasoning tasks (GSM8k, AQUARAT, SVAMP) and the BBH tasks.
- “…achieving significant gains in accuracy on arithmetic reasoning tasks… These tasks, often requiring detailed multi-step reasoning, which PW addresses through its iterative synthesis of prompts enriched with intermediate reasoning steps and examples.”
PromptWizard’s advantage is consistent across different base LLMs including GPT-4, GPT3.5Turbo, and Llama-70B.
An ablation study highlights the contribution of each stage in the PW pipeline. It was found that the reasoning of few-shot examples provided the most significant gains.
- “This emerges as one of the most significant contributors, indicating that generating detailed reasoning chains for few-shot examples is critical for task accuracy.”

Computational Efficiency:

PromptWizard demonstrates significantly reduced API calls and token usage compared to methods like Instinct and PromptBreeder.
- “We can see that PW has significant lower number of API calls compared to Instinct, thus resulting in 5x reduction in overall tokens per task.”
The framework can utilize smaller LLMs, such as Llama-70B, for prompt generation, further reducing costs while maintaining high task accuracy through the use of GPT-4 for inference.
- “This approach reduce computational costs during prompt optimization by leveraging the efficiency of smaller models while still maximizing task accuracy with powerful model during inference.”

PromptWizard is designed to improve performance on complex tasks by focusing on optimizing both the instruction prompt and the provided examples.
The refined prompts often include detailed guidance, such as step-by-step instructions, handling of specific scenarios or operations, and explicit problem-solving strategies.
- “Analyze the given real-world mathematical problem step-by-step, identifying key information, relationships between different pieces of data, and the context… Finally, verify your answer against the original problem to ensure it is logical and accurate.”
The system also generates tailored few-shot examples that include step-by-step reasoning chains, enhancing the LLM’s understanding of the task and guiding its generation of accurate responses.
- “Specifically, we automatically generate a detailed reasoning chain for each selected few-shot examples.”

Key Facts

Iterative Optimization: PromptWizard iteratively refines prompts and examples.
Four-Step Process: Mutation, Scoring, Critique, and Synthesis.
Gradient-Free: Operates without needing access to the internal model mechanics of closed-source LLMs
Chain-of-Thought Reasoning: Uses chain-of-thought to enhance problem-solving abilities
Diverse Example Selection: Selects examples that maximize the impact of refinement.
Strong Performance: Outperforms SOTA algorithms in instruction and task performance.
Reduced Cost: Significantly lower API calls and token usage.
Adaptable: Works with different base LLMs for both prompt optimization and inference.

Conclusion

PromptWizard represents a significant advancement in prompt optimization for black-box LLMs. By combining iterative feedback-driven refinement, chain-of-thought reasoning, diverse example selection, and detailed expert prompts, it overcomes the limitations of existing methods and achieves superior performance across a range of tasks. The framework’s efficiency and adaptability make it a promising tool for practical applications involving complex tasks and large language models.