Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection¶
Conference: ACL 2025
arXiv: 2411.07446
Code: None
Area: LLM/NLP
Keywords: prompt optimization, feedback memory, exemplar retrieval, automatic prompt engineering, LLM
TL;DR¶
Proposes the ERM method, which enhances feedback quality by generating exemplars with detailed problem-solving processes guided by meta-prompts, and introduces Feedback Memory and Exemplar Factory as two long-term memory mechanisms to efficiently store and reuse historical feedback and exemplars, surpassing SOTA prompt optimization methods on multiple tasks with approximately half the optimization steps.
Background & Motivation¶
Background: Automatic prompt optimization aims to find the optimal prompt without human intervention. Mainstream methods include evolutionary (EvoPrompt), trajectory-based (OPRO, GPO), and feedback-based (ProTeGi) approaches.
Limitations of Prior Work: Feedback-based methods suffer from two core problems: (a) they only utilize the feedback from the current step, discarding historical and non-selected feedback, which leads to more optimization steps required for convergence; (b) during inference, exemplar retrieval is solely based on semantic similarity, without evaluating its impact on actual task performance.
Key Challenge: Valuable feedback information is wasted, and the selection of exemplars is disconnected from task performance.
Goal: How to efficiently utilize all historical feedback? How to select exemplars that truly contribute to task performance?
Key Insight: Leveraging the human memory mechanism (Ebbinghaus forgetting curve), the proposed method establishes long-term memory storage for feedback and exemplars with priority scores, dynamically adjusting priority and selectively forgetting based on effectiveness evaluation.
Core Idea: Manage feedback and exemplars using memory mechanisms, ensuring that valuable information is continuously utilized while invaluable information is forgotten.
Method¶
Overall Architecture¶
ERM (Exemplar-Guided Reflection with Memory) contains three core components: the input is a set of error samples \(\mathcal{B}\) and the current prompt \(p^t\), which passes through Exemplar-Guided Reflection to generate exemplars and feedback, stored in two memory storage systems—Feedback Memory and Exemplar Factory—ultimately outputting the optimized prompt \(p^{t+1}\). During inference, exemplars are retrieved from the Exemplar Factory and concatenated with the prompt to improve prediction accuracy.
Key Designs¶
-
Exemplar-Guided Reflection:
- Function: Designing a guided meta-prompt to direct the prompt optimizer to select typical samples from error instances and provide detailed problem-solving processes (CoT style), then generating more informative feedback based on these exemplars.
- Mechanism: \(\mathcal{E}^t = M_e(p^t, \mathcal{B}; p^{meta}_{ref*})\) is first used to generate a set of exemplars (including question, answer, and CoT), and then feedback \(\mathcal{F}^t = M_e(p^t, \mathcal{B}, \mathcal{E}^t; p^{meta}_{ref*})\) is generated based on the exemplars.
- Design Motivation: Traditional methods generate limited feedback directly on error samples. Incorporating detailed problem-solving processes makes the feedback more targeted, offering a more precise direction for subsequent prompt optimization.
-
Feedback Memory:
- Function: Storing historical feedback, assigning a priority score to each, and periodically retrieving high-priority feedback to guide prompt optimization.
- Mechanism: Filtering during storage (only storing feedback that brings performance gains + deduplication); sampling by priority probability \(P_f = \text{softmax}(\{e^{s_p(f_i)/\tau_f}\})\) during retrieval; updating priority after use \(s_p^t(f) = (1-\beta)s_p(f)^{t-1} + \beta \mathbb{I}(f)\), and feedback scoring below threshold \(\theta\) is forgotten.
- Design Motivation: Avoiding the loss of valuable historical feedback while ensuring only effective information is retained in memory through a selective forgetting mechanism, thereby accelerating optimization convergence.
-
Exemplar Factory:
- Function: Storing, evaluating, and retrieving exemplars, and choosing the optimal exemplar to concatenate with the prompt to enhance predictions during inference.
- Mechanism: Also managed using priority scores; retrieval synthetically considers priority and semantic similarity to the current question \(P_e^r = \text{softmax}(\{e^{s_p(e_i) \cdot s_s^j(e_i)/\tau_e}\})\). Correctness of the problem-solving process is validated and deduplicated during storage, and priorities are updated after use based on their helpfulness to predictions.
- Design Motivation: Retrieval purely based on semantic similarity does not guarantee selecting the most helpful exemplars for the task. Thus, the retrieval strategy needs to be optimized through real-world performance feedback.
Loss & Training¶
- Employs beam search to select the \(k\) best-performing candidate prompts on the validation set for the next optimization step.
- Similarity computation uses the BGE-M3 model.
- The task model uses Doubao-Pro, and the prompt optimizer uses GPT-4o.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | ERM | Prev. SOTA | Gain |
|---|---|---|---|---|
| LIAR | F1 | 68.6 | 58.5 (ProTeGi) | +10.1 |
| BBH | F1 | 86.1 | 81.9 (CoT) | +4.2 |
| ETHOS | F1 | 98.0 | 96.5 (ProTeGi) | +1.5 |
| WebNLG | Rouge-L | 59.6 | 55.7 (ProTeGi) | +3.9 |
| GSM8K | Acc. | 93.3 | 91.7 (Promptbreeder) | +1.6 |
| WSC | Acc. | 86.0 | 84.0 (GPO) | +2.0 |
Ablation Study¶
| Configuration | LIAR F1 | BBH F1 | Description |
|---|---|---|---|
| Baseline (ProTeGi) | 58.5 | 73.6 | No components |
| +Exemplar-Guided Reflection | 62.9 | 75.7 | +4.4 |
| +Reflection +Feedback Memory | 67.2 | 84.7 | +8.7 |
| +Reflection +Exemplar Factory | 66.6 | 82.6 | +8.1 |
| Full ERM | 68.6 | 86.1 | +10.1 |
Key Findings¶
- Feedback Memory contributes the most (+5.7 on LIAR), demonstrating that the reuse of historical feedback is robustly important.
- Both filtering and selective forgetting are indispensable in Exemplar Factory; retrieving without filtering is actually ineffective.
- ERM requires approximately half the optimization steps of ProTeGi to achieve optimal performance (LIAR: 7 steps vs. 13 steps).
- Consistently and significantly outperforms other methods in the few-shot setting.
Highlights & Insights¶
- Introduces the concept of the human memory "forgetting curve" into prompt optimization, dynamically managing the lifecycle of feedback/exemplars with priority scores; this approach can be transferred to other scenarios requiring long-term knowledge management.
- The design of Exemplar-Guided Reflection is elegant: it first generates CoT-style exemplary solutions and then uses them to assist feedback generation, forming a closed loop of mutual reinforcement.
Limitations & Future Work¶
- The scenario of human-in-the-loop optimization is not explored (when the model continues to fail, human-provided solutions might be more efficient).
- Experiments are constrained by budget, resulting in a limited variety of task types.
- Hyperparameters (\(\beta\), \(\theta\), \(\tau\)) in Feedback Memory and Exemplar Factory require tuning.
Related Work & Insights¶
- vs ProTeGi: ProTeGi only uses feedback from the current step, whereas ERM reuses historical feedback via the memory mechanism, achieving +10.1 on LIAR.
- vs OPRO/GPO: Trajectory-based methods optimize based on historical prompts and scores but fail to learn concrete improvement directions from errors.
- vs EvoPrompt: Evolutionary methods randomly mutate prompts without targeted feedback guidance.
Rating¶
- Novelty: ⭐⭐⭐⭐ Applying the memory mechanism to prompt optimization is an innovative combination, though the individual components are not entirely novel on their own.
- Experimental Thoroughness: ⭐⭐⭐⭐ 7 datasets, extensive ablation studies, and efficiency analyses.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and well-illustrated figures.
- Value: ⭐⭐⭐⭐ Highly practical, with a significant increase in prompt optimization efficiency.