Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection¶

Conference: ACL 2025
arXiv: 2411.07446
Code: None
Area: LLM/NLP
Keywords: prompt optimization, feedback memory, exemplar retrieval, automatic prompt engineering, LLM

TL;DR¶

Proposes the ERM method, which enhances feedback quality by generating exemplars with detailed problem-solving processes guided by meta-prompts, and introduces Feedback Memory and Exemplar Factory as two long-term memory mechanisms to efficiently store and reuse historical feedback and exemplars, surpassing SOTA prompt optimization methods on multiple tasks with approximately half the optimization steps.

Background & Motivation¶

Background: Automatic prompt optimization aims to find the optimal prompt without human intervention. Mainstream methods include evolutionary (EvoPrompt), trajectory-based (OPRO, GPO), and feedback-based (ProTeGi) approaches.

Limitations of Prior Work: Feedback-based methods suffer from two core problems: (a) they only utilize the feedback from the current step, discarding historical and non-selected feedback, which leads to more optimization steps required for convergence; (b) during inference, exemplar retrieval is solely based on semantic similarity, without evaluating its impact on actual task performance.

Key Challenge: Valuable feedback information is wasted, and the selection of exemplars is disconnected from task performance.

Goal: How to efficiently utilize all historical feedback? How to select exemplars that truly contribute to task performance?

Key Insight: Leveraging the human memory mechanism (Ebbinghaus forgetting curve), the proposed method establishes long-term memory storage for feedback and exemplars with priority scores, dynamically adjusting priority and selectively forgetting based on effectiveness evaluation.

Core Idea: Manage feedback and exemplars using memory mechanisms, ensuring that valuable information is continuously utilized while invaluable information is forgotten.

Method¶

Overall Architecture¶

ERM (Exemplar-Guided Reflection with Memory) contains three core components: the input is a set of error samples \(\mathcal{B}\) and the current prompt \(p^t\), which passes through Exemplar-Guided Reflection to generate exemplars and feedback, stored in two memory storage systems—Feedback Memory and Exemplar Factory—ultimately outputting the optimized prompt \(p^{t+1}\). During inference, exemplars are retrieved from the Exemplar Factory and concatenated with the prompt to improve prediction accuracy.

Key Designs¶

Exemplar-Guided Reflection:
- Function: Designing a guided meta-prompt to direct the prompt optimizer to select typical samples from error instances and provide detailed problem-solving processes (CoT style), then generating more informative feedback based on these exemplars.
- Mechanism: \(\mathcal{E}^t = M_e(p^t, \mathcal{B}; p^{meta}_{ref*})\) is first used to generate a set of exemplars (including question, answer, and CoT), and then feedback \(\mathcal{F}^t = M_e(p^t, \mathcal{B}, \mathcal{E}^t; p^{meta}_{ref*})\) is generated based on the exemplars.
- Design Motivation: Traditional methods generate limited feedback directly on error samples. Incorporating detailed problem-solving processes makes the feedback more targeted, offering a more precise direction for subsequent prompt optimization.
Feedback Memory:
- Function: Storing historical feedback, assigning a priority score to each, and periodically retrieving high-priority feedback to guide prompt optimization.
- Mechanism: Filtering during storage (only storing feedback that brings performance gains + deduplication); sampling by priority probability \(P_f = \text{softmax}(\{e^{s_p(f_i)/\tau_f}\})\) during retrieval; updating priority after use \(s_p^t(f) = (1-\beta)s_p(f)^{t-1} + \beta \mathbb{I}(f)\), and feedback scoring below threshold \(\theta\) is forgotten.
- Design Motivation: Avoiding the loss of valuable historical feedback while ensuring only effective information is retained in memory through a selective forgetting mechanism, thereby accelerating optimization convergence.
Exemplar Factory:
- Function: Storing, evaluating, and retrieving exemplars, and choosing the optimal exemplar to concatenate with the prompt to enhance predictions during inference.
- Mechanism: Also managed using priority scores; retrieval synthetically considers priority and semantic similarity to the current question \(P_e^r = \text{softmax}(\{e^{s_p(e_i) \cdot s_s^j(e_i)/\tau_e}\})\). Correctness of the problem-solving process is validated and deduplicated during storage, and priorities are updated after use based on their helpfulness to predictions.
- Design Motivation: Retrieval purely based on semantic similarity does not guarantee selecting the most helpful exemplars for the task. Thus, the retrieval strategy needs to be optimized through real-world performance feedback.

Loss & Training¶

Employs beam search to select the \(k\) best-performing candidate prompts on the validation set for the next optimization step.
Similarity computation uses the BGE-M3 model.
The task model uses Doubao-Pro, and the prompt optimizer uses GPT-4o.

Key Experimental Results¶

Main Results¶

Dataset	Metric	ERM	Prev. SOTA	Gain
LIAR	F1	68.6	58.5 (ProTeGi)	+10.1
BBH	F1	86.1	81.9 (CoT)	+4.2
ETHOS	F1	98.0	96.5 (ProTeGi)	+1.5
WebNLG	Rouge-L	59.6	55.7 (ProTeGi)	+3.9
GSM8K	Acc.	93.3	91.7 (Promptbreeder)	+1.6
WSC	Acc.	86.0	84.0 (GPO)	+2.0

Ablation Study¶

Configuration	LIAR F1	BBH F1	Description
Baseline (ProTeGi)	58.5	73.6	No components
+Exemplar-Guided Reflection	62.9	75.7	+4.4
+Reflection +Feedback Memory	67.2	84.7	+8.7
+Reflection +Exemplar Factory	66.6	82.6	+8.1
Full ERM	68.6	86.1	+10.1

Key Findings¶

Feedback Memory contributes the most (+5.7 on LIAR), demonstrating that the reuse of historical feedback is robustly important.
Both filtering and selective forgetting are indispensable in Exemplar Factory; retrieving without filtering is actually ineffective.
ERM requires approximately half the optimization steps of ProTeGi to achieve optimal performance (LIAR: 7 steps vs. 13 steps).
Consistently and significantly outperforms other methods in the few-shot setting.

Highlights & Insights¶

Introduces the concept of the human memory "forgetting curve" into prompt optimization, dynamically managing the lifecycle of feedback/exemplars with priority scores; this approach can be transferred to other scenarios requiring long-term knowledge management.
The design of Exemplar-Guided Reflection is elegant: it first generates CoT-style exemplary solutions and then uses them to assist feedback generation, forming a closed loop of mutual reinforcement.

Limitations & Future Work¶

The scenario of human-in-the-loop optimization is not explored (when the model continues to fail, human-provided solutions might be more efficient).
Experiments are constrained by budget, resulting in a limited variety of task types.
Hyperparameters (\(\beta\), \(\theta\), \(\tau\)) in Feedback Memory and Exemplar Factory require tuning.

vs ProTeGi: ProTeGi only uses feedback from the current step, whereas ERM reuses historical feedback via the memory mechanism, achieving +10.1 on LIAR.
vs OPRO/GPO: Trajectory-based methods optimize based on historical prompts and scores but fail to learn concrete improvement directions from errors.
vs EvoPrompt: Evolutionary methods randomly mutate prompts without targeted feedback guidance.

Rating¶

Novelty: ⭐⭐⭐⭐ Applying the memory mechanism to prompt optimization is an innovative combination, though the individual components are not entirely novel on their own.
Experimental Thoroughness: ⭐⭐⭐⭐ 7 datasets, extensive ablation studies, and efficiency analyses.
Writing Quality: ⭐⭐⭐⭐ Clear structure and well-illustrated figures.
Value: ⭐⭐⭐⭐ Highly practical, with a significant increase in prompt optimization efficiency.