SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization¶
Conference: ACL 2025
arXiv: 2402.11347
Code: —
Area: LLM/NLP
Keywords: prompt optimization, metaheuristic optimization, exploration and exploitation, joint instruction-example optimization, adaptive operators
TL;DR¶
SEE is the first prompt optimization framework that jointly optimizes instructions and examples as a cohesive whole. It designs a four-phase exploration-exploitation strategy based on metaheuristic optimization principles, coupled with the adaptive selection of five LLM operators, significantly outperforming 9 state-of-the-art (SOTA) methods across 35 benchmark tasks.
Background & Motivation¶
- Difficulty of Prompt Design: Manually designing LLM prompts requires substantial human effort and domain expertise, making automatic optimization a critical direction. However, prompt optimization is inherently a combinatorial optimization problem in a discrete, high-dimensional space, which is highly challenging.
- Limitations of Prior Work: Existing works treat instruction optimization and example selection as two separate tasks; frameworks like APE, APO, and OPRO focus on zero-shot instruction optimization, while others target few-shot example selection. This overlooks the cohesiveness between instructions and examples, leading to sub-optimal performance.
- Inefficiency of Traditional Metaheuristic Algorithms: Directly applying genetic algorithms or similar methods results in unnecessary randomness, computational waste, and slow convergence, as they repeatedly and uniformly apply mutation and crossover operations without adapting to the specific demands of the optimization process.
- Goal: Design a framework that simultaneously optimizes instructions and examples to generate flexible prompts ranging from simple zero-shot to complex CoT few-shot, while maintaining computational efficiency.
Method¶
Overall Architecture¶
Designed under a metaheuristic optimization framework, SEE consists of four phases: Phase 0 (Global Initialization) \(\rightarrow\) Phase 1 (Local Feedback Operation / Exploitation) \(\rightarrow\) Phase 2 (Global Fusion Operation / Exploration) \(\rightarrow\) Phase 3 (Local Semantic Operation / Exploitation). Each phase utilizes the most suitable LLM operators combined with adaptive stopping criteria.
Key Designs¶
-
Systematic Analysis of Five LLM Operators:
- Global Operators: Lamarckian (reverse-engineering instructions from input-output pairs), EDA (generating new candidates by learning distributions from a set of candidates), Crossover (mixing characteristics from two parents).
- Local Operators: Feedback (improving candidates based on error cases using an Examiner+Improver dual-agent system), Semantic (making lexical alterations while preserving semantics).
- The authors systematically evaluated each operator across 5 dimensions, including improvement probability, convergence speed, and API cost, across 100 experiments.
-
Four-Phase Exploration-Exploitation Strategy:
- Phase 0: Create a diverse initial candidate pool using Lamarckian and Semantic operators.
- Phase 1: Rapidly drive each candidate toward a local optimum (high convergence speed) using the Feedback operator.
- Phase 2: Fuse characteristics of different candidates using EDA/Crossover operators to escape local optima.
- Phase 3: Perform fine-grained "last-mile" convergence using the Semantic operator.
-
Performance Vector + Hamming Distance: Instead of using cosine similarity, the correct/incorrect records of each candidate on the development set are structured as binary vectors (e.g., \([1,1,1,0,0]\)). Hamming distance is used to select the most distinct parents for fusion to guarantee diversity.
Loss & Training¶
The evaluation function \(\mathcal{F}\) is the accuracy on the development set \(\mathcal{D}_{dev}\). The optimization objective is \(\mathcal{P}^* = \arg\max_{\mathcal{P} \in \mathcal{X}} \mathbb{E}_{(\mathcal{Q}, \mathcal{A})}[\mathcal{F}(\mathcal{P}; \mathcal{Q}, \mathcal{A}) | \mathcal{L}]\).
Experiments¶
Main Results: 8 BBH Tasks¶
| Method | Causal | Formal Fallacies | Disambiguation | Hyperbaton | Logical Five | Color Reasoning | Salient | Translation | Average |
|---|---|---|---|---|---|---|---|---|---|
| OPRO | 71.94 | 71.53 | 36.73 | 49.51 | 75.92 | 50.00 | 65.55 | 43.88 | 58.13 |
| EvoPrompt | 67.24 | 53.70 | 47.96 | 50.81 | 74.79 | 61.40 | 60.90 | 47.58 | 58.05 |
| AELP | 77.77 | 64.79 | 10.67 | 58.25 | 53.74 | 73.49 | 68.14 | 41.43 | 56.04 |
| SEE-io | 72.13 | 72.37 | 8.06 | 58.87 | 86.02 | 48.19 | 60.52 | 49.19 | 56.92 |
| SEE-example | 89.09 | 68.47 | 46.77 | 58.65 | 87.50 | 86.29 | 80.64 | 47.59 | 70.63 |
On average, SEE-example improves by +15.31 over AELP and +13.21 over OPRO.
Generalizability Validation: Cross-Model¶
| Model | Disambiguation | Formal Fallacies | Hyperbaton | Salient Translation |
|---|---|---|---|---|
| GPT-4 | 79.34 | 75.91 | 90.58 | 70.45 |
| GPT-3.5 | 69.99 | 58.49 | 84.35 | 48.39 |
| Claude 2 | 72.95 | 49.46 | 83.32 | 61.82 |
| Llama3-8B | 62.63 | 71.50 | 57.52 | 37.09 |
| Llama2-7B | 42.74 | 56.72 | 53.23 | 21.23 |
GPT-4 is consistently optimal, while Mistral-7B and Llama3-8B show comparable performance among open-source models.
Key Findings¶
- Joint Optimization Significantly Outperforms Separate Optimization: SEE can generate flexible prompts ranging from zero-shot to complex few-shot+CoT, adaptively selecting the optimal format based on the task.
- Strategic Operator Selection is Crucial: Iteration history shows that the Feedback operator quickly improves performance in single iterations, while EDA/Crossover helps escape local optima to achieve leapfrog improvements.
- On 24 Instruction Induction Tasks: SEE outperforms APE and MoP on 87.5% of tasks, and outperforms EvoPrompt and OPRO on 100% of tasks.
- Computational Efficiency: On BBH tasks, SEE reduces API call costs by an average of 58.67% compared to SOTA methods.
- Performance Vector + Hamming Distance outperforms traditional embedding cosine similarity, facilitating candidate diversity more effectively.
Highlights & Insights¶
- Achieves the joint optimization of instructions and examples for the first time, breaking the artificial divide between zero-shot and few-shot in prompt optimization.
- The four-phase design deeply integrates metaheuristic optimization principles with the characteristics of LLM operators, strategically scheduling exploration and exploitation.
- Systematic quantitative analysis of the five operators (improvement probability, convergence speed, cost) provides a solid basis for adaptive selection.
- Comprehensive evaluation across 35 tasks, 9 baselines, and 6 models demonstrates strong empirical conviction.
Limitations & Future Work¶
- The framework relies on a development set to evaluate candidate performance; thus, the quality and size of the development set directly impact the optimization outcome.
- Phase transitions depend on preset stopping criteria (improvement gain threshold and tolerance), and tuning these hyperparameters increases the complexity of the method.
- Primarily uses GPT-3.5-turbo as the operator executor; its generalizability to other LLMs requires further validation.
- For extremely simple tasks (such as sentiment classification), the gains from joint optimization might not justify the additional computational overhead.
- Constructing the performance vector requires evaluating all candidates on the development set, which can be costly when the candidate pool is large.
Related Work & Insights¶
- Instruction Optimization: APE (Zhou et al., 2023), APO (Pryzant et al., 2023), and OPRO (Yang et al., 2023a) only optimize zero-shot instructions.
- Example Selection: Liu et al. (2021), Lu et al. (2022), etc., select optimal few-shot examples under a fixed instruction.
- Evolutionary Optimization: PromptBreeder (Fernando et al., 2023) and EvoPrompt (Guo et al., 2023) use genetic algorithms for prompt evolution but do not jointly optimize them.
- Joint Optimization Efforts: AELP (Hsieh et al., 2023) and MoP (Wang et al., 2024) make preliminary attempts, but their effectiveness and efficiency remain limited.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Technical Depth | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall Recommendation | ⭐⭐⭐⭐ |