MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization¶

Conference: AAAI 2026
arXiv: 2503.16874
Code: https://github.com/exoskeletonzj/MARS
Area: Reinforcement Learning
Keywords: Automated Prompt Optimization, Socratic Dialogue, POMDP, Teacher-Critic-Student, Pseudo-Gradient

TL;DR¶

This paper proposes MARS, a five-agent framework for automated prompt optimization (APO): a Planner generates task-specific optimization trajectories; a Teacher-Critic-Student triad conducts Socratic dialogue for iterative prompt refinement (simulating pseudo-gradient descent in text space); and a Target agent executes the prompt and provides feedback. The entire process is modeled as a POMDP. MARS outperforms the previous SOTA (PE2) by 6.04% on general tasks and 6.42% on domain-specific tasks across 17 datasets, requiring only 1-shot training data.

Background & Motivation¶

Background: Automated prompt optimization (APO) aims to overcome the cognitive biases inherent in manually crafted prompts by automatically exploring superior prompt design spaces. Existing approaches fall into two categories: generation-and-search methods (APE/ProTeGi/PoisonedRAG: generate candidate prompts then search for the optimal one) and meta-prompt methods (OPRO/PE2: design elaborate meta-prompts to guide optimization).

Limitations of Prior Work: (a) Template rigidity: fixed meta-prompt templates cannot dynamically adapt to diverse task requirements, making it difficult to capture task-specific optimization directions; (b) Inefficient exploration: generation-and-search methods perform only local search near initial candidates, risking premature convergence or missing better regions of the prompt space.

Key Challenge: The search space for prompt optimization is discrete, high-dimensional, and non-differentiable, precluding direct gradient descent; yet gradient-like directional guidance is needed to avoid blind search.

Key Insight: Inspired by the Socratic teaching method—guiding students to discover answers through questioning rather than direct instruction—the paper models the prompt optimization process as a POMDP and employs multi-agent collaboration to simulate gradient-style iterative refinement.

Core Idea: A five-agent POMDP framework—a Planner charts the optimization path; Teacher-Critic-Student conduct Socratic dialogue for pseudo-gradient refinement; and a Target agent evaluates and provides feedback.

Method¶

Overall Architecture¶

Five LLM agents collaborate within a POMDP \(\langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \mathcal{O} \rangle\): - Planner: decomposes the optimization objective into a sub-goal sequence \(\mathbf{ST} = [st_1, \ldots, st_n]\) - Teacher: formulates Socratic questions \(q_i\) based on the current sub-goal and the previous prompt version - Critic: evaluates the quality and directional validity of each question, producing feedback \(c_i\) - Student: integrates the question and critique to update its internal state and generate a new prompt version \(p_i\) - Target: executes the prompt on the downstream task and returns a performance reward \(\mathcal{R}\)

Key Designs¶

Planner — Optimization Trajectory Planning:
- Function: Decomposes the abstract objective of "optimizing a prompt" into a concrete, ordered sequence of sub-goals.
- Mechanism: \(\mathbf{ST} = \pi_{\text{plan}}(g, x, p_0)\); a latent variable \(z\) is introduced to model task semantics, and a structured plan is generated via \(\arg\max_{\mathbf{ST}} \mathbb{E}_{z \sim q(z|g,x)}[\log P(\mathbf{ST}|z, p_0)]\).
- Design Motivation: Static meta-prompts apply a one-size-fits-all approach; the Planner tailors the optimization path to each individual task, enabling adaptive optimization.
Teacher-Critic-Student Socratic Dialogue:
- Function: Refines the prompt through an iterative question–critique–revision cycle.
- Mechanism: At each step \(i\)—the Teacher poses a question \(q_i = \pi_t(st_i, p_{i-1}, \mathcal{H}_{<i})\) to steer the Student's reasoning toward a specific direction; the Critic evaluates \(c_i = \pi_c(q_i, \mathcal{H}_{<i})\) to verify question quality and directional correctness; the Student updates the prompt \(p_i = \pi_s((q_i, c_i), p_{i-1}, \mathcal{H}_{<i})\). All agents have full access to the dialogue history \(\mathcal{H}_{<i}\).
- Design Motivation: This simulates "pseudo-gradient descent" in discrete prompt space—the Teacher's question corresponds to the gradient direction, the Critic ensures directional correctness, and the Student executes the update step. Proposition 1 formally proves that the cumulative improvement is lower-bounded by \(\geq \sum_i (\bar{A}_i - \sigma^2/2\lambda)\).
Adaptive Termination:
- Function: Automatically determines when to stop optimization based on marginal returns.
- Mechanism: Optimization continues when \(\Delta\mathcal{R}^{(t)} = \mathcal{R}^{(t)} - \mathcal{R}^{(t-1)} > \delta\) and \(t < I\). Proposition 2 proves that under Lipschitz conditions the reward change is bounded, guaranteeing convergence under small step sizes.
- Design Motivation: Prevents over-refinement and unnecessary computational expenditure.

Training Efficiency Highlights¶

Only 1 training sample is required for optimization. The Planner can infer task structure and semantics from a single example, because the core of APO is understanding "what the task is" rather than memorizing "what the data is."

Key Experimental Results¶

Main Results — General Tasks (BBH + MMLU, 6+6=12 tasks)¶

Method	BBH Avg.	MMLU Avg.	Overall Avg.
Origin (original prompt)	53.71	76.39	64.95
CoT (Zero-Shot)	61.40	78.20	69.79
PE2 (Prev. SOTA)	69.45	88.44	78.81
MARS	79.52	90.94	85.11

Main Results — Domain Tasks (C-Eval + LSAT + GSM8K, 5 tasks)¶

Method	C-Eval	GSM8K	LSAT-AR	Avg.
PE2	66.47	83.46	34.50	69.39
MARS	77.13	89.22	38.42	75.81

Ablation Study¶

Configuration	BBH Avg.	MMLU Avg.	Change
MARS (full)	79.52	90.94	—
w/o Socratic	68.28	—	−11.31
w/o Planner	72.82	—	−6.77
w/o Critic	76.04	—	−3.55

Key Findings¶

MARS achieves state-of-the-art across all 17 datasets: surpassing Prev. SOTA (PE2) by 6.04% on general tasks and 6.42% on domain-specific tasks.
The Socratic dialogue mechanism contributes the most: removing it causes an average drop of 11.31%, far exceeding the impact of removing the Planner (−6.77%) or the Critic (−3.55%).
1-shot training is sufficient: 0-shot achieves 77.77%, 1-shot achieves 79.59%, and 3-shot achieves 79.81%—the marginal gain from additional training data is negligible.
Fast convergence: optimization typically converges within 5 rounds (vs. OPRO, which has not converged after 10 rounds), substantially reducing inference cost.
Cross-model generalization: prompts optimized on DeepSeek-V2.5 transfer directly to GPT-4o without performance degradation, indicating that the optimized prompts are model-agnostic.
Inference-time scaling law: MARS achieves the highest performance under an equivalent token budget, and reaches a given performance level with the lowest token cost.

Highlights & Insights¶

Modeling APO as a POMDP is the key theoretical innovation—the Student's internal reasoning state serves as the hidden state, Teacher/Critic interactions are actions, the prompt is the observation, and task performance is the reward, forming a complete mathematical framework.
The analogy between Socratic teaching and pseudo-gradient descent is particularly elegant—the Teacher's question corresponds to the gradient direction, the Critic's evaluation to gradient correction, and the Student's update to a parameter step—with a formal proof of the cumulative improvement lower bound.
Requiring only 1 training sample is a striking result, suggesting that the essence of APO is "understanding task specifications" rather than "fitting training data," and that the Planner's task comprehension capability is the core driver.
The appendix provides the final optimized prompts for all 17 tasks, which are directly reusable.

Limitations & Future Work¶

The five-agent architecture incurs substantial inference overhead (5 LLM calls per round), making it sensitive to computational budgets.
The framework relies on DeepSeek-V2.5 / GPT-4o as agent backbones; smaller models may be inadequate for the Teacher and Planner roles.
The POMDP hidden-state transition \(\mathcal{T}\) is implicitly realized by the LLM rather than being precisely modeled; the theoretical assumptions (Lipschitz continuity, bounded variance) may not hold strictly in practice.
Evaluation is limited to text classification, QA, and mathematical reasoning tasks; generative tasks (e.g., summarization, translation) are not assessed.
Whether the Planner-generated sub-goal sequences are genuinely superior to manually designed ones lacks comparison against human expert prompt engineers.

vs. OPRO: OPRO uses a meta-prompt to directly instruct an LLM to generate an optimized prompt—a single-agent approach. MARS employs multi-agent Socratic dialogue for iterative refinement, converging faster (5 rounds vs. 10+) with higher performance.
vs. PE2: PE2 is the previous SOTA meta-prompt method; MARS surpasses it by over 6% on all tasks with greater computational efficiency.
Implications for agent research: The Teacher-Critic-Student triad pattern is transferable to other agent tasks requiring iterative refinement, such as code debugging, text revision, and solution optimization.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ POMDP modeling combined with a Socratic five-agent framework represents a paradigm-level innovation in APO, with both rigorous theory and complete methodology.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 17 datasets, ablation studies, convergence analysis, cross-model validation, 1-shot analysis, and inference-time scaling law—comprehensively evaluated.
Writing Quality: ⭐⭐⭐⭐⭐ POMDP formalization is rigorous, Propositions are fully proven, and the appendix is exceptionally detailed (all prompts and complete optimization trajectories included).
Value: ⭐⭐⭐⭐⭐ An APO method achieving SOTA with only 1-shot training data has high practical value; the multi-agent Socratic paradigm is broadly transferable.