Mem²Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation¶
Conference: ACL 2026 arXiv: 2604.10923 Code: https://buaa-irip-llm.github.io/Mem2Evolve Area: Model Compression Keywords: Self-Evolving Agent, Dual-Memory Mechanism, Capability Expansion, Experience Distillation, Co-Evolution
TL;DR¶
This paper proposes Mem²Evolve, a self-evolving agent framework that achieves co-evolutionary capability expansion and experience distillation via a dual-memory mechanism (Asset Memory + Experience Memory). The framework attains an average Pass@1 of 70.24% across 8 benchmarks spanning 6 task categories, outperforming the strongest experience-centric and capability-centric baselines by 11.80% and 6.46%, respectively.
Background & Motivation¶
Background: LLM agents are evolving from static, task-specific systems toward self-evolving systems that can leverage past experiences and autonomously expand their capabilities. Existing self-evolving frameworks follow two main paradigms: experience-centric evolution (optimizing execution strategies, prompts, or building experience repositories by accumulating experience) and capability-centric evolution (expanding capability boundaries by dynamically creating new tools or expert agents).
Limitations of Prior Work: These two evolutionary processes are treated in isolation by existing frameworks. Experience-centric evolution is constrained by a predefined static toolset and cannot handle tasks beyond existing capability boundaries. Capability-centric evolution creates new assets from scratch without experiential guidance, failing to leverage validated strategies or avoid known pitfalls, which leads to irreproducible successes and repeated errors.
Key Challenge: Capability expansion and experience accumulation are inherently interdependent — new capabilities enable agents to complete more tasks and thereby acquire more experience, while experience in turn guides better capability expansion — yet existing methods overlook this intrinsic synergy.
Goal: To design a self-evolving agent framework that unifies capability expansion and experience distillation within the same evolutionary loop, enabling their co-evolution.
Key Insight: Inspired by Piaget's equilibration theory — wherein intelligence evolves through the interplay of assimilation (integrating new experiences) and accommodation (adapting internal structures) — agent evolution is analogized to cognitive development.
Core Idea: Through a dual-memory mechanism (Asset Memory storing reusable capabilities, Experience Memory storing strategic experiences), co-evolution of capabilities and experiences is realized within a forward-inference and backward-evolution loop.
Method¶
Overall Architecture¶
The core of Mem²Evolve is a two-phase task cycle of forward inference and backward evolution. Forward inference: task planning → asset recruitment (reuse first, create on demand) → execution. Backward evolution: trajectory evaluation → asset memory evolution (retaining and refining newly created high-quality assets) → experience memory evolution (distilling strategic experience from successes and failures). Both memory stores are updated after each task execution, forming a stable self-evolving loop.
Key Designs¶
-
Dual-Memory Mechanism:
- Function: Separately stores the agent's reusable capabilities and strategic experiences.
- Mechanism: Asset Memory \(\mathcal{M}_A = \mathcal{B}_{agt} \cup \mathcal{B}_{tool}\) comprises an Agent Bank (storing expert agents' roles, expertise, behavioral strategies, and available tools) and a Tool Bank (storing executable tools compliant with the MCP protocol, including names, functional descriptions, implementation code, and documentation). Experience Memory \(\mathcal{M}_E = \mathcal{E}_{agt} \cup \mathcal{E}_{tool}\) stores strategic experiences distilled from past successes and failures; each experience entry contains a title, description, applicable scenarios, and core knowledge.
- Design Motivation: Asset Memory extends capability boundaries while Experience Memory provides guiding knowledge — the two are complementary, as capability expansion without experience is blind, and experience accumulation without capability expansion is constrained by a fixed toolset.
-
Experience-Guided Asset Creation:
- Function: When new capabilities are required, leverages past experiences to guide the creation of high-quality assets.
- Mechanism: When the similarity between a subtask and the asset memory falls below threshold \(\delta\), creation is triggered instead of reuse. Tool creation is realized through experience-augmented generation: \(m_{tool}^{new} \sim \pi_\theta(s_i | \text{Retrieve}(s_i, \mathcal{E}_{tool}), \text{Web}(s_i))\), jointly conditioning on retrieved relevant experiences and web-search information. After creation, a Self-Correction Loop performs validation: the LLM synthesizes test cases from review feedback, and only assets passing all tests are retained.
- Design Motivation: Experience guidance improves the first-pass validation rate from 53.1% to 72.4% (a relative gain of 36.3%) and reduces average debugging iterations from 1.01 to 0.48 (a reduction of 52.5%).
-
Backward Experience Distillation:
- Function: Extracts transferable strategic knowledge from execution trajectories of each task.
- Mechanism: Upon task completion, LLM-as-a-Judge evaluates execution quality and assigns success/failure labels along with review comments. Successes trigger Success Generalization (abstracting high-level strategy guidelines), while failures trigger Failure Diagnosis (encoding anti-patterns and failure–repair pairs). Distilled experiences are merged into the experience memory: \(\mathcal{M}_E \leftarrow \mathcal{M}_E \cup \{e_{\text{new}}\}\).
- Design Motivation: Learning simultaneously from both successes and failures allows the agent to reproduce effective strategies and avoid known pitfalls.
Loss & Training¶
Mem²Evolve is an inference-time framework that does not involve model parameter training. Asset recruitment uses embedding similarity retrieval (threshold \(\delta\)), and task evaluation employs LLM-as-a-Judge. All baselines and Mem²Evolve uniformly use GPT-5-chat as the LLM backbone.
Key Experimental Results¶
Main Results¶
| Method | GAIA Total | ALFWorld | HotpotQA | AIME24 | AIME25 | Average |
|---|---|---|---|---|---|---|
| GPT-5 (ReAct) | 18.47 | 86.87 | 41.40 | 66.67 | 60.00 | 48.27 |
| AFLOW (experience-centric) | 19.75 | 93.40 | 60.80 | 66.67 | 63.33 | 58.44 |
| Alita (capability-centric) | 72.73 | 86.13 | 58.80 | 70.00 | 66.67 | 63.78 |
| Mem²Evolve | 76.31 | 94.31 | 60.80 | 76.70 | 73.33 | 70.24 |
Ablation Study¶
| Configuration | Avg. Pass@1 | Drop |
|---|---|---|
| Full Mem²Evolve | 70.24 | – |
| w/o Tool Creation | 59.96 | ↓10.28 |
| w/o Agent Memory | 65.51 | ↓4.73 |
| w/o Tool Memory | 67.11 | ↓3.13 |
| w/o Expert Agent Creation | 68.52 | ↓1.72 |
Key Findings¶
- Dynamic tool creation is the most critical component (removal causes a 10.28% drop), indicating that expanding the toolset is essential for handling complex tasks.
- Experience guidance improves the first-pass tool creation validation rate from 53.1% to 72.4%, reducing debugging iterations by more than half.
- Cross-task initialization (using memories accumulated from GAIA to initialize other tasks) consistently improves performance, approaching the gains of 25-sample same-task initialization, demonstrating good transferability of the memory.
- On GAIA, Mem²Evolve achieves 76.31% Pass@1, second only to OpenAI DeepResearch's 67.36% (a proprietary system), demonstrating the strong potential of the framework.
Highlights & Insights¶
- The co-evolutionary paradigm of dual memory is the paper's most significant contribution. Inspired by Piaget's theory of cognitive development, it unifies "assimilation" (experience accumulation) and "accommodation" (capability adaptation) within a single framework. This analogy is both theoretically grounded and practically valuable, lending the framework's design logic exceptional clarity.
- The "Reuse first, Create on demand" forward inference strategy is highly practical. The similarity threshold \(\delta\) automatically determines whether a current task exceeds capability boundaries, avoiding unnecessary asset creation overhead while enabling on-demand capability expansion.
- The cross-task memory transfer results are impressive: memories accumulated from GAIA data consistently yield improvements on entirely different tasks such as HotpotQA and AIME without negative transfer, indicating that distilled experiences possess strong abstraction and generality.
Limitations & Future Work¶
- The framework relies on a sandbox environment to execute auto-generated code, limiting deployment in open-world settings that require direct access to local file systems or unrestricted network access.
- The continuous growth of asset and experience memories may introduce retrieval efficiency issues and noise; long-term memory management strategies (e.g., forgetting, compression) are not discussed.
- The reliability of LLM-as-a-Judge evaluation in the absence of ground-truth labels may affect the quality of backward evolution.
- The quality of tool creation is bounded by the LLM's code generation capability; complex tools may require multiple iterations to reach usable quality.
Related Work & Insights¶
- vs. Alita (Qiu et al., 2025): Alita supports dynamic tool creation but lacks experiential guidance. Mem²Evolve augments this with experience-guided creation and distillation mechanisms, achieving an average performance gain of 6.46%.
- vs. AFLOW (Zhang et al., 2025): AFLOW optimizes module composition via search algorithms but is constrained by a fixed toolset and cannot extend capability boundaries. Mem²Evolve dynamically expands the toolset while accumulating experience, achieving an average performance gain of 11.80%.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to propose a co-evolutionary paradigm for capability expansion and experience distillation, with clear theoretical motivation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 8 benchmarks across 6 task categories with comprehensive ablation, single-task, and cross-task analyses.
- Writing Quality: ⭐⭐⭐⭐ Framework diagrams are clear; the analogy to Piaget's theory is thought-provoking.
- Overall Recommendation: ⭐⭐⭐⭐ Provides a practical framework foundation for building general-purpose self-evolving agents.