Mem²Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation¶

Conference: ACL 2026 arXiv: 2604.10923 Code: https://buaa-irip-llm.github.io/Mem2Evolve Area: Model Compression Keywords: Self-Evolving Agent, Dual-Memory Mechanism, Capability Expansion, Experience Distillation, Co-Evolution

TL;DR¶

This paper proposes Mem²Evolve, a self-evolving agent framework that achieves co-evolutionary capability expansion and experience distillation via a dual-memory mechanism (Asset Memory + Experience Memory). The framework attains an average Pass@1 of 70.24% across 8 benchmarks spanning 6 task categories, outperforming the strongest experience-centric and capability-centric baselines by 11.80% and 6.46%, respectively.

Background & Motivation¶

Background: LLM agents are evolving from static, task-specific systems toward self-evolving systems that can leverage past experiences and autonomously expand their capabilities. Existing self-evolving frameworks follow two main paradigms: experience-centric evolution (optimizing execution strategies, prompts, or building experience repositories by accumulating experience) and capability-centric evolution (expanding capability boundaries by dynamically creating new tools or expert agents).

Limitations of Prior Work: These two evolutionary processes are treated in isolation by existing frameworks. Experience-centric evolution is constrained by a predefined static toolset and cannot handle tasks beyond existing capability boundaries. Capability-centric evolution creates new assets from scratch without experiential guidance, failing to leverage validated strategies or avoid known pitfalls, which leads to irreproducible successes and repeated errors.

Key Challenge: Capability expansion and experience accumulation are inherently interdependent — new capabilities enable agents to complete more tasks and thereby acquire more experience, while experience in turn guides better capability expansion — yet existing methods overlook this intrinsic synergy.

Goal: To design a self-evolving agent framework that unifies capability expansion and experience distillation within the same evolutionary loop, enabling their co-evolution.

Key Insight: Inspired by Piaget's equilibration theory — wherein intelligence evolves through the interplay of assimilation (integrating new experiences) and accommodation (adapting internal structures) — agent evolution is analogized to cognitive development.

Core Idea: Through a dual-memory mechanism (Asset Memory storing reusable capabilities, Experience Memory storing strategic experiences), co-evolution of capabilities and experiences is realized within a forward-inference and backward-evolution loop.

Method¶

Overall Architecture¶

The core of Mem²Evolve is a two-phase task cycle of forward inference and backward evolution. Forward inference: task planning → asset recruitment (reuse first, create on demand) → execution. Backward evolution: trajectory evaluation → asset memory evolution (retaining and refining newly created high-quality assets) → experience memory evolution (distilling strategic experience from successes and failures). Both memory stores are updated after each task execution, forming a stable self-evolving loop.

Key Designs¶

Dual-Memory Mechanism:
- Function: Separately stores the agent's reusable capabilities and strategic experiences.
- Mechanism: Asset Memory \(\mathcal{M}_A = \mathcal{B}_{agt} \cup \mathcal{B}_{tool}\) comprises an Agent Bank (storing expert agents' roles, expertise, behavioral strategies, and available tools) and a Tool Bank (storing executable tools compliant with the MCP protocol, including names, functional descriptions, implementation code, and documentation). Experience Memory \(\mathcal{M}_E = \mathcal{E}_{agt} \cup \mathcal{E}_{tool}\) stores strategic experiences distilled from past successes and failures; each experience entry contains a title, description, applicable scenarios, and core knowledge.
- Design Motivation: Asset Memory extends capability boundaries while Experience Memory provides guiding knowledge — the two are complementary, as capability expansion without experience is blind, and experience accumulation without capability expansion is constrained by a fixed toolset.
Experience-Guided Asset Creation:
- Function: When new capabilities are required, leverages past experiences to guide the creation of high-quality assets.
- Mechanism: When the similarity between a subtask and the asset memory falls below threshold \(\delta\), creation is triggered instead of reuse. Tool creation is realized through experience-augmented generation: \(m_{tool}^{new} \sim \pi_\theta(s_i | \text{Retrieve}(s_i, \mathcal{E}_{tool}), \text{Web}(s_i))\), jointly conditioning on retrieved relevant experiences and web-search information. After creation, a Self-Correction Loop performs validation: the LLM synthesizes test cases from review feedback, and only assets passing all tests are retained.
- Design Motivation: Experience guidance improves the first-pass validation rate from 53.1% to 72.4% (a relative gain of 36.3%) and reduces average debugging iterations from 1.01 to 0.48 (a reduction of 52.5%).
Backward Experience Distillation:
- Function: Extracts transferable strategic knowledge from execution trajectories of each task.
- Mechanism: Upon task completion, LLM-as-a-Judge evaluates execution quality and assigns success/failure labels along with review comments. Successes trigger Success Generalization (abstracting high-level strategy guidelines), while failures trigger Failure Diagnosis (encoding anti-patterns and failure–repair pairs). Distilled experiences are merged into the experience memory: \(\mathcal{M}_E \leftarrow \mathcal{M}_E \cup \{e_{\text{new}}\}\).
- Design Motivation: Learning simultaneously from both successes and failures allows the agent to reproduce effective strategies and avoid known pitfalls.

Loss & Training¶

Mem²Evolve is an inference-time framework that does not involve model parameter training. Asset recruitment uses embedding similarity retrieval (threshold \(\delta\)), and task evaluation employs LLM-as-a-Judge. All baselines and Mem²Evolve uniformly use GPT-5-chat as the LLM backbone.

Key Experimental Results¶

Main Results¶

Method	GAIA Total	ALFWorld	HotpotQA	AIME24	AIME25	Average
GPT-5 (ReAct)	18.47	86.87	41.40	66.67	60.00	48.27
AFLOW (experience-centric)	19.75	93.40	60.80	66.67	63.33	58.44
Alita (capability-centric)	72.73	86.13	58.80	70.00	66.67	63.78
Mem²Evolve	76.31	94.31	60.80	76.70	73.33	70.24

Ablation Study¶

Configuration	Avg. Pass@1	Drop
Full Mem²Evolve	70.24	–
w/o Tool Creation	59.96	↓10.28
w/o Agent Memory	65.51	↓4.73
w/o Tool Memory	67.11	↓3.13
w/o Expert Agent Creation	68.52	↓1.72

Key Findings¶

Dynamic tool creation is the most critical component (removal causes a 10.28% drop), indicating that expanding the toolset is essential for handling complex tasks.
Experience guidance improves the first-pass tool creation validation rate from 53.1% to 72.4%, reducing debugging iterations by more than half.
Cross-task initialization (using memories accumulated from GAIA to initialize other tasks) consistently improves performance, approaching the gains of 25-sample same-task initialization, demonstrating good transferability of the memory.
On GAIA, Mem²Evolve achieves 76.31% Pass@1, second only to OpenAI DeepResearch's 67.36% (a proprietary system), demonstrating the strong potential of the framework.

Highlights & Insights¶

The co-evolutionary paradigm of dual memory is the paper's most significant contribution. Inspired by Piaget's theory of cognitive development, it unifies "assimilation" (experience accumulation) and "accommodation" (capability adaptation) within a single framework. This analogy is both theoretically grounded and practically valuable, lending the framework's design logic exceptional clarity.
The "Reuse first, Create on demand" forward inference strategy is highly practical. The similarity threshold \(\delta\) automatically determines whether a current task exceeds capability boundaries, avoiding unnecessary asset creation overhead while enabling on-demand capability expansion.
The cross-task memory transfer results are impressive: memories accumulated from GAIA data consistently yield improvements on entirely different tasks such as HotpotQA and AIME without negative transfer, indicating that distilled experiences possess strong abstraction and generality.

Limitations & Future Work¶

The framework relies on a sandbox environment to execute auto-generated code, limiting deployment in open-world settings that require direct access to local file systems or unrestricted network access.
The continuous growth of asset and experience memories may introduce retrieval efficiency issues and noise; long-term memory management strategies (e.g., forgetting, compression) are not discussed.
The reliability of LLM-as-a-Judge evaluation in the absence of ground-truth labels may affect the quality of backward evolution.
The quality of tool creation is bounded by the LLM's code generation capability; complex tools may require multiple iterations to reach usable quality.

vs. Alita (Qiu et al., 2025): Alita supports dynamic tool creation but lacks experiential guidance. Mem²Evolve augments this with experience-guided creation and distillation mechanisms, achieving an average performance gain of 6.46%.
vs. AFLOW (Zhang et al., 2025): AFLOW optimizes module composition via search algorithms but is constrained by a fixed toolset and cannot extend capability boundaries. Mem²Evolve dynamically expands the toolset while accumulating experience, achieving an average performance gain of 11.80%.

Rating¶

Novelty: ⭐⭐⭐⭐ First to propose a co-evolutionary paradigm for capability expansion and experience distillation, with clear theoretical motivation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 8 benchmarks across 6 task categories with comprehensive ablation, single-task, and cross-task analyses.
Writing Quality: ⭐⭐⭐⭐ Framework diagrams are clear; the analogy to Piaget's theory is thought-provoking.
Overall Recommendation: ⭐⭐⭐⭐ Provides a practical framework foundation for building general-purpose self-evolving agents.