Distilling LLM Agent into Small Models with Retrieval and Code Tools¶
Conference: NeurIPS 2025 arXiv: 2505.17612 Code: https://github.com/Nardien/agent-distillation Area: LLM Agent / Knowledge Distillation Keywords: agent distillation, first-thought prefix, self-consistent action generation, small language model, CodeAct
TL;DR¶
This paper proposes an Agent Distillation framework that distills the complete reason-act-observe interactive behaviors of LLM agents (rather than static CoT) into small models ranging from 0.5B to 7B parameters. Combined with a first-thought prefix to improve teacher trajectory quality and self-consistent action generation to enhance inference robustness, the framework enables small models to achieve performance comparable to CoT-distilled models 2–4× their size.
Background & Motivation¶
Background: CoT distillation—training small models on reasoning traces from large models—is the dominant paradigm for compressing LLM reasoning capabilities, and has been widely adopted by Llama3, Qwen2.5, DeepSeek-R1, and others.
Limitations of Prior Work: CoT distillation imparts only static reasoning—small models must memorize both factual knowledge and computational procedures within their parameters. When encountering unseen knowledge or complex computations at inference time, these models are prone to hallucination. For example, answering "how much would a $100 investment in Apple stock in 2010 be worth in 2020" requires both factual knowledge (historical stock prices) and precise calculation, both of which are failure-prone in CoT-distilled small models.
Key Challenge: Small models have limited parameter capacity and cannot simultaneously memorize large amounts of factual knowledge while maintaining precise computational ability. However, if small models can be taught to use tools (retrieval + code execution), knowledge storage and computation can be delegated to external tools, requiring the model to learn only the behavioral pattern of "how to reason and invoke tools."
Goal: Distill the complete agentic behaviors of LLM agents (reasoning + tool use + environment interaction) into small models with ≤3B parameters, enabling them to serve as compact agents capable of using retrieval and code tools.
Key Insight: (1) Instruction-tuned LLMs exhibit degraded reasoning quality when prompted as agents, particularly on mathematical tasks, due to a distributional mismatch between agent instructions and CoT training; (2) Code actions generated by small models frequently contain syntax errors or fail during execution.
Core Idea: A first-thought prefix (prepending the first CoT step to the agent trajectory) is used to correct the reasoning quality degradation in the teacher agent, while self-consistent action generation (sampling multiple candidates and selecting the most consistent result) is used to address code execution failures in the student agent.
Method¶
Overall Architecture¶
Agent Distillation proceeds in two stages: (1) Training: a 32B teacher LLM generates reason-act-observe trajectories in CodeAct format, and the student model is fine-tuned via SFT, with loss computed only over thought and action tokens (observations are excluded); (2) Inference: the student agent interacts with the environment over multiple steps, using retrieval tools for knowledge acquisition and code tools for computation at each step.
Key Designs¶
-
First-Thought Prefix (FTP):
- Function: Corrects reasoning degradation in instruction-tuned teacher models operating in agent mode.
- Mechanism: The teacher model first generates an initial reasoning step \(y_1\) under a CoT prompt; \(y_1\) is then prepended as the prefix of the first thought in the agent prompt, after which the full agent trajectory is generated. This ensures the agent begins from a correct reasoning direction.
- Design Motivation: Prior work has shown that the first step of an LLM's reasoning chain has a decisive influence on its final conclusion. Agent instructions (e.g., "follow a Thought/Code/Observation loop") may override the model's native CoT reasoning pattern. FTP is analogous to prefix injection in adversarial jailbreak attacks but is applied constructively to guide reasoning.
- Caveat: FTP is used only during teacher trajectory generation and is not required at student inference time. However, FTP can sometimes cause the model to generate knowledge internally rather than invoking retrieval tools, increasing the risk of hallucination.
-
Self-Consistent Action Generation (SAG):
- Function: Improves the robustness of code generation in small model agents during inference.
- Mechanism: At each step, \(N=8\) thought-action sequences are sampled via high-temperature nucleus sampling. Candidates with parsing or execution failures are filtered out, and majority voting is applied over the remaining valid results to select the most consistent action. When all candidates fail, one failed action is retained at random and its error message is passed back as an observation, enabling the model to self-correct.
- Design Motivation: Models in the 0.5B–3B range have been pre-trained on code data but have a low probability of generating valid Python code. SAG exploits the observation that small models can produce valid code, albeit with low probability, and increases the likelihood of selecting a valid action through repeated sampling.
Loss & Training¶
- Standard SFT loss computed only over thought and action tokens; observation tokens are excluded.
- LoRA (rank 64) applied to all linear layers; lr = 2e-4, batch size 8, 2 epochs.
- Training data: 1,000 HotPotQA + 2,000 MATH examples; approximately 2,000 correct trajectories retained after filtering.
- Hardware: 4× A100 80GB.
Key Experimental Results¶
Main Results¶
| Method | 0.5B Avg | 1.5B Avg | 3B Avg | 7B Avg |
|---|---|---|---|---|
| CoT Distill | 13.64 | 21.28 | 27.72 | 33.54 |
| CoT Distill + RAG | 15.90 | 24.64 | 28.53 | 32.16 |
| Agent Distill | 19.24 | 28.06 | 33.60 | 39.85 |
| +FTP+SAG | 21.90 | 30.55 | 36.60 | 42.68 |
Key finding: 0.5B Agent ≈ 1.5B CoT, 1.5B Agent ≈ 3B CoT, 3B Agent > 7B CoT, 7B Agent > 32B CoT. Agent distillation enables small models to match the performance of CoT-distilled models 2–4× their size.
Ablation Study¶
| Component | Contribution | Notes |
|---|---|---|
| FTP on teacher trajectory quality | MATH hard: 58.4→67.1, MATH medium: 78.4→83.4 | Significant improvement on difficult problems |
| SAG on code errors | Parsing errors halved on 0.5B AIME | Multi-sampling effectively filters invalid actions |
| FTP on retrieval invocation | Fewer retrieval calls | FTP causes the model to rely more on internal knowledge, potentially increasing hallucination |
| LoRA vs. Full FT | LoRA: 29.11 vs. Full FT: 26.24 (1.5B) | LoRA generalizes better; full fine-tuning is prone to overfitting |
| Code-specific model | Marginal effect | Using Qwen2.5-Coder as teacher provides slight but non-significant improvement |
Key Findings¶
- Agent Distillation vs. RAG: Static RAG helps on factual reasoning but hurts on mathematical reasoning due to mismatched retrieved documents. Agent distillation allows the model to autonomously decide when to retrieve, affording greater flexibility.
- FTP is more beneficial on complex problems: The largest gains are observed on MATH level 5 and AIME, where correct initial reasoning direction is most critical.
- Token overhead is roughly neutral: Agent models generate more tokens on factual reasoning (multiple retrievals) but fewer on mathematical reasoning (replacing verbose calculations with for-loops), with no significant overall difference.
- Cross-architecture generalization: Consistent improvements are also observed on Llama-3.2-1B and Phi-4-mini.
Highlights & Insights¶
- Paradigm shift from "distilling knowledge" to "distilling behavior": Traditional CoT distillation trains small models to memorize reasoning processes; Agent distillation trains them to interact with external tools. The former is constrained by model parameter capacity, whereas the latter offloads knowledge storage and computation to external tools.
- The dual nature of First-Thought Prefix: FTP significantly improves mathematical reasoning by guiding the initial reasoning direction, but may cause the model to "answer confidently" on factual tasks rather than invoking retrieval tools, thereby increasing hallucination. This reveals the inherent tension between internal reasoning and external tool use in agentic systems.
- Even 0.5B models can serve as agents: This is a practically significant finding. Under standard prompting, 0.5B models are nearly incapable of producing valid agent outputs, yet after distillation they yield meaningful results across multiple benchmarks.
Limitations & Future Work¶
- Only retrieval and code tools are distilled; more complex agent scenarios involving web browsing or simulator interaction remain unexplored.
- Training uses only a single teacher trajectory per example; increasing the number of sampled trajectories may further improve performance.
- Agent distillation does not directly improve the core reasoning capacity of small models; RL post-training could provide additional gains.
- Code execution carries security risks; sandboxing solutions are not sufficiently discussed.
Related Work & Insights¶
- vs. Search-R1 / ToolRL: These works use RL to train LLMs to use search and tools; this paper employs SFT-based distillation to transfer agentic capabilities to very small models (0.5B–3B) at lower cost, without involving policy optimization.
- vs. GiGPO (2505.10978): GiGPO addresses the credit assignment problem in agent RL training, while this paper addresses the distillation of agent capabilities from large to small models. The two are complementary—distillation followed by RL fine-tuning is a viable pipeline.
- vs. FireAct / AgentTuning: These works primarily focus on agent fine-tuning of 7B+ models. This paper is the first to systematically investigate agent distillation for extremely small models (≤3B) and proposes solutions specifically targeting small-model failure modes (code generation failures and reasoning direction drift).
Rating¶
- Novelty: ⭐⭐⭐⭐ — Distilling agentic behaviors into extremely small models is a valuable new direction; FTP and SAG are simple yet effective designs.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four model scales, eight benchmarks, cross-architecture validation, and extensive ablation analysis.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured; Figure 1's performance comparison and Figure 2's conceptual diagram are intuitive.
- Value: ⭐⭐⭐⭐ — Directly applicable to building low-resource deployable agents, though constrained by the SFT paradigm.