Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities¶
Conference: ACL 2026
arXiv: 2508.20324
Code: https://github.com/omron-sinicx/dgpo
Area: Information Retrieval / LLM Agent
Keywords: Agentic RAG, Knowledge Distillation, Reinforcement Learning, Compact Models, PPO
TL;DR¶
This paper proposes DGPO: using teacher demonstrations for cold-start KD initialization, followed by applying KL distillation penalties to "incorrect samples" during the PPO phase. This allows 0.5B compact models to acquire Agentic RAG capabilities, increasing average EM across seven QA benchmarks from 0.006 to 0.329, with some datasets even surpassing the 3B teacher model.
Background & Motivation¶
Background: Agentic RAG (e.g., Search-R1, ReAct) has become the mainstream paradigm for LLMs to invoke external retrieval. Models need to execute interleaved <think>, <search>, and <answer> actions to complete multi-hop question answering. Such systems demand high reasoning capabilities from the LLM; thus, existing works almost exclusively rely on models with several billion parameters or more.
Limitations of Prior Work: The authors attempted to directly transplant the PPO training used in Search-R1 to 0.5–1B compact models, finding that both mainstream paths failed. First, the RL path: compact models have an initial EM of nearly 0 (Qwen2.5-0.5B averaged only 0.006 across seven datasets), preventing PPO/GRPO from obtaining positive rewards, leading to slow convergence or early collapse. Second, the KD path: pure TGO distillation suffers from exposure bias, while pure on-policy SGO distillation is misled by noisy samples. Dynamic scheduling methods like DistiLLM or TAID are also sensitive to the student-teacher capacity gap.
Key Challenge: The SGO quality of compact models is too poor to support either RL exploration (due to lack of reward signals) or on-policy distillation (due to noisy targets). Offline TGO distillation cannot solve the training-inference distribution mismatch. The fundamental contradiction is the inability to achieve both "cold-start quality" and "exploration capability" simultaneously.
Goal: To train 0.5–1B compact models into retrieval models capable of multi-turn searching like an agent, while ensuring training stability and providing a fine-grained evaluation framework to pinpoint specific shortfalls in agentic capabilities.
Key Insight: The authors rethink the role of the reference model in PPO. Traditional PPO treats it as a KL regularization anchor (to prevent policy drift), whereas this paper transforms the teacher model into an "active instructor": allowing the student to explore freely when it answers correctly, but using KL distillation to "pull it back" to the teacher's trajectory when it answers incorrectly.
Core Idea: Integrate distillation into the internal PPO process via "KD cold-start + selective KL distillation penalty," turning the reference model from a passive regularizer into an active educator to achieve stable agentic RAG capabilities on compact models that surpass the teacher.
Method¶
Overall Architecture¶
DGPO consists of two stages: (1) Cold-start KD Initialization—using high-quality TGO trajectories (keeping only correct ones) generated by the teacher for offline distillation to help the student learn a reasonable behavioral skeleton for <think>/<search>/<answer>; (2) Distillation-guided RL—using the cold-started student as the initial policy for PPO, but with a redesigned reward: a scalar +1 for correct answers, and replacing the reward for incorrect answers with \(-\beta D_{\text{KL}}[\pi_\theta(y\mid x;\mathcal{R})\|\pi_g(y\mid x;\mathcal{R})]\) to provide dense learning signals via teacher imitation.
The entire process does not require a manual scheduler (unlike TAID/DistiLLM which require tuning \(\alpha\) interpolation coefficients); the transition between the two stages is triggered by a performance threshold.
Key Designs¶
-
Cold-start KD Initialization (Using Correct TGO Only):
- Function: Pulls the student from "near-zero performance" to a starting point where it can generate meaningful rewards.
- Mechanism: Uses a hybrid loss \(\mathcal{L}_{\text{distill}} = \mathcal{L}_{\text{CE}}(\pi_g,\pi_\theta) + \lambda D_{\text{KL}}[\pi_g(\cdot\mid x)\|\pi_\theta(\cdot\mid x)]\) on the teacher's correct trajectories to learn both hard labels and soft distributions. The authors verified that using only correct TGO is more effective than including incorrect TGO, as the latter propagates faulty retrieval decisions.
- Design Motivation: SFT warm-start only learns hard labels (SFT→PPO reached only 0.289 in experiments). This method reuses the teacher's full soft distribution, containing fine-grained information on how the teacher weighs ambiguities, allowing the KD initialization alone to reach 0.298.
-
Selective KL Penalty:
- Function: Transforms the teacher from a "passive regularization anchor" to an "active error corrector" during PPO training.
- Mechanism: Standard PPO reward is \(r_{\text{answer}}=\mathbb{1}[y=y^*]\), where incorrect samples receive a constant 0 reward, yielding no learning signal. This is modified to \(r_\phi(x,y)=1\) if correct, else \(-\beta D_{\text{KL}}[\pi_\theta(y\mid x;\mathcal{R})\|\pi_g(y\mid x;\mathcal{R})]\). Correct samples maintain room for free exploration, while incorrect samples are strongly pulled toward the teacher.
- Design Motivation: Changing this to a uniform KL penalty (applying distillation to all samples) dropped the average EM from 0.329 to 0.314. Removing teacher guidance (standard PPO) dropped it to 0.306. This suggests that selective correction preserves RL exploration while avoiding being misled by noisy SGO.
-
KD→PPO Two-Stage Sequence:
- Function: Ensures that PPO starts with reasonable agentic behavior to avoid calculating policy gradients on a broken policy.
- Mechanism: Five epochs of KD initialization are followed by a maximum of 1000 PPO steps of distillation-guided RL. Reversing the sequence (PPO→KD) causes PPO to collapse on a weak policy, which KD cannot subsequently recover.
- Design Motivation: Inverting the pipeline (PPO→KD) resulted in an average EM of only 0.286, 4.3 points lower than DGPO, indicating that initialization order is critical for compact models.
Loss & Training¶
KD phase: \(\lambda=1.0\) to balance CE and KL. PPO phase: \(\beta=0.001\) to control distillation intensity for incorrect samples. Maximum of 4 dialogue turns, with top-3 document retrieval (E5 retriever, 2018 Wiki dump). Token masking is applied so gradients are calculated only on LLM-generated tokens, not on <information> segments. Training took approximately 1 day on NVIDIA 8×H200.
Key Experimental Results¶
Main Results¶
EM comparison on 7 QA benchmarks for Qwen 2.5 (3B Teacher → 0.5B Student):
| Method | NQ | TriviaQA | HotpotQA | MuSiQue | Bamboogle | Avg | Notes |
|---|---|---|---|---|---|---|---|
| Student-0.5B | 0.004 | 0.006 | 0.007 | 0.000 | 0.000 | 0.006 | Untrained |
| Teacher-3B | 0.365 | 0.569 | 0.340 | 0.135 | 0.298 | 0.353 | Upper bound ref |
| PPO (Search-R1) | 0.306 | 0.444 | 0.205 | 0.041 | 0.073 | 0.238 | RL baseline |
| SFT→PPO | 0.338 | 0.415 | 0.296 | 0.088 | 0.250 | 0.289 | warm start |
| KD (Hinton) | 0.331 | 0.431 | 0.286 | 0.091 | 0.290 | 0.298 | Offline distillation |
| DistiLLM | 0.333 | 0.442 | 0.288 | 0.095 | 0.209 | 0.287 | Adaptive distillation |
| TAID | 0.325 | 0.427 | 0.290 | 0.079 | 0.218 | 0.282 | Scheduled distillation |
| DGPO | 0.378 | 0.481 | 0.342 | 0.120 | 0.274 | 0.329 | Surpasses 3B teacher on NQ/HotpotQA |
Generalization was also verified across model families: Qwen 7B→0.5B improved average EM from 0.238 in PPO to 0.323; Llama-3 8B→1B improved from 0.250 to 0.389, only 4.9 points behind the 8B teacher.
Ablation Study¶
| Configuration | Avg EM | Explanation |
|---|---|---|
| DGPO Full | 0.329 | KD cold-start + KD→PPO + Selective KL |
| w/o cold-start init | 0.320 | Training collapsed after step 800; highest score before collapse taken |
| w/o selective KL (uniform) | 0.314 | Distillation on correct samples as well; constraint too strong |
| w/o teacher guidance (std PPO) | 0.306 | No learning signal for incorrect samples |
| invert pipeline (PPO→KD) | 0.286 | Sequence reversed; policy already collapsed during PPO |
Key Findings¶
- Cold-start KD is key to stability: Without it, training collapses at step 800, proving compact models cannot run PPO from scratch. However, the peak score before collapse (0.320) is close to DGPO, suggesting performance comes from distillation while stability comes from KD initialization.
- Selective KL > Uniform KL: Applying distillation only to incorrect samples outperformed applying it to all samples by 1.5 points, proving that the combination of "free exploration + error correction" is superior to pure imitation.
- Fine-grained ARCap breakdown: On NQ (single-hop), the Query Rewriting Hit ratio was actually highest for standard PPO (0.711 > Teacher 0.682), while DGPO matched the teacher at 0.682. On MuSiQue (multi-hop), DGPO’s Hit ratio (0.583) and search steps (2.64) were the highest, showing compact models compensate for weaker single-step reasoning with "repeated searches."
- GRPO is unsuitable for compact models: GRPO converges fast but collapses early; even with KD initialization and teacher guidance, it remains unstable. All primary experiments use PPO.
Highlights & Insights¶
- Redefining the role of the reference model: Shifted from a KL anchor (passive) to an instructor (active). This conceptual shift could be useful for any PPO-style fine-tuning—whenever a reference stronger than the current policy exists, it can be upgraded from a "regularization term" to an "error correction term."
- "Class-based rewards" is a simple yet powerful trick: Giving scalars for correct answers and distillation penalties for incorrect ones essentially fills the void left by sparse binary rewards with dense distillation rewards.
- Reusable ARCap evaluation framework: The design of isolating agentic sub-capabilities (e.g., "testing pure answering with ground-truth context," "testing first-round query Hit ratio") by splitting actions into thinking, query rewriting, and source referencing is a valuable methodology.
- 0.5B + CPU capability: Bringing agentic RAG from the cloud to laptops/phones; after a 55× performance gain, it approaches the 3B teacher model.
Limitations & Future Work¶
- Model families were only verified on Qwen2.5 and Llama-3, without extension to Mistral, Phi, or others.
- The teacher cap was 8B; whether the capacity gap can still be bridged by KD with ultra-large teachers (70B+) remains unknown due to compute constraints.
- Distillation introduces ~9.5% extra training time (teacher inference), which is small but may increase significantly with larger teachers.
- Agentic behavior was only verified on QA tasks; the transferability to code, math, or tool-use scenarios is left for future work.
- Personal note: The "correct/incorrect" boundary for selective KL relies on EM, which is unsuitable for free-form generation (e.g., summarization). Transferring this method to non-QA tasks would require redefining the reward splitting logic.
Related Work & Insights¶
- vs Search-R1 (Jin et al. 2025): Search-R1 runs PPO effectively on 7B+, but rewards are too sparse for 0.5B; Ours uses distillation to provide dense rewards for incorrect samples to solve compact model issues.
- vs DistiLLM (Ko et al. 2024) / TAID (Shing et al. 2025): These use \(\alpha\)-interpolation for dynamic scheduling between teacher/student distributions, which is hyperparameter-sensitive. DGPO uses a two-stage process with selective KL, requiring no scheduler.
- vs GKD (Agarwal et al. 2024): Pure on-policy SGO distillation, but compact model SGO is noisy, leading GKD to average only 0.240. DGPO decouples on-policy exploration and off-policy KD into two stages.
- vs DeepSeek-R1 cold-start: R1 uses SFT for cold-start; Ours replaces SFT with full KD (including soft distributions), proving that soft targets are more valuable than hard targets for small models.
Rating¶
- Novelty: ⭐⭐⭐⭐ The perspective of "reference model as teacher" is fresh, though individual components like KD initialization and selective KL are not entirely original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 QA datasets × 3 model configurations × 5 ablation dimensions, plus ARCap sub-capability evaluation.
- Writing Quality: ⭐⭐⭐⭐ Figures 1 and 5 clarify the ideas well; Limitations are honest. Individual formulas (like the PPO objective in Eq. 2) are slightly cramped.
- Value: ⭐⭐⭐⭐⭐ Bringing agentic RAG to the 0.5B scale for edge deployment is highly practical. The method is directly applicable to other PPO-on-small-model scenarios.