Language Model Distillation: A Temporal Difference Imitation Learning Perspective¶
Conference: AAAI 2026 arXiv: 2505.20335 Code: N/A Area: Reinforcement Learning Keywords: knowledge distillation, imitation learning, temporal difference learning, top-p action space, inverse reinforcement learning
TL;DR¶
This paper revisits language model distillation from an imitation learning / inverse reinforcement learning perspective. It exploits the sparsity of teacher output distributions (top-p tokens concentrate over 96% of probability mass) to construct a top-p MDP for temporal difference (TD) learning, proves that the optimal policy in the reduced action space admits a bounded suboptimality guarantee, and demonstrates that the resulting Bellman Distill method — built on the IQL algorithm — outperforms existing distillation methods across multiple model families.
Background & Motivation¶
-
Background: LLM distillation has become standard practice for compressing large-capacity teacher models into efficient student models. Most existing distillation methods can be viewed as Behavior Cloning (BC) — directly imitating the teacher's output distribution at each step.
-
Limitations of Prior Work: BC suffers from compounding errors, also known as exposure bias in autoregressive models — during training the student observes teacher-generated contexts, whereas at inference it operates on its own (potentially erroneous) contexts, causing errors to accumulate exponentially with sequence length.
-
Key Challenge: The imitation learning literature offers rich tools to address BC's shortcomings, particularly TD learning, which mitigates compounding errors by estimating the long-term impact of actions given a state. However, directly applying TD learning to LLMs faces the challenge of an enormous action space — vocabulary size \(|\mathcal{V}|\) typically ranges from tens of thousands to over a hundred thousand tokens.
-
Goal: To identify and exploit structure specific to the distillation setting. The key observation is that LLM output distributions are highly sparse — the top-50 tokens account for 96% of probability mass, and the top-7 tokens contribute ≥90% — motivating the idea that only a small set of high-probability candidate actions needs to be considered in TD learning.
-
Key Insight: Does a distillation-specific exploitable structure exist? The answer is the sparsity of the teacher distribution — a property absent in pure IL settings, where access to the full expert policy distribution is generally unavailable.
Method¶
Overall Architecture¶
- Define the MDP for language model generation: state = prompt + generated sequence so far; action = next token.
- Exploit teacher distribution sparsity to define the top-p candidate set.
- Construct the top-p MDP and prove bounded suboptimality.
- Implement the IQL algorithm on the top-p MDP for distillation.
Key Designs¶
1. Top-p Candidate Set and Top-p MDP¶
Given teacher policy \(\pi^\star\) and state \(s\), the top-p candidate set is defined as the subset of tokens whose cumulative probability mass reaches \(p\):
The top-p MDP is defined as \(\mathcal{M}_p = (\mathcal{S}, \mathcal{A}_p^\star, \mathbb{P}, r, \gamma)\), reducing the action space from the full vocabulary \(\mathcal{V}\) to \(\mathcal{A}_p^\star\).
2. Theoretical Suboptimality Guarantee¶
A core contribution of this paper is proving that the gap between the optimal policy learned in the reduced action space and the original optimal policy is controllable.
The Top-p Bellman operator is defined as:
where \(\text{proj}_p(\pi)\) is the projection of policy \(\pi\) onto \(\mathcal{A}_p^\star\) (via renormalization).
Proposition 1 (Contraction): \(\mathcal{B}_p^\pi\) is a contraction mapping under the supported \(\infty\)-norm.
Proposition 2 (Suboptimality Bound): \(\|Q^\star - \bar{Q}^{\text{proj}_p \pi^\star}\|_{\infty, \mathcal{A}_p^\star} \leq \kappa(p)\), where \(\kappa(p) = -\frac{\gamma}{1-\gamma}\log p\).
Proposition 3 (Sandwich Condition): \(\bar{Q}^{\text{proj}_p \pi^\star}(s,a) \leq \bar{Q}_p^\star(s,a) \leq Q^\star(s,a)\)
Proposition 4 (Bounded Suboptimality, Main Theorem):
At \(p=0.8\), \(\kappa(0.8) \approx 0.22 \cdot \gamma/(1-\gamma)\), indicating small suboptimality. This justifies running IRL algorithms on a substantially reduced action space.
3. Bellman Distill via IQL¶
IQL (Inverse soft Q-Learning) is adopted as the base algorithm.
Q-function masking: Only Q-values within the top-p set are updated via the top-p mask \(\mathcal{F}_p^\star\):
Policy projection: \((\text{proj}_p \pi_Q)(a|s) = \exp Q(s,a) / \sum_{\mathcal{A}_p^\star} \exp Q(s,a)\)
The final objective combines Q-masking and policy projection:
Key implementation details: - Student model logits serve directly as Q-values (up to a constant shift), so a single model functions simultaneously as Q-function and policy. - \(\chi^2\) regularization is applied: \(\phi(x) = x - x^2/4\alpha\), \(\alpha=0.1\). - Q-value clipping: \(Q_{\min} = -10\) for numerical stability. - Offline training: a dataset is first generated from the teacher, then used to train via Bellman Distill (BD).
Loss & Training¶
- A language modeling objective \(\mathcal{J}_{PT}\) is jointly maximized to preserve NLP benchmark performance.
- Two-phase training: Phase 1 SFT initialization → Phase 2 BD distillation.
- Batch size 64, learning rate 5e-6, \(p=0.8\) (\(p=0.5\) for OPT 1.3B).
- Discount factor \(\gamma=0.99\).
- The teacher generates 8 responses per query.
- Training is conducted on 4× A40 GPUs.
Key Experimental Results¶
Main Results¶
Rouge-L scores (mean over 5 random seeds):
| Method | GPT-2 (1.5B→120M) | OPT (6.7B→125M) | Qwen-2.5 (3B→0.5B) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Dolly | SelfInst | Vicuna | Dolly | SelfInst | Vicuna | Dolly | SelfInst | Vicuna | |
| SFT | 23.2 | 9.9 | 14.3 | 23.2 | 9.9 | 14.3 | 24.5 | 16.5 | 17.6 |
| KD | 22.8 | 10.8 | 13.4 | 21.9 | 9.7 | 14.0 | 24.4 | 14.7 | 17.8 |
| SeqKD | 22.7 | 10.1 | 14.3 | 22.0 | 10.1 | 13.7 | 24.7 | 15.3 | 17.4 |
| MiniLLM | 24.6 | 13.2 | 16.9 | 23.8 | 10.2 | 15.3 | 26.7 | 19.1 | 20.5 |
| BD (Ours) | 24.7 | 13.3 | 16.2 | 25.1 | 11.4 | 14.8 | 27.8 | 19.5 | 20.7 |
Ablation Study¶
Effect of top-p value:
| p | GPT-2 340M | OPT 125M | Qwen-2.5 0.5B | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Dolly | SelfInst | Vicuna | Dolly | SelfInst | Vicuna | Dolly | SelfInst | Vicuna | |
| 1.0 (no mask) | 24.9 | 15.2 | 15.7 | 23.6 | 10.5 | 14.2 | 26.6 | 18.1 | 18.6 |
| 0.8 | 25.8(+0.9) | 15.4(+0.2) | 16.5(+0.8) | 25.1(+1.5) | 11.4(+0.9) | 14.8(+0.6) | 27.8(+1.2) | 19.5(+1.4) | 20.7(+2.1) |
| 0.5 | 25.6(+0.7) | 15.5(+0.3) | 16.1(+0.4) | 24.4(+0.8) | 11.0(+0.5) | 14.7(+0.5) | 27.6(+1.0) | 19.2(+1.1) | 20.2(+1.6) |
Training time comparison:
| Model | MiniLLM | BD (Ours) | Speedup |
|---|---|---|---|
| GPT-2 1.5B→340M | 11.2h | 1.3h | 8.6× |
| OPT 6.7B→350M | 12.8h | 3.5h | 3.7× |
| Qwen-2.5 3B→0.5B | 10.7h | 0.8h | 13.4× |
\(\chi^2\) regularization ablation:
| Config | GPT-2 760M | Qwen-2.5 0.5B | ||||
|---|---|---|---|---|---|---|
| Dolly | SelfInst | Vicuna | Dolly | SelfInst | Vicuna | |
| w/ \(\chi^2\) | 26.2 | 16.1 | 17.3 | 27.8 | 19.5 | 20.7 |
| w/o \(\chi^2\) | 25.9 | 15.7 | 17.1 | 27.4 | 19.2 | 20.3 |
Key Findings¶
-
Top-p masking is consistently effective: At \(p=0.8\), significant improvements over \(p=1.0\) (no mask) are observed across all three model families, with a maximum gain of 2.1 Rouge-L points on Qwen-2.5. This directly validates the value of exploiting teacher distribution sparsity.
-
Efficiency advantage of offline training: BD reaches optimal validation performance 3.7–13.4× faster than the online method MiniLLM, by avoiding costly online autoregressive generation.
-
Win rate under GPT-4o-mini evaluation: BD consistently achieves higher win rates against KD, SeqKD, and MiniLLM baselines, indicating that generation quality improvements extend beyond Rouge-L.
-
Scaling consistency: Rouge-L improves consistently as student model size increases, demonstrating favorable scalability.
-
\(\chi^2\) regularization is effective: Incorporating \(\chi^2\) regularization yields consistent gains of 0.2–0.5 across all settings.
Highlights & Insights¶
- Value of perspective shift: Recasting distillation as an IL/IRL problem naturally introduces the TD learning toolkit to address BC's compounding error problem — an elegant theoretical contribution.
- Exploiting LLM-specific structure: Teacher distribution sparsity is an LLM-specific structural property; converting it into a theoretical guarantee for action space reduction bridges RL theory and LLM practice.
- Practical significance of the suboptimality bound: \(\kappa(p) = -\gamma\log p/(1-\gamma)\) provides an explicit trade-off — enabling a quantifiable decision on the degree of action space reduction.
- Clever design of Q-values as logits: Student model logits serve directly as Q-values, requiring no additional value network — a single model simultaneously serves as both policy and value function.
Limitations & Future Work¶
- Shared vocabulary assumption: Teacher and student must share a vocabulary for distribution matching, limiting cross-architecture distillation.
- Performance ceiling of offline training: While efficient, online training could in principle achieve higher final performance, as the distribution shift problem is not fully resolved.
- Limited task diversity: Evaluation is restricted to instruction-following tasks; more complex settings such as code generation and mathematical reasoning are not covered.
- MiniLLM remains competitive on GPT-2: BD's advantage over the GPT-2 family is less pronounced than over OPT and Qwen families.
- Hyperparameter sensitivity: Performance varies with the choice of \(p\) (\(p=0.5\) vs. \(p=0.8\)), and no adaptive strategy for selecting \(p\) is proposed.
- Theory–practice gap: The theoretical analysis assumes a tabular setting; contraction properties and related guarantees may not hold strictly for parameterized models.
Related Work & Insights¶
- SeqKD, KD, and MiniLLM are unified under the IL/IRL framework as offline off-policy BC, online mixed-policy BC, and policy gradient, respectively, providing a clean taxonomy.
- IQL is chosen because it reparameterizes the saddle-point IRL objective as a simple maximization, avoiding complex min-max optimization.
- The top-p candidate set is conceptually aligned with nucleus sampling, but is applied during training rather than inference.
- The theoretical framework is general — any soft IRL algorithm can be adapted to the reduced action space via top-p projection.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The IL perspective combined with the top-p MDP theoretical framework constitutes a genuinely novel contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Three model families with multiple ablations, though task diversity is limited.
- Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical derivations are rigorous, motivation is clearly articulated, and the IL/IRL background is well presented.
- Value: ⭐⭐⭐⭐ — Provides a new theoretical framework and practical approach for LLM distillation.