AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation¶
Conference: ACL 2025
arXiv: 2503.02832
Code: None
Area: Model Compression
Keywords: LLM Alignment, DPO, Distillation, Token-Level Reward, Preference Optimization
TL;DR¶
AlignDistil theoretically proves the equivalence between the RLHF objective and a token-level distillation process. Based on this, it designs a simple distillation method: constructing a teacher distribution through a linear combination of logit distributions from a DPO model and a reverse DPO model, and combining this with a token-adaptive extrapolation mechanism to achieve token-level reward optimization. It outperforms existing methods on AlpacaEval 2.0, MT-Bench, and Arena-Hard while achieving faster convergence.
Background & Motivation¶
Background: LLM alignment is primarily achieved through RLHF and DPO, but these methods utilize sparse response-level reward/preference annotations to optimize all tokens.
Limitations of Prior Work: Response-level feedback is coarse-grained and fails to reflect the individual contribution of each token, which may incorrectly penalize high-quality tokens or encourage low-quality tokens, leading to suboptimal performance and slow convergence.
Key Challenge: Fine-grained token-level reward signals are needed, but human annotations can only provide response-level preferences.
Goal: To theoretically decompose the response-level RLHF objective into token-level optimization and achieve efficient token-level alignment.
Key Insight: Leveraging the token-level decomposability of DPO reward to prove that the RLHF objective is equivalent to a token-level distillation process.
Core Idea: RLHF = token-level distillation, and the teacher distribution = a linear combination of DPO logits and reference model logits.
Method¶
Overall Architecture¶
Starting from the RLHF objective, the method incorporates the token-level decomposition of DPO reward \(\rightarrow\) derives the equivalent token-level distillation objective \(\rightarrow\) the student policy \(\pi_\theta\) learns from the teacher distribution \(\pi^*\), which is composed of the adaptive logit extrapolation of a DPO model and a reverse DPO model.
Key Designs¶
-
RLHF-Distillation Equivalence:
- Function: To prove that the sequence-level RLHF objective can be decomposed into token-level KL divergence distillation.
- Mechanism: The implicit DPO reward can be decomposed to each token position \(r_t = \beta \log \frac{\pi_{DPO}(a_t|s_t)}{\pi_{ref}(a_t|s_t)}\). Substituting this into the RLHF objective, the optimal policy at each token position is equivalent to the teacher distribution \(\pi^*(t) \propto \exp(\text{logit}_{ref}(t) + \alpha \cdot \text{logit}_{DPO}(t))\).
- Design Motivation: To convert an intractable RL optimization problem into a simple distillation task.
-
Contrastive DPO Reward:
- Function: To improve the accuracy of the implicit DPO reward.
- Mechanism: Training a standard DPO model and a reverse DPO model (swapping chosen/rejected), and constructing a more robust reward utilizing their contrast: the forward DPO strengthens good tokens, while the reverse DPO weakens bad tokens.
- Design Motivation: A standalone DPO reward has worse accuracy than a pure reward model, and the contrastive strategy bridges this gap.
-
Token-Adaptive Logit Extrapolation:
- Function: To construct a teacher distribution with an appropriate strength for each token position.
- Mechanism: Dynamically adjusting the extrapolation weight based on the divergence degree \(\alpha_t\) between the DPO model and the reference model at each token position—tokens with larger divergence receive smaller weights to avoid over-optimization, while tokens with smaller divergence receive larger weights to enhance alignment.
- Design Motivation: Using a uniform weight across all positions leads to over-optimization of some tokens and under-optimization of others.
Loss & Training¶
Distillation loss: \(\mathcal{L} = \text{KL}(\pi^*(t) \| \pi_\theta(t))\), summed over all token positions. Supports the flexible switching between on-policy (self-sampling) and off-policy (using existing data) training modes.
Key Experimental Results¶
Main Results¶
Base model: Llama-3-8B-Instruct. Preference data: UltraFeedback.
| Method | AlpacaEval 2.0 LC WR ↑ | MT-Bench ↑ | Arena-Hard ↑ |
|---|---|---|---|
| DPO | 25.4 | 7.82 | 28.1 |
| SimPO | 30.2 | 7.88 | 33.5 |
| TDPO | 28.9 | 7.85 | 31.2 |
| AlignDistil | 34.7 | 8.05 | 37.8 |
Ablation Study¶
| Configuration | AlpacaEval 2.0 LC WR | Description |
|---|---|---|
| AlignDistil (Full) | 34.7 | Contrastive DPO + adaptive extrapolation |
| Without contrast (forward DPO only) | 31.5 | Affected by the inaccuracy of DPO rewards |
| Without adaptation (fixed \(\alpha\)) | 32.3 | Partial tokens over/under-optimized |
| Response-level reward distillation | 29.1 | Validates the advantage of token-level alignment |
Key Findings¶
- Token-level distribution reward > Token-level scalar reward > Response-level reward: Distribution-based rewards provide the richest gradient signals.
- Significantly faster convergence: AlignDistil reaches the final performance of DPO in approximately half the training steps.
- The contrastive DPO reward effectively compensates for the shortcomings of DPO acting as a reward model.
Highlights & Insights¶
- Theoretical equivalence of RLHF and distillation: Elegantly converts complex RL optimization into a standard knowledge distillation problem, greatly simplifying the implementation.
- Clever use of the reverse DPO model: By swapping chosen/rejected to train a model that does the "opposite," the contrast between the two models enhances the accuracy of the reward signal.
Limitations & Future Work¶
- Requires training two DPO models (forward and reverse), which increases the upfront computing cost.
- The theoretical equivalence relies on the assumption of token-level decomposition of DPO rewards, which may be imprecise in practice.
- Has not been compared with recent methods such as GRPO/RLVR.
Related Work & Insights¶
- vs DPO: DPO directly optimizes via response-level preferences, while AlignDistil achieves token-level optimization through distillation, offering finer granularity.
- vs TDPO: TDPO also performs token-level DPO, but lacks the distillation perspective and the adaptive mechanism.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The theoretical finding of the RLHF-distillation equivalence is highly elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Benchmark comparisons and ablation experiments are comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ The theoretical derivations are rigorous, and equations are clear.
- Value: ⭐⭐⭐⭐ Provides a fresh theoretical perspective and a practical methodology for LLM alignment.