Skip to content

AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation

Conference: ACL 2025
arXiv: 2503.02832
Code: None
Area: Model Compression
Keywords: LLM Alignment, DPO, Distillation, Token-Level Reward, Preference Optimization

TL;DR

AlignDistil theoretically proves the equivalence between the RLHF objective and a token-level distillation process. Based on this, it designs a simple distillation method: constructing a teacher distribution through a linear combination of logit distributions from a DPO model and a reverse DPO model, and combining this with a token-adaptive extrapolation mechanism to achieve token-level reward optimization. It outperforms existing methods on AlpacaEval 2.0, MT-Bench, and Arena-Hard while achieving faster convergence.

Background & Motivation

Background: LLM alignment is primarily achieved through RLHF and DPO, but these methods utilize sparse response-level reward/preference annotations to optimize all tokens.

Limitations of Prior Work: Response-level feedback is coarse-grained and fails to reflect the individual contribution of each token, which may incorrectly penalize high-quality tokens or encourage low-quality tokens, leading to suboptimal performance and slow convergence.

Key Challenge: Fine-grained token-level reward signals are needed, but human annotations can only provide response-level preferences.

Goal: To theoretically decompose the response-level RLHF objective into token-level optimization and achieve efficient token-level alignment.

Key Insight: Leveraging the token-level decomposability of DPO reward to prove that the RLHF objective is equivalent to a token-level distillation process.

Core Idea: RLHF = token-level distillation, and the teacher distribution = a linear combination of DPO logits and reference model logits.

Method

Overall Architecture

Starting from the RLHF objective, the method incorporates the token-level decomposition of DPO reward \(\rightarrow\) derives the equivalent token-level distillation objective \(\rightarrow\) the student policy \(\pi_\theta\) learns from the teacher distribution \(\pi^*\), which is composed of the adaptive logit extrapolation of a DPO model and a reverse DPO model.

Key Designs

  1. RLHF-Distillation Equivalence:

    • Function: To prove that the sequence-level RLHF objective can be decomposed into token-level KL divergence distillation.
    • Mechanism: The implicit DPO reward can be decomposed to each token position \(r_t = \beta \log \frac{\pi_{DPO}(a_t|s_t)}{\pi_{ref}(a_t|s_t)}\). Substituting this into the RLHF objective, the optimal policy at each token position is equivalent to the teacher distribution \(\pi^*(t) \propto \exp(\text{logit}_{ref}(t) + \alpha \cdot \text{logit}_{DPO}(t))\).
    • Design Motivation: To convert an intractable RL optimization problem into a simple distillation task.
  2. Contrastive DPO Reward:

    • Function: To improve the accuracy of the implicit DPO reward.
    • Mechanism: Training a standard DPO model and a reverse DPO model (swapping chosen/rejected), and constructing a more robust reward utilizing their contrast: the forward DPO strengthens good tokens, while the reverse DPO weakens bad tokens.
    • Design Motivation: A standalone DPO reward has worse accuracy than a pure reward model, and the contrastive strategy bridges this gap.
  3. Token-Adaptive Logit Extrapolation:

    • Function: To construct a teacher distribution with an appropriate strength for each token position.
    • Mechanism: Dynamically adjusting the extrapolation weight based on the divergence degree \(\alpha_t\) between the DPO model and the reference model at each token position—tokens with larger divergence receive smaller weights to avoid over-optimization, while tokens with smaller divergence receive larger weights to enhance alignment.
    • Design Motivation: Using a uniform weight across all positions leads to over-optimization of some tokens and under-optimization of others.

Loss & Training

Distillation loss: \(\mathcal{L} = \text{KL}(\pi^*(t) \| \pi_\theta(t))\), summed over all token positions. Supports the flexible switching between on-policy (self-sampling) and off-policy (using existing data) training modes.

Key Experimental Results

Main Results

Base model: Llama-3-8B-Instruct. Preference data: UltraFeedback.

Method AlpacaEval 2.0 LC WR ↑ MT-Bench ↑ Arena-Hard ↑
DPO 25.4 7.82 28.1
SimPO 30.2 7.88 33.5
TDPO 28.9 7.85 31.2
AlignDistil 34.7 8.05 37.8

Ablation Study

Configuration AlpacaEval 2.0 LC WR Description
AlignDistil (Full) 34.7 Contrastive DPO + adaptive extrapolation
Without contrast (forward DPO only) 31.5 Affected by the inaccuracy of DPO rewards
Without adaptation (fixed \(\alpha\)) 32.3 Partial tokens over/under-optimized
Response-level reward distillation 29.1 Validates the advantage of token-level alignment

Key Findings

  • Token-level distribution reward > Token-level scalar reward > Response-level reward: Distribution-based rewards provide the richest gradient signals.
  • Significantly faster convergence: AlignDistil reaches the final performance of DPO in approximately half the training steps.
  • The contrastive DPO reward effectively compensates for the shortcomings of DPO acting as a reward model.

Highlights & Insights

  • Theoretical equivalence of RLHF and distillation: Elegantly converts complex RL optimization into a standard knowledge distillation problem, greatly simplifying the implementation.
  • Clever use of the reverse DPO model: By swapping chosen/rejected to train a model that does the "opposite," the contrast between the two models enhances the accuracy of the reward signal.

Limitations & Future Work

  • Requires training two DPO models (forward and reverse), which increases the upfront computing cost.
  • The theoretical equivalence relies on the assumption of token-level decomposition of DPO rewards, which may be imprecise in practice.
  • Has not been compared with recent methods such as GRPO/RLVR.
  • vs DPO: DPO directly optimizes via response-level preferences, while AlignDistil achieves token-level optimization through distillation, offering finer granularity.
  • vs TDPO: TDPO also performs token-level DPO, but lacks the distillation perspective and the adaptive mechanism.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The theoretical finding of the RLHF-distillation equivalence is highly elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Benchmark comparisons and ablation experiments are comprehensive.
  • Writing Quality: ⭐⭐⭐⭐⭐ The theoretical derivations are rigorous, and equations are clear.
  • Value: ⭐⭐⭐⭐ Provides a fresh theoretical perspective and a practical methodology for LLM alignment.