Skip to content

Token-Importance Guided Direct Preference Optimization (TI-DPO)

Conference: ICLR 2026 arXiv: 2505.19653 Code: https://github.com/gracefulning/TIDPO Area: Alignment RLHF / DPO Keywords: token-level DPO, gradient attribution, hybrid weighting, triplet loss, fine-grained alignment

TL;DR

TI-DPO is proposed, which precisely quantifies each token's contribution to preference via a hybrid weighting mechanism combining gradient attribution and a Gaussian prior, and incorporates a triplet loss to guide optimization in a continuous semantic space. The method achieves state-of-the-art performance with an average score of 62.3 across 6 benchmarks, while providing interpretable token-level control.

Background & Motivation

Background: DPO optimizes preferences at the sequence level, ignoring the differential importance of individual tokens. Existing token-level methods (TDPO/TIS-DPO) assess importance via probability surrogates, which introduce bias.

Limitations of Prior Work: - DPO's coarse-grained optimization is sensitive to data noise and suffers from severe distribution shift - Probability surrogates in existing token-level methods yield inconsistent outputs - The binary "good/bad" contrastive framework cannot finely adjust generation behavior in a continuous semantic space

Key Challenge: Simultaneously identifying critical tokens precisely and guiding preference adjustment in a continuous space are both required.

Core Idea: Gradient attribution to localize critical tokens + Gaussian prior to correct positional bias + triplet loss for continuous-space guidance

Method

Overall Architecture

\(\mathcal{L}_{\text{TI-DPO}} = \mathcal{L}_{\text{DPO-w}} + \gamma \mathcal{L}_{\text{triplet}}\)

Key Designs

  1. Hybrid Weighting Mechanism:

    • Gradient attribution: \(I_i = \|\nabla_{e_i}\mathcal{L}_{\text{target}}\|_1\), computing each token embedding's gradient contribution to the final prediction
    • Gaussian prior: \(\mathcal{P}(t) = \exp(-\frac{1}{2}(\frac{t-\mu}{\sigma})^2)\), where \(\mu=(T-1)/2\), \(\sigma=T/4\), correcting the model's U-shaped attention bias (over-attention to leading and trailing tokens)
    • Convex combination: \(W = \lambda \cdot \mathcal{I}_{\text{norm}} + (1-\lambda) \cdot \mathcal{P}\)
    • Weights are computed independently for \(y_w\) and \(y_l\)
  2. Weighted Token-Level DPO:

    • \(\Delta r_{\text{token}} = \sum_t w_t^w \log\frac{\pi_\theta(y_w^t|\cdot)}{\pi_{\text{ref}}(y_w^t|\cdot)} - \sum_t w_t^l \log\frac{\pi_\theta(y_l^t|\cdot)}{\pi_{\text{ref}}(y_l^t|\cdot)}\)
    • The contributions of critical tokens are amplified while noisy tokens are suppressed
  3. Triplet Loss:

    • An anchor response \(y\) is generated from the policy model; in the implicit reward space, \(y\) is pulled closer to \(y_w\) and pushed away from \(y_l\)
    • \(\mathcal{L}_{\text{triplet}} = \max(0, d(y, y_w) - d(y, y_l) + \alpha)\)
    • Provides finer-grained guidance than binary contrastive learning in a continuous semantic space

Key Experimental Results

Main Results (Average over 3 Models)

Method MMLU GSM8K HumanEval TruthfulQA IFEval Avg
DPO 65.3 69.3 61.0 56.7 70.0 57.7
SimPO 63.5 64.7 58.2 54.2 64.7 54.5
GRPO 70.7 75.7 64.3 59.9 74.0 62.1
TI-DPO 70.0 73.0 67.0 62.0 75.7 62.3

Ablation Study (Llama-3.2-3B)

Configuration General Math Code Reliability
Full TI-DPO 65.4 80.7 33.0 86.8
w/o triplet loss 64.0 79.0 31.0 83.0
Uniform weights 64.0 78.2 29.0 80.0
w/o Gaussian prior 64.5 79.7 31.5 82.5

Key Findings

  • TI-DPO matches GRPO: average 62.3 vs. 62.1, with TI-DPO leading on HumanEval (67 vs. 64.3) and IFEval (75.7 vs. 74)
  • Weight distribution adapts to task: weights concentrate in [0.2, 0.5] for math tasks (few critical symbols) and shift toward [0.6, 0.8] for safety tasks (requiring comprehensive attention)
  • Noise robustness: TI-DPO exhibits the least performance degradation as label noise increases
  • Interpretability: token-level weights can be visualized; e.g., in medical scenarios "medical attention" receives high weight while "painkillers" is down-weighted

Highlights & Insights

  • Complementary design of gradient attribution and positional prior: gradient attribution captures semantic importance but suffers from positional bias, while the Gaussian prior corrects this bias — the two are mutually complementary
  • Triplet loss breaks the binary framework: extends from "good/bad" contrast to continuous-space guidance that aligns with positive samples and distances from negative samples
  • Interpretable token-level control: beyond performance gains, critical tokens can be visualized — offering direct value for safety auditing

Limitations & Future Work

  • Computational overhead: gradient attribution requires additional forward and backward passes
  • Gaussian prior assumption: assumes important tokens are uniformly distributed within the sequence, which may not hold for certain tasks
  • Future direction: could integrate Uni-DPO's quality weighting to enable dual-level dynamic re-weighting at both the data level and token level
  • vs. TDPO/TIS-DPO: probability surrogates introduce bias; TI-DPO achieves greater accuracy through gradient attribution combined with a Gaussian prior
  • vs. Uni-DPO: Uni-DPO re-weights at the data level, while TI-DPO re-weights at the token level — the two are orthogonal and can be combined
  • vs. GRPO: GRPO leverages RL-based exploration, whereas TI-DPO refines supervised signals at the token level — comparable performance achieved through fundamentally different mechanisms

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of hybrid weighting mechanism and triplet loss is elegantly designed
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6 benchmarks × 3 models × comprehensive ablation and noise experiments
  • Writing Quality: ⭐⭐⭐⭐ Theoretical motivation is clearly articulated
  • Value: ⭐⭐⭐⭐ A practical improvement to token-level DPO; interpretability serves as a distinctive advantage