Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs¶

Conference: ICLR 2026 arXiv: 2506.10054 Code: https://github.com/pspdada/Uni-DPO Area: Alignment RLHF / DPO Keywords: DPO improvement, dynamic weighting, quality-aware, focal loss, preference optimization

TL;DR¶

Uni-DPO is proposed to unify dynamic reweighting of preference pairs via three components — quality-aware weighting (prioritizing pairs with large score margins), performance-aware weighting (focal loss focusing on underfitted samples), and a calibrated NLL loss — consistently outperforming DPO/SimPO on text understanding and mathematical reasoning benchmarks, with Gemma-2-9B achieving 67.1% on Arena-Hard, surpassing Claude 3 Opus (60.4%).

Background & Motivation¶

Background: DPO optimizes policies directly from preference data via implicit rewards, and has become a standard approach for LLM alignment. SimPO further simplifies this by removing the reference model.

Limitations of Prior Work: - Standard DPO treats all preference pairs equally, despite large variance in data quality — high-quality pairs exhibit clear chosen/rejected distinctions, while low-quality pairs are noisy or ambiguous. - A mismatch exists between data quality and model learning state: high-quality pairs may already be well-learned, and over-emphasizing them leads to overfitting. - DPO lacks fine-grained external reward signals (unlike PPO/GRPO).

Key Challenge: How to dynamically reweight preference pairs by jointly considering both the intrinsic quality of data and the model's current learning state?

Core Idea: Quality weights differentiate high- and low-quality data; performance weights focus on hard samples; calibrated NLL loss prevents the probability of chosen responses from degrading.

Method¶

Overall Architecture¶

\[\mathcal{L}_{\text{Uni-DPO}} = -\mathbb{E}[w_{\text{qual}}(y_w, y_l) \cdot w_{\text{perf}}(\pi_\theta) \cdot \log\sigma(\Delta_r)] + \lambda\mathcal{L}_{\text{c-NLL}}\]

Key Designs¶

Quality-Aware Weight \(w_{\text{qual}}\):
- Function: Assigns weights based on external score differences, giving higher weight to pairs with larger score margins.
- \(w_{\text{qual}}(y_w, y_l) = \sigma(\eta \cdot (S_w - S_l))\)
- \(S_w, S_l\) are sourced from human annotations, GPT-4, or reward models.
- Effect: Filters noisy/ambiguous preference pairs and retains high signal-to-noise data.
Performance-Aware Weight \(w_{\text{perf}}\) (Calibrated Focal):
- Function: Down-weights already well-learned samples and focuses training on hard samples where the current model underperforms.
- \(w_{\text{perf}} = [1 - \sigma(\frac{\beta}{|y_w|}\log\pi_\theta(y_w|x) - \frac{\beta}{|y_l|}\log\pi_\theta(y_l|x) - \tau_{\text{ref}})]^\gamma\)
- Key improvement: A fixed threshold \(\tau_{\text{ref}}\) replaces reference model dependency, avoiding per-sample constraints that cause training instability; length normalization (LN) is incorporated to prevent length bias.
- \(\gamma\) controls focal intensity; \(\tau_{\text{ref}}\) controls the expected margin.
Calibrated NLL Loss \(\mathcal{L}_{\text{c-NLL}}\):
- Function: Prevents the absolute probability of chosen responses from decreasing during DPO training.
- Activated only when the policy underperforms the reference model and the sample is of high quality.
- Reinforces the model's confidence on difficult, high-quality positive samples.

Loss & Training¶

\(\eta = 0.7\), \(\lambda = 0.001\), \(\gamma = 3.0\), \(\tau_{\text{ref}} \in [0.5, 2.0]\)
Supports diverse quality score sources (human annotation, GPT-4, ArmoRM, and other reward models).

Key Experimental Results¶

Main Results: Text Understanding¶

Model	Method	AlpacaEval2 LC	Arena-Hard	IFEval Loose	SedarEval
Llama3-8B-Base	DPO	15.5	15.9	45.5	31.80
	SimPO	19.4	23.4	45.7	32.43
	Uni-DPO	23.8	23.9	47.9	38.49
Gemma-2-9B-IT	SimPO	53.2	59.1	67.7	57.7
	Uni-DPO	54.7	67.1	72.8	57.5

Main Results: Mathematical Reasoning (Qwen2.5-Math-7B)¶

Method	GSM8K	MATH	AIME24	AMC23	Avg
Baseline	64.3	65.8	23.3	47.5	39.11
DPO	83.2	75.8	26.7	57.5	51.55
SimPO	85.7	76.4	26.7	57.5	53.73
Uni-DPO	88.9	78.2	26.7	67.5	56.80

Ablation Study¶

Configuration	AlpacaEval2 WR	Arena-Hard	SedarEval
Full Uni-DPO	20.5	23.9	38.49
w/o \(w_{\text{qual}}\)	15.9	22.8	37.43
w/o \(w_{\text{perf}}\)	18.5	21.4	40.46
w/o LN	3.8	2.7	28.18
w/o \(\mathcal{L}_{\text{c-NLL}}\)	19.4	23.3	37.73

Key Findings¶

Length normalization (LN) is critical: Removing it causes a sharp performance collapse (SedarEval −10.31) and training instability.
Quality weighting most affects AlpacaEval: Removing it drops WR from 20.5 to 15.9 (−4.6).
Gemma-2-9B + Uni-DPO surpasses Claude 3 Opus: Arena-Hard 67.1 vs. 60.4.
Significant gains on mathematical reasoning: Qwen2.5-Math-7B achieves an average of +3.07 over SimPO.

Highlights & Insights¶

Unified dual-perspective dynamic weighting: The joint consideration of data quality (external signal) and learning difficulty (internal dynamics) is more effective than either perspective alone.
Improved calibrated focal loss design: Replacing reference model dependency with a fixed threshold and incorporating length normalization resolves the training instability of naive focal DPO.
Transferability to mathematical reasoning: The framework generalizes beyond dialogue and instruction following, yielding consistent gains on mathematical tasks.

Limitations & Future Work¶

Dependency on external scores: Quality weights require reward model or GPT-4 scoring, increasing data preparation cost.
Many hyperparameters: \(\eta, \gamma, \tau_{\text{ref}}, \lambda, \tau_{\text{good}}\) all require tuning.
Potential directions: Self-reward could replace external scoring; null-space constraints from NSPO could be integrated to add a safety dimension.

vs. DPO: DPO applies uniform weighting → Uni-DPO introduces dual-dimensional dynamic reweighting, yielding consistent improvements.
vs. SimPO: SimPO simplifies by removing the reference model → Uni-DPO augments SimPO with quality and performance weights, achieving additive gains.
vs. Standard focal loss: Naive focal DPO is unstable; Uni-DPO's calibrated variant (fixed threshold + LN) resolves this issue.

Rating¶

Novelty: ⭐⭐⭐⭐ Dual-perspective dynamic weighting is natural but not breakthrough.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four models × multiple benchmarks × mathematical reasoning, with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Method motivation is clearly articulated.
Value: ⭐⭐⭐⭐ A practical improvement to DPO that is easy to integrate into existing pipelines.