ICML 2025 Recommender Systems DPO Preference Optimization on-policy off-policy data mixing RLHF Language Model Alignment

tags: - ICML 2025 - Recommender Systems - DPO - on-policy - off-policy - RLHF date: 2026-05-08 content_hash: 54f64f6cc05ae373

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning¶

Conference: ICML 2025

arXiv: 2505.02363

Authors: Tianjian Li, Daniel Khashabi

Area: Recommendation Systems / Preference Learning / Language Model Alignment

Keywords: DPO, Preference Optimization, on-policy, off-policy, data mixing, RLHF, Language Model Alignment

TL;DR¶

SIMPLEMIX finds that on-policy data excels at reasoning tasks while off-policy data excels at open-ended tasks. By simply mixing the two types of data sources, it achieves an average improvement of 6.03% on Alpaca Eval 2.0, outperforming complex methods such as HyPO by 3.05%.

Background & Motivation¶

Core Problem¶

Language model alignment relies on pairwise preference datasets for preference optimization (e.g., DPO). The data sources are categorized into two types:

Off-policy data: Responses are generated by other models (not the model currently being trained), such as the UltraFeedback dataset.
On-policy data: Responses are generated by sampling from the current target model itself.

Limitations of Prior Work¶

The academic community holds conflicting views regarding the advantages of these two types of data:

Perspective	Representative Work	Conclusion
On-policy is better	Multiple studies	On-policy data consistency outperforms off-policy
Task-dependent	Other studies	The advantage of on-policy may depend on the specific task type

This contradiction indicates a need to systematically study the interaction between the two types of data.

Motivation¶

Existing methods such as HyPO and DPO-Mix-P attempt to combine both types of data, but their designs are complex.
Is there a simpler mixing strategy that can achieve better results?
Can the complementarity of the two data sources on different task types be quantitatively verified?

Method¶

Overall Architecture¶

The core idea of SIMPLEMIX is extremely simple: directly mix on-policy and off-policy preference data in a certain ratio, and then train with standard DPO.

Data mixing pipeline:
1. Collect off-policy preference data (e.g., UltraFeedback)
2. Sample the target model to generate on-policy preference pairs
3. Merge the two types of data according to a mixing ratio α
4. Perform standard DPO training on the mixed dataset

Key Findings: Complementarity of Task Types¶

The theoretical foundation of SIMPLEMIX stems from an important empirical finding:

Task Type	On-policy Performance	Off-policy Performance	Best Strategy
Reasoning tasks (math, coding)	Strong	Weaker	Use on-policy
Open-ended tasks (creative writing, personalized recommendation)	Weaker	Strong	Use off-policy
Comprehensive evaluation	Imbalanced	Imbalanced	Mixed (SIMPLEMIX)

This finding explains why previous studies reached contradictory conclusions—different studies focused on different types of evaluation tasks.

Key Designs¶

Data Mixing Strategy¶

The core operation of SIMPLEMIX is a simple dataset-level mixture:

\[\mathcal{D}_{\text{mix}} = \alpha \cdot \mathcal{D}_{\text{on}} \cup (1 - \alpha) \cdot \mathcal{D}_{\text{off}}\]

Where: - \(\mathcal{D}_{\text{on}}\): on-policy preference data (chosen/rejected pairs generated by the current model) - \(\mathcal{D}_{\text{off}}\): off-policy preference data (pre-existing preference datasets) - \(\alpha\): mixing ratio hyperparameter

No additional weighting mechanisms, curriculum learning strategies, or modifications to the DPO loss are required.

Loss & Training¶

Standard DPO loss is used without any modification:

\[\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_{\text{mix}}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]\]

where \(y_w\) and \(y_l\) denote the preferred (chosen) and dispreferred (rejected) responses in the preference pairs, and \(\beta\) is the temperature parameter.

Key Experimental Results¶

Main Results: Performance Comparison on Alpaca Eval 2.0¶

Method	Data Type	Alpaca Eval 2.0 LC Win Rate	Gain
Off-policy DPO	Off-policy only	baseline_off	—
On-policy DPO	On-policy only	baseline_on	—
HyPO	Mixed (complex method)	Good	Baseline
DPO-Mix-P	Mixed (complex method)	Good	Baseline
SIMPLEMIX	Simple mixture	Optimal	+6.03% vs single-strategy DPO, +3.05% vs HyPO/DPO-Mix-P

Ablation Study: Data Performance across Different Task Types¶

Task Category	Evaluation Metric	On-policy DPO	Off-policy DPO	SIMPLEMIX
Mathematical Reasoning	Accuracy	High	Lower	Highest
Code Generation	Pass@1	High	Lower	Highest
Creative Writing	Human Preference	Lower	High	Highest
Personalized Recommendation	Preference Rate	Lower	High	Highest

Key Findings¶

Complementarity quantitatively verified: On-policy data shows a clear advantage in reasoning tasks, while off-policy is stronger in open-ended tasks.
Simple mixing is optimal: No need for complex weighting, sampling strategies, or loss function modifications.
Consistently outperforming complex methods: SIMPLEMIX surpasses HyPO (+3.05%) and DPO-Mix-P with lower methodological complexity.
Robustness to mixing ratio: Adjusting \(\alpha\) within a reasonable range has minimal impact on the final performance.
Generalizability: Consistently effective across multiple benchmarks (such as Alpaca Eval 2.0).

Highlights & Insights¶

"Simplicity is the ultimate sophistication" philosophy: In the field of preference learning, data diversity is more important than algorithmic complexity. The success of simple mixing challenges the necessity of complex data integration methods.
Unified explanation for contradictory literature: Different studies reached contradictory conclusions regarding on-policy vs off-policy data; this work reveals that the root cause lies in the evaluation task type bias.
Practical guidance value: For engineering practitioners aligning LLMs, simply mixing existing off-policy data with self-sampled on-policy data yields substantial improvements.
Relevance to recommendation systems: The advantage of off-policy data in "personalized recommendation" tasks hints at the unique value of historical data in recommendation scenarios.
Reduced cost of on-policy generation: It is not necessary to fully rely on expensive on-policy data; mixing can achieve better performance while saving computational costs.

Limitations & Future Work¶

Selection of mixing ratio \(\alpha\): Although results are relatively robust to \(\alpha\), the optimal ratio may vary based on model size and task distribution, and there is a lack of an adaptive method to determine \(\alpha\).
Dependency on off-policy data quality: When there is a significant distribution gap between off-policy data and the target model, mixing performance might decline.
Limited evaluation benchmarks: Primarily validated on Alpaca Eval 2.0; generalization to other alignment benchmarks (such as MT-Bench, Arena-Hard) requires further confirmation.
Data scale ratio not discussed: The impact of the absolute scale difference between the two data types on the mixing effect has not been analyzed thoroughly.
Insufficient theoretical explanation: Lacks theoretical analysis on "why simple mixing works", leaving the explanation of complementarity at an empirical level.

DPO (Rafailov et al., 2023): The foundational preference optimization algorithm for this work. SIMPLEMIX does not modify the DPO loss.
HyPO: Integrates on/off-policy data through a complex mixing strategy, which is outperformed by SIMPLEMIX in a simpler manner.
DPO-Mix-P: Another mixing method, also outperformed.
InCo-DPO (2025): Balances distribution shift and data quality, focusing on issues similar to those addressed by SIMPLEMIX.
Insights: In recommendation systems, combining online exploration data (on-policy) and user historical behavior data (off-policy) might leverage a similarly simple mixing strategy.

Rating¶

Dimension	Score (1-5)	Description
Novelty	3.5	The method is extremely simple (data mixing only), but identifying the task complementarity of on/off-policy data is a valuable new insight.
Value	5	Zero additional implementation cost; any team using DPO can adopt it immediately.
Experimental Thoroughness	4	Quantitatively validates complementarity across multiple tasks, though the coverage of evaluation benchmarks could be broader.
Writing Quality	4	Clearly written; the naming "frustratingly simple" accurately conveys the core message.
Overall	4.0	Achieves significant improvements using a minimalist approach, providing important practical guidance for preference learning.