tags: - ICML 2025 - Recommender Systems - DPO - on-policy - off-policy - RLHF date: 2026-05-08 content_hash: 54f64f6cc05ae373
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning¶
Conference: ICML 2025
arXiv: 2505.02363
Authors: Tianjian Li, Daniel Khashabi
Area: Recommendation Systems / Preference Learning / Language Model Alignment
Keywords: DPO, Preference Optimization, on-policy, off-policy, data mixing, RLHF, Language Model Alignment
TL;DR¶
SIMPLEMIX finds that on-policy data excels at reasoning tasks while off-policy data excels at open-ended tasks. By simply mixing the two types of data sources, it achieves an average improvement of 6.03% on Alpaca Eval 2.0, outperforming complex methods such as HyPO by 3.05%.
Background & Motivation¶
Core Problem¶
Language model alignment relies on pairwise preference datasets for preference optimization (e.g., DPO). The data sources are categorized into two types:
- Off-policy data: Responses are generated by other models (not the model currently being trained), such as the UltraFeedback dataset.
- On-policy data: Responses are generated by sampling from the current target model itself.
Limitations of Prior Work¶
The academic community holds conflicting views regarding the advantages of these two types of data:
| Perspective | Representative Work | Conclusion |
|---|---|---|
| On-policy is better | Multiple studies | On-policy data consistency outperforms off-policy |
| Task-dependent | Other studies | The advantage of on-policy may depend on the specific task type |
This contradiction indicates a need to systematically study the interaction between the two types of data.
Motivation¶
- Existing methods such as HyPO and DPO-Mix-P attempt to combine both types of data, but their designs are complex.
- Is there a simpler mixing strategy that can achieve better results?
- Can the complementarity of the two data sources on different task types be quantitatively verified?
Method¶
Overall Architecture¶
The core idea of SIMPLEMIX is extremely simple: directly mix on-policy and off-policy preference data in a certain ratio, and then train with standard DPO.
Data mixing pipeline:
1. Collect off-policy preference data (e.g., UltraFeedback)
2. Sample the target model to generate on-policy preference pairs
3. Merge the two types of data according to a mixing ratio α
4. Perform standard DPO training on the mixed dataset
Key Findings: Complementarity of Task Types¶
The theoretical foundation of SIMPLEMIX stems from an important empirical finding:
| Task Type | On-policy Performance | Off-policy Performance | Best Strategy |
|---|---|---|---|
| Reasoning tasks (math, coding) | Strong | Weaker | Use on-policy |
| Open-ended tasks (creative writing, personalized recommendation) | Weaker | Strong | Use off-policy |
| Comprehensive evaluation | Imbalanced | Imbalanced | Mixed (SIMPLEMIX) |
This finding explains why previous studies reached contradictory conclusions—different studies focused on different types of evaluation tasks.
Key Designs¶
Data Mixing Strategy¶
The core operation of SIMPLEMIX is a simple dataset-level mixture:
Where: - \(\mathcal{D}_{\text{on}}\): on-policy preference data (chosen/rejected pairs generated by the current model) - \(\mathcal{D}_{\text{off}}\): off-policy preference data (pre-existing preference datasets) - \(\alpha\): mixing ratio hyperparameter
No additional weighting mechanisms, curriculum learning strategies, or modifications to the DPO loss are required.
Loss & Training¶
Standard DPO loss is used without any modification:
where \(y_w\) and \(y_l\) denote the preferred (chosen) and dispreferred (rejected) responses in the preference pairs, and \(\beta\) is the temperature parameter.
Key Experimental Results¶
Main Results: Performance Comparison on Alpaca Eval 2.0¶
| Method | Data Type | Alpaca Eval 2.0 LC Win Rate | Gain |
|---|---|---|---|
| Off-policy DPO | Off-policy only | baseline_off | — |
| On-policy DPO | On-policy only | baseline_on | — |
| HyPO | Mixed (complex method) | Good | Baseline |
| DPO-Mix-P | Mixed (complex method) | Good | Baseline |
| SIMPLEMIX | Simple mixture | Optimal | +6.03% vs single-strategy DPO, +3.05% vs HyPO/DPO-Mix-P |
Ablation Study: Data Performance across Different Task Types¶
| Task Category | Evaluation Metric | On-policy DPO | Off-policy DPO | SIMPLEMIX |
|---|---|---|---|---|
| Mathematical Reasoning | Accuracy | High | Lower | Highest |
| Code Generation | Pass@1 | High | Lower | Highest |
| Creative Writing | Human Preference | Lower | High | Highest |
| Personalized Recommendation | Preference Rate | Lower | High | Highest |
Key Findings¶
- Complementarity quantitatively verified: On-policy data shows a clear advantage in reasoning tasks, while off-policy is stronger in open-ended tasks.
- Simple mixing is optimal: No need for complex weighting, sampling strategies, or loss function modifications.
- Consistently outperforming complex methods: SIMPLEMIX surpasses HyPO (+3.05%) and DPO-Mix-P with lower methodological complexity.
- Robustness to mixing ratio: Adjusting \(\alpha\) within a reasonable range has minimal impact on the final performance.
- Generalizability: Consistently effective across multiple benchmarks (such as Alpaca Eval 2.0).
Highlights & Insights¶
- "Simplicity is the ultimate sophistication" philosophy: In the field of preference learning, data diversity is more important than algorithmic complexity. The success of simple mixing challenges the necessity of complex data integration methods.
- Unified explanation for contradictory literature: Different studies reached contradictory conclusions regarding on-policy vs off-policy data; this work reveals that the root cause lies in the evaluation task type bias.
- Practical guidance value: For engineering practitioners aligning LLMs, simply mixing existing off-policy data with self-sampled on-policy data yields substantial improvements.
- Relevance to recommendation systems: The advantage of off-policy data in "personalized recommendation" tasks hints at the unique value of historical data in recommendation scenarios.
- Reduced cost of on-policy generation: It is not necessary to fully rely on expensive on-policy data; mixing can achieve better performance while saving computational costs.
Limitations & Future Work¶
- Selection of mixing ratio \(\alpha\): Although results are relatively robust to \(\alpha\), the optimal ratio may vary based on model size and task distribution, and there is a lack of an adaptive method to determine \(\alpha\).
- Dependency on off-policy data quality: When there is a significant distribution gap between off-policy data and the target model, mixing performance might decline.
- Limited evaluation benchmarks: Primarily validated on Alpaca Eval 2.0; generalization to other alignment benchmarks (such as MT-Bench, Arena-Hard) requires further confirmation.
- Data scale ratio not discussed: The impact of the absolute scale difference between the two data types on the mixing effect has not been analyzed thoroughly.
- Insufficient theoretical explanation: Lacks theoretical analysis on "why simple mixing works", leaving the explanation of complementarity at an empirical level.
Related Work & Insights¶
- DPO (Rafailov et al., 2023): The foundational preference optimization algorithm for this work. SIMPLEMIX does not modify the DPO loss.
- HyPO: Integrates on/off-policy data through a complex mixing strategy, which is outperformed by SIMPLEMIX in a simpler manner.
- DPO-Mix-P: Another mixing method, also outperformed.
- InCo-DPO (2025): Balances distribution shift and data quality, focusing on issues similar to those addressed by SIMPLEMIX.
- Insights: In recommendation systems, combining online exploration data (on-policy) and user historical behavior data (off-policy) might leverage a similarly simple mixing strategy.
Rating¶
| Dimension | Score (1-5) | Description |
|---|---|---|
| Novelty | 3.5 | The method is extremely simple (data mixing only), but identifying the task complementarity of on/off-policy data is a valuable new insight. |
| Value | 5 | Zero additional implementation cost; any team using DPO can adopt it immediately. |
| Experimental Thoroughness | 4 | Quantitatively validates complementarity across multiple tasks, though the coverage of evaluation benchmarks could be broader. |
| Writing Quality | 4 | Clearly written; the naming "frustratingly simple" accurately conveys the core message. |
| Overall | 4.0 | Achieves significant improvements using a minimalist approach, providing important practical guidance for preference learning. |