Skip to content

tags: - ICML 2025 - Recommender Systems - DPO - on-policy - off-policy - RLHF date: 2026-05-08 content_hash: 54f64f6cc05ae373


SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Conference: ICML 2025

arXiv: 2505.02363

Authors: Tianjian Li, Daniel Khashabi

Area: Recommendation Systems / Preference Learning / Language Model Alignment

Keywords: DPO, Preference Optimization, on-policy, off-policy, data mixing, RLHF, Language Model Alignment


TL;DR

SIMPLEMIX finds that on-policy data excels at reasoning tasks while off-policy data excels at open-ended tasks. By simply mixing the two types of data sources, it achieves an average improvement of 6.03% on Alpaca Eval 2.0, outperforming complex methods such as HyPO by 3.05%.


Background & Motivation

Core Problem

Language model alignment relies on pairwise preference datasets for preference optimization (e.g., DPO). The data sources are categorized into two types:

  • Off-policy data: Responses are generated by other models (not the model currently being trained), such as the UltraFeedback dataset.
  • On-policy data: Responses are generated by sampling from the current target model itself.

Limitations of Prior Work

The academic community holds conflicting views regarding the advantages of these two types of data:

Perspective Representative Work Conclusion
On-policy is better Multiple studies On-policy data consistency outperforms off-policy
Task-dependent Other studies The advantage of on-policy may depend on the specific task type

This contradiction indicates a need to systematically study the interaction between the two types of data.

Motivation

  • Existing methods such as HyPO and DPO-Mix-P attempt to combine both types of data, but their designs are complex.
  • Is there a simpler mixing strategy that can achieve better results?
  • Can the complementarity of the two data sources on different task types be quantitatively verified?

Method

Overall Architecture

The core idea of SIMPLEMIX is extremely simple: directly mix on-policy and off-policy preference data in a certain ratio, and then train with standard DPO.

Data mixing pipeline:
1. Collect off-policy preference data (e.g., UltraFeedback)
2. Sample the target model to generate on-policy preference pairs
3. Merge the two types of data according to a mixing ratio α
4. Perform standard DPO training on the mixed dataset

Key Findings: Complementarity of Task Types

The theoretical foundation of SIMPLEMIX stems from an important empirical finding:

Task Type On-policy Performance Off-policy Performance Best Strategy
Reasoning tasks (math, coding) Strong Weaker Use on-policy
Open-ended tasks (creative writing, personalized recommendation) Weaker Strong Use off-policy
Comprehensive evaluation Imbalanced Imbalanced Mixed (SIMPLEMIX)

This finding explains why previous studies reached contradictory conclusions—different studies focused on different types of evaluation tasks.

Key Designs

Data Mixing Strategy

The core operation of SIMPLEMIX is a simple dataset-level mixture:

\[\mathcal{D}_{\text{mix}} = \alpha \cdot \mathcal{D}_{\text{on}} \cup (1 - \alpha) \cdot \mathcal{D}_{\text{off}}\]

Where: - \(\mathcal{D}_{\text{on}}\): on-policy preference data (chosen/rejected pairs generated by the current model) - \(\mathcal{D}_{\text{off}}\): off-policy preference data (pre-existing preference datasets) - \(\alpha\): mixing ratio hyperparameter

No additional weighting mechanisms, curriculum learning strategies, or modifications to the DPO loss are required.

Loss & Training

Standard DPO loss is used without any modification:

\[\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}_{\text{mix}}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]\]

where \(y_w\) and \(y_l\) denote the preferred (chosen) and dispreferred (rejected) responses in the preference pairs, and \(\beta\) is the temperature parameter.


Key Experimental Results

Main Results: Performance Comparison on Alpaca Eval 2.0

Method Data Type Alpaca Eval 2.0 LC Win Rate Gain
Off-policy DPO Off-policy only baseline_off
On-policy DPO On-policy only baseline_on
HyPO Mixed (complex method) Good Baseline
DPO-Mix-P Mixed (complex method) Good Baseline
SIMPLEMIX Simple mixture Optimal +6.03% vs single-strategy DPO, +3.05% vs HyPO/DPO-Mix-P

Ablation Study: Data Performance across Different Task Types

Task Category Evaluation Metric On-policy DPO Off-policy DPO SIMPLEMIX
Mathematical Reasoning Accuracy High Lower Highest
Code Generation Pass@1 High Lower Highest
Creative Writing Human Preference Lower High Highest
Personalized Recommendation Preference Rate Lower High Highest

Key Findings

  1. Complementarity quantitatively verified: On-policy data shows a clear advantage in reasoning tasks, while off-policy is stronger in open-ended tasks.
  2. Simple mixing is optimal: No need for complex weighting, sampling strategies, or loss function modifications.
  3. Consistently outperforming complex methods: SIMPLEMIX surpasses HyPO (+3.05%) and DPO-Mix-P with lower methodological complexity.
  4. Robustness to mixing ratio: Adjusting \(\alpha\) within a reasonable range has minimal impact on the final performance.
  5. Generalizability: Consistently effective across multiple benchmarks (such as Alpaca Eval 2.0).

Highlights & Insights

  1. "Simplicity is the ultimate sophistication" philosophy: In the field of preference learning, data diversity is more important than algorithmic complexity. The success of simple mixing challenges the necessity of complex data integration methods.
  2. Unified explanation for contradictory literature: Different studies reached contradictory conclusions regarding on-policy vs off-policy data; this work reveals that the root cause lies in the evaluation task type bias.
  3. Practical guidance value: For engineering practitioners aligning LLMs, simply mixing existing off-policy data with self-sampled on-policy data yields substantial improvements.
  4. Relevance to recommendation systems: The advantage of off-policy data in "personalized recommendation" tasks hints at the unique value of historical data in recommendation scenarios.
  5. Reduced cost of on-policy generation: It is not necessary to fully rely on expensive on-policy data; mixing can achieve better performance while saving computational costs.

Limitations & Future Work

  1. Selection of mixing ratio \(\alpha\): Although results are relatively robust to \(\alpha\), the optimal ratio may vary based on model size and task distribution, and there is a lack of an adaptive method to determine \(\alpha\).
  2. Dependency on off-policy data quality: When there is a significant distribution gap between off-policy data and the target model, mixing performance might decline.
  3. Limited evaluation benchmarks: Primarily validated on Alpaca Eval 2.0; generalization to other alignment benchmarks (such as MT-Bench, Arena-Hard) requires further confirmation.
  4. Data scale ratio not discussed: The impact of the absolute scale difference between the two data types on the mixing effect has not been analyzed thoroughly.
  5. Insufficient theoretical explanation: Lacks theoretical analysis on "why simple mixing works", leaving the explanation of complementarity at an empirical level.

  • DPO (Rafailov et al., 2023): The foundational preference optimization algorithm for this work. SIMPLEMIX does not modify the DPO loss.
  • HyPO: Integrates on/off-policy data through a complex mixing strategy, which is outperformed by SIMPLEMIX in a simpler manner.
  • DPO-Mix-P: Another mixing method, also outperformed.
  • InCo-DPO (2025): Balances distribution shift and data quality, focusing on issues similar to those addressed by SIMPLEMIX.
  • Insights: In recommendation systems, combining online exploration data (on-policy) and user historical behavior data (off-policy) might leverage a similarly simple mixing strategy.

Rating

Dimension Score (1-5) Description
Novelty 3.5 The method is extremely simple (data mixing only), but identifying the task complementarity of on/off-policy data is a valuable new insight.
Value 5 Zero additional implementation cost; any team using DPO can adopt it immediately.
Experimental Thoroughness 4 Quantitatively validates complementarity across multiple tasks, though the coverage of evaluation benchmarks could be broader.
Writing Quality 4 Clearly written; the naming "frustratingly simple" accurately conveys the core message.
Overall 4.0 Achieves significant improvements using a minimalist approach, providing important practical guidance for preference learning.