Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training¶
Conference: ACL 2026
arXiv: 2507.15640
Code: None
Area: Reinforcement Learning
Keywords: Data Mixing, Domain Re-weighting, Continual Pre-training, Reinforcement Learning, Catastrophic Forgetting
TL;DR¶
This paper proposes Data Mixing Agent, the first model-based end-to-end domain re-weighting framework. By training a small agent using CQL reinforcement learning on extensive data mixing trajectories, it learns generalizable data mixing heuristics. It balances performance between source and target domains in mathematical reasoning continual pre-training and generalizes to unseen source domains, target models, and domain spaces.
Background & Motivation¶
Background: Although Large Language Models (LLMs) acquire general capabilities through large-scale pre-training, they still require continual pre-training to enhance performance in knowledge-intensive domains (e.g., mathematics, code). However, direct training on target domain data leads to catastrophic forgetting.
Limitations of Prior Work: (1) Common solutions involve mixing source and target domain data, but determining mixing ratios typically relies on human-designed heuristics or empirical findings; (2) The heuristic space for data mixing is extremely rich (different domains, ratios, schedules), making manual exploration highly inefficient; (3) Existing methods (e.g., DoReMi, DSIR) are based on specific assumptions and have limited generalization.
Key Challenge: The optimal data mixing strategy is high-dimensional, dynamic, and task-dependent, yet manual heuristics cover only a tiny fraction of the strategy space. Many potentially effective heuristics remain undiscovered and underutilized.
Goal: Train a small agent model to learn generalizable domain re-weighting heuristics from a large number of data mixing trajectories, automatically adjusting data mixing ratios during continual pre-training.
Key Insight: First, randomly sample numerous data mixing trajectories on a small proxy model and collect environmental feedback (benchmark performance). Then, use offline reinforcement learning to train the agent to map trajectory states to optimal mixing ratios.
Core Idea: Data mixing heuristics can be parameterized as a small agent, learned from trajectory data via RL, and the learned heuristics demonstrate cross-model and cross-domain generalization capabilities.
Method¶
Overall Architecture¶
A three-stage framework: (1) Data Collection—randomly sampling numerous data mixing trajectories, training and evaluating on a proxy model; (2) Agent Training—using CQL (Conservative Q-Learning) to train the data mixing agent on collected trajectories and feedback; (3) Deployment—during the continual pre-training of the target model, the agent directly predicts the domain distribution for the next re-weighting step.
Key Designs¶
-
Trajectory Collection and Evaluation Environment:
- Function: Generate empirical data required for training the agent.
- Mechanism: Randomly sample 20 data mixing trajectories (80 re-weighting steps each) on a 50M parameter proxy model, evaluating MMLU and MATH benchmark performance at each checkpoint. 52-dimensional domain space (DCLM general data + Dolmino math data).
- Design Motivation: Small proxy models have low training costs and allow for extensive strategy space exploration; environmental feedback provides supervisory signals linking mixing strategies to performance.
-
CQL Reinforcement Learning Training:
- Function: Learn the optimal mixing strategy from offline trajectory data.
- Mechanism: State = current domain distribution + historical environmental feedback, Action = next domain distribution, Reward = change in benchmark performance. Use Conservative Q-Learning to avoid over-optimism toward unseen actions.
- Design Motivation: Online RL requires repeated training of large models (impractical); offline RL can learn efficiently from pre-collected trajectories.
-
Cross-Setting Generalization:
- Function: Enable the learned heuristics to transfer to new scenarios.
- Mechanism: The agent is trained on a 50M model with math domains and deployed directly to target models of different sizes (1B+), different source domains (general \(\rightarrow\) science/code, etc.), and different domain spaces.
- Design Motivation: If heuristics represent general knowledge about "what domain distributions help balance performance," they should possess cross-setting transferability.
Loss & Training¶
CQL Loss = standard Q-learning loss + conservative regularization term (penalizing high Q-value estimates for actions outside the dataset). The agent is a small MLP model that takes the current state as input and outputs a continuous domain distribution.
Key Experimental Results¶
Main Results¶
Continual Pre-training for Mathematical Reasoning (Balancing MMLU and MATH Performance)
| Method | MMLU Retention | MATH Gain | Overall |
|---|---|---|---|
| Uniform Mixing | Medium | Medium | Baseline |
| DoReMi | Good | Good | Improved |
| Manual Heuristic | Variable | Variable | Experience-dependent |
| Data Mixing Agent | Optimal | Optimal | Optimal |
Ablation Study¶
| Generalization Test | Effect | Description |
|---|---|---|
| Unseen Source Domain | Effective | Heuristic transfer across domains |
| Different Size Target Models | Effective | Successful 50M \(\rightarrow\) 1B+ transfer |
| Unseen Domain Space | Effective | Remains effective with different domain classifications |
| Code Generation Domain | Effective | Adaptation across target domains |
Key Findings¶
- Data Mixing Agent outperforms all baseline methods in balancing source and target domain performance.
- Heuristics learned by the agent align closely with human intuition—e.g., science domain data benefits MMLU.
- The agent achieves better model performance using less source domain data, indicating it has learned a more efficient data utilization strategy.
- Strategies learned from a 50M model transfer directly to 1B+ models, suggesting that data mixing heuristics are scale-invariant.
Highlights & Insights¶
- First to demonstrate that data mixing heuristics can be parameterized and learned via RL.
- Impressive cross-setting generalization—strategies learned on extremely small models apply to large models.
- The strategies learned by the agent are interpretable and align with human intuition, increasing reliability.
Limitations & Future Work¶
- The trajectory collection phase still incurs significant computational costs (20 trajectories \(\times\) 80 steps \(\times\) valuation).
- The conservatism of CQL might limit the agent from exploring more aggressive mixing strategies.
- The evaluation environment uses simplified benchmarks (MMLU/MATH), which may not fully capture real-world performance.
- The division of domain space depends on external classifiers; classification quality affects agent learning.
Related Work & Insights¶
- vs DoReMi: DoReMi is based on gradient optimization of domain weights; Data Mixing Agent learns heuristics end-to-end.
- vs Manual Tuning: Manual methods cover only a tiny strategy space; the agent can automatically explore and utilize a vast number of heuristics.
- vs Online Methods: Online methods require repeated training of large models; the agent can be deployed multiple times after learning once.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First model-based end-to-end data mixing method using RL to learn heuristics.
- Experimental Thoroughness: ⭐⭐⭐⭐ Includes various generalization tests and ablation analyses, though benchmarks are limited.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and systematic methodology.
- Value: ⭐⭐⭐⭐ Provides an automated tool for data engineering in large-scale pre-training.