Skip to content

Explaining Data Mixing Scaling Laws

Conference: ICML2026
arXiv: 2606.08167
Code: https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws
Area: LLM Pre-training / Data Mixing / Scaling Laws
Keywords: Data Mixing, Scaling Laws, Capacity Competition, Quantization Model, Bilevel Optimization

TL;DR

This paper provides the long-missing theoretical explanation for "multi-domain data mixing scaling laws." By extending two classic theories of single-domain scaling laws (the quantization model and the projection linear regression model) to multiple domains, it proposes a "shared head, disjoint tail" distribution hypothesis. It identifies two mechanisms governing the loss of each domain: capacity competition (limited model capacity is contested by domain-specific skills, globally coupling all domain losses) and data quantity noise (losses in harder-to-learn domains decrease more slowly, biasing the optimal ratio toward them). The resulting model achieves lower fitting errors using fewer parameters and enables cross-scale extrapolation, using small-scale fitted parameters to predict optimal ratios for large models.

Background & Motivation

Background: Large models are trained on multi-domain data, where "data mixing proportions" (the ratio of each domain) significantly impact performance. One mainstream approach is offline, using empirical scaling laws to predict the loss landscape (expressing domain test loss as a function of mixing weights \(f_i(h,N,D)\)) and solving for the optimal ratio. Another is online, dynamically adjusting weights during training.

Limitations of Prior Work: While practical, offline empirical laws are black boxes derived purely from curve fitting—the fitting itself is computationally expensive, and it is unclear if they extrapolate to larger scales or different datasets, or why those functional forms exist. Online methods suffer from training overhead and theoretical opacity. A fundamental fact is that domain loss depends not only on its own weight but also on the weights of other domains, indicating non-trivial inter-domain interactions. However, a first-principles explanation for why these interactions exist and what laws they follow has been missing.

Key Challenge: While single-domain scaling laws (Kaplan, Chinchilla) have theoretical explanations (quantization models, linear regression models), they are restricted to single domains. Multi-domain data mixing laws rely on empirical fitting without theory. The field needs a "mechanism of domain interaction" rather than just another fitting function.

Goal: Establish a unified theoretical framework to explain the underlying mechanisms of data mixing, thereby (i) improving loss landscape fitting, (ii) reliably predicting optimal ratios, (iii) enabling cross-scale extrapolation, and (iv) utilizing fewer parameters.

Key Insight: The authors extend the single-domain quantization model (Michaud: skills follow a power law; models learn the top \(N\) frequent ones) and the projection linear regression model (Lin/Bordelon: training dynamics characterized by spectral decay) to multiple domains. By introducing a structural hypothesis—different domains overlap in fundamental skills but diverge in specialized ones—domain interaction is attributed to two interpretable mechanisms.

Method

Overall Architecture

The framework progresses through two layers. The first layer, the Extended Quantization Model, treats training as a "capacity allocation" problem: the total model capacity \(N\) is finite, and domain-specific skills compete for it. The weight \(h\) determines the capacity allocated to each domain, thereby determining its loss. This layer reveals capacity competition as a source of domain coupling but has a fatal flaw (the optimal ratio trivially equals the target distribution). The second layer, the Extended Linear Regression Model, incorporates SGD training dynamics onto the first layer, adding a data quantity noise term. This term breaks the symmetry, causing the optimal training ratio to deviate from the target distribution and lean toward "hard-to-learn domains," aligning with real-world observations. Both layers share the "shared head, disjoint tail" distribution hypothesis: "skills" in the quantization model correspond to "eigenvectors" of the covariance matrix in the linear model, with skill frequency/loss corresponding to eigenvalues. Finally, loss prediction is formulated as convex programming, and the search for optimal ratios is formulated as bilevel optimization solved via Online Mirror Descent.

Key Designs

1. "Shared Head, Disjoint Tail" Structural Hypothesis: Making Multi-domain Analyzable

To extend single-domain theory, one must define how knowledge overlaps across domains. The authors propose that within each domain, skills follow a power law \(p_i(k_i)=(\alpha_i-1)k_i^{-\alpha_i}\). Different domains overlap significantly at the head (high-frequency, fundamental skills like basic grammar or arithmetic) but become increasingly independent or orthogonal at the tail (rare, specialized skills). Spectrally, this means covariance matrices share a set of orthogonal eigenvectors where the first \(H\) components (head) have non-zero variance across all domains, while for the tail (\(k>H\)), each domain only has variance \(\lambda_k^{(i)}=k^{-\alpha_i}\) in its unique eigenvectors and zero in others. This idealization allows the tail of the mixed covariance \(\mathbf{H}(h)=\sum_j h_j \mathbf{H}_j\) to decouple—the tail eigenvalues for domain \(i\) after mixing simply become \(h_i k^{-\alpha_i}\) (scaling by its own ratio). Synthetic stress tests show that even with 40% tail overlap, the fitting MRE remains stable, indicating robustness.

2. Extended Quantization Model: Attributing Domain Coupling to "Capacity Competition"

Under this hypothesis, training is formalized as capacity allocation: the model selects a coverage threshold \(x_i\ge H\) for each domain (learning high-frequency skills \(k_i\le x_i\) and discarding the rest). Unlearned skills contribute a constant error \(c_i\), leading to a domain training loss of \(c_i x_i^{-b_i}\) (where \(b_i=\alpha_i-1\)). The optimal threshold is given by: $\(\min_{x}\ \sum_{i=1}^{K} h_i c_i x_i^{-b_i}\quad \text{s.t.}\ \sum_{i=1}^{K}(x_i-H)\le N-H,\ x_i\ge H.\)$ The constraint \(\sum(x_i-H)\le N-H\) forces all domains into a global competition for limited capacity—the source of domain interaction. When \(b_i\) are similar, a Lagrangian multiplier yields an approximate closed-form solution where the loss of domain \(i\) depends on a "total demand" denominator \(\sum_k (b_k c_k h_k)^{1/(b_k+1)}\). Thus, the loss of domain \(i\) is determined not just by its own weight but is globally coupled by the weights and complexities of all competing domains. However, this layer leads to a contradiction: when solving for the optimal training ratio \(h^*\) via bilevel optimization, both layers minimize the same weighted sum of losses, trivially yielding \(h^*\equiv w\) (optimal training ratio equals target distribution), which contradicts empirical findings.

3. Extended Linear Regression Model: Breaking Symmetry with the "Data Quantity Noise Term"

To resolve this, the authors introduce the stochasticity of a single-pass SGD: the loss of each learned skill also depends on "how many times it was seen," which is proportional to the domain sample size \(D h_i\). This leads to the core conclusion (Theorem 4.1): $\(L_i(h,N,D)\approx c_i\, x_i^{*}(h,N)^{-b_i} + A_i (D h_i)^{-a_i} + E_i.\)$ The first term accounts for capacity competition (global coupling), while the second is a data quantity noise term depending only on \(h_i\). This noise term breaks the symmetry of the first layer: when seeking \(h^*\) for a target \(w\), \(A_i(Dh_i)^{-a_i}\) causes \(h^*\) to deviate from \(w\) and pushes weight toward "harder-to-learn" domains (those with larger \(A_i\) or smaller \(\alpha_i\)). The intuition is that harder domains see their loss decrease more slowly with increased weight, requiring more data to suppress noise, thus naturally biasing the optimal ratio toward them. This explains the observed deviation of optimal training ratios from target distributions. Overlap in the tail also has minimal impact on the noise term, making the approximation robust.

Loss & Training

Given fitted parameters, loss prediction for any ratio \(h\) is obtained via numerical estimation of the convex program (Eq. 1). Finding the optimal ratio is a bilevel optimization (outer layer minimizes \(w\), inner layer solves capacity allocation \(x^*(h)\)), which the authors characterize via gradients (Proposition 4.2) and solve efficiently using Online Mirror Descent. For parameter fitting, simple laws (power/exponential) use scipy curve_fit, while complex laws and the proposed model use Basin-Hopping + L-BFGS (optimizing MSE with the inner convex program solved each time), repeated with multiple random initializations.

Key Experimental Results

Validation focuses on three goals: fitting accuracy, optimal ratio prediction, and cross-scale extrapolation.

Main Results: Fitting Accuracy (64 1B models, K=17 Pile domains, 25B tokens)

Method MRE (%) ↓ MAE ↓ #Param
Additive (Shukor 2026) 2.209 0.052 \(K(2K+1)\)
Exponential (Ye 2025) 6.990 0.059 \(K(K+2)\)
BiMix (Ge 2025) 2.963 0.144 \(2K\)
RegMix (Liu 2025a) 6.480 0.136 \(K^2\)
Ours Eq(1) Ext. Quantization 2.064 0.051 \(3K\)
Ours Eq(3) Ext. Linear 1.533 0.034 \(5K\)

The two proposed models achieve the lowest and second-lowest MRE (1.533% / 2.064%), significantly outperforming the strongest empirical baseline, Additive (2.209%). Simultaneously, the number of parameters is reduced from \(K(2K+1)\) (approx. 595 when \(K=17\)) to \(5K/3K\) (85/51), representing higher accuracy with an order of magnitude fewer parameters.

Ablation Study: Progression of the Two Models

Model Loss Form Explains \(h^*\neq w\)? MRE(%)
Ext. Quantization (Eq 1) \(c_i x_i^{*-b_i}+E_i\) No (\(h^*\equiv w\)) 2.064
Ext. Linear (Eq 3) \(+\,A_i(Dh_i)^{-a_i}\) Noise Yes 1.533

Adding the data quantity noise term reduces the MRE from 2.064% to 1.533% and theoretically resolves the contradiction of "optimal ratio equals target distribution," quantifying the dual contribution of the noise term design.

Key Findings

  • The noise term is critical: Removing it (reverting to the quantization model) worsens fitting and leads to the contradictory \(h^*\equiv w\). It is the theoretical root of the "hard-domain bias" in optimal proportions.
  • Cross-scale extrapolation holds: Testing with 4 domains from 200M/8B extrapolated to 700M/16B, and 7 domains from 122M/10B to 1B/30B. Using only small-scale proxy loss to fit parameters, the predicted ratios achieve the lowest (or tied) test loss at large scales. Specifically, at 1B/30B, it matches the strong Additive baseline that had access to large-scale loss data, whereas Ours only used small-scale data.
  • Superior optimal ratio prediction: Across settings like 4-domain 200M, 7-domain 122M/310M, and 17-domain Pile, models trained with predicted ratios consistently yield the lowest test losses using fewer free parameters.

Highlights & Insights

  • Providing "Mechanisms" for Black-box Scaling Laws: This is the first work to decompose domain interaction in data mixing into two interpretable mechanisms: "capacity competition" and "data quantity noise," answering why domain losses are coupled.
  • Elegant Two-layer Progression: Identifying the flaw in the extended quantization model (\(h^*\equiv w\)) and fixing it with the linear model's noise term makes the necessity of the noise term extremely clear.
  • Few Parameters + Extrapolatable = Practical: Theory-driven modeling reduces free parameters from \(O(K^2)\) to \(O(K)\). The ability to use small-model parameters to predict large-model optimal ratios directly reduces the computational cost of ratio searching, which is highly valuable for pre-training.
  • Transferable logic: The mapping of "skills = eigenvectors, frequency = eigenvalues" could potentially explain trade-offs in other multi-distribution or multi-task learning contexts.

Limitations & Future Work

  • Idealized Assumptions: The "strictly disjoint tails" and "negligible head error" do not hold for real data. While stress tests suggest stability up to 40% overlap, real-world data possesses more complex structures, and the boundaries of these assumptions need more systematic characterization.
  • Gap between Linear Proxy and Real LLMs: The theory assumes projected linear regression and single-pass SGD, whereas real pre-training is non-linear and multi-epoch. Theorem 4.1 is noted as informal, and the identifiability of constants \(A_i, a_i, E_i\) in real Transformers remains an open question.
  • Loss vs. Downstream Performance: The framework predicts domain test loss, but optimal loss does not guaranteed optimal downstream performance. The mapping from predicted domain loss to downstream tasks is currently unclear.
  • Scale Limitations: Experiments are primarily conducted at scales \(\le 1\text{B}\)/\(\le 30\text{B}\) tokens. Whether extrapolation remains valid at frontier scales (billions to trillions of parameters/tokens) remains to be verified.
  • vs Additive / Exponential / BiMix (Empirical Laws): These directly fit functional forms (e.g., \(E_i+(\sum C_{ij}h_j^{\gamma_{ij}})^{-1}\)) which are black-box and parameter-heavy. Ours derives the form from first principles, achieving higher accuracy and fewer parameters with extrapolation capabilities.
  • vs RegMix / Proxy Models: RegMix uses LightGBM on small proxy models to learn ratios with \(O(K^2)\) parameters and low interpretability. Ours provides a closed-form mechanism with \(O(K)\) parameters.
  • vs Single-domain Theory: This work is a direct multi-domain extension of existing single-domain theories, generalizing "skill frequency/spectral decay" to "shared head + disjoint tail."
  • vs Online Dynamic Weighting (ODM / Skill-it / Aioli / PiKE): Online methods adjust weights during training with extra overhead and opaque logic. Ours serves as a complementary offline, extrapolatable route with theoretical grounding.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First unified theoretical explanation for data mixing scaling laws; "capacity competition + data quantity noise" provides deep insight.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive experiments across fitting, prediction, and extrapolation, though scales are \(\le 1\text{B}\) and core theorems remain informal.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear progression through the two layers; effectively communicates the contradiction-resolution narrative.
  • Value: ⭐⭐⭐⭐⭐ Significantly reduces ratio search costs through parameter efficiency and extrapolation; directly applicable to pre-training data engineering.