An Information-Theoretic Criterion for Efficient Data Synthesis¶
Conference: ICML2026
arXiv: 2605.16379
Code: None
Area: LLM Pre-training
Keywords: Synthetic Data, Information Theory, Data Processing Inequality, External Verifiers, Reward Hacking
TL;DR¶
This paper uses the Data Processing Inequality (DPI) to explain why synthetic data is sometimes effective and sometimes leads to model collapse: synthetic data training is information-open only when the closed-loop training continuously introduces stable external signals; high meta-level verification signals are more efficient and easier to generalize than instance-level imitation.
Background & Motivation¶
Background: Large Language Model (LLM) training is increasingly dependent on synthetic data. Real high-quality text is approaching availability bottlenecks, while tasks such as mathematics, coding, tool usage, and long-range reasoning require stronger supervisory signals than typical web corpora. Successful cases include verifier-guided synthesis, RLVR, program test feedback, formal proof checkers, and fixed rubric evaluators.
Limitations of Prior Work: Synthetic data is a double-edged sword. Repeated self-training using the model's own outputs often causes distribution collapse, loss of long-tail patterns, and performance degradation. However, synthetic pipelines with verifiers or environmental feedback can yield significant gains. While empirical evidence is abundant, a unified explanation is lacking regarding when synthetic data injects new information and when it merely cycles through the model's existing distribution.
Key Challenge: Data generated by the model itself cannot increase information about the real task out of thin air. If training samples are derived entirely from the current model without any feedback independent of the model state, the Data Processing Inequality implies that task-relevant information will only remain constant or decrease. Yet, real-world RLVR, code testing, and proof checking indeed lead to model improvement, suggesting that successful pipelines are not closed systems.
Goal: The authors aim to provide an information-theoretic criterion to determine the effectiveness of synthetic data pipelines; furthermore, they seek to explain why some external signals produce cross-domain generalization with very few samples while others require massive data for limited gains.
Key Insight: The paper formalizes the training process using relationships between random variables: the real task structure \(X\), the existing data \(D\), the model state \(Z\), and the external signal \(S\). If \(X\to D\to Z\) or \(X\to Z_t\to D_t^{syn}\to Z_{t+1}\) forms an information-closed loop, DPI dictates that information is monotonically non-increasing. If a stable \(S\) is introduced, the upper bound becomes \(I(X;D,S)\) or \(I(X;Z_t,S)\).
Core Idea: Whether synthetic data is effective depends not on "whether it is model-generated," but on whether the training loop is information-open and whether the external signal injects task-relevant information at a high meta-level.
Method¶
The paper does not propose a new training algorithm but rather a set of explanation and design criteria. It first uses DPI to distinguish between information-closed and information-open pipelines, analyzes the different ways external signals enter SFT, RFT, and RLVR, and finally decomposes the effective information and intra-class noise in supervisory signals using task-relevant partitions to explain sample efficiency, generalization, data diversity, and reward hacking.
Overall Architecture¶
In the most basic abstraction, the training process maps data \(D\) and internal randomness \(R\) to the model state \(Z=f_{train}(D,R)\). If the training procedure lacks additional side information about the task structure \(X\), then \(X\to D\to Z\), and the Data Processing Inequality gives \(I(X;Z)\leq I(X;D)\). This means closed-loop self-training cannot systematically increase the model's information about the real task.
For iterative synthetic data, if \(D_t^{syn}\) is generated entirely by \(Z_t\) without verifiers, environments, fixed judges, or new data, then \(X\to Z_t\to D_t^{syn}\to Z_{t+1}\), implying \(I(X;Z_{t+1})\leq I(X;Z_t)\). Actual degradation stems from sampling errors, optimization errors, and distribution collapse, often making the inequality strict.
When a pipeline introduces an external signal \(S\), the correct object becomes the augmented observation \((D,S)\) or \((Z_t,S)\). In this case, \(I(X;Z)\leq I(X;D,S)=I(X;D)+I(X;S\mid D)\). New information only comes from the conditional mutual information \(I(X;S\mid D)\). External signals must be relevant to the task structure and must not drift freely with the student model; otherwise, the loop closes again.
Key Designs¶
-
Information-open criterion:
- Function: To determine if synthetic data training can yield sustained gains.
- Mechanism: If the external signal \(S\) satisfies \(I(X;S\mid D)>0\) or \(I(X;S\mid Z_t)>0\) during iteration, it carries task-relevant information outside the current model/data, making the pipeline information-open. Without such an \(S\), random sampling only provides diverse candidates without selection direction.
- Design Motivation: This explains model collapse in closed-loop self-training and why code tests, proof checks, environmental rewards, and fixed rubrics make synthetic data genuinely useful.
-
Meta-level information injection:
- Function: To explain differences in sample efficiency across various information-open pipelines.
- Mechanism: High meta-level signals only distinguish behaviorally significant equivalence classes, such as "is the answer correct" or "does the program pass the test." Low meta-level signals require replicating a specific reference answer. If there are \(M\) equally acceptable answers, a high-level signal treats them as the same class, while instance-level SFT wastes up to \(\log M\) bits identifying a specific surface form.
- Design Motivation: This explains why binary correctness in RLVR can generalize across math, code, and logic, whereas single-reference imitation often spends capacity on irrelevant surface details.
-
Supervision info decomposition and reward hacking mechanism:
- Function: To separate "effective task signals" from "intra-class/spurious signals," explaining generalization and reward hacking.
- Mechanism: Given a task-relevant partition \(\pi\), any supervisory signal satisfies \(I(Y;S\mid Q)=I([Y]_\pi;S\mid Q)+I(Y;S\mid [Y]_\pi,Q)\). The former represents task-relevant gain, while the latter represents intra-class gain. If the signal depends only on \([Y]_\pi\), the efficiency \(\eta_\pi=1\). If a coarser, easier-to-learn spurious signal correlates with rewards, the model will prioritize learning it.
- Design Motivation: This redefines reward hacking from "model laziness" to "the model learning the most information-efficient components of the training signal." The solution is not longer training but decorrelating spurious features or making intent signals the most efficient ones.
Loss & Training¶
The paper compares three entry methods for training signals. SFT maximizes the log-likelihood of fixed data pairs, primarily increasing the probability of a specific reference answer. RFT generates candidates from the current model, filters them via external acceptance tests, and performs SFT on accepted samples. RLVR directly places verifier rewards into the policy gradient, weighting model-generated outputs by advantage. The authors emphasize that the key difference is not whether the data is synthetic, but how external signals enter the gradient and sample selection.
Key Experimental Results¶
Main Results¶
This work is primarily a theoretical and case-analysis paper without a traditional benchmark table. The main results are summarized by the following criteria and case evidence.
| Scenario | Metric | Criterion / Observation | Prev. Common Understanding | Gain |
|---|---|---|---|---|
| Closed-loop self-training | \(I(X;Z_{t+1})\leq I(X;Z_t)\) | Task information cannot increase without external signals; collapse is expected | Collapse viewed as empirical instability | Provides DPI-level explanation |
| Synthetic data with verifier/env | \(I(X;S\mid Z_t)>0\) | External signals provide new task information; loop remains information-open | "Filtering improves data quality" | Defines source of new information |
| SFT vs RLVR | Meta-level | SFT learns specific references; RLVR learns correctness equivalence classes | Compares algorithmic form only | Explains RLVR efficiency and generalization |
| JudgeRLVR Case | Binary correctness signal | Judges trained only on right/wrong generalize across math/code/logic | Requires domain-specific rubrics | High meta-level signals reduce irrelevant variance |
| Reward Hacking Case | Training reward rises but true accuracy plateaus | Model learns coarser spurious signals like length/style | Reward hacking is "loophole exploitation" | Informational efficiency provides diagnostic mechanism |
Ablation Study¶
Analytical ablations in the paper focus on "externality," "signal granularity," and "spurious correlation removal."
| Configuration | Key Metrics | Description |
|---|---|---|
| Self-training w/o external signals | Information-closed, \(I(X;S\mid Z_t)=0\) | Only explores existing distribution; cannot judge proximity to task |
| Fixed evaluator / verifier | Information-open, \(I(X;S\mid Z_t)>0\) | Evaluator does not drift with student; provides continuous selection pressure |
| High meta-level binary signal | \(\eta_\pi=1\) | Distinguishes only success/failure; ignores all intra-class surface variance |
| Instance-level reference answers | Requires identifying specific elements | Supervisory capacity is consumed by surface forms when many answers are valid |
| Length/Style correlated with correctness | Reward rises, true accuracy stalls | Spurious signals are coarser/easier than correctness; model converges to spurious partitions first |
Key Findings¶
- The core bottleneck of synthetic data is verification capacity. Mathematics, code, and formal reasoning progress quickly because these fields easily provide stable, high meta-level external signals.
- Both randomness and external signals are indispensable. Random sampling provides candidate diversity, while external signals provide selection direction. Sampling without signals leads to collapse; signals without candidates prevent exploration.
- The reason diversity outperforms repeated data can be explained by partition coverage: repeated prompt-output pairs provide almost no new task information, while samples covering new partition blocks provide fresh bits.
Highlights & Insights¶
- The paper elevates the question of "whether synthetic data is effective" from an empirical recipe to an information boundary problem. This perspective is practical as it forces one to ask: which variable in the pipeline carries task information outside the current model.
- The meta-level formulation is highly explanatory. Many seemingly different success cases—code unit tests, Lean proof checkers, format rewards, SAT solver evaluators—essentially compress diverse outputs into a few behavioral equivalence classes.
- The explanation for reward hacking is more actionable than common narratives: if spurious features are more efficient than the real target, learning them is a rational optimization outcome. Thus, data construction must actively break coarse-grained spurious correlations like length, style, or source model.
- The paper provides inspiration for training design: instead of pursuing more synthetic samples, prioritize investing in stable verifiers, fixed judges, executable environments, and diverse prompt coverage.
Limitations & Future Work¶
- The framework is primarily a qualitative explanation and cannot directly predict exact sample requirements, verifier precision, or training rounds needed for a pipeline to succeed.
- The thesis that "models prioritize convergence to the most information-efficient signal" relies on simplicity bias in gradient learning and case support, rather than being strictly derived from mutual information decomposition.
- External signals are treated as given variables, but in reality, verifiers have errors, blind spots, and vulnerability to optimization attacks. Designing verifiers that remain reliable as model capability increases is an open problem.
- For open-ended generation, safety preferences, and creative tasks, high-quality meta-level signals are difficult to construct, likely requiring hybrid solutions involving human labeling, fixed rubrics, model judges, and behavioral tests.
Related Work & Insights¶
- vs Model Collapse: Existing work observes degradation in self-consuming generative models; this paper uses DPI to explain it as an inevitable trend of information-closed loops.
- vs RLVR / Verifiable Reward: The success of RLVR is not just an RL trick, but the verifier acting as an external signal to keep the training loop information-open.
- vs SFT Synthetic Data: SFT effectively distills fixed data, but when multiple answers are acceptable, it spends capacity on replicating specific references; this paper identifies this as an inefficiency of low meta-level signals.
- vs Sutton’s Bitter Lesson: The paper interprets "letting computation discover structure" as using meta-level constraints rather than instance-level hardcoded knowledge: signals define what is correct, not what it must look like.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Uses information-openness and meta-level injection to unify synthetic data, RLVR, collapse, and reward hacking.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Extensive case and theoretical analysis, though lacks a unified reproducible benchmark with quantitative predictions.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear conceptual chain, from DPI to information decomposition to practical cases.
- Value: ⭐⭐⭐⭐⭐ High reference value for designing LLM synthetic data pipelines, verifiers, and reward models.