An Information-Theoretic Criterion for Efficient Data Synthesis¶

Conference: ICML2026
arXiv: 2605.16379
Code: None
Area: LLM Pre-training
Keywords: Synthetic data, information theory, data processing inequality, external verifier, reward hacking

TL;DR¶

This paper employs the Data Processing Inequality (DPI) to explain why synthetic data can be effective or cause model collapse: a synthetic data pipeline is only information-open if the training closed-loop continuously introduces stable external signals. Furthermore, high meta-level verification signals are more efficient and generalizable than instance-level imitation.

Background & Motivation¶

Background: Large language model training increasingly relies on synthetic data. Real high-quality text is approaching a supply bottleneck, while tasks such as mathematics, coding, tool use, and long-range reasoning require stronger supervision signals than generic web corpora. Successful cases include verifier-guided synthesis, RLVR, program test feedback, formal proof checkers, and fixed rubric evaluators.

Limitations of Prior Work: Synthetic data is a double-edged sword. Repeated self-training using a model's own outputs often causes distribution collapse, loss of long-tail modes, and performance degradation. However, synthetic pipelines with verifiers or environmental feedback can yield significant gains. While much empirical evidence exists, a unified explanation is missing: when does synthetic data inject new information, and when does it merely recycle the existing distribution of the model?

Key Challenge: Data generated by a model does not inherently increase information about the real task. If training samples are derived entirely from the current model without feedback independent of the model's state, the Data Processing Inequality (DPI) implies that task-relevant information will only remain constant or decrease. Yet, real-world successes in RLVR, code testing, and proof checking show model improvement, suggesting that successful pipelines are not closed systems.

Goal: The authors aim to provide an information-theoretic criterion to determine the effectiveness of synthetic data pipelines. Furthermore, they seek to explain why certain external signals can produce cross-domain generalization with few samples while others require massive data for limited gains.

Key Insight: The paper formulates the training process through relationships between random variables: real task structure \(X\), existing data \(D\), model state \(Z\), and external signal \(S\). If \(X\to D\to Z\) or \(X\to Z_t\to D_t^{syn}\to Z_{t+1}\) forms an information closed-loop, DPI dictates that information is monotonically non-increasing. If a stable \(S\) is introduced, the upper bound becomes \(I(X;D,S)\) or \(I(X;Z_t,S)\).

Core Idea: The effectiveness of synthetic data depends not on whether it is "model-generated," but on whether the training loop is information-open and whether external signals inject task-relevant information at a high meta-level.

Method¶

Overall Architecture¶

This paper does not propose a new training algorithm but provides an information-theoretic "judgment criterion" for synthetic data pipelines. The core approach involves modeling the entire training process as relationships between random variables—task structure \(X\), data \(D\), model state \(Z\), and external signal \(S\)—and analyzing information flow via the Data Processing Inequality (DPI). The framework distinguishes between information-closed and information-open pipelines to explain model collapse, uses "meta-level" to characterize sample efficiency across SFT, RFT, and RLVR, and employs a task-relevant partition to decompose supervision signals into valid task information and intra-class noise, thereby unifying generalization, diversity, and reward hacking within a single framework.

Training is abstracted as mapping data \(D\) and internal randomness \(R\) to the model state \(Z=f_{train}(D,R)\). If the training procedure lacks additional side information about the task structure \(X\), it forms a Markov chain \(X\to D\to Z\), and DPI yields \(I(X;Z)\leq I(X;D)\)—closed-loop self-training cannot systematically increase the model's information about the real task. For iterative synthesis: if \(D_t^{syn}\) is generated solely by \(Z_t\) without verifiers, environments, fixed judges, or new data, then \(X\to Z_t\to D_t^{syn}\to Z_{t+1}\), resulting in \(I(X;Z_{t+1})\leq I(X;Z_t)\). Actual degradation (sampling error, optimization error, distribution collapse) makes this inequality strict. The turning point is the introduction of an external signal \(S\), where the upper bound relaxes to \(I(X;Z)\leq I(X;D,S)=I(X;D)+I(X;S\mid D)\), with all truly new information originating from the conditional mutual information \(I(X;S\mid D)\).

Key Designs¶

1. Information-open criterion: Determining pipeline sustainability via conditional mutual information

Synthetic data is a double-edged sword: pure self-training collapses, while verifier-based pipelines thrive. The criterion provided is clear: if the external signal \(S\) satisfies \(I(X;S\mid D)>0\) (or \(I(X;S\mid Z_t)>0\) iteratively), it carries task-relevant information beyond the current model/data, making the pipeline information-open. Conversely, without such \(S\), random sampling provides candidates but no selection direction towards the real task; the loop closes, DPI takes over, and information only decreases. This explains both why closed-loop self-training inevitably collapses and why code tests, proof checks, and fixed rubrics are effective—they are stable external signals that do not drift with the student model.

2. Meta-level information injection: Determining sample efficiency via "behavioral equivalence classes"

Being information-open is insufficient; sample efficiency varies depending on the "level" of signal injection. High meta-level signals only distinguish behaviorally significant equivalence classes, such as "is the answer correct" or "does the program pass the test." Low meta-level signals require replicating a specific reference answer. If a problem has \(M\) equally acceptable answers, high-level signals treat them as a single class, while instance-level SFT spends up to \(\log M\) bits identifying a specific surface form—capacity that is useless for the task. This explains why RLVR's binary correctness generalizes across domains (it constrains "what is correct"), while single-reference imitation wastes capacity on irrelevant phrasing and formatting.

3. Supervision decomposition: Separating task signals and intra-class noise for reward hacking

Given a task-relevant partition \(\pi\) (grouping outputs by "task equivalence"), any supervision signal satisfies the decomposition:

\[I(Y;S\mid Q)=I([Y]_\pi;S\mid Q)+I(Y;S\mid [Y]_\pi,Q)\]

The first term \(I([Y]_\pi;S\mid Q)\) represents true task-relevant gain, while the second is intra-class gain (surface variance within an equivalence class). Efficiency \(\eta_\pi\) is defined as the proportion of the task-relevant term. When the signal relies only on \([Y]_\pi\), \(\eta_\pi=1\). Reward hacking occurs when a coarser, easier-to-learn proxy signal (e.g., length, style) correlates with the reward. Due to simplicity bias in gradient learning, the model rationally converges to this proxy partition first, causing training rewards to rise while true accuracy stagnates. This reframes reward hacking as the model learning the most efficient component of a signal.

Loss & Training¶

The framework compares three signal injection methods: SFT maximizes log-likelihood of fixed pairs (raising the probability of specific references); RFT generates candidates and filters them via an external acceptance test followed by SFT; RLVR incorporates verifier rewards into policy gradients, weighting self-generated outputs by advantage. The true divide is not "whether data is synthetic," but how the external signal enters the gradient and sample selection, corresponding to the information-open and meta-level dimensions.

Key Experimental Results¶

Main Results¶

This paper focuses on theoretical and case analysis rather than traditional benchmarks.

Scenario	Metric	Criterion / Observation	Prior Understanding	Gain
Closed-loop self-training	\(I(X;Z_{t+1})\leq I(X;Z_t)\)	Task info cannot increase without external signals; collapse is expected	Collapse as empirical instability	DPI-level explanation
Synthesis with verifiers	\(I(X;S\mid Z_t)>0\)	External signals provide new task info; loop remains information-open	"Filtering improves quality"	Clarifies source of new info
SFT vs RLVR	meta-level	SFT learns specific refs; RLVR learns correctness classes	Comparison of algorithm forms	Explains RLVR efficiency/gen.
JudgeRLVR Case	Binary correctness	Judges trained on correct/incorrect generalize across math/code	Need domain-specific rubrics	High meta-level reduces noise
Reward Hacking	Reward rises; Accuracy stalls	Model learns proxy signals like length/style	Model "taking shortcuts"	Explains mechanism via efficiency

Ablation Study¶

Analytical ablations focus on "externality," "signal granularity," and "spurious correlation removal."

Configuration	Key Metric	Description
Self-training (No S)	Information-closed	Can only explore existing distributions; cannot identify task-proximal samples
Fixed evaluator	Information-open	Evaluator does not drift with the student; provides continuous selection pressure
High meta-level binary	\(\eta_\pi=1\)	Ignores surface differences; only distinguishes correctness/satisfaction
Instance-level ref	Requires class elements	Supervision capacity wasted on surface forms when many answers are valid
Correlated proxies	Reward up; Accuracy flat	Proxy signals are coarser/easier to learn; model converges to proxy partition

Key Findings¶

The core bottleneck of synthetic data is verification capacity. Progress is faster in math and code because these fields provide stable, high meta-level external signals.
Both randomness and external signals are essential. Sampling provides diversity; the signal provides direction.
Diversity's advantage can be explained by partition coverage: repeating prompt-output pairs provides almost no new task info, whereas samples covering new partition blocks provide fresh bits.

Highlights & Insights¶

Reframes the "effectiveness of synthetic data" from empirical recipes to an information boundary problem.
Meta-level descriptions provide high explanatory power for why code tests or proof checkers are more efficient than instance-level imitation.
Reward hacking is characterized as a rational optimization result where models learn the most "efficient" signal.
Practical implication: Investment should prioritize stable verifiers, fixed judges, and environmental feedback over simply increasing synthetic sample volume.

Limitations & Future Work¶

The framework is primarily qualitative and does not predict specific sample requirements or verifier precision needed for success.
The "simplicity bias" assumption for reward hacking relies on case evidence rather than being strictly derived from mutual information formulas.
External signals are assumed given, but verifiers in practice have errors and can be targets of optimization attacks.
High meta-level signals for open-ended or creative tasks are difficult to construct and may require hybrid human-model rubrics.

vs. Model Collapse: Extends observations of generative model decay by defining them as inherent trends in information-closed loops.
vs. RLVR: Success is attributed to the verifier's role as an external signal that keeps the training loop information-open.
vs. SFT Synthesis: Explains the inefficiency of SFT when multiple answers are acceptable as wasting supervision capacity on surface forms.
vs. Bitter Lesson: Interprets "letting computation discover structure" as using meta-level constraints rather than hard-coding instance-level knowledge.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐☆
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐