Neon: Negative Extrapolation From Self-Training Improves Image Generation¶

Conference: ICLR 2026 Oral arXiv: 2510.03597 Code: github.com/VITA-Group/Neon Area: Image Generation / Self-Training Keywords: self-training, model collapse, weight merging, negative extrapolation, FID

TL;DR¶

Neon is proposed as a post-processing method requiring <1% additional training compute: the model is first fine-tuned on its own synthetic data (causing degradation), then negatively extrapolated away from the degraded weights. The paper proves that mode-seeking samplers cause anti-alignment between synthetic and real data gradients, so negative extrapolation is equivalent to optimizing toward the real data distribution. On ImageNet 256×256, xAR-L achieves SOTA FID of 1.02.

Background & Motivation¶

Background: Scaling generative models is constrained by the scarcity of high-quality training data. Self-training with synthetically generated data is an intuitive solution, but leads to Model Autophagy Disorder (MAD) / Model Collapse—rapid degradation in sample quality and diversity.

Limitations of Prior Work: (a) SIMS requires 2× inference NFE and large numbers of synthetic samples (100K) with significant additional training compute (20%); (b) DDO requires multiple rounds of iteration (16 rounds × 50K samples); (c) existing methods lack a unified theoretical explanation for why self-training degrades and how degradation can be exploited.

Key Challenge: Self-training degradation appears wasteful, yet the degradation direction itself carries information—if the direction of degradation can be understood, it can be exploited in reverse.

Goal: Can the degradation signal from self-training be transformed into a self-improvement signal, with theoretical guarantees?

Key Insight: The authors observe that mode-seeking samplers (temperature <1, top-k, finite-step ODE solvers) bias synthetic data toward high-probability regions of the model distribution, causing the population gradients of synthetic and real data to be anti-aligned (\(\cos\varphi < 0\)). Consequently, reversing the self-training gradient is equivalent to optimizing toward the real data distribution.

Core Idea: Self-training degrades the model, but the direction of degradation is precisely the opposite of the direction of improvement; therefore, negative extrapolation along that direction improves the model.

Method¶

Overall Architecture¶

Neon is a minimalist three-step post-processing pipeline: (1) generate a small set of synthetic data \(S\) (~1K–6K samples) using the base model \(\theta_r\); (2) briefly fine-tune on \(S\) to obtain a degraded model \(\theta_s\); (3) perform weight negative extrapolation: \(\theta_{\text{neon}} = (1+w)\theta_r - w\theta_s\), where \(w > 0\).

Key Designs¶

Negative Extrapolation Weight Merging
- Function: Obtain an improved model via reverse extrapolation in parameter space.
- Mechanism: \(\theta_{\text{neon}} = \theta_r + w(\theta_r - \theta_s)\), i.e., moving in the direction opposite to "base model → degraded model." In practice, this reduces to a single line of code: merged[k] = base[k] - w * (aux[k] - base[k])
- Design Motivation: No inference overhead (unlike SIMS which requires 2× NFE), no new real data required, and only a minimal number of synthetic samples needed.
Anti-Alignment Theory (Theorem 1)
- Function: Proves that under mode-seeking samplers, synthetic data gradients are anti-aligned with real data gradients.
- Mechanism: Defines \(r_d = \nabla_\theta \mathcal{L}_{\text{real}}(\theta_r)\) (real data gradient) and \(r_s = \nabla_\theta \mathcal{L}_{\text{synth}}(\theta_r)\) (synthetic data gradient), and proves that when the sampler satisfies the monotone reweighting condition and model error \(|\varepsilon|\) is sufficiently small, \(\cos\varphi = \frac{\langle r_d, r_s \rangle}{\|r_d\| \|r_s\|} < 0\).
- Design Motivation: This explains why self-training causes degradation (updating along \(r_s\) effectively increases real data loss) and why negative extrapolation works (reversing \(r_s\) approximates an update along \(r_d\)).
Neon Reduces Population Risk (Theorem 2)
- Function: Proves that an appropriate \(w > 0\) guarantees improvement.
- Mechanism: \(\mathcal{L}_{\text{real}}(\theta_{\text{neon}}) < \mathcal{L}_{\text{real}}(\theta_r)\); the optimal \(w\) can be predicted from the gradient alignment perspective.
- Design Motivation: Provides rigorous theoretical guarantees rather than a purely empirical approach.
U-Shaped Training Budget Curve
- Function: Explains the non-monotonic effect of training budget \(B\) on performance.
- Mechanism: When \(B\) is too small, high variance causes inaccurate estimation of the degradation direction; when \(B\) is too large, the Taylor expansion fails as higher-order terms dominate. The optimal range is 1–2% of the base training budget.
- Design Motivation: Guides hyperparameter selection.

Loss & Training¶

The fine-tuning stage uses the standard training loss of each respective architecture without modification. \(w\) is typically chosen in \([0.5, 1.5]\), with \(w \approx 0.8\text{–}1.0\) recommended. For class-conditional models, joint tuning with the CFG scale \(\gamma\) is required.

Key Experimental Results¶

Main Results¶

Model	Type	Dataset	Baseline FID	Neon FID	Gain
xAR-L	Flow matching	ImageNet-256	1.28	1.02	-20.3%
xAR-B	Flow matching	ImageNet-256	1.72	1.31	-23.8%
VAR d16	Autoregressive	ImageNet-256	3.30	2.01	-39.1%
VAR d36	Autoregressive	ImageNet-512	2.63	1.70	-35.4%
EDM (cond.)	Diffusion	CIFAR-10	1.78	1.38	-22.5%
EDM (uncond.)	Diffusion	FFHQ-64	2.39	1.12	-53.1%
IMM	Moment matching	ImageNet-256	1.99	1.46	-26.6%

Ablation Study¶

Ablation Dimension	Key Findings
Training budget \(B\)	U-shaped curve: optimum at 1–2% of base training budget
Merging weight \(w\)	\(w=-1\) (direct self-training) degrades; \(w \in [0.5, 1.5]\) yields consistent improvement
Number of synthetic samples	Effective with as few as 1K; diminishing returns beyond 6K
Cross-architecture synthesis	Synthetic data generated by one architecture can improve another

Efficiency Comparison¶

Method	FID (EDM, cond. CIFAR-10)	Extra Compute	Synthetic Samples	Inference Overhead
Neon	1.38	1.75%	6K	None
SIMS	1.33	20%	100K	2× NFE
DDO	1.30	12%	800K	None

Key Findings¶

Cross-Architecture Generality: The same method applies without modification to four architecture types: diffusion, flow matching, autoregressive, and moment matching.
Precision–Recall Trade-off: Neon primarily improves recall (diversity) at a slight cost to precision, with a net reduction in FID.
Mode-Seeking vs. Diversity-Seeking: When the sampler is diversity-seeking (\(\tau > 1\)), the gradient alignment flips and negative extrapolation fails—a boundary condition predicted by theory.
SOTA: xAR-L + Neon achieves FID 1.02 on ImageNet 256×256 with only 0.36% additional compute.

Highlights & Insights¶

The core insight of "degradation as signal" is remarkably elegant: model collapse is transformed from a problem into a tool, exploiting the information encoded in the degradation direction. The philosophical implication is profound—"knowing the wrong direction is equivalent to knowing the right direction."
Minimal implementation: The entire method reduces to a single line of weight merging code, requiring no modifications to the inference pipeline, no additional data, and no extra inference overhead—methodologically parsimonious to an extreme.
Theory and practice in precise correspondence: The anti-alignment theorem accurately predicts the U-shaped curve and the mode-seeking condition observed empirically, representing a rare example of a theoretically grounded practical method.

Limitations & Future Work¶

Hyperparameter tuning: \(w\) and \(B\) require some tuning (though the effective range is broad), and no automatic selection mechanism is provided.
Validated on image generation only: The approach has not been evaluated on NLP (where language models also use temperature/top-k) or video generation.
One-shot correction: Neon relies on a local first-order approximation and cannot be applied iteratively (repeated negative extrapolation invalidates the Taylor expansion).
Precision–diversity trade-off: The method does not improve peak generation quality, only the proportion of samples exceeding a quality threshold.

vs. SIMS: SIMS corrects at inference time using the score difference between the base and self-trained models, requiring 2× NFE. Neon performs a one-shot weight merge after training, incurring no inference overhead.
vs. DDO (Distillation from Degraded Output): DDO requires 16 iterative rounds × 50K samples, with substantially higher compute than Neon.
Connection to "Why DPO is Misspecified": Both works exploit information encoded in misspecification or degradation directions—DPO's misspecified projection and Neon's degradation gradient both embody the principle of "leveraging bias signals."

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The idea of "reversing degradation" is highly original; both theory and method are novel contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four architectures, three datasets, extensive ablations, and thorough efficiency comparisons.
Writing Quality: ⭐⭐⭐⭐⭐ Theory is clearly presented, figures are intuitive, and the method is implementable in a single line of code.
Value: ⭐⭐⭐⭐⭐ High generality, minimal cost, and theoretical guarantees—a strong candidate to become a standard post-training step for generative models.