Rectified Noise: A Generative Model Using Positive-incentive Noise¶
Conference: AAAI 2026 arXiv: 2511.07911 Code: https://github.com/simulateuser538/Rectified-Noise Area: Image Generation Keywords: Rectified Flow, Positive-incentive Noise, Flow Matching, SiT, Generative Models
TL;DR¶
This paper proposes Rectified Noise (ΔRN), which leverages the positive-incentive noise (π-noise) framework to learn a set of beneficial noise signals and inject them into the velocity field of a pretrained Rectified Flow model, achieving a reduction in FID from 10.16 to 9.05 on ImageNet-1k with only 0.39% additional parameters.
Background & Motivation¶
State of the Field¶
Rectified Flow (RF) is an efficient generative modeling approach that learns a velocity field by connecting the source and target distributions via straight-line paths. RF directly parameterizes a continuous-time transport map without introducing additional stochasticity, with a simple training objective:
SiT (Scalable Interpolant Transformers), a family of RF models built upon DiT, achieves strong generative performance through systematic exploration of the design space.
Limitations of Prior Work¶
An Intriguing Observation: Although RF is based on a probability flow ODE, recent work (SiT) finds that injecting stochastic noise during sampling via a reverse-time SDE actually improves generation quality (lower FID). This implies: - Deterministic sampling in RF is not necessarily optimal. - Certain forms of noise can be beneficial for RF.
Core Problem¶
This observation raises two key questions:
What kind of stochastic noise can bring performance gains to RF?
How should beneficial noise be introduced into RF?
Starting Point¶
The π-noise (positive-incentive noise) framework provides a theoretical foundation by learning beneficial noise through maximizing mutual information between the task and the noise:
This paper establishes a connection between the π-noise framework and RF, and designs a π-noise generator to automatically learn the optimal noise.
Method¶
Overall Architecture¶
The Rectified Noise pipeline consists of two stages: 1. Pretraining the RF model to obtain optimal parameters \(\psi^*\). 2. Training the π-noise generator: freezing the RF parameters, attaching trainable SiT blocks to predict π-noise, and injecting it into the velocity field.
At inference: standard RF inference with π-noise added to the predicted velocity field.
Key Designs¶
1. Defining Task Entropy via the RF Loss¶
Mechanism: The complexity of what the RF model must learn from a given dataset needs to be quantified. An auxiliary random variable \(\alpha\) is introduced to connect the RF loss to information entropy:
where \(\mathcal{L}(\mathbf{x}, t; \psi^*)\) is the loss of the optimal RF model on sample \(\mathbf{x}\) at time \(t\). A higher loss yields a larger variance in the auxiliary distribution, a higher entropy, and thus a harder task.
Task entropy is defined as:
Design Motivation: The auxiliary Gaussian distribution elegantly bridges the regression loss of RF and information entropy, laying the theoretical groundwork for applying the π-noise framework to generative models.
2. Injecting π-noise into the RF Model¶
Core Derivation: Maximizing mutual information is equivalent to minimizing the conditional entropy \(H(\mathcal{T}|\mathcal{E})\). The noise-conditioned auxiliary loss is defined as:
Key Insight: When \(p(\epsilon|\mathbf{x}, t) \rightarrow \delta(\epsilon)\) (a Dirac delta, i.e., noise is always zero), the optimization objective degenerates to the standard RF loss. This shows that standard RF is a special case of ΔRN where π-noise is identically zero.
The final optimization objective simplifies to:
The π-noise is parameterized by a neural network \(\epsilon_\theta\), trained to maximize the noise-perturbed RF loss.
3. Two Optimization Strategies¶
Strategy 1: Joint optimization of \(\theta\) and \(\psi\)
Both parameter sets are unified via the reparameterization trick. For a Gaussian distribution:
where \(\hat{\mu}_\theta = \mathbf{v}_{\psi^*} + \mu_\theta\) can be predicted by a single network.
Strategy 2: Freeze \(\psi^*\), optimize \(\theta\) only (recommended)
Intermediate feature representations from the pretrained RF model are used as inputs; additional SiT blocks are appended as the π-noise generator, with the final linear layer initialized to zero to ensure the initial output matches the original RF predictions.
Key Finding: Strategy 1 exhibits training instability (convergence difficulties caused by injected stochasticity), whereas Strategy 2 (fine-tuning) is superior.
π-noise Distribution Assumptions¶
Three reparameterizable distributions are explored: - Gaussian: \(\mathbf{z} = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\) - Gumbel: \(\mathbf{z} = \mu - \beta \odot \log(-\log(\epsilon)), \quad \epsilon_i \sim U(0,1)\) - Uniform: \(\mathbf{z} = \mathbf{a} + (\mathbf{b}-\mathbf{a}) \odot \epsilon, \quad \epsilon_i \sim U(0,1)\)
Loss & Training¶
- The pretrained RF model parameters are frozen.
- 0–4 additional SiT blocks are appended as the π-noise generator.
- Only the π-noise generator parameters are trained (0.39%–14.56% additional parameters).
- On ImageNet, the RF model is pretrained for 6M steps, followed by 100K steps of ΔRN fine-tuning; on AFHQ/CelebA-HQ, pretraining runs for 100K/200K steps followed by 10K fine-tuning steps.
Key Experimental Results¶
Main Results¶
ImageNet-1k 256×256 (without CFG):
| Model | Noise Setting | Extra SiT Blocks | Extra Params | FID↓ | IS↑ | sFID↓ | Prec.↑ | Rec.↑ |
|---|---|---|---|---|---|---|---|---|
| SiT-XL/2 | - | - | - | 10.16 | 123.86 | 12.02 | 0.50 | 0.62 |
| +ΔRN | \(\mathcal{N}(\mu,\Sigma)\) | 0 | 0.39% | 9.06 | 130.21 | 11.18 | 0.52 | 0.61 |
| +ΔRN | \(\mathcal{N}(\mu,\Sigma)\) | 1 | 3.93% | 9.05 | 132.10 | 11.23 | 0.52 | 0.62 |
Cross-dataset results:
| Dataset | Baseline FID | ΔRN FID | FID Gain |
|---|---|---|---|
| ImageNet-1k | 10.16 | 9.05 | −1.11 |
| AFHQ | 12.33 | 10.44 | −1.89 |
| CelebA-HQ | 11.25 | 7.73 | −3.52 |
Ablation Study¶
Effect of different noise distribution assumptions (ImageNet-1k):
| Noise Distribution | FID↓ | IS↑ | sFID↓ | Prec.↑ | Rec.↑ |
|---|---|---|---|---|---|
| None (baseline) | 10.16 | 123.86 | 12.02 | 0.50 | 0.62 |
| Gaussian | 9.05 | 132.10 | 11.23 | 0.52 | 0.62 |
| Gumbel | 9.42 | 129.73 | 11.42 | 0.52 | 0.61 |
| Uniform | 10.02 | 124.40 | 11.63 | 0.51 | 0.62 |
Effect of the number of additional SiT blocks (\(\mathcal{N}(\mu,\Sigma)\)):
| Extra Blocks | Param Ratio | FID↓ | Note |
|---|---|---|---|
| 0 | 0.39% | 9.06 | Linear layer alone is effective |
| 1 | 3.93% | 9.05 | Optimal |
| 2 | 7.48% | 9.08 | Diminishing returns |
| 4 | 14.56% | 9.15 | Slight degradation with excess parameters |
Key Findings¶
- Effective with minimal parameters: Only 0.39% additional parameters (linear layer only, no extra SiT blocks) suffice to reduce FID from 10.16 to 9.06.
- Gaussian distribution is optimal: Among the three noise distributions, Gaussian performs best, likely due to its natural alignment with the forward process of RF (Gaussian noise).
- Largest improvement on CelebA-HQ: FID decreases by 3.52, possibly because the more concentrated distribution of facial data makes it easier for π-noise to learn.
- Fine-tuning strategy outperforms joint training: Jointly optimizing θ and ψ leads to slower and more unstable FID convergence.
- Diminishing returns with more SiT blocks: Zero to one extra block is sufficient; additional blocks may introduce overfitting.
Highlights & Insights¶
- Theoretical elegance: The paper establishes a connection between the RF loss and information entropy via an auxiliary Gaussian variable. The derivation is rigorous and concise, ultimately revealing that standard RF is a special case of ΔRN where π-noise is zero.
- Exceptional parameter efficiency: Significant improvement with only 0.39% additional parameters is particularly valuable given the prevailing trend of ever-larger models.
- Plug-and-play: The architecture and weights of the pretrained RF model remain unchanged; only a lightweight π-noise generator is appended.
- π-noise visualization: The paper visualizes how π-noise evolves across timesteps, revealing the spatiotemporal structure of beneficial noise.
- Generality: Consistent improvements across three distinct datasets demonstrate that the method does not rely on a specific data distribution.
Limitations & Future Work¶
- Validation is limited to SiT (a specific RF implementation); other Flow Matching architectures (e.g., Flux, SD3) are not tested.
- Experiments are conducted only at 256×256 resolution; performance at higher resolutions is unknown.
- Interaction with Classifier-Free Guidance (CFG) is not explored.
- The failure of the joint training strategy is not analyzed in sufficient depth.
- The interpretability of π-noise is limited — what information does the beneficial noise actually encode?
- Convergence behavior and hyperparameter sensitivity of the 10K fine-tuning steps are not thoroughly discussed.
Related Work & Insights¶
- Connection to SDE sampling: SiT shows that SDE sampling outperforms ODE sampling; ΔRN can be seen as a further answer to the question "what noise is optimal?" — not random noise, but learned π-noise.
- Success of π-noise in other tasks: VPN enhances classical neural networks; PiNI enhances vision-language models; this paper extends the framework to generative models.
- Insight: The paradigm of pretrained model + lightweight enhancement module is highly efficient and could generalize to other generative tasks (text generation, video generation, etc.).
- The relationship between ΔRN and LoRA-style methods is worth investigating — both enhance pretrained models with minimal additional parameters.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (The theoretical connection between π-noise and RF is a genuinely novel finding with elegant derivation.)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Three datasets with comprehensive ablations, but high-resolution and CFG experiments are absent.)
- Writing Quality: ⭐⭐⭐⭐ (Theoretical derivations are clear and experimental presentation is well-organized.)
- Value: ⭐⭐⭐⭐ (Reveals the existence of beneficial noise in RF and provides a learning framework with substantial room for follow-up research.)