Skip to content

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Conference: ICLR 2026 arXiv: 2512.05150 Code: https://github.com/inclusionAI/TwinFlow Area: Diffusion Models / One-step Generation / Large Model Acceleration Keywords: one-step generation, self-adversarial, flow matching, 20B scaling, no auxiliary models

TL;DR

TwinFlow is proposed: by extending the flow matching time interval from \([0,1]\) to \([-1,1]\), twin trajectories are constructed to form self-adversarial signals, enabling one-step generation without any discriminator or frozen teacher. This is the first work to scale 1-NFE generation to a 20B-parameter model (Qwen-Image), achieving 1-NFE GenEval of 0.86 — approaching the original 100-NFE score of 0.87 — while reducing inference cost by 100×.

Background & Motivation

Diffusion and flow matching models achieve excellent generation quality, but inference requires 40–100 NFE steps. In the era of large models, inference cost far exceeds one-time training cost, making few-step or one-step generation highly desirable.

Bottlenecks of existing methods:

Method Auxiliary Models Frozen Teacher Core Issue
GAN 1 (discriminator) 0 Unstable training, hard to scale to large models
Diffusion Distillation 0 1 Frozen teacher occupies extra GPU memory
DMD/DMD2 1–2 (fake score + discriminator) 1 Highest complexity; direct OOM at 20B
Consistency Models (LCM/PCM) 0 0–1 Quality degrades sharply at NFE < 4
TwinFlow (Ours) 0 0 No extra components; trainable at 20B

Key Challenge: One-step acceleration of large models requires an extremely simple and memory-efficient framework, yet all high-quality few-step methods rely on auxiliary components (discriminators/teachers), which directly cause OOM at the 20B scale.

Key Insight: Can a model teach itself? The multi-step output of a model is of higher quality than its single-step output — this quality gap itself constitutes a usable self-supervised signal, requiring no external teacher.

Method

Core Idea: Twin Trajectories

Standard flow matching learns the mapping from noise to real data over \(t \in [0,1]\). TwinFlow extends the time interval to \(t \in [-1,1]\):

  • Positive half-axis \(t \in (0,1]\): standard flow, noise → real data (real trajectory)
  • Negative half-axis \(t \in [-1,0)\): twin flow, noise → "fake" data generated by the model itself (fake trajectory)

Both trajectories share a single network \(F_\theta\), distinguished by the sign of the time condition. The model generates \(\mathbf{x}^{\text{fake}} = \mathbf{z} - F_\theta(\mathbf{z}, 0)\) as the fake endpoint, constructing a complete fake trajectory.

Key Designs

  1. Self-adversarial loss \(\mathcal{L}_{\text{adv}}\): applies the standard flow matching objective to the fake trajectory, training the network to learn the noise-to-fake-data mapping under negative time conditioning, without any additional discriminator.
  2. Velocity matching rectification loss \(\mathcal{L}_{\text{rectify}}\): the core mathematical insight — minimizing the velocity field discrepancy \(\Delta_\mathbf{v}(\mathbf{x}_t) = \mathbf{v}_{\text{real}}(\mathbf{x}_t, t) - \mathbf{v}_{\text{fake}}(\mathbf{x}_t, -t)\) between the real and fake trajectories is equivalent to minimizing \(D_{\text{KL}}(p_{\text{fake}} \| p_{\text{real}})\), which is converted into a tractable loss via stop-gradient.
  3. Unified any-step framework: based on the RCGM any-step formulation, the same model and checkpoint supports 1/2/4/... step inference, allowing flexible deployment.

Loss & Training

\[\mathcal{L}(\theta) = \mathcal{L}_{\text{base}} + \underbrace{(\mathcal{L}_{\text{adv}} + \mathcal{L}_{\text{rectify}})}_{\mathcal{L}_{\text{TwinFlow}}}\]
  • \(\mathcal{L}_{\text{base}}\): standard any-step flow matching (\(N=2\)), with target time \(r\) sampled randomly from \([0,1]\)
  • \(\mathcal{L}_{\text{TwinFlow}}\): self-adversarial + rectification, with target time fixed at \(r=0\)
  • In-batch mixing: hyperparameter \(\lambda\) controls the proportion of each mini-batch allocated to each loss; optimal \(\lambda \approx 1/3\)
  • Qwen-Image-20B supports both LoRA fine-tuning (~40 GB) and full-parameter training

Why No Mode Collapse?

Unlike Qwen-Image-Lightning (which is based on DMD2 with the GAN loss removed), TwinFlow retains random target time sampling in \(\mathcal{L}_{\text{base}}\), continuously learning the full trajectory. Meanwhile, the fake trajectory co-evolves with training (self-play), avoiding the fixed-target bias introduced by a frozen teacher. Experiments confirm that Qwen-Image-Lightning suffers from severe diversity degradation (different noise inputs under the same prompt yield nearly identical images), whereas TwinFlow does not.

Key Experimental Results

Unified Multimodal Model Comparison (Qwen-Image-20B, LoRA)

Method NFE ↓ GenEval ↑ DPG-Bench ↑ WISE ↑
Qwen-Image (original) 100 0.87 88.32 0.62
Qwen-Image-Lightning 1 0.85 87.79 0.51
Qwen-Image-RCGM 1 0.52 59.50 0.30
Qwen-Image-TwinFlow 1 0.86 86.52 0.54
Qwen-Image-TwinFlow 2 0.87 87.64 0.57
BLIP3-o-8B 60+ 0.84 81.60 0.62
Bagel 100 0.82 0.52
MetaQuery-XL 60 0.78 81.10 0.55

Key finding: 1-NFE already surpasses most unified multimodal models running at 40–100 NFE (Bagel/MetaQuery/BLIP3-o); 2-NFE fully matches the original 100-NFE baseline.

20B Full-Parameter Training Comparison

Method NFE GenEval ↑ DPG-Bench ↑ WISE ↑ Notes
VSD / DMD / SiD (original) OOM OOM OOM Require 3 model copies
VSD (LoRA fake score) 1 0.67 84.44 0.22 Poor quality
DMD 1 0.81 84.31 0.47 Mode collapse ⭐
sCM (JVP-free) 8 0.60 85.54 0.45 Still low at 8 steps
MeanFlow (JVP-free) 8 0.49 83.81 0.37 Only 0.49 at 8 steps
TwinFlow 1 0.85 85.44 0.51
TwinFlow 2 0.86 86.35 0.55
TwinFlow (longer training) 1 0.89 87.54 0.57 Continuous improvement with full-parameter training

Key finding: VSD/DMD/SiD directly OOM at 20B in their original configurations; sCM/MeanFlow achieve far lower quality at 8 NFE than TwinFlow at 1 NFE. With longer training, 1-NFE GenEval reaches 0.89, surpassing the original 100-NFE score of 0.87.

Dedicated T2I Model Comparison (SANA Backbone)

Method NFE Params GenEval ↑ DPG-Bench ↑
SANA-Sprint-1.6B 1 1.6B 0.76 80.1
RCGM-1.6B 1 1.6B 0.78 76.5
FLUX-Schnell 1 12B 0.69
SDXL-DMD2 1 0.9B 0.59
TwinFlow-0.6B 1 0.6B 0.83 78.9
TwinFlow-1.6B 1 1.6B 0.81 79.1
SANA-1.5 40 4.8B 0.81 84.7

Key finding: TwinFlow-0.6B at 1-NFE (0.83) surpasses SANA-1.5-4.8B at 40-NFE (0.81), using only 1/8 the parameters and running 40× faster.

Ablation Study

  • Effect of \(\lambda\): \(\lambda = 1/3\) is optimal; performance degrades both above and below this value, confirming the importance of balancing the base loss and TwinFlow loss.
  • Generality of \(\mathcal{L}_{\text{TwinFlow}}\): across three architectures (OpenUni, SANA, Qwen-Image), adding \(\mathcal{L}_{\text{TwinFlow}}\) improves 1-NFE DPG-Bench by approximately 3, 2, and 27 percentage points respectively; the gain is most significant for Qwen-Image (59.50 → 86.52).
  • Training steps vs. NFE: longer training leads to lower optimal NFE — both 1-NFE and few-step performance improve simultaneously.

Highlights & Insights

  • Minimalist design is the key selling point: 0 auxiliary models + 0 frozen teachers. Compared to other methods that directly OOM at 20B, TwinFlow is currently the only viable approach.
  • Mathematical elegance: extending the time interval to \([-1,1]\) naturally makes the real/fake velocity discrepancy equivalent to the KL divergence gradient, without requiring an explicit score estimator.
  • Engineering significance: the first demonstration that a 20B model can generate high-quality images in a single step, with direct implications for large model deployment costs.
  • Any-step flexibility: the same checkpoint supports 1/2/4-step inference, enabling dynamic quality/speed trade-offs at deployment time.

Limitations & Future Work

  • Theoretical convergence guarantees for self-adversarial training are insufficient — despite empirical stability, a rigorous analysis is lacking.
  • Traditional distributional quality metrics such as FID/IS are absent; evaluation relies solely on GenEval/DPG-Bench/WISE.
  • Validation is limited to text-to-image tasks; applicability to video generation, audio generation, and other modalities remains unknown.
  • Image editing experiments are preliminary (15K data, 4-NFE) and insufficiently explored.
  • vs. DMD/DMD2: DMD requires a fake score estimator and a frozen teacher (3× model memory); TwinFlow requires only a single model copy.
  • vs. sCM/MeanFlow: both are auxiliary-model-free methods, but at 20B full-parameter training, their 8-NFE GenEval is only ~0.5, far below TwinFlow's 1-NFE score of 0.85.
  • vs. SANA-Sprint: Sprint uses a GAN loss and a frozen teacher, which is infeasible at large model scale; TwinFlow removes the GAN loss while achieving 7–11 percentage points higher 1-NFE GenEval.
  • vs. Qwen-Image-Lightning: both target few-step generation at 20B scale, but Lightning exhibits severe mode collapse, while TwinFlow does not.

Rating

  • Novelty: ⭐⭐⭐⭐ The derivation of twin trajectories and velocity matching rectification is mathematically elegant, though the core intuition (using the model's own multi-step outputs as a teaching signal) is not particularly complex.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers scales from 0.6B to 20B, LoRA to full-parameter training, 3 benchmarks, detailed ablations, and comparisons against 7 baselines.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure, complete mathematical derivations, and rich tables.
  • Value: ⭐⭐⭐⭐⭐ First demonstration of high-quality 1-step generation at the 20B scale, with direct and significant practical impact on large model inference costs.