SoFlow: Solution Flow Models for One-Step Generative Modeling¶
Conference: ICLR 2026 arXiv: 2512.15657 Code: https://github.com/zlab-princeton/SoFlow Area: Diffusion Models / One-Step Generation Keywords: solution function, flow matching, one-step generation, consistency loss, JVP-free
TL;DR¶
This paper proposes Solution Flow Models (SoFlow), which directly learn the solution function \(f(x_t, t, s)\) of the velocity ODE (mapping \(x_t\) at time \(t\) to the solution at time \(s\)). Trained from scratch via a Flow Matching loss combined with a JVP-free solution consistency loss, SoFlow achieves a 1-NFE FID of 2.96 on ImageNet 256 (XL/2), outperforming MeanFlow (3.43).
Background & Motivation¶
Background: Consistency models (CM/iCT/ECT/sCT) and MeanFlow have enabled few-step/one-step generation, but MeanFlow's Flow Matching anchoring requires expensive Jacobian-vector product (JVP) computations (poorly optimized in PyTorch), and consistency models trained from scratch struggle to leverage CFG.
Limitations of Prior Work: (a) JVP is inefficient in deep learning frameworks, as it is neither a standard forward nor backward pass; (b) consistency training objectives are unstable due to stop-gradient pseudo-target drift; (c) one-step models trained from scratch do not support CFG during training.
Key Challenge: Learning to "jump to the endpoint in one step" currently requires either JVP (slow) or unstable objectives (poor quality).
Goal: Design a one-step generative framework that is JVP-free, supports CFG during training, and can be trained from scratch.
Key Insight: Rather than learning a velocity field and integrating it (Flow Matching), SoFlow directly learns the solution function \(f(x_t, t, s)\) of the ODE. The solution function inherently satisfies two properties: (1) the initial condition \(f(x_t, t, t) = x_t\), and (2) ODE solution consistency — the latter can be approximated with a consistency loss that requires no JVP.
Core Idea: Learn the ODE solution function instead of the velocity field, and replace JVP with a three-timepoint \((s, l, t)\) consistency constraint to enforce ODE consistency.
Method¶
Overall Architecture¶
The SoFlow model \(f_\theta(x_t, t, s)\) takes three inputs — noisy data \(x_t\), current time \(t\), and target time \(s\) — and outputs a prediction at time \(s\). The training loss is a weighted combination: \(\lambda\) × Flow Matching loss + \((1-\lambda)\) × solution consistency loss.
Key Designs¶
-
Solution Consistency Loss (JVP-Free):
- Function: Enforces the transitivity property \(f(x_t, t, s) = f(f(x_t, t, l), l, s)\) required by the ODE solution.
- Mechanism: Three timepoints \(s < l < t\) are sampled, and the loss is computed as \(\|f_\theta(x_t, t, s) - f_{\theta^-}(x_t + (\alpha_t' x_0 + \beta_t' x_1)(l-t), l, s)\|^2\), where the intermediate point is obtained via a one-step Euler step from a teacher model (stop-gradient), requiring no JVP.
- Design Motivation: MeanFlow's consistency loss requires JVP to compute the partial derivative of the velocity field with respect to time, whereas SoFlow defines consistency over the solution function, requiring only forward passes.
-
Flow Matching Loss (Velocity Field + CFG):
- Function: The behavior of the solution function near \(s = t\) is equivalent to a velocity field, enabling simultaneous velocity prediction training.
- Mechanism: Since \(\partial_3 f(x_t, t, s)|_{s=t} = v(x_t, t)\), under the Euler parameterization \(f_\theta(x_t, t, s) = x_t + (s-t) F_\theta(x_t, t, s)\), it follows that \(F_\theta(x_t, t, t) = v_\theta(x_t, t)\).
- Design Motivation: (a) Enables CFG — a guided velocity field can be used directly during training; (b) Stabilizes training by providing a well-defined FM objective.
-
Training-Time CFG Integration:
- Function: Injects CFG signal during training rather than solely at inference.
- Mechanism: The FM loss uses a guided velocity target \(w(\alpha_t' x_0 + \beta_t' x_1) + (1-w) v_{\text{uncond}}\); the consistency loss replaces high-variance targets with the model's predicted guided velocity.
- Design Motivation: With only one step at inference, CFG cannot be applied between intermediate states, so the model must internalize guidance during training.
Loss & Training¶
- DiT architecture (B/4, L/2, XL/2) in latent space (SD-VAE).
- \(\lambda\) controls the FM ratio: approximately 80% FM + 20% consistency.
- Time sampling follows a logit-normal distribution.
- Adaptive Huber loss (\(p=0.5\) or \(1\)) is used for robustness against large-error samples.
Key Experimental Results¶
Main Results (ImageNet 256×256, 1-NFE)¶
| Model Size | MeanFlow FID | SoFlow FID | Gain |
|---|---|---|---|
| B/2 | 6.17 | 4.85 | −1.32 |
| M/2 | 5.01 | 3.73 | −1.28 |
| L/2 | 3.84 | 3.20 | −0.64 |
| XL/2 | 3.43 | 2.96 | −0.47 |
Ablation Study¶
| Configuration | FID (B/4) |
|---|---|
| 100% Consistency, 0% FM | 53.78 |
| 20% FM + 80% Consistency | 47.65 |
| 80% FM + 20% Consistency | 44.64 |
| MSE (\(p=0\)) | 62.93 |
| Huber (\(p=0.5\)) | 44.64 |
| Without CFG | 44.64 |
| With CFG (\(w=1.0\)) | 14.92 |
Key Findings¶
- SoFlow outperforms MeanFlow across all model sizes under identical architectures and training steps.
- An 80% FM loss ratio is optimal — excessive consistency loss is detrimental, as the FM loss provides stable velocity field supervision.
- Huber loss substantially outperforms MSE (62.93 → 44.64), and robustness to large-error samples is critical.
- Training-time CFG yields dramatic gains (44.64 → 14.92), making it a key factor in one-step generation quality.
Highlights & Insights¶
- JVP-free is the most significant practical advantage — MeanFlow requires JVP, but PyTorch's JVP implementation is inefficient (2–4× slower than backpropagation). SoFlow requires only forward passes, leading to simpler engineering.
- The conceptual shift from solution function vs. velocity field is illuminating — learning the "answer" (solution function) rather than the "direction" (velocity field) naturally sidesteps the integration process.
- Training-time CFG resolves the fundamental limitation of one-step models, which cannot apply guidance at inference due to the absence of intermediate states.
Limitations & Future Work¶
- The XL/2 FID of 2.96 still lags behind multi-step methods such as SiT/DiT (~2.0 with 250 steps), indicating room for improvement in one-step generation quality.
- Validation is limited to ImageNet 256; experiments at higher resolutions (512/1024) and text-to-image settings are absent.
- The solution function requires an additional \(s\) input (via positional embedding of \(s-t\)), introducing additional model design complexity.
- Direct comparisons with recent consistency-based methods such as sCT and IMM are lacking.
Related Work & Insights¶
- vs. MeanFlow: The core distinction lies in the JVP-free formulation and solution function parameterization. SoFlow consistently outperforms MeanFlow across all model sizes (FID improvements of −0.47 to −1.32).
- vs. Consistency Models (iCT/sCT): Both share the spirit of consistency-based training, but SoFlow's solution function consistency is more general (supporting arbitrary \((t, s)\) pairs) and naturally accommodates training-time CFG.
- vs. Shortcut/IMM: Both approaches learn mappings between arbitrary time pairs, but differ in theoretical motivation — SoFlow is grounded in the mathematical properties of ODE solution functions.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Solution function learning combined with a JVP-free consistency loss is a meaningful contribution, though not a fundamental paradigm shift.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablations and multi-scale model comparisons, but lacking high-resolution and text-to-image evaluations.
- Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are clear and the relationship between motivation and method is well-structured.
- Value: ⭐⭐⭐⭐ — Represents a substantive advance in one-step generation, with significant engineering value due to the JVP-free design.