Skip to content

Improved Training Technique for Shortcut Models (iSM)

Conference: NeurIPS 2025 arXiv: 2510.21250 Code: Not released Area: Image Generation Keywords: Shortcut Models, Flow Matching, Few-step Generation, CFG Guidance, Wavelet Loss, Optimal Transport, EMA Strategy

TL;DR

Targeting five key performance bottlenecks of Shortcut Models (compounding guidance, fixed guidance, frequency bias, self-consistency deviation, and curved trajectories), this paper proposes iSM, a unified training framework that incorporates intrinsic guidance, multi-level wavelet loss, scaling optimal transport, and twin EMA strategy, achieving substantial improvements on ImageNet 256×256 with one-step FID 5.27 and four-step FID 2.05.

Background & Motivation

Background: Flow Matching-based diffusion models have achieved remarkable results in image generation, but sampling requires a large number of iterative steps, limiting deployment efficiency. Accelerating sampling is a key research direction.

Advantages of Shortcut Models: Shortcut Models (SM) represent an elegant acceleration approach — the network is conditioned on both noise level \(t\) and target step size \(d\) simultaneously, trained with a self-consistency loss so that a single network supports one-step, few-step, and multi-step sampling.

Limitations of SM: Despite the elegant framework design, the practical performance of SM lags far behind comparable methods (one-step FID 10.60 vs. IMM 7.77), impeding widespread adoption.

Limitations of Prior Work: The authors identify five systematic issues — (1) exponential compounding amplification of CFG guidance at large step sizes; (2) fixed guidance strength during training, leading to loss of inference flexibility; (3) low-frequency bias from pixel-level loss causing blurry generated images; (4) conflict between the temporal lag of the EMA target network and the self-consistency objective; (5) curved generation trajectories caused by random noise-data pairing.

Key Insight: Rather than modifying the core framework of SM, the paper systematically addresses the five bottlenecks above to make SM a genuinely competitive generative paradigm.

Core Problem

Problem 1: Compounding Guidance

This is one of the most important findings of the paper. SM constructs self-consistency targets using a fixed CFG strength \(w=1.5\) during training. When the model performs one-step generation \(s_\theta(x_0, 0, c, 1)\), it implicitly aggregates the effects of \(N=128\) intermediate steps. The authors formally prove for the first time that the effective guidance strength at each implicit intermediate step is not \(w\) but \(w' = w^{\log_2(N)}\). When \(N=128, w=1.5\), \(w' = 1.5^7 \approx 17\), causing severe color oversaturation and blurring artifacts.

Problem 2: Inflexible Fixed Guidance

The original SM hard-codes CFG strength as \(w=1.5\) during training, making it impossible to adjust the diversity-fidelity trade-off at inference time. Moreover, the optimal \(w\) value depends on the number of inference steps, and a fixed value cannot adapt to different scenarios.

Problem 3: Frequency Bias

SM uses pixel-level \(\ell_2\) loss to optimize direct-domain predictions. Neural networks naturally tend to learn low-frequency features first, causing generated images to lack high-frequency texture details and appear overall blurry.

Problem 4: Self-Consistency Deviation

The EMA target network uses a slow decay rate to maintain training stability, but this causes the target network to represent a "historical state" of the online network. The online network must simultaneously optimize the current trajectory (flow matching) and align with the historical target (self-consistency), creating conflicting learning signals.

Problem 5: Curved Flow Trajectories

Random noise-data pairing in standard flow matching causes frequent crossing of forward trajectories, forcing the reverse generation process to follow curved paths and increasing the difficulty of large-step prediction.

Method

Overall Architecture

iSM preserves the core dual-loss structure of SM (flow matching + self-consistency) and introduces four key improvement components. The final training objective is: $\(\mathcal{L}_{\text{total}}(\theta) = \alpha \mathcal{L}_{\text{velocity}}(\theta) + \beta \mathcal{L}_{\text{guidance}}(\theta) + \gamma \mathcal{L}_{\text{consistency}}(\theta)\)$ where \(\alpha = \beta = \gamma = 1\).

Key Design 1: Intrinsic Guidance

  • Function: Takes guidance strength \(w\) as an explicit conditioning input to the network, training the model to directly output CFG-modulated velocity \(s_\theta(x_t, t, c, d, w)\)
  • Core Idea:
    • Flow matching objective: trains the base velocity field at \(d=0, w=0\), with standard dropout randomly introducing null conditioning
    • Intrinsic guidance objective: trains the model to directly learn CFG scaling behavior at \(d=0, w>0\). The target is \(s_{\text{velocity}} + w \cdot \text{sg}(s_{\text{guidance}})\), where stop-gradient prevents interference with base predictions
    • Guided self-consistency objective: maintains self-consistency at arbitrary step sizes and guidance strengths for \(d>0, w \geq 0\)
  • Effect: Eliminates compounding guidance, supports flexible adjustment of \(w\) at inference, enables CFG in a single step, and halves inference time (no additional unconditional forward pass required)
  • Interval Guidance: Guidance is not applied in high-noise regions (\(t < t_{\text{interval}} = 0.3\)) to avoid premature mode collapse

Key Design 2: Multi-Level Wavelet Loss

  • Function: Uses the Discrete Wavelet Transform (DWT) to decompose predictions and targets into multi-frequency representations, computing the loss in the wavelet domain
  • Core Idea: Recursively decomposes wavelet subbands to \(L=5\) levels (the maximum decomposition depth for a \(32 \times 32\) latent space), computing errors independently at each frequency band
  • Effect: Introduces frequency-aware error signals, forcing the model to recover high-frequency details neglected by \(\ell_2\) loss, producing sharper textures

Key Design 3: Scaling Optimal Transport (sOT)

  • Function: Decouples the batch size for OT computation from the training batch size, enabling large-scale OT matching
  • Core Idea: Every \(K\) training batches (of size \(M\)), all \(K \times M\) noise-image samples are pooled to compute a global OT plan once, then split back into \(K\) mini-batches for training
  • Implementation: \(K=32\), with only approximately 4% additional training time overhead
  • Effect: Substantially reduces forward trajectory crossings, producing straighter reverse paths and reducing self-consistency and flow matching losses

Key Design 4: Twin EMA

  • Function: Maintains two sets of EMA parameters in place of the traditional single EMA
  • Core Idea:
    • Inference parameters \(\theta_{\text{infer}}^-\): slow decay rate (standard), used only for inference to ensure stable high-quality generation
    • Target parameters \(\theta_{\text{target}}^-\): fast decay rate (\(\rho = 0.95\)), used to generate self-consistency targets, closely tracking the current state of the online network
  • Effect: The self-consistency target seen by the online network more closely reflects the current distribution, eliminating the conflict of "aligning with a historical version," while inference still benefits from the stability of slow decay

Loss & Training

Training Configuration

  • Backbone: SiT-XL/2 (675M parameters), latent space size \(32 \times 32\)
  • Dataset: ImageNet 256×256 (class-conditional generation), encoded to latent space using sd-vae-ft-mse
  • Training Iterations: 800K (main experiments), 250K (ablation studies)
  • Guidance Scale Sampling: \(w\) is discretely sampled from \([0, w_{\max}=3.5]\) in steps of 0.25; excessively high \(w_{\max}\) (5.0) introduces unnecessary complexity, while too low (2.0) lacks high-quality guidance
  • Interval Guidance Threshold: \(t_{\text{interval}} = 0.3\); below this threshold \(w=0\) (no guidance applied)
  • Wavelet Loss: \(L=5\) levels of DWT decomposition (maximum feasible depth for \(32 \times 32\) latent space \(= \log_2(32) = 5\))
  • sOT Parameters: \(K=32\); samples are pooled every 32 batches to compute a global OT plan, with approximately 4% additional training time
  • Twin EMA: Inference parameters \(\theta_{\text{infer}}^-\) decay rate 0.9999 (standard slow decay), target parameters \(\theta_{\text{target}}^-\) decay rate 0.95 (fast decay)
  • Conditioning Dropout: During training, conditions are randomly replaced with null conditioning \(\varnothing\) with a certain probability, following standard CFG training paradigm

Inference Procedure

  • Inference uses slow-decay EMA parameters \(\theta_{\text{infer}}^-\) to generate samples
  • The number of sampling steps NFE \(\in \{1, 2, 4, 8, 128\}\) can be freely selected without retraining
  • Guidance strength \(w\) can be freely adjusted at inference time; the network directly outputs CFG-modulated velocity — no additional unconditional forward pass is required, reducing inference time by approximately 50% compared to standard CFG
  • One-step generation: directly \(x_1 = x_0 + s_\theta(x_0, 0, c, 1, w)\); multi-step generation iterates via Euler integration

Key Experimental Results

Main Results (ImageNet 256×256, SiT-XL/2, 800K Iterations)

Model NFE FID-50K ↓
SM (original) 1 10.60
IMM 1 7.77
iSM (Ours) 1 5.27
SM 2
IMM 2 3.99
iSM 2 2.44
SM 4 7.80
IMM 4 2.51
iSM 4 2.05
SM 8
IMM 8 1.99
iSM 8 1.93
SM 128 3.80
iSM 128 1.88

Additional Metric Validation (FD-DINOv2 / IS)

Model NFE FD-DINOv2 ↓ IS ↑
SM 1 500.92 102.66
IMM 1 247.78 128.87
iSM 1 232.31 223.52
SM 2 329.53 125.66
IMM 2 152.08 173.66
iSM 2 107.63 302.29
SM 4 265.90 136.79
IMM 4 110.88 204.95
iSM 4 83.70 298.23

Notably, iSM's improvement on FD-DINOv2 is even more pronounced: at 4 steps, iSM (83.70) reduces the score by more than 3× compared to SM (265.90), indicating that the improvements are not confined to the Inception feature space. The IS metric also demonstrates a substantial advantage; at 2 steps, iSM (302.29) is 2.4× that of SM (125.66).

The optimal hyperparameters for each component are determined within its corresponding ablation block and carried forward to the search for the next component.

Component FID (1-step) FID (4-step)
Intrinsic Guidance (\(w_{\max}=3.5\)) 9.62 3.17
+ Interval Guidance (\(t=0.3\)) 8.49 2.81
+ Wavelet Loss (\(L=5\)) 8.12 2.64
+ sOT (\(K=32\)) 7.97 2.23
+ Twin EMA (\(\rho=0.95\)) 6.56 2.16

Key Findings on Hyperparameter Sensitivity

  • \(w_{\max}\): 2.0 → 3.5 → 5.0 corresponds to 1-step FID 10.10 → 9.62 → 10.38, exhibiting a sweet spot
  • \(t_{\text{interval}}\): 0.0 → 0.1 → 0.3 → 0.5 corresponds to 4-step FID 3.17 → 3.14 → 2.81 → 2.84; at 0.5, 1-step FID jumps sharply to 19.22 (guidance coverage too narrow)
  • DWT levels: 0 → 1 → 3 → 5 layers show monotonically increasing effectiveness; 5 layers is the maximum decomposition limit for the latent space
  • sOT \(K\) value: 0 → 1 → 8 → 32, 1-step FID drops from 8.12 → 8.07 → 8.03 → 7.97; 4-step FID drops from 2.64 → 2.51 → 2.28 → 2.23
  • Target EMA decay rate: 0.9999 → 0.999 → 0.95, 1-step FID drops from 7.97 → 7.43 → 6.56; fast decay yields the largest improvement for one-step generation

High-Resolution Generalization (ImageNet 512×512, FlowDCN Architecture, 300K Iterations)

Model NFE FID ↓ Precision ↑ Recall ↑
SM 1 43.81 0.56 0.11
iSM 1 37.05 0.60 0.55
SM 4 12.16 0.86 0.19
iSM 4 9.94 0.78 0.62

The substantial improvement in Recall (0.11 → 0.55, 0.19 → 0.62) indicates that iSM dramatically improves sample diversity — a direct consequence of intrinsic guidance eliminating the mode collapse caused by compounding guidance. The slight drop in Precision at 4 steps (0.86 → 0.78) is consistent with the expected diversity-fidelity trade-off.

Highlights & Insights

  1. Formalization of Compounding Guidance: The paper is the first to rigorously prove the exponential compounding problem of CFG in SM (\(w' = w^{\log_2 N}\)), explaining the root cause of one-step generation artifacts — an important theoretical insight.
  2. Comprehensive and Systematic Problem Diagnosis: Rather than addressing issues in isolation, the paper identifies five bottlenecks at once and provides a unified solution framework.
  3. Clear Contribution of Each Component: The ablation study clearly demonstrates the incremental gains of each improvement, from 9.62 → 6.56 (1-step FID), with well-documented design decisions.
  4. Architecture and Resolution Generalization: The framework's generality is validated on FlowDCN + 512×512, with the Recall improvement from 0.11 to 0.55 being particularly notable.
  5. Controllable Training Overhead: sOT adds only ~4% training time, Twin EMA introduces negligible additional parameters, and the overall improvements do not impose significant computational burden.

Limitations & Future Work

  1. Validation Only on ImageNet: The absence of experiments on text-to-image (e.g., T2I) or larger-scale datasets leaves the practical applicability to be further confirmed.
  2. Limited Comparison with Distillation Methods: Distillation methods such as DMD and DMD2 offer more aggressive compression for few-step generation; the paper does not provide a thorough comparison.
  3. Remaining Gap in One-Step FID: Compared to GANs (StyleGAN-XL 2.30), iSM's one-step FID of 5.27 still shows a notable gap, and the ceiling of the SM framework in extreme one-step scenarios remains to be explored.
  4. Upper Bound of sOT \(K\): The paper stops at \(K=32\); whether larger \(K\) would yield continued gains is unknown, and increasing \(K\) introduces greater inter-batch latency.
  5. Generality of Wavelet Loss: The multi-level DWT loss is hard-coded for a \(32 \times 32\) latent space; different latent resolutions require re-adjusting the number of decomposition levels.
Dimension SM (Original) IMM iSM (Ours)
Variable Steps ✅ Single network, multi-step ✅ Single network, multi-step ✅ Single network, multi-step
CFG Flexibility ❌ Fixed at training Partial ✅ Adjustable at inference
1-Step FID 10.60 7.77 5.27
4-Step FID 7.80 2.51 2.05
Frequency Awareness ✅ Wavelet Loss
OT Matching ✅ Scaling OT
EMA Strategy Single EMA Single EMA Twin EMA

Insights and Connections: 1. Intrinsic Guidance is Transferable: Conditioning on CFG strength as a network input can be directly transferred to other generative models requiring CFG (e.g., consistency models, Flow Matching) and serves as a general performance improvement technique. 2. Lessons from Wavelet-Domain Loss: Performing frequency decomposition in latent space rather than pixel space is efficient and effective, and warrants exploration in other latent-space diffusion models. 3. Twin EMA Concept: The design of using distinct EMA parameters for training and inference is concise and effective, and may be applicable to other self-consistency or self-distillation frameworks. 4. Cross-Batch Pooling in sOT: Achieving large-scale OT at minimal cost can serve as a general training enhancement for Flow Matching. 5. Complementarity with VeCoR: VeCoR improves FM via contrastive regularization of velocity fields, while iSM improves SM through training strategies — the two are orthogonal, and SM + VeCoR + iSM may be a combination worth exploring.

Rating

  • Novelty: ⭐⭐⭐⭐ (The theoretical analysis of compounding guidance and the Twin EMA design are novel, though individual components are not entirely new in isolation)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive ablations and diverse metrics, but lacks T2I and larger-scale validation)
  • Writing Quality: ⭐⭐⭐⭐⭐ (Clear problem-solution structure, high-quality figures, and rigorous mathematical derivations)
  • Value: ⭐⭐⭐⭐ (Systematically improves the competitiveness of SM and provides a practical guide for variable-step generation paradigms)