Skip to content

Self-Refining Video Sampling

Conference: ICML 2026
arXiv: 2601.18577
Code: https://agwmon.github.io/self-refine-video/ (Project Page)
Area: Video Generation / Diffusion Models / Inference-time Sampling
Keywords: Video Diffusion, Flow Matching, Self-Refining Sampling, Denoising Autoencoder, Physical Consistency

TL;DR

The pretrained flow matching video generator is reinterpreted as a "denoising autoencoder." During inference, a "Predict-and-Perturb" inner loop is used within the same noise level to repeatedly correct latents. An uncertainty mask derived from model self-consistency is applied to refine only dynamic regions. This approach significantly improves motion coherence and physical plausibility without any external verifiers or additional training, achieving over 70% human preference.

Background & Motivation

Background: Modern video generators such as Wan2.1/2.2 and Cosmos-Predict are typically based on flow matching, where a learned vector field pushes Gaussian noise toward the data distribution along an ODE in VAE latent space. Although regarded as early "world models," physical dynamics (e.g., multi-argument interactions, complex human movements, rigid-body free fall) remain a known weakness.

Limitations of Prior Work: There are two primary paths to improving physical realism, both of which are computationally heavy. The first is external verifier + rejection sampling (e.g., Cosmos-Reason1, Bansal et al.), which generates multiple candidates and selects the one with the highest score. This suffers from low acceptance rates, high costs, and domain-specific verifiers with limited temporal/physical assessment capabilities. The second is additional training/post-training (e.g., WISA, VideoJAM, CGI synthetic data fine-tuning), which requires high-quality physical annotations and massive training costs, while reward models struggle to capture fine-grained motion.

Key Challenge: Large-scale video generators already encode priors of "realistic motion + structure" in their weights. However, standard ODE solvers follow a one-way trajectory—once coarse motion is determined in the early steps, no mechanism exists to revisit and correct it. While LLMs can critique-and-revise their output tokens, video generators lack such explicit feedback signals, particularly for high-dimensional, temporally coupled latents.

Goal: To find a (i) training-free, (ii) self-contained, (iii) computationally manageable, and (iv) plug-and-play inference-time self-refinement mechanism for existing ODE solvers.

Key Insight: The authors observe that the flow matching objective can be rewritten as \(\mathcal{L}_{\text{FM}}(\theta)=\mathbb{E}\bigl[\tfrac{1}{(1-t)^2}\lVert\hat z_1^\theta-z_1\rVert_2^2\bigr]\), where \(\hat z_1^\theta = z_t+(1-t)u_\theta(z_t,t)\) is the model's prediction of the clean sample. This is exactly a weighted version of the generalized denoising autoencoder (generalized DAE, Bengio 2013) objective. This means a flow matching model is essentially a time-conditioned DAE across all noise levels, possessing the implicit ability to "corrupt-and-reconstruct" back to the data manifold.

Core Idea: Reinterpret the flow matching video generator as a DAE and initiate a pseudo-Gibbs inner loop at each sampling step \(t\): use the model to predict the clean endpoint \(\hat z_1\) (Predict), then perturb it back to the same noise level (Perturb). Repeat this 2–3 times before performing a standard ODE step—termed Predict-and-Perturb (P&P). Additionally, apply an uncertainty mask based on self-consistency to refine only motion areas, preventing over-saturation in static regions caused by cumulative CFG.

Method

Overall Architecture

The input remains Gaussian noise \(z_{t_0}\sim\mathcal{N}(0,\mathbf{I})\), which follows an ODE along discretized time steps \(t_0<\cdots<t_T=1\). In the early "motion determination window" (\(t\le\alpha T\), where \(\alpha\approx 0.2\)), a standard ODE step yields \(z_{t_{i+1}}^{(0)}\). This is followed by a P&P inner loop of \(K_f\le 3\) iterations. In each iteration, the "refined ODE result" and an "uncertainty mask" are calculated simultaneously. The mask is used to fuse the refined result with the previous round's result at each spatial-temporal position. Mid-to-late steps skip refinement and proceed with standard ODE. Total inference time increases by only \(\sim 1.5\times\) without extra training or external models.

Key Designs

  1. Reinterpreting Flow Matching as DAE:

    • Function: Redefines the "sampling process" as a pair of repeatable operators: the Predict operator \(D_\theta(z_t,t):=z_t+(1-t)u_\theta(z_t,t)\) maps \(z_t\) at any noise level to a clean prediction \(\hat z_1\); the Perturb operator \(R_\epsilon(z,t):=tz+(1-t)\epsilon\) adds noise to a clean sample back to level \(t\).
    • Mechanism: The authors prove \(\mathcal{L}_{\text{FM}}\) is equivalent to a weighted DAE reconstruction objective \(\mathbb{E}\bigl[\lVert\hat z_1^\theta-z_1\rVert_2^2\bigr]/(1-t)^2\). Thus, a trained flow matching model acts as a valid DAE at any fixed \(t\), usable an infinite number of times.
    • Design Motivation: This shifts the perspective from "one-way ODE trajectory" to a "re-entrant DAE at each \(t\)," providing the theoretical foundation—without this, repeated perturb-and-predict would lack justification.
  2. Predict-and-Perturb (P&P) Inner Loop:

    • Function: Performs pseudo-Gibbs sampling within a fixed noise level \(t\) to pull \(z_t\) toward high-density regions (i.e., the physically and temporally plausible video manifold) before the ODE solver continues.
    • Mechanism: A single iteration is defined as \(z_t^{(k+1)}=\operatorname{P\&P}_{\epsilon_k}(z_t^{(k)},t):=R_{\epsilon_k}(D_\theta(z_t^{(k)},t),t)\). Starting from \(z_t^{(0)}=z_t\), Predict→Perturb is repeated with newly sampled Gaussian noise for local resampling. The refined \(z_t^*:=z_t^{(K_f)}\) is then used in a standard ODE step: \(z_{t_{i+1}}=z_t^*+\Delta t\cdot u_\theta(z_t^*,t)\).
    • Design Motivation: Consistent with findings that "motion and physics are locked in its first few steps," P&P is only triggered for \(t<0.2\). Local resampling at the same noise level maintains a larger exploration radius to alleviate early lock-in without the error accumulation seen in cross-level jumps like Restart.
  3. Uncertainty-aware Selective Refinement (Uncertainty-aware P&P):

    • Function: Identifies spatial-temporal regions that the model has not yet "seen clearly" and applies P&P only to those areas. This preserves static backgrounds and avoids over-saturation or artifacts caused by cumulative CFG.
    • Mechanism: The difference between two consecutive P&P predictions serves as a "self-consistency" signal: \(\mathbf{U}(z_{t}^{(k-1)},z_{t}^{(k)}):=\tfrac{1}{C}\lVert D_\theta(z_{t}^{(k-1)},t)-D_\theta(z_{t}^{(k)},t)\rVert_1\) (averaged over channels). This map is binarized with threshold \(\tau=0.25\) to produce mask \(M_{t_i}^{(k)}\). Fusion is performed as: \(z_{t_{i+1}}^{(k)}\leftarrow M_{t_i}^{(k)}\odot z_{t_{i+1}}^{(k)}+(1-M_{t_i}^{(k)})\odot z_{t_{i+1}}^{(k-1)}\). Mask calculation is integrated into the Predict step to avoid extra NFE.
    • Design Motivation: Observation showed that \(K_f>3\) without masking leads to CFG accumulation on static backgrounds, causing color shifts and exaggerated reflections. Moving regions, which differ between predictions, do not suffer this accumulation. The consistency-based mask automatically separates motion that requires revision from static backgrounds.

Loss & Training

The method is completely training-free. It utilizes the weights of pretrained Wan2.1 / Wan2.2 / Cosmos-Predict-2.5 without any gradient updates or fine-tuning. The algorithm introduces 3 tunable hyperparameters: P&P iterations \(K_f\) (default 2–3), uncertainty threshold \(\tau\) (default 0.25), and P&P trigger window \(\alpha\) (default first ~20% of steps). All operations are executed in latent space.

Key Experimental Results

Main Results

Dynamic-bench (Wan2.2-A14B T2V, 120 challenging motion prompts, VBench auto-eval + 20-person subjective eval):

Method Human Motion (%) Human Text (%) VBench Motion ↑ VBench Const. ↑ NFE Time
Wan2.2 T2V (Default UniPC) 73.57 57.64 98.01 90.68 40 1.0×
+ NFE×2 74.05 57.55 98.03 90.66 80 2.0×
+ CFG-Zero 81.53 65.71 98.27 91.16 40 1.0×
+ FlowMo (training-free) 70.57 61.71 97.68 90.95 40* 3.9×
+ Ours 98.41 91.33 60 1.5×

PAI-Bench Robotics I2V (Gemini 3 Flash evaluating grasp, Qwen2.5-VL-72B evaluating Robot-QA):

Base Model + Method Grasp ↑ Robot-QA ↑ Quality ↑ NFE Time
Cosmos-Predict-2.5 79.2 71.7 75.1 35 1.0×
+ Verifier best-of-4 84.4 72.3 75.3 140 4.0×
+ Ours 89.6 76.3 75.1 57 1.6×
Wan2.2-I2V-A14B 77.3 77.4 75.3 40 1.0×
+ Verifier best-of-4 80.5 78.1 75.3 144 4.0×
+ Ours 85.7 80.3 75.5 60 1.5×

Physical Alignment: On PhyWorldBench, the PC score improved from 29.3 to 40.0 (+10.7). In VideoPhy2 human evaluations, 84% of users preferred this method over the default sampler. Spatial Consistency: SSIM improved from 0.401 to 0.485, and PSNR from 14.96 to 17.21 dB.

Ablation Study

Configuration Key Observation Explanation
Full (P&P + Uncertainty mask) Optimal motion/physical consistency Complete method
P&P, \(K_f=5\), No mask Over-saturation, tone drift CFG accumulation in static regions
P&P, \(K_f=2{-}3\), No mask Improved motion, occasional background simplification Acceptable with limited iterations
P&P at \(t<0.2\) only vs. whole process Early trigger is sufficient Late triggers yield marginal gains but add NRT
\(\tau\) 0.15 / 0.25 / 0.40 0.25 is most stable Too low: background altered; too high: motion missed

Visualized masks show that thresholded 1-regions nearly perfectly track moving objects (hands, limbs), while 0-regions correspond to static backgrounds—proving that "model self-consistency difference" is a valid signal for locating areas needing refinement.

Key Findings

  • Early steps determine everything: Limiting P&P to the first 20% of steps captures almost all gains, confirming that motion/physical structure is locked in early steps.
  • Uncertainty mask is the cure for over-saturation: P&P alone can introduce artifacts; the mask ensures stability for larger \(K_f\) and prevents CFG accumulation in backgrounds.
  • Video is more robust than images: P&P on images can trigger drastic semantic shifts, while on video it only causes gentle motion structural changes. Cross-frame consistency makes P&P a "local search" rather than "global resampling."
  • Mode-seeking improves temporal consistency: Iterative P&P is mode-seeking. In videos, this manifests as reduced jitter and flicker, as temporally inconsistent videos lie in low-density regions.
  • Not all visual reasoning can be fixed: Graph traversal success jumped from 0.1 to 0.8; however, maze solving showed zero improvement—indicating P&P fixes "motion/temporal" errors but not discrete/semantic correctness, which still requires external verifiers.

Highlights & Insights

  • Theoretical Highlight: Flow matching as DAE. This allows each fixed \(t\) to be treated as an independent DAE that can be called repeatedly without breaking the distribution.
  • Engineering Highlight: Zero-NFE mask calculation. Binding the next-step ODE calculation with the mask calculation into the same Predict call allows the mask to be a "free lunch."
  • Methodological Highlight: Uncertainty from prediction variance. Using the L1 difference between two P&P outputs of the same model locates areas needing refinement without external headers or ensembles.
  • Transferable Trick: Adding 2–3 P&P iterations for the first 20% of steps requires only a few lines of code and provides high ROI performance gains.

Limitations & Future Work

  • Dependency on model priors: P&P cannot generate knowledge the base model lacks (e.g., maze solving). It is essentially a more stable search for high-density modes.
  • Mode-seeking in images: \(K_f=8\) can collapse image diversity. The success in video depends on the unique property of cross-frame consistency.
  • Empirical hyperparameters: \(\alpha, \tau, K_f\) are robust but currently empirical; there is no automatic estimation mechanism yet.
  • Future Work: (i) Adaptive \(K_f\) based on mask energy; (ii) Combining P&P with reward-guided fine-tuning; (iii) Exploring masks in 3D generation or video editing.
  • vs. Restart Sampling: Restart uses forward-backward macro cycles across noise levels to fix late-stage errors; this method performs micro-refinement within the same noise level to fix early-stage motion errors.
  • vs. FreeInit: FreeInit refinies initial noise by rerunning the entire denoising chain; this method refines intermediate latents with far higher efficiency.
  • vs. Verifier-based methods: Rejection sampling takes 4.0× time for lower gains; P&P's 1.6× time is more cost-effective for physical consistency.
  • vs. VideoJAM / FlowMo: Unlike these, this method requires zero gradients and no additional networks, representing a significant leap in engineering friendliness.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐