Envisioning the Future, One Step at a Time¶
Conference: CVPR 2026 arXiv: 2604.09527 Code: http://compvis.github.io/myriad Area: Video Understanding / Motion Prediction Keywords: Open-set motion prediction, sparse trajectories, autoregressive diffusion model, future prediction, world model
TL;DR¶
This paper formulates open-set future scene dynamics prediction as stepwise reasoning over sparse point trajectories, enabling rapid generation of thousands of diverse future hypotheses from a single image via an autoregressive diffusion model — orders of magnitude faster than dense prediction models.
Background & Motivation¶
Background: Most future prediction methods rely on dense video or latent-space prediction, expending substantial model capacity on appearance rather than underlying motion trajectories, making large-scale exploration of future hypotheses computationally prohibitive.
Limitations of Prior Work: (1) Dense video generation methods incur a "visual tax" — every pixel must be rendered before motion can be reasoned about; (2) single-step prediction methods fail in long-horizon scenarios involving multiple contacts; (3) physics engine methods cannot generalize to open-set motion.
Key Challenge: Real-world dynamics are highly complex and stochastic — a large number of possible futures must be considered, yet dense prediction renders such exploration computationally infeasible.
Goal: Achieve open-set, stepwise, and massively parallelizable motion prediction without incurring the visual tax.
Key Insight: Analogous to human cognition — we do not "paint" pictures of the future but instead track meaningful changes. Sparsity is leveraged to make future foresight tractable.
Core Idea: Motion prediction is modeled as a stepwise autoregressive diffusion process over user-defined sparse point trajectories.
Method¶
Overall Architecture¶
Given a single reference frame and \(K\) visible query points, incremental motion at each timestep is generated autoregressively. Each step is a conditional diffusion model predicting locally predictable short-range transitions. The model factorizes the joint distribution causally across both the temporal and trajectory dimensions:
Key Designs¶
-
Motion Token Design:
- Function: Construct informative representations for each (time, trajectory) pair.
- Mechanism: Three sources of information are fused — (1) appearance features sampled at the original position \(x_0^{(i)}\) ("what"); (2) local context features sampled at the current position \(x_t^{(i)}\) ("where"); (3) Fourier-encoded current motion \(\Delta x_t^{(i)}\). A trajectory identifier \(id_{traj}^{(i)} \sim \mathcal{U}(\mathbb{S}^{d-1})\) sampled uniformly from the unit hypersphere is additionally appended.
- Design Motivation: Random IDs prevent the model from over-relying on fixed indices and allow scaling to arbitrary \(K\); dual-position sampling of appearance and context enables each token to simultaneously encode what the target is and where it currently resides.
-
Fast Reasoning Blocks:
- Function: Substantially accelerate sampling speed during autoregressive inference.
- Mechanism: Parallel Transformer blocks are adopted, merging self-attention, cross-attention, and FFN into a single residual update: \(\mathbf{h} \leftarrow \mathbf{h} + SA(\mathbf{h}) + CA(\mathbf{h}, \mathbf{h}_{cross}) + FFN(\mathbf{h})\). Shared pre-normalization and fused projections are applied; image tokens remain frozen (serving solely as keys and values for cross-attention), while motion tokens causally attend to both streams.
- Design Motivation: Multiple kernel launches in conventional Transformer layers constitute the primary bottleneck for autoregressive inference; the fused design significantly reduces the number of launches.
-
Flow Matching Posterior Parameterization:
- Function: Model the distribution of per-step motion with high fidelity.
- Mechanism: Conditional flow matching is applied to model the distribution of incremental motion \(\Delta x_t^{(i)}\) at each step, naturally accommodating uncertainty in multimodal motion. Independent denoising at each step allows uncertainty to grow naturally over time in long-horizon prediction.
- Design Motivation: Compared to deterministic regression, flow matching inherently models multimodality and critically avoids the mode-averaging problem.
Loss & Training¶
The model is trained on diverse in-the-wild videos using the standard conditional probability flow matching loss from flow matching. KV caching is employed to accelerate autoregressive inference.
Key Experimental Results¶
Main Results¶
| Method Type | Prediction Accuracy | Sampling Speed | Diversity |
|---|---|---|---|
| Dense video models | High | Extremely slow | Low (cost-limited) |
| Physics engine methods | High (in-domain) | Moderate | Low (domain-limited) |
| Ours | Comparable / superior | Orders of magnitude faster | High (thousands of hypotheses) |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| w/o trajectory ID | Significant performance drop | Essential for multi-trajectory setting |
| w/o Fast Reasoning | Substantial speed degradation | Fused blocks are critical |
| Single-step prediction | Degradation in long-horizon | Stepwise reasoning is necessary |
| Full model | Best | All components synergize |
Key Findings¶
- Accuracy matches or exceeds dense models on the OWM benchmark while sampling at speeds orders of magnitude faster.
- Random trajectory IDs are critical for multi-trajectory modeling — fixed IDs cause the model to memorize indices rather than learn dynamics.
- Stepwise reasoning allows uncertainty to grow naturally in long-horizon prediction, consistent with physical intuition.
Highlights & Insights¶
- "Track motion, not paint the world" philosophy: The visual tax is entirely avoided, concentrating computation on motion dynamics that truly matter.
- Engineering innovation in Fast Reasoning Blocks: The combination of fused projections, frozen image tokens, and prefix attention substantially improves throughput.
- Introduction of the OWM benchmark: Provides a standardized evaluation framework for open-set motion prediction.
Limitations & Future Work¶
- Sparse point trajectories cannot capture continuum motion such as deformation and rotation.
- Autoregressive generation still accumulates errors over very long horizons.
- The gap between motion prediction and scene understanding remains to be bridged.
Related Work & Insights¶
- vs. Video world models: These methods incur a substantial visual tax to predict every pixel; this work demonstrates that sparse trajectories suffice to capture the essence of motion.
- vs. Physics engine methods: Physics engines are restricted to closed-set domains, whereas the proposed approach achieves generalization in open-set settings through data-driven learning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ A paradigm shift combining sparse trajectories with stepwise autoregressive diffusion.
- Experimental Thoroughness: ⭐⭐⭐⭐ OWM benchmark with multi-scenario validation.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation is compellingly articulated with insightful analogies.
- Value: ⭐⭐⭐⭐⭐ Opens an efficient and scalable new paradigm for future prediction.