FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance¶
Conference: CVPR 2026 arXiv: 2603.12146 Code: Unavailable Area: Video Generation Keywords: trajectory-controllable video generation, distillation acceleration, few-step inference, adversarial training, diffusion models
⚠️ This note is written based on the arXiv abstract (local cache contains abstract only, ~4.7KB); method and experimental details are limited.
TL;DR¶
FlashMotion proposes a three-stage training framework that distills a multi-step trajectory-controllable video generation model into a few-step counterpart. By fine-tuning the adapter with a hybrid diffusion and adversarial objective, the method simultaneously preserves video quality and trajectory accuracy under few-step inference.
Background & Motivation¶
Trajectory-controllable video generation has achieved remarkable progress in recent years, enabling users to precisely control object motion paths via predefined trajectories. Existing methods primarily adopt adapter architectures (e.g., ControlNet-style) injected into video diffusion models to achieve precise motion control.
Key Challenge: All such methods rely on multi-step denoising processes (typically 20–50 steps), resulting in long inference times and substantial computational overhead. Although video distillation methods (e.g., consistency distillation, adversarial distillation) can compress multi-step generators into few-step versions, directly applying these distillation methods to trajectory-controllable video generation leads to significant degradation in both video quality and trajectory accuracy.
Key Challenge: Distillation alters the latent space distribution of the model, introducing a distribution mismatch between the trajectory adapter trained on the multi-step model and the distilled few-step model, causing the adapter's control signals to be misinterpreted.
Key Insight: The paper designs a three-stage training framework — first training the adapter, then distilling the base model, and finally re-aligning the adapter with the few-step model using a hybrid objective — to fundamentally resolve the distribution mismatch problem.
Method¶
Overall Architecture¶
FlashMotion adopts a three-stage training pipeline: 1. Stage 1 — Adapter Training: Train a trajectory control adapter on the multi-step video generator to acquire precise trajectory control capability. 2. Stage 2 — Base Model Distillation: Distill the multi-step video generator into a few-step version to accelerate video generation. 3. Stage 3 — Adapter Fine-tuning and Alignment: Fine-tune the adapter using a hybrid strategy (diffusion objective + adversarial objective) to adapt it to the few-step generator.
Key Designs¶
-
Trajectory Adapter Training (Stage 1):
- Function: Train a plug-and-play trajectory control module on the original multi-step video diffusion model.
- Design Motivation: The adapter architecture enables precise injection of motion control into the base model without affecting its original generation quality.
- Mechanism: Follows standard adapter training paradigms (e.g., ControlNet); the adapter takes predefined trajectories as input and learns to map trajectories to video motion.
-
Video Generator Distillation (Stage 2):
- Function: Compress the multi-step (20–50 steps) video diffusion model into a few-step (4–8 steps) version.
- Design Motivation: Few-step inference substantially reduces computational overhead, though distillation shifts the latent space distribution of the model.
- Mechanism: Employs existing video distillation methods; the paper validates generality across two distinct adapter architectures.
-
Hybrid-Objective Adapter Fine-tuning (Stage 3):
- Function: Re-align the adapter with the distilled few-step generator.
- Design Motivation: Distillation alters the latent space, rendering the Stage 1 adapter incompatible with the few-step model.
- Mechanism: A hybrid training strategy combining a diffusion objective (preserving trajectory accuracy) and an adversarial objective (enhancing video quality).
- Novelty: The hybrid objective simultaneously optimizes two dimensions — the diffusion loss ensures that trajectory control signals are correctly propagated, while the adversarial loss ensures the perceptual quality of few-step generated videos.
Evaluation Benchmark — FlashBench¶
- The paper introduces FlashBench, a benchmark specifically designed to evaluate long-sequence trajectory-controllable video generation.
- It jointly measures video quality and trajectory accuracy.
- It supports scenarios with varying numbers of foreground objects.
Loss & Training¶
- Stage 3 employs a hybrid loss: \(\mathcal{L} = \mathcal{L}_{diffusion} + \lambda \mathcal{L}_{adversarial}\)
- The diffusion objective guarantees trajectory control precision.
- The adversarial objective guarantees the perceptual quality of generated videos.
Key Experimental Results¶
Main Results (Based on Abstract)¶
| Comparison Dimension | FlashMotion | Existing Distillation Methods | Multi-Step Models |
|---|---|---|---|
| Video Quality | ✓ Best | Notable degradation | Good |
| Trajectory Accuracy | ✓ Best | Notable degradation | Good |
| Inference Steps | Few-step (4–8) | Few-step | Multi-step (20–50) |
Architectural Generality Validation¶
| Experimental Setting | Description |
|---|---|
| Adapter Architecture 1 | FlashMotion outperforms video distillation baseline |
| Adapter Architecture 2 | FlashMotion likewise outperforms baseline |
| Remark | Generality of the method is validated across two distinct adapter architectures |
Key Findings¶
- Directly applying the original adapter to the distilled model leads to severe degradation, confirming the distribution mismatch problem.
- The hybrid diffusion + adversarial fine-tuning strategy effectively resolves this mismatch.
- The method is effective across two different adapter architectures, demonstrating strong generality.
- FlashMotion not only matches the performance of multi-step models but surpasses them on certain metrics.
Highlights & Insights¶
- The paper precisely identifies the core bottleneck of few-step trajectory-controllable video generation — the distribution mismatch between the adapter and the distilled model.
- The three-stage decoupled training design is elegant: the two components are independently optimized before being aligned via a hybrid objective.
- The method is decoupled from specific adapter architectures, offering strong generality.
- The introduction of FlashBench addresses the gap in evaluation benchmarks for trajectory-controllable video generation.
Limitations & Future Work¶
- The local cache contains only the abstract, precluding access to specific experimental data and implementation details.
- The three-stage training pipeline is relatively complex; the practical deployment cost warrants further evaluation.
- The specific number of few-step inference steps and the corresponding speedup ratio are not explicitly reported in the abstract.
- Applicability to longer videos and more complex multi-object trajectory scenarios remains to be verified.
Related Work & Insights¶
- Trajectory-controllable video generation: adapter-based methods such as DragNUWA and MotionCtrl.
- Video distillation: consistency distillation and adversarial diffusion distillation.
- Insights for accelerating controllable generation: when distillation shifts the base model's distribution, control modules require re-alignment — an insight generalizable to other controllable generation settings (e.g., layout, depth).
Rating¶
- Novelty: ⭐⭐⭐⭐ The three-stage decoupled design combined with hybrid-objective fine-tuning is novel and practical, precisely targeting the distribution mismatch problem.
- Experimental Thoroughness: ⭐⭐⭐ (based on abstract) Validation across two architectures and a new benchmark.
- Writing Quality: ⭐⭐⭐ The abstract is clear with well-defined problem formulation.
- Value: ⭐⭐⭐⭐ Strong practical demand for accelerating trajectory-controllable video generation; method demonstrates good generality.