FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance¶

Conference: CVPR 2026 arXiv: 2603.12146 Code: Unavailable Area: Video Generation Keywords: trajectory-controllable video generation, distillation acceleration, few-step inference, adversarial training, diffusion models

⚠️ This note is written based on the arXiv abstract (local cache contains abstract only, ~4.7KB); method and experimental details are limited.

TL;DR¶

FlashMotion proposes a three-stage training framework that distills a multi-step trajectory-controllable video generation model into a few-step counterpart. By fine-tuning the adapter with a hybrid diffusion and adversarial objective, the method simultaneously preserves video quality and trajectory accuracy under few-step inference.

Background & Motivation¶

Trajectory-controllable video generation has achieved remarkable progress in recent years, enabling users to precisely control object motion paths via predefined trajectories. Existing methods primarily adopt adapter architectures (e.g., ControlNet-style) injected into video diffusion models to achieve precise motion control.

Key Challenge: All such methods rely on multi-step denoising processes (typically 20–50 steps), resulting in long inference times and substantial computational overhead. Although video distillation methods (e.g., consistency distillation, adversarial distillation) can compress multi-step generators into few-step versions, directly applying these distillation methods to trajectory-controllable video generation leads to significant degradation in both video quality and trajectory accuracy.

Key Challenge: Distillation alters the latent space distribution of the model, introducing a distribution mismatch between the trajectory adapter trained on the multi-step model and the distilled few-step model, causing the adapter's control signals to be misinterpreted.

Key Insight: The paper designs a three-stage training framework — first training the adapter, then distilling the base model, and finally re-aligning the adapter with the few-step model using a hybrid objective — to fundamentally resolve the distribution mismatch problem.

Method¶

Overall Architecture¶

FlashMotion adopts a three-stage training pipeline: 1. Stage 1 — Adapter Training: Train a trajectory control adapter on the multi-step video generator to acquire precise trajectory control capability. 2. Stage 2 — Base Model Distillation: Distill the multi-step video generator into a few-step version to accelerate video generation. 3. Stage 3 — Adapter Fine-tuning and Alignment: Fine-tune the adapter using a hybrid strategy (diffusion objective + adversarial objective) to adapt it to the few-step generator.

Key Designs¶

Trajectory Adapter Training (Stage 1):
- Function: Train a plug-and-play trajectory control module on the original multi-step video diffusion model.
- Design Motivation: The adapter architecture enables precise injection of motion control into the base model without affecting its original generation quality.
- Mechanism: Follows standard adapter training paradigms (e.g., ControlNet); the adapter takes predefined trajectories as input and learns to map trajectories to video motion.
Video Generator Distillation (Stage 2):
- Function: Compress the multi-step (20–50 steps) video diffusion model into a few-step (4–8 steps) version.
- Design Motivation: Few-step inference substantially reduces computational overhead, though distillation shifts the latent space distribution of the model.
- Mechanism: Employs existing video distillation methods; the paper validates generality across two distinct adapter architectures.
Hybrid-Objective Adapter Fine-tuning (Stage 3):
- Function: Re-align the adapter with the distilled few-step generator.
- Design Motivation: Distillation alters the latent space, rendering the Stage 1 adapter incompatible with the few-step model.
- Mechanism: A hybrid training strategy combining a diffusion objective (preserving trajectory accuracy) and an adversarial objective (enhancing video quality).
- Novelty: The hybrid objective simultaneously optimizes two dimensions — the diffusion loss ensures that trajectory control signals are correctly propagated, while the adversarial loss ensures the perceptual quality of few-step generated videos.

Evaluation Benchmark — FlashBench¶

The paper introduces FlashBench, a benchmark specifically designed to evaluate long-sequence trajectory-controllable video generation.
It jointly measures video quality and trajectory accuracy.
It supports scenarios with varying numbers of foreground objects.

Loss & Training¶

Stage 3 employs a hybrid loss: \(\mathcal{L} = \mathcal{L}_{diffusion} + \lambda \mathcal{L}_{adversarial}\)
The diffusion objective guarantees trajectory control precision.
The adversarial objective guarantees the perceptual quality of generated videos.

Key Experimental Results¶

Main Results (Based on Abstract)¶

Comparison Dimension	FlashMotion	Existing Distillation Methods	Multi-Step Models
Video Quality	✓ Best	Notable degradation	Good
Trajectory Accuracy	✓ Best	Notable degradation	Good
Inference Steps	Few-step (4–8)	Few-step	Multi-step (20–50)

Architectural Generality Validation¶

Experimental Setting	Description
Adapter Architecture 1	FlashMotion outperforms video distillation baseline
Adapter Architecture 2	FlashMotion likewise outperforms baseline
Remark	Generality of the method is validated across two distinct adapter architectures

Key Findings¶

Directly applying the original adapter to the distilled model leads to severe degradation, confirming the distribution mismatch problem.
The hybrid diffusion + adversarial fine-tuning strategy effectively resolves this mismatch.
The method is effective across two different adapter architectures, demonstrating strong generality.
FlashMotion not only matches the performance of multi-step models but surpasses them on certain metrics.

Highlights & Insights¶

The paper precisely identifies the core bottleneck of few-step trajectory-controllable video generation — the distribution mismatch between the adapter and the distilled model.
The three-stage decoupled training design is elegant: the two components are independently optimized before being aligned via a hybrid objective.
The method is decoupled from specific adapter architectures, offering strong generality.
The introduction of FlashBench addresses the gap in evaluation benchmarks for trajectory-controllable video generation.

Limitations & Future Work¶

The local cache contains only the abstract, precluding access to specific experimental data and implementation details.
The three-stage training pipeline is relatively complex; the practical deployment cost warrants further evaluation.
The specific number of few-step inference steps and the corresponding speedup ratio are not explicitly reported in the abstract.
Applicability to longer videos and more complex multi-object trajectory scenarios remains to be verified.

Trajectory-controllable video generation: adapter-based methods such as DragNUWA and MotionCtrl.
Video distillation: consistency distillation and adversarial diffusion distillation.
Insights for accelerating controllable generation: when distillation shifts the base model's distribution, control modules require re-alignment — an insight generalizable to other controllable generation settings (e.g., layout, depth).

Rating¶

Novelty: ⭐⭐⭐⭐ The three-stage decoupled design combined with hybrid-objective fine-tuning is novel and practical, precisely targeting the distribution mismatch problem.
Experimental Thoroughness: ⭐⭐⭐ (based on abstract) Validation across two architectures and a new benchmark.
Writing Quality: ⭐⭐⭐ The abstract is clear with well-defined problem formulation.
Value: ⭐⭐⭐⭐ Strong practical demand for accelerating trajectory-controllable video generation; method demonstrates good generality.