FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion¶
Conference: AAAI 2026 arXiv: 2512.11274 Code: Project Page Area: Video Generation Keywords: Multi-shot video generation, autoregressive diffusion, cache mechanism, consistency, long video generation
TL;DR¶
FilmWeaver is proposed as a framework that guides autoregressive diffusion models via a dual-level cache (Shot Cache + Temporal Cache), enabling multi-shot video generation of arbitrary length with cross-shot consistency.
Background & Motivation¶
State of the Field¶
Current video diffusion models (e.g., HunyuanVideo, Wan) have demonstrated strong performance on single-shot video generation, yet multi-shot video generation remains a significant challenge. Multi-shot videos hold greater practical value in film production and narrative-driven creative scenarios.
Limitations of Prior Work¶
Multi-shot video generation faces two critical challenges:
Cross-shot consistency: Character identities and backgrounds must remain visually coherent across shots, which cannot be achieved through text descriptions alone.
Shot duration and count management: Existing methods are limited in controlling the duration of individual shots and the total video length.
Existing approaches suffer from distinct drawbacks: - Multi-model pipeline methods (VideoDirectorGPT, VideoStudio): Adopt a two-stage paradigm of keyframe generation followed by image-to-video, but independent generation of each segment leads to visual discontinuities and scene jumps. - Concurrent multi-shot methods (ShotAdapter, Mask2DiT): Pack multiple shots into a single sequence, severely constraining per-shot duration. - TTT: Introduces an RNN mechanism but lacks long-term memory and incurs high training costs. - LCT: Requires two-stage training and supports only MM-DiT architectures.
Core Idea¶
Decouple the consistency problem into two sub-problems — cross-shot consistency and intra-shot coherence — managed separately via a dual-level cache system: Shot Cache stores keyframes from prior shots to maintain character/scene identity, while Temporal Cache retains the frame history of the current shot to ensure smooth motion.
Method¶
Overall Architecture¶
FilmWeaver is built upon an autoregressive diffusion paradigm, centered on a Dual-Level Cache mechanism. When generating a new video chunk, the model is conditioned on a text prompt \(\mathbf{c}_{\text{text}}\), Temporal Cache \(C_{\text{temp}}\), and Shot Cache \(C_{\text{shot}}\). The training objective is:
The cache is injected via in-context injection without modifying the model architecture, making the approach compatible with existing pretrained T2V models.
Key Designs¶
1. Temporal Cache (Intra-Shot Coherence)¶
- Function: Acts as a sliding window storing conditioning information from the most recently generated frames of the current shot.
- Mechanism: Exploits high temporal redundancy in video via a differential compression strategy: recent frames are preserved at high fidelity, while distant frames are progressively compressed.
- Implementation: Three-level hierarchical compression — the most recent 1 latent is uncompressed, the next 2 are compressed at 4×, and the last 16 at 32×.
- Design Motivation: Maintains motion coherence while controlling computational overhead; each generation step produces 6 latents (24 frames, 1 second at 24 fps).
2. Shot Cache (Cross-Shot Consistency)¶
- Function: Retrieves the Top-K keyframes most relevant to the current text prompt from prior shots.
- Retrieval Mechanism: Computes cosine similarity between CLIP text embeddings and candidate keyframe image embeddings:
- K=3, chosen based on a performance-efficiency trade-off; three keyframes suffice to capture the diverse concepts required in complex multi-shot scenarios.
- Design Motivation: Provides a concise yet highly relevant visual summary of narrative history to guide the model in maintaining character and background consistency.
3. Four Inference Modes¶
Based on the state of the dual-level cache, inference proceeds in four stages: 1. No Cache (first shot generation): Initializes the cache; operates in standard T2V mode. 2. Temporal Only (first shot extension): Ensures high temporal coherence; supports video extension. 3. Shot Only (new shot generation): Clears the Temporal Cache and injects prior keyframes from the Shot Cache; supports multi-concept injection. 4. Full Cache (new shot extension): Leverages both cache levels simultaneously.
Loss & Training¶
Progressive Training Curriculum¶
- Stage 1: Trains long single-shot video generation using only the Temporal Cache (10K steps); Shot Cache is disabled, allowing the model to first master intra-shot dynamics.
- Stage 2: Activates the Shot Cache and fine-tunes on a mixed curriculum covering all four cache scenarios (10K steps); the progressive approach accelerates convergence.
Data Augmentation (Addressing the "Copy-Paste" Problem)¶
- Problem: The model over-relies on visual context, reducing motion diversity and text-following capability.
- Negative Sampling: Randomly introduces irrelevant keyframes into the Shot Cache, forcing the model to distinguish useful from distracting information.
- Asymmetric Noising: Strong noise (100–400 timesteps) is applied to the Shot Cache, and weak noise (0–100 timesteps) to the Temporal Cache, balancing copy-avoidance with motion coherence.
Multi-Shot Dataset Construction¶
Expert models are applied for shot segmentation → sliding-window CLIP similarity for scene clustering → filtering (removal of segments shorter than 1 second and scenes with more than 3 persons) → Group Captioning: all shots within a scene are jointly fed into Gemini 2.5 Pro for unified description → a validation step ensures annotation accuracy.
Key Experimental Results¶
Main Results¶
| Method | Aes.↑ | Incep.↑ | Char. Cons.↑ | All Cons.↑ | Char. Align.↑ | All Align.↑ |
|---|---|---|---|---|---|---|
| VideoStudio | 32.02 | 6.81 | 73.34% | 62.40% | 20.88 | 31.52 |
| StoryDiffusion | 35.61 | 8.30 | 70.03% | 67.15% | 20.21 | 30.86 |
| IC-LoRA | 31.78 | 6.95 | 72.47% | 71.19% | 22.16 | 28.74 |
| FilmWeaver | 33.69 | 8.57 | 74.61% | 75.12% | 23.07 | 31.23 |
FilmWeaver achieves state-of-the-art performance on consistency and character-text alignment metrics, while also obtaining the highest Inception Score.
Ablation Study¶
| Configuration | Aes.↑ | Incep.↑ | Char.↑ | All.↑ | Char. Align.↑ | All Align.↑ |
|---|---|---|---|---|---|---|
| w/o Augmentation | 30.04 | 7.77 | 72.36% | 75.92% | 21.88 | 28.12 |
| w/o Shot Cache | 33.92 | 8.63 | 68.11% | 65.44% | 22.41 | 31.79 |
| w/o Temporal Cache | 31.61 | 8.36 | 70.79% | 70.57% | 20.21 | 30.70 |
| Full Model | 33.69 | 8.57 | 74.61% | 75.12% | 23.07 | 31.23 |
Key Findings¶
- Shot Cache is critical for cross-shot consistency: Its removal causes All Consistency to drop sharply from 75.12% to 65.44%.
- Noise augmentation is critical for text-following: Removing it leads to a significant decline in Text Alignment.
- Negative sampling provides fault tolerance: The model can effectively ignore irrelevant keyframes when they are retrieved.
- Computational efficiency: Attention complexity is reduced from \(O(24^2)=576\) for the full sequence to approximately \(3.5 \times 11^2 \approx 423.5\) with chunked processing.
Highlights & Insights¶
- Elegant decoupling: The consistency problem is explicitly decomposed into inter-shot and intra-shot sub-problems, each managed by a dedicated cache — a clean and effective design.
- High flexibility: The four inference modes naturally support downstream tasks such as multi-concept injection and video extension without additional training.
- Architecture-agnostic: Cache injection via in-context injection requires no modification to the model structure, enabling compatibility with diverse pretrained T2V models.
- Practical data construction pipeline: The Group Captioning strategy addresses the problem of cross-shot annotation consistency.
Limitations & Future Work¶
- Visual quality still has room for improvement, which could be addressed through better data curation and training strategies.
- The cache size (K=3) may be insufficient for highly complex scenes.
- The data pipeline relies on Gemini 2.5 Pro, limiting annotation cost-efficiency and accessibility.
- The evaluation benchmark is small (20 scenes × 5 shots), and standardized public multi-shot benchmarks are lacking.
Related Work & Insights¶
- The differential compression strategy parallels that of FramePack, but is extended to cross-shot scenarios.
- The retrieval-augmented generation idea underlying the Shot Cache is transferable to other generation tasks requiring long-term consistency.
- The negative sampling strategy resembles hard negative training in contrastive learning, enhancing model robustness.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The dual-level cache decoupling is elegant, though autoregressive diffusion and in-context injection are not themselves novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive quantitative and qualitative comparisons with sufficient ablations, though the evaluation scale is limited.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, smooth logic, and rich figures and tables.
- Value: ⭐⭐⭐⭐ — Multi-shot video generation is an important problem, and the framework demonstrates strong practical utility.