FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion¶

Conference: AAAI 2026 arXiv: 2512.11274 Code: Project Page Area: Video Generation Keywords: Multi-shot video generation, autoregressive diffusion, cache mechanism, consistency, long video generation

TL;DR¶

FilmWeaver is proposed as a framework that guides autoregressive diffusion models via a dual-level cache (Shot Cache + Temporal Cache), enabling multi-shot video generation of arbitrary length with cross-shot consistency.

Background & Motivation¶

State of the Field¶

Current video diffusion models (e.g., HunyuanVideo, Wan) have demonstrated strong performance on single-shot video generation, yet multi-shot video generation remains a significant challenge. Multi-shot videos hold greater practical value in film production and narrative-driven creative scenarios.

Limitations of Prior Work¶

Multi-shot video generation faces two critical challenges:

Cross-shot consistency: Character identities and backgrounds must remain visually coherent across shots, which cannot be achieved through text descriptions alone.

Shot duration and count management: Existing methods are limited in controlling the duration of individual shots and the total video length.

Existing approaches suffer from distinct drawbacks: - Multi-model pipeline methods (VideoDirectorGPT, VideoStudio): Adopt a two-stage paradigm of keyframe generation followed by image-to-video, but independent generation of each segment leads to visual discontinuities and scene jumps. - Concurrent multi-shot methods (ShotAdapter, Mask2DiT): Pack multiple shots into a single sequence, severely constraining per-shot duration. - TTT: Introduces an RNN mechanism but lacks long-term memory and incurs high training costs. - LCT: Requires two-stage training and supports only MM-DiT architectures.

Core Idea¶

Decouple the consistency problem into two sub-problems — cross-shot consistency and intra-shot coherence — managed separately via a dual-level cache system: Shot Cache stores keyframes from prior shots to maintain character/scene identity, while Temporal Cache retains the frame history of the current shot to ensure smooth motion.

Method¶

Overall Architecture¶

FilmWeaver is built upon an autoregressive diffusion paradigm, centered on a Dual-Level Cache mechanism. When generating a new video chunk, the model is conditioned on a text prompt \(\mathbf{c}_{\text{text}}\), Temporal Cache \(C_{\text{temp}}\), and Shot Cache \(C_{\text{shot}}\). The training objective is:

\[\mathcal{L} = \mathbb{E}_{\mathbf{v}_0, \mathbf{c}_{\text{text}}, \epsilon, t}\left[\left\|\epsilon - \epsilon_\theta(\mathbf{v}_t, t, \mathbf{c}_{\text{text}}, C_{\text{temp}}, C_{\text{shot}})\right\|^2\right]\]

The cache is injected via in-context injection without modifying the model architecture, making the approach compatible with existing pretrained T2V models.

Key Designs¶

1. Temporal Cache (Intra-Shot Coherence)¶

Function: Acts as a sliding window storing conditioning information from the most recently generated frames of the current shot.
Mechanism: Exploits high temporal redundancy in video via a differential compression strategy: recent frames are preserved at high fidelity, while distant frames are progressively compressed.
Implementation: Three-level hierarchical compression — the most recent 1 latent is uncompressed, the next 2 are compressed at 4×, and the last 16 at 32×.
Design Motivation: Maintains motion coherence while controlling computational overhead; each generation step produces 6 latents (24 frames, 1 second at 24 fps).

2. Shot Cache (Cross-Shot Consistency)¶

Function: Retrieves the Top-K keyframes most relevant to the current text prompt from prior shots.
Retrieval Mechanism: Computes cosine similarity between CLIP text embeddings and candidate keyframe image embeddings:

\[C_{\text{shot}} = \underset{kf \in \mathcal{KF}}{\arg\,\text{top-k}}\left(\text{sim}(\phi_T(\mathbf{c}_{\text{text}}), \phi_I(kf))\right)\]

K=3, chosen based on a performance-efficiency trade-off; three keyframes suffice to capture the diverse concepts required in complex multi-shot scenarios.
Design Motivation: Provides a concise yet highly relevant visual summary of narrative history to guide the model in maintaining character and background consistency.

3. Four Inference Modes¶

Based on the state of the dual-level cache, inference proceeds in four stages: 1. No Cache (first shot generation): Initializes the cache; operates in standard T2V mode. 2. Temporal Only (first shot extension): Ensures high temporal coherence; supports video extension. 3. Shot Only (new shot generation): Clears the Temporal Cache and injects prior keyframes from the Shot Cache; supports multi-concept injection. 4. Full Cache (new shot extension): Leverages both cache levels simultaneously.

Loss & Training¶

Progressive Training Curriculum¶

Stage 1: Trains long single-shot video generation using only the Temporal Cache (10K steps); Shot Cache is disabled, allowing the model to first master intra-shot dynamics.
Stage 2: Activates the Shot Cache and fine-tunes on a mixed curriculum covering all four cache scenarios (10K steps); the progressive approach accelerates convergence.

Data Augmentation (Addressing the "Copy-Paste" Problem)¶

Problem: The model over-relies on visual context, reducing motion diversity and text-following capability.
Negative Sampling: Randomly introduces irrelevant keyframes into the Shot Cache, forcing the model to distinguish useful from distracting information.
Asymmetric Noising: Strong noise (100–400 timesteps) is applied to the Shot Cache, and weak noise (0–100 timesteps) to the Temporal Cache, balancing copy-avoidance with motion coherence.

Multi-Shot Dataset Construction¶

Expert models are applied for shot segmentation → sliding-window CLIP similarity for scene clustering → filtering (removal of segments shorter than 1 second and scenes with more than 3 persons) → Group Captioning: all shots within a scene are jointly fed into Gemini 2.5 Pro for unified description → a validation step ensures annotation accuracy.

Key Experimental Results¶

Main Results¶

Method	Aes.↑	Incep.↑	Char. Cons.↑	All Cons.↑	Char. Align.↑	All Align.↑
VideoStudio	32.02	6.81	73.34%	62.40%	20.88	31.52
StoryDiffusion	35.61	8.30	70.03%	67.15%	20.21	30.86
IC-LoRA	31.78	6.95	72.47%	71.19%	22.16	28.74
FilmWeaver	33.69	8.57	74.61%	75.12%	23.07	31.23

FilmWeaver achieves state-of-the-art performance on consistency and character-text alignment metrics, while also obtaining the highest Inception Score.

Ablation Study¶

Configuration	Aes.↑	Incep.↑	Char.↑	All.↑	Char. Align.↑	All Align.↑
w/o Augmentation	30.04	7.77	72.36%	75.92%	21.88	28.12
w/o Shot Cache	33.92	8.63	68.11%	65.44%	22.41	31.79
w/o Temporal Cache	31.61	8.36	70.79%	70.57%	20.21	30.70
Full Model	33.69	8.57	74.61%	75.12%	23.07	31.23

Key Findings¶

Shot Cache is critical for cross-shot consistency: Its removal causes All Consistency to drop sharply from 75.12% to 65.44%.
Noise augmentation is critical for text-following: Removing it leads to a significant decline in Text Alignment.
Negative sampling provides fault tolerance: The model can effectively ignore irrelevant keyframes when they are retrieved.
Computational efficiency: Attention complexity is reduced from \(O(24^2)=576\) for the full sequence to approximately \(3.5 \times 11^2 \approx 423.5\) with chunked processing.

Highlights & Insights¶

Elegant decoupling: The consistency problem is explicitly decomposed into inter-shot and intra-shot sub-problems, each managed by a dedicated cache — a clean and effective design.
High flexibility: The four inference modes naturally support downstream tasks such as multi-concept injection and video extension without additional training.
Architecture-agnostic: Cache injection via in-context injection requires no modification to the model structure, enabling compatibility with diverse pretrained T2V models.
Practical data construction pipeline: The Group Captioning strategy addresses the problem of cross-shot annotation consistency.

Limitations & Future Work¶

Visual quality still has room for improvement, which could be addressed through better data curation and training strategies.
The cache size (K=3) may be insufficient for highly complex scenes.
The data pipeline relies on Gemini 2.5 Pro, limiting annotation cost-efficiency and accessibility.
The evaluation benchmark is small (20 scenes × 5 shots), and standardized public multi-shot benchmarks are lacking.

The differential compression strategy parallels that of FramePack, but is extended to cross-shot scenarios.
The retrieval-augmented generation idea underlying the Shot Cache is transferable to other generation tasks requiring long-term consistency.
The negative sampling strategy resembles hard negative training in contrastive learning, enhancing model robustness.

Rating¶

Novelty: ⭐⭐⭐⭐ — The dual-level cache decoupling is elegant, though autoregressive diffusion and in-context injection are not themselves novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive quantitative and qualitative comparisons with sufficient ablations, though the evaluation scale is limited.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, smooth logic, and rich figures and tables.
Value: ⭐⭐⭐⭐ — Multi-shot video generation is an important problem, and the framework demonstrates strong practical utility.