Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning¶

Conference: CVPR 2026 arXiv: 2603.06688 Code: To be confirmed Area: Multimodal VLM Keywords: Long-range visual consistency, narrative generation, AR+Diffusion, Memory Bank, e-commerce advertising

TL;DR¶

This paper proposes the Narrative Weaver framework, which combines narrative planning via MLLMs with fine-grained generation via diffusion models. Through learnable queries and a dynamic Memory Bank, the framework achieves long-range visually consistent generation under multi-modal conditioning. The authors also introduce EAVSD, the first e-commerce advertising video storyboard dataset, comprising 330K+ images.

Background & Motivation¶

Background: Generative AI systems such as Sora, Veo, and Midjourney demonstrate strong performance on short-clip image/video generation, yet long-range narrative generation—maintaining character, background, and style consistency across frames—remains a major challenge.

Limitations of Prior Work: (1) Video generation models suffer rapid consistency degradation beyond short clips; (2) Image generation models operate on individual frames and cannot plan multi-frame narratives; (3) Existing planning methods rely on purely textual conditioning and cannot produce controllable, visually grounded outputs.

Key Challenge: No unified framework exists that integrates narrative planning, fine-grained control, and long-range consistency into a single system. Large-scale multi-modal conditioned generation datasets are also lacking.

Goal: Achieve multi-modal conditioned long-sequence consistent generation of the form (text, image) → (text, {Image_i}).

Key Insight: A hybrid architecture combining an AR model for planning and a diffusion model for generation, with a Memory Bank propagating consistency information across keyframes.

Core Idea: An MLLM acts as a "director" to plan narratives and compress context into learnable queries; a Memory Bank anchors the initial visual condition to prevent drift; three-stage progressive training enables data-efficient learning.

Method¶

Overall Architecture¶

A hybrid AR + Diffusion architecture: an MLLM (Qwen2.5-VL-3B) serves as the AR component responsible for textual narrative planning and historical information encoding, while Flux.1-Dev serves as the diffusion component responsible for image generation. The input consists of a conditioning image and a user instruction; the output is a multi-frame visual narrative sequence.

Key Designs¶

Multi-Modal Interaction and Learnable Queries:
- Function: The MLLM simultaneously performs narrative planning (text generation) and high-level visual content aggregation (query vector generation).
- Mechanism: A dynamic causal attention mask is designed—text tokens attend only to preceding text tokens (standard causal attention), while learnable queries \(q_n\) can attend to the full multi-modal context (input \(\mathbf{I}\), all narrative texts \(\{t_j\}\), and preceding queries \(\{q_k\}\)).
- Special tokens <img> / </img> delimit query sequences, enabling the model to learn when to generate an image versus when to continue planning text.
- Design Motivation: Prevents queries from interfering with original text generation while allowing queries to fully absorb multi-modal information.
Dynamic Memory Bank:
- Function: Caches VAE features of previously generated images to prevent visual drift.
- Mechanism: Features from the most recent \(T\) frames are cached and compressed via geometrically decayed average pooling—the feature length of the \(k\)-th frame is \(l/\lambda^{k-1}\), ensuring the total memory length is bounded by \(L < l \cdot \lambda/(\lambda-1)\).
- The final conditioning signal is: \(\mathbf{C}_n = \text{Concat}(q_n, f^{cond}, \hat{f}_{n-1}, ..., \hat{f}_{n-T})\)
- Design Motivation: Recent frames retain more detail (high resolution) while distant frames provide coarse-grained context (compressed), decoupling the trade-off between consistency and efficiency.
Three-Stage Progressive Training:
- Stage 1 (Narrative Planning): Trains the MLLM to learn textual narrative generation and image-timing prediction using a standard cross-entropy loss.
- Stage 2 (Semantic Consistent Generation): Trains learnable queries and projectors; pre-trained on 30M low-resolution text–image pairs, then fine-tuned on 60K high-quality samples using a Flow Matching objective.
- Stage 3 (Fine-Grained Consistent Alignment): Fully trains the diffusion model, incorporating VAE features of conditioning images and Memory Bank features, continuing with the Flow Matching objective.

Efficiency Analysis¶

DiT computational complexity is reduced from quadratic to linear growth with respect to the number of images.
The bottleneck shifts to the highly optimizable MLLM component.
Parallel planning and generation are supported at inference time.

Key Experimental Results¶

GPT-4o Evaluation (Consistent Visual Generation)¶

Method	Text Control	ITC	RGC	MSSC	MSCC	IMQ
StoryDiffusion	✗	6.54	5.86	7.48	6.00	6.80
IP-Adapter	✗	7.11	6.10	8.57	7.57	6.65
Flux.1-kontext	✗	7.06	9.41	8.11	7.28	6.94
Narrative Weaver	✓	7.54	8.86	8.67	7.91	7.35

Automatic Evaluation (DreamSim↓ / CLIP Score↑)¶

Method	DreamSim↓ (Avg)	Notes
StoryDiffusion	56.33	Multi-scene generation method
IP-Adapter	33.30	Reference image method
Flux.1-kontext	3.71	Editing method (but exhibits copy-paste artifacts)
Narrative Weaver	12.18	Best among multi-scene generation methods

User Study¶

180+ user preference surveys confirm the model's advantages.
Although Flux.1-kontext achieves better automatic metrics, its "copy-paste" behavior is disfavored by users.

Highlights & Insights¶

The first generative framework to unify narrative planning, fine-grained control, and long-range consistency, filling an important gap in the field.
The dynamic causal attention mask is elegantly designed; text planning can be learned with as few as ~5K samples.
The geometrically decayed compression in the Memory Bank guarantees bounded memory while prioritizing recent frames.
EAVSD fills the gap in e-commerce advertising storyboard datasets (330K+ images).
The three-stage training strategy achieves state-of-the-art performance under limited computation and data, offering strong practical utility.
Computational complexity is reduced from quadratic to linear growth, enabling generation of longer narrative sequences.

Limitations & Future Work¶

The current approach focuses primarily on keyframe generation; consistency of transitional video segments between keyframes remains unresolved.
The planning capacity of Qwen2.5-VL-3B may limit narrative complexity; larger MLLMs could raise the performance ceiling.
EAVSD dataset construction relies on commercial models (Qwen-Image, Flux.1-kontext), which may introduce generative bias.
Dedicated modules for identity preservation (e.g., face ID embeddings) could be incorporated to further improve character consistency.
The effect of the geometric decay rate \(\lambda\) in the Memory Bank across different narrative lengths warrants further ablation.
Stage 3 is trained for only 1–2 epochs; more thorough training may further improve fine-grained consistency.