StoryTailor: A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives¶

Conference: CVPR 2026
arXiv: 2602.21273
Code: Coming soon
Area: LLM Efficiency
Keywords: Visual Storytelling, Zero-Shot, Multi-Subject, Diffusion Model, Attention Mechanism

TL;DR¶

Proposes StoryTailor, a zero-shot visual narrative generation pipeline that uses Gaussian-Centered Attention (GCA) to mitigate subject overlap and background leakage, Action-Boost SVR (AB-SVR) to amplify action semantics, and Selective Forgetting Cache (SFC) to maintain cross-frame background continuity, achieving multi-subject, action-rich visual narrative generation on a single RTX 4090 with 10–15% CLIP-T improvement over baselines.

Background & Motivation¶

Background: Personalized image generation is divided into two camps: fine-tuning methods (DreamBooth/LoRA/Textual Inversion) that require per-identity training, and adapter methods (IP-Adapter/MS-Diffusion) that are lighter but primarily single-frame. Sequence-level methods (FluxKontext, video diffusion) require GPU clusters and tend to entangle identities in multi-subject interactions.

Limitations of Prior Work: A triple tension—(1) poor action-text fidelity (models excel at identity but not actions); (2) subject identity fidelity collapses during overlap/close proximity; (3) cross-frame background continuity is difficult to maintain.

Key Challenge: Enhancing action response requires increasing text guidance strength, but this corrupts identity consistency through cross-attention drift; propagating background information across frames in turn constrains subject dynamics.

Goal: Achieve training-free multi-subject, action-rich, cross-frame consistent visual narrative generation on a single 24GB GPU.

Key Insight: Rather than modifying the backbone (SDXL), make precise interventions in the attention mechanism and text embedding space—targeting spatial localization, semantic enhancement, and temporal continuity respectively.

Core Idea: Three inference-time modules divide and conquer three sub-problems—GCA handles space, AB-SVR handles semantics, SFC handles time.

Method¶

Overall Architecture¶

Built on SDXL + MS-Diffusion backbone, taking a long narrative prompt, reference images for each subject, and grounding boxes as input. Three plug-and-play modules: GCA applies Gaussian decay masks in the IP branch's cross-attention to localize subject cores; AB-SVR performs SVD on text embeddings and selectively enhances action-related subspaces; SFC propagates background context across frames via KV cache + attention output blending.

Key Designs¶

Gaussian-Centered Attention (GCA)
- Function: Resolves identity confusion when grounding boxes overlap and prevents reference background leakage
- Mechanism: Uses a Voronoi strategy to compute each box's centroid \(\mu_i^*\), dynamically adjusting Gaussian decay radii \(s_i^{\text{in}}, s_i^{\text{out}}\) based on text attention strength. The inner circle uses slow decay to protect identity cores, while the outer ring uses fast decay to decouple subjects from backgrounds. The mask is applied as a logit bias in the IP branch attention: \(\alpha^{ip} = \text{softmax}(QK_{ip}^T/\sqrt{d} + B_{ip})\)
- Design Motivation: Hard box boundaries constrain joint motion and produce edge artifacts; simple soft masks still cling to box edges. Two-stage Gaussian decay both protects center identity and provides freedom for action
Action-Boost SVR (AB-SVR)
- Function: Amplifies action semantics in the text embedding space while suppressing cross-frame action leakage
- Mechanism: Performs thin SVD on the current frame tokens \(X_{\text{exp}}\), selects the retained rank \(k\) via cumulative energy threshold \(\tau=0.85\), forming projection matrix \(P_k = U_k U_k^T\). The current frame retains the main component: \(\tilde{X}_{\text{exp}} = P_k X_{\text{exp}}\); other frames apply notch projection to remove overlapping components: \(\tilde{X}_{\text{sup}}^{(\text{notch})} = (I - P_k) X_{\text{sup}}\)
- Design Motivation: Standard SVR only suppresses but does not zero out other frames' semantics, with residual action noise still interfering. AB-SVR uses SVD principal projection for precise subspace separation—enhancing current-frame actions while denoising other-frame actions
Selective Forgetting Cache (SFC)
- Function: Propagates background context across frames for continuity while not constraining subject dynamics
- Mechanism: Dual mode—(a) KV accumulation: top-k selects 128 relevant tokens from historical frame KV cache to concatenate with the current frame, with a negative bias \(\delta_h=-0.1\) on historical logits to promote forgetting, capacity capped at 512; (b) Context blending: at low-resolution layers, blends previous frame attention output according to background mask: \(\tilde{C} = C \odot (1-\alpha M_b') + \bar{C}_{\text{prev}} \odot (\alpha M_b')\), \(\alpha=0.6\)
- Design Motivation: Directly propagating full KV freezes subject motion and explodes memory; the triple mechanism achieves "remember backgrounds, forget unimportant history"

Loss & Training¶

Training-free method; all modules are plug-and-play during SDXL inference. Hyperparameters: Gaussian base radii (0.35/0.70), AB-SVR energy threshold (\(\tau=0.85\)), SFC blending strength (\(\alpha=0.6\)), and forgetting bias (\(\delta_h=-0.1\)).

Key Experimental Results¶

Main Results¶

Multi-Subject Image Consistency (MSBench)

Method	CLIP-I↑	M-DINO↑	CLIP-T↑
MS-Diffusion	0.692	0.108	0.340
FluxKontext	0.732	0.107	0.372
Nano-Banana	0.749	0.114	0.389
StoryTailor	0.717	0.112	0.414

Ablation Study¶

Config	CLIP-T	CLIP-I	Note
Baseline (MS-Diff)	0.340	0.692	Baseline
+ GCA	~0.355	~0.710	Spatial localization improves
+ AB-SVR	~0.390	~0.705	Action semantics significantly enhanced
Full (GCA+AB-SVR+SFC)	0.414	0.717	Best synergy of all three

Key Findings¶

CLIP-T improves by 10–15% (0.340→0.414), with substantially better text-following for actions and interactions
CLIP-I is slightly lower than the API-based method Nano-Banana (0.717 vs. 0.749), but the latter requires cluster deployment
Runs on a single RTX 4090; FluxKontext requires more VRAM and is slower
AB-SVR is the largest contributor to CLIP-T improvement; GCA is the largest contributor to CLIP-I improvement

Highlights & Insights¶

The three-module divide-and-conquer architecture is clean—space (GCA), semantics (AB-SVR), time (SFC) are orthogonally decoupled
AB-SVR's SVD subspace separation is more precise than simple weight adjustment—using projection matrices for "notch" projection, the current frame retains the action principal component while completely removing corresponding components from other frames
Strong practicality: Training-free, single GPU (24GB), plug-and-play modules

Limitations & Future Work¶

CLIP-I is not optimal (0.717 vs. 0.749); identity preservation strategies have room for improvement
Relies on user-provided grounding boxes, raising the usage barrier
Validated only on SDXL; adaptability to other diffusion backbones is unknown

vs MS-Diffusion: StoryTailor adds three modules on top, improving CLIP-T from 0.340 to 0.414
vs FluxKontext: Quality is comparable, but StoryTailor runs on a single GPU
vs 1Prompt1Story: The SVR pioneer work, but with poor identity preservation and limited actions; AB-SVR introduces subspace separation

Rating¶

Novelty: ⭐⭐⭐⭐ AB-SVR's subspace notch projection is particularly novel
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple baseline comparisons + ablation + qualitative demonstrations
Writing Quality: ⭐⭐⭐⭐ Clear structure, though notation-heavy
Value: ⭐⭐⭐⭐ A practical solution for single-GPU visual storytelling