Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement¶

Conference: ACL 2026
arXiv: 2411.15115
Code: video-repair
Area: Video Generation
Keywords: Text-to-Video Generation, Self-Correction, Localized Refinement, Text-Video Alignment, Diffusion Model

TL;DR¶

This paper proposes VideoRepair, the first training-free, model-agnostic text-to-video self-correction framework that detects fine-grained text-video misalignment via MLLM, preserves correct regions, and selectively repairs problematic regions, consistently improving alignment quality across four T2V backbone models on EvalCrafter and T2V-CompBench.

Background & Motivation¶

Background: Text-to-video (T2V) diffusion models have made significant advances in generation quality, but still struggle with following complex text prompts — particularly involving multiple objects, attribute binding, and spatial relationships. Common errors include incorrect object counts, confused attribute binding, or region distortion.

Limitations of Prior Work: Existing compositional T2V methods improve compositionality but lack explicit feedback mechanisms to detect and correct misalignments. Image-domain repair frameworks suffer from high computational overhead, reliance on external generators, or introduction of visual inconsistencies. The key issue is: even when generated videos contain misaligned portions, correctly generated regions should often be preserved rather than regenerated.

Key Challenge: Global regeneration wastes correctly generated content, while simple inpainting/editing lacks semantically guided ability to introduce or correct entities that do not match the text. A mechanism is needed that can both precisely locate problematic regions and preserve faithful content.

Goal: Design a training-free video repair framework that can automatically detect what is wrong, plan how to fix it, and then locally correct it.

Key Insight: Analogous to how humans revise creative works — modifying only erroneous parts while preserving correct ones. Through MLLM-generated fine-grained evaluation questions to identify misaligned regions, then leveraging the diffusion model's own regeneration capability for selective repair.

Core Idea: Preserve correct regions, selectively repair erroneous regions — transform MLLM evaluation feedback into actionable generation guidance.

Method¶

Overall Architecture¶

VideoRepair has three stages: (1) Misalignment Detection: extract semantic tuples from text prompts, generate evaluation question sets, use MLLM binary answers to identify misaligned regions; (2) Refinement Planning: determine entities to preserve and their instance counts, obtain preservation region masks through segmentation models, generate local prompts for regions to be repaired; (3) Localized Refinement: selectively reinitialize noise, apply different text guidance to preserved and repair regions, achieve seamless fusion through joint optimization.

Key Designs¶

MLLM-Driven Misalignment Detection:
- Function: Automatically identify which video elements do not match the text prompt
- Mechanism: Extract semantic tuples (entities, attributes, relationships, actions) from the prompt, use LLM to generate evaluation question sets \(Q\), divided into counting questions \(Q_c\) (e.g., "Is there one bear?") and other questions \(Q_{others}\) (attributes, actions, scenes). MLLM answers these questions for the initial video; counting questions return triplets (judgment, prompt count, video count), other questions return binary judgments. Aggregated into a \([0,1]\) alignment score
- Design Motivation: More fine-grained than simple object existence checks — explicitly capturing quantity, attributes, spatiotemporal relationships, and actions, providing feedback that directly guides repair planning
Region-Preserving Refinement Planning:
- Function: Determine what to preserve, what to repair, and what prompts to use for repair
- Mechanism: (a) MLLM identifies correctly generated key entities \(O^*\) and their preservation count \(N^*\) based on QA results; (b) Pointing prompts and segmentation models obtain entity binary masks \(\mathbf{M}\) per frame; (c) LLM generates local repair prompts \(p^r\) excluding already preserved entities
- Design Motivation: Transforms evaluation feedback into actionable generation guidance — masks precisely define which pixels to preserve and which to regenerate; local prompts ensure repair regions receive correct semantic guidance
Localized Refinement and Fusion:
- Function: Repair problematic regions without destroying correct regions
- Mechanism: Downscale masks to latent space; preserved regions use original noise while repair regions use resampled noise. Each denoising step runs the diffusion model twice: preserved regions with original prompt \(p\), repair regions with local prompt \(p^r\). Final fusion via joint optimization: \(V_1 = \arg\min_{\tilde{V}} \|M_{pres} \otimes (\tilde{V} - \hat{V}_{pres})\|^2 + \|M_{refine} \otimes (\tilde{V} - \hat{V}_{refine})\|^2\), achieving seamless boundary transitions
- Design Motivation: Pure mask inpainting cannot introduce new entities; pure editing cannot freely correct misalignments; dual-path denoising + joint optimization achieves both precise control and global consistency

Loss & Training¶

Entirely training-free, using existing T2V diffusion models for inference. K repair candidate videos are generated (different random seeds), with the best selected via evaluation question scores. BLIP-BLEU score serves as tiebreaker when scores are tied.

Key Experimental Results¶

Main Results¶

T2V Backbone	Method	EvalCrafter Avg↑	Visual Quality	Motion Quality	Temporal Consistency
Wan 2.1-1.3B	Original	44.83	63.2	61.0	62.1
Wan 2.1-1.3B	+ VideoRepair	49.01	65.1	61.6	62.0
VideoCrafter2	Original	45.97	61.8	62.6	62.9
VideoCrafter2	+ VideoRepair	48.83	62.1	62.4	62.0
CogVideoX-5B	Original	45.01	65.8	61.0	61.8
CogVideoX-5B	+ VideoRepair	46.41	64.8	61.1	61.9

Ablation Study¶

Config	Metric	Note
vs LLM paraphrasing	43.12-45.81	Simple prompt rephrasing, limited or negative improvement
vs SLD	43.72-47.11	Effective in some scenarios but severely damages visual/temporal quality
vs OPT2I	45.63-48.69	Clear improvement but lower than VideoRepair
VideoRepair	46.41-49.01	Consistently optimal without harming quality metrics

Key Findings¶

VideoRepair provides consistent improvements across all four T2V backbones, validating model-agnosticism
The key advantage is not harming visual quality, motion quality, and temporal consistency — while the SLD method sometimes approaches alignment scores, it severely damages these quality metrics (e.g., temporal consistency drops from 62.1 to 21.0)
Count and Color subcategories show the most significant improvements, precisely the weakest areas of current T2V models

Highlights & Insights¶

"Preserve correct, repair incorrect" paradigm: An intuitively natural but technically non-trivial approach — compared to global regeneration or simple inpainting, region-preserving refinement is superior in both efficiency and quality. This paradigm is transferable to any generative task requiring post-processing correction
Evaluation-feedback-driven generation: Directly transforming MLLM evaluation QA results into repair plans (masks + prompts) establishes a closed loop between evaluation and generation. This self-correction paradigm is more scalable than purely human feedback
Training-free + model-agnostic: No additional model training required; can be used plug-and-play with any T2V diffusion model

Limitations & Future Work¶

Requires two diffusion model forward passes (preservation + repair), doubling inference overhead
Depends on MLLM evaluation accuracy — misalignment state misjudgments may lead to unnecessary modifications or omissions
Currently supports only single-round repair; iterative repair may lead to error accumulation
Future exploration: combining with T2V model training for online self-correction, incorporating user interactive feedback

vs SLD/OPT2I: SLD uses global semantic guidance but severely damages visual quality; OPT2I optimizes prompts but does not perform pixel-level repair; VideoRepair's region-preserving strategy achieves both alignment precision and quality maintenance
vs Image repair/editing methods: Inpainting can only fill regions but cannot introduce new entities; editing cannot freely correct misalignments; VideoRepair's dual-path denoising overcomes both limitations

Rating¶

Novelty: ⭐⭐⭐⭐ First training-free video self-correction framework; region-preserving repair paradigm is novel
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four backbones, two benchmarks, comprehensive ablation and quality metric evaluation
Writing Quality: ⭐⭐⭐⭐ Three-stage pipeline diagram is clear; method description is systematic
Value: ⭐⭐⭐⭐ Provides a general and practical post-processing improvement solution for T2V generation