🎬 Video Generation¶
📷 CVPR2026 · 60 paper notes
- ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
-
This work introduces the first activity-level video forgery localization task and the large-scale ActivityForensics benchmark (6K+ forged clips). A grounding-assisted automated data construction pipeline is proposed to produce highly realistic activity manipulations, and a baseline method, Temporal Artifact Diffuser (TADiff), is presented to amplify forgery cues via diffusion-based feature regularization.
- Anti-I2V: Safeguarding your photos from malicious image-to-video generation
-
Anti-I2V proposes a defense method against malicious image-to-video generation. By optimizing perturbations in both L*a*b* color space and the frequency domain, and designing Internal Representation Collapse (IRC) and Anchor (IRA) losses to disrupt semantic feature propagation within the denoising network, the method achieves state-of-the-art protection across three architecturally distinct models: CogVideoX, DynamiCrafter, and Open-Sora.
- AutoCut: End-to-end Advertisement Video Editing Based on Multimodal Discretization and Controllable Generation
-
AutoCut proposes an end-to-end advertisement video editing framework that unifies video, audio, and text into a shared discrete token space via Residual Vector Quantization (RQVAE), performs multimodal alignment and supervised fine-tuning on Qwen3-8B, and enables four tasks—clip selection, ordering, script generation, and background music selection—within a single unified model, surpassing GPT-4o baselines across multiple metrics.
- Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
-
This work models physically plausible video generation (PPVG) as a sequence of causally connected events. It decomposes complex physical phenomena into ordered events via physics-formula-grounded event chain reasoning, then synthesizes semantic–visual dual conditions through transition-aware cross-modal prompting to guide a video diffusion model in generating videos that follow causal physical evolution.
- Compressed-Domain-Aware Online Video Super-Resolution
-
CDA-VSR leverages compressed-domain information (motion vectors, residual maps, and frame types) to guide three key stages of online video super-resolution: motion-vector-guided deformable alignment for efficient and accurate registration, residual-map-gated fusion to suppress misalignment artifacts, and frame-type-aware reconstruction to adaptively allocate computation. The method achieves state-of-the-art PSNR on REDS4 at 93 FPS—more than twice the speed of prior SOTA.
- CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
-
CubeComposer decomposes 360° video into a cubemap six-face representation and generates each face autoregressively in a spatio-temporal order, achieving for the first time native 4K (3840×1920) 360° panoramic video generation from perspective video without post-hoc super-resolution.
- Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation
-
This paper proposes Diff4Splat, a feed-forward framework that unifies video diffusion models with deformable 3D Gaussian fields into an end-to-end trainable model, enabling direct generation of dynamic 4D scene representations from a single image in approximately 30 seconds — roughly 60× faster than optimization-based methods.
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
-
DisCa is the first framework to unify learnable feature caching with step distillation in a compatible manner, replacing hand-crafted caching strategies with a lightweight neural predictor (<4% of model parameters). Combined with Restricted MeanFlow for stable large-scale video DiT distillation, DisCa achieves an 11.8× near-lossless speedup on HunyuanVideo.
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
-
This paper proposes DisCa, the first framework to combine learnable feature caching with step distillation by replacing handcrafted caching strategies with lightweight neural predictors. It further introduces Restricted MeanFlow to stabilize large-scale video model distillation, achieving an 11.8× speedup on HunyuanVideo with negligible quality degradation.
- DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
-
This paper proposes DreamShot, which leverages the spatiotemporal prior of video diffusion models to generate multi-shot storyboards with consistent characters and coherent scenes. A Role-Attention Consistency Loss (RACL) is introduced to address multi-character confusion, and three unified generation modes are supported: text-to-shot, reference-to-shot, and shot-to-shot.
- DriveLaW: Unifying Planning and Video Generation in a Latent Driving World
-
This paper proposes DriveLaW, a driving world model that unifies video generation and motion planning through a shared latent space. The intermediate latent features of the video generator are directly injected into a diffusion-based planner, achieving state-of-the-art performance simultaneously on the nuScenes video prediction benchmark and the NAVSIM planning benchmark.
- FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
-
FastLightGen proposes a three-stage distillation algorithm that, for the first time, achieves joint distillation of sampling steps and model size. By identifying redundant layers, applying dynamic probabilistic pruning, and performing well-guided teacher guidance distribution matching, it compresses HunyuanVideo/WanX into a lightweight generator with 4 sampling steps and 30% parameter pruning, achieving approximately 35× speedup while surpassing the teacher model in performance.
- First Frame Is the Place to Go for Video Content Customization
-
This paper identifies an intrinsic capability of video generation models to implicitly use the first frame as a "conceptual memory buffer" for storing and reusing multiple visual entities. Building on this observation, the authors propose FFGo—a lightweight LoRA adaptation method requiring only 20–50 training samples—that activates this capability without any architectural modification, enabling multi-reference video content customization. FFGo is rated best in 81.2% of cases in user studies.
- FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
-
FlashMotion proposes a three-stage training framework that distills a multi-step trajectory-controllable video generation model into a few-step counterpart. By fine-tuning the adapter with a hybrid diffusion and adversarial objective, the method simultaneously preserves video quality and trajectory accuracy under few-step inference.
- FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
-
FlashMotion is proposed as the first three-stage training framework for few-step (4-step) trajectory-controllable video generation. By sequentially training a trajectory adapter, distilling a fast generator, and fine-tuning the adapter via a hybrid adversarial and diffusion loss, the method surpasses existing multi-step approaches in both visual quality and trajectory accuracy under 4-step inference, achieving a 47× speedup.
- Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
-
FreeLOC proposes a training-free, layer-adaptive framework that identifies the differential sensitivity of each layer in video DiTs to two out-of-distribution (OOD) problems—frame-level relative position OOD and context length OOD—and selectively applies multi-granularity positional re-encoding (VRPR) and tiered sparse attention (TSA) to sensitive layers, achieving state-of-the-art long video generation quality without any additional training cost.
- From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
-
This paper proposes Co-Settle, a framework that trains a lightweight linear projection layer on top of a frozen image-pretrained encoder. Using temporal cycle consistency loss and semantic separability constraints, the method achieves consistent improvements across multi-granularity video downstream tasks on 8 image foundation models with only 5 epochs of self-supervised training.
- Generative Neural Video Compression via Video Diffusion Prior
-
This paper proposes GNVC-VD, the first DiT-based generative neural video compression framework. By leveraging a video diffusion transformer as a video-native generative prior, GNVC-VD performs spatiotemporal latent compression and sequence-level generative refinement within a unified codec. At extremely low bitrates (<0.03 bpp), it substantially surpasses both traditional and learned codecs in perceptual quality while significantly reducing the flickering artifacts prevalent in prior generative approaches.
- Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
-
This paper proposes the Geometry-as-Context (GaC) framework, which replaces the non-differentiable operators (3D reconstruction + rendering) in reconstruction-based scene video generation with a unified autoregressive video generation model. By embedding geometric information (depth maps) as interleaved context into the generation sequence, GaC enables end-to-end training and mitigates accumulated errors.
- Gloria: Consistent Character Video Generation via Content Anchors
-
Gloria introduces a compact set of "Content Anchors" to represent a character's multi-view appearance and expression identity. Through two key mechanisms—superset content anchoring (to prevent copy-paste artifacts) and RoPE weak conditioning (to distinguish multiple anchor frames)—the method enables consistent character video generation exceeding 10 minutes in duration.
- Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
-
This paper proposes GenReward, a framework that fine-tunes a pre-trained video diffusion model to generate goal-conditioned videos, and derives two-level goal-driven reward signals—video-level and frame-level—to guide reinforcement learning agents without manually designed reward functions, achieving substantial improvements over baselines on Meta-World robotic manipulation tasks.
- Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
-
This paper proposes IPRO, which directly optimizes a video diffusion model via reinforcement learning and a differentiable facial identity scorer, significantly improving face identity consistency in image-to-video generation without modifying the model architecture, achieving 20%–45% FaceSim gains on Wan 2.2.
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
-
This paper proposes ∞-RoPE, a training-free inference-time framework comprising three components — Block-Relativistic RoPE, KV Flush, and RoPE Cut — that extends an autoregressive video diffusion model trained solely on 5-second clips to support infinite-length generation, fine-grained action control, and cinematic scene transitions.
- I'm a Map! Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
-
This paper proposes IMAP (Interpretable Motion-Attentive Maps), a training-free framework that extracts spatio-temporal saliency maps for motion concepts from Video DiTs via two modules: GramCol for spatial localization and motion head selection for temporal localization. IMAP surpasses existing methods on motion localization and zero-shot video semantic segmentation benchmarks.
- LAMP: Language-Assisted Motion Planning for Controllable Video Generation
-
LAMP frames motion control as a language-to-program synthesis problem: it designs a cinematography-inspired motion DSL, fine-tunes an LLM to translate natural language descriptions into structured motion programs, and deterministically maps these programs to 3D object and camera trajectories that condition a video diffusion model — achieving, for the first time, simultaneous natural-language control over both object and camera motion.
- Let Your Image Move with Your Motion! – Implicit Multi-Object Multi-Motion Transfer
-
This paper proposes FlexiMMT, the first I2V framework supporting implicit multi-object multi-motion transfer. It introduces a Motion Decoupling Mask Attention (MDMA) mechanism to constrain motion/text tokens to interact only with their corresponding object regions, and a Differential Mask Extraction Mechanism (DMEM) to derive object masks from diffusion attention maps with progressive propagation, enabling precise compositional multi-object motion transfer.
- Lighting-grounded Video Generation with Renderer-based Agent Reasoning
-
LiVER proposes a lighting-driven video generation framework that employs a renderer-based agent to translate textual descriptions into explicit 3D scene proxies (encompassing layout, lighting, and camera trajectories). Physical rendering is then used to produce diffuse/glossy/rough GGX scene proxies, which are injected into a video diffusion model to achieve physically accurate lighting effects and precise scene control.
- LightMover: Generative Light Movement with Color and Intensity Controls
-
LightMover leverages video diffusion priors to model light source editing as a sequence-to-sequence prediction problem. Through a unified control token representation, it achieves precise manipulation of light source position, color, and intensity. An adaptive token pruning mechanism reduces control sequence length by 41%, and the method outperforms existing approaches on both light movement and object movement tasks.
- LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
-
This paper proposes LinVideo, a data-free post-training framework that selectively replaces quadratic attention with linear attention in video diffusion models, achieving 1.43–1.71× speedup. Combined with distillation, the speedup reaches 15.9–20.9× while maintaining generation quality.
- LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
-
LinVideo is the first data-free post-training framework that automatically identifies which layers are most amenable to linear attention substitution via selective transfer, and recovers model performance through an Arbitrary-timestep Distribution Matching (ADM) objective. It achieves 1.43–1.71× lossless speedup on Wan 1.3B/14B, and up to 15.9–20.9× speedup when combined with 4-step distillation.
- MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer
-
The first method to simultaneously generate RGB + depth + semantic tri-modal multi-view driving scene videos within a unified DiT framework. Through a decomposed design of modal-shared layers (temporal + multi-view spatiotemporal attention) and modal-specific layers (cross-modal interaction + projection heads), a unified layout encoder, and diverse conditioning, the method achieves FVD 46.8 on nuScenes (22% improvement over CogVideoX+SyntheOcc), depth AbsRel 0.110, and semantic mIoU 37.5, outperforming pipelines based on separate model generation and estimation.
- MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer
-
MoVieDrive proposes a unified multi-modal multi-view video diffusion Transformer that simultaneously generates RGB video, depth maps, and semantic maps within a single model via a two-level modal-shared + modal-specific architecture. Combined with diverse conditioning inputs (text, layout, contextual reference), it achieves FVD 46.8 (SOTA) on nuScenes while producing cross-modally consistent, high-quality driving scene synthesis.
- NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
-
NeoVerse proposes a scalable 4D world model that enables the entire training pipeline to leverage large-scale in-the-wild monocular videos (millions of clips) via feed-forward pose-free 4DGS reconstruction and online monocular degradation simulation, achieving state-of-the-art performance on both 4D reconstruction and novel-trajectory video generation.
- NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
-
This paper proposes NOVA, which formalizes for the first time the "sparse control, dense synthesis" paradigm for video editing: a sparse branch provides semantic guidance from user-edited multi-keyframes, while a dense branch injects motion and texture information from the original video. Combined with a degradation simulation training strategy, NOVA achieves learning without paired data and comprehensively outperforms existing methods in editing fidelity, motion preservation, and temporal consistency.
- Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
-
This paper proposes leveraging the latent features of a 3D foundation generative model (Hunyuan3D) as shape priors, injecting them into a base video diffusion model via a multi-scale 3D adapter, to generate geometrically realistic and view-consistent orbital videos from a single image.
- PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation
-
This paper proposes PAM — the first engine capable of generating realistic hand-object interaction (HOI) videos given only initial/target hand poses and object geometry. Through a three-stage decoupled architecture of pose generation, appearance generation, and motion generation, PAM achieves FVD 29.13 (vs. InterDyn 38.83) and MPJPE 19.37 mm (vs. CosHand 30.05 mm) on DexYCB. The generated synthetic data also effectively augments downstream hand pose estimation tasks.
- PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
-
PerformRecast presents a GAN-based portrait video editing method built upon a corrected 3DMM keypoint transformation formulation. By applying expression deformation before head rotation — consistent with the FLAME model — the method achieves precise disentanglement of expression and head pose. A Boundary Alignment Module (BAM) is further introduced to address stitching misalignment between facial and non-facial regions. The approach substantially outperforms existing methods under both expression replacement and expression enhancement modes.
- Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
-
This paper proposes Phantom, a framework that augments a pretrained video diffusion model (Wan2.2-TI2V) with a dedicated physical dynamics branch. Physics-aware embeddings extracted by V-JEPA2 serve as latent physical states, and bidirectional cross-attention is employed to jointly model visual content and physical dynamics evolution. Phantom achieves substantial improvements over baselines on physics consistency benchmarks (VideoPhy PC +50.4%) while preserving visual quality.
- Physical Simulator In-the-Loop Video Generation
-
This paper proposes PSIVG — the first training-free inference-time framework that embeds a physical simulator into the video diffusion generation loop. It reconstructs a 4D scene and object meshes from a template video, generates physically consistent trajectories via an MPM simulator, guides video generation using optical flow, and enforces texture consistency of moving objects through Test-Time Consistency Optimization (TTCO), achieving a user preference rate of 82.3%.
- PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
-
PoseGen achieves dual condition injection (token-level appearance + channel-level pose) via in-context LoRA finetuning, and proposes a segmented interleaved generation strategy (KV sharing + pose-aware frame interpolation) to generate high-fidelity long human videos using only 33 hours of training data.
- Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
-
This paper proposes PoCo (Position Embedding as Context Controller), which introduces an additional SideInfo axis in RoPE to encode reference entity information, addressing the "reference confusion" problem in multi-reference multi-shot video generation—where the model fails to correctly associate shots with references when reference images are visually similar. PoCo achieves state-of-the-art cross-shot consistency on the VACE-Wan2.1-14B framework (CrossShot-FaceSim 89.35, CrossShot-DINO 92.66).
- SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
-
SeeU is a 2D→4D→2D learning framework that reconstructs a 4D world representation from sparse monocular 2D frames, learns continuous and physically consistent 4D dynamics on a low-rank representation (B-spline parameterization + physical constraints), and reprojects the 4D world back to 2D, completing unseen regions with a spatiotemporally context-aware video generator—enabling generation of unseen visual content across time and space.
- Semantic Satellite Communications for Synchronized Audiovisual Reconstruction
-
This paper proposes an adaptive multimodal semantic satellite transmission system that flexibly switches transmission priorities via a dual-stream generative architecture (video-driven audio / audio-driven video), combined with a dynamic knowledge base update mechanism and an LLM agent for adaptive decision-making, achieving high-fidelity synchronized audiovisual reconstruction under stringent bandwidth constraints.
- Semantic Satellite Communications for Synchronized Audiovisual Reconstruction
-
This paper proposes an adaptive multimodal semantic transmission system for satellite communications. A dual-stream generative architecture (video-driven audio / audio-driven video) enables dynamic modality priority switching, combined with a dynamic knowledge base update mechanism and an LLM agent decision module, achieving high-fidelity synchronized audiovisual reconstruction under severe bandwidth constraints.
- SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
-
This paper proposes SLVMEval, a meta-evaluation benchmark that synthesizes controlled degradations to construct "high-quality vs. low-quality" video pairs (up to ~3 hours) from densely captioned video datasets, and tests whether existing T2V evaluation systems can distinguish long-video quality differences. Human annotators achieve 84.7%–96.8% accuracy across 10 dimensions, whereas existing automatic evaluation systems fall behind humans on 9 out of 10 dimensions.
- StreamDiT: Real-Time Streaming Text-to-Video Generation
-
StreamDiT presents a complete streaming video generation pipeline—covering training, modeling, and distillation—that introduces a sliding buffer with progressive denoising under Flow Matching, a mixed partitioning training strategy, a time-varying DiT architecture with windowed attention, and a customized multi-step distillation method. The resulting 4B-parameter model achieves real-time streaming video generation at 512p@16FPS on a single GPU.
- SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
-
SWIFT introduces the novel task of "few-shot training-free generated video attribution," exploiting the temporal mapping property of 3D VAEs — where \(K\) pixel frames correspond to a single latent frame — by performing two reconstructions (normal and corrupted) via fixed-length sliding windows. The ratio of reconstruction losses over overlapping frames serves as the attribution signal. Using only 20 samples, SWIFT achieves over 90% average attribution accuracy, with a 5-model average of 94%.
- SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
-
This paper proposes SwitchCraft, a training-free multi-event video generation framework that achieves clear temporal transitions and scene consistency without modifying model weights, via Event-Aligned Query Steering (EAQS) to align frame-level attention to corresponding event prompts, and Auto-Balance Strength Solver (ABSS) to adaptively balance guidance strength.
- SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
-
SymphoMotion is a unified motion control framework that simultaneously and precisely controls camera motion and object 3D trajectories in video generation via two mechanisms — Camera Trajectory Control (CTC) and Object Dynamics Control (ODC) — alongside a large-scale real-world jointly annotated dataset, RealCOD-25K, containing 25K samples.
- TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
-
This paper proposes TEAR, the first automated red-teaming framework targeting temporal-dimension vulnerabilities in text-to-video (T2V) models. Through a two-stage optimized temporal-aware test generator and an iterative refinement model, TEAR generates textually benign prompts that exploit temporal dynamics to elicit harmful videos, achieving attack success rates (ASR) exceeding 80% on both open-source and commercial T2V models.
- The Devil is in the Details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
-
This paper proposes KeyTailor, a framework that employs a keyframe-driven details injection strategy—comprising garment dynamic enhancement and collaborative background optimization—to substantially improve garment fidelity and background consistency in video virtual try-on without modifying the DiT architecture. A 15K high-resolution dataset, ViT-HD, is also released.
- Training-free Motion Factorization for Compositional Video Generation
-
This paper proposes a motion factorization framework that decomposes multi-instance scene motion into three categories — stationary, rigid-body, and non-rigid motion — and addresses prompt semantic ambiguity via Structured Motion Reasoning (SMR) while steering the generation of each motion category during diffusion through Decoupled Motion Guidance (DMG). The framework requires no additional training and achieves substantial improvements in motion diversity and fidelity on VideoCrafter-v2.0 and CogVideoX-2B.
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
-
This paper proposes U-Mind, the first unified real-time full-stack multimodal interaction system supporting high-level reasoning dialogue and instruction following. Within a single interaction loop, the system jointly generates text, speech, and motion, and renders them into photorealistic video. Rehearsal-driven learning and a text-first decoding strategy are introduced to balance reasoning preservation with cross-modal alignment.
- UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
-
UniAVGen proposes a joint audio-video generation framework built on a symmetric dual-branch DiT, achieving precise spatiotemporal synchronization through an asymmetric cross-modal interaction mechanism and a Face-Aware Modulation module. With only 1.3M training samples, it outperforms competitors trained on 30M data across lip-audio synchronization, timbre consistency, and emotional consistency.
- Unified Camera Positional Encoding for Controlled Video Generation
-
This paper proposes Unified Camera Positional Encoding (UCPE), which injects complete camera geometric information (6-DoF pose, intrinsics, and lens distortion) into Transformer attention mechanisms via relative ray encoding and absolute orientation encoding. UCPE enables fine-grained video generation control across heterogeneous camera models while introducing less than 1% additional trainable parameters.
- UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
-
UniTalking is proposed as an end-to-end talking portrait generation framework built upon MM-DiT. Through a joint attention mechanism within a dual-stream symmetric architecture, it explicitly models fine-grained temporal correspondences between audio and video tokens, achieving state-of-the-art lip-audio synchronization accuracy while supporting personalized voice cloning.
- Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
-
Vanast proposes a unified framework that simultaneously performs garment transfer and human image animation within a single stage via a Dual Module architecture (HAM + GTM) and a three-stage synthetic data construction pipeline. On the Internet dataset, it achieves a PSNR of 17.95 dB (+5.5 dB over the best two-stage baseline) and an LPIPS of 0.237.
- VideoCoF: Unified Video Editing with Temporal Reasoner
-
This paper proposes VideoCoF, a Chain-of-Thought-inspired "observe→reason→edit" video editing framework. By prompting a video diffusion model to first predict reasoning tokens (grayscale-highlighted region latents) before generating the target video tokens, VideoCoF achieves precise instruction-region alignment without requiring user-provided masks. Trained on only 50K video pairs, it achieves state-of-the-art performance and supports video extrapolation up to 16× the training length.
- When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
-
NUMINA proposes an identify-then-guide paradigm that, without retraining the video diffusion model, extracts a countable instance layout from DiT attention maps during inference, detects inconsistencies between numeric tokens and the current layout, applies conservative layout modifications, and uses the revised layout to guide regeneration—substantially improving adherence to quantity constraints such as "two apples" or "eight ducks" in text-to-video models.
- When to Lock Attention: Training-Free KV Control in Video Diffusion
-
This paper proposes KV-Lock, a training-free framework that dynamically schedules background KV cache fusion ratios and CFG guidance strength based on diffusion hallucination detection, simultaneously ensuring background consistency and foreground generation quality in video editing.