NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing¶

Conference: CVPR 2026 arXiv: 2603.02802 Code: https://github.com/WeChatCV/NovaEdit Area: Image Generation / Video Editing Keywords: pair-free video editing, dual-branch architecture, sparse control, degradation simulation training, multi-keyframe

TL;DR¶

This paper proposes NOVA, which formalizes for the first time the "sparse control, dense synthesis" paradigm for video editing: a sparse branch provides semantic guidance from user-edited multi-keyframes, while a dense branch injects motion and texture information from the original video. Combined with a degradation simulation training strategy, NOVA achieves learning without paired data and comprehensively outperforms existing methods in editing fidelity, motion preservation, and temporal consistency.

Background & Motivation¶

Background: Diffusion model-driven video editing methods are advancing rapidly. Data-driven approaches (Senorita-2M, VACE) require large-scale paired data; first-frame-guided methods (AnyV2V, I2VEdit) propagate edits from the first frame to the entire video, relying on motion compensation.

Limitations of Prior Work: (a) Paired video data are extremely difficult to obtain, and synthetic data contain artifacts that impair generalization; (b) methods relying solely on the first frame suffer from structural drift under large camera or object motion; (c) global editing results are acceptable, but local editing (region-specific modification) fails broadly—background inconsistency and severe artifacts in edited regions are common.

Key Challenge: The control signal (what to change) and the synthesis signal (what to preserve) are coupled within the same pathway, making it difficult for the model to distinguish between "what to change" and "what to preserve."

Goal: Decouple control from synthesis and achieve high-quality video editing without paired data.

Key Insight: Multiple keyframes provide stronger spatiotemporal anchors, and the original video itself is the best reference for motion and texture.

Core Idea: The sparse branch encodes multiple edited keyframes for semantic guidance; the dense branch encodes the original video for motion/texture injection; degradation simulation enables self-supervised training.

Method¶

Overall Architecture¶

Built upon the WAN 2.1 VACE 1.3B architecture. The framework consists of a main denoising branch, a sparse branch (VACE blocks), and a dense branch (DiT replica + cross-attention). During training, only the newly added cross-attention modules are trainable.

Key Designs¶

Dual-Branch Decoupled Architecture:
- Function: Decouples editing control and source video fidelity into two separate pathways.
- Mechanism: At layer \(l\), \(\boldsymbol{z}_m^{(l)} \leftarrow \boldsymbol{z}_m^{(l)} + \underbrace{\mathcal{S}^{(l)}(\boldsymbol{z}_m^{(l)}, \boldsymbol{r})}_{\text{sparse control}} + \underbrace{\mathcal{D}^{(l)}(\boldsymbol{z}_m^{(l)}, \boldsymbol{z}_d^{(l)})}_{\text{dense synthesis}}\). The sparse branch injects a degraded keyframe sequence via VACE blocks; the dense branch injects original video information through trainable cross-attention (main branch as Q, dense branch as K/V).
- Design Motivation: Directly fusing dense features would interfere with editing; cross-attention enables the main branch to actively query the motion and texture information it needs.
Degradation Simulation Training (Pair-Free Learning):
- Anchored Control Pipeline: Keyframes are sparsely sampled from the target video and subjected to random degradations (Gaussian blur, affine transformations, etc.) to simulate editing artifacts: \(\hat{\boldsymbol{x}}_{k_i} = (\boldsymbol{1}-\boldsymbol{b}_{k_i})\odot\boldsymbol{x}_{k_i} + \boldsymbol{b}_{k_i}\odot\mathcal{D}_{aug}(\boldsymbol{x}_{k_i})\). The complete sequence is then reconstructed via linear interpolation as input to the sparse branch.
- Source Fidelity Pipeline: A pseudo-source video is generated by randomly applying Cut-and-Paste to the target video, \(\tilde{\boldsymbol{x}}_t = \boldsymbol{m}_t\odot\boldsymbol{y}_t + (1-\boldsymbol{m}_t)\odot\boldsymbol{x}_t\), and fed into the dense branch.
- Training Objective: Standard denoising loss \(\mathcal{L} = \mathbb{E}[\|\epsilon - \epsilon_\theta(\boldsymbol{z}_t, t, \tilde{\mathcal{X}}, \hat{\mathcal{X}})\|_2^2]\).
- Design Motivation: Degradation simulation trains the model to perform temporal restoration and texture propagation; Cut-and-Paste trains the model to recover motion and background from the dense branch.
Consistency-Aware Keyframe Editing (Inference):
- Function: Ensures visual consistency across multiple edited keyframes.
- Mechanism: FLUX Kontext is used for editing; the first frame is edited in the standard manner, and subsequent keyframes are edited with the first edited frame as reference: \(\boldsymbol{x}_{k_i}^{edit} = \text{FLUX}(\boldsymbol{x}_{k_i}, \boldsymbol{x}_{k_0}^{edit}, \boldsymbol{m}_{k_i}, \mathcal{P})\).
- Design Motivation: Independently editing each frame leads to style inconsistency and flickering.

Loss & Training¶

Only the newly added cross-attention modules are trained; the base model is frozen.
Training data: 5,000 high-quality videos (Pexels), resolution 832×480, 81 frames per clip.
AdamW, lr=1e-4, 8,000 steps.
Inference uses a keyframe interval of 10 frames.

Key Experimental Results¶

Main Results¶

Method	Params	Per-video Fine-tuning	SR↑	TC↑	FC↑	BG-SSIM↑	MS↑	BC↑
AnyV2V	1.3B	✗	0.75	0.918	0.840	0.858	0.973	0.939
I2VEdit	1.3B	✓	0.83	0.931	0.846	0.900	0.991	0.941
VACE (multi-frame)	1.3B	✗	0.90	0.928	0.840	0.913	0.989	0.940
Senorita-2M	5B	✗	0.86	0.919	0.853	0.921	0.989	0.953
NOVA	1.3B	✗	0.93	0.935	0.882	0.917	0.993	0.946

Ablation Study¶

Configuration	TC↑	FC↑	BG-SSIM↑	Note
Full NOVA	0.935	0.882	0.917	Complete model
w/o Dense Branch	0.920	0.841	0.807	Background hallucination
w/o Consistency Inference	0.92	0.85	0.88	Style inconsistency across independent edits
Blurred input to Dense Branch	0.933	0.878	0.910	Detail still recoverable

Key Findings¶

NOVA achieves a success rate of 93%, surpassing LoRA-Edit (which requires per-video fine-tuning) by 13%.
The dense branch is critical for background preservation: removing it causes BG-SSIM to drop from 0.917 to 0.807.
Even when the dense branch input is blurred, the model recovers backgrounds sharper than the blurred input—indicating that the dense branch performs guided synthesis rather than simple copying.
Performance is robust across keyframe intervals of 8–20, without overfitting to the training interval of 10.
Replacing the editing model (FLUX → Qwen-Image-Edit) causes only minor performance changes, demonstrating the generality of the framework.

Highlights & Insights¶

Sparse/dense decoupling represents a key paradigm innovation: this is the first work to explicitly separate control and synthesis into independent pathways for video editing. This architectural idea is transferable to image editing, 3D editing, and beyond.
Degradation simulation training makes ingenious use of unpaired data for self-supervision: by simulating editing artifacts and background mismatches, the model learns to repair them.
Guided synthesis in the dense branch: experiments demonstrate that the branch performs generation with physical understanding rather than simple copying—an instructive finding for understanding the capabilities of diffusion models.

Limitations & Future Work¶

Performance is sensitive to the quality of edited keyframes, and current image editing models remain limited on complex edits.
Training is conducted on only 5,000 videos, constraining scale.
Text-driven global style transfer editing is not yet supported.

vs. VACE: A unified framework but with control and synthesis coupled; NOVA's decoupling yields superior results.
vs. I2VEdit/LoRA-Edit: Require per-video LoRA fine-tuning, which is not scalable; NOVA requires no fine-tuning.
vs. Senorita-2M: Uses 5B parameters and large-scale paired data; NOVA with 1.3B parameters and no paired data surpasses it.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The sparse-control dense-synthesis paradigm is formally introduced for the first time; the degradation simulation training strategy is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple baselines, multiple metrics, user study, and ablation study.
Writing Quality: ⭐⭐⭐⭐ Problem decomposition is clear; design motivations for the architecture are well justified.
Value: ⭐⭐⭐⭐⭐ Provides a scalable, pair-free training framework for video editing.