VACE: All-in-One Video Creation and Editing¶
Conference: ICCV 2025 arXiv: 2503.07598 Code: Project Area: Video Generation / Video Editing Keywords: Unified Framework, DiT, Video Condition Unit, Controllable Generation, Video Editing
TL;DR¶
VACE is proposed as a unified framework for video creation and editing. It introduces a Video Condition Unit (VCU) that consolidates text, image, video, and mask inputs into a unified conditional representation. Combined with a Context Adapter that injects task concepts into a DiT model, VACE is the first single video DiT to simultaneously support reference-guided generation, video editing, mask-based editing, and their arbitrary combinations.
Background & Motivation¶
- Background: The image domain already has unified generation-editing frameworks such as ACE and OmniGen, but the video domain—due to its higher spatiotemporal consistency requirements—still relies predominantly on single-task, single-model approaches.
- Limitations of Prior Work: The large variety of video tasks (reference-guided generation, style transfer, inpainting, etc.) makes deploying separate models per task costly; chaining multiple tasks (e.g., long-video editing pipelines) is difficult to realize.
- Key Challenge: Unifying diverse video task input modalities while preserving spatiotemporal consistency.
- Goal: Build a single model that covers as many video generation and editing tasks as possible.
- Key Insight: Design a unified input interface (VCU) + a concept decoupling strategy + a plug-and-play Context Adapter.
- Core Idea: Represent all video task conditions as a triplet (text, frame sequence, mask sequence), and achieve multi-task unification via concept decoupling and adapter injection.
Method¶
Overall Architecture¶
Built upon a pretrained T2V DiT, the model takes a VCU triplet \([T; F; M]\) as input. \(F\) and \(M\) are decomposed via concept decoupling into reactive frames (to be modified) and inactive frames (to be preserved). Both are VAE-encoded and fused with noisy video tokens, with training performed either by full fine-tuning or via the Context Adapter.
Key Designs¶
Design 1: Video Condition Unit (VCU) - Function: Unifies four fundamental task types—T2V, R2V, V2V, and MV2V—and their combinations into a \((T, F, M)\) triplet. - Mechanism: \(T\) denotes text; \(F\) is a frame sequence (reference images, control signals, edited video, or blank frames); \(M\) is a binary mask sequence (1 = to be generated, 0 = to be preserved). Different tasks are represented by different compositions of \(F\) and \(M\). - Design Motivation: Eliminates the need for task-specific input interfaces and enables a high degree of combinatorial flexibility.
Design 2: Concept Decoupling Strategy - Function: Decomposes \(F\) via \(M\) into reactive frames (\(F_c = F \times M\)) and inactive frames (\(F_k = F \times (1-M)\)). - Mechanism: Explicitly separates pixels to be edited from pixels to be preserved, enabling the model to distinguish between editing and reference roles. Each subset is encoded independently by the VAE. - Design Motivation: Natural video frames and control signals (e.g., depth maps, pose sequences) have different distributions; mixing them hinders convergence.
Design 3: Context Adapter Tuning - Function: An optional lightweight training strategy that freezes the DiT backbone and trains only the Context Embedder and Context Blocks. - Mechanism: A subset of Transformer Blocks is copied from the DiT to form Context Blocks, which process context tokens and add them back to the main branch—analogous to Res-Tuning. - Design Motivation: Avoids full fine-tuning, achieves faster convergence, and remains plug-and-play with respect to the base model.
Loss & Training¶
Standard diffusion denoising loss is used. Training is conducted under either full fine-tuning or Context Adapter tuning, with random conditional dropout applied across different conditions to enable multi-task training.
Key Experimental Results¶
Main Results¶
VACE-Benchmark: Automatic Evaluation Across 12 Tasks (Partial, Normalized Average Score)
| Task Type | Baseline | VACE (Ours) |
|---|---|---|
| I2V | CogVideoX-I2V: 73.66 | 74.38 |
| Inpaint | ProPainter: 70.15 | Competitive |
| Depth Control | Task-specific model | Comparable |
| Pose Control | Task-specific model | Comparable |
Ablation Study¶
| Configuration | Effect |
|---|---|
| Without concept decoupling | Confusion between editing and reference tasks |
| Full fine-tuning vs. Adapter | Full fine-tuning slightly better; Adapter converges faster |
| Without random condition dropout | Degraded task generalization |
Key Findings¶
- The unified model achieves performance comparable to task-specific models across all sub-tasks, validating the feasibility of the unified framework.
- Task combinations (e.g., reference-guided generation + inpainting) are natively supported by VACE but are not achievable with task-specific models.
- In human evaluations, VACE significantly outperforms most baselines on the temporal consistency dimension.
Highlights & Insights¶
- The VCU design is remarkably elegant, unifying the maximum number of task types with minimal formalization.
- Concept decoupling is the critical enabler—explicitly separating editing and reference information substantially improves convergence and output quality.
- VACE is the first all-in-one generation-and-editing model in the video domain, representing pioneering work in this direction.
Limitations & Future Work¶
- Temporal consistency in long-video scenarios remains a room for improvement.
- Extending the range of supported control signals (e.g., depth, pose) requires additional training data.
- The scale of the user study is limited.
Related Work & Insights¶
- ACE and OmniGen achieve unified generation and editing in the image domain; VACE extends this paradigm to video.
- Key insight: The bottleneck for modality unification lies not in complex architectural design but in the elegance of the input interface.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ★★★★★ |
| Experimental Thoroughness | ★★★★☆ |
| Writing Quality | ★★★★☆ |
| Value | ★★★★★ |