MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing¶

Conference: ECCV 2024
arXiv: 2311.17338
Area: Video Generation

TL;DR¶

This paper proposes MagDiff, the first multi-alignment diffusion model that unifies video generation and editing. Through three mechanisms—subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment—MagDiff achieves high-quality video generation and editing simultaneously within a single, tuning-free framework.

Background & Motivation¶

Video generation and video editing are closely related but traditionally treated as separate tasks. Video generation creates videos from pure noise, while video editing requires maintaining consistency in unedited regions. Existing methods struggle with multiple alignment issues:

Subject Misalignment: Models solely relying on text prompts fail to precisely control visual details (e.g., generating "waving right hand" when "waving left hand" is requested).

Identity Misalignment: During text-guided editing, the identity of the subject and the background often change.

Motion Misalignment: When utilizing image prompts, fixed weights prevent motion from being controlled effectively by text.

The core reason is that text and images are heterogeneous modalities. Existing methods overlook the differences between homogeneous (image-image) and heterogeneous (text-image) alignments; simply assigning equal weights to both prompts fails to balance them.

Method¶

Overall Architecture¶

Based on the U-Net diffusion model, MagDiff introduces three alignment strategies: 1. Subject-Driven Alignment (SDA) — Unifies the two tasks. 2. Adaptive Prompts Alignment (APA) — Balances heterogeneous and homogeneous modalities. 3. High-Fidelity Alignment (HFA) — Enhances subject fidelity.

Key Designs¶

1. Subject-Driven Alignment (SDA)

Unlike VideoCrafter1 which utilizes whole images, MagDiff extracts the main subject as the conditioning image using a segmentation algorithm. This simple yet critical modification enables: - In generation tasks: The subject image provides appearance details, while the text controls the actions and scenes. - In editing tasks: The unedited regions are preserved as the "subject", while the text guides the edited regions. - Consequently, both tasks are unified within the same model.

2. Adaptive Prompts Alignment (APA)

In the cross-attention blocks, learnable parameters \(\alpha_1\) and \(\alpha_2\) are designed to dynamically adjust the control strength of text and image prompts:

\[\text{Attention} = \alpha_1 \cdot \text{Softmax}\left(\frac{QK_1^\top}{\sqrt{d}}\right)V_1 + \alpha_2 \cdot \text{Softmax}\left(\frac{QK_2^\top}{\sqrt{d}}\right)V_2\]

By sharing the query \(Q\) and processing the text and image \(K/V\) elements separately, the model is allowed to adaptively learn the optimal weight ratio for both modalities.

3. High-Fidelity Alignment (HFA)

CLIP encoders preserve only high-level semantics, losing fine visual details. HFA addresses this by utilizing a VAE encoder to construct a pyramid structure: - The subject image is resized to three scales: 384×384, 320×320, and 256×256. - Multi-scale latent features are acquired through VAE encoding. - After alignment via convolutional layers, these features are concatenated with the noisy latent variables to inject pixel-level details.

Loss & Training¶

Standard conditional diffusion denoising loss:

\[\mathcal{L} = \mathbb{E}_{y \sim \mathcal{N}(0, I)}\left[\|y - f_\theta(x_t; c_s, c_t, t)\|_2^2\right]\]

where \(c_s\) is the subject image prompt, and \(c_t\) is the text prompt.

Key Experimental Results¶

Main Results¶

Video Generation (UCF-101 & MSR-VTT)

Method	Input Type	Training Data Volume	IS ↑	FVD ↓ (UCF)	FVD ↓ (MSR)
Make-A-Video	text	20M	33.00	367.23	-
PYoCo	text	22.5M	47.76	355.19	-
VideoComposer	text&image	10.3M	-	-	580
VideoCrafter1	text&image	10.3M	44.53	415.87	465
MagDiff	text&image	5.3M+76K	48.57	339.62	245

MagDiff significantly outperforms all other methods in FVD, using only 76K fine-tuning data.

Video Editing (DAVIS)

Method	Inference Mode	Textual-align	Frame-consistency
Tune-A-Video	Fine-tuning	28.33	90.45
FateZero	Fine-tuning	23.81	92.92
Framewise IP2P	Tuning-free	25.11	86.76
MagDiff	Tuning-free	27.65	90.86

Ablation Study¶

Component Configuration	IS ↑	FVD ↓	DINO ↑	Textual-align ↑	Frame-consist ↑
Base (VidRD, Text-only)	42.85	380.24	44.5	24.8	89.8
+ SDA	45.12	363.45	47.2	24.9	89.9
+ SDA + APA	46.89	349.18	49.1	25.2	90.0
+ SDA + APA + HFA	48.57	339.62	50.8	25.4	90.2

Superimposing the three components step-by-step shows distinct contributions from each. SDA serves as the foundation for unifying the two tasks, APA improves control precision, and HFA enhances fidelity.

Key Findings¶

In human evaluations, MagDiff significantly outperforms VideoCrafter1 (3.2, 2.8) in both subject fidelity (4.4/5) and text alignment (4.1/5).
Learnable \(\alpha_1\)/\(\alpha_2\) adaptive adjustments outperform fixed equal-weight attention fusion.
Subject-driven alignment (removing the background to use only the subject) is key to unifying generation and editing.

Highlights & Insights¶

Simple and elegant design of the unified framework: By implementing a simple change of "using only the subject instead of the complete image", generation and editing are unified. The underlying insight is that subject-driven segmentation naturally distinguishes editable from uneditable areas.
Adaptive weights over fixed weights: Explicitly acknowledging the differences between homogeneous and heterogeneous modalities, the model utilizes learnable parameters to decide the optimal fusion ratio autonomously.
Tuning-free inference: Compared to methods like Tune-A-Video that require per-video fine-tuning, MagDiff can be used directly without fine-tuning during inference, making it highly practical.
Minimal training data: Achieving SOTA with only 76K fine-tuning videos demonstrates the reasonableness of the module design.

Limitations & Future Work¶

Only 16-frame generation is used, and the consistency of longer videos remains unverified.
Segmentation quality significantly impacts subject-driven alignment, but degradation when segmentation fails is not discussed.
The baselines compared in video editing are somewhat outdated (FateZero, Tune-A-Video), lacking comparisons with more recent methods.
Complex multi-subject scenarios and occlusions are not analyzed in depth.

Rating¶

Novelty: ⭐⭐⭐⭐ — The first multi-alignment diffusion framework to unify video generation and editing.
Technical Depth: ⭐⭐⭐⭐ — The design of the three alignment strategies is rational, and the APA module is inspiring.
Experimental Thoroughness: ⭐⭐⭐⭐ — Four benchmark datasets + human evaluation + comprehensive ablation.
Writing Quality: ⭐⭐⭐⭐ — The problem analysis is clear, and the illustrations are rich.