VidTwin: Video VAE with Decoupled Structure and Dynamics¶

Conference: CVPR 2025
arXiv: 2412.17726
Code: Project Page
Area: Video Generation
Keywords: Video Autoencoder, Latent Space Decoupling, Structure-Dynamics Disentanglement, High Compression Ratio, Q-Former

TL;DR¶

Proposed VidTwin, which decouples videos into two independent latent spaces: Structure Latent (global content and coarse motion) and Dynamics Latent (fine-grained details and fast motion), achieving high-quality reconstruction with a 28.14 PSNR at an extremely high compression ratio of 0.20%.

Background & Motivation¶

Video Autoencoders (Video AEs) play a crucial role in video generation pipelines—encoding videos into a compact latent space to reduce the modeling complexity of diffusion models. Existing methods suffer from limitations rooted in two types of design philosophies:

Uniform Representation methods (e.g., MAGVIT-v2, CV-VAE) represent each frame as a fixed-size latent vector, ignoring inter-frame redundancy.
Content-Motion Decoupling methods (e.g., CMD) oversimplify the dynamic characteristics of videos, leading to blurry generation results.

Key Insight: Video information can be decoupled more finely into two complementary levels—Structure Latent captures prime semantic contents and low-frequency motion trends (such as the presence of objects and slow translations), while Dynamics Latent captures high-frequency details and fast motions (such as rotation, color, and texture changes). This decoupling achieves a higher compression ratio while maintaining reconstruction quality.

Method¶

Overall Architecture¶

VidTwin utilizes a Spatial-Temporal Transformer (768 dimensions, 16 layers for both encoder and decoder, ~300M parameters) as the backbone. After the encoder outputs \(z \in \mathbb{R}^{c \times f \times h \times w}\), it splits into two paths: \(\mathcal{F}_S\) extracts the Structure Latent \(z_S\), and \(\mathcal{F}_D\) extracts the Dynamics Latent \(z_D\). During decoding, their dimensions are aligned, element-wise added, and then input to the decoder. A VAE paradigm with KL regularization is adopted to ensure the smoothness of the latent space.

Key Design 1: Structure Latent Extraction (Q-Former + Spatial Downsampling)¶

Function: Extract global content and low-frequency motion trends.

Mechanism: Utilize a Q-Former to extract representative features along the temporal dimension. The spatial dimensions of the encoder output \(z\) are reshaped into the batch dimension to obtain \((hw, f, c)\). Then, \(n_q \leq f\) learnable queries are used to dynamically select \(n_q\) representative features from \(f\) frames via cross-attention. Subsequently, spatial downsampling and channel dimension reduction are performed via convolutional layers to obtain \(z_S \in \mathbb{R}^{n_q \times d_S \times h_S \times w_S}\).

Design Motivation: By merging the spatial dimensions into the batch size, the Q-Former is forced to learn generic temporal motion trends independently of spatial locations. Spatial downsampling removes redundant details, keeping only the major object information. The cross-attention mechanism of the Q-Former is naturally suited for extracting the most representative frame-level semantics from long sequences.

Key Design 2: Dynamics Latent Extraction (Spatial Downsampling + Average Pooling)¶

Function: Capture fast motion and local details.

Mechanism: First, spatial downsampling is performed via convolutional layers to obtain an intermediate result \(z_D'\), and then average pooling is applied along the height and width dimensions respectively followed by concatenation:

\[z_D = \mathcal{G}([\text{avg}_h(z_D'); \text{avg}_w(z_D')]) \in \mathbb{R}^{f \times d_D \times (w_D + h_D)}\]

The dimensionality is reduced from \(\mathcal{O}(w_D \cdot h_D)\) to \(\mathcal{O}(w_D + h_D)\), drastically compressing the representation while preserving the motion information of each frame.

Design Motivation: Fast motion information is inherently low-dimensional and distributed across each frame. Average pooling along spatial dimensions (instead of Q-Former) avoids destroying spatial consistency. Independent average pooling along height and width preserves row/column-level dynamic patterns.

Key Design 3: Latent Concatenation for Diffusion Model Adaptation¶

Function: Adapt two latents of different shapes into a unified training target for diffusion models.

Mechanism: \(z_S\) (which has a video-like shape) and \(z_D\) (processed into a single-frame video shape by adding dummy dimensions) are respectively patchified into 3D patches, normalized, and concatenated along the sequence length dimension to form the training target for the diffusion model.

Design Motivation: Different latent spaces have different dimensions and physical meanings. Unifying them into token sequences via patchification allows standard DiT architectures to process them directly.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{rec} + \lambda_p \mathcal{L}_p + \lambda_{GAN} \mathcal{L}_{GAN} + \lambda_{KL} \mathcal{L}_{KL}\]

This includes reconstruction loss, perceptual loss, adversarial loss, and KL divergence regularization.

Key Experimental Results¶

Main Results: MCL-JCV Video Reconstruction¶

Method	Compression Ratio↓	PSNR↑	LPIPS↓	SSIM↑	FVD↓
iVideoGPT	1.50%	19.35	0.4677	0.5752	1693
MAGVIT-v2	0.65%	24.35	0.3347	0.6877	654
CMD	6.85%	27.33	0.2732	0.7746	468
EMU-3	0.53%	25.36	0.2543	0.7260	354
CV-VAE	0.53%	28.06	0.2436	0.7546	402
VidTwin	0.20%	28.14	0.2414	0.8044	389

Downstream Generation: UCF-101 Class-Conditional Video Generation¶

Method	FVD↓
TATS	332
Video-LaViT	275
VidTwin	193
MAGVIT-v2	58

Key Findings¶

0.20% Compression Ratio—2.5 to 30 times lower than the closest baselines, while achieving the best overall PSNR/LPIPS/SSIM.
Reduces downstream diffusion model FLOPs by 4-8x and training GPU memory by 2-3x.
Cross-replacement experiments (Structure of Video A + Dynamics of Video B) validate the interpretability of the decoupling: the generated video inherits the primary object of A and the color/fast motion of B.
Decoding only using Structure Latent reconstructs the main semantic content but lacks color and fast motion; decoding only using Dynamics Latent reconstructs detailed motion but lacks the main object.

Highlights & Insights¶

Extremely High Compression Ratio: The 0.20% compression ratio is 2.5-30x lower than competing methods, directly easing the computational burden of downstream models.
Interpretable Decoupling: The separation of Structure and Dynamics has clear physical interpretations and is visually validated.
Novel Use of Q-Former: Porting Q-Former from multimodal alignment to temporal video compression, performing cross-attention across the temporal dimension to extract representative frames.

Limitations & Future Work¶

The FVD of 193 on UCF-101 generation does not reach MAGVIT-v2's 58, indicating that while the latent space is compact, the design of the generative model can be further optimized.
The decoupling is not perfect—there is information loss when decoding with only one latent variable.
Currently only verified at \(224 \times 224\) resolution, and the effectiveness at higher resolutions remains to be explored.
Future work can explore more fine-grained latent interaction mechanisms and conditional generation control.

CMD: A pioneer in content-motion decoupling, but representing content using frame averages is too coarse.
CV-VAE: A uniform-sized video VAE baseline with a 0.53% compression ratio.
BLIP-2 Q-Former: A multimodal information extraction architecture, which VidTwin innovatively adopts for temporal feature compression.

Rating¶

⭐⭐⭐⭐ — The decoupling design is novel and well-motivated theoretically. The experimental results at a 0.20% compression ratio are impressive. The creative use of Q-Former for temporal compression is inspiring. The interpretability analysis from cross-replacement experiments adds strong persuasion. However, the performance gap in downstream generation quality compared to MAGVIT-v2 suggests there is still room for optimization in the latent space design.