From Image to Video: An Empirical Study of Diffusion Representations¶
Conference: ICCV 2025 arXiv: 2502.07001 Code: No public code Area: 3D Vision Keywords: Diffusion Models, Video Representation Learning, Image vs. Video Diffusion, Motion Understanding, WALT
TL;DR¶
This paper systematically compares diffusion models trained under image vs. video generation objectives using the same architecture (WALT) on a suite of downstream visual understanding tasks. Video diffusion models consistently outperform their image counterparts across all tasks, with particularly large gains on tasks requiring motion and 3D spatial understanding (point tracking +68%, camera pose +60%).
Background & Motivation¶
Diffusion models have achieved remarkable success in image and video generation, and their internal representations have demonstrated strong potential for image understanding tasks such as segmentation, depth estimation, and keypoint matching. However, a critical question remains unanswered:
Video diffusion representations are nearly unexplored: Despite the high-quality video content generated by video diffusion models, the representational capabilities of such models are entirely unknown, in contrast to the extensive research on image diffusion representations.
Lack of fair comparison: Comparing the representation quality of image vs. video diffusion models requires architecturally identical models trained under different objectives. Existing models (e.g., Stable Diffusion 2.1 vs. SVD) differ substantially in parameter count (865M vs. 1.5B), precluding meaningful conclusions.
Unknown role of temporal information: How does the temporal understanding introduced by video training affect representation quality? Does motion information enhance semantic understanding, 3D perception, or object tracking?
Core Insight: The hybrid architecture of WALT provides an ideal platform for a fair comparison — models with identical parameter counts can be trained for image or video generation, differing only in whether the attention window spans the temporal dimension. This enables the first direct controlled comparison.
Method¶
WALT Model Architecture¶
WALT (Windowed-Attention Latent Transformer) is a Transformer-based latent diffusion model with the following key properties:
- Shared tokenizer: Uses a causal 3D CNN encoder from MAGVIT-v2 to compress images and videos into a shared latent space. Videos are encoded as \(z \in \mathbb{R}^{(1+m) \times h \times w \times c}\), where the first frame is encoded independently and the subsequent \(m=4\) latents represent 16 frames.
- Windowed attention: Alternates between spatio-temporal window blocks and spatial-only window blocks.
- Image mode compatibility: During image generation, each latent in the spatio-temporal blocks attends only to itself (equivalent to an identity mask).
I-WALT vs. V-WALT¶
Two models are constructed for fair comparison: - V-WALT (video model): Standard WALT, where spatio-temporal window blocks perform both spatial and temporal attention. - I-WALT (image model): Spatio-temporal window attention blocks are replaced with spatial-only window attention blocks (identical parameter count).
V-WALT is trained on joint image and video datasets; I-WALT is trained on the same datasets, but frames are randomly sampled from videos and treated as independent images. Both models share exactly the same architecture and parameter count.
Feature Extraction and Probing Framework¶
WALT is used as a frozen backbone, and lightweight readout heads are trained to evaluate downstream tasks:
- Feature extraction: Noise is added to the input at timestep \(t\), a single forward pass is performed (no full denoising), and activations from intermediate Transformer blocks are extracted.
- Readout architecture: Task-specific readout heads are selected:
- Classification tasks (ImageNet, K400, etc.): attentive readout (learnable query tokens + cross-attention + MLP)
- Depth estimation: Scene Representation Transformer decoder (cross-attention + per-pixel MLP)
- Point/box tracking: MooG-style recurrent readout (predict-correct mechanism)
Evaluation Task Coverage¶
A full spectrum from purely semantic to spatiotemporal understanding: - Image classification: ImageNet (objects), Places365 (scenes), iNat2018 (fine-grained) - Action recognition: K400/K700 (appearance-dominated), SSv2 (motion-sensitive) - Monocular depth estimation: ScanNet - Camera pose estimation: RealEstate10k - Visual correspondence: point tracking (Perception Test), box tracking (Waymo Open)
Experiments¶
Main Results: Video vs. Image Diffusion Representations¶
| Task | I-WALT (Baseline 100%) | V-WALT Relative Gain | Task Nature |
|---|---|---|---|
| Places365 | 100% | +0.6% | Pure semantics |
| ImageNet | 100% | +1.8% | Pure semantics |
| iNat2018 | 100% | +11% | Fine-grained semantics |
| K400 | 100% | +8% | Appearance understanding |
| K700 | 100% | +12% | Appearance understanding |
| SSv2 | 100% | +42% | Motion understanding |
| Depth | 100% | +16% | 3D perception |
| Cam. Pose | 100% | +60% | Spatial understanding |
| Obj. Tracks | 100% | +23% | Spatiotemporal correspondence |
| PointTracks | 100% | +68% | Precise localization |
Key Findings: - V-WALT consistently outperforms I-WALT on all 10 tasks, but the magnitude of improvement varies greatly (0.6%–68%). - Purely semantic tasks (Places365) show minimal differences, whereas tasks requiring motion or spatial understanding (PointTracks, Cam. Pose, SSv2) exhibit the largest gains. - The +11% improvement on iNat2018 is unexpected and may reflect an enhanced sensitivity to fine-grained visual differences induced by video training.
Ablation Study: Noise Level and Layer Selection¶
| Design Choice | Optimal Configuration | Observed Effect |
|---|---|---|
| Noise level \(t\) | \(t=200\) optimal for most tasks | High noise is universally harmful; tracking tasks are more sensitive to noise (\(t=0\)–\(100\) optimal) |
| Network block \(l\) | ~2/3 depth (\(l=11\)–\(16\)) optimal | Model implicitly divides into encoder/decoder; best representations emerge near the boundary |
| Training progress | >90% performance reached at 20% training | Recognition tasks improve continuously; tracking and depth tasks plateau earlier |
| Model scale | 284M→1.9B yields significant gains | Classification tasks benefit most (especially with many categories); PointTracks is an exception |
Key Findings: - A small amount of noise (\(t=200\)) benefits most tasks, whereas low-level tasks such as tracking prefer zero or minimal noise. - The optimal feature layer lies at ~2/3 of the model depth, suggesting an implicit encoder–decoder structure within diffusion models. - Camera pose estimation degrades after 26% of training, suggesting that prolonged training may shift features toward generation rather than understanding. - PCA visualizations reveal that V-WALT selectively attends to moving regions, whereas I-WALT attends to all salient regions including static objects.
Comparison with Other Representation Models¶
| Model | Type | Parameters | Relative Performance (vs. I-WALT) |
|---|---|---|---|
| DINOv2 | Contrastive learning (image) | 300M | Strong on semantic tasks, weak on tracking |
| SigLIP | Image-text alignment (image) | 400M | Strongest on semantic tasks |
| V-JEPA | Feature reconstruction (video) | 300M | Comparable to V-WALT, task-dependent |
| VideoMAE | Pixel reconstruction (video) | 300M | Slightly weaker overall |
| V-WALT (1.9B) | Diffusion (video) | 1.9B | Significant gains on most tasks |
V-WALT is competitive on depth and motion understanding tasks, but is dominated by DINOv2 and SigLIP on purely semantic tasks, revealing a fundamental limitation of generative diffusion models in semantic understanding.
Highlights & Insights¶
- First fair comparison of image vs. video diffusion representations: By exploiting WALT's hybrid architecture under fully controlled conditions, the paper establishes that video training consistently improves representation quality.
- Discovery of motion sensitivity: PCA visualizations and brick-wall experiments elegantly reveal V-WALT's selective attention to motion regions, explaining its advantage on spatiotemporal tasks.
- Systematic study of noise level and layer depth: Provides practical guidelines for extracting features from diffusion models — a small amount of noise (~200) and ~2/3 network depth.
- Interesting observations on training dynamics: Generation quality (FVD) improves continuously, yet some downstream tasks peak early or even degrade, suggesting a potential trade-off between generative capability and representational quality.
Limitations & Future Work¶
- Only a single model architecture (WALT) is evaluated; the generalizability of the conclusions warrants further validation, despite the authors' justifications.
- Training relies on Google-internal datasets and models, limiting full reproducibility.
- Only frozen feature extraction with lightweight probing is explored; fine-tuning the diffusion model may further reveal its representational potential.
- The choice of readout head may influence conclusions, though the designs are consistent with prior work.
- No code or model weights are released.
Related Work & Insights¶
- Image diffusion representations: DDPM-Seg, DIFT, zero-shot classification, multi-task understanding (segmentation + depth), etc.
- Video representation learning: V-JEPA (contrastive), VideoMAE (masked reconstruction), VideoPrism (hybrid strategy).
- Diffusion model architectures: WALT (windowed-attention latent transformer), Stable Diffusion/SVD (U-Net family).
Rating¶
- Novelty: ⭐⭐⭐⭐ — First systematic comparison, though empirical in nature; no new method is proposed.
- Technical Quality: ⭐⭐⭐⭐⭐ — Rigorous experimental design, well-controlled variables, 10 downstream tasks with broad coverage.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Main experiments, noise/layer ablations, training dynamics analysis, model scaling, and baseline comparisons; highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ — Figures and tables are concise and intuitive; conclusions are clear and discussion is in-depth.
- Overall Score: 8.0/10