ICCV 2025 Image Generation world model 4D reconstruction video prediction visual planning geometric awareness synthetic data

Aether: Geometric-Aware Unified World Modeling¶

Conference: ICCV 2025 arXiv: 2503.18945 Code: Project Area: World Models / 4D Reconstruction / Visual Planning Keywords: world model, 4D reconstruction, video prediction, visual planning, geometric awareness, synthetic data

TL;DR¶

This paper proposes Aether, a unified world model that post-trains the CogVideoX video diffusion model on synthetic RGB-D data. Through a multi-task training strategy that randomly combines input/output modalities, Aether simultaneously achieves 4D reconstruction, action-conditioned video prediction, and goal-conditioned visual planning, with zero-shot transfer to real-world data reaching performance comparable to domain-specific models.

Background & Motivation¶

Background: World models require three core capabilities—perception (4D reconstruction), prediction (action-conditioned generation), and planning (goal-conditioned reasoning)—yet existing methods typically address only one of these.
Limitations of Prior Work: (1) Independent modeling of each capability lacks synergy; (2) real-world 4D annotated data is extremely scarce; (3) action representations are heterogeneous (keyboard inputs / robot actions / camera trajectories).
Key Challenge: The demand for unifying three capabilities conflicts with the heterogeneity of data and representations.
Goal: Construct a unified framework that simultaneously supports reconstruction, prediction, and planning.
Key Insight: Synthetic data + camera trajectories as a unified action representation + multi-task post-training.
Core Idea: Post-train a video diffusion model on synthetic 4D data, using camera trajectories as the geometric action space to unify reconstruction, prediction, and planning.

Method¶

Overall Architecture¶

CogVideoX-5b-I2V serves as the base model, with camera parameters automatically annotated from synthetic RGB-D data. The model produces three output modalities: color video, depth video, and action (raymap). Different tasks are realized through distinct conditioning combinations.

Key Designs¶

Design 1: 4D Synthetic Data Annotation Pipeline - Function: Automatically obtain accurate camera parameters from synthetic RGB-D videos. - Mechanism: Dynamic object masking (Grounded SAM2) → video segmentation (SIFT + optical flow filtering) → coarse estimation (DroidCalib) → refinement (CoTracker3 + Bundle Adjustment + Ceres Solver). - Design Motivation: 4D annotated data is scarce; an automated pipeline is a prerequisite for scaling.

Design 2: Raymap Camera Representation - Function: Convert camera trajectories into a raymap video representation compatible with diffusion models. - Mechanism: Each frame uses 6 channels (3D ray directions + 3D ray origins), with translation normalized via log-scale. The raymap is invertible—camera intrinsics and extrinsics can be recovered from the generated raymap. - Design Motivation: Camera parameters must align with the spatiotemporal tokens of the DiT; raymaps naturally possess spatial structure.

Design 3: Multi-Task Random Conditioning Training - Function: Randomly mask different conditioning combinations to enable unified multi-task training. - Mechanism: Color conditioning probability allocation: 30% planning (first + last frame), 40% prediction (first frame), 28% reconstruction (full video), 2% fully unmasked. Action conditioning is retained 50% of the time and masked otherwise. Two-stage training: Stage 1 uses standard diffusion loss; Stage 2 adds decoded MS-SSIM + scale-invariant depth loss + point map loss. - Design Motivation: Random conditioning enables knowledge transfer across tasks; geometric supervision ensures 3D consistency.

Loss & Training¶

Stage 1: Standard diffusion MSE. Stage 2: + MS-SSIM (color) + scale-shift invariant loss (depth) + scale-shift invariant point map loss (depth + raymap). Training runs for two weeks on 80× A100 GPUs.

Key Experimental Results¶

Main Results¶

Video Depth Estimation (Abs Rel↓ / \(\delta<1.25\)↑)

Method	Sintel	BONN	KITTI
MonST3R	0.378/55.8	0.067/96.3	0.168/74.4
DepthCrafter	0.590/55.5	0.253/56.3	0.124/86.5
Aether	0.314/60.4	0.273/59.4	0.054/97.7

Ablation Study¶

Configuration	Depth Abs Rel (Sintel)
w/o multi-task training	0.45
w/o Stage 2 geometric loss	0.38
Full Aether	0.314

Key Findings¶

Training exclusively on synthetic data enables zero-shot transfer to real-world scenes, with reconstruction performance matching or exceeding specialist models on most benchmarks.
Multi-task training yields significant knowledge transfer—reconstruction capability promotes geometric consistency in prediction and planning.
Camera trajectories as the action space are particularly effective for ego-view tasks such as navigation.

Highlights & Insights¶

This is the first work to unify reconstruction, prediction, and planning within a single video diffusion model.
The paradigm of synthetic data combined with an automated annotation pipeline is scalable to large-scale settings.
The raymap representation elegantly embeds camera parameters into the diffusion model framework.

Limitations & Future Work¶

Camera trajectories as the action space are insufficiently general for non-ego-view tasks such as robotic arm manipulation.
The domain gap from synthetic data still manifests in certain real-world scenarios.
Planning capability is achieved solely through first- and last-frame conditioning, lacking explicit path optimization.

DA-V trains depth estimation on synthetic data but does not address reconstruction or planning.
Insight: Post-training video foundation models can efficiently inject 4D geometric reasoning capabilities.

Rating¶

Dimension	Score
Novelty	★★★★★
Practicality	★★★★☆
Experimental Thoroughness	★★★★★
Writing Quality	★★★★☆