C4D: 4D Made from 3D through Dual Correspondences¶
Conference: ICCV 2025 arXiv: 2510.14960 Code: https://littlepure2333.github.io/C4D Area: Other Keywords: 4D reconstruction, temporal correspondence, point tracking, motion segmentation, DUSt3R
TL;DR¶
This paper proposes C4D, a framework that upgrades existing 3D reconstruction paradigms to full 4D reconstruction by jointly capturing dual temporal correspondences — short-term optical flow and dynamic-aware long-term point tracking (DynPT) — on top of DUSt3R's 3D pointmap predictions. Motion masks are generated to separate static and dynamic regions. Three optimization objectives are introduced: camera motion alignment, camera trajectory smoothing, and point trajectory smoothing. The resulting system produces per-frame point clouds, camera parameters, and 2D/3D trajectories, achieving competitive performance across depth estimation, pose estimation, and point tracking tasks.
Background & Motivation¶
Pointmap-based methods such as DUSt3R achieve strong results in static scene 3D reconstruction, but fail in dynamic scenes where moving objects violate multi-view geometric constraints. Existing 4D approaches either require fine-tuning the model (e.g., MonST3R) or rely on complex NeRF/3DGS optimization pipelines. The central question is: how can temporal correspondence information be leveraged to upgrade 3D reconstruction to 4D without modifying pretrained weights?
Core Problem¶
How can temporal correspondences (optical flow + point tracking) be exploited to distinguish static from dynamic regions, improve camera pose estimation, and achieve temporally smooth 4D reconstruction?
Method¶
Overall Architecture¶
Monocular video → DUSt3R/MASt3R/MonST3R predicts pointmaps + DynPT predicts long-term trajectories and dynamic scores + optical flow estimates short-term correspondences → correspondence-guided motion mask prediction (static-point fundamental matrix + epipolar error) → multi-objective optimization (GA + CMA + CTS + PTS) → 4D output (per-frame point clouds / depth / camera poses / intrinsics / motion masks / 2D+3D trajectories).
Key Designs¶
- DynPT (Dynamic-aware Point Tracker): Built on the CoTracker architecture, augmented with a 3D-aware ViT encoder (frozen DUSt3R pretrained encoder) and a CNN dual-feature extractor. A Transformer iteratively updates each tracked point's position, confidence, visibility, and mobility. Trained on Kubric; mobility ground truth is generated via positional difference thresholding.
- Correspondence-Guided Motion Mask: Static points predicted by DynPT are used to sample static correspondences from optical flow → LMedS estimates the fundamental matrix (reflecting camera motion only) → Sampson distance identifies dynamic regions violating epipolar constraints → masks from multiple frames are fused via union. This yields more accurate motion masks than MonST3R.
- Correspondence-Assisted Optimization:
- CMA (Camera Motion Alignment): constrains the ego-motion field to be consistent with optical flow in static regions.
- CTS (Camera Trajectory Smoothing): penalizes abrupt changes in rotation and translation between adjacent frames.
- PTS (Point Trajectory Smoothing): adaptively weighted smoothing on sparse tracked points → linear blending displacement (LBD) propagated to dense points.
- Plug-and-Play: No modifications to DUSt3R/MASt3R/MonST3R weights; new objectives are introduced only at the optimization stage.
Training / Optimization Details¶
- DynPT: trained for 50K steps, batch size 32, AdamW + OneCycle scheduler, lr \(5 \times 10^{-4}\).
- Optimization is conducted in two stages: (1) GA + CMA + CTS optimize depth, pose, and intrinsics; (2) pose is fixed and PTS optimizes depth only.
- Each stage runs for 300 iterations, Adam lr \(0.01\).
Key Experimental Results¶
Camera Pose Estimation (ATE ↓)¶
| Method | Sintel | TUM-dyn | ScanNet |
|---|---|---|---|
| MonST3R+GA | 0.158 | 0.099 | 0.075 |
| C4D-M | 0.103 | 0.071 | 0.061 |
| DROID-SLAM† | 0.175 | — | — |
| LEAP-VO† | 0.089 | 0.068 | 0.070 |
- Compared to MonST3R+GA: Sintel ATE reduced by 35%; RPE_rot reduced from 1.924 to 0.705.
- Competitive with dedicated VO methods that require GT intrinsics.
Video Depth Estimation (AbsRel ↓, scale-only alignment)¶
| Method | Sintel | Bonn | KITTI |
|---|---|---|---|
| MonST3R | 0.345 | 0.065 | 0.159 |
| C4D-M | 0.338 | 0.063 | 0.091 |
| DepthCrafter | 0.692 | 0.217 | 0.141 |
Under scale-only alignment, KITTI AbsRel improves from 0.159 to 0.091 (43% relative gain over MonST3R).
Point Tracking (TAP-Vid DAVIS AJ ↑)¶
| Method | AJ | \(\delta_\text{avg}\) | OA |
|---|---|---|---|
| CoTracker | 61.8 | 76.1 | 88.3 |
| DynPT | 61.6 | 75.4 | 87.4 |
Competitive with SOTA, with the additional capability of predicting mobility (D-ACC: MOVi-E 87.9%, Pan.MOVi-E 94.1%).
Ablation Study (Sintel)¶
| Variant | ATE ↓ | RPE_t ↓ | RPE_r ↓ |
|---|---|---|---|
| w/o CMA | 0.140 | 0.051 | 0.905 |
| w/o CTS | 0.131 | 0.058 | 1.348 |
| w/o PTS | 0.103 | 0.040 | 0.705 |
| C4D (full) | 0.103 | 0.040 | 0.705 |
CTS has the largest impact on RPE_rot (0.705 → 1.348 when removed).
Highlights & Insights¶
- Plug-and-play 4D upgrade: The 3D-to-4D transition requires no fine-tuning of 3D model weights; it is achieved solely through new optimization objectives and temporal correspondences.
- Mobility prediction in DynPT: A key innovation — distinguishing whether a point's motion originates from camera movement or object movement, enabling more accurate motion mask prediction.
- LMedS + fundamental matrix for motion segmentation: An elegant solution — the fundamental matrix is estimated using only static points, and any region violating epipolar constraints is classified as dynamic.
- Multi-frame motion mask fusion: Addresses cases where temporarily stationary dynamic objects (e.g., a pedestrian's feet while standing) appear static between adjacent frame pairs.
Limitations & Future Work¶
- DynPT is trained on synthetic Kubric data; the domain gap may affect mobility prediction in real-world dynamic scenes.
- The optimization stage is relatively slow (2 × 300 iterations).
- PTS provides marginal improvement on quantitative depth metrics, though temporal smoothness is substantially enhanced (best assessed via visual evaluation).
Related Work & Insights¶
- vs. MonST3R: MonST3R fine-tunes the DUSt3R decoder; C4D leaves model weights unchanged and upgrades reconstruction through optimization. C4D produces more accurate motion masks (Fig. 6).
- vs. Shape-of-Motion / GFlow: These methods require NeRF/3DGS optimization; C4D is more lightweight by operating directly on pointmaps.
- vs. DROID-SLAM / LEAP-VO: These methods require GT intrinsics; C4D operates from monocular video alone.
Rating¶
- Novelty: ⭐⭐⭐⭐ — DynPT's mobility prediction and correspondence-guided motion masks are the primary contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 3D/4D comparisons, three downstream tasks, six+ datasets, ablations, and motion segmentation evaluation.
- Writing Quality: ⭐⭐⭐⭐ — Architecture diagrams are clear; method descriptions are complete.
- Value: ⭐⭐⭐⭐ — Provides both 4D reconstruction methodology and an important extension of the DUSt3R ecosystem.