Skip to content

Multi-View 3D Point Tracking

Conference: ICCV 2025 arXiv: 2508.21060 Code: ethz-vlg/mvtracker Area: 3D Vision Keywords: multi-view 3D point tracking, point cloud feature fusion, kNN association, Transformer iterative refinement, occlusion handling

TL;DR

This paper presents MVTracker—the first data-driven multi-view 3D point tracker. By back-projecting multi-view depth maps into a unified 3D feature point cloud and leveraging kNN association with Transformer-based iterative refinement, MVTracker achieves robust long-range 3D point trajectory estimation under a practical 4-camera configuration, attaining median trajectory errors of 3.1 cm and 2.0 cm on Panoptic Studio and DexYCB, respectively.

Background & Motivation

3D point tracking is a fundamental task in dynamic scene reconstruction, robotic manipulation, and augmented reality. Existing methods suffer from three key limitations:

Inherent limitations of 2D point trackers: Methods such as CoTracker and LocoTrack perform well in the 2D domain, but lifting 2D trajectories to 3D introduces significant accuracy degradation due to projection ambiguity and depth estimation noise.

Depth ambiguity in monocular 3D trackers: Methods such as SpatialTracker and DELTA rely on monocular depth estimation and struggle to produce reliable 3D trajectories under occlusion and complex motion.

High cost of multi-camera optimization approaches: Dynamic 3DGS requires 27 cameras and per-sequence optimization; Shape of Motion requires iterative optimization—both are ill-suited for real-time or large-scale deployment.

The authors' core insight is that exploiting multi-view information from a small number of cameras (e.g., 4) via a feed-forward model to directly predict 3D correspondences can simultaneously resolve depth ambiguity and occlusion, without requiring per-sequence optimization.

Method

Overall Architecture

The MVTracker pipeline consists of five stages:

  1. Feature extraction: A CNN backbone extracts \(d=128\)-dimensional feature maps \(\Phi_t^v\) per view with downsampling factor \(k=4\), forming a feature pyramid of \(S=4\) scales.
  2. 3D feature point cloud construction: Pixels are back-projected into 3D world coordinates using depth maps and camera parameters, with associated features aggregated across all views into a unified 3D feature point cloud \(\mathcal{X}_t^s\).
  3. kNN association computation: Multi-scale kNN search is performed in the fused point cloud for each tracked point to compute feature correlations.
  4. Transformer iterative refinement: Spatio-temporal tokens are constructed and refined through self-attention and virtual track point cross-attention over \(M\) iterations.
  5. Sliding window inference: Overlapping sliding windows handle long videos, with trajectory estimates propagated across windows.

Key Designs

Fused 3D Feature Point Cloud vs. Triplane

This is the paper's most central contribution. SpatialTracker uses a triplane representation that projects multi-view features onto three orthogonal planes (XY/YZ/ZX). This approach has two fundamental drawbacks:

  • Projection collision: Different 3D surfaces mapped to the same planar coordinates cause feature averaging and information loss.
  • Fixed boundary: A predefined scene bounding box is required, limiting adaptability across scenes of varying scale and position.

MVTracker's point cloud representation preserves features directly in 3D space, avoiding collisions and naturally adapting to diverse scenes. Experiments show that replacing this component with a Triplane Baseline drops AJ from 86.0 to 65.1 on Panoptic Studio.

Multi-Scale kNN Spatial Association

Unlike 2D trackers that compute associations on pixel grids, MVTracker employs kNN search in the 3D point cloud to establish correspondences:

\[\mathbf{C}_t^{n,s} = \{\langle \mathbf{f}_t^n, \phi_k \rangle \mid (\mathbf{x}_k, \phi_k) \in \mathcal{N}_K(\hat{\mathbf{p}}_t^n, \mathcal{X}_t^s)\}\]

A critical design choice is explicit 3D offset vector encoding: for each neighbor, the feature similarity and relative offset \((\mathbf{x}_k - \hat{\mathbf{p}}_t^n)\) are concatenated. Ablation studies confirm that using only offset vectors (without absolute position) is optimal, improving AJ from 21.3 (no offset) to 53.6.

The mean neighborhood radii across the 4 scales are 12.5, 22.4, 42.7, and 85.8 cm (Panoptic Studio), with the coarsest scale covering inter-frame motion of approximately 92 km/h at 30 FPS.

Transformer Iterative Refinement

Each tracked point at each timestep is encoded as a token:

\[G_t^n = (\eta(\hat{\mathbf{p}}_t^n - \hat{\mathbf{p}}_{t_n^q}^n), \mathbf{f}_t^n, \mathbf{C}_t^{n,s}, \hat{v}_t^n)\]

where \(\eta(\cdot)\) denotes sinusoidal positional encoding. The Transformer applies temporal self-attention and virtual track point cross-attention (inherited from CoTracker2) to output residual updates to positions and features over \(M\) iterations. Visibility \(\hat{v}_t^n\) is predicted via sigmoid projection after the final iteration.

Loss & Training

The total loss combines a position term and a visibility term:

\[\mathcal{L} = \mathcal{L}_{xyz} + \lambda_{vis} \mathcal{L}_{vis}\]
  • Position loss: Weighted \(\ell_1\) norm with higher weights for later iterations (\(\gamma^{M-m}\)), summed over all windows, iterations, trajectories, and frames.
  • Visibility loss: Balanced binary cross-entropy (B-BCE) to address class imbalance between visible and occluded frames.

Key Experimental Results

Main Results (Table 1)

Method Panoptic Studio AJ↑ MTE↓(cm) DexYCB AJ↑ MTE↓(cm) MV-Kubric AJ↑ MTE↓(cm)
Dynamic 3DGS 66.5 3.9 45.7 11.3 30.4 11.2
Shape of Motion 72.6 4.8 36.2 8.0 57.8 5.3
CoTracker3 74.5 8.6 29.4 22.0 55.1 11.9
SpatialTracker 61.5 7.3 58.3 5.9 65.5 2.2
TAPIP3D 84.3 3.1 38.8 8.2 72.4 1.3
Triplane Baseline 65.1 7.2 57.5 4.3 74.7 1.2
MVTracker 86.0 3.1 71.6 2.0 81.4 0.7

MVTracker achieves state-of-the-art performance across all datasets and metrics. Compared to the strongest monocular baseline TAPIP3D, it improves AJ by +32.8 on DexYCB (38.8→71.6) and reduces MTE by 75.6% (8.2→2.0 cm).

Ablation Study on Association Components (Table 2, MV-Kubric)

Variant AJ↑ δ_avg↑ MTE↓(cm)
No offset 21.3 45.3 15.6
Offset + absolute position 48.7 59.6 6.8
Offset only 53.6 64.9 4.3

Explicit 3D offset vectors are critical for kNN association, with relative-offset-only encoding yielding the best performance.

Effect of Number of Views (Figure 4, DexYCB)

No. of Views MVTracker AJ SpatialTracker AJ CoTracker3 AJ
1 64.0
4 71.1
8 79.2

MVTracker's performance scales consistently with the number of views, demonstrating strong multi-view information utilization.

Robustness to Camera Configuration (Table 3)

Across different 4-camera layouts (opposing/adjacent placements), MVTracker consistently outperforms all baselines by a substantial margin. On Panoptic Studio, AJ scores for Setups A/B/C are 86.0/75.7/83.2, respectively.

Key Findings

  • MVTracker runs at 7.2 FPS given RGB-D input, suitable for near-real-time applications.
  • Training exclusively on 5K synthetic Kubric sequences generalizes well to real-world scenes.
  • The model supports flexible inputs of 1–8 views and 24–150 frames.

Highlights & Insights

  1. Fused point cloud > Triplane: This is the paper's most important technical contribution. The point cloud representation avoids projection collisions and fixed boundary issues, with advantages particularly pronounced in multi-view settings. Using identical training data and framework, replacing only the representation reduces AJ by more than 20 points.
  2. kNN + explicit offsets: In 3D space, kNN neighbors arrive from arbitrary directions; unlike 2D grids where direction is implicitly encoded, explicit offset vectors are essential for disambiguation.
  3. Practical deployment: Only 4 cameras are needed to substantially surpass 27-camera optimization-based methods, greatly lowering deployment barriers.
  4. Synthetic-to-real generalization: Training solely on synthetic data achieves state-of-the-art results on real-world data, attributed to comprehensive data augmentation strategies including view count randomization and depth source mixing.
  5. Feed-forward vs. optimization: Online inference at 7.2 FPS vastly outpaces the per-sequence optimization required by Dynamic 3DGS and Shape of Motion.

Limitations & Future Work

  1. Depth estimation dependency: The method is heavily reliant on depth map quality. Without sensor-based depth, it falls back on estimators such as DUSt3R/VGGT, which may be unreliable or fail under sparse camera configurations.
  2. Scene normalization: The model is trained on synthetic data at a fixed scale; at test time, manual or heuristic similarity transforms are required to adapt to scenes of varying scale, lacking an automated solution.
  3. Bounded scene assumption: The method operates only within regions of sufficient camera overlap; extending to outdoor unbounded environments poses challenges due to insufficient training data and weak viewpoint constraints.
  4. Lack of real-world training data: Complete reliance on synthetic training data and the absence of large-scale real-world 3D point tracking annotations limit generalization capacity.
  • Scene flow methods (e.g., [30, 28]): Address dense 3D motion between two frames only and cannot perform long-range tracking. MVTracker can be viewed as a long-range extension of scene flow.
  • 2D point tracking (CoTracker2/3, LocoTrack): Provide mature architectural designs including sliding windows and virtual track points; MVTracker directly inherits the spatio-temporal Transformer framework from CoTracker2.
  • SpatialTracker: The first feed-forward method to extend point tracking to 3D, but its triplane representation becomes a bottleneck in multi-view settings. MVTracker's point cloud representation is a direct and effective improvement.
  • TAPIP3D: A strong monocular 3D tracker that approaches MVTracker on Panoptic Studio but falls far behind on DexYCB, underscoring the substantial value of multi-view information under depth estimation noise.
  • Future directions: The authors identify joint optimization of depth estimation and tracking, 4D reconstruction foundation models, and self-supervised learning from real-world video as key research directions.

Rating

Dimension Score (/10) Notes
Novelty 8 First data-driven multi-view 3D point tracker; fused point cloud + kNN association design is novel and effective
Technical Depth 8 Method design is clear and complete; ablation studies thoroughly validate each design choice
Experimental Thoroughness 8 Three datasets, multiple baselines, and rich ablations; limited real-world scene evaluation
Writing Quality 9 Clear structure, high-quality figures, well-motivated problem statement
Value 8 Practical 4-camera feed-forward inference at 7.2 FPS approaching real-time; deployment limited by depth dependency
Overall 8.2 A pioneering contribution to multi-view 3D point tracking with sound technical design and convincing experiments

Rating

  • Novelty: TBD
  • Experimental Thoroughness: TBD
  • Writing Quality: TBD
  • Value: TBD