Skip to content

VisualSync: Multi-Camera Synchronization via Cross-View Object Motion

Conference: NeurIPS 2025 arXiv: 2512.02017 Code: Project Page Area: 3D Vision Keywords: multi-camera synchronization, epipolar geometry, video alignment, dynamic scene reconstruction, temporal offset estimation

TL;DR

VisualSync presents a multi-camera temporal synchronization framework grounded in epipolar geometry constraints. By leveraging pretrained vision foundation models (VGGT, CoTracker3, MAST3R) to extract motion trajectories and cross-view correspondences, the method estimates per-camera temporal offsets by minimizing Sampson error, achieving millisecond-level synchronization with median errors below 50 ms across four benchmarks.

Background & Motivation

Multi-view video capture has become increasingly prevalent in everyday settings (concerts, sporting events, family gatherings, etc.), yet temporal synchronization across heterogeneous devices remains an open problem. Limitations of existing synchronization approaches include:

  • Geometry-based methods (Albl et al.): rely on static scene assumptions or fixed viewpoints
  • Human-pose-driven methods: constrained by pose estimation accuracy and require human presence
  • Audio-based methods: require clean audio signals, unsuitable for noisy environments
  • Sync-NeRF: jointly optimizes temporal offsets and radiance fields, but is restricted to controlled environments

The core insight of VisualSync is elegantly simple: at the correct synchronization instant the scene is momentarily static, so all correspondences must satisfy the epipolar constraint \(\mathbf{x}'^T \mathbf{F} \mathbf{x} = 0\). When the temporal offset is incorrect, dynamic object correspondences deviate from epipolar lines. Minimizing this deviation recovers the correct temporal alignment.

Although the geometric principle is straightforward, building a system that works robustly on in-the-wild videos is non-trivial: it requires reliable fundamental matrix estimation, detection and tracking of dynamic objects, and cross-view correspondence establishment. VisualSync's contribution lies in harnessing state-of-the-art vision foundation models to address each of these sub-problems.

Method

Overall Architecture

VisualSync adopts a three-stage pipeline: - Stage 0: Visual cue extraction (camera parameters, motion trajectories, cross-view correspondences) - Stage 1: Pairwise temporal offset estimation via exhaustive search over minimum Sampson error - Stage 2: Global optimization to recover globally consistent temporal offsets

Given \(N\) asynchronous videos \(\{\mathbf{V}^i\}_{i=1}^N\), the goal is to estimate a temporal offset \(s^i \in \mathbb{R}\) for each video such that synchronized frames correspond to the same physical instant.

Key Designs

  1. Sampson-error-based alignment energy: For camera pair \((i,j)\), the pairwise energy is defined as the sum of Sampson errors over all matched trajectory pairs and all time instants:
\[E_{ij}(\Delta) = \sum_{(\mathbf{x}^i, \mathbf{x}^j)} \sum_t \frac{(\mathbf{x}^i(t+\Delta)^\top \mathbf{F}_{t+\Delta,t}^{ij} \mathbf{x}^j(t))^2}{\|\mathbf{F}_{t+\Delta,t}^{ij} \mathbf{x}^j(t)\|_{1,2}^2 + \|\mathbf{F}_{t+\Delta,t}^{ij\top} \mathbf{x}^i(t+\Delta)\|_{1,2}^2}\]

The Sampson error is a first-order approximation and lower bound of the true epipolar distance; it admits a closed-form analytical expression and is robust to noise. Comparisons against alternative geometric measures (algebraic error, symmetric epipolar distance, cosine error) confirm its superiority. Global synchronization is formulated as \(\{s^i\} = \arg\min \sum_{i<j} E_{ij}(\Delta^{ij})\), where \(\Delta^{ij} = s^j - s^i\).

  1. Visual cue extraction (Stage 0): Multiple pretrained models are combined:

    • VGGT: jointly estimates intrinsic and extrinsic camera trajectories for all views (for static cameras only the first frame is used)
    • GPT-4o + GroundedSAM + DEVA: automatically identifies dynamic object categories and generates temporally consistent segmentation masks
    • CoTracker3: performs dense 2D point tracking within dynamic regions to obtain per-video temporal trajectories
    • MAST3R: establishes spatial cross-view correspondences via keyframe sampling and instance-level matching to filter noisy associations
  2. Divide-and-conquer optimization strategy: Stage 1 independently searches for the optimal offset for each camera pair, \(\Delta^{ij*} = \arg\min_{\Delta \in \mathcal{S}} E_{ij}(\Delta)\), and filters unreliable pairs via energy landscape analysis (threshold of 0.1 on the ratio of optimal energy to the second-best local minimum). Stage 2 recovers global offsets via robust least squares: \(\{s^i\}^* = \arg\min \sum_{(i,j) \in \mathcal{E}} \rho_\delta(s^j - s^i - \Delta^{ij})\), solved with Huber loss \(\rho_\delta\) and iteratively reweighted least squares (IRLS).

Loss & Training

VisualSync is an optimization-based framework rather than a learning-based one: - The energy function is grounded in epipolar geometry via Sampson error - Stage 1 employs exhaustive search (step size determined by frame rate) - Stage 2 is solved with IRLS for robust least squares - Unreliable pairs are automatically discarded through energy landscape analysis

Key Experimental Results

Main Results

Video synchronization error (milliseconds):

Method EgoHumans \(\delta_{med}\) CMU Panoptic \(\delta_{med}\) 3D-POP \(\delta_{med}\) UDBD \(\delta_{med}\)
Uni4D* 222.1 99.9 1265.4 25.1
MAST3R 263.8 58.1 72.2 7.4
Sync-NeRF* - 866.7 1100.0 0.2
VisualSync 46.6 41.5 77.8 5.9

Methods marked with "" use GT camera poses.*

Pairwise synchronization accuracy (A@100/A@500):

Method EgoHumans CMU Panoptic 3D-POP UDBD
Uni4D* 23.8/49.4 32.3/60.7 0.9/9.5 46.2/74.1
MAST3R 24.3/50.4 29.6/49.8 15.7/69.1 77.8/95.4
VisualSync 33.9/55.8 26.0/51.2 33.3/69.3 82.1/94.3

Ablation Study

Key component ablation (EgoHumans dataset):

Segmentation Correspondence Camera Energy Solver \(\delta_{med}\) Note
GT GT GT Sampson IRLS 2.0 Ideal upper bound
DEVA CoTracker+MAST3R GT Sampson IRLS 28.6 GT poses used
DEVA CoTracker+MAST3R vggt Inlier IRLS 1544.8 RANSAC baseline
DEVA CoTracker+MAST3R vggt Cosine IRLS 94.6 Cosine error
DEVA CoTracker+MAST3R vggt Sampson LS 118.0 Standard least squares
DEVA CoTracker+MAST3R vggt Sampson IRLS 46.6 Full method

Effect of input pair ratio:

Pair Ratio Pseudo-pair Detection \(\delta_{med}\) Note
RST (minimum spanning) 130.0±24.5 Significant degradation
50% 70.7±1.3 Still functional
100% 111.5 Filtering omitted
100% 46.6 Full pipeline

Key Findings

  • Sampson error outperforms all other geometric measures due to its noise robustness and role as a lower bound on epipolar distance
  • Pseudo-pair filtering is critical: omitting it raises \(\delta_{med}\) from 46.6 to 111.5
  • IRLS substantially outperforms standard least squares (46.6 vs. 118.0) by down-weighting unreliable estimates
  • Even with only 50% of pairs, performance remains acceptable (70.7 ms), demonstrating robustness
  • Performance is consistent across different frame rates (5–30 fps) (51.5 vs. 41.5 ms), indicating strong adaptability

Highlights & Insights

  • Principled simplicity: temporal synchronization is addressed from first principles via epipolar constraints, yielding an elegant mathematical formulation
  • Modular design: each sub-module (tracking, matching, pose estimation) can be independently replaced or upgraded
  • Practical applicability: requires no GT inputs (e.g., camera poses) and generalizes to in-the-wild videos
  • The work demonstrates the feasibility of integrating multiple recent vision foundation models (VGGT, CoTracker3, MAST3R, GPT-4o) into a coherent, practical system
  • The method achieves 46.6 ms accuracy on the challenging EgoHumans benchmark (egocentric viewpoints with large temporal offsets)

Limitations & Future Work

  • Relies on reliable camera pose estimation for a subset of frames, even though accuracy is not required for every frame
  • Cannot handle videos with non-uniform playback speeds (e.g., alternating slow-motion and normal-speed segments)
  • Pairwise estimation complexity is \(\mathcal{O}(N^2)\), limiting scalability in large-scale settings
  • Depends on upstream modules (segmentation, matching, pose estimation), whose errors propagate to the final synchronization result
  • Compared to Sync-NeRF, VisualSync does not require joint radiance field optimization, making it more efficient and general
  • The strategy of combining CoTracker (temporal dimension) and MAST3R (spatial dimension) for cross-view correspondence establishment is a noteworthy design pattern
  • Energy landscape analysis for automatic pseudo-pair filtering is a practically effective technique
  • Synchronized videos can be directly fed into downstream novel-view synthesis systems such as K-Planes, underscoring the foundational role of synchronization for 4D reconstruction

Rating

  • Novelty: ⭐⭐⭐⭐ The principle of epipolar-constraint-based synchronization is not entirely new, but systematically leveraging modern vision foundation models to make it work in the wild constitutes a meaningful contribution
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four diverse benchmarks, comprehensive ablations, and downstream application validation
  • Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and mathematical derivations are rigorous
  • Value: ⭐⭐⭐⭐ Addresses a foundational problem in multi-camera 4D reconstruction with strong practical utility