VisualSync: Multi-Camera Synchronization via Cross-View Object Motion¶

Conference: NeurIPS 2025 arXiv: 2512.02017 Code: Project Page Area: 3D Vision Keywords: multi-camera synchronization, epipolar geometry, video alignment, dynamic scene reconstruction, temporal offset estimation

TL;DR¶

VisualSync presents a multi-camera temporal synchronization framework grounded in epipolar geometry constraints. By leveraging pretrained vision foundation models (VGGT, CoTracker3, MAST3R) to extract motion trajectories and cross-view correspondences, the method estimates per-camera temporal offsets by minimizing Sampson error, achieving millisecond-level synchronization with median errors below 50 ms across four benchmarks.

Background & Motivation¶

Multi-view video capture has become increasingly prevalent in everyday settings (concerts, sporting events, family gatherings, etc.), yet temporal synchronization across heterogeneous devices remains an open problem. Limitations of existing synchronization approaches include:

Geometry-based methods (Albl et al.): rely on static scene assumptions or fixed viewpoints
Human-pose-driven methods: constrained by pose estimation accuracy and require human presence
Audio-based methods: require clean audio signals, unsuitable for noisy environments
Sync-NeRF: jointly optimizes temporal offsets and radiance fields, but is restricted to controlled environments

The core insight of VisualSync is elegantly simple: at the correct synchronization instant the scene is momentarily static, so all correspondences must satisfy the epipolar constraint \(\mathbf{x}'^T \mathbf{F} \mathbf{x} = 0\). When the temporal offset is incorrect, dynamic object correspondences deviate from epipolar lines. Minimizing this deviation recovers the correct temporal alignment.

Although the geometric principle is straightforward, building a system that works robustly on in-the-wild videos is non-trivial: it requires reliable fundamental matrix estimation, detection and tracking of dynamic objects, and cross-view correspondence establishment. VisualSync's contribution lies in harnessing state-of-the-art vision foundation models to address each of these sub-problems.

Method¶

Overall Architecture¶

VisualSync adopts a three-stage pipeline: - Stage 0: Visual cue extraction (camera parameters, motion trajectories, cross-view correspondences) - Stage 1: Pairwise temporal offset estimation via exhaustive search over minimum Sampson error - Stage 2: Global optimization to recover globally consistent temporal offsets

Given \(N\) asynchronous videos \(\{\mathbf{V}^i\}_{i=1}^N\), the goal is to estimate a temporal offset \(s^i \in \mathbb{R}\) for each video such that synchronized frames correspond to the same physical instant.

Key Designs¶

Sampson-error-based alignment energy: For camera pair \((i,j)\), the pairwise energy is defined as the sum of Sampson errors over all matched trajectory pairs and all time instants:

\[E_{ij}(\Delta) = \sum_{(\mathbf{x}^i, \mathbf{x}^j)} \sum_t \frac{(\mathbf{x}^i(t+\Delta)^\top \mathbf{F}_{t+\Delta,t}^{ij} \mathbf{x}^j(t))^2}{\|\mathbf{F}_{t+\Delta,t}^{ij} \mathbf{x}^j(t)\|_{1,2}^2 + \|\mathbf{F}_{t+\Delta,t}^{ij\top} \mathbf{x}^i(t+\Delta)\|_{1,2}^2}\]

The Sampson error is a first-order approximation and lower bound of the true epipolar distance; it admits a closed-form analytical expression and is robust to noise. Comparisons against alternative geometric measures (algebraic error, symmetric epipolar distance, cosine error) confirm its superiority. Global synchronization is formulated as \(\{s^i\} = \arg\min \sum_{i<j} E_{ij}(\Delta^{ij})\), where \(\Delta^{ij} = s^j - s^i\).

Visual cue extraction (Stage 0): Multiple pretrained models are combined:
- VGGT: jointly estimates intrinsic and extrinsic camera trajectories for all views (for static cameras only the first frame is used)
- GPT-4o + GroundedSAM + DEVA: automatically identifies dynamic object categories and generates temporally consistent segmentation masks
- CoTracker3: performs dense 2D point tracking within dynamic regions to obtain per-video temporal trajectories
- MAST3R: establishes spatial cross-view correspondences via keyframe sampling and instance-level matching to filter noisy associations
Divide-and-conquer optimization strategy: Stage 1 independently searches for the optimal offset for each camera pair, \(\Delta^{ij*} = \arg\min_{\Delta \in \mathcal{S}} E_{ij}(\Delta)\), and filters unreliable pairs via energy landscape analysis (threshold of 0.1 on the ratio of optimal energy to the second-best local minimum). Stage 2 recovers global offsets via robust least squares: \(\{s^i\}^* = \arg\min \sum_{(i,j) \in \mathcal{E}} \rho_\delta(s^j - s^i - \Delta^{ij})\), solved with Huber loss \(\rho_\delta\) and iteratively reweighted least squares (IRLS).

Loss & Training¶

VisualSync is an optimization-based framework rather than a learning-based one: - The energy function is grounded in epipolar geometry via Sampson error - Stage 1 employs exhaustive search (step size determined by frame rate) - Stage 2 is solved with IRLS for robust least squares - Unreliable pairs are automatically discarded through energy landscape analysis

Key Experimental Results¶

Main Results¶

Video synchronization error (milliseconds):

Method	EgoHumans \(\delta_{med}\)↓	CMU Panoptic \(\delta_{med}\)↓	3D-POP \(\delta_{med}\)↓	UDBD \(\delta_{med}\)↓
Uni4D*	222.1	99.9	1265.4	25.1
MAST3R	263.8	58.1	72.2	7.4
Sync-NeRF*	-	866.7	1100.0	0.2
VisualSync	46.6	41.5	77.8	5.9

Methods marked with "" use GT camera poses.*

Pairwise synchronization accuracy (A@100/A@500):

Method	EgoHumans	CMU Panoptic	3D-POP	UDBD
Uni4D*	23.8/49.4	32.3/60.7	0.9/9.5	46.2/74.1
MAST3R	24.3/50.4	29.6/49.8	15.7/69.1	77.8/95.4
VisualSync	33.9/55.8	26.0/51.2	33.3/69.3	82.1/94.3

Ablation Study¶

Key component ablation (EgoHumans dataset):

Segmentation	Correspondence	Camera	Energy	Solver	\(\delta_{med}\)↓	Note
GT	GT	GT	Sampson	IRLS	2.0	Ideal upper bound
DEVA	CoTracker+MAST3R	GT	Sampson	IRLS	28.6	GT poses used
DEVA	CoTracker+MAST3R	vggt	Inlier	IRLS	1544.8	RANSAC baseline
DEVA	CoTracker+MAST3R	vggt	Cosine	IRLS	94.6	Cosine error
DEVA	CoTracker+MAST3R	vggt	Sampson	LS	118.0	Standard least squares
DEVA	CoTracker+MAST3R	vggt	Sampson	IRLS	46.6	Full method

Effect of input pair ratio:

Pair Ratio	Pseudo-pair Detection	\(\delta_{med}\)↓	Note
RST (minimum spanning)	✓	130.0±24.5	Significant degradation
50%	✓	70.7±1.3	Still functional
100%	✗	111.5	Filtering omitted
100%	✓	46.6	Full pipeline

Key Findings¶

Sampson error outperforms all other geometric measures due to its noise robustness and role as a lower bound on epipolar distance
Pseudo-pair filtering is critical: omitting it raises \(\delta_{med}\) from 46.6 to 111.5
IRLS substantially outperforms standard least squares (46.6 vs. 118.0) by down-weighting unreliable estimates
Even with only 50% of pairs, performance remains acceptable (70.7 ms), demonstrating robustness
Performance is consistent across different frame rates (5–30 fps) (51.5 vs. 41.5 ms), indicating strong adaptability

Highlights & Insights¶

Principled simplicity: temporal synchronization is addressed from first principles via epipolar constraints, yielding an elegant mathematical formulation
Modular design: each sub-module (tracking, matching, pose estimation) can be independently replaced or upgraded
Practical applicability: requires no GT inputs (e.g., camera poses) and generalizes to in-the-wild videos
The work demonstrates the feasibility of integrating multiple recent vision foundation models (VGGT, CoTracker3, MAST3R, GPT-4o) into a coherent, practical system
The method achieves 46.6 ms accuracy on the challenging EgoHumans benchmark (egocentric viewpoints with large temporal offsets)

Limitations & Future Work¶

Relies on reliable camera pose estimation for a subset of frames, even though accuracy is not required for every frame
Cannot handle videos with non-uniform playback speeds (e.g., alternating slow-motion and normal-speed segments)
Pairwise estimation complexity is \(\mathcal{O}(N^2)\), limiting scalability in large-scale settings
Depends on upstream modules (segmentation, matching, pose estimation), whose errors propagate to the final synchronization result

Compared to Sync-NeRF, VisualSync does not require joint radiance field optimization, making it more efficient and general
The strategy of combining CoTracker (temporal dimension) and MAST3R (spatial dimension) for cross-view correspondence establishment is a noteworthy design pattern
Energy landscape analysis for automatic pseudo-pair filtering is a practically effective technique
Synchronized videos can be directly fed into downstream novel-view synthesis systems such as K-Planes, underscoring the foundational role of synchronization for 4D reconstruction

Rating¶

Novelty: ⭐⭐⭐⭐ The principle of epipolar-constraint-based synchronization is not entirely new, but systematically leveraging modern vision foundation models to make it work in the wild constitutes a meaningful contribution
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four diverse benchmarks, comprehensive ablations, and downstream application validation
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and mathematical derivations are rigorous
Value: ⭐⭐⭐⭐ Addresses a foundational problem in multi-camera 4D reconstruction with strong practical utility