Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding¶

Conference: NeurIPS 2025 arXiv: 2512.03601 Code: GitHub Area: 3D Vision Keywords: 4D scene understanding, 3D Gaussian splatting, motion estimation, semantic segmentation, video object segmentation

TL;DR¶

Motion4D proposes a unified 4D Gaussian splatting framework that incorporates priors from 2D foundation models (semantic masks, point tracking, depth) into 3D representations via an iterative refinement strategy, achieving spatiotemporally consistent motion and semantic modeling. The method significantly outperforms existing approaches on video object segmentation, point tracking, and novel view synthesis tasks.

Background & Motivation¶

Recent 2D visual foundation models (e.g., SAM2, Track Any Point, Depth Anything) have achieved remarkable results in per-frame processing, yet fundamentally lack 3D consistency. In real-world dynamic scenes, models such as SAM2 frequently exhibit spatial misalignment and temporal flickering due to their reliance on frame-by-frame processing without explicit 3D reasoning.

Existing methods for lifting 2D models to 3D face two primary challenges:

Most methods are limited to static scenes: Semantic predictions from multiple views are fused into 3DGS/NeRF representations, but these approaches cannot handle the motion complexity and occlusions inherent in dynamic environments.

Decoupled modeling of semantics and motion: Existing dynamic methods (e.g., Semantic Flow, SADG) either learn feature fields independently of the 3D model or treat semantic understanding and motion estimation as separate processes, resulting in a lack of consistency.

The core motivation of Motion4D is to construct a unified dynamic representation that simultaneously models motion and semantics from monocular video, leveraging an iterative optimization strategy to enable mutual enhancement between 2D priors and 3D representations.

Method¶

Overall Architecture¶

Motion4D employs a two-stage iterative optimization framework: - Sequential Optimization: Motion and semantic fields are updated alternately within short temporal windows to maintain local consistency. - Global Optimization: All attributes are jointly optimized to ensure long-term coherence.

The input consists of an RGB video with known camera poses \(\{I_t\}\), along with priors generated by 2D pretrained models: object masks \(\mathbf{M}_t\), 2D point trajectories \(\mathbf{U}_{t \to t'}\), and monocular depth \(\mathbf{D}_t\). The objective is to estimate spatiotemporally consistent semantics \(\hat{\mathbf{M}}_t\) and motion \(\{\hat{\mathbf{U}}_{t \to t'}, \hat{\mathbf{D}}_t\}\).

Key Designs¶

4D Scene Representation: Building upon standard 3DGS, the framework extends Gaussians with motion and semantic fields. The motion field models rigid transformations via a set of global motion bases \(\{\hat{\mathbf{T}}_b^{0 \to t}\}_{b=1}^{B}\) and per-Gaussian blend coefficients \(w_i^b\), transforming the canonical frame to the target frame as: \(\mathbf{T}_i^{0 \to t} = \sum_{b=0}^{B} w_i^b \hat{\mathbf{T}}_b^{0 \to t}\). The semantic field is embedded directly into each Gaussian and rendered into per-pixel semantic features via volumetric rendering analogous to color rendering. This design enables joint modeling of geometry, motion, and semantics within a single unified representation.
Iterative Motion Refinement: The key innovation lies in introducing 3D confidence maps and adaptive resampling. Since 2D tracking networks do not support interactive correction, Motion4D assigns each Gaussian an uncertainty field \(u_i \in \mathbb{R}\), rendering per-pixel confidence weights \(w(p)\) that modulate the tracking and depth supervision losses: \(\mathcal{L}_{\text{track}} = \frac{1}{|I_t|} \sum_{p \in I_t} w(p) \|\hat{\mathbf{U}}_{t \to t'}(p) - \mathbf{U}_{t \to t'}(p)\|\). Ground-truth confidence weights are estimated via temporal self-consistency of color and semantics—pixels that remain consistent in both color and semantics across frames are assigned high confidence. Furthermore, adaptive resampling computes RGB error \(e_{\text{rgb}}(p)\) and semantic error \(e_{\text{sem}}(p)\), and initializes new Gaussians by sampling additional 2D points in high-error regions and projecting them to 3D, effectively recovering blurred or missing regions caused by inaccurate motion estimation.
Iterative Semantic Refinement: Leveraging the promptable nature of SAM2, at each iteration the 3D-rendered semantic mask \(\hat{\mathbf{M}}_t^s\) is compared against the previous iteration's 2D mask \(\mathbf{M}_t^{s-1}\) to identify mismatched regions. Additional prompts are then generated for each object: (1) precise bounding boxes derived from the 3D mask; and (2) positive/negative prompt points placed at locations of maximum distance transform values. Notably, the method deliberately avoids using the 3D mask directly as a mask prompt, as SAM2 tends to strictly follow mask inputs, thereby limiting its capacity for correction. The 3D mask provides stronger consistency constraints, while SAM2 excels at preserving high-resolution boundary details—the two are complementary.

Loss & Training¶

The total loss is a weighted sum of multiple terms:

\[\mathcal{L} = \lambda_{\text{rgb}} L_{\text{rgb}} + \lambda_{\text{sem}} L_{\text{sem}} + \lambda_{\text{track}} L_{\text{track}} + \lambda_{\text{depth}} L_{\text{depth}} + \lambda_w L_w\]

Training proceeds in three stages: - Stage 1 (Sequential–Motion): The motion field is optimized within short temporal windows; each window \(\mathcal{S}_i = \{I_t \mid t \in [iL, (i+1)L)\}\) undergoes iterative motion refinement. - Stage 2 (Sequential–Semantics): The motion field is frozen; the semantic field is optimized and SAM2 inputs are updated through iterative refinement. - Stage 3 (Global): All fields are jointly trained over the entire video sequence to ensure cross-field consistency and long-term coherence.

Sequential optimization is critical because 2D networks rely on short-term memory and tend to accumulate errors over time (e.g., SAM2 is accurate at the initial frame but gradually loses track).

Key Experimental Results¶

Main Results¶

Video Object Segmentation (DyCheck-VOS and DAVIS):

Method	Representation	DyCheck-VOS \(\mathcal{J}\&\mathcal{F}\)	DAVIS \(\mathcal{J}\&\mathcal{F}\)
SAM2	2D	89.4	90.7
SADG	3D + SAM2	81.8	75.0
Semantic Flow	3D + SAM2	76.9	72.2
Motion4D	3D + SAM2	91.0	89.7
Motion4D + SAM2	3D + SAM2	91.7	90.8

2D Point Tracking (DyCheck dataset):

Method	AJ ↑	\(<\delta_{\text{avg}}\) ↑	OA ↑
CoTracker3	31.0	44.4	79.9
Shape of Motion	34.4	47.0	86.6
Motion4D	37.3	50.4	87.1

3D Point Tracking and Novel View Synthesis (DyCheck):

Method	EPE ↓	\(\delta_{3D}^{.05}\) ↑	PSNR ↑
Shape of Motion	0.082	43.0	16.72
Motion4D	0.072	46.7	17.91

Ablation Study¶

Configuration	\(\mathcal{J}\&\mathcal{F}\) ↑	AJ ↑	OA ↑	Notes
Full model	91.7	37.3	87.1	All components
w/o iterative refinement	87.6	34.6	86.5	No 2D prior updates
w/o adaptive sampling	88.9	35.1	84.2	No error-guided densification
Full-sequence initialization	88.0	34.9	87.0	No sequential optimization
w/o global optimization	90.3	36.5	86.6	Sequential updates only

Key Findings¶

Iterative refinement is critical for both segmentation and tracking performance; removing it causes a 4.1-point drop in \(\mathcal{J}\&\mathcal{F}\).
Adaptive sampling primarily improves motion consistency (OA drops by 2.9 without it), helping to recover regions with insufficient motion estimation.
Sequential optimization prevents long-term error accumulation in 2D priors and is essential for training stability.
Global optimization further improves consistency across temporal segments.

Highlights & Insights¶

Closed-loop mutual enhancement between 2D and 3D: The 3D representation provides consistency constraints while 2D foundation models supply rich detail priors; iterative optimization forms a positive feedback loop between the two.
The confidence-weighting mechanism elegantly addresses the inability to directly correct 2D tracking priors by automatically suppressing noisy supervision signals via self-consistency metrics.
The paper introduces the DyCheck-VOS benchmark, filling a gap in VOS evaluation for dynamic scenes.
This is the first method to simultaneously and substantially surpass both 2D foundation models and 3D approaches in dynamic scene understanding.

Limitations & Future Work¶

Performance depends on the quality of the underlying 3D reconstruction: severe occlusions, low-texture regions, or inaccurate depth estimates can degrade results.
Known camera poses are required as input.
The motion field assumes a weighted combination of rigid transformations, which may limit modeling capacity for highly non-rigid motions.
The multi-stage optimization combined with iterative refinement incurs significant computational overhead.

Shape of Motion is the closest 3D baseline; Motion4D extends it by incorporating a semantic field and iterative refinement.
The promptable nature of SAM2 enables iterative semantic field refinement, and this design principle is generalizable to other promptable models.
The confidence-weighting and adaptive-sampling strategy offers a transferable paradigm for other tasks that require fusion of noisy priors.

Rating¶

Novelty: ⭐⭐⭐⭐ Unifying multiple 2D priors into 4DGS with a closed-loop iterative refinement design is conceptually clear and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three tasks, multiple datasets, comprehensive ablations, and a newly proposed benchmark.
Writing Quality: ⭐⭐⭐⭐ Method description is clear and well-illustrated.
Value: ⭐⭐⭐⭐ Provides a unified framework for dynamic scene understanding with strong practical applicability.