EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting¶

Conference: ICCV 2025 arXiv: 2411.15582 Code: qingpowuwu.github.io/emd Area: Autonomous Driving Keywords: 3D Gaussian Splatting, Dynamic Scene Reconstruction, Motion Modeling, Self-Supervised Learning, Street View Simulation

TL;DR¶

This paper proposes the Explicit Motion Decomposition (EMD) module, which models the motion characteristics of each Gaussian primitive via learnable motion embeddings and a dual-scale deformation framework. As a plug-and-play module, EMD integrates seamlessly into both self-supervised and supervised street-view Gaussian splatting methods, achieving state-of-the-art performance under the self-supervised setting on the Waymo and KITTI datasets.

Background & Motivation¶

Problem Definition¶

Novel view synthesis of dynamic street scenes is a core technology for closed-loop autonomous driving simulation. Methods based on 3DGS/4DGS address street scene reconstruction by decomposing the scene into static and dynamic components; however, existing approaches fail to effectively model the heterogeneous motion patterns of dynamic objects.

Limitations of Prior Work¶

Supervised methods (StreetGaussian, OmniRe): - Use 3D bounding box supervision to classify scene elements as either "static" or "dynamic" - Ignore the continuous spectrum of motion (e.g., pedestrian motion is far slower than vehicular motion) - Although bounding box optimization partially mitigates dynamic reconstruction errors, motion patterns are not fundamentally modeled

Self-supervised methods (S3Gaussian, DeSiRe-GS): - Optimize a holistic 4D street-scene representation via intrinsic motion cues - Overlook inter-object differences in motion speed (e.g., pedestrians vs. vehicles) - Lack effective motion modeling mechanisms, leading to blurry reconstruction of dynamic objects

Core Motivation¶

Different object categories in street scenes exhibit fundamentally distinct motion patterns: vehicles undergo fast global motion, while pedestrians undergo slow local motion. An explicit motion modeling mechanism is needed to distinguish and handle these multi-scale motion patterns. The key insight is that assigning each Gaussian a learnable motion embedding and processing fast global motion and slow local motion separately through a dual-scale deformation network can significantly improve dynamic scene reconstruction quality.

Method¶

Overall Architecture¶

EMD is a plug-and-play module consisting of two core components: 1. Motion-aware Feature Encoding: Encodes spatial, temporal, and Gaussian-specific information into a unified motion-aware feature representation. 2. Dual-scale Deformation Framework: Hierarchically handles fast global motion and slow local deformation.

Given a set of static 3D Gaussian primitives $\mathbb{G} = \{(\mu_k, \mathbf{s}_k, \mathbf{q}_k, \alpha_k, \mathbf{c}_k)\}_{k=1}^K$ and a timestamp $t$, the goal is to learn a deformation field $\mathcal{D}$ that maps each Gaussian's parameters from a canonical state to the deformed state at time $t$.

Key Designs¶

1. Motion-aware Feature Encoding¶

Function: Fuses the spatial position, temporal information, and individual motion characteristics of each Gaussian primitive into a comprehensive feature representation.
Mechanism:

The aggregated feature is formed by concatenating three components: $$\mathbf{F}_{aggr}(\mu, t) = [\mathbf{F}_{pos}(\mu), \mathbf{F}_{temp}(t), \mathbf{F}_{gauss}]$$

Spatial encoding $\mathbf{F}_{pos}$: Multi-frequency positional encoding with $P=10$ frequency bands: $$\mathbf{F}_{pos}(\mu) = [\mu, \{\sin(2^i\pi\mu), \cos(2^i\pi\mu)\}_{i=0}^{P-1}]$$

Adaptive temporal embedding $\mathbf{F}_{temp}$: Realized via a learnable embedding matrix $\mathbf{W} \in \mathbb{R}^{N_{max} \times D}$ and progressive temporal sampling: $$\mathbf{F}_{temp}(t) = \text{Interp}(\mathbf{W}, t, N(i))$$ where $N(i)$ progressively grows from $N_{min}=30$ to $N_{max}=150$, controlled by training iteration $i$: $$N(i) = N_{min} + (N_{max} - N_{min}) \cdot \min(i, T) / T$$ $D=4$ is the temporal embedding dimension, and $T=25000$ controls the duration of progressive sampling.

Gaussian embedding $\mathbf{F}_{gauss}$: Each Gaussian $k$ is assigned a learnable latent variable $\mathbf{z}_k \in \mathbb{R}^M$ ($M=32$) encoding its individual motion characteristics.

Design Motivation:
- Multi-frequency spatial encoding captures multi-level information from fine geometry to global structure.
- Progressive temporal sampling learns temporal dynamics from coarse to fine, preventing overfitting to high-frequency temporal changes in early training.
- Gaussian embeddings enable instance-level motion representation — Gaussians belonging to the same moving object should learn similar embeddings.

2. Dual-scale Deformation Framework¶

Function: Decomposes deformation into two levels — coarse-scale (fast global motion) and fine-scale (slow local deformation).
Mechanism:

$$\mathcal{D}(\mu, t) = \mathcal{D}_{coarse}(\mathbf{F}_{aggr}(\mu, t)) + \mathcal{D}_{fine}(\mathbf{F}_{aggr}(\mu + \Delta\mu_{coarse}, t))$$

The final deformation parameters combine predictions from both scales: $$\mu_t = \mu + \Delta\mu_{coarse} + \Delta\mu_{fine}$$ $$\mathbf{s}_t = \mathbf{s} + \Delta\mathbf{s}_{coarse} + \Delta\mathbf{s}_{fine}$$ $$\mathbf{q}_t = \mathbf{q} \otimes \Delta\mathbf{q}_{coarse} \otimes \Delta\mathbf{q}_{fine}$$

Crucially, the fine-scale deformation network re-encodes spatial features using the coarse-displaced position $\mu + \Delta\mu_{coarse}$ as input.

Design Motivation:
- $\mathcal{D}_{coarse}$ focuses on large-scale motion such as vehicle translation, providing the primary deformation direction.
- $\mathcal{D}_{fine}$ captures local details such as articulated motion on top of the coarse deformation.
- The coarse-to-fine cascaded design allows the two networks to specialize, avoiding the difficulty of learning both large displacements and fine deformations within a single network.

3. Integration with Existing Frameworks¶

Function: Seamlessly integrates EMD into both self-supervised and supervised methods.
Mechanism:

Self-supervised integration (S3Gaussian, DeSiRe-GS): - A learnable embedding $\mathbf{z}_k$ is added to each Gaussian. - The original decoder is replaced with the dual-scale framework. - The original self-supervised deformation setup is retained.

Supervised integration (StreetGaussian, OmniRe): - Dual-scale decomposition is introduced into bounding box pose optimization: $R'_t = \Delta R_t^f \cdot \Delta R_t^c \cdot R_t$, $T'_t = T_t + (\Delta T_t^c + \Delta T_t^f)$ - Dual-scale refinement is also applied to OmniRe's non-rigid SMPL model: $\theta'_t = \Delta\theta_t^f \cdot \Delta\theta_t^c \cdot \theta_t$

Design Motivation: The plug-and-play design enables EMD to augment any existing street-view Gaussian method without redesigning the entire pipeline.

Loss & Training¶

The same reconstruction losses as the respective baselines are used (photometric loss, etc.).
An additional local smoothness regularization on Gaussian embeddings encourages spatially adjacent Gaussians to share similar embeddings: $$\mathcal{L}_{\mathbf{z}_k} = \frac{1}{d|\mathcal{U}|}\sum_{i \in \mathcal{U}}\sum_{j \in \text{KNN}_{i;d}} (e^{-\lambda_w\|\mu_j - \mu_i\|_2} \|\mathbf{z}_{k_i} - \mathbf{z}_{k_j}\|_2)$$ where $\lambda_w = 2000$ and $d=20$ (number of KNN neighbors).
Regularization on coarse/fine deformation magnitudes constrains deformation values close to zero, preventing excessively large deformations.

Key Experimental Results¶

Main Results¶

Self-supervised setting — comparison with S3Gaussian (Waymo-D32, scene reconstruction):

Method	Full PSNR↑	Full SSIM↑	Full LPIPS↓	Vehicle PSNR↑	Vehicle SSIM↑
EmerNeRF	28.16	0.806	0.228	24.32	0.682
3DGS	28.47	0.876	0.136	23.26	0.716
S3Gaussian	30.69	0.900	0.121	26.23	0.804
S3Gaussian+EMD	32.50	0.933	0.082	29.04	0.879

Self-supervised setting — comparison with DeSiRe-GS (Waymo, scene reconstruction):

Method	PSNR↑	SSIM↑	LPIPS↓	FPS
PVG	32.46	0.910	0.229	50
DeSiRe-GS	33.61	0.919	0.204	36
DeSiRe-GS+EMD	34.15	0.948	0.130	32

Supervised setting — comparison with OmniRe (Waymo, 3 front cameras, novel view synthesis):

Method	Full PSNR↑	Human PSNR↑	Vehicle PSNR↑
OmniRe	32.57	24.36	27.57
OmniRe+EMD	33.89	25.97	27.82

Ablation Study¶

Contribution of each component (Waymo-D32, S3Gaussian baseline):

Configuration	Full PSNR↑	Vehicle PSNR↑	Note
Full model	32.50	29.04	—
w/o Gaussian embedding	32.21 (−0.29)	28.80	Individual motion characteristics not captured
w/o temporal embedding	32.23 (−0.27)	28.08	Insufficient temporal dynamics modeling
w/o coarse-scale deformation	29.40 (−3.10)	24.54	Unable to handle large displacements
w/o fine-scale deformation	32.45 (−0.05)	28.80	Local detail lost

Novel trajectory synthesis (FID↓, Waymo):

Method	0.5 m offset	1.0 m offset	1.5 m offset
S3Gaussian	83.48	110.11	134.38
S3Gaussian+EMD	45.11	70.26	90.20

Key Findings¶

EMD yields the greatest improvement on vehicle regions: Vehicle PSNR increases from 26.23 to 29.04 (+2.81 dB), demonstrating that explicit motion modeling directly improves dynamic object reconstruction.
Coarse-scale deformation is the most critical component: Removing it causes PSNR to drop sharply from 32.50 to 29.40 (−3.10), confirming that large-scale motion modeling is central.
Substantial improvement in novel trajectory synthesis: FID drops from 83.48 to 45.11 at a 0.5 m offset, indicating that EMD's motion modeling improves simulation fidelity for lane-change scenarios.
Demonstrated generality as a plug-and-play module: Consistent improvements are observed across all four baselines: S3Gaussian, DeSiRe-GS, StreetGaussian, and OmniRe.
Effective for human body modeling: OmniRe+EMD improves Human PSNR from 24.36 to 25.97, confirming the dual-scale framework's applicability to non-rigid motion.
Acceptable inference speed trade-off: DeSiRe-GS+EMD runs at 32 FPS vs. the original 36 FPS, incurring only an 11% speed reduction.

Highlights & Insights¶

Motion continuum insight: Object motion in street scenes is not a simple static/dynamic binary but exists along a continuous spectrum from stationary to high-speed.
Elegance of the dual-scale design: The coarse-to-fine cascade naturally corresponds to large displacements (vehicles) and small deformations (pedestrians, wheel rotation).
Gaussian embedding + smoothness regularization: Enables Gaussians belonging to the same object to automatically cluster into similar motion patterns without explicit object segmentation.
Progressive temporal sampling: Avoids interference from high-frequency temporal signals during early training, serving as a simple yet effective curriculum learning strategy.
Practical plug-and-play engineering: Existing methods can be enhanced by simply adding motion embeddings and replacing the decoder.

Limitations & Future Work¶

Increased computational cost: The dual-scale deformation network and embedding smoothness regularization increase training time.
No explicit modeling of inter-object relationships: Each Gaussian models its motion independently, without considering motion correlations between objects.
Fixed hyperparameters: Progressive sampling parameters such as $N_{min}$, $N_{max}$, and $T$ are not adapted to individual scenes.
Focused solely on appearance reconstruction: Motion modeling is not applied to downstream tasks such as object detection or tracking.
Limited multi-camera evaluation: While Waymo supports 5 cameras, experiments are conducted primarily under 1–3 camera settings.

vs. S3Gaussian: S3Gaussian optimizes the scene holistically without distinguishing motion patterns; EMD addresses this gap through explicit motion embeddings.
vs. StreetGaussian: StreetGaussian handles dynamic reconstruction errors via bounding box optimization; EMD further refines motion modeling on this foundation.
Gaussian embedding concept is analogous to latent codes in NeRF, but applied to motion modeling rather than appearance alone.
The dual-scale deformation paradigm may inspire other multi-scale motion scenarios, such as indoor robotics or sports scene reconstruction.

Rating¶

Novelty: ⭐⭐⭐⭐ — The explicit motion modeling concept is novel, though the technical components (dual-scale deformation, embeddings) are individually well-established.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four baselines, two datasets, both self-supervised and supervised settings, and novel trajectory synthesis evaluation.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, with rich figures and convincing visualizations.
Value: ⭐⭐⭐⭐ — The plug-and-play design is highly practical, providing a standardized motion modeling component for street-scene reconstruction.