LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling¶
Conference: ICCV 2025 arXiv: 2507.02363 Area: 3D Vision Keywords: 3D Gaussian Splatting, dynamic scene reconstruction, multi-view, local spatial modeling, static-dynamic decoupling, temporal Gaussians
TL;DR¶
This paper proposes LocalDyGS, a framework that decomposes complex global dynamic scenes into local spaces defined by seed points and generates temporal Gaussians via static-dynamic feature decoupling to model local motions independently, achieving high-quality reconstruction of large-scale complex dynamic scenes for the first time.
Background & Motivation¶
Multi-view dynamic scene reconstruction is a fundamental problem in 3D vision, with applications in free-viewpoint sports broadcasting, AR/VR, and gaming. Existing methods face the following challenges:
- Deformation field methods (4DGaussian): Map from canonical space to deformation fields; struggle with large-scale motion.
- Trajectory tracking methods (SpaceTimeGS): Represent trajectories via polynomial/Fourier series; suffer from blurring and flickering under large-scale motion.
- Online streaming methods (3DGStream): Model per-frame; limited capacity for large-range motion.
- Storage issues: 3DGStream requires 1230 MB for 300 frames; RealTimeGS exceeds 1000 MB.
Core problem: Explicitly tracking the long-term motion of each Gaussian point is infeasible for large-scale complex motion.
Core Idea: Decompose the global dynamic scene into local spaces, where each seed point independently models short-range motion. Moving objects are represented by multiple seeds—seeds activate when an object passes through and deactivate when it leaves.
Method¶
Overall Architecture¶
Two major components: 1. Global-to-local spatial decomposition: Fuse multi-frame SfM point clouds to initialize seeds. 2. Local spatial temporal Gaussian generation: Static-dynamic decoupling + MLP decoding.
Key Designs¶
1. Global Seed Initialization
\(N\) frames are sampled to extract and fuse SfM point clouds, initializing seed positions \(\mu\). Each seed contains: - Position \(\mu\) (global parameter) - Static feature \(f_s \in \mathbb{R}^{64}\) (shared across time steps, initialized to 0) - Local spatial scale \(v \in \mathbb{R}^3\) (initialized to the average distance of the 3 nearest neighboring seeds)
2. Spatio-temporal Field with Feature Decoupling
- Static feature \(f_s\): Independently optimized per local space, carries the majority of scene information.
- Dynamic residual field \(F_d\): Multi-resolution 4D hash encoding (position + time) + shallow MLP.
- Hash table size \(2^{17}\); \(L\) resolution levels concatenated and fused by MLP.
- Input \((μ, t) \in \mathbb{R}^4\).
- Weight field \(F_w\): Predicts \(w_s, w_d\) to balance static and dynamic contributions.
- Weighted feature: \(f_w = w_s \cdot f_s + w_d \cdot f_d\)
- Key finding: Dynamic residuals approach zero; the majority of information is carried by static features.
3. Temporal Gaussian Generation
Each seed generates \(k=10\) temporal Gaussians: - Position: \(\mu_t = \mu + v \cdot F_\mu(f_w)\) - Opacity: \(\text{Sigmoid}(F_o(f_w, d))\), where \(d\) is the camera viewing direction - Rotation, scale, and color decoded via independent MLPs - Deactivation mechanism: Deactivated when \(\sigma < \tau_\alpha = 0.01\), reducing computational cost.
4. Adaptive Seed Growing (ASG)
- Records the maximum 2D projection gradient and 3D position of temporal Gaussians over \(n\) iterations.
- New seeds are added when \(\nabla_{\max} > \tau_g = 0.001\).
- Executed every 100 iterations between iterations 3000 and 15000.
Loss & Training¶
- \(\mathcal{L}_v = \sum \prod(s_t^i)\): Volume regularization encouraging small Gaussian sizes to preserve locality.
- Adam optimizer, 30,000 iterations.
Key Experimental Results¶
N3DV Dataset (21 cameras, fine-grained motion)¶
| Method | PSNR↑ | LPIPS↓ | FPS↑ | Time↓ | Storage↓ |
|---|---|---|---|---|---|
| 4DGaussian | 31.02 | 0.150 | 30 | 0.67h | 90MB |
| SpaceTimeGS | 32.05 | 0.044 | 140 | >5h | 200MB |
| 3DGStream | 31.67 | - | 215 | 1.0h | 1230MB |
| RealTimeGS | 32.01 | 0.055 | 114 | 9.0h | >1000MB |
| LocalDyGS | 32.28 | 0.043 | 105 | 0.58h | 100MB |
MeetRoom Dataset (13 cameras, sparse views)¶
| Method | PSNR↑ | Time↓ | Storage↓ |
|---|---|---|---|
| 3DGS (per-frame) | 31.31 | 13h | 6330MB |
| 3DGStream | 30.79 | 0.6h | 1230MB |
| LocalDyGS | 32.45 | 0.36h | 90MB |
VRU Basketball Court (34 cameras, large-scale complex motion)¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| 4DGS | 28.32 | 0.930 | 0.186 |
| SpaceTimeGS | 27.42 | 0.926 | 0.193 |
| LocalDyGS | 30.58 | 0.944 | 0.173 |
Exceeds dynamic methods by more than 2 dB, approaching the static 3DGS upper bound.
Ablation Study¶
| Component | Effect |
|---|---|
| ASG seed growing | PSNR 31.81 → 33.02 (+1.21 dB) |
| Remove static feature | 29.46 vs. 31.40 (−1.94 dB) |
| Remove deactivation | FPS 89 → 105, PSNR nearly unchanged |
| \(N=6\) vs. \(N=30\) | Performance nearly identical (32.28 vs. 32.30) |
| \(N=1\) | 31.84 (−0.44 dB), insufficient coverage |
| \(k=5/10/20\) | \(k=10\) achieves optimal balance |
Key Findings¶
- First method to successfully handle large-scale dynamic datasets such as the VRU basketball court.
- Static-dynamic decoupling contributes the most (−1.94 dB if removed).
- Only 6 SfM frames are sufficient to achieve optimal performance; the method is insensitive to the initial point cloud.
- Training time: 35 minutes on a single RTX 3090.
- Storage: 100 MB / 300 frames vs. 1230 MB for 3DGStream.
Highlights & Insights¶
- Core idea of local decomposition: Reducing global long-range tracking to local short-range modeling is a key breakthrough for large-scale dynamic scenes.
- Effectiveness of static-dynamic decoupling: Dynamic residuals approach zero, and decoupling substantially reduces modeling difficulty.
- Extreme efficiency: Best quality + 35-minute training + 100 MB storage.
- First large-scale dynamic scene benchmark: VRU basketball court validates practical applicability.
- Elegant dynamic extension of ScaffoldGS: Inherits the anchor structure and incorporates the temporal dimension.
- Deactivation mechanism analogous to MoE: Sparse activation reduces computational overhead.
Limitations & Future Work¶
- Relies on multi-frame SfM initialization; fast motion or texture-less regions may lead to insufficient coverage.
- Rendering at 105 FPS is lower than 3DGStream at 215 FPS.
- Assumes camera parameters are known and accurately calibrated.
- Volume regularization may constrain the representation of large-scale objects.
- Memory of 4D hash encoding grows with resolution.
Related Work & Insights¶
- Static novel view synthesis: 3DGS for real-time rendering; ScaffoldGS with anchor-based representation.
- Deformation field methods: 4DGaussian canonical-to-deformation mapping.
- Trajectory tracking: SpaceTimeGS with polynomial control.
- Streaming methods: 3DGStream for per-frame online reconstruction.
- 4DGS extensions: Fit Gaussians in 4D space; high storage and computation cost.
Rating¶
- Novelty: ★★★★☆ — Local spatial decomposition + feature decoupling introduces a new paradigm for dynamic scene modeling.
- Technical Depth: ★★★★★ — Three datasets at different motion scales; highly thorough ablation study.
- Experimental Thoroughness: ★★★★★ — Comprehensive quantitative and qualitative evaluation; first large-scale validation.
- Practicality: ★★★★★ — Fast training / low storage / high quality / first large-scale dynamic scene support.
- Overall Score: 9.0/10