Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos¶
Conference: NeurIPS 2025 arXiv: 2510.10691 Code: hhhddddddd/dydeblur Area: 3D Vision Keywords: 3D Gaussian Splatting, dynamic scene reconstruction, defocus blur, motion blur, novel view synthesis
TL;DR¶
A unified framework is proposed that jointly models defocus blur and motion blur via learnable blur kernel convolution, combined with a dynamic Gaussian densification strategy and unseen-view constraints, enabling high-quality novel view synthesis of dynamic scenes from blurry monocular videos using 3DGS.
Background & Motivation¶
Reconstructing dynamic scenes from monocular video and synthesizing novel views is a fundamental problem in 3D vision. Existing methods suffer from several core bottlenecks:
- Disjoint blur modeling: Defocus blur and motion blur arise from entirely different physical causes — the former from depth-of-field limitations, the latter from relative motion during exposure. Existing methods address only one type.
- Difficulty in kernel estimation: Although both blur types can be modeled as kernel convolution, accurately estimating per-pixel blur kernels from dynamic scenes is extremely challenging.
- Incomplete dynamic Gaussians: Dynamic Gaussians initialized from tracked points are missing in occluded regions.
Core Problem¶
How to design a unified framework that simultaneously handles defocus blur and motion blur in monocular videos, enabling high-fidelity sharp novel view synthesis?
Method¶
Dynamic Gaussian Representation and Densification¶
The Shape-of-Motion framework is adopted to model static and dynamic Gaussians separately. Dynamic Gaussian motion is represented as a linear combination of \(\mathrm{SE}(3)\) motion bases:
The transformed dynamic Gaussian positions and rotations are:
Dynamic Gaussian Densification (DGD): Dynamic Gaussians are first initialized using tracked points visible in the canonical frame. Depth maps from all observed frames are then back-projected to supplement dynamic regions. Gaussians from observed frames are mapped back to the canonical frame via:
Densification is performed once after \(N_d = 2500\) training iterations.
Unified Blur Synthesis¶
Both defocus and motion blur are uniformly modeled as per-pixel kernel convolution:
Blur Prediction Network (BP-Net): A 4-layer CNN that takes as input the camera embedding \(e(i)\), scene features \(f_{\text{scene}}\) (encoded from the rendered image, depth, and motion mask), and pixel-coordinate positional encoding \(p(x)\), and jointly predicts the blur kernel \(k_x\) and blur intensity \(m_x\):
The output blurry image is obtained by blending the sharp and blurred images:
Blur-Aware Sparsity Constraint: Prevents over-blurring in mildly blurred regions. The target center weight of the kernel is defined as:
High blur intensity → low target center weight → dispersed kernel allowed; low blur intensity → high target center weight → concentrated kernel enforced.
Unseen-View Constraints¶
Unseen views surrounding the training viewpoints are synthesized from geometric and appearance information to mitigate overfitting in monocular video:
Parallel unseen views (interpolated between adjacent training viewpoints) and perpendicular unseen views (offset along the normal of the camera trajectory) are generated and introduced every \(N_u = 5\) iterations.
Total Loss¶
The reconstruction loss is \(\mathcal{L}_{\text{rec}} = (1-\beta)\mathcal{L}_1(\hat{B}, B) + \beta \mathcal{L}_{\text{ssim}}(\hat{B}, B)\), with \(\beta = 0.2\).
Key Experimental Results¶
D2RF (Defocus Blur) + DyBluRF (Motion Blur)¶
| Method | Defocus PSNR↑ | Defocus SSIM↑ | Defocus LPIPS↓ | Motion PSNR↑ | Motion SSIM↑ | Motion LPIPS↓ | Training Time |
|---|---|---|---|---|---|---|---|
| D3DGS | 22.54 | 0.715 | 0.215 | 21.54 | 0.675 | 0.287 | 10 min |
| SoM | 28.32 | 0.784 | 0.164 | 26.21 | 0.823 | 0.109 | 10 min |
| D2RF | 27.04 | 0.808 | 0.128 | 23.67 | 0.745 | 0.120 | 48 hrs |
| DyBluRF | 26.24 | 0.788 | 0.159 | 24.53 | 0.864 | 0.079 | 48 hrs |
| De4DGS | 28.49 | 0.791 | 0.154 | 26.62 | 0.871 | 0.059 | 20 hrs |
| Ours | 29.39 | 0.859 | 0.078 | 27.01 | 0.876 | 0.056 | 1 hr |
On the defocus dataset, PSNR exceeds De4DGS by 0.9 dB and LPIPS is reduced by 49%; training requires only 1 hour versus 20 hours.
Ablation Study¶
| Configuration | Defocus PSNR | Motion PSNR |
|---|---|---|
| w/o sparsity constraint | 29.03 | 26.63 |
| w/o Shortcut | 29.12 | 26.74 |
| w/o DGD | 29.19 | 26.53 |
| w/o unseen views | 29.09 | 26.66 |
| Full method | 29.39 | 27.01 |
Kernel size \(K=9\) represents the optimal trade-off; gains beyond \(K>9\) are marginal.
Highlights & Insights¶
- First unified framework: Simultaneously handles defocus and motion blur in dynamic scene reconstruction.
- Blur intensity–kernel sparsity coupling: \(m_x\) adaptively constrains kernel distribution in a physically grounded manner.
- High efficiency: 1-hour training and 65 FPS rendering, 48× faster than NeRF-based approaches.
- The unseen-view synthesis strategy effectively alleviates overfitting in monocular video reconstruction.
Limitations & Future Work¶
- Relies on 2D priors (depth estimation, segmentation); errors in these priors propagate downstream.
- Fails on scenes with large non-rigid motion blur.
- Requires per-scene optimization and does not generalize to unseen scenes.
- Blur kernel size is fixed at \(9 \times 9\), which may not cover large-scale blur.
Related Work & Insights¶
- vs. De4DGS / DyBluRF: These methods simulate motion blur by weighting multiple frames within the exposure interval and cannot handle defocus blur; the proposed method unifies both via kernel convolution.
- vs. D2RF: D2RF handles defocus blur via layered depth-of-field volume rendering but cannot address motion blur and is extremely slow (48 hours).
- vs. Shape-of-Motion: SoM is the SOTA for sharp video; this work extends it with blur modeling to recover sharp scenes from blurry inputs.
The decoupled design of blur kernel prediction and blur intensity prediction is noteworthy — it explicitly models both "how blurry a pixel is" and "how it is blurred" as two separate dimensions. The unseen-view synthesis constraint is a general strategy for mitigating overfitting in monocular reconstruction.
Rating¶
- ⭐ Novelty: 8/10 — Unifying defocus and motion blur in dynamic 3DGS is a first; the blur-aware sparsity constraint is elegantly designed.
- ⭐ Experimental Thoroughness: 8/10 — Two datasets covering both blur types with complete ablations.
- ⭐ Writing Quality: 7/10 — Structure is clear, but certain details (e.g., unseen-view generation) are described somewhat briefly.
- ⭐ Value: 8/10 — High practical value; the unified framework combined with high efficiency directly advances blurry video reconstruction.