EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis¶

Conference: CVPR 2025
arXiv: 2503.20168
Code: https://xdimlab.github.io/EVolSplat/
Area: Autonomous Driving / Novel View Synthesis
Keywords: 3D Gaussian Splatting, Urban Scenes, Feed-forward Reconstruction, Sparse 3D Convolution, Real-time Rendering

TL;DR¶

This paper proposes EVolSplat, a feed-forward 3D Gaussian Splatting method for urban scenes based on sparse 3D convolutions. Instead of pixel-aligned predictions, it predicts Gaussian parameters from a globally unified voxel grid. Combined with occlusion-aware image-based rendering (IBR) coloring, it achieves 23.26dB PSNR and 83.81 FPS on KITTI-360.

Background & Motivation¶

Background: Novel view synthesis of urban scenes is a core requirement for autonomous driving simulation. Per-scene optimization methods (e.g., 3DGS, Street Gaussians) require tens of minutes of training per scene. Feed-forward methods (e.g., MVSplat) are fast but suffer from multi-view inconsistency in large-scale urban scenes due to pixel-aligned Gaussian predictions.

Limitations of Prior Work: Pixel-aligned methods associate 3D Gaussians with individual pixel rays, resulting in: (1) inconsistent Gaussian positions predicted from different views, causing conflicts during multi-view fusion; (2) lack of plausible representation for distant backgrounds and sky areas; (3) direct propagation of depth estimation errors to Gaussian positions.

Key Challenge: The trade-off between feed-forward speed and spatial consistency—pixel-space operations are fast but inconsistent, whereas 3D-space operations are consistent but dense voxel computations are computationally prohibitive.

Key Insight: Use sparse 3D convolutions within a unified global voxel grid to predict Gaussian parameters, allocating computational resources only where 3D points exist.

Core Idea: Sparse 3D-CNN global voxel prediction + occlusion-aware IBR coloring + hemispherical background Gaussians = consistent and efficient novel view synthesis for urban scenes.

Method¶

Overall Architecture¶

Given multi-view input images and monocular depth estimations, an initial 3D point cloud is generated to construct a sparse voxel grid. A sparse 3D-CNN is utilized to extract geometric features, and MLPs predict Gaussian parameters (position offset, scale, rotation, opacity) for each voxel. Colors are obtained by querying 2D textures from the input images via occlusion-aware IBR. Distant backgrounds and the sky are modeled using hemispherical background Gaussians.

Key Designs¶

Sparse 3D-CNN Voxel Prediction:
- Function: Consistently predict Gaussian parameters in global 3D space.
- Mechanism: Voxelize the 3D point cloud obtained from monocular depth estimation, extract features using MinkowskiNet sparse convolutions, and recursively refine positions. MLPs then predict position offsets \(\Delta p\), scale, rotation, and opacity. The position offsets correct depth estimation errors.
- Design Motivation: Unlike pixel-aligned methods, the global voxel representation ensures multi-view geometric consistency. Sparse convolutions operate only on voxels with point clouds, achieving efficiency comparable to dense 2D methods.
Occlusion-Aware Image-Based Rendering (IBR) Coloring:
- Function: Directly retrieve Gaussian colors from input images rather than predicting them via network.
- Mechanism: Project 3D Gaussian centers back to the input views to query 2D features. Use visibility maps (generated by rendering input views to check if a Gaussian is visible) to filter out occluded views, and employ an attention mechanism to fuse multi-view colors.
- Design Motivation: Network predicted colors lack detail in urban scenes, whereas IBR preserves high-frequency textures from input images. The occlusion visibility check resolves color inconsistency under large baselines.
Hemispherical Background Gaussians:
- Function: Model distant backgrounds and the sky.
- Mechanism: Distribute Gaussians on a hemispherical surface outside the scene, and use an MLP to predict spherical harmonics coefficients from direction vectors.
- Design Motivation: In urban scenes, sky and distant buildings occupy a large fraction of pixels but lack depth info. Dedicated background modeling prevents foreground Gaussians from being "wasted" on distant views.

Loss & Training¶

\(\mathcal{L} = (1-0.2)\mathcal{L}_1 + 0.2\mathcal{L}_{SSIM} + 0.1\mathcal{L}_{entropy}\), where entropy regularization encourages opacity values to be close to 0 or 1, avoiding semi-transparent artifacts. The model is trained on 160 KITTI-360 scenes, with 30 stereo pairs per scene.

Key Experimental Results¶

Main Results¶

Method	PSNR↑	SSIM↑	LPIPS↓	FPS
MVSplat	21.22	0.695	0.246	-
EDUS	22.13	0.761	0.178	-
EVolSplat	23.26	0.797	0.179	83.81

Zero-shot generalization on Waymo: PSNR 23.43, SSIM 0.786.

Ablation Study¶

Configuration	PSNR	Description
W/o IBR	21.06	IBR contributes +2.2 dB
W/o Position Offset	22.49	Offset corrects depth errors +0.77 dB
W/o Occlusion Check	22.97	Visibility filtering +0.29 dB
Full Model	23.26	—

Key Findings¶

IBR is the largest contributor: Querying colors directly from input images outperforms network prediction by more than 2 dB, as high-frequency textures in urban scenes must be preserved.
Zero-shot generalization to Waymo: The model trained on KITTI-360 directly achieves 23.43 dB on Waymo, demonstrating the dataset-cross generalization ability of the global voxel representation.

Highlights & Insights¶

Efficiency breakthrough of sparse 3D convolutions—Operations are restricted to occupied voxels, preventing the memory explosion associated with dense 3D convolutions.
IBR vs. Network-predicted colors—Urban scenes are characterized by rich textural details. Querying colors directly from images significantly outperforms letting the network "imagine" them.

Limitations & Future Work¶

Dynamic objects (moving vehicles/pedestrians) are not handled.
The quality relies on monocular depth estimation.
The fixed voxel size may not generalize well to all scene scales.

vs MVSplat: Pixel alignment leads to multi-view inconsistency, whereas EVolSplat's global voxel structure is naturally consistent.
vs Per-Scene Optimization: Feed-forward inference requires no optimization, being several orders of magnitude faster, despite a slight drop in quality.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of sparse 3D convolutions and IBR is highly effective for urban scenes.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on KITTI-360 and Waymo (zero-shot), with comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Structured methodology with highly convincing tables and charts.
Value: ⭐⭐⭐⭐ Provides a strong baseline for feed-forward 3DGS in urban scenes.