Toward Real-World BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting¶

Conference: CVPR 2025
arXiv: 2504.01957
Code: https://hcis-lab.github.io/GaussianLSS/
Area: Autonomous Driving
Keywords: BEV Perception, Depth Uncertainty, Gaussian Splatting, Lift-Splat-Shoot, Semantic Segmentation

TL;DR¶

GaussianLSS introduces depth uncertainty modeling into the classic Lift-Splat-Shoot (LSS) framework. By calculating the variance of the depth distribution and converting it into a 3D Gaussian representation, which is then efficiently rasterized using Gaussian Splatting to generate uncertainty-aware BEV features, the method achieves state-of-the-art (SOTA) performance among unprojection methods on nuScenes, while being 2.5\(\times\) faster and saving 70% of GPU memory compared to projection methods.

Background & Motivation¶

Background: BEV perception is a core task in autonomous driving, providing a unified spatial representation for 3D detection, semantic segmentation, motion prediction, and planning. Existing methods are divided into two main paradigms: (1) 2D unprojection methods (e.g., LSS, FIERY) — which estimate depth to \"lift\" 2D features to 3D space; (2) 3D projection methods (e.g., BEVFormer, SimpleBEV, PointBEV) — which project predefined 3D queries onto the image plane to sample features without requiring explicit depth estimation.

Limitations of Prior Work: (1) 3D projection methods achieve the highest accuracy but are computationally expensive due to high 3D grid sampling costs, making real-time deployment difficult; (2) Although traditional LSS is efficient, it heavily relies on accurate depth estimation — which is inherently an ill-posed problem, and depth errors directly propagate to the BEV representation; (3) Existing LSS variants utilize softmax probability distributions for \"soft\" depth assignment but lack explicit modeling of depth uncertainty — softmax can generate vastly different probabilities across adjacent depth bins, leading to unstable BEV features.

Key Challenge: Unprojection methods are highly efficient but their accuracy is limited by the quality of depth estimation; projection methods are highly accurate but too slow. The challenge is to find a solution that is both efficient and robust to depth errors.

Goal: (1) Introduce depth uncertainty modeling into the LSS framework to reduce dependency on precise depth estimation; (2) Leverage Gaussian Splatting to achieve efficient BEV feature aggregation.

Key Insight: The authors observe that the variance of the depth distribution itself encodes information about depth estimation uncertainty — a large variance indicates high uncertainty, suggesting that features should \"spread\" over a wider spatial range to cover potential object locations. This aligns perfectly with the spatial \"expansion\" property of Gaussian distributions.

Core Idea: Calculate the mean and variance of the depth distribution for each pixel, convert them into 3D Gaussian distributions (where Mean = 3D position, Covariance = spatial uncertainty range), and then render them onto the BEV plane via Gaussian Splatting to achieve uncertainty-aware BEV feature aggregation.

Method¶

Overall Architecture¶

Input multi-view images → Backbone extracts features → CNN predicts splat feature \(F_i\), opacity \(\alpha_i\), and depth distribution \(P_i\) → Depth uncertainty transformation (Mean \(\mu\), Variance \(\sigma^2\) \(\rightarrow\) 3D Gaussians) → Multi-scale Gaussian Splatting rendering to the BEV plane → Fuse multi-scale BEV features → Segmentation head outputs predictions.

Key Designs¶

Depth Uncertainty Modeling:
- Function: Explicitly extract uncertainty information from the depth probability distribution.
- Mechanism: Given the discrete depth distribution \(P\) in LSS, compute the depth mean \(\mu = \sum_{i} P_i(p) d_i\) and depth variance \(\sigma^2 = \sum_{i} P_i(p)(d_i - \mu)^2\), and then define a tolerance range \(\hat{\mathbf{D}} = [\mu - k\sigma, \mu + k\sigma]\). This range transforms a "point estimation" into an "interval estimation with uncertainty" — when the model is uncertain about the depth (large \(\sigma\)), features will spread over a wider depth range. \(k\) is the error tolerance coefficient, empirically set to 0.5.
- Design Motivation: In traditional LSS, the softmax depth distribution appears probabilistic, but it is actually only used for a weighted sum, failing to utilize key information regarding the \"spreadness\" of the distribution. Variance directly measures the confidence of the depth estimation.
3D Uncertainty Transformation and Gaussian Representation:
- Function: Convert 1D depth uncertainty into 3D spatial Gaussian distributions.
- Mechanism: Utilize camera intrinsic matrix \(I\) and extrinsic matrix \(E\) to back-project the pixel-depth points \((u,v,d_i)\) corresponding to each depth bin into the 3D space as \(p_i^{3d} = E^{-1}(d_i \cdot I^{-1}[u,v,1]^T)\), and then calculate the 3D mean \(\mu_{3d} = \sum_i P_i(p) p_i^{3d}\) and covariance matrix \(\Sigma = \sum_i P_i(p)(p_i^{3d} - \mu_{3d})(p_i^{3d} - \mu_{3d})^T\). This yields 3D Gaussians \(\mathcal{N}(\mu_{3d}, \Sigma)\), naturally representing spatial positions and uncertainty.
- Design Motivation: Depth uncertainty is 1D in the camera coordinate system, but when mapped to world coordinates, it stretches along the ray direction as a 3D ellipsoid. A Gaussian distribution is the most natural mathematical tool to describe this spatial uncertainty.
Multi-Scale BEV Feature Rendering:
- Function: Efficiently render BEV features using Gaussian Splatting, and alleviate the depth mean inconsistency issue through multi-scale processing.
- Mechanism: Project 3D Gaussians (including mean, covariance, features, and opacity) onto the BEV plane, and render using alpha-blending: \(\mathbf{F}_{BEV}(\mathbf{x}) = \sum_i F_i \alpha_i \exp(-\frac{1}{2}(\mathbf{x}-\mu_i)^\top\Sigma_i^{-1}(\mathbf{x}-\mu_i))\). To address BEV feature distortion caused by depth mean jumps between adjacent pixels, render BEV features at multiple resolutions (50×50, 100×100, 200×200) and then upsample and fuse them.
- Design Motivation: The rasterization of Gaussian Splatting is highly efficient (based on tile-based rendering) and naturally supports spatial expansion (via covariance matrices), perfectly fitting uncertainty-aware feature aggregation. Multi-scale rendering borrows concepts from Feature Pyramid Networks (FPN).

Loss & Training¶

Three loss functions are employed: focal loss for segmentation (\(\lambda_1=1\)), L1 loss for centerness (\(\lambda_2=2\)), and L2 loss for offset (\(\lambda_3=0.1\)). Optimizer: AdamW, learning rate: \(3 \times 10^{-4}\) with cosine annealing, total batch size: 8, GPU: 2\(\times\) RTX 4090, training epochs: 50. Backbone is EfficientNet-B4.

Key Experimental Results¶

Main Results¶

nuScenes Vehicle BEV semantic segmentation (IoU, 224\(\times\)480 resolution, without visibility filtering):

Method	Type	Backbone	IoU↑
BEVFormer	3D projection	RN-50	35.8
SimpleBEV	3D projection	RN-50	36.9
PointBEV	3D projection	EN-b4	38.7
FIERY static	2D unprojection	EN-b4	35.8
CVT	2D unprojection	EN-b4	31.4
GaussianLSS	2D unprojection	EN-b4	38.3

Efficiency comparison:

Method	FPS↑	GPU Mem (GiB)↓	IoU
PointBEV	32.0	1.26	38.7
CVT	107.6	0.35	31.4
GaussianLSS	80.2	0.33	38.3

Ablation Study¶

Effect of the error tolerance coefficient \(k\):

k value	Vehicle IoU	Description
0.25	~37.0	Too small, insufficient uncertainty coverage
0.50	38.3	Optimal
1.00	~38.0	Still within reasonable range
2.00	~35.0	Too large, features over-diffused
Direct extent prediction	37.0	Without uncertainty, 1.3% drop

Effectiveness of opacity learning:

Epoch	Retained Gaussian Ratio (α>0.01)	Vehicle IoU
Initial	~100%	Low
After convergence	~20%	Optimal

Key Findings¶

GaussianLSS achieves SOTA (38.3 IoU) among unprojection methods, only 0.4% lower than the strongest projection method PointBEV, but is 2.5\(\times\) faster and saves 74% of GPU memory.
Direct prediction of a fixed extent performs 1.3% worse than learning uncertainty, proving that uncertainty modeling is superior to deterministic position prediction.
Performance is stable within the range of \(k \in [0.5, 1.25]\), but an excessively large \(k\) causes feature over-diffusion, leading to performance degradation.
GaussianLSS outperforms PointBEV on distant objects (\(>30\)m) — demonstrating that uncertainty modeling is especially important in long-range scenarios with high depth ambiguity.
After training convergence, 80% of the Gaussian points have an opacity below 0.01, indicating that the model automatically learns to focus only on semantically relevant regions.

Highlights & Insights¶

Uncertainty \(\approx\) Target Extent: Depth variance not only reflects the estimation uncertainty but also implicitly encodes the spatial range of objects (larger objects have more \"dispersed\" depth distributions). This is an elegant dual interpretation.
A New Use Case for Gaussian Splatting: Transferring 3DGS from rendering tasks to feature aggregation in BEV perception is a highly creative application. The efficient rasterization of GS is naturally suited for scenarios requiring spatial expansion.
Adaptive Pruning of Opacity: The model automatically learns to filter out 80% of redundant points using opacity, achieving adaptive sparsification without requiring post-processing.

Limitations & Future Work¶

Validated only on nuScenes; not tested on other datasets such as Waymo or Argoverse.
Currently only handles single-frame perception and does not utilize temporal information — incorporating temporal context would allow uncertainty to propagate and update across time.
Object shape prediction (IoU shape quality) is slightly inferior to projection methods.
Multi-scale rendering introduces additional computational overhead, which might not be ideal for scenarios with extreme real-time requirements.
Future work can extend this approach to more BEV tasks like 3D detection and map segmentation.

vs. LSS (Philion & Fidler): LSS pioneered the lift-splat paradigm but only performed softmax depth weighting, lacking uncertainty awareness. GaussianLSS introduces variance modeling and GS rendering to the same paradigm, representing a fundamental upgrade to LSS.
vs. PointBEV: PointBEV achieves high accuracy using a coarse-to-fine 3D grid strategy (a projection method) but is relatively slow. GaussianLSS achieves comparable accuracy and much faster speed using unprojection + GS.
vs. BEVFormer: BEVFormer uses 3D queries for cross-attention, which is computationally expensive. GaussianLSS replaces the attention mechanism with efficient GS rasterization.

Rating¶

Novelty: ⭐⭐⭐⭐ Elegantly and clearly integrates depth uncertainty and GS rendering into BEV perception.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on multiple tasks on nuScenes with comprehensive ablations, efficiency analysis, and long-range analysis.
Writing Quality: ⭐⭐⭐⭐ Clearly described method with intuitive illustrations.
Value: ⭐⭐⭐⭐ Friendly to practical deployment (fast + low memory usage), representing a substantial advancement for unprojection methods.