Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction¶
Conference: AAAI 2026 arXiv: 2508.10936 Code: GitHub Area: Autonomous Driving / Collaborative Perception Keywords: Collaborative Perception, 3D Gaussian Splatting, Semantic Occupancy, V2X Communication, Vision-Only
TL;DR¶
This paper proposes the first vision-only semantic occupancy prediction framework that uses sparse 3D semantic Gaussian primitives as the communication medium for collaborative perception. Through ROI cropping, rigid transformation of Gaussians, and a neighborhood fusion module to suppress noise and redundancy, the method achieves +8.42 mIoU over the single-agent baseline and +3.28 mIoU over the baseline collaborative method.
Background & Motivation¶
- Background: Collaborative perception extends single-agent perception range via V2X communication. 3D semantic occupancy prediction provides finer-grained scene understanding than BEV or 3D object detection.
- Limitations of Prior Work: Existing collaborative occupancy methods (e.g., CoHFF) rely on tri-plane features requiring depth supervision, multi-stage training, and complex cross-agent alignment; dense voxel feature transmission incurs high communication costs.
- Key Challenge: Fine-grained 3D representations demand large data transmission, which conflicts with limited V2X communication bandwidth.
- Goal: Design a communication-efficient, end-to-end trainable collaborative semantic occupancy prediction framework.
- Key Insight: 3D Gaussians are sparse, rigidly transformable representations that simultaneously encode geometry and semantics.
- Core Idea: Replace voxel/plane features with 3D Gaussian primitives as the V2X communication medium, which naturally supports rigid alignment and sparse transmission.
Method¶
Overall Architecture¶
Single agent: Image-to-Gaussian module (randomly initialized Gaussians → multi-scale image feature-guided refinement) → Gaussian-to-voxel splatting → occupancy prediction. Collaborative: Gaussian packaging (rigid transformation + ROI cropping) → transmission → cross-agent Gaussian fusion → splatting.
Key Designs¶
Design 1: Gaussian Primitives as Communication Medium - Function: Transmit 3D Gaussians (mean + scale + rotation + opacity + semantics) as V2X messages. - Mechanism: Gaussians are closed under rigid transformation (mean rotation + translation, rotation quaternion multiplication, scale/opacity/semantics unchanged); ROI cropping transmits only Gaussians within the ego agent's region of interest. - Design Motivation: Sparser than voxel features, better preserves 3D structure than plane features, and alignment requires only a simple rigid transformation.
Design 2: Cross-Agent Gaussian Fusion Module - Function: Fuse Gaussian primitives from multiple agents while suppressing noise and redundancy. - Mechanism: Neighborhood-conditioned proposal → cross-neighborhood pooling → attribute blending with ego Gaussians. Unlike GaussianFormer's single-agent refinement, this module specifically addresses cross-agent inconsistencies. - Design Motivation: Gaussians from different agents may be redundant or conflicting due to occlusion-induced noise, necessitating learned fusion.
Design 3: End-to-End Single-Stage Training - Function: The entire pipeline is trained end-to-end without separate depth estimation or multi-stage schedules. - Mechanism: GaussianFormer-based Image-to-Gaussian + Gaussian-to-voxel, jointly optimized with the fusion module. - Design Motivation: CoHFF requires two-stage training and an independent depth network, increasing overall complexity.
Loss & Training¶
Standard semantic occupancy loss (CE + Lovász loss), single-stage end-to-end training.
Key Experimental Results¶
Main Results¶
| Method | IoU↑ | mIoU↑ |
|---|---|---|
| Single-Agent GSFormer | 67.76 | 29.20 |
| CoHFF (Collaborative) | 50.46 | 34.16 |
| Zero-Shot Stacking | 67.88 | 30.54 |
| Naive Fusion | 70.10 | 36.02 |
| Learned Fusion | 72.87 | 37.44 |
Ablation Study¶
| Communication Volume | mIoU |
|---|---|
| 100% Gaussians | 37.44 |
| 34.6% Gaussians | 36.06 (+1.9 vs. single agent) |
Key Findings¶
- Even zero-shot stacking (without joint training) improves collaborative perception, validating the advantage of explicit representations.
- Using only 34.6% of the communication budget still surpasses the single-agent baseline by +1.9 mIoU, demonstrating high communication efficiency.
- Learned fusion improves over naive fusion by +1.4 mIoU, confirming the effectiveness of the neighborhood fusion module.
Highlights & Insights¶
- This work is the first to introduce 3D Gaussian splatting into collaborative perception, establishing a pioneering research direction.
- The closure of Gaussians under rigid transformation makes cross-agent alignment trivial.
- Sparse, explicit, and interpretable representations are better suited to communication-constrained scenarios than implicit features.
Limitations & Future Work¶
- Experiments are conducted only on simulated data; validation on real-world V2X datasets is absent.
- Gaussian initialization remains random; better initialization strategies warrant exploration.
- Communication latency and asynchrony are not addressed.
Related Work & Insights¶
- GaussianFormer applies Gaussians to single-agent occupancy prediction; this work extends the approach to collaborative settings.
- CoHFF is the first collaborative occupancy framework but relies on tri-plane features and depth supervision; this work substantially simplifies the pipeline.
- Insight: Choosing an appropriate representation can simultaneously address communication efficiency and alignment difficulty.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ★★★★★ |
| Practicality | ★★★★☆ |
| Experimental Thoroughness | ★★★☆☆ |
| Writing Quality | ★★★★☆ |