An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving¶

Conference: CVPR 2026 arXiv: 2603.27238 Code: https://mias.group/CarlaOcc Area: Autonomous Driving Keywords: Panoptic Occupancy Prediction, 3D Mesh Library, CARLA Simulation, Instance-Level Annotation, Occupancy Dataset Quality

TL;DR¶

This paper proposes ADMesh (a library of 15K+ high-quality 3D models) and CarlaOcc (a panoptic occupancy dataset with 100K frames at 0.05m resolution), providing for the first time instance-level annotations and physically consistent ground truth for 3D panoptic occupancy prediction in autonomous driving, along with occupancy quality evaluation metrics and a systematic benchmark.

Background & Motivation¶

Background: 3D occupancy prediction is evolving from purely semantic occupancy toward fine-grained panoptic occupancy (joint semantic and instance prediction). Methods such as SparseOcc and PanoOcc have been proposed, but remain constrained by dataset quality.

Limitations of Prior Work: (1) Existing datasets lack instance-level annotations—SparseOcc/PaSCo generate pseudo-panoptic labels via heuristics (3D box grouping/clustering), introducing boundary artifacts and instance overlaps; (2) Existing ground truth relies on LiDAR point cloud aggregation and voxelization, resulting in coarse resolution (0.2–0.5 m), incomplete geometry (only sensor-visible surfaces), and physical inconsistencies (holes and fractures); (3) No unified high-quality 3D model library exists—current resources are fragmented and platform-dependent.

Key Challenge: Panoptic occupancy prediction requires precise instance-level geometric annotations, yet the generation pipelines of existing datasets (LiDAR aggregation → voxelization) are fundamentally incapable of providing physically consistent and complete ground truth.

Key Insight: Starting from 3D meshes rather than point clouds—meshes encode complete geometry and can be voxelized at arbitrary resolution.

Core Idea: Build a unified 3D model library (ADMesh) → reconstruct complete scene meshes via CARLA simulation → apply topology-aware voxelization to generate physically consistent panoptic occupancy labels.

Method¶

Overall Architecture¶

Four major components: (1) ADMesh 3D model library construction; (2) CarlaOcc dataset generation (scene mesh reconstruction → voxelization → sensor artifact correction); (3) occupancy quality evaluation metrics; (4) systematic benchmarking.

Key Designs¶

ADMesh: Unified 3D Model Library:
- Function: Integrates 15K+ high-quality 3D models from four sources—CARLA, BuildingNet, MeshFleet, and ShapeNet.
- Mechanism: Develops an automated mesh export toolchain—traverses CARLA scenes → extracts component-level mesh assets → queries component hierarchy and transforms via the UE editor interface → integrates CARLA's native semantic annotation system → hierarchically assembles complete object-level meshes. A unified data organization framework ensures consistency in naming, coordinate systems, and semantic hierarchy.
- Design Motivation: Simulation platform assets are fragmented, non-standardized, and platform-bound; a unified framework is needed to support large-scale dataset construction.
Mesh-Based Scene Reconstruction:
- Function: Directly reconstructs the panoptic scene mesh for each frame from 3D meshes (rather than LiDAR point aggregation).
- Mechanism:
  - Static background: Selects background meshes \(\mathcal{S}_{bg}\) intersecting the occupancy region.
  - Rigid foreground: Matches models \(\mathcal{S}_{fg}^r\) from ADMesh using a lookup table (LUT).
  - Non-rigid foreground (pedestrians): A skeletal motion analyzer preprocesses walking animations into \(D\) discrete phase template meshes; at runtime, the current skeletal state is matched to the nearest phase via geodesic distance: \(d_k = \arg\min_d \mathcal{G}(\delta_k, \delta_d)\).
  - Merging: \(\mathcal{M}^{pano} = \mathcal{S}_{bg} \cup \mathcal{S}_{fg}^r \cup \mathcal{S}_{fg}^n\)
- Design Motivation: Meshes preserve complete geometric information, avoiding the incompleteness caused by LiDAR sparse sampling and occlusion.
Topology-Aware Mesh Displacement Strategy:
- Function: Generates overlap-free panoptic occupancy labels from the panoptic scene mesh.
- Mechanism: Merges stuff meshes by semantic category (eliminating redundant boundaries), then sorts instances by world height and integrates them via layer-by-layer voxelization from bottom to top—ensuring lower structures do not overwrite higher ones.
- Design Motivation: Independently voxelizing each mesh is computationally expensive and produces label conflicts.
Instance-Guided Sensor Artifact Correction:
- Function: Corrects depth and semantic artifacts caused by transparent/semi-transparent objects in CARLA rendering.
- Mechanism: Constructs a scene mesh containing only transparent objects → generates accurate depth via ray casting → repairs the original depth map by taking the per-point minimum.
- Design Motivation: CARLA incorrectly renders depth and semantics of transparent objects, displaying the opaque objects behind them instead.

Occupancy Quality Evaluation Metrics¶

Spatial Continuity Score (\(s_{sc}\)): Quantifies the spatial continuity of occupied voxels within the same semantic category (higher is better).
Temporal Consistency Score (\(s_{tc}\)): Quantifies the temporal stability of occupancy labels across adjacent frames.

Key Experimental Results¶

Dataset Quality Comparison¶

Dataset	Synthetic	Resolution (m)	Instance Ann.	\(s_{sc}\)↑	\(s_{tc}\)↑
SemanticKITTI	No	0.2	No	0.353	0.023
Occ3D-nuScenes	No	0.4	No	0.721	0.431
SurroundOcc	No	0.5	No	0.878	0.589
CarlaSC	Yes	0.4	No	0.887	0.775
CarlaOcc (Ours)	Yes	0.05	Yes	0.996	0.873

Benchmark Model Evaluation (Semantic Occupancy mIoU)¶

Model	Key Findings
Multiple SOTA methods	Models trained on CarlaOcc benefit from finer-grained ground truth
Panoptic occupancy task	For the first time, evaluation on genuine instance-level annotations is possible

Key Findings¶

CarlaOcc achieves substantially higher spatial continuity (0.996) and temporal consistency (0.873) than all existing datasets.
The 0.05 m resolution is 4× finer than the finest existing dataset (SemanticKITTI at 0.2 m).
The instance-guided artifact correction pipeline effectively rectifies rendering errors for transparent objects.
The mesh-based generation pipeline entirely avoids information loss inherent in LiDAR aggregation.

Highlights & Insights¶

Paradigm Shift from Point Clouds to Meshes: Meshes encode complete geometric information, fundamentally resolving the resolution and completeness limitations of LiDAR aggregation pipelines. This has significant implications for the methodology of synthetic dataset construction.
Skeletal Motion Analyzer: Provides an elegant solution for accurate reconstruction of non-rigid objects (pedestrians)—preprocessing animation phases combined with runtime geodesic matching.
Quality Evaluation Metrics: For the first time, quantitative standards for spatial continuity and temporal consistency are defined to assess occupancy dataset quality.

Limitations & Future Work¶

Sim-to-real gap of synthetic data—whether models trained on CarlaOcc transfer to real driving scenarios remains an open question.
ADMesh assets are primarily sourced from CARLA; asset diversity remains constrained by the simulation platform.
The enormous voxel count at 0.05 m resolution imposes significant memory and computational overhead for model training and inference.
Pedestrian animations cover only walking cycles; more complex human motions (e.g., bending, crouching) require future extension.

vs. Occ3D/SurroundOcc: Real-world datasets based on LiDAR aggregation with incomplete geometry. CarlaOcc generates labels from meshes and is physically consistent, but incurs a sim-to-real gap.
vs. CarlaSC: Also a CARLA-based synthetic dataset, but lacks instance annotations and has coarser resolution (0.4 m vs. 0.05 m).
vs. SparseOcc/PanoOcc: Model-level algorithmic innovations; this paper provides dataset-level infrastructure.

Rating¶

Novelty: ⭐⭐⭐⭐ First instance-level panoptic occupancy benchmark; ADMesh and the mesh reconstruction pipeline are innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive dataset quality evaluation, though downstream model benchmarking could be enriched.
Writing Quality: ⭐⭐⭐⭐⭐ Pipeline description is clear and complete; dataset statistics are detailed.
Value: ⭐⭐⭐⭐⭐ Provides foundational infrastructure for 3D panoptic occupancy research and advances the field.