Skip to content

An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving

Conference: CVPR 2026
arXiv: 2603.27238
Code: https://mias.group/CarlaOcc
Area: Autonomous Driving
Keywords: Panoptic Occupancy Prediction, 3D Mesh Library, CARLA Simulation, Instance-level Annotation, Occupancy Dataset Quality

TL;DR

Ours proposes ADMesh (a high-quality 3D model library with 15K+ assets) and CarlaOcc (a panoptic occupancy dataset with 100k frames and 0.05m precision). It provides the first instance-level annotations and physically consistent ground truth for 3D panoptic occupancy prediction in autonomous driving, along with occupancy quality evaluation metrics and a systematic benchmark.

Background & Motivation

Background: 3D occupancy prediction is evolving from pure semantic occupancy to fine-grained panoptic occupancy (joint semantic and instance prediction). Methods like SparseOcc and PanoOcc have been proposed, but they are constrained by dataset quality.

Limitations of Prior Work: (1) Existing datasets lack instance-level annotations—SparseOcc/PaSCo generate pseudo-panoptic labels via heuristics (3D box grouping/clustering), introducing boundary artifacts and instance overlaps; (2) Current ground truth (GT) relies on LiDAR point cloud aggregation and voxelization, resulting in coarse resolution (0.2-0.5m), incomplete geometry (sensor-visible surfaces only), and physical inconsistency (holes and fractures); (3) Lack of a unified high-quality 3D model library—existing resources are fragmented and platform-dependent.

Key Challenge: Panoptic occupancy prediction requires precise instance-level geometric annotations, but existing generation pipelines (LiDAR aggregation → voxelization) fundamentally fail to provide physically consistent and complete ground truth.

Key Insight: Start from 3D meshes rather than point clouds—meshes contain complete geometry and can be voxelized at any arbitrary resolution.

Core Idea: Construct a unified 3D model library (ADMesh) → Reconstruct complete scene meshes via CARLA simulation → Apply topology-aware voxelization to generate physically consistent panoptic occupancy labels.

Method

Overall Architecture

This paper addresses the issue of "lack of high-quality data" for panoptic occupancy prediction: existing ground truth derived from LiDAR aggregation is coarse, incomplete, and lacks instance labels. The authors adopt a mesh-based generation route, as meshes inherently contain complete geometry. The pipeline consists of two primary stages: first, consolidating assets from various simulation platforms into a unified 3D model library, ADMesh; second, using CARLA simulation to reconstruct complete scene meshes frame-by-frame. Through topology-aware voxelization and sensor artifact repair, physically consistent panoptic occupancy GT (CarlaOcc) with instance labels is produced at 0.05m resolution. Finally, a set of metrics and benchmarks are provided to quantify dataset quality.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Simulation Assets<br/>CARLA / BuildingNet / MeshFleet / ShapeNet"] --> B["ADMesh<br/>Unified 3D Model Library (15K+ object-level meshes)"]
    B --> C["Mesh-based Scene Reconstruction<br/>Background Filtering + Rigid LUT Matching + Pedestrian Skeletal Phase"]
    C --> D["Topology-aware Mesh Displacement Voxelization<br/>Stuff Merging → Layer-by-layer Union by Height, No Label Overlap"]
    E["Instance-guided Sensor Artifact Repair<br/>Separate Ray-casting for Transparent Objects, Point-wise Depth Minimum"]
    D --> E
    E --> F["CarlaOcc Panoptic Occupancy GT<br/>0.05m Resolution, Instance Labels Included"]
    F --> G["Occupancy Quality Evaluation<br/>Spatial Continuity / Temporal Consistency Metrics"]

Key Designs

1. ADMesh: Consolidating Fragmented Simulation Assets into a Unified Library

Generating data from meshes requires clean, consistently annotated 3D models. However, simulation assets are highly fragmented with varying coordinate systems and platform locks. ADMesh integrates 15K+ models from CARLA, BuildingNet, MeshFleet, and ShapeNet using an automated mesh export toolchain. It extracts component-level assets from CARLA, queries hierarchies and transforms via UE Editor interfaces, and integrates native semantic systems to reassemble parts into complete object-level meshes. This ensures consistent naming, coordinates, and semantic hierarchies for large-scale reuse.

2. Mesh-based Scene Reconstruction: Replacing Sparse Point Clouds with Complete Geometry

LiDAR aggregation suffers from sparse sampling and occlusion, missing object rears. Ours reconstructs the full panoptic scene mesh \(\mathcal{M}^{pano}\) frame-by-frame: static backgrounds filter mesh \(\mathcal{S}_{bg}\) intersecting the occupancy region; rigid foregrounds (vehicles) use a Look-Up Table (LUT) to match models \(\mathcal{S}_{fg}^r\) from ADMesh; non-rigid foregrounds (pedestrians) use a skeletal motion analyzer. The analyzer pre-processes walking animations into \(D\) discrete phase templates, matching the current skeletal state \(\delta_k\) to the nearest phase \(d_k = \arg\min_d \mathcal{G}(\delta_k, \delta_d)\) via geodesic distance. This reconstruction avoids LiDAR-induced holes and fractures.

3. Topology-aware Mesh Displacement: Clean Voxelization for Overlap-free Labels

Individual mesh voxelization causes label conflicts and redundant computation. This strategy first merges "stuff" (background) meshes to eliminate internal boundaries. Instances are then sorted by world-coordinate height and integrated layer-by-layer from bottom to top. This ensures lower structures (like ground) are not overwritten by high-level object voxels, producing naturally overlap-free output where each voxel belongs to exactly one semantic/instance ID.

4. Instance-guided Sensor Artifact Repair: Correcting Depth/Semantic Errors for Transparent Objects

CARLA rendering often allows depth and semantics to penetrate transparent objects (e.g., windows), embedding errors into the GT. The fix involves constructing a scene mesh containing only transparent objects, performing ray-casting to obtain accurate depths, and taking the point-wise minimum relative to the original depth. This recovers transparent surfaces closer to the camera, overwriting penetration errors.

5. Occupancy Quality Evaluation Metrics: Quantifiable Standards for Dataset Quality

Ours defines two metrics to characterize label quality beyond simple resolution. The spatial continuity score \(s_{sc}\) measures if occupancy voxels of the same semantic class form continuous volumes (fragmentation or holes lower the score). The temporal consistency score \(s_{tc}\) measures the stability of labels across adjacent frames (higher values indicate smoother temporal transitions). These transform subjective "geometric completeness" into comparable quantitative data.

Key Experimental Results

Dataset Quality Comparison

Dataset Synthetic Resolution (m) Instance Label \(s_{sc}\) \(s_{tc}\)
SemanticKITTI No 0.2 No 0.353 0.023
Occ3D-nuScenes No 0.4 No 0.721 0.431
SurroundOcc No 0.5 No 0.878 0.589
CarlaSC Yes 0.4 No 0.887 0.775
CarlaOcc (Ours) Yes 0.05 Yes 0.996 0.873

Benchmark Model Testing (Semantic Occupancy mIoU)

Model Key Finding
Various SOTA Methods Models trained on CarlaOcc benefit from more refined ground truth.
Panoptic Occupancy Task Evaluations on true instance-level annotations are possible for the first time.

Key Findings

  • CarlaOcc's spatial continuity (0.996) and temporal consistency (0.873) significantly outperform all existing datasets.
  • The 0.05m resolution is 4x finer than the previous best (SemanticKITTI 0.2m).
  • The instance-guided repair pipeline effectively corrects rendering artifacts for transparent objects.
  • Mesh-based generation entirely avoids information loss associated with LiDAR aggregation.

Highlights & Insights

  • Paradigm Shift from Point Clouds to Meshes: Meshes contain complete geometric information, fundamentally solving the resolution and completeness limits of LiDAR aggregation. This inspires new methodologies for synthetic dataset construction.
  • Skeletal Motion Analyzer: Provides an elegant solution for precise reconstruction of non-rigid objects (pedestrians) via animation phase pre-processing and geodesic matching.
  • Quality Evaluation Metrics: Quantitatively defines spatial continuity and temporal consistency for the first time to evaluate occupancy dataset quality.

Limitations & Future Work

  • Sim-to-real gap: Can models trained on CarlaOcc transfer effectively to real-world driving scenarios?
  • Asset Diversity: ADMesh assets are primarily from CARLA; diversity remains limited by simulation platforms.
  • Computational Cost: 0.05m resolution generates massive voxel volumes, increasing memory and compute overhead for training.
  • Animation Complexity: Pedestrian animations currently cover walking cycles; complex actions (crouching, bending) require further expansion.
  • vs. Occ3D/SurroundOcc: These use LiDAR aggregation on real data but suffer from incomplete geometry. CarlaOcc uses mesh-based generation for physical consistency at the cost of the sim-to-real gap.
  • vs. CarlaSC: Also a CARLA dataset, but lacks instance labels and has coarser resolution (0.4m vs. 0.05m).
  • vs. SparseOcc/PanoOcc: While those focus on methodological innovation, this work provides necessary dataset infrastructure.

Rating

  • Novelty: ⭐⭐⭐⭐ First instance-level panoptic occupancy benchmark; ADMesh and mesh reconstruction pipeline are innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive dataset quality evaluation, though downstream model benchmarks could be more extensive.
  • Writing Quality: ⭐⭐⭐⭐⭐ Pipeline descriptions are clear and complete; dataset statistics are detailed.
  • Value: ⭐⭐⭐⭐⭐ Provides critical infrastructure for 3D panoptic occupancy research, driving the field forward.