Skip to content

SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

Conference: CVPR 2025
arXiv: 2503.06467
Code: GitHub
Area: 3D Vision
Keywords: Sparse Supervision, 3D Object Detection, Cross-Modal Semantic Prompts, Pseudo-Label Generation, Large Multimodal Models

TL;DR

Proposes SP3D, a two-stage training strategy that leverages large multimodal models (LMMs) to generate accurate cross-modal semantic prompts. Through dynamic clustering pseudo-label generation and distribution-shape scoring, it significantly boosts sparsely-supervised 3D object detection performance under an extremely low annotation rate (2%).

Background & Motivation

Sparsely-supervised 3D object detection aims to train 3D detectors with extremely limited annotations, reducing reliance on expensive human labeling. Existing methods face the following challenges:

  1. Insufficient feature discriminability under extremely low annotations: When the annotation rate is extremely low (e.g., 2%), existing methods (e.g., CoIn) struggle to learn sufficient feature discriminative capability from limited labels.
  2. Semantic ambiguity issue: When directly transferring 2D image semantics to 3D point clouds, significant noise occurs at instance boundaries due to depth occlusion and camera calibration errors.
  3. Difficulty in pseudo-label quality assessment: The lack of ground truth makes it highly challenging to evaluate the quality of generating pseudo-labels.

Inspired by the success of LMMs in 2D tasks, this paper proposes leveraging the cross-modal prior knowledge of LMMs to enhance the feature discriminability of 3D detectors in sparsely-supervised scenarios.

Method

Overall Architecture

SP3D adopts a two-stage training strategy: the first stage pre-trains the 3D detector using pseudo-labels generated by LMMs to establish fundamental feature discriminability; the second stage fine-tunes the network using a small amount of precise annotations. The core pipeline consists of three modules: CPST (Credible Point Semantic Transfer), DCPG (Dynamic Clustering Pseudo-Label Generation), and DS Score (Distribution-Shape Score).

Key Designs

1. Credible Point Semantic Transfer Module (CPST)

  • Function: Extracts precise foreground semantic information from 2D images and transfers it to 3D point clouds, producing accurate cross-modal semantic prompts (seed points).
  • Mechanism: First segments images using FastSAM to obtain class-agnostic masks, then uses SemanticSAM to generate descriptions for each mask, filtering foreground masks via cosine similarity with class text. A key innovation is the boundary-constrained mask contraction operation: contracting the foreground mask (\(\gamma = 0.3\)) to retain only the central region, thereby filtering out semantically ambiguous edge points.
  • Design Motivation: Directly projecting 2D semantics to 3D introduces substantial noise at instance boundaries. By contracting masks to preserve only high-confidence central regions, the semantic prompts transferred to the point cloud are guaranteed to be accurate and reliable.

2. Dynamic Clustering Pseudo-Label Generation Module (DCPG)

  • Function: Dynamically generates complete 3D bounding box pseudo-labels based on the multi-scale neighborhood geometry of the seed points.
  • Mechanism: For each seed point \(p_t\), DBSCAN clustering is performed using a dynamically updated clustering radius \(r = r_{\text{init}} \cdot \frac{t}{N^{(k)}} + \delta\), and bounding boxes are fitted to the clustering results. A set of pseudo-label proposals is generated by traversing all seed points and multi-scale radii.
  • Design Motivation: A fixed clustering radius either leads to incomplete foreground information or introduces excessive background noise. The dynamic radius varies from small to large to capture foreground information under multi-scale receptive fields.

3. Distribution-Shape Score (DS Score)

  • Function: Evaluates the quality of pseudo-labels in the absence of ground truth, replacing the IoU score in NMS.
  • Mechanism: Combines two unsupervised priors—(1) Distribution Constraint Score \(s_{dc}\): the distances from points inside a high-quality box to its boundaries should follow \(\mathcal{N}(0.8, 0.2)\); (2) Meta-Shape Constraint Score \(s_{msc}\): the shape of the box should be consistent with the template shape of the category, with deviation measured by KL divergence. \(\text{DS}(\hat{b}) = \lambda_1 \bar{s}_{dc} + \lambda_2 \bar{s}_{msc}\)
  • Design Motivation: Since the IoU between predicted boxes and GT cannot be computed, physical-world prior knowledge (point cloud distribution characteristics and category dimension templates) is leveraged to evaluate pseudo-label quality.

Loss & Training

The contrastive learning loss training framework of CoIn is adopted. The first stage is trained using pseudo-labels generated by SP3D, and the second stage is fine-tuned with sparse, accurate labels. Training is based on the OpenPCDet framework using detectors such as VoxelRCNN, CenterPoint, and CasA.

Key Experimental Results

Main Results: KITTI val split (2% Annotation Rate)

Method Car Easy Car Mod. Car Hard Ped. Easy Ped. Mod. Cyc. Easy
VoxelRCNN (fully-sup.) 92.3 84.9 82.6 69.6 63.0 88.7
VoxelRCNN (2%) 70.5 54.9 44.8 42.6 38.5 73.3
CoIn (2%) 89.1 70.2 55.6 50.8 45.2 80.2
CoIn++ (2%) 92.0 79.5 71.5 46.7 36.1 82.0
CoIn++ + SP3D (2%) 91.3 80.5 74.0 67.4 58.7 92.5

Improvements across Different Detector Architectures

Base Detector W/o SP3D (Mod.) W/ SP3D (Mod.) Gain
CenterPoint 54.82 69.24 +14.42
Voxel-RCNN 68.47 74.89 +6.42
CasA 75.32 75.94 +0.62

Zero-Shot Performance (W/o Fine-Tuning)

Method Car Easy @0.5 Car Mod. @0.5 Car Easy @0.7 Car Mod. @0.7
VS3D 40.32 37.36 9.09 5.73
WS3DPR - - 60.01 44.48
SP3D 93.75 76.36 69.71 48.65

Key Findings

  • SP3D achieves greater gains on weaker detectors (CenterPoint +14.42 vs CasA +0.62), demonstrating that SP3D primarily addresses the issue of insufficient initial feature discriminability.
  • Under the zero-shot setting, SP3D substantially outperforms VS3D and WS3DPR, validating the effectiveness of cross-modal semantic prompts.
  • The improvement is particularly significant on the Cyclist category (+12.5 AP), where annotations are the sparsest.

Highlights & Insights

  1. Ingenious two-stage strategy: Pre-warming with LMM pseudo-labels followed by fine-tuning with a small amount of accurate annotations elegantly solves the cold-start problem under extremely low annotation rates.
  2. Simple yet effective boundary contraction: Resolves the 2D-3D semantic transfer noise problem through mask contraction, offering a mathematically simple yet highly effective approach.
  3. Clever unsupervised quality assessment: DS Score evaluates pseudo-label quality using only geometric distribution priors and category shape templates, eliminating the need for ground truth.

Limitations & Future Work

  • CPST relies on the accuracy of the 2D-3D calibration matrix, meaning calibration errors will affect the quality of semantic transfer.
  • The initial dynamic clustering radius \(r_{\text{init}}\) and contraction factor \(\gamma\) require manual configuration.
  • Performance may be limited on more complex scenarios, such as dense pedestrian scenes.
  • Future work could explore directly integrating LMM features into the point cloud feature space.
  • CoIn: Sparsely-supervised baseline method; SP3D builds on it by enhancing feature discriminability during the pre-training stage.
  • ULIP/CLIP2Scene: Transfers 2D LMM knowledge to 3D, but primarily focuses on classification rather than detection.
  • SAM3D: Segments BEV images using SAM for 3D detection, but with limited precision.

Rating

⭐⭐⭐⭐ — Achieves significant improvements under extremely low annotation rates, offering novel designs in its two-stage strategy and unsupervised quality assessment. The comprehensive experimental validation across various detector architectures is also highly solid.