Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance¶

Conference: ECCV 2024
arXiv: 2312.07530
Code: https://github.com/kuanchihhuang/VG-W3D
Area: 3D Vision
Keywords: Weakly supervised 3D detection, visual guidance, pseudo-labels, 2D-3D constraints, feature alignment

TL;DR¶

This paper proposes the VG-W3D framework, which trains a 3D object detector using only 2D annotations (without any 3D labels) through a three-level visual guidance mechanism across feature, output, and training layers. It achieves comparable performance on the KITTI dataset to methods utilizing 500 frames of 3D annotations.

Background & Motivation¶

3D object detection is a core component of perception systems in autonomous driving, but 3D annotation is extremely expensive (3-16 times slower than 2D annotation). Existing weakly supervised methods have two key limitations:

Still relying on partial 3D annotations: WS3D requires BEV center annotations + 534 3D annotations, while MTrans/MAP-Gen require 500 frames of precise 3D annotations.

Only utilizing single-level constraints: FGR only uses the output-level frustum geometric relations, failing to fully exploit the multi-level correlation between the 2D and 3D domains.

The authors observe that the association between 2D images and 3D point clouds can be leveraged from three levels: object-aware alignment at the feature level, 2D-3D bounding box overlap constraints at the output level, and high-quality pseudo-label generation at the training level. This insight motivates the design of multi-level visual guidance.

Method¶

Overall Architecture¶

VG-W3D consists of two branches:

Image Branch: Employs CenterNet as the 2D detector, whose parameters are frozen after training, to provide visual guidance signals (features \(\mathbf{F}_{\mathcal{I}}\), 2D boxes \(\mathbf{B}_{\mathcal{I}}\), and confidence scores \(\sigma_{\mathcal{I}}\)).
Point Cloud Branch: Employs PointRCNN as the 3D detector to extract point cloud features \(\mathbf{F}_{\mathcal{P}}\) and predict 3D bounding boxes \(\mathbf{B}_{\mathcal{P}}\).

Initial 3D labels are generated by the non-learning method of FGR (frustum point clouds + heuristics) and are then iteratively refined through self-training. During inference, only the point cloud branch is used, and the image branch is discarded.

Key Designs¶

Feature-Level Visual Guidance: After projecting point cloud features onto the image plane, the DINO self-supervised segmentation is utilized to generate object foreground maps \(\mathbf{S}\). Subsequently, objectness binary classification is learned for both image and point cloud features. The core idea is to avoid direct L2 mimicry of image features (which would lose geometric information) and instead align the probability distribution of objectness. The losses include:
- Point cloud objectness segmentation loss: \(\mathcal{L}_{seg}^{\mathcal{P}} = \frac{1}{|\mathcal{A}|}\sum_{i \in \mathcal{A}} \text{FL}(\mathbf{C}_{\mathcal{P}'}(i), \mathbf{S}(i))\)
- Image objectness segmentation loss: \(\mathcal{L}_{seg}^{\mathcal{I}}\) (same formulation)
- KL divergence alignment: \(\mathcal{L}_{kl} = \text{KL}(\mathbf{C}_{\mathcal{I}} || \mathbf{C}_{\mathcal{P}'})\)
Output-Level Visual Guidance: Leveraging the prior that projected 3D boxes on the image plane should highly overlap with their corresponding 2D ground truth boxes, a GIoU loss is used to constrain the alignment between projected 3D boxes and 2D boxes. Key is the introduction of the 2D detection confidence \(\hat{\sigma}_{\mathcal{I}}\) as a weighting factor, where lower confidence 2D boxes receive smaller weights: \(\mathcal{L}_{box} = \hat{\sigma}_{\mathcal{I}} (1 - \text{GIoU}(\mathbf{B}_{\mathcal{I}}, \mathbf{B}_{proj}))\) The reason for using GIoU over IoU: GIoU handles gradient vanishing issues much better under non-overlapping conditions.
Training-Level Visual Guidance: The 3D label quality is improved via iterative pseudo-label generation. Each round contains three steps:
- Train the 3D detector with the current pseudo-labels
- Generate new 3D pseudo-labels along with their confidence scores
- Filter pseudo-labels: (a) Overlap matching set \(\mathbf{B}_{overlap}\): projected 3D boxes with IoU > \(\alpha_0\) with 2D GT and average confidence > \(\alpha_1\); (b) High-score set \(\mathbf{B}_{score}\): unmatched boxes with 3D confidence > \(\alpha_2\) are preserved.
- Final pseudo-labels = \(\mathbf{B}_{overlap} + \mathbf{B}_{score}\)

Loss & Training¶

For scenes with 3D pseudo-labels: \(\mathcal{L}_{pl} = \mathcal{L}_{rpn} + \mathcal{L}_{rcnn} + \mathcal{L}_{seg}^{\mathcal{P}} + \mathcal{L}_{kl}\)

For scenes with only 2D labels: \(\mathcal{L}_{weak} = \mathcal{L}_{seg}^{\mathcal{P}} + \mathcal{L}_{kl} + \mathcal{L}_{box}\)

Training parameters: \(\alpha_0 = 0.5\), \(\alpha_1 = 0.5\), \(\alpha_2 = 0.95\), CenterNet is trained for 140 epochs, PointRCNN is trained for 30 epochs. The pseudo-label quality tends to saturate after 2-3 iterations.

Key Experimental Results¶

Main Results (KITTI test set)¶

Method	Weak Label	Requires 3D Labels	AP3D Easy	AP3D Mod.	AP3D Hard
PointRCNN (Fully supervised)	-	✓	86.96	75.64	70.70
WS3D (2021)	BEV Center	534	80.99	70.59	64.23
MTrans (PointRCNN)	2D boxes	500 frames	83.42	75.07	68.26
FGR	2D boxes	✗	80.26	68.47	61.57
VG-W3D (Ours)	2D boxes	✗	84.09	74.28	67.90

Ablation Study (KITTI val set)¶

Feature-Level	Output-Level	Training-Level	AP3D Easy	AP3D Mod.	AP3D Hard
✗	✗	✗	87.19	74.00	68.34
✓	✗	✗	89.12	74.29	70.78
✗	✓	✗	88.95	76.42	71.58
✗	✗	✓	88.95	77.75	73.31
✓	✓	✓	91.32	78.89	74.70

Key Findings¶

All three levels of visual guidance make independent contributions, with the training-level providing the largest improvement (Mod. +3.75%).
Feature guidance using KL divergence + segmentation masks outperforms those using L2 loss or 2D bounding box masks.
GIoU loss performs better than IoU and L1 loss in output-level guidance.
The pseudo-label quality after 2 iterations improves [email protected] from 46.71% to 74.22%.
COCO pre-trained 2D detectors can be directly utilized instead of KITTI 2D annotations, while performance remains highly competitive.

Highlights & Insights¶

Completely 3D-annotation-free weakly supervised detection: For the first time among similar methods, performance close to fully supervised approaches is achieved without requiring any 3D labels.
Exquisitely designed multi-level constraints: The feature-level guidance avoids geometric information loss caused by direct feature mimicry and instead aligns the probability distribution of objectness.
2D-3D consistency constraints for pseudo-label filtering: Leverages 2D detection confidence to filter false positives, effectively suppressing noise accumulation during self-training.
High practicality: Supports cross-domain 2D detectors (e.g., COCO pre-trained), reducing data annotation requirements for practical applications.

Limitations & Future Work¶

Only validated on KITTI (monocular/single-camera setups) and has not been tested on multi-camera datasets (such as nuScenes).
The initial pseudo-labels depend on FGR's frustum geometric method, which may fail when point clouds are extremely sparse.
Iterative training increases computational costs (requires multiple rounds of training).
The image branch needs to be frozen during training, and end-to-end joint optimization remains unexplored.

Shares a similar concept with semi-supervised methods like DetMatch, but completely eliminates the need for any 3D annotations.
The objectness prior provided by DINO self-supervised segmentation is highly effective and can be transferred to other weakly supervised scenarios.
The 2D-3D consistency filtering strategy for training-level pseudo-labels can be generalized to multi-modal scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ — The systematic design of three-level visual guidance is relatively novel, and the entirely 3D-annotation-free setting has practical value.
Experimental Thoroughness: ⭐⭐⭐⭐ — The ablation studies are comprehensive, though validation is confined to a single dataset (KITTI).
Writing Quality: ⭐⭐⭐⭐ — Clear structure and natural introduction of the three observations.
Value: ⭐⭐⭐⭐ — Provides a practical solution for reducing the annotation cost of 3D detection.