MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label¶

Conference: CVPR 2026 arXiv: 2604.01646 Code: https://github.com/VisualAIKHU/MonoSAOD Area: 3D Vision / Object Detection Keywords: Monocular 3D Detection, Sparse Annotation, Data Augmentation, Pseudo Labels, Prototype Filtering

TL;DR¶

This work is the first to formally define and address the problem of sparsely annotated monocular 3D object detection. It proposes two modules—Road-Aware Patch Augmentation (RAPA) and Prototype-Based Filtering (PBF)—achieving substantial improvements over existing 2D SAOD methods under the KITTI 30% annotation setting (AP3D Easy: 21.28 vs. 17.14).

Background & Motivation¶

Background: Monocular 3D object detection infers 3D object properties (depth, dimensions, orientation) from a single image and is a critical technology for autonomous driving. Recent methods such as MonoDETR and MonoDGP have achieved notable progress on fully annotated datasets, yet all assume complete 3D annotations for every object.

Limitations of Prior Work: 3D annotation is extremely costly—providing accurate depth, dimension, and orientation labels requires 3–16× more time than 2D annotation. Consequently, real-world datasets frequently contain incomplete annotations: the same visible object may be labeled in some scenes but omitted in others, resulting in sparse and inconsistent annotation. Such inconsistency severely disrupts the model's ability to learn reliable depth and orientation cues.

Key Challenge: Existing 2D sparse annotation object detection (SAOD) methods select pseudo labels based on classification confidence scores, which reflect 2D localization certainty rather than the accuracy of 3D attributes (depth, orientation). As a result, high-confidence predictions may carry substantial 3D errors. LiDAR-based 3D SAOD methods, on the other hand, rely on point-cloud depth that is unavailable in the monocular setting.

Goal: (1) How can models better understand road–object relationships and achieve greater scene diversity under limited annotations? (2) How can reliable pseudo labels be generated by jointly validating 2D appearance consistency and 3D geometric accuracy?

Key Insight: The problem is decomposed into two tracks—"making the most of sparse annotations" (data augmentation) and "mining unannotated objects" (pseudo labeling)—with dedicated modules designed for the specific requirements of monocular 3D detection.

Core Idea: Leverage SAM segmentation, road constraints, and 3D geometric transformation for augmentation; apply dual filtering via prototype similarity and depth uncertainty for pseudo-label generation, thereby addressing sparsely annotated monocular 3D detection.

Method¶

Overall Architecture¶

A teacher–student framework is adopted. The RAPA module first performs geometrically consistent data augmentation on sparsely annotated images to pretrain a model, which then initializes both the teacher and student networks. The teacher network processes augmented images to produce predictions, and the PBF module selects high-quality pseudo labels through dual filtering based on prototype similarity and depth uncertainty. Accepted pseudo labels are used to update the prototype bank and are stored in a GT Bank as annotations for subsequent epochs. The student network is trained on both sparse annotations and pseudo labels.

Key Designs¶

Road-Aware Patch Augmentation (RAPA):
- Function: Generates geometrically consistent augmented training samples from sparse annotations.
- Mechanism: High-quality annotated object patches (untruncated, unoccluded) are extracted from the training set; SAM is applied for precise foreground segmentation to eliminate background noise (as opposed to directly using bounding-box crops). SAM also generates a road mask \(M_\text{road}\) for the target image. The segmented object patch is transformed from the source camera coordinate system to the target coordinate system via the extrinsic matrices: \([x_t, y_t, z_t]^T = [R_t | T_t][R_s | T_s]^{-1}[x_s, y_s, z_s]^T\). Candidate placement positions are sampled uniformly along the horizontal direction; for each candidate, the yaw angle is updated as \(r_y' = \alpha + \text{arctan2}(x_t', z_t')\) to preserve the observation angle. After projecting to 2D, two constraints are checked: the overlap ratio between the bottom region and the road mask must satisfy ≥ \(\tau_\text{road}\) (ensuring placement on the road), and the IoU with existing annotation boxes must satisfy < \(\tau_\text{overlap}\) (preventing unrealistic overlaps).
- Design Motivation: Existing copy-paste augmentation methods exhibit three shortcomings: (1) rectangular patches include background noise; (2) road constraints are ignored, causing objects to float or appear in physically implausible locations; (3) 3D pose is not adjusted, leading to geometric inconsistency. RAPA addresses all three issues through SAM-based precise segmentation, road constraint enforcement, and 3D pose transformation.
Prototype-Based Filtering (PBF):
- Function: Generates high-quality pseudo labels via dual-criterion filtering.
- Mechanism: The process consists of three steps. Prototype initialization: RoI features are extracted from the teacher network using sparse annotations, and a class prototype bank \(\mathcal{P} = \{p_k\}_{k=1}^K\) (capacity \(K=256\)) is built via weighted cumulative updates; similar features are merged (cosine similarity > 0.8), while dissimilar features initialize new prototypes. Geometric reliability filtering: Using depth uncertainty \(\sigma\) obtained from training with a Laplacian aleatoric uncertainty loss, a geometric reliability score \(S_\text{depth} = \exp(-\sigma)\) is computed; only candidates with \(S_\text{depth} > \tau_\text{depth}\) pass this filter. Semantic consistency filtering: The maximum cosine similarity between a candidate RoI feature and all prototypes is computed as \(S_\text{proto}^{(i)} = \max_{p_k} \text{cos}(f_\text{roi}^{(i)}, p_k)\); only candidates with \(S_\text{proto} > \tau_\text{proto}=0.85\) pass. A prediction is accepted as a pseudo label only when both conditions are satisfied simultaneously.
- Design Motivation: Classification confidence scores cannot reflect 3D attribute accuracy. PBF verifies geometric reliability via depth uncertainty (rejecting predictions with erroneous depth estimates) and verifies semantic consistency via prototype similarity (rejecting visually anomalous predictions), providing dual guarantees for pseudo-label quality.
GT Bank Cumulative Update Mechanism:
- Function: Progressively increases the volume of effective training data.
- Mechanism: Pseudo labels passing dual filtering are stored in the GT Bank and used as additional annotations in subsequent epochs. Meanwhile, pseudo-label RoI features continuously refine the prototype bank via weighted updates \(p_k' = (1-\beta)p_k + \beta f_\text{roi}\) (\(\beta=0.005\)), enabling the bank to adapt to the evolving feature distribution.
- Design Motivation: As training progresses, the teacher model produces increasingly better predictions, and the GT Bank accumulates reliable pseudo labels, forming a positive feedback loop.

Loss & Training¶

The method builds on the MonoDETR architecture (ResNet-50 backbone). Depth uncertainty is trained with the Laplacian aleatoric uncertainty loss: \(\mathcal{L}_\text{depth} = \frac{\sqrt{2}}{\sigma}\|d_\text{gt} - d_\text{pred}\|_1 + \log(\sigma)\). The model is first pretrained on RAPA-augmented sparse annotations, then used to initialize the teacher–student network for pseudo-label training. Experiments are conducted on a single RTX 3090 GPU with batch size 16, AdamW optimizer, and 100 training epochs.

Key Experimental Results¶

Main Results¶

Method	30% Easy	30% Mod.	30% Hard	50% Mod.	70% Mod.
Baseline (MonoDETR)	11.17	8.73	7.56	15.25	17.83
Co-mining	16.01	12.62	10.38	16.22	18.21
Calibrated Teacher	17.14	12.96	10.58	16.03	18.94
MonoSAOD (Ours)	21.28	15.60	12.79	18.84	19.37

Gains are most pronounced under the most challenging 30% annotation setting (Easy: +4.14, Mod.: +2.64, Hard: +2.21 vs. the strongest baseline). On the KITTI test set with 30% annotations, Easy AP3D reaches 17.47 (vs. 10.76 for the strongest baseline), a 62% improvement.

Ablation Study¶

Configuration	Easy	Mod.	Hard
Baseline (no aug., no pseudo labels)	11.17	8.73	7.56
+ Confidence pseudo labels	12.39	9.68	8.18
+ Conf. + PBF	16.49	12.65	10.32
+ Conf. + RAPA	20.31	14.51	11.72
+ Conf. + RAPA + PBF (Full)	21.28	15.60	12.79

Key Findings¶

RAPA contributes the most: Adding RAPA alone raises Easy AP3D from 12.39 to 20.31 (+7.92), demonstrating that geometrically consistent data augmentation is critical under sparse annotation.
PBF provides complementary gains: Adding PBF on top of RAPA yields approximately 1 additional point (20.31→21.28); used alone, PBF improves from 12.39 to 16.49 (+4.10).
Confidence-only filtering is weak: Confidence-based pseudo labels yield only ~1-point improvement, validating the claim that classification confidence fails to reflect 3D accuracy.
Generalization to other architectures: Applying RAPA+PBF to MonoDGP also yields substantial gains (30% annotation, Mod.: 11.70→16.79), demonstrating the generality of the approach.
Robustness to adverse weather: Under foggy KITTI with 30% annotations, MonoSAOD achieves 13.72 Mod. AP3D (vs. 8.65 for the strongest baseline), with even larger margins in degraded conditions.

Highlights & Insights¶

First formal definition of sparsely annotated monocular 3D detection: The paper identifies the fundamental inapplicability of existing 2D SAOD methods to 3D detection—confidence scores cannot reflect 3D geometric accuracy—and this problem formulation itself constitutes a contribution.
Elegant combination of SAM, road constraints, and 3D transformation: RAPA ingeniously integrates SAM's segmentation capability, road semantic constraints, and 3D geometric transformation to generate augmented samples that are both visually and geometrically plausible. This geometry-aware copy-paste paradigm is transferable to other augmentation tasks requiring 3D consistency.
Dual filtering via depth uncertainty and prototype similarity: Repurposing the existing Laplacian uncertainty signal as a proxy for 3D reliability incurs low additional cost while delivering strong empirical gains.

Limitations & Future Work¶

Evaluation is limited to the KITTI dataset; results on larger-scale benchmarks such as nuScenes and Waymo are absent.
RAPA relies on SAM segmentation and manually provided point prompts for road regions; automation could be improved.
The prototype bank capacity \(K=256\) is fixed and may be insufficient for scenarios with more categories (e.g., pedestrians, cyclists).
A substantial gap from full-annotation performance remains under the 30% annotation setting, indicating room for further improvement.
PBF thresholds (\(\tau_\text{depth}=1.0\), \(\tau_\text{proto}=0.85\)) are manually set; adaptive threshold strategies may yield better results.

vs. Calibrated Teacher: Calibrated Teacher calibrates teacher confidence for pseudo-label selection but still relies on 2D information. MonoSAOD's PBF additionally validates 3D depth reliability, offering a clear advantage in monocular 3D detection.
vs. Co-mining / SparseDet: These methods design self-consistency losses or gradient reweighting to handle missing annotations, but are less direct. MonoSAOD's explicit augmentation and pseudo-label filtering pipeline is simpler and more effective.
vs. Semi-supervised M3OD: Semi-supervised methods assume some images are fully annotated and others are unannotated, whereas SAOD assumes each image has partially missing annotations—a fundamentally different problem setting.

Rating¶

Novelty: ⭐⭐⭐⭐ First to define and systematically address sparsely annotated monocular 3D detection; RAPA's geometry-aware augmentation design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple annotation ratios, test-set evaluation, architecture generalization, adverse-weather robustness, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated and method description is detailed, though some formulations could be simplified.
Value: ⭐⭐⭐⭐ Addresses a practically relevant annotation-sparsity problem with a generalizable method and open-source code.