MonoSOWA: Scalable Monocular 3D Object Detector Without Human Annotations¶
Conference: ICCV2025 arXiv: 2501.09481 Code: github.com/jskvrna/MonoSOWA Area: Autonomous Driving Keywords: Monocular 3D Detection, Weak Supervision, Auto-Labeling, Pseudo-LiDAR, Annotation-Free
TL;DR¶
This paper proposes the first monocular 3D object detection method that requires no human annotations of any kind (neither 2D nor 3D). A novel Local Object Motion Model (LOMM) is introduced to disentangle inter-frame motion sources, enabling auto-labeling at a speed ~700× faster than prior work. A Canonical Object Space (COS) is further proposed to enable multi-dataset training across heterogeneous camera configurations.
Background & Motivation¶
- Monocular 3D object detection is a critical component in autonomous driving; conventional methods depend on LiDAR sensors and extensive manual 3D annotations.
- The annotation process is extremely time-consuming and expensive, limiting training data diversity; any change in camera setup necessitates re-collection and re-annotation.
- Existing weakly supervised methods still require 2D instance masks (e.g., WeakMono3D, VSRD) or LiDAR data (e.g., WeakM3D, Autolabels).
- VSRD requires ~15 minutes per frame for annotation; annotating Waymo's 158K frames would take approximately 4 years, rendering it practically unscalable.
- Both WeakMono3D and VSRD fail to handle moving objects—they either discard them or assign low confidence, wasting a large amount of training signal.
- Goal: Completely eliminate dependence on human annotations and LiDAR, so that data from commodity vehicles equipped only with monocular cameras can be directly exploited.
Method¶
Overall Architecture¶
The method requires only: image sequences + ego-motion data (GPS/IMU) + known camera intrinsics and extrinsics. No human annotations are needed, not even 2D labels. Two off-the-shelf pretrained models are used: 1. 2D Object Detector: MViT2-Huge (Detectron2 framework, COCO pretrained) — provides instance segmentation masks. 2. Monocular Depth Estimator: Metric3D v2 — selected for its best zero-shot metric depth generalization.
The auto-labeling pipeline decomposes the 7-DOF 3D bounding box estimation problem into three sequential sub-problems: orientation, position, and dimensions, avoiding the difficulty of direct joint optimization.
Step 1: Pseudo-LiDAR Aggregation¶
- Metric depth maps are inferred per frame and back-projected into 3D point clouds.
- Instance segmentation masks from the 2D detector are used to extract per-object point clouds \(P_{i,j}\).
- Object tracking is performed in 3D world coordinates:
- The median of the object point cloud approximates spatial position.
- A physics-based motion model predicts the position in the next frame.
- Nearest-neighbor matching with a distance threshold is applied.
- Multi-frame point clouds are transformed into a reference frame coordinate system using ego-motion data.
Step 2: Local Object Motion Model (LOMM)¶
This is the core contribution of the paper, addressing a key challenge: since the ego-vehicle is also moving, all objects undergo relative displacement between frames. The question is how to identify which objects are truly in motion.
- Core Idea: Decouple inter-frame object position changes into two sources — ego-vehicle motion and object self-motion.
- Classification: For each object instance, compute the frame-wise position difference sequence \(\Delta_{i,j} = L_{i,j} - L_{i-1,j}\):
- Mean \(\mu_j = \frac{1}{l-k} \sum \Delta_{i,j}\), standard deviation \(\sigma_j\) (normalized by \(\sqrt{2}\)).
- Ratio \(z_j = \|\mu_j\|_2 / \|\sigma_j\|_2\).
- Stationary objects: displacement variation stems from pseudo-LiDAR noise jitter (akin to a random walk), so \(\|\mu\|_2 \ll \|\sigma\|_2\).
- Moving objects: persistent directional motion yields \(\|\mu\|_2 \gg \|\sigma\|_2\).
- Decision criterion: \(z > T_z = 0.2\) and net displacement \(> T_m = 5\text{m}\).
- For stationary objects: All frames' point clouds are directly aggregated as \(A_j = \{P_{i,j}\}\) (after transformation to the reference frame), yielding a denser representation with recovery from occlusion.
- For moving objects: The known trajectory is leveraged to directly compute orientation from physical constraints (vehicles travel in the direction of motion).
- In contrast to prior methods (WeakMono3D, VSRD) that discard or downweight moving objects, LOMM exploits information from both stationary and moving objects simultaneously.
Step 3: Sequential Auto-Labeling¶
The 7-DOF pose estimation is decomposed into three independent sub-problems:
Orientation Estimation: - Moving objects: orientation is computed directly from consecutive frame positions as \(\theta = \text{atan2}(\Delta z / \Delta x)\); the median yaw over the preceding and following 5 frames is used for robustness. - Stationary objects: all angles \(\theta \in [0, \pi/2)\) are iterated in BEV, and the optimal angle is selected using the proposed Saturated Closeness Criterion (SCC): - Improves upon Zhang's Closeness Criterion: (1) applies sigmoid saturation \(\sigma(\alpha \cdot x)\) (\(\alpha=10\)) to suppress outlier influence; (2) replaces min/max with the 10th/90th percentiles as boundary reference points. - The criterion accumulates, for each point, the minimum distance to the nearest boundary across two perpendicular axes. - Since the algorithm cannot distinguish front from back, two orientation hypotheses (differing by \(\pi\)) are produced.
Dimension Estimation: - Dimensions are obtained from the SCC algorithm output; outliers (outside typical vehicle size ranges) are replaced by prior dimensions. - View-dependent unobservable dimensions are detected: when the angle between vehicle heading and viewing direction is near \(0, \pi/2, \pi, 3\pi/2\), prior dimensions are substituted.
Position Refinement: - Given the estimated orientation and coarse position, small perturbations (≤2m) are applied along the x and z axes. - A Template Fitting Loss (TFL) using a canonical vehicle template point cloud selects the optimal position; TFL is more robust to outliers than Chamfer Distance. - The front/back orientation ambiguity is resolved by testing both \(\theta\) and \(\theta + \pi\) and selecting the hypothesis with lower TFL.
Step 4: Canonical Object Space (COS)¶
- Problem: Under different camera focal lengths, the same object at the same distance appears at different image sizes; in extreme cases, the network must predict very different distances for objects with identical pixel sizes.
- Solution: A canonical focal length \(f^C = 750\) is chosen, and only the 3D label coordinates are scaled: \(\omega_i = f^C / f_i\), \((x^C, y^C, z^C) = (x \cdot \omega_i, y \cdot \omega_i, z \cdot \omega_i)\).
- During training the model learns in COS; during inference the predictions are inverse-transformed back to world coordinates using the target frame's \(\omega_j\).
- Data augmentation (e.g., image scaling) requires synchronized adjustment of the perceived focal length.
- Inspired by Metric3D and Omni3D, but only label coordinates are transformed rather than entire images or depth maps, making the design minimally invasive.
- This allows a single model to train and infer across heterogeneous camera configurations without per-camera retraining.
Implementation Details¶
- Detector: MonoDETR is used as the final 3D detection model, optimized with AdamW (lr=2e-4, wd=1e-4).
- Aggregation window: 100 frames (up to 50 frames before and after each object instance).
- Thresholds: \(T_z = 0.2\) (motion/static classification), \(T_m = 5\)m (minimum motion distance), \(\alpha = 10\) (SCC steepness).
- Identical hyperparameters are used across all three datasets, demonstrating the generalizability of the method.
Key Experimental Results¶
KITTI-360 Test Set¶
| Method | Human Annotation | AP_BEV@0.5 (Easy/Hard) | AP_3D@0.5 (Easy/Hard) | Labeling Speed |
|---|---|---|---|---|
| MonoFlex (Fully Supervised) | 3D boxes | 50.82/41.78 | 43.11/34.43 | - |
| MonoDETR (Fully Supervised) | 3D boxes | 47.21/36.05 | 41.01/30.38 | - |
| Autolabels | LiDAR+masks | 20.18/14.33 | 4.69/2.79 | 6s/frame |
| VSRD | Masks | 29.07/22.83 | 21.77/16.46 | 15min/frame |
| MonoSOWA (No Annotation) | None | 38.41/35.26 | 29.98/27.56 | 1.3s/frame |
Waymo Validation Set (Level 2)¶
| Method | AP_BEV@0.5 All | AP_3D@0.5 All |
|---|---|---|
| MonoDETR (Fully Supervised) | 23.63 | 21.41 |
| MonoSOWA (No Annotation) | 18.98 | 13.46 |
Pseudo-Label Pretraining + Fine-tuning with Limited Human Annotations (KITTI)¶
| Pretraining | Human Annotation Ratio | AP_BEV@0.7 Easy | AP_3D@0.7 Easy |
|---|---|---|---|
| None | 25% | 31.72 | 21.76 |
| MonoSOWA | 25% | 39.99 | 32.64 |
| None | 100% | 37.99 | 29.36 |
- MonoSOWA pretraining with only 25% human annotations surpasses fully supervised training with 100% human annotations.
- 15% human annotations + MonoSOWA pretraining ≈ 100% human annotation performance → 85% annotation cost reduction.
Cross-Dataset Training¶
| Training Data | AP_BEV@0.5 (Easy) | AP_3D@0.5 (Easy) |
|---|---|---|
| KITTI pseudo-labels | 61.24 | 53.22 |
| K360 pseudo-labels | 57.62 | 47.39 |
| KITTI+K360 pseudo-labels | 64.29 | 58.97 |
| KITTI human labels | 67.44 | 65.09 |
Joint training with multi-dataset pseudo-labels approaches human-label performance. On Hard@0.3, it even surpasses human-labeled training.
Ablation Study¶
| LOMM | SCC | AP_BEV@0.5 Easy |
|---|---|---|
| ✗ | ✗ | 20.41 |
| ✗ | ✓ | 20.37 |
| ✓ | ✗ | 35.89 |
| ✓ | ✓ | 39.22 |
LOMM is the key factor driving performance gains; SCC further improves by ~3.3 AP on top of LOMM.
Highlights & Insights¶
Strengths: - First fully annotation-free monocular 3D detection system (requires neither 2D nor 3D labels). - Labeling speed of ~1.3s/frame, approximately 700× faster than VSRD. - LOMM is the first approach to exploit temporal information from moving objects (prior methods could only discard them). - COS enables a single model to train and infer across datasets with different camera configurations. - As a pretraining tool, it reduces human annotation costs by 85%.
Limitations & Future Work¶
- Detection accuracy is lower for distant objects (a few pixels in height), which is an inherent limitation of monocular detection rather than the labeling pipeline.
- Performance depends on the zero-shot depth estimation quality of Metric3D.
- AP at IoU=0.3 still lags behind VSRD on KITTI-360, because KITTI-360 human labels are amodal (including occluded parts), which inherently favors VSRD's use of human masks; when given identical inputs, MonoSOWA consistently outperforms VSRD.
- Only the vehicle (car) category is validated; non-rigid or small object categories such as pedestrians and cyclists are not addressed.
Rating¶
- Novelty: ⭐⭐⭐⭐ First fully annotation-free monocular 3D detection system; both LOMM and SCC present meaningful innovations.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated on three large-scale datasets with thorough ablations; pretraining and cross-dataset experiments are highly convincing.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed method descriptions and logically coherent pipeline steps.
- Value: ⭐⭐⭐⭐⭐ Significant practical impact on annotation cost reduction in autonomous driving; the 700× speedup makes large-scale application feasible.