MonoSOWA: Scalable Monocular 3D Object Detector Without Human Annotations¶

Conference: ICCV2025 arXiv: 2501.09481 Code: github.com/jskvrna/MonoSOWA Area: Autonomous Driving Keywords: Monocular 3D Detection, Weak Supervision, Auto-Labeling, Pseudo-LiDAR, Annotation-Free

TL;DR¶

This paper proposes the first monocular 3D object detection method that requires no human annotations of any kind (neither 2D nor 3D). A novel Local Object Motion Model (LOMM) is introduced to disentangle inter-frame motion sources, enabling auto-labeling at a speed ~700× faster than prior work. A Canonical Object Space (COS) is further proposed to enable multi-dataset training across heterogeneous camera configurations.

Background & Motivation¶

Monocular 3D object detection is a critical component in autonomous driving; conventional methods depend on LiDAR sensors and extensive manual 3D annotations.
The annotation process is extremely time-consuming and expensive, limiting training data diversity; any change in camera setup necessitates re-collection and re-annotation.
Existing weakly supervised methods still require 2D instance masks (e.g., WeakMono3D, VSRD) or LiDAR data (e.g., WeakM3D, Autolabels).
VSRD requires ~15 minutes per frame for annotation; annotating Waymo's 158K frames would take approximately 4 years, rendering it practically unscalable.
Both WeakMono3D and VSRD fail to handle moving objects—they either discard them or assign low confidence, wasting a large amount of training signal.
Goal: Completely eliminate dependence on human annotations and LiDAR, so that data from commodity vehicles equipped only with monocular cameras can be directly exploited.

Method¶

Overall Architecture¶

The method requires only: image sequences + ego-motion data (GPS/IMU) + known camera intrinsics and extrinsics. No human annotations are needed, not even 2D labels. Two off-the-shelf pretrained models are used: 1. 2D Object Detector: MViT2-Huge (Detectron2 framework, COCO pretrained) — provides instance segmentation masks. 2. Monocular Depth Estimator: Metric3D v2 — selected for its best zero-shot metric depth generalization.

The auto-labeling pipeline decomposes the 7-DOF 3D bounding box estimation problem into three sequential sub-problems: orientation, position, and dimensions, avoiding the difficulty of direct joint optimization.

Step 1: Pseudo-LiDAR Aggregation¶

Metric depth maps are inferred per frame and back-projected into 3D point clouds.
Instance segmentation masks from the 2D detector are used to extract per-object point clouds \(P_{i,j}\).
Object tracking is performed in 3D world coordinates:
- The median of the object point cloud approximates spatial position.
- A physics-based motion model predicts the position in the next frame.
- Nearest-neighbor matching with a distance threshold is applied.
Multi-frame point clouds are transformed into a reference frame coordinate system using ego-motion data.

Step 2: Local Object Motion Model (LOMM)¶

This is the core contribution of the paper, addressing a key challenge: since the ego-vehicle is also moving, all objects undergo relative displacement between frames. The question is how to identify which objects are truly in motion.

Core Idea: Decouple inter-frame object position changes into two sources — ego-vehicle motion and object self-motion.
Classification: For each object instance, compute the frame-wise position difference sequence \(\Delta_{i,j} = L_{i,j} - L_{i-1,j}\):
- Mean \(\mu_j = \frac{1}{l-k} \sum \Delta_{i,j}\), standard deviation \(\sigma_j\) (normalized by \(\sqrt{2}\)).
- Ratio \(z_j = \|\mu_j\|_2 / \|\sigma_j\|_2\).
- Stationary objects: displacement variation stems from pseudo-LiDAR noise jitter (akin to a random walk), so \(\|\mu\|_2 \ll \|\sigma\|_2\).
- Moving objects: persistent directional motion yields \(\|\mu\|_2 \gg \|\sigma\|_2\).
- Decision criterion: \(z > T_z = 0.2\) and net displacement \(> T_m = 5\text{m}\).
For stationary objects: All frames' point clouds are directly aggregated as \(A_j = \{P_{i,j}\}\) (after transformation to the reference frame), yielding a denser representation with recovery from occlusion.
For moving objects: The known trajectory is leveraged to directly compute orientation from physical constraints (vehicles travel in the direction of motion).
In contrast to prior methods (WeakMono3D, VSRD) that discard or downweight moving objects, LOMM exploits information from both stationary and moving objects simultaneously.

Step 3: Sequential Auto-Labeling¶

The 7-DOF pose estimation is decomposed into three independent sub-problems:

Orientation Estimation: - Moving objects: orientation is computed directly from consecutive frame positions as \(\theta = \text{atan2}(\Delta z / \Delta x)\); the median yaw over the preceding and following 5 frames is used for robustness. - Stationary objects: all angles \(\theta \in [0, \pi/2)\) are iterated in BEV, and the optimal angle is selected using the proposed Saturated Closeness Criterion (SCC): - Improves upon Zhang's Closeness Criterion: (1) applies sigmoid saturation \(\sigma(\alpha \cdot x)\) (\(\alpha=10\)) to suppress outlier influence; (2) replaces min/max with the 10th/90th percentiles as boundary reference points. - The criterion accumulates, for each point, the minimum distance to the nearest boundary across two perpendicular axes. - Since the algorithm cannot distinguish front from back, two orientation hypotheses (differing by \(\pi\)) are produced.

Dimension Estimation: - Dimensions are obtained from the SCC algorithm output; outliers (outside typical vehicle size ranges) are replaced by prior dimensions. - View-dependent unobservable dimensions are detected: when the angle between vehicle heading and viewing direction is near \(0, \pi/2, \pi, 3\pi/2\), prior dimensions are substituted.

Position Refinement: - Given the estimated orientation and coarse position, small perturbations (≤2m) are applied along the x and z axes. - A Template Fitting Loss (TFL) using a canonical vehicle template point cloud selects the optimal position; TFL is more robust to outliers than Chamfer Distance. - The front/back orientation ambiguity is resolved by testing both \(\theta\) and \(\theta + \pi\) and selecting the hypothesis with lower TFL.

Step 4: Canonical Object Space (COS)¶

Problem: Under different camera focal lengths, the same object at the same distance appears at different image sizes; in extreme cases, the network must predict very different distances for objects with identical pixel sizes.
Solution: A canonical focal length \(f^C = 750\) is chosen, and only the 3D label coordinates are scaled: \(\omega_i = f^C / f_i\), \((x^C, y^C, z^C) = (x \cdot \omega_i, y \cdot \omega_i, z \cdot \omega_i)\).
During training the model learns in COS; during inference the predictions are inverse-transformed back to world coordinates using the target frame's \(\omega_j\).
Data augmentation (e.g., image scaling) requires synchronized adjustment of the perceived focal length.
Inspired by Metric3D and Omni3D, but only label coordinates are transformed rather than entire images or depth maps, making the design minimally invasive.
This allows a single model to train and infer across heterogeneous camera configurations without per-camera retraining.

Implementation Details¶

Detector: MonoDETR is used as the final 3D detection model, optimized with AdamW (lr=2e-4, wd=1e-4).
Aggregation window: 100 frames (up to 50 frames before and after each object instance).
Thresholds: \(T_z = 0.2\) (motion/static classification), \(T_m = 5\)m (minimum motion distance), \(\alpha = 10\) (SCC steepness).
Identical hyperparameters are used across all three datasets, demonstrating the generalizability of the method.

Key Experimental Results¶

KITTI-360 Test Set¶

Method	Human Annotation	AP_BEV@0.5 (Easy/Hard)	AP_3D@0.5 (Easy/Hard)	Labeling Speed
MonoFlex (Fully Supervised)	3D boxes	50.82/41.78	43.11/34.43	-
MonoDETR (Fully Supervised)	3D boxes	47.21/36.05	41.01/30.38	-
Autolabels	LiDAR+masks	20.18/14.33	4.69/2.79	6s/frame
VSRD	Masks	29.07/22.83	21.77/16.46	15min/frame
MonoSOWA (No Annotation)	None	38.41/35.26	29.98/27.56	1.3s/frame

Waymo Validation Set (Level 2)¶

Method	AP_BEV@0.5 All	AP_3D@0.5 All
MonoDETR (Fully Supervised)	23.63	21.41
MonoSOWA (No Annotation)	18.98	13.46

Pseudo-Label Pretraining + Fine-tuning with Limited Human Annotations (KITTI)¶

Pretraining	Human Annotation Ratio	AP_BEV@0.7 Easy	AP_3D@0.7 Easy
None	25%	31.72	21.76
MonoSOWA	25%	39.99	32.64
None	100%	37.99	29.36

MonoSOWA pretraining with only 25% human annotations surpasses fully supervised training with 100% human annotations.
15% human annotations + MonoSOWA pretraining ≈ 100% human annotation performance → 85% annotation cost reduction.

Cross-Dataset Training¶

Training Data	AP_BEV@0.5 (Easy)	AP_3D@0.5 (Easy)
KITTI pseudo-labels	61.24	53.22
K360 pseudo-labels	57.62	47.39
KITTI+K360 pseudo-labels	64.29	58.97
KITTI human labels	67.44	65.09

Joint training with multi-dataset pseudo-labels approaches human-label performance. On Hard@0.3, it even surpasses human-labeled training.

Ablation Study¶

LOMM	SCC	AP_BEV@0.5 Easy
✗	✗	20.41
✗	✓	20.37
✓	✗	35.89
✓	✓	39.22

LOMM is the key factor driving performance gains; SCC further improves by ~3.3 AP on top of LOMM.

Highlights & Insights¶

Strengths: - First fully annotation-free monocular 3D detection system (requires neither 2D nor 3D labels). - Labeling speed of ~1.3s/frame, approximately 700× faster than VSRD. - LOMM is the first approach to exploit temporal information from moving objects (prior methods could only discard them). - COS enables a single model to train and infer across datasets with different camera configurations. - As a pretraining tool, it reduces human annotation costs by 85%.

Limitations & Future Work¶

Detection accuracy is lower for distant objects (a few pixels in height), which is an inherent limitation of monocular detection rather than the labeling pipeline.
Performance depends on the zero-shot depth estimation quality of Metric3D.
AP at IoU=0.3 still lags behind VSRD on KITTI-360, because KITTI-360 human labels are amodal (including occluded parts), which inherently favors VSRD's use of human masks; when given identical inputs, MonoSOWA consistently outperforms VSRD.
Only the vehicle (car) category is validated; non-rigid or small object categories such as pedestrians and cyclists are not addressed.

Rating¶

Novelty: ⭐⭐⭐⭐ First fully annotation-free monocular 3D detection system; both LOMM and SCC present meaningful innovations.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated on three large-scale datasets with thorough ablations; pretraining and cross-dataset experiments are highly convincing.
Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed method descriptions and logically coherent pipeline steps.
Value: ⭐⭐⭐⭐⭐ Significant practical impact on annotation cost reduction in autonomous driving; the 700× speedup makes large-scale application feasible.