3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection¶

Conference: ICCV 2025 arXiv: 2507.23567 Code: royyang0714.github.io/3D-MOOD Area: Object Detection Keywords: Monocular 3D object detection, open-set detection, 2D-to-3D lifting, geometry-aware query, canonical image space

TL;DR¶

This paper proposes 3D-MOOD, the first end-to-end monocular open-set 3D object detector, which lifts open-set 2D detections into 3D space via geometry-aware 3D query generation and a canonical image space design, achieving state-of-the-art performance on both the Omni3D closed-set benchmark and the Argoverse 2 / ScanNet open-set benchmarks.

Background & Motivation¶

Monocular 3D object detection (3DOD) infers and localizes 3D objects from a single RGB image, offering low cost but posing significant challenges. Existing methods operate almost exclusively under the closed-set assumption, where training and test sets share the same scenes and categories. However, in real-world applications such as robotics and AR/VR, models frequently encounter objects from novel environments and novel categories, to which closed-set methods cannot generalize.

Unlike the 2D domain—where abundant image-text pairs enable open-vocabulary classification—3D data lacks rich vision-language correspondences, making open-set classification in 3D extremely difficult. Monocular depth estimation also faces inherent generalization challenges across datasets. The combination of these two issues has left open-set monocular 3DOD largely unsolved.

Core Problem¶

How can a monocular 3D detector simultaneously achieve: (1) the ability to recognize novel categories unseen during training (open-vocabulary classification), and (2) accurate 3D localization/size/orientation estimation in unseen scenes (cross-domain 3D regression generalization)?

The key challenges are: 3D annotations are scarce and lack textual correspondence, making it impossible to achieve open-set classification through vision-language alignment as in 2D. Furthermore, cross-dataset training introduces significant variation in image resolution and camera intrinsics; conventional resize-and-padding strategies cause ambiguity between intrinsics and image dimensions.

Method¶

Overall Architecture¶

3D-MOOD builds upon the open-set 2D detector Grounding DINO (G-DINO). Given a monocular image \(\mathbf{I}\) and a language prompt \(\mathbf{T}\) describing object categories of interest, the model outputs 3D bounding boxes \(\mathbf{D}^{3D}\) (comprising 3D center coordinates, dimensions, and 6D orientation) along with category predictions \(\hat{\mathbf{C}}\).

The core mechanism is 2D-to-3D lifting: G-DINO's vision-language fusion capability first produces open-set 2D detections; a 3D bounding box head then estimates lifting parameters from 2D object queries to differentiably map 2D boxes to 3D boxes. The entire pipeline is end-to-end trainable, enabling joint optimization of 2D and 3D objectives.

Pipeline: 1. An image encoder (Swin-T/B) extracts image features \(\mathbf{q}_{\text{image}}\); a text backbone (BERT) extracts text features \(\mathbf{q}_{\text{text}}\). 2. A cross-modal Transformer decoder fuses image and text features across layers, producing 2D object queries \(\mathbf{q}_{2d}^i\). 3. A 2D box head predicts 2D detections; categories are determined by the similarity between queries and text embeddings (open-set classification). 4. A Geometry-aware 3D Query Generation module fuses 2D queries with camera embeddings and depth features to produce 3D queries \(\mathbf{q}_{3d}^i\). 5. A 3D bounding box head predicts lifting parameters (projected 3D center offset, log depth, log dimensions, 6D rotation) from 3D queries; the final 3D detections are obtained by combining these with the 2D boxes and camera intrinsics \(\mathbf{K}\).

Key Designs¶

3D Bounding Box Head (Differentiable 2D→3D Lifting): Predicts 12-dimensional 3D attributes from 2D object queries. 3D localization is achieved by predicting the offset \([\hat{u}, \hat{v}]\) between the projected 3D center and the 2D box center, together with a scaled log depth \(\hat{d}\): \(\hat{z} = \exp(\hat{d}/s_{\text{depth}})\). 3D dimensions are predicted as scaled log values without relying on category-specific size priors, since such priors are unavailable in open-set settings. Orientation is parameterized using 6D rotation (rather than yaw-only), enabling generalization to both indoor and outdoor scenes. The entire lifting process is differentiable: \(\hat{\mathbf{D}}^{3D}_i = \mathbf{Lift}(\text{MLP}^{3D}_i(\mathbf{q}_{3d}^i), \hat{\mathbf{D}}^{2D}_i, \mathbf{K})\).
Canonical Image Space: Addresses the core problem of inconsistent image resolutions and camera intrinsics across datasets. Prior methods (e.g., Cube R-CNN resizing by short edge with bottom-right padding; G-DINO resizing by long edge with bottom-right padding) suffer from: (a) wasteful zero-padding that consumes GPU memory; (b) camera intrinsics \(\mathbf{K}\) not being updated after resizing, causing train-inference inconsistency; and (c) violation of the central projection assumption. 3D-MOOD fixes the input resolution to \([H_c \times W_c]\) (800×1333), resizes images while preserving aspect ratio, applies center padding, and adjusts intrinsics accordingly. This ensures a consistent observation space across training and inference while reducing GPU memory consumption (17 GB vs. 21–23 GB per batch of 2).
Auxiliary Metric Depth Estimation: An FPN extracts depth features \(\mathbf{F}\); a Transformer block generates depth features \(\mathbf{F}^d_{16}\), which are conditioned on camera embeddings \(\mathbf{E}\) (following UniDepth's design) and upsampled to 1/8 resolution to predict full-image log depth \(\hat{d}_{\text{full}}\), sharing the depth scale factor \(s_{\text{depth}}\). Supervised with a scale-invariant log loss, this auxiliary branch provides global scene geometry understanding and serves as a conditioning signal for subsequent query generation.
Geometry-aware 3D Query Generation: To improve cross-domain generalization of 3D estimation, geometric priors are injected into 2D object queries in two steps: (1) cross-attention with camera embeddings \(\mathbf{E}\) to make queries aware of scene-specific camera properties; (2) cross-attention with depth features \(\mathbf{F}^d_8|\mathbf{E}\) to align depth estimation with 3D box estimation. Notably, gradients are detached in the depth feature cross-attention to stabilize training. Ablations show this module yields the largest improvement in open-set settings (+1.3 ODS), demonstrating that encoding geometric priors is critical for generalization.

Loss & Training¶

2D loss \(L_{2D}\): L1 + GIoU (box regression) + contrastive loss (classification between predicted objects and language tokens, inherited from GLIP).
3D loss \(L_{3D}\): L1 loss supervising each 3D attribute independently (projected center offset, depth, dimensions, orientation).
Auxiliary depth loss \(L^{aux}_{depth}\): Scale-invariant log loss with weight \(\lambda_{\text{depth}}=10\).
Total loss: \(L_{\text{final}} = \sum_{i=0}^{l}(L^i_{2D} + L^i_{3D}) + \lambda_{\text{depth}} L^{aux}_{\text{depth}}\)
Each decoder layer has independent 2D and 3D heads (multi-layer supervision).
Depth ground truth is sourced from Omni3D sub-dataset depth annotations, projected LiDAR points, or SfM points.
Training: 120 epochs, batch size 128, lr = 0.0004; ablations use 12 epochs, batch size 64.
Backbone: Swin-T (29M) / Swin-B (88M).

Key Experimental Results¶

Open-Set Results (Trained on Omni3D → Tested on AV2/ScanNet)¶

Dataset	Metric	3D-MOOD (Swin-B)	Cube R-CNN	OVM3D-Det	Gain
Argoverse 2	ODS	23.8	8.9	8.8	+14.9 vs. Cube R-CNN
Argoverse 2	ODS(N) Novel	14.8	0.0	1.7	Zero → functional
ScanNet	ODS	31.5	19.5	16.3	+12.0 vs. Cube R-CNN
ScanNet	ODS(N) Novel	15.7	0.0	8.8	+6.9 vs. OVM3D-Det

Closed-Set Results (Omni3D test)¶

Method	AP3D omni↑
Cube R-CNN	23.3
Uni-MODE*	28.2
3D-MOOD (Swin-T)	28.4
3D-MOOD (Swin-B)	30.0

Ablation Study¶

Setting	CI	Depth	GA	AP3D omni (closed-set)	ODS open (open-set)
Baseline	-	-	-	24.1	23.6
+Canonical Image	✓	-	-	25.5 (+1.4)	24.5 (+0.9)
+Depth	✓	✓	-	26.2 (+0.7)	24.7 (+0.2)
+Geometry-aware	✓	✓	✓	26.8 (+0.6)	26.0 (+1.3)

Canonical Image Space benefits both closed-set and open-set performance while reducing GPU memory (21→17 GB).
The auxiliary depth head yields larger gains on closed-set (+0.7) than open-set (+0.2); the authors attribute this to limited diversity in Omni3D depth annotations.
Geometry-aware query generation provides the most significant open-set improvement (+1.3 ODS), demonstrating that geometric priors are critical for cross-domain generalization.
Geometry-aware query generation converges faster than Cube R-CNN's virtual depth (26.8 vs. 21.6 at 12 epochs).

Highlights & Insights¶

First formulation and solution of open-set monocular 3DOD: Establishes complete benchmarks (Omni3D→AV2/ScanNet) and the ODS evaluation metric.
Elegant inheritance of open-set capability from 2D to 3D: No vision-language pairs in 3D are required; classification ability is entirely borrowed from the 2D open-set detector via a differentiable 2D→3D lifting module.
Canonical Image Space design: Conceptually simple yet effective—fixed resolution + center padding + synchronized intrinsic adjustment, with the added benefit of reduced GPU memory.
ODS evaluation metric: Uses normalized distance rather than IoU for matching, yielding fairer evaluation for small or thin objects; incorporates TP error terms (ATE/ASE/AOE) for a more comprehensive assessment than AP alone.
Advantage of end-to-end joint training: OVM3D-Det relies on a pipeline approach (G-DINO + SAM + UniDepth + LLM for pseudo-GT generation) that precludes end-to-end optimization, resulting in substantially inferior performance.

Limitations & Future Work¶

Slower inference: 3D-MOOD (Swin-T) runs at 17 FPS vs. 68 FPS for Cube R-CNN (DLA-34), reflecting the cost of a heavy backbone and multiple heads.
Limited open-set generalization from the auxiliary depth head: Depth training data lacks diversity (Omni3D only); incorporating more diverse depth data (e.g., MiDaS/UniDepth pretraining) could further improve performance.
Depth estimation accuracy remains inferior: Absolute relative error of 9.1% on KITTI Eigen-split, well behind UniDepth (4.21%) and Metric3Dv2 (4.4%).
IoU-based AP remains very low in open-set settings: Even when ODS is high, IoU-based AP3D is still very low (e.g., 14.7% on AV2), indicating substantial room for improvement in 3D localization accuracy.
Future work could explore using a stronger depth foundation model as a frozen feature provider rather than training the depth head from scratch.
Extending open-set capability to 3D tracking is a natural direction (the authors have prior work in 3D tracking, e.g., CC-3DT).

Method	Key Difference	Advantage	Disadvantage
Cube R-CNN	Closed-set unified 3DOD; uses category priors for size prediction and virtual depth	3D-MOOD surpasses Cube R-CNN by 6.7 AP on closed-set and detects novel categories	Cube R-CNN is faster (68 FPS)
OVM3D-Det	Pipeline approach: uses foundation models to generate pseudo 3D GT	3D-MOOD is end-to-end trainable with significantly better performance; OVM3D-Det's depth model is also trained on target-domain data	OVM3D-Det does not require 3D annotations
Uni-MODE	Domain confidence + joint BEV detector for indoor/outdoor	3D-MOOD achieves slightly better closed-set performance on Omni3D (30.0 vs. 28.2) with additional open-set capability	Uni-MODE is not open-sourced, preventing direct open-set comparison

The 2D→3D lifting paradigm is broadly applicable: given a strong 2D open-set detector, learning lifting parameters to extend predictions to 3D is a transferable approach for open-set 3D segmentation, tracking, and related tasks. The canonical image space strategy of center padding with synchronized intrinsic adjustment can be adopted by any cross-dataset detection or depth estimation method. The ODS metric provides a standardized evaluation protocol for future work in this area. The paper demonstrates that, in the absence of 3D vision-language data, borrowing semantic alignment capability from 2D foundation models is a viable and effective strategy, inspiring analogous approaches in other 3D tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — First end-to-end open-set monocular 3DOD with significant contributions in problem formulation and benchmarking; individual technical modules (depth head, canonical space) are not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation covering closed-set, open-set, cross-domain, ablation, backbone comparison, and metric analysis.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with clear motivation; some tables and equations could be formatted more compactly.
Value: ⭐⭐⭐⭐⭐ — Opens a new research direction in open-set monocular 3DOD, provides benchmarks and evaluation metrics, and is likely to serve as a foundational reference for future work.