Towards In-the-Wild 3D Plane Reconstruction from a Single Image¶

Conference: CVPR 2025
arXiv: 2506.02493
Code: https://github.com/jcliu0428/ZeroPlane
Area: LLM Evaluation
Keywords: Plane Reconstruction, Zero-shot Generalization, Single-image 3D Reconstruction, Cross-domain Learning, Transformer

TL;DR¶

ZeroPlane proposes the first cross-domain zero-shot 3D plane reconstruction framework. By constructing a large-scale plane benchmark dataset with 14 datasets and 560k annotations, and designing a normal-offset decoupled classification-regression paradigm along with a pixel-level geometric enhanced embedding module, it achieves generalization performance significantly outperforming existing methods in diverse indoor and outdoor scenes.

Background & Motivation¶

Background: Single-image 3D plane reconstruction is a fundamental task in 3D vision, applied in AR, localization and mapping, and robotics. Existing methods (PlaneNet, PlaneRCNN, PlaneTR, PlaneRecTR) are mostly trained and tested on a single dataset, focusing on either indoor scenes (ScanNet) or outdoor scenes (Synthia), performing well in their respective domains.

Limitations of Prior Work: (1) No existing method has attempted cross-domain generalizable plane reconstruction, where the geometric scales and plane distributions of indoor and outdoor scenes differ vastly; (2) High-quality outdoor plane annotation data is scarce, and existing outdoor datasets lack dense plane mask annotations; (3) Existing methods are trained and tested at low resolutions (256×192), which limits reconstruction quality.

Key Challenge: The geometric parameter distributions of planes in indoor and outdoor scenes vary enormously (indoor offsets are typically <5m, while outdoor offsets can reach tens or even hundreds of meters). Existing methods couple the normal and offset into a single scaled vector \(\mathbf{n}/d\) for regression, failing to adapt to such massive geometric scale variations during joint training.

Goal: To construct a unified cross-domain 3D plane reconstruction system capable of zero-shot generalization across diverse indoor, outdoor, and even in-the-wild scenarios.

Key Insight: Inspired by zero-shot depth estimation (MiDaS, Depth Anything), training an integrated model on mixed large-scale datasets can significantly boost generalization ability. The authors extend this idea to the more challenging task of plane reconstruction.

Core Idea: Decouple the representation of normal and offset, applying a "classification-regression" paradigm to both (first classifying to the nearest cluster center/exemplar, then regressing the residual) to alleviate the difficulty of learning cross-domain geometric parameters.

Method¶

Overall Architecture¶

ZeroPlane is built upon the detection-segmentation framework of Mask2Former and PlaneRecTR. Given a single input image, multi-scale features are extracted using a DINOv2-Base encoder and passed to a DPT pixel decoder to obtain multi-resolution feature maps. Learnable plane queries interact with image features within a Transformer decoder to output instance-level predictions: segmentation masks, classification scores, normals, and offsets. Additional pixel-level depth/normal predictions serve as auxiliary tasks, and their features are injected into plane queries via a geometric enhancement module.

Key Designs¶

Normal-Offset Decoupled + Classification-Regression Paradigm:
- Function: Solves the core challenge of learning difficult geometric parameter regression under cross-domain mixed training.
- Mechanism: Decouples the plane parameter \(\mathbf{n}/d\) into normal \(\mathbf{n}\) and offset \(d\) for separate learning. A classification-regression strategy is applied to each parameter: first clustering mixed training data via K-Means to obtain \(K_n=7\) normal exemplars and \(K_d=20\) offset exemplars (offsets are grouped into indoor/outdoor sets with a 20m threshold, clustering 10 exemplars per group). An MLP classification head predicts the corresponding exemplar, and a regression head predicts the relative residual. The final predictions are: \(\mathbf{n} = \hat{\mathbf{n}}^{(i)} + \mathbf{r_n}^{(i)}\), \(d = \hat{d}^{(j)} + r_d^{(j)}\)
- Design Motivation: Direct regression requires the network to simultaneously handle drastically different geometric scales (indoor <5m vs. outdoor >50m), which is highly challenging. Classification first brings the problem to a coarse, appropriate range, while regression only needs to fine-tune the residual, vastly reducing learning difficulty.
Pixel-Level Geometric Enhanced Plane Embedding Module:
- Function: Injects low-level pixel-level geometric cues (depth, normals) into plane queries to enhance instance-level predictions.
- Mechanism: Two CNN modules are appended after the encoder to predict pixel-level object depth maps \(\mathbf{D}\) and normal maps \(\mathbf{N}\) (as auxiliary tasks). They are projected into embedding spaces \(\mathbf{F_D}\) and \(\mathbf{F_N}\), after which plane queries interact with the geometric features via cross-attention: \(\mathbf{X_D} = \text{Attn}(\mathbf{Q}, \mathbf{F_D})\), \(\mathbf{X_N} = \text{Attn}(\mathbf{Q}, \mathbf{F_N})\). The final embedding is \(\mathbf{X} = \mathbf{X_F} + \mathbf{X_D} + \mathbf{X_N}\).
- Design Motivation: Simply training pixel depth/normals as auxiliary tasks provides limited improvement because plane queries cannot directly leverage this low-level geometric information. The attention mechanism enables queries to discover fine-grained contextual cues (such as plane boundaries) from pixel-level geometric predictions.
Large-scale Cross-domain Plane Benchmark Dataset:
- Function: Provides high-resolution dense plane annotations across diverse indoor and outdoor environments.
- Mechanism: Consolidates 14 datasets (7 indoor + 4 outdoor), totaling 560k high-resolution (640×480) annotations. For datasets lacking semantic annotations, Mask2Former is used to obtain panoptic segmentation results as pseudo-ground truths, and planes are then fitted on back-projected point clouds for each object/region. For outdoor scenes, synthetic datasets (with accurate depth maps) or stereo camera data are utilized to generate plane annotations.
- Design Motivation: Existing plane datasets only cover indoor environments with low resolutions, which cannot support cross-domain training. High-resolution annotations are a prerequisite for high-quality plane reconstruction.

Loss & Training¶

Total Loss: \(L = \lambda_c L_c + \lambda_m L_m + \lambda_{n_c} L_{n_c} + \lambda_{n_r} L_{n_r} + \lambda_{d_c} L_{d_c} + \lambda_{d_r} L_{d_r} + \lambda_{p_d} L_{p_d} + \lambda_{p_n} L_{p_n}\)
Cross-entropy loss is used for classification, L1 loss for regression, and a combined cross-entropy and Dice loss for masks.
Bipartite matching strategy (Hungarian matching) is employed to match predictions with ground truths.
Training: AdamW optimizer, lr=1e-4, batch size=16, 50K steps, with a 10x decay at 40K and 47K steps.

Key Experimental Results¶

Main Results — Zero-shot Evaluation¶

Method	NYUv2 [email protected]	NYUv2 Recall@5°	7-Scenes [email protected]
PlaneRecTR (S)	10.13	16.42	-
PlaneRecTR (M)	14.29	24.97	10.97
ZeroPlane-DINO-B (M)	17.86	37.29	17.19
ZeroPlane-Dust3R (M)	21.20	38.32	19.14

Ablation Study¶

Design	NYUv2 [email protected]
Coupled representation (\(\mathbf{n}/d\) direct regression)	~12
Decoupled + Direct Regression	~14
Decoupled + Classification-Regression	17.86
W/O Geometric Enhancement Module	Dec. by ~1-2%
Auxiliary task without attention injection	Limited gain

Key Findings¶

Mixed training (M) significantly outperforms single dataset training (S), validating the complementary value of cross-domain data.
Decoupled normal/offset + classification-regression yields huge improvements compared to direct regression, especially in depth reconstruction accuracy.
Inserting the geometric enhanced embedding module via the attention mechanism is effective, whereas treating it purely as an auxiliary task offers limited help.
Stronger encoders (DINOv2-L, Dust3R) further improve performance, though the base version already leads by a large margin.
The advantage on outdoor zero-shot data is even more pronounced, reflecting strong cross-domain generalization capability.

Highlights & Insights¶

First to introduce the concept of zero-shot generalization to the field of 3D plane reconstruction, opening up a new research direction.
The "classification-regression" paradigm elegantly solves the problem of vast distribution differences in cross-domain geometric parameters, showing good generalizability.
The constructed large-scale cross-domain benchmark dataset itself is a significant contribution, serving as a foundation for future research.
The introduction of the DINOv2 encoder significantly enhances the cross-domain robustness of features.

Limitations & Future Work¶

Outdoor data mostly originates from synthetic datasets or stereo cameras; obtaining high-quality annotations for real-world outdoor scenes remains challenging.
The indoor/outdoor segmentation threshold (20m) for offset exemplars is manually set, which may lack flexibility.
Reconstruction of non-planar regions (e.g., curved surfaces) is not addressed, focusing only on planes.
Integrating depth foundation models (e.g., Depth Anything) could be explored to further improve geometric estimation accuracy.

PlaneRecTR: The previous SOTA Transformer plane detector; ZeroPlane introduces cross-domain training strategies on top of it.
MiDaS/Depth Anything: The success of zero-shot depth estimation directly inspired this work.
Mask2Former: Provided the overall architecture and bipartite matching strategy.
Dust3R: Served as an alternative encoder to further enhance geometric awareness.

Rating¶

Novelty: 8/10 — First to propose cross-domain zero-shot plane reconstruction, with an elegantly designed classification-regression paradigm.
Experimental Thoroughness: 9/10 — 14 datasets, multiple encoders, comprehensive ablation studies, with zero-shot evaluation covering both indoor and outdoor environments.
Writing Quality: 8/10 — Thorough problem analysis and clear motivation for methods.
Value: 8/10 — Dual contribution of both dataset and methodology, opening a new avenue of research in 3D plane reconstruction.