Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations¶

Conference: CVPR 2026 arXiv: 2604.00548 Code: None Area: 3D Vision Keywords: feed-forward 3D reconstruction, weak supervision, monocular depth, sparse correspondences, SfM-free training

TL;DR¶

Reliev3R introduces the first weakly supervised paradigm for training feed-forward 3D reconstruction models (FFRMs) from scratch without multi-view geometric annotations (i.e., no SfM/MVS-derived point clouds or camera poses). By substituting monocular relative depth and sparse image correspondences as supervisory signals, it achieves performance on par with or superior to certain fully supervised FFRMs.

Background & Motivation¶

Feed-forward 3D reconstruction models (e.g., DUSt3R, MASt3R) map 2D images end-to-end to 3D content, but rely heavily on multi-view geometric annotations generated by SfM/MVS pipelines. Such annotations are computationally expensive, brittle in low-texture scenes, and difficult to scale.

Core Observation: Multi-view geometric annotations are not essential to reconstruction — raw multi-view inputs already encode all geometric cues (depth–appearance relationships, multi-view correspondences, pose-induced reprojection structure). Training FFRMs with SfM annotations is effectively equivalent to embedding a traditional reconstruction pipeline inside a Transformer.

Key Question: Can geometric principles be learned directly from multi-view inputs without relying on heavy geometric annotations?

Method¶

Overall Architecture¶

Multi-view images → FFRM predicts per-view depth maps and camera poses → constrained by two weak supervisory signals: (1) monocular relative depth pseudo-labels constraining depth shape; (2) sparse 2D correspondences enforcing multi-view geometric consistency.

Key Designs¶

Ambiguity-Aware Relative Depth Loss
- Function: Constrains the shape of predicted depth using pseudo-labels from monocular depth estimation.
- Mechanism: A pretrained monocular depth model provides relative depth pseudo-labels. Since monocular estimates are inconsistent across views, an ambiguity-aware scale-invariant depth loss is designed to automatically down-weight unreliable regions (e.g., sky, reflective surfaces). Only the ordinal relationships and shape of depth are constrained, not the absolute scale.
- Design Motivation: Monocular depth offers a per-pixel prior on relative distance, but is unreliable in regions such as sky; automatic identification and down-weighting of such regions is necessary.
Triangulation-Based Reprojection Loss
- Function: Uses sparse 2D correspondences to enforce multi-view geometric consistency between predicted depth and poses.
- Mechanism: An off-the-shelf matcher provides sparse 2D correspondences. The predicted depth maps and camera poses are used to triangulate 3D points, and reprojection errors are computed. This loss jointly optimizes depth and pose, registering per-view depth predictions into a global 3D coordinate frame.
- Design Motivation: Monocular depth constrains only local shape and lacks cross-view global consistency. Sparse correspondences provide geometric anchors that stitch individual views into a coherent reconstruction.
Weakly Supervised Training Paradigm
- Function: Trains FFRMs from scratch entirely without SfM/MVS annotations.
- Mechanism: Both types of pseudo-labels are generated zero-shot by pretrained expert models (monocular depth model + image matcher), requiring no scene-specific 3D annotations. Camera intrinsics are assumed known, which is generally available in practice.
- Design Motivation: Eliminates dependence on SfM pipelines, enabling training to scale to arbitrary unannotated multi-view image collections.

Loss & Training¶

Ambiguity-aware scale-invariant depth loss + triangulation-based reprojection loss. Trained from scratch without any fully supervised pretrained weights.

Key Experimental Results¶

Main Results¶

Method	Supervision	Depth Accuracy	Pose Accuracy	Notes
MVDUSt3R	Full	Medium	Medium	Early FFRM
FLARE	Full	Medium	Medium	Recent FFRM
AnyCam	Weak (pose only)	—	Medium	Focused on pose estimation
Reliev3R	Weak	On par / superior	Surpasses AnyCam	No geometric annotations

Achieves parity with or exceeds certain fully supervised methods using substantially fewer annotated data.

Ablation Study¶

Configuration	Depth Accuracy	Pose Accuracy	Notes
Relative depth loss only	Medium (locally good)	Poor	Lacks global consistency
Reprojection loss only	Poor	Medium	Lacks depth shape constraint
Both combined	Best	Best	Strong complementary effect
Without ambiguity awareness	Degraded	—	Sky/reflection regions introduce noise

Key Findings¶

The two supervisory signals are highly complementary: relative depth constrains local shape while reprojection enforces global alignment.
The ambiguity-aware mechanism is critical — without it, erroneous depth estimates in sky/reflective regions corrupt optimization.
Reliev3R substantially outperforms AnyCam (also weakly supervised) on pose estimation, demonstrating that joint optimization of depth and pose is more effective than estimating them independently.

Highlights & Insights¶

Lowering the data barrier for 3D learning: The requirement shifts from "SfM annotations needed" to "images + pretrained models suffice," dramatically reducing the cost of training data construction.
Pretrained models as free annotators: Pseudo-labels from monocular depth models and matchers prove sufficient to replace expensive SfM pipelines.
Toward scalable 3D foundation models: By removing the geometric annotation bottleneck, FFRMs can be trained on multi-view data at arbitrary scale.

Limitations & Future Work¶

Camera intrinsics are still assumed known; while generally obtainable in practice, this constrains truly assumption-free learning.
Pseudo-label quality is bounded by the pretrained models and may be unreliable in severely out-of-distribution scenes.
Current performance remains slightly below the latest fully supervised FFRMs (e.g., VGGT, Fast3R), though the gap is narrowing.
Future work may explore fully self-supervised paradigms that require neither geometric annotations nor known intrinsics.

vs. DUSt3R / MASt3R / VGGT: These fully supervised FFRMs achieve stronger performance but depend on SfM annotations; Reliev3R eliminates this dependency.
vs. AnyCam: Both adopt weak supervision, but AnyCam estimates poses only, whereas Reliev3R jointly predicts depth and pose.
vs. MonoDepth: Monocular depth methods lack multi-view consistency; Reliev3R treats monocular depth as a component rather than an end solution.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First method to train FFRMs from scratch without multi-view geometric annotations; paradigm-level innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-dataset comparisons.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and complete technical exposition.
Value: ⭐⭐⭐⭐⭐ Significant contribution toward scalable 3D reconstruction.