FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views¶
Conference: CVPR 2025
arXiv: 2502.12138
Code: https://zhanghe3z.github.io/FLARE/
Area: 3D Vision / Sparse-view Reconstruction
Keywords: Sparse-view Reconstruction, Camera Estimation, 3D Gaussians, Feed-forward, Pointmaps
TL;DR¶
Proposed is FLARE, a feed-forward differentiable system that simultaneously infers high-quality camera poses, 3D geometry, and appearance from uncalibrated sparse-view images (2-8 views) in 0.5 seconds, progressively simplifying the complex 3D learning task by employing a cascaded learning paradigm with camera poses acting as a bridge.
Background & Motivation¶
Background: SfM+MVS is the classic two-stage reconstruction pipeline, but feature matching is difficult under sparse views; DUSt3R/MASt3R predict pointmaps but rely on post-optimization for global registration.
Limitations of Prior Work: DUSt3R only supports pairwise matching followed by global optimization, which is time-consuming and yields sub-optimal results; the triplane representation in PF-LRM limits its performance in large-scale scenes; existing methods fail to resolve camera estimation, geometric reconstruction, and appearance modeling simultaneously and efficiently.
Key Challenge: Jointly optimizing pose, geometry, and appearance directly from images is highly prone to falling into local optima.
Goal: To design a cascaded learning paradigm that progressively lowers learning difficulty by utilizing camera poses as intermediate proxies.
Core Idea: Estimate coarse poses first → guide local geometry in camera coordinates → project to global coordinates → generate 3D Gaussians for rendering.
Method¶
Overall Architecture¶
A four-step cascade: (1) Neural Pose Predictor estimates coarse poses; (2) Camera-centric geometry estimation predicts local pointmaps under each camera coordinate system; (3) Global Geometry Projector unifies local pointmaps into global coordinates; (4) 3D Gaussian regression head is used for novel view synthesis.
Key Designs¶
-
Neural Pose Predictor:
- Function: Directly regress camera poses from sparse-view images
- Mechanism: Concatenate image patches and learnable camera latents into a 1D sequence, which is then fed into a small decoder-only transformer to predict 7D poses (translation + normalized quaternion)
- Design Motivation: Skip feature matching and regress poses directly; even imperfect poses provide valuable spatial initialization
-
Two-stage Geometry Estimation:
- Function: Progressively learn geometry from local to global
- Mechanism: Map local pointmaps under each camera coordinate system first (consistent with the imaging process, simplifying learning), then use a learnable geometry projector to transform local pointmaps into global coordinates. Pose perturbations with noise are added during training to enhance robustness.
- Design Motivation: Local prediction avoids directly reasoning about complex global spatial relationships, decomposing the learning difficulty.
-
3D Gaussian Appearance Modeling:
- Function: Generate renderable 3D Gaussians from estimated geometry
- Mechanism: Use global pointmaps as Gaussian centers, and predict opacity, rotation, scale, and spherical harmonics after fusing VGG features with appearance features. To address scale inconsistency, both predicted and ground-truth pointmaps are normalized to a unit space.
- Design Motivation: Decouple geometry and appearance, where geometry serves as the geometric scaffold for 3D Gaussians.
Loss & Training¶
Total loss = pose loss (Huber) + geometry loss (confidence-aware L2) + Gaussian rendering loss (L2 + VGG perceptual + depth). Jointly trained on a mixture of large-scale public datasets.
Key Experimental Results¶
Main Results¶
On RealEstate10K and multiple datasets: - Pose estimation: Outperforms DUSt3R and MASt3R - Novel view synthesis: Outperforms existing pose-free methods - Inference speed: < 0.5 seconds
Key Findings¶
- Cascaded learning is significantly better than direct joint learning
- Two-stage geometry (local → global) converges faster than direct global prediction
- Noisy pose augmentation is crucial for inference-time robustness
Highlights & Insights¶
- Core insight of the cascaded learning paradigm: Pose as a bridge from 2D to 3D reduces learning complexity
- Inference speed of 0.5 seconds is orders of magnitude faster than optimization-based methods
- Can handle an arbitrary number of input images
Limitations & Future Work¶
- GPU memory limits the number of images processed simultaneously
- Still partially dependent on the quality of camera pose estimation
- Performance may degrade in textureless regions
Rating¶
- Novelty: 8/10 — Well-designed cascaded learning paradigm
- Technical Depth: 8/10 — Complete multi-task joint learning framework
- Experimental Thoroughness: 8/10 — Multi-dataset validation
- Writing Quality: 8/10 — Clear structure