FF3R: Feedforward Feature 3D Reconstruction from Unconstrained Views¶
Conference: CVPR 2026 arXiv: 2604.09862 Code: https://chaoyizh.github.io/ff3r_project Area: 3D Vision Keywords: 3D Reconstruction, Semantic Understanding, Feedforward Architecture, 3D Gaussians, Annotation-Free Training
TL;DR¶
FF3R is the first fully annotation-free feedforward framework capable of jointly performing geometric reconstruction and open-vocabulary semantic understanding from unconstrained multi-view image sequences, achieving 180× speedup over optimization-based methods when processing 64+ images.
Background & Motivation¶
Background: Geometric reconstruction and semantic understanding are two pillars of 3D vision, yet treating them as separate frameworks leads to redundant pipelines and accumulated errors.
Limitations of Prior Work: (1) Annotation-dependent methods are constrained by fixed category sets and labeling costs; (2) Annotation-free methods face two core challenges: global semantic inconsistency (2D foundation models lack multi-view geometric priors) and local structural inconsistency (Gaussian merging across semantic boundaries).
Key Challenge: Geometric foundation models are self-supervisedly trained via photometric loss, while semantic foundation models require annotations or knowledge distillation — the divergence between these two training paradigms makes building a unified system extremely difficult.
Goal: Construct a fully self-supervised feedforward framework that relies solely on RGB and feature map rendering supervision.
Key Insight: Inject semantic context into geometric tokens via token-level fusion, and resolve consistency issues through a semantic-geometry mutual enhancement mechanism.
Core Idea: Geometry-guided semantic alignment (addressing global inconsistency) + semantics-aware voxelization (addressing local inconsistency).
Method¶
Overall Architecture¶
Unconstrained multi-view images → Pretrained geometric/semantic encoders extract tokens → Token-wise fusion module (cross-attention) → Decode pixel-aligned features → Predict feature-RGB 3DGS, depth, and camera parameters → Semantic-geometry mutual enhancement mechanism enables annotation-free training.
Key Designs¶
-
Token-wise Fusion Module:
- Function: Inject semantic context into geometric tokens.
- Mechanism: A cross-attention mechanism allows geometric tokens to query semantic tokens, establishing geometric-semantic information exchange at the token level. The output is semantically aware geometric tokens used for subsequent 3D decoding.
- Design Motivation: Simple concatenation or post-hoc fusion fails to establish deep interactions at the representation level.
-
Geometry-Guided Feature Warping Loss:
- Function: Resolve global semantic inconsistency.
- Mechanism: Leverages geometric priors (via 3DGS reprojection) to align semantic features across views. If two views observe the same 3D point, their semantic features should be consistent. Cross-view semantic alignment is enforced by computing the loss of rendered feature maps on novel views.
- Design Motivation: 2D foundation models (CLIP/DINO) are trained on single images, and the same object seen from different viewpoints may produce inconsistent features.
-
Semantics-Aware Voxelization:
- Function: Resolve local structural inconsistency.
- Mechanism: When merging redundant Gaussian primitives in dense-view settings, both geometric confidence and semantic consistency are considered. Geometry-only merging combines Gaussians across semantic boundaries, causing semantic blurring. Semantics-aware weights prevent cross-category merging.
- Design Motivation: Long image sequences cause an explosion in the number of Gaussians that must be merged, but semantics-agnostic merging destroys structure.
Loss & Training¶
Fully annotation-free training: RGB rendering loss (photometric consistency) + feature map rendering loss (semantic consistency). No camera poses, depth maps, or semantic labels are required.
Key Experimental Results¶
Main Results¶
| Task / Dataset | Metric | FF3R | Prev. SOTA | Gain |
|---|---|---|---|---|
| ScanNet NVS | PSNR/SSIM | SOTA | — | Significant |
| ScanNet Semantic Segmentation | mIoU | SOTA | — | Significant |
| DL3DV-10K Depth Estimation | Error | SOTA | — | Significant |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| w/o Token Fusion | Degraded semantic quality | Geometric-semantic interaction absent |
| w/o Geometry-Guided Warping | Cross-view inconsistency | Global semantic alignment fails |
| w/o Semantics-Aware Voxelization | Blurred local boundaries | Cross-category Gaussian merging |
| Full FF3R | Best | Two designs are complementary |
Key Findings¶
- FF3R handles 64+ images, whereas the previous SOTA handles only 6 — a scalability improvement of over 10×.
- FF3R runs 180× faster than optimization-based methods; the efficiency advantage of the feedforward architecture is even more pronounced on long sequences.
- Strong generalization to in-the-wild scenes demonstrates the scalability of the annotation-free training paradigm.
Highlights & Insights¶
- Fully Annotation-Free Training Paradigm: Relying solely on RGB and feature map rendering supervision, the method truly enables learning from arbitrary in-the-wild images.
- Feedforward Processing Scalable to 64+ Images: Breaks the input limitations of prior methods, paving the way for practical applications.
- Bidirectional Gains from Semantic-Geometry Mutual Enhancement: Geometry aids semantic alignment, and semantics aids geometric merging — their interaction yields benefits beyond unidirectional transfer.
Limitations & Future Work¶
- Dependent on the feature quality of 2D foundation models (CLIP/DINO).
- Voxelization may introduce quantization errors.
- Not validated in dynamic scenes.
Related Work & Insights¶
- vs. LSM: LSM is the first annotation-free feedforward method but lacks deep geometric-semantic interaction and cannot scale to long sequences.
- vs. SceneSplat: SceneSplat relies on large-scale SAM2 annotation data, whereas FF3R is fully annotation-free.
Rating¶
- Novelty: ⭐⭐⭐⭐ First realization of fully annotation-free feedforward processing for long sequences.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on ScanNet and DL3DV.
- Writing Quality: ⭐⭐⭐⭐ Problem analysis is clear and well-structured.
- Value: ⭐⭐⭐⭐⭐ Opens a scalable path toward unified 3D understanding.