FF3R: Feedforward Feature 3D Reconstruction from Unconstrained Views¶

Conference: CVPR 2026 arXiv: 2604.09862 Code: https://chaoyizh.github.io/ff3r_project Area: 3D Vision Keywords: 3D Reconstruction, Semantic Understanding, Feedforward Architecture, 3D Gaussians, Annotation-Free Training

TL;DR¶

FF3R is the first fully annotation-free feedforward framework capable of jointly performing geometric reconstruction and open-vocabulary semantic understanding from unconstrained multi-view image sequences, achieving 180× speedup over optimization-based methods when processing 64+ images.

Background & Motivation¶

Background: Geometric reconstruction and semantic understanding are two pillars of 3D vision, yet treating them as separate frameworks leads to redundant pipelines and accumulated errors.

Limitations of Prior Work: (1) Annotation-dependent methods are constrained by fixed category sets and labeling costs; (2) Annotation-free methods face two core challenges: global semantic inconsistency (2D foundation models lack multi-view geometric priors) and local structural inconsistency (Gaussian merging across semantic boundaries).

Key Challenge: Geometric foundation models are self-supervisedly trained via photometric loss, while semantic foundation models require annotations or knowledge distillation — the divergence between these two training paradigms makes building a unified system extremely difficult.

Goal: Construct a fully self-supervised feedforward framework that relies solely on RGB and feature map rendering supervision.

Key Insight: Inject semantic context into geometric tokens via token-level fusion, and resolve consistency issues through a semantic-geometry mutual enhancement mechanism.

Core Idea: Geometry-guided semantic alignment (addressing global inconsistency) + semantics-aware voxelization (addressing local inconsistency).

Method¶

Overall Architecture¶

Unconstrained multi-view images → Pretrained geometric/semantic encoders extract tokens → Token-wise fusion module (cross-attention) → Decode pixel-aligned features → Predict feature-RGB 3DGS, depth, and camera parameters → Semantic-geometry mutual enhancement mechanism enables annotation-free training.

Key Designs¶

Token-wise Fusion Module:
- Function: Inject semantic context into geometric tokens.
- Mechanism: A cross-attention mechanism allows geometric tokens to query semantic tokens, establishing geometric-semantic information exchange at the token level. The output is semantically aware geometric tokens used for subsequent 3D decoding.
- Design Motivation: Simple concatenation or post-hoc fusion fails to establish deep interactions at the representation level.
Geometry-Guided Feature Warping Loss:
- Function: Resolve global semantic inconsistency.
- Mechanism: Leverages geometric priors (via 3DGS reprojection) to align semantic features across views. If two views observe the same 3D point, their semantic features should be consistent. Cross-view semantic alignment is enforced by computing the loss of rendered feature maps on novel views.
- Design Motivation: 2D foundation models (CLIP/DINO) are trained on single images, and the same object seen from different viewpoints may produce inconsistent features.
Semantics-Aware Voxelization:
- Function: Resolve local structural inconsistency.
- Mechanism: When merging redundant Gaussian primitives in dense-view settings, both geometric confidence and semantic consistency are considered. Geometry-only merging combines Gaussians across semantic boundaries, causing semantic blurring. Semantics-aware weights prevent cross-category merging.
- Design Motivation: Long image sequences cause an explosion in the number of Gaussians that must be merged, but semantics-agnostic merging destroys structure.

Loss & Training¶

Fully annotation-free training: RGB rendering loss (photometric consistency) + feature map rendering loss (semantic consistency). No camera poses, depth maps, or semantic labels are required.

Key Experimental Results¶

Main Results¶

Task / Dataset	Metric	FF3R	Prev. SOTA	Gain
ScanNet NVS	PSNR/SSIM	SOTA	—	Significant
ScanNet Semantic Segmentation	mIoU	SOTA	—	Significant
DL3DV-10K Depth Estimation	Error	SOTA	—	Significant

Ablation Study¶

Configuration	Key Metric	Note
w/o Token Fusion	Degraded semantic quality	Geometric-semantic interaction absent
w/o Geometry-Guided Warping	Cross-view inconsistency	Global semantic alignment fails
w/o Semantics-Aware Voxelization	Blurred local boundaries	Cross-category Gaussian merging
Full FF3R	Best	Two designs are complementary

Key Findings¶

FF3R handles 64+ images, whereas the previous SOTA handles only 6 — a scalability improvement of over 10×.
FF3R runs 180× faster than optimization-based methods; the efficiency advantage of the feedforward architecture is even more pronounced on long sequences.
Strong generalization to in-the-wild scenes demonstrates the scalability of the annotation-free training paradigm.

Highlights & Insights¶

Fully Annotation-Free Training Paradigm: Relying solely on RGB and feature map rendering supervision, the method truly enables learning from arbitrary in-the-wild images.
Feedforward Processing Scalable to 64+ Images: Breaks the input limitations of prior methods, paving the way for practical applications.
Bidirectional Gains from Semantic-Geometry Mutual Enhancement: Geometry aids semantic alignment, and semantics aids geometric merging — their interaction yields benefits beyond unidirectional transfer.

Limitations & Future Work¶

Dependent on the feature quality of 2D foundation models (CLIP/DINO).
Voxelization may introduce quantization errors.
Not validated in dynamic scenes.

vs. LSM: LSM is the first annotation-free feedforward method but lacks deep geometric-semantic interaction and cannot scale to long sequences.
vs. SceneSplat: SceneSplat relies on large-scale SAM2 annotation data, whereas FF3R is fully annotation-free.

Rating¶

Novelty: ⭐⭐⭐⭐ First realization of fully annotation-free feedforward processing for long sequences.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on ScanNet and DL3DV.
Writing Quality: ⭐⭐⭐⭐ Problem analysis is clear and well-structured.
Value: ⭐⭐⭐⭐⭐ Opens a scalable path toward unified 3D understanding.