Skip to content

FF3R: Feedforward Feature 3D Reconstruction from Unconstrained Views

Conference: CVPR 2026 arXiv: 2604.09862 Code: https://chaoyizh.github.io/ff3r_project Area: 3D Vision Keywords: 3D Reconstruction, Semantic Understanding, Feedforward Architecture, 3D Gaussians, Annotation-Free Training

TL;DR

FF3R is the first fully annotation-free feedforward framework capable of jointly performing geometric reconstruction and open-vocabulary semantic understanding from unconstrained multi-view image sequences, achieving 180× speedup over optimization-based methods when processing 64+ images.

Background & Motivation

Background: Geometric reconstruction and semantic understanding are two pillars of 3D vision, yet treating them as separate frameworks leads to redundant pipelines and accumulated errors.

Limitations of Prior Work: (1) Annotation-dependent methods are constrained by fixed category sets and labeling costs; (2) Annotation-free methods face two core challenges: global semantic inconsistency (2D foundation models lack multi-view geometric priors) and local structural inconsistency (Gaussian merging across semantic boundaries).

Key Challenge: Geometric foundation models are self-supervisedly trained via photometric loss, while semantic foundation models require annotations or knowledge distillation — the divergence between these two training paradigms makes building a unified system extremely difficult.

Goal: Construct a fully self-supervised feedforward framework that relies solely on RGB and feature map rendering supervision.

Key Insight: Inject semantic context into geometric tokens via token-level fusion, and resolve consistency issues through a semantic-geometry mutual enhancement mechanism.

Core Idea: Geometry-guided semantic alignment (addressing global inconsistency) + semantics-aware voxelization (addressing local inconsistency).

Method

Overall Architecture

Unconstrained multi-view images → Pretrained geometric/semantic encoders extract tokens → Token-wise fusion module (cross-attention) → Decode pixel-aligned features → Predict feature-RGB 3DGS, depth, and camera parameters → Semantic-geometry mutual enhancement mechanism enables annotation-free training.

Key Designs

  1. Token-wise Fusion Module:

    • Function: Inject semantic context into geometric tokens.
    • Mechanism: A cross-attention mechanism allows geometric tokens to query semantic tokens, establishing geometric-semantic information exchange at the token level. The output is semantically aware geometric tokens used for subsequent 3D decoding.
    • Design Motivation: Simple concatenation or post-hoc fusion fails to establish deep interactions at the representation level.
  2. Geometry-Guided Feature Warping Loss:

    • Function: Resolve global semantic inconsistency.
    • Mechanism: Leverages geometric priors (via 3DGS reprojection) to align semantic features across views. If two views observe the same 3D point, their semantic features should be consistent. Cross-view semantic alignment is enforced by computing the loss of rendered feature maps on novel views.
    • Design Motivation: 2D foundation models (CLIP/DINO) are trained on single images, and the same object seen from different viewpoints may produce inconsistent features.
  3. Semantics-Aware Voxelization:

    • Function: Resolve local structural inconsistency.
    • Mechanism: When merging redundant Gaussian primitives in dense-view settings, both geometric confidence and semantic consistency are considered. Geometry-only merging combines Gaussians across semantic boundaries, causing semantic blurring. Semantics-aware weights prevent cross-category merging.
    • Design Motivation: Long image sequences cause an explosion in the number of Gaussians that must be merged, but semantics-agnostic merging destroys structure.

Loss & Training

Fully annotation-free training: RGB rendering loss (photometric consistency) + feature map rendering loss (semantic consistency). No camera poses, depth maps, or semantic labels are required.

Key Experimental Results

Main Results

Task / Dataset Metric FF3R Prev. SOTA Gain
ScanNet NVS PSNR/SSIM SOTA Significant
ScanNet Semantic Segmentation mIoU SOTA Significant
DL3DV-10K Depth Estimation Error SOTA Significant

Ablation Study

Configuration Key Metric Note
w/o Token Fusion Degraded semantic quality Geometric-semantic interaction absent
w/o Geometry-Guided Warping Cross-view inconsistency Global semantic alignment fails
w/o Semantics-Aware Voxelization Blurred local boundaries Cross-category Gaussian merging
Full FF3R Best Two designs are complementary

Key Findings

  • FF3R handles 64+ images, whereas the previous SOTA handles only 6 — a scalability improvement of over 10×.
  • FF3R runs 180× faster than optimization-based methods; the efficiency advantage of the feedforward architecture is even more pronounced on long sequences.
  • Strong generalization to in-the-wild scenes demonstrates the scalability of the annotation-free training paradigm.

Highlights & Insights

  • Fully Annotation-Free Training Paradigm: Relying solely on RGB and feature map rendering supervision, the method truly enables learning from arbitrary in-the-wild images.
  • Feedforward Processing Scalable to 64+ Images: Breaks the input limitations of prior methods, paving the way for practical applications.
  • Bidirectional Gains from Semantic-Geometry Mutual Enhancement: Geometry aids semantic alignment, and semantics aids geometric merging — their interaction yields benefits beyond unidirectional transfer.

Limitations & Future Work

  • Dependent on the feature quality of 2D foundation models (CLIP/DINO).
  • Voxelization may introduce quantization errors.
  • Not validated in dynamic scenes.
  • vs. LSM: LSM is the first annotation-free feedforward method but lacks deep geometric-semantic interaction and cannot scale to long sequences.
  • vs. SceneSplat: SceneSplat relies on large-scale SAM2 annotation data, whereas FF3R is fully annotation-free.

Rating

  • Novelty: ⭐⭐⭐⭐ First realization of fully annotation-free feedforward processing for long sequences.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on ScanNet and DL3DV.
  • Writing Quality: ⭐⭐⭐⭐ Problem analysis is clear and well-structured.
  • Value: ⭐⭐⭐⭐⭐ Opens a scalable path toward unified 3D understanding.