Omni-Recon: Harnessing Image-Based Rendering for General-Purpose Neural Radiance Fields¶

Conference: ECCV 2024
arXiv: 2403.11131
Code: Available
Area: 3D Vision
Keywords: NeRF, Generalizable 3D Reconstruction, Image-Based Rendering, Zero-Shot Scene Understanding, Real-Time Rendering

TL;DR¶

This paper proposes the Omni-Recon framework to construct a general-purpose NeRF through an image-based rendering (IBR) pipeline. By leveraging a decoupled dual-branch design of geometry and appearance, it is the first to achieve compatibility with multiple downstream 3D tasks—such as generalizable 3D reconstruction, zero-shot multi-task scene understanding, real-time rendering, and scene editing—within a single model.

Background & Motivation¶

While NeRF has demonstrated massive potential in 3D applications, current approaches suffer from fundamental conflicts:

Pipeline Fragmentation: Different 3D applications (cross-scene generalization, real-time rendering, scene understanding, etc.) require distinct NeRF models and pipelines, with each target task necessitating tedious training and trial-and-error experiments. For instance, the instant reconstruction in generalizable NeRFs and the real-time rendering in mesh-rasterization-based NeRFs typically employ entirely different pipelines, making it difficult to satisfy both requirements simultaneously.

Per-Scene Optimization Bottleneck: Traditional NeRFs rely on expensive per-scene optimization, failing to achieve cross-scene generalization. Although generalizable NeRFs are designed to generalize across scenes, their computational complexity is enormous, making them unsuitable for real-time rendering scenarios.

Poor Scalability in Scene Understanding: Recognizing new scene attributes (semantics, edges, keypoints, etc.) requires training new NeRF models, which is not scalable down the line as the types of target attributes increase. The potential of existing generalizable methods in zero-shot scene understanding and scene editing remains largely untapped.

Inspired by the generalization and adaptation capabilities of foundation models, the authors propose a Key Insight: a well-designed IBR pipeline, when equipped with accurate geometry and appearance estimation capabilities, can lift 2D image features to 3D space, thus naturally extending widely explored 2D tasks to the 3D world. This insight drives the overall design of Omni-Recon—satisfying reconstruction accuracy, real-time capability, and multi-task scalability simultaneously by carefully decoupling geometry and appearance.

Method¶

Overall Architecture¶

Based on image-based rendering (IBR), Omni-Recon designs a general-purpose NeRF backbone consisting of two decoupled branches: - Complex Transformer Geometry Branch (\(\mathbf{M}_{sdf}\)): Progressively fuses geometry and appearance features to predict the SDF. - Lightweight Appearance Branch (\(\mathbf{M}_{color}\), 3-layer MLP only): Predicts the blending weights of source views.

The core advantage of this decoupled design is that the geometry branch can be baked into a mesh and discarded, while the lightweight appearance branch continues to function as a shader to achieve real-time rendering; meanwhile, the blending weights can be reused for zero-shot scene understanding.

Key Designs¶

1. Image-Based Rendering Pipeline¶

Given \(N\) source views \(\{I_i\}_{i=1}^N\), features \(\{\mathbf{F}_i\}_{i=1}^N \in \mathbb{R}^{H \times W \times C}\) are extracted using a CNN encoder. A 3D feature volume \(V \in \mathbb{R}^{M \times M \times M \times C}\) is constructed to aggregate multi-view geometric information (by projecting voxel centers onto \(N\) source views, calculating and concatenating the feature mean and variance, and then enhancing them via a 3D U-Net). Geometry and appearance are estimated through two independent branches:

\[s = \mathbf{M}_{sdf}(\{\mathbf{f}_i\}_{i=1}^N, V), \quad \{\omega_i\}_{i=1}^N = \mathbf{M}_{color}(\{\mathbf{f}_i\}_{i=1}^N, \mathbf{d})\]

Point radiance is obtained via weighted summation: \(\hat{\mathbf{c}} = \sum_{i=1}^N \omega_i \mathbf{c}_i\). Design Motivation: Since the projected colors from source views are already close to the ground-truth radiance, the appearance branch can remain lightweight (a 3-layer MLP).

2. Transformer Geometry Branch (Three-stage Progressive Fusion)¶

Consisting of \(B=2\) blocks, each block contains three Transformer modules to progressively fuse features:

Geometry Transformer: Cross-attention, fusing 3D volume features into the input and modeling occlusion relations among sampled points along the ray: \(\mathbf{M}_{sdf}^{geo}(\mathbf{x}, \{\mathbf{v}_k\}) = \text{CrossAttention}(\mathbf{q}=\mathbf{x}, \mathbf{k}=\mathbf{v}=\{\mathbf{v}_k\}_{k=1}^K)\)
Appearance Transformer: Uses subtraction attention to integrate appearance features \(\{\mathbf{f}_i\}_{i=1}^N\) into geometric features, handling the occlusion between sampled points and source views. Subtraction attention is better suited for reasoning about geometric relationships.
Occlusion Transformer: Self-attention, explicitly modeling occlusion among sampled points along the ray: \(\mathbf{M}_{sdf}^{occ}(\mathbf{x}) = \text{SelfAttention}(\mathbf{q}=\mathbf{k}=\mathbf{v}=\mathbf{x})\)

Design Motivation: Correctly handling two types of occlusion effects is critical for accurate SDF estimation—occlusion among sampled points (which point is in front) and occlusion between sampled points and source views (which projections are valid).

3. Predict-then-Blend Zero-Shot Scene Understanding¶

Core Hypothesis: When geometry and appearance estimations are accurate, the blending weights \(\{\omega_i\}\) learned for radiance can be directly reused for other tasks, as regions with similar appearance tend to share similar scene attributes.

Workflow: (1) Generate predictions \(\{\mathbf{P}_i\}\) on each source view using pre-trained 2D models; (2) Reuse RGB blending weights to blend the projected predictions: \(\hat{\mathbf{p}} = \sum_{i=1}^N \omega_i \mathbf{p}_i\); (3) Obtain pixel-level predictions through NeuS volume rendering. Compared to the traditional "render-then-predict" paradigm, this method avoids propagating rendering errors to 2D models and leverages multi-view information to enhance monocular understanding.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{color} + \beta \mathcal{L}_{depth}\]

\(\mathcal{L}_{color} = \frac{1}{R}\sum_{\mathbf{r}=1}^R \|\hat{\mathbf{C}}_\mathbf{r} - \mathbf{C}_\mathbf{r}\|_2^2\): L2 photometric loss
\(\mathcal{L}_{depth} = \frac{1}{R}\sum_{\mathbf{r}=1}^R \|\hat{\mathbf{D}}_\mathbf{r} - \mathbf{D}_\mathbf{r}\|_2^2\): L2 depth loss
\(\beta = 1\), using the NeuS formulation for SDF-to-density transformation
Training configuration: \(N=4\) source views, 640×512 resolution, 1024 rays per batch

Key Experimental Results¶

Main Results: DTU Sparse-View Mesh Reconstruction (Chamfer Distance ↓)¶

Method	Mean	Scan24	Scan37	Scan65	Scan83	Scan110	Notes
COLMAP	1.52	0.90	2.89	1.94	1.30	1.42	Traditional MVS
MVSNet	1.22	1.05	2.52	1.52	1.29	0.66	Deep MVS
VolRecon	1.38	1.20	2.59	1.92	1.48	1.38	Generalizable Implicit
ReTR	1.17	1.05	2.31	1.52	1.35	0.77	Prev. SOTA
Omni-Recon	1.13	0.91	2.13	1.70	1.29	0.81	New SOTA, best on 10 out of 15 scenes

Rendering Quality (PSNR ↑): Omni-Recon Mean 26.32 vs ReTR 25.59 (+0.73) vs VolRecon 24.58 (+1.74).

Ablation Study: Zero-Shot Scene Understanding & Real-Time Rendering¶

Zero-Shot Scene Understanding Comparison:

Dataset	Strategy	Semantic mIoU↑	Edge↓	KeyPoint↓	KeyPoint3D↓
Replica	Render-then-Predict	15.64	0.0456	0.1101	0.0470
Replica	Predict-then-Blend	32.11	0.0412	0.0774	0.0176
ScanNet	Render-then-Predict	41.32	0.0471	0.0568	0.0412
ScanNet	Predict-then-Blend	61.11	0.0434	0.0424	0.0197

Real-Time Rendering (DTU):

Configuration	FPS	Mean PSNR	Notes
VolRecon	0.029	24.58	Baseline
ReTR	0.024	25.59	Strongest Baseline
Omni-Recon (No fine-tuning)	71.3	22.96	Direct use of mesh+shader
Omni-Recon (10s fine-tuning)	71.3	25.68	Already surpasses ReTR
Omni-Recon (1min fine-tuning)	71.3	28.34	Outperforms significantly
Omni-Recon (5min fine-tuning)	71.3	29.02	+3.43 over ReTR

Key Findings¶

Achieves optimal reconstruction on 10 out of 15 DTU scenes, reaching a new SOTA with a Mean Chamfer distance of 1.13.
Real-time rendering at 71.3 FPS, representing a >2458\(\times\) speedup over the baseline; surpasses the rendering quality of ReTR with only 10 seconds of fine-tuning.
Predict-then-Blend achieves a 19.79% higher mIoU on ScanNet semantic segmentation compared to Render-then-Predict.
After PET fine-tuning, the semantic segmentation mIoU exceeds the SOTA method SRay by 5.20%.

Highlights & Insights¶

Versatility of Decoupled Design: The decoupling of geometry and appearance branches is the most ingenious design of this work—the complex branch ensures accuracy and can be discarded after baking to gain rendering speed, while the lightweight branch simultaneously acts as a shader and a multi-task bridge.
Underestimated Potential of IBR: The paper reveals the core value of IBR pipelines in constructing a general-purpose 3D pipeline, offering an inspiring perspective.
Blending Weight Reuse: Directly generalizing color blending weights to tasks like semantics/edges/keypoints without additional training is both elegant and practical.

Limitations & Future Work¶

Experiments mainly focus on indoor scenes like DTU and ScanNet; generalization to complex outdoor scenes remains to be verified.
The quality of zero-shot understanding depends heavily on the prediction accuracy of 2D prior models.
Extracting a mesh via TSDF is required prior to real-time rendering, and the mesh quality directly affects rendering results.
Text-guided editing requires iterative maintenance of 3D consistency, leaving room for efficiency improvements.

The idea of decoupling geometry and appearance can be extended to other 3D representations like 3DGS.
The Predict-then-Blend strategy can be combined with more 2D foundation models (e.g., SAM, CLIP-LSeg) to expand the scope of 3D understanding.
The paradigm of utilizing LoRA for fine-tuning pre-trained NeRF aligns with current trends in NLP/CV.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of unifying general-purpose NeRF with IBR and the insight of reusing blending weights are highly creative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers reconstruction, rendering, understanding, and editing, with comprehensive datasets and ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Clear logic, rich tables/figures, and well-articulated motivation.
Value: ⭐⭐⭐⭐ — The real-time rendering capability and zero-shot understanding hold significant practical application value.