PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views¶

Conference: ICML 2025
arXiv: 2505.23481
Code: Available
Area: 3D Vision / Neural Radiance Fields
Keywords: NeRF, Sparse Views, Physics Priors, 3D Reconstruction, Generalization Analysis

TL;DR¶

PhysicsNeRF proposes a physics-prior-based sparse-view NeRF framework. By leveraging four complementary constraints—depth ranking, cross-view consistency, sparsity regularization, and progressive training—it achieves a PSNR of 21.4 dB with only 8 views while providing an in-depth theoretical analysis of the nature of overfitting under sparse-view conditions.

Background & Motivation¶

Background¶

Neural Radiance Fields (NeRF) have become the standard methodology for view synthesis, but typically rely on dense views (hundreds of images). Existing sparse-view approaches like RegNeRF, DietNeRF, SparseNeRF, and Instant-NGP improve regularization and encoding strategies, yet they either still require relatively dense views or lack physically-grounded priors. Physics-aware extensions such as PAC-NeRF and PIE-NeRF introduce constraints, but fall short of addressing generalization challenges under extremely sparse views.

Limitations of Prior Work¶

Sparse-view reconstruction is a severely underdetermined inverse problem—the limited $N \times K$ pixel-level color constraints cannot uniquely resolve a continuous 3D radiance field, leading to exponentially many 3D solutions consistent with the limited observations. Overfitting is not a mere technical flaw but rather a reflection of this inherent ambiguity.

Ours¶

This paper proposes PhysicsNeRF, a compact NeRF variant with only 0.67M parameters. It leverages four complementary physics-based constraints to regularize 3D reconstruction under sparse views, alongside an in-depth theoretical analysis of the nature of the generalization gap.

Method¶

Overall Architecture¶

PhysicsNeRF employs dual-scale coordinate encoding ($1\times$ and $2\times$ scales), where each branch utilizes a 7-layer MLP (192 hidden units), totaling only 0.67M parameters. Inspired by Instant-NGP and Plenoxels, this design aims to balance model capacity with generalization capability under sparse supervision.

Key Designs¶

Depth Ranking Consistency: Utilizing relative depth supervision provided by monocular depth estimators such as MiDaS, a ranking loss is imposed on selected pixel pairs $(i,j) \in \mathcal{P}$: $$\mathcal{L}_{\text{depth}} = \sum_{(i,j)\in\mathcal{P}} \ell_{\text{rank}}\big(\text{sgn}(D_M(i)-D_M(j)),\; \text{sgn}(D_P(i)-D_P(j))\big)$$ The Mechanism is to utilize the ordinal relationships (rather than absolute depth values) provided by pretrained depth models to guide geometric learning, thereby avoiding inaccuracies associated with absolute depth estimation.
Cross-View Geometric Consistency: Constraining rays projected from different camera poses to the same 3D point to yield consistent radiance field outputs: $$\mathcal{L}_{\text{cv}} = \sum_k \|F_\theta(\mathbf{r}_{k,1}) - F_\theta(\mathbf{r}_{k,2})\|_2^2$$ The Design Motivation is to introduce the geometric consistency principles of multi-view stereo into NeRF training, enhancing cross-view geometric coherence.
Sparsity Regularization: Natural scenes exhibit spatial sparsity. A volumetric prior is applied to the density field: $$\mathcal{L}_{\text{sparse}} = \mathbb{E}_{\mathbf{x}\sim\mathcal{U}(\Omega)}[\text{softplus}(\sigma(\mathbf{x}))]$$ Concurrently, a gradient regularization $\mathcal{L}_{\text{reg}} = \|\nabla_{\mathbf{r}} F_\theta(\mathbf{r})\|_2^2$ is incorporated to promote smoothness and prevent excessive local variation.
Progressive Training: Inspired by curriculum learning, physical constraints are progressively introduced via a scheduling function $\alpha(t)$: $$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{rgb}} + \alpha(t)\sum_i \lambda_i \mathcal{L}_i$$ where $\alpha(t)$ is a piecewise constant: 0.008 when $t<5k$, 0.025 when $5k \leq t < 15k$, and 0.08 thereafter.

Loss & Training¶

The total loss is a weighted combination of the RGB reconstruction loss and the four physical constraint losses. The model is optimized using the Adam optimizer with an initial learning rate of $5\times10^{-4}$ and an exponential decay factor of $\gamma=0.998$ over 150,000 training iterations, utilizing mixed-precision training and adaptive batch sizes.

Key Experimental Results¶

Main Results¶

Dataset/Scene	Metric	PhysicsNeRF	NeRF	RegNeRF	DietNeRF	SparseNeRF
Chair	Train/Test/Gap	23.2/18.5/4.7	16.2/9.1/7.1	21.0/12.6/8.4	20.4/13.8/6.6	21.3/12.9/8.4
Lego	Train/Test/Gap	21.7/15.0/6.7	15.0/8.5/6.5	19.8/11.5/8.3	19.5/13.0/6.5	20.1/11.7/8.4
Drums	Train/Test/Gap	19.2/12.0/7.2	14.4/8.5/5.9	19.5/11.3/8.2	19.5/12.8/6.7	20.1/12.8/8.4
Average	Train/Test/Gap	21.4/15.2/6.2	-	-	-	-

Ablation Study¶

Configuration	Train PSNR	Test PSNR	Gap	Explanation
RGB only	23.3	9.8	13.5	Best training but worst generalization
+ Depth ranking	23.0	11.2	11.8	Gap reduced by 1.7 dB
+ Cross-view	22.7	12.8	9.9	Further narrows the Gap
+ Sparsity	22.4	13.9	8.5	Approaching final performance
+ All (Full)	21.7	15.0	6.7	Optimal generalization

Key Findings¶

Collapse-Recovery Dynamics: A consistent PSNR collapse-recovery pattern is observed during training at approximately 20k iterations, precisely corresponding to the activation times of the progressive constraints.
Generalization Gap Correlates Positively with Complexity: As the geometric complexity of the scene increases, the generalization gap increases from 4.7 → 6.7 → 7.2 dB.
Overfitting is an Inherent Feature of Sparse-View Reconstruction: Theoretical analysis demonstrates that the structural magnitude of the generalization gap is $O(\sqrt{|\theta|/N})$.

Highlights & Insights¶

Deep theoretical analysis of the nature of overfitting in sparse-view reconstruction, casting overfitting as a structural property rather than an implementation flaw.
A compact design with only 0.67M parameters, demonstrating that physical priors are more critical than model scale.
The discovery of collapse-recovery dynamics reveals the underlying mechanism of physical constraints in shaping the optimization landscape.
Insights for world model construction: physically consistent representation under limited observations remains an open problem.

Limitations & Future Work¶

The generalization gap remains at 5.7-6.2 dB, indicating that fixed-form physical constraints struggle to fully resolve the underdetermined nature of the problem.
Experiments are limited to the NeRF synthetic dataset, lacking validation on real-world scenes.
Future directions include: learnable adaptive physical constraints, multi-modal information fusion (semantics + geometry + temporal), and hierarchical scene decomposition.
Lack of comparison with more advanced 3DGS-based or diffusion-based methods.

Belongs to the line of sparse-view NeRF work alongside RegNeRF, DietNeRF, and SparseNeRF, but places a heavier emphasis on physical priors.
Borrows from the PINN paradigm of incorporating physical constraints into neural networks.
The collapse-recovery dynamics are similar to the phase transition studies of loss landscapes in training.

Rating¶

Novelty: ⭐⭐⭐⭐ The physical constraints themselves are not entirely novel, but the theoretical analysis of overfitting behavior provides a fresh perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Experiments are conducted only on the NeRF synthetic dataset (3 scenes), indicating a limited evaluation scale.
Writing Quality: ⭐⭐⭐⭐⭐ The theoretical analysis is clear and in-depth, with a complete structure.
Value: ⭐⭐⭐⭐ The theoretical insights are valuable, though the practical application scenarios remain limited.