ROGR: Relightable 3D Objects using Generative Relighting¶

Conference: NeurIPS 2025 arXiv: 2510.03163 Code: Project Page available Area: 3D Vision / Relighting Keywords: Relighting, Neural Radiance Field, Generative Relighting, Diffusion Model, Environment Lighting

TL;DR¶

This paper proposes ROGR, which leverages a multi-view diffusion relighting model to generate consistent images under multiple lighting conditions, trains a lighting-conditioned NeRF on the resulting dataset, and achieves feed-forward 3D object relighting under arbitrary environment lighting. ROGR attains state-of-the-art performance on the TensoIR and Stanford-ORB benchmarks while supporting interactive rendering.

Background & Motivation¶

Inserting real objects into novel environments and rendering them correctly under varying illumination is a classic problem in computer graphics. Current 3D relighting methods follow two main technical paradigms, each with notable shortcomings:

Inverse Rendering: Recovers material and lighting parameters by optimizing to explain observed images. Key issues include: (a) mismatches between physical light transport and simplified models make optimization fragile; (b) due to inherent ambiguities, recovered object properties frequently yield unrealistic appearances under novel lighting; (c) physically based Monte Carlo light transport simulation is computationally prohibitive for interactive applications.

Generative Relighting: Diffusion-based methods such as Neural Gaffer and DiLightNet can generate perceptually realistic single-image relighting results, but each image is processed independently, leading to multi-view inconsistencies. Although IllumiNeRF attempts to reconstruct inconsistent relit images into a NeRF, a separate 3D representation must be optimized for every new target lighting condition, precluding interactive use.

The paper's core idea is: first use a multi-view diffusion model to generate a consistent relit dataset, then train a lighting-conditioned NeRF on this data to enable feed-forward rendering under arbitrary lighting in a single training pass. This is essentially a generalization of the Light Stage concept to generative models.

Method¶

Overall Architecture¶

A multi-view relighting diffusion model relights $N=64$ input views under $M=111$ environment lighting conditions, producing an $N \times M$ multi-illumination dataset.
A lighting-conditioned NeRF is trained on this dataset.
At inference, the NeRF accepts arbitrary environment lighting as input and produces relit outputs in a feed-forward manner.

Key Designs¶

Multi-View Relighting Diffusion Model:
- Built on the CAT3D multi-view diffusion architecture with cross-view self-attention layers.
- Environment lighting is encoded following Neural Gaffer: two separate latents for HDR (log tone-mapped and normalized) and LDR (standard tone-mapped) representations.
- For each input view, the environment map is rotated into the corresponding camera frame and concatenated with the image latent and ray map before being fed into the diffusion network.
- Key advantage: joint denoising across multiple views (64 views simultaneously), ensuring cross-view consistency in the generated relit images.
- Trained on 128 TPU v5 chips with a total batch size of 128.
Lighting-Conditioned NeRF (Dual-Branch Architecture):
- Built on NeRF-Casting; two types of lighting conditioning signals are designed:

General Conditioning: - Maps the full environment map to a 128-dimensional vector. - Uses a ViT-S8-based Transformer encoder trained from scratch: $f_\text{general} = W \cdot \text{ViT}(E)$. - Key distinction from NeRF-in-the-Wild: the embedding is a learnable mapping of the environment lighting rather than a per-image optimized code, enabling generalization to unseen lighting.

Specular Conditioning: - Queries the environment map value and its blurred versions along the reflection direction. - Pre-blurs the environment map with Gaussian kernels of varying widths $\sigma$ to simulate reflections from materials of different roughness: $$f_i^\text{specular} = \int E(\omega') G(\omega'; \omega_r, \sigma_i)\, d\omega'$$ - Conceptually a modernized implementation of Precomputed Radiance Transfer (PRT).

Architecture Details:
- A geometry MLP outputs density, roughness, surface normals, and geometric features.
- A color MLP receives 3D coordinates, view direction, general conditioning, and specular conditioning, and outputs RGB values.
- Trained for 500k steps on 8 H100 GPUs.
- Inference time of 0.384 seconds per frame, supporting interactive relighting.

Loss & Training¶

Diffusion model: Training data rendered from 400k synthetic 3D objects (including 100k from Objaverse), with 64 views × 16 HDR lighting conditions per object.
Progressive training: 4-view diffusion → 16-view → 64-view.
NeRF: A multi-illumination dataset is generated using 111 environment lighting conditions, on which the lighting-conditioned NeRF is trained.

Key Experimental Results¶

TensoIR Benchmark (Synthetic Scenes)¶

Method	PSNR↑	SSIM↑	LPIPS↓
NeRFactor	23.38	0.908	0.131
InvRender	23.97	0.901	0.101
TensoIR	28.58	0.944	0.081
Neural-PBIR	27.09	0.925	0.085
NeRO	27.00	0.935	0.074
R3DG	29.05	0.937	0.080
Neural Gaffer	27.30	0.918	0.122
IllumiNeRF	29.71	0.947	0.072
ROGR (Ours)	30.74	0.950	0.069

Stanford-ORB Benchmark (Real Scenes)¶

Method	PSNR-H↑	PSNR-L↑	SSIM↑	LPIPS↓
NVDiffRecMC†	25.08	32.28	0.974	0.027
Neural-PBIR	26.01	33.26	0.979	0.023
IllumiNeRF	25.56	32.74	0.976	0.027
R3DG	21.25	27.50	0.962	0.063
ROGR (Ours)	26.21	32.91	0.980	0.027

Ablation Study (TensoIR Hotdog Scene)¶

Configuration	PSNR↑	SSIM↑	LPIPS↓	Note
Full model (64 views, 111 env. lights)	31.88	0.91	0.075	—
(a) Specular conditioning w/o blurring	31.35	0.90	0.080	Degraded on rough surfaces
(b) No specular conditioning	30.00	0.88	0.079	Noticeable drop in highlight accuracy
(c) No general conditioning	21.59	0.70	0.130	Severe artifacts
(d) Per-image embedding	19.12	0.62	0.160	Cannot generalize to novel lighting
(e) 128×128 environment map	29.26	0.89	0.082	Loss of highlight detail
(f) 64×64 environment map	27.69	0.86	0.110	Rendering artifacts

Key Findings¶

ROGR surpasses IllumiNeRF (Prev. SOTA) by 1.03 dB PSNR on TensoIR, with simultaneous improvements in SSIM and LPIPS.
Achieves the best PSNR-H and SSIM on Stanford-ORB, with overall best performance.
General conditioning is the critical component — removing it causes a dramatic performance collapse (31.88 → 21.59 PSNR).
The multi-scale blurring design for specular conditioning is important for materials of varying roughness.
Joint denoising of more views (64 > 16 > 4) significantly improves cross-view relighting consistency.
Rendering speed of 0.384 s/frame is comparable to Gaussian Splatting-based methods (R3DG: 0.415 s/frame).

Highlights & Insights¶

Combines the Light Stage concept with generative models: first generating multi-illumination data, then distilling it into an efficient feed-forward model — achieving "train once, relight anywhere."
The dual-branch lighting conditioning (general + specular) provides a clear division of responsibility: general conditioning handles global illumination effects, while specular conditioning captures high-frequency reflection details.
Compared to inverse rendering methods, this approach avoids the ambiguity inherent in material–lighting decomposition; compared to IllumiNeRF, it eliminates the cost of optimizing a separate NeRF for each lighting condition.
The multi-view diffusion model ensures cross-view consistency, which was the primary bottleneck of prior single-image relighting methods.
The ablation study on environment map resolution provides practical engineering guidance (512×512 is necessary).

Limitations & Future Work¶

Training data does not cover complex material effects such as subsurface scattering, refraction, or volumetric appearance.
The environment lighting assumption places light sources at infinity; near-field lighting is not supported.
The method handles only single-object scenes and has not been extended to large-scale scene relighting.
While substantially faster than inverse rendering methods, rendering speed still falls short of real-time Gaussian Splatting.
Although the diffusion-generated relit data is more consistent than single-image methods, minor cross-view inconsistencies may still persist.

This work elegantly combines the "linear superposition" philosophy of PRT (Precomputed Radiance Transfer) with modern generative models.
The key distinction from IllumiNeRF lies in generalization: ROGR trains a single universal lighting-conditioned NeRF, rather than a separate NeRF per lighting condition.
Insight: generative models can serve as a "virtual data factory," producing high-quality training data for downstream tasks.
NeRF-Casting's reflection modeling provides the foundation for the specular conditioning in this work, underscoring the importance of accurate reflection modeling in relighting.

Rating¶

Novelty: ⭐⭐⭐⭐ — The generative distillation relighting framework is novel, and the dual-branch conditioning design is clever.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers both synthetic and real benchmarks with comprehensive ablations, both quantitative and qualitative.
Writing Quality: ⭐⭐⭐⭐ — Well-structured; some technical details could be presented more concisely.
Value: ⭐⭐⭐⭐⭐ — Achieves new state-of-the-art on 3D relighting with practical application potential.