ROGR: Relightable 3D Objects using Generative Relighting¶
Conference: NeurIPS 2025 arXiv: 2510.03163 Code: Project Page available Area: 3D Vision / Relighting Keywords: Relighting, Neural Radiance Field, Generative Relighting, Diffusion Model, Environment Lighting
TL;DR¶
This paper proposes ROGR, which leverages a multi-view diffusion relighting model to generate consistent images under multiple lighting conditions, trains a lighting-conditioned NeRF on the resulting dataset, and achieves feed-forward 3D object relighting under arbitrary environment lighting. ROGR attains state-of-the-art performance on the TensoIR and Stanford-ORB benchmarks while supporting interactive rendering.
Background & Motivation¶
Inserting real objects into novel environments and rendering them correctly under varying illumination is a classic problem in computer graphics. Current 3D relighting methods follow two main technical paradigms, each with notable shortcomings:
Inverse Rendering: Recovers material and lighting parameters by optimizing to explain observed images. Key issues include: (a) mismatches between physical light transport and simplified models make optimization fragile; (b) due to inherent ambiguities, recovered object properties frequently yield unrealistic appearances under novel lighting; (c) physically based Monte Carlo light transport simulation is computationally prohibitive for interactive applications.
Generative Relighting: Diffusion-based methods such as Neural Gaffer and DiLightNet can generate perceptually realistic single-image relighting results, but each image is processed independently, leading to multi-view inconsistencies. Although IllumiNeRF attempts to reconstruct inconsistent relit images into a NeRF, a separate 3D representation must be optimized for every new target lighting condition, precluding interactive use.
The paper's core idea is: first use a multi-view diffusion model to generate a consistent relit dataset, then train a lighting-conditioned NeRF on this data to enable feed-forward rendering under arbitrary lighting in a single training pass. This is essentially a generalization of the Light Stage concept to generative models.
Method¶
Overall Architecture¶
- A multi-view relighting diffusion model relights \(N=64\) input views under \(M=111\) environment lighting conditions, producing an \(N \times M\) multi-illumination dataset.
- A lighting-conditioned NeRF is trained on this dataset.
- At inference, the NeRF accepts arbitrary environment lighting as input and produces relit outputs in a feed-forward manner.
Key Designs¶
-
Multi-View Relighting Diffusion Model:
- Built on the CAT3D multi-view diffusion architecture with cross-view self-attention layers.
- Environment lighting is encoded following Neural Gaffer: two separate latents for HDR (log tone-mapped and normalized) and LDR (standard tone-mapped) representations.
- For each input view, the environment map is rotated into the corresponding camera frame and concatenated with the image latent and ray map before being fed into the diffusion network.
- Key advantage: joint denoising across multiple views (64 views simultaneously), ensuring cross-view consistency in the generated relit images.
- Trained on 128 TPU v5 chips with a total batch size of 128.
-
Lighting-Conditioned NeRF (Dual-Branch Architecture):
- Built on NeRF-Casting; two types of lighting conditioning signals are designed:
General Conditioning: - Maps the full environment map to a 128-dimensional vector. - Uses a ViT-S8-based Transformer encoder trained from scratch: \(f_\text{general} = W \cdot \text{ViT}(E)\). - Key distinction from NeRF-in-the-Wild: the embedding is a learnable mapping of the environment lighting rather than a per-image optimized code, enabling generalization to unseen lighting.
Specular Conditioning: - Queries the environment map value and its blurred versions along the reflection direction. - Pre-blurs the environment map with Gaussian kernels of varying widths \(\sigma\) to simulate reflections from materials of different roughness: $\(f_i^\text{specular} = \int E(\omega') G(\omega'; \omega_r, \sigma_i)\, d\omega'\)$ - Conceptually a modernized implementation of Precomputed Radiance Transfer (PRT).
-
Architecture Details:
- A geometry MLP outputs density, roughness, surface normals, and geometric features.
- A color MLP receives 3D coordinates, view direction, general conditioning, and specular conditioning, and outputs RGB values.
- Trained for 500k steps on 8 H100 GPUs.
- Inference time of 0.384 seconds per frame, supporting interactive relighting.
Loss & Training¶
- Diffusion model: Training data rendered from 400k synthetic 3D objects (including 100k from Objaverse), with 64 views × 16 HDR lighting conditions per object.
- Progressive training: 4-view diffusion → 16-view → 64-view.
- NeRF: A multi-illumination dataset is generated using 111 environment lighting conditions, on which the lighting-conditioned NeRF is trained.
Key Experimental Results¶
TensoIR Benchmark (Synthetic Scenes)¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| NeRFactor | 23.38 | 0.908 | 0.131 |
| InvRender | 23.97 | 0.901 | 0.101 |
| TensoIR | 28.58 | 0.944 | 0.081 |
| Neural-PBIR | 27.09 | 0.925 | 0.085 |
| NeRO | 27.00 | 0.935 | 0.074 |
| R3DG | 29.05 | 0.937 | 0.080 |
| Neural Gaffer | 27.30 | 0.918 | 0.122 |
| IllumiNeRF | 29.71 | 0.947 | 0.072 |
| ROGR (Ours) | 30.74 | 0.950 | 0.069 |
Stanford-ORB Benchmark (Real Scenes)¶
| Method | PSNR-H↑ | PSNR-L↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|
| NVDiffRecMC† | 25.08 | 32.28 | 0.974 | 0.027 |
| Neural-PBIR | 26.01 | 33.26 | 0.979 | 0.023 |
| IllumiNeRF | 25.56 | 32.74 | 0.976 | 0.027 |
| R3DG | 21.25 | 27.50 | 0.962 | 0.063 |
| ROGR (Ours) | 26.21 | 32.91 | 0.980 | 0.027 |
Ablation Study (TensoIR Hotdog Scene)¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ | Note |
|---|---|---|---|---|
| Full model (64 views, 111 env. lights) | 31.88 | 0.91 | 0.075 | — |
| (a) Specular conditioning w/o blurring | 31.35 | 0.90 | 0.080 | Degraded on rough surfaces |
| (b) No specular conditioning | 30.00 | 0.88 | 0.079 | Noticeable drop in highlight accuracy |
| (c) No general conditioning | 21.59 | 0.70 | 0.130 | Severe artifacts |
| (d) Per-image embedding | 19.12 | 0.62 | 0.160 | Cannot generalize to novel lighting |
| (e) 128×128 environment map | 29.26 | 0.89 | 0.082 | Loss of highlight detail |
| (f) 64×64 environment map | 27.69 | 0.86 | 0.110 | Rendering artifacts |
Key Findings¶
- ROGR surpasses IllumiNeRF (Prev. SOTA) by 1.03 dB PSNR on TensoIR, with simultaneous improvements in SSIM and LPIPS.
- Achieves the best PSNR-H and SSIM on Stanford-ORB, with overall best performance.
- General conditioning is the critical component — removing it causes a dramatic performance collapse (31.88 → 21.59 PSNR).
- The multi-scale blurring design for specular conditioning is important for materials of varying roughness.
- Joint denoising of more views (64 > 16 > 4) significantly improves cross-view relighting consistency.
- Rendering speed of 0.384 s/frame is comparable to Gaussian Splatting-based methods (R3DG: 0.415 s/frame).
Highlights & Insights¶
- Combines the Light Stage concept with generative models: first generating multi-illumination data, then distilling it into an efficient feed-forward model — achieving "train once, relight anywhere."
- The dual-branch lighting conditioning (general + specular) provides a clear division of responsibility: general conditioning handles global illumination effects, while specular conditioning captures high-frequency reflection details.
- Compared to inverse rendering methods, this approach avoids the ambiguity inherent in material–lighting decomposition; compared to IllumiNeRF, it eliminates the cost of optimizing a separate NeRF for each lighting condition.
- The multi-view diffusion model ensures cross-view consistency, which was the primary bottleneck of prior single-image relighting methods.
- The ablation study on environment map resolution provides practical engineering guidance (512×512 is necessary).
Limitations & Future Work¶
- Training data does not cover complex material effects such as subsurface scattering, refraction, or volumetric appearance.
- The environment lighting assumption places light sources at infinity; near-field lighting is not supported.
- The method handles only single-object scenes and has not been extended to large-scale scene relighting.
- While substantially faster than inverse rendering methods, rendering speed still falls short of real-time Gaussian Splatting.
- Although the diffusion-generated relit data is more consistent than single-image methods, minor cross-view inconsistencies may still persist.
Related Work & Insights¶
- This work elegantly combines the "linear superposition" philosophy of PRT (Precomputed Radiance Transfer) with modern generative models.
- The key distinction from IllumiNeRF lies in generalization: ROGR trains a single universal lighting-conditioned NeRF, rather than a separate NeRF per lighting condition.
- Insight: generative models can serve as a "virtual data factory," producing high-quality training data for downstream tasks.
- NeRF-Casting's reflection modeling provides the foundation for the specular conditioning in this work, underscoring the importance of accurate reflection modeling in relighting.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The generative distillation relighting framework is novel, and the dual-branch conditioning design is clever.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers both synthetic and real benchmarks with comprehensive ablations, both quantitative and qualitative.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured; some technical details could be presented more concisely.
- Value: ⭐⭐⭐⭐⭐ — Achieves new state-of-the-art on 3D relighting with practical application potential.