NeAR: Coupled Neural Asset–Renderer Stack¶
Conference: CVPR 2026 arXiv: 2511.18600 Code: https://near-project.github.io/ (project page) Area: 3D Vision / Diffusion Models Keywords: Neural rendering, illumination homogenization, 3D Gaussian splatting, relighting, coupled asset–renderer design
TL;DR¶
NeAR proposes jointly designing neural asset creation and neural rendering as a coupled stack. By introducing illumination-homogenized structured 3D latents (LH-SLAT) to remove baked lighting from input images, and employing an illumination-aware neural decoder for real-time synthesis of relightable 3D Gaussian fields, NeAR surpasses existing methods across four tasks: forward rendering, reconstruction, relighting, and novel-view relighting.
Background & Motivation¶
-
Background: Neural graphics currently follows two independent tracks—neural asset creation (synthesizing 3D assets via generative models) and neural rendering (mapping assets to images). These are typically developed in isolation: asset generation assumes a fixed renderer, while renderers are trained for static asset distributions.
-
Limitations of Prior Work: (a) PBR decomposition-based methods (e.g., Hunyuan3D) are prone to material misclassification (e.g., identifying wood as metal); decomposition errors are amplified through nonlinear rendering pipelines, leading to baked shadows and illumination inconsistencies. (b) Diffusion-based 2D relighting methods (e.g., DiLightNet, IC-Light) lack 3D consistency and incur high computational cost. (c) Existing SLAT representations blindly encode appearance including shadows and specular highlights, making them unsuitable for relighting.
-
Key Challenge: Shadows, highlights, and inter-reflections are inherently entangled with geometry and material in 3D assets. Explicit PBR inverse decomposition is fragile in real-world scenarios, while fully black-box neural generation lacks controllability.
-
Goal: How to achieve high-quality, controllable single-image relightable 3D generation without relying on unstable PBR inversion?
-
Key Insight: The authors propose a homogenize-then-synthesize strategy—co-designing the asset representation and rendering process so that they form a robust "contract" through a shared illumination-homogenized latent space.
-
Core Idea: Jointly design illumination-homogenized 3D latent representations and an illumination-aware neural renderer, forming a coupled asset–renderer stack that replaces the conventional paradigm of decoupled asset generation and independent rendering.
Method¶
Overall Architecture¶
NeAR proceeds in two stages. Stage 1 fine-tunes a rectified-flow model with LoRA to lift a single input image under arbitrary illumination into an illumination-homogenized SLAT (LH-SLAT), removing baked shadows and unstable highlights. Stage 2 uses a feed-forward decoder to synthesize a relightable 3D Gaussian splatting field conditioned on target illumination and viewpoint from the LH-SLAT. The entire pipeline requires no per-object optimization and supports real-time inference.
Key Designs¶
-
Illumination-Homogenized Structured 3D Latent (LH-SLAT):
- Function: Maps inputs under arbitrary illumination to a canonical illumination space, yielding an illumination-invariant asset representation.
- Mechanism: Defines homogeneous illumination \(E_h\) as white uniform ambient light. A pretrained SLAT flow model \(f_s\) first generates a shaded SLAT \(Z_s\) from the input image; a LoRA-adapted model \(f_\theta\) then guides \(Z_s\) toward an illumination-homogenized \(Z_{\text{lh}} = f_\theta(Z_s, I_{\text{in}})\) in sparse voxel space. Ground-truth LH-SLATs are obtained by rendering 3D assets under homogeneous illumination, with inputs rendered under random illumination. For highly reflective materials, an optional Base Color SLAT \(Z_{\text{bc}}\) is extracted and concatenated.
- Design Motivation: Direct PBR decomposition is an ill-posed problem. Learning illumination removal in the latent space via flow models avoids the instability of explicit material decomposition. LH-SLAT retains essential geometry–material–illumination interaction information, providing a stable foundation for downstream rendering.
-
Illumination Tokenizer:
- Function: Encodes HDR environment maps into compact illumination condition tokens.
- Mechanism: Decomposes the environment map into an LDR tone-mapped image \(\mathbf{E}_{\text{ldr}}\), a normalized log-intensity map \(\mathbf{E}_{\text{log}}\), and a camera-space direction encoding \(\mathbf{E}_{\text{dir}}\). ConvNeXt extracts visual features; NeRF positional encoding processes \(\mathbf{E}_{\text{dir}}\); spatial cross-attention fuses directional information with visual features; self-attention then produces illumination condition tokens \(C_L \in \mathbb{R}^{4096 \times 768}\).
- Design Motivation: Compared to compressing the entire environment map with a VAE, explicitly embedding directional information makes illumination direction editable when switching viewpoints.
-
Intrinsic-Aware Decoder (IAD) + Lighting-Aware Decoder (LAD):
- Function: IAD decodes view-independent, illumination-invariant intrinsic features; LAD injects illumination and viewpoint conditions to produce final illumination-dependent features.
- Mechanism: IAD processes LH-SLAT with Transformer and shifted window attention, augmented by 16 register tokens that capture global context via global cross-attention, outputting intrinsic features \(\boldsymbol{h}\). LAD first computes view embeddings (ray distance + ray direction NeRF positional encoding) and adds them to \(\boldsymbol{h}\) to obtain \(\boldsymbol{h}^v\); stacked cross-attention blocks then inject illumination conditions \(C_L\) to produce illumination-aware features \(\boldsymbol{h}^e\).
- Design Motivation: Separating decoding into illumination-independent and illumination-dependent stages ensures stable intrinsic attributes while enabling flexible illumination control. Replacing spherical harmonics with explicit viewpoint injection better models view-dependent specular highlights.
-
Neural 3D Gaussian Splatting:
- Function: Regresses 3DGS parameters from intrinsic and illumination-aware features.
- Mechanism: Intrinsic features \(\boldsymbol{h}\) are decoded into illumination-independent parameters including position offsets, base color, roughness, metalness, scale, rotation, and opacity; illumination features \(\boldsymbol{h}^e\) are decoded into 48-dimensional color features, illumination scale, and shadow. A shallow MLP combines normal positional encoding with color features to predict radiance values; differentiable rasterization yields HDR predicted images.
- Design Motivation: Partitioning Gaussian parameters into intrinsic and illumination-related groups achieves disentangled material–illumination representation.
Loss & Training¶
Stage 1: Conditional flow matching loss \(\mathcal{L}_{\text{stage1}} = \mathbb{E}\|\boldsymbol{v}_\theta(\boldsymbol{z}, Z_s, I_{\text{in}}, t) - (\boldsymbol{\epsilon} - \boldsymbol{z}_0)\|_2^2\).
Stage 2: HDR reconstruction loss (log-space L1 + LPIPS + D-SSIM) \(\mathcal{L}_{\text{hdr}}\) + PBR material auxiliary supervision \(\mathcal{L}_{\text{pbr}}\) + shadow supervision \(\mathcal{L}_{\text{shadow}}\).
Training data: 87K 3D assets with PBR textures + 2K 4K-resolution HDR environment maps, rendered with Blender EEVEE Next.
Key Experimental Results¶
Main Results (PSNR↑ comparison across four tasks)¶
| Task | Method | ADT | DTC | Objaverse | Glossy Syn. |
|---|---|---|---|---|---|
| G-buffer forward rendering | DiffusionRenderer | 24.41 | 27.16 | 27.09 | 25.46 |
| NeAR | 29.15 | 31.59 | 32.23 | 30.47 | |
| Random illumination reconstruction | DiLightNet | 21.11 | 23.53 | 25.65 | 24.09 |
| NeAR | 22.89 | 24.68 | 26.53 | 25.32 | |
| Unknown illumination relighting | DiffusionRenderer | 21.91 | 22.99 | 23.75 | 22.13 |
| NeAR | 21.95 | 23.47 | 24.38 | 22.61 | |
| Novel-view relighting | Hunyuan3D-2.1 | 22.30 | 24.89 | 25.47 | 22.26 |
| NeAR | 22.87 | 25.53 | 25.97 | 22.94 |
Ablation Study¶
| Input SLAT Type | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| Shaded SLAT | 28.95 | 0.9281 | 0.0813 |
| Base Color SLAT | 30.38 | 0.9541 | 0.0564 |
| LH-SLAT | 32.02 | 0.9631 | 0.0494 |
| LH + Base Color | 32.54 | 0.9649 | 0.0442 |
| LAD Layers (IAD=12) | PSNR | FPS |
|---|---|---|
| 1 layer | 31.56 | 48 |
| 3 layers | 32.35 | 38 |
| 6 layers | 32.54 | 30 |
| 9 layers | 32.56 | 23 |
Key Findings¶
- LH-SLAT surpasses Shaded SLAT by over 3 dB PSNR, confirming the necessity of illumination homogenization.
- Combining LH-SLAT with Base Color SLAT yields the best results; Base Color SLAT supplements information for highly reflective materials.
- LAD with 6 layers provides the optimal performance–speed trade-off (PSNR 32.54, 30 FPS).
- Injecting viewpoint information before baking illumination (architectures c+d+g in Figure 9) significantly outperforms the reverse order.
- HY3D-2.1 incorrectly classifies wood as metal (erroneous metalness maps), whereas NeAR's LH-SLAT correctly recovers material properties.
Highlights & Insights¶
- Coupled asset–renderer design paradigm: Rather than treating asset generation and rendering as independent components, NeAR establishes a "contract" between them through a shared latent space—a paradigm with broad implications for the design of neural graphics stacks.
- Illumination homogenization strategy: By learning illumination removal in latent space rather than performing explicit PBR decomposition, the method elegantly circumvents the ill-posedness of inverse rendering.
- Texture style transfer application: LH-SLAT accepts stylized image inputs, enabling semantically consistent style transfer combined with photorealistic relighting, demonstrating the flexibility of the representation.
Limitations & Future Work¶
- The method still struggles with transparent objects (e.g., helmets); the neural renderer partially mitigates but does not fully resolve this issue.
- Training requires multi-illumination rendering of large quantities of 3D assets, resulting in high data preparation costs.
- Evaluation is limited to static objects; dynamic scenes and human bodies remain unexplored.
- The real-time performance of 30 FPS may still be insufficient for certain applications.
Related Work & Insights¶
- vs. Trellis: Trellis uses SLAT but blindly encodes illumination; NeAR proposes LH-SLAT to explicitly remove illumination, yielding a more stable representation suited for relighting.
- vs. DiffusionRenderer: DiffusionRenderer performs 2D rendering from G-buffers and lacks 3D structural information; NeAR's 3D Gaussian field achieves greater accuracy in shadow and specular detail.
- vs. HY3D-2.1: HY3D's decoupled PBR decomposition leads to material estimation errors (e.g., incorrect metalness), whereas NeAR avoids the fragility of explicit decomposition.
Rating¶
- Novelty: ⭐⭐⭐⭐ The coupled asset–renderer design concept is original, and the LH-SLAT representation is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four sub-tasks, four datasets, multi-method comparisons, ablation studies, and application demonstrations are all comprehensive.
- Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with thorough comparative analysis.
- Value: ⭐⭐⭐⭐ Offers important inspiration for design paradigms in neural rendering and 3D generation.