InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction¶
Conference: CVPR 2026 arXiv: 2603.11298 Code: Unavailable (to be released after review) Area: 3D Reconstruction / High Dynamic Range Imaging Keywords: HDR novel view synthesis, feed-forward 3D reconstruction, 3D Gaussian splatting, multi-exposure fusion, tone mapping meta-network
TL;DR¶
This paper proposes InstantHDR, the first feed-forward HDR novel view synthesis method. It introduces a geometry-guided appearance modeling module to resolve appearance inconsistencies in multi-exposure fusion, and employs a MetaNet to predict scene-specific tone mapping parameters for generalization. The method reconstructs HDR 3D Gaussian scenes in seconds from uncalibrated multi-exposure LDR images, achieving +2.90 dB PSNR over GaussianHDR under sparse 4-view settings at approximately 700× faster speed.
Background & Motivation¶
Background: HDR novel view synthesis (HDR-NVS) aims to reconstruct HDR scenes from multi-exposure LDR images and render novel views at arbitrary exposures. Existing optimization-based methods (HDR-GS, GaussianHDR) already produce high-quality results.
Limitations of Prior Work:
- Optimization-based methods rely heavily on known camera poses, SfM-initialized dense point clouds, and per-scene optimization (GaussianHDR requires ~30 minutes per scene), limiting practical deployment.
- Multi-exposure inputs cause appearance inconsistencies, leading to SfM point cloud collapse and complete failure of optimization-based methods under sparse-view settings.
- Feed-forward 3D models (e.g., AnySplat) assume appearance consistency; directly applying them to multi-exposure inputs produces severe ghosting artifacts (e.g., the same white wall exhibits drastically different brightness across exposures).
- Different cameras apply different tone curves (AgX/Filmic/Standard), making it difficult to learn a unified tone mapping.
- Publicly available HDR datasets are extremely scarce (HDR-NeRF contains only 12 scenes), insufficient to support feed-forward model pretraining.
Key Challenge: The speed advantage of the feed-forward paradigm vs. the exposure inconsistency, CRF diversity, and data scarcity inherent to HDR scenes.
Goal: How to rapidly reconstruct high-quality HDR 3D scenes from uncalibrated, exposure-inconsistent multi-view LDR images without per-scene optimization?
Key Insight: Decouple geometry and appearance via a frozen geometry backbone and a trainable appearance branch; reuse intermediate-layer attention maps from the geometry encoder to guide cross-view fusion; employ a meta-network to predict CRF parameters for single-forward adaptation to diverse cameras.
Core Idea: Geometry-guided appearance modeling addresses exposure-inconsistent fusion; meta-network-predicted tone mapping enables generalization → single-forward HDR reconstruction.
Method¶
Overall Architecture¶
Given \(V\) uncalibrated multi-exposure LDR images \(\{I_v, \ell_v\}\), the dual-branch architecture proceeds as follows: ① The geometry branch (frozen pretrained alternating-attention Transformer from VGGT/AnySplat) estimates depth \(D_v\) and pose \(p_v\); ② The appearance branch applies exposure normalization \(F_E\) → geometry-guided cross-view attention fusion \(F_A\) → DoG high-resolution upsampling \(F_U\) → Gaussian head \(F_G\) merging both branches to produce HDR 3D Gaussians; ③ MetaNet \(F_M\) predicts tone mapping parameters \(\theta\) → LDR rendering at arbitrary exposure \(\ell\).
Key Designs¶
-
Geometry-guided Appearance Modeling
A three-stage pipeline:
- (a) Exposure Normalization \(F_E\): Computes relative log-exposure \(\tilde{\ell}_v = \ell_v - \bar{\ell}\), encodes it as a \(d\)-dimensional embedding \(e_v\) via sinusoidal positional encoding, and generates per-view affine parameters \((\gamma_v, \beta_v) = \text{FiLM}(\mathbf{e}_v, \bar{a}_v, \bar{a})\) via FiLM layers to modulate appearance tokens → aligning all views to a common irradiance level.
- (b) Geometry-guided Cross-view Attention \(F_A\): A key observation is that the \(Q\) and \(K\) matrices at layer 14 of the frozen geometry encoder already encode reliable cross-view geometric correspondences, even under exposure differences ranging from 0.5s to 32s. These \(Q\), \(K\) matrices are directly reused to guide cross-view fusion of appearance features: \(\tilde{t}_v^A = \text{softmax}(QK^\top/\sqrt{d}) \hat{t}_v^A\) → zero additional computational overhead.
- (c) DoG High-resolution Upsampling \(F_U\): Patch-level features lose high-frequency texture details. A shallow CNN extracts full-resolution features \(g_v\); high-frequency residuals \((g_v - g_{v\downarrow\uparrow})\) are added to the upsampled irradiance features → recovering pixel-level texture.
-
Tone Mapping MetaNet
- The tone mapper \(g_\theta\) is a two-layer MLP (\(3 \to h \to 3\), ReLU + sigmoid) mapping log-irradiance to \([0,1]\) LDR values.
- Unlike optimization-based methods that overfit a separate MLP per scene, MetaNet predicts all weights and biases of \(g_\theta\) from scene context (LDR features \(g_v\) + exposure embeddings \(e_v\) + predicted HDR Gaussians \(G\)).
- The concatenated inputs are encoded via strided convolutions followed by global pooling → scene-level descriptor \(\theta \in \mathbb{R}^{d_\theta}\).
- Supports single-forward adaptation to diverse camera tone curves (AgX/Filmic/Standard) without per-scene optimization.
-
HDR-Pretrain Dataset
- 168 Blender-rendered indoor scenes based on HSSD open-source indoor assets.
- Per scene: \(5 \times 7\) view grids (2.5°/5° steps), 5-level exposure bracketing, 32-bit HDR ground truth, depth and normal maps.
- Three tone mapping operators (AgX/Filmic/Standard) applied randomly to increase CRF diversity.
- Resolution: \(448 \times 448\); rendered with Cycles path tracing.
- Fills a community gap in feed-forward HDR pretraining data (the largest prior HDR dataset contained only 16 scenes).
Loss & Training¶
- \(\mathcal{L} = \mathcal{L}_{\text{RGB}} + \lambda_g \mathcal{L}_g\), where \(\mathcal{L}_{\text{RGB}} = \text{MSE}(I_v, L_v(\ell_v)) + \lambda_{\text{perc}} \mathcal{L}_{\text{perc}}\)
- \(\mathcal{L}_g\) is a depth consistency loss supervised only on the top 30% confidence pixels — avoiding unreliable regions such as reflective surfaces and sky.
- The geometry encoder and decoder head are fully frozen; only the appearance branch, Gaussian head, and MetaNet are trained.
- AdamW + cosine LR, peak \(2 \times 10^{-4}\), 1K warmup steps, 30K iterations, bf16 precision; trained on \(8 \times\) A6000 GPUs for approximately 2 days.
- \(\lambda_{\text{perc}} = 0.05\), \(\lambda_g = 0.1\); 2–10 context views sampled per iteration.
- Post-optimization: low-opacity Gaussians (\(\sigma < 0.01\)) are pruned, followed by 1K iterations of joint MSE + SSIM optimization.
Key Experimental Results¶
Main Results (HDR-NeRF Real Dataset)¶
| Method | 4-view PSNR↑ | 4-view SSIM↑ | 8-view PSNR↑ | 18-view PSNR↑ | Time↓ |
|---|---|---|---|---|---|
| AnySplat | 12.10 | 0.517 | 13.30 | 13.91 | ~1–2s |
| GaussianHDR | ~19.26 | ~0.691 | ~24.96 | ~29.36 | ~1833s |
| HDR-GS | ~15.40 | — | — | ~28.90 | ~910s |
| InstantHDR (zero-shot) | 18.44 | 0.721 | 18.95 | 19.48 | ~1–2s |
| InstantHDR_1K | 22.16 | 0.762 | 25.32 | 29.19 | ~30–40s |
Ablation Study (HDR-NeRF Real, 8-view)¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| Full InstantHDR | 18.95 | 0.724 | 0.269 |
| w/o exposure normalization | 13.72 | — | — |
| w/o MetaNet | 16.32 | — | — |
| w/o cross-view attention | 17.63 | — | — |
| w/o high-resolution upsampling | — | — | 0.386 |
Key Findings¶
- Zero-shot mode outperforms AnySplat by +5.65–+8.07 dB PSNR — exposure normalization and cross-view fusion are the core differentiators.
- Under sparse 4-view settings, InstantHDR_1K surpasses GaussianHDR by +2.90 dB (22.16 vs. 19.26) — feed-forward geometric priors effectively compensate for sparse input.
- Speed: InstantHDR_1K ~30–40s/scene vs. GaussianHDR ~1833s → ~50× speedup.
- Ablation shows exposure normalization has the largest impact (−5.23 dB) — brightness inconsistency fundamentally disrupts cross-view fusion.
- Removing MetaNet causes training instability (16.32 dB) — the model cannot adapt to different CRFs.
- Under dense 18-view settings, performance is comparable to GaussianHDR (29.19 vs. 29.36) at ~50× faster speed.
Highlights & Insights¶
- First feed-forward paradigm for HDR-NVS: reconstruction time drops from ~30 minutes to seconds — a qualitative leap in speed.
- Reusing intermediate-layer attention maps from a frozen geometry encoder is an elegant design — providing reliable cross-view geometric correspondence guidance at zero additional computational cost.
- MetaNet predicting all CRF parameters enables "one network, multiple cameras" — a paradigm shift from per-scene MLP overfitting to meta-learned parameter prediction.
- HDR-Pretrain dataset fills a community gap — 168 scenes, more than 10× larger than the previously largest HDR dataset.
Limitations & Future Work¶
- Zero-shot HDR output tends to be over-bright — extreme radiance values are difficult to accurately predict in a single forward pass.
- A PSNR gap remains between InstantHDR and GaussianHDR under dense-view synthetic settings (~2–6 dB) — the latter employs a dedicated 3D-2D dual-branch tone mapping.
- Only a simple single-branch tone mapper (two-layer MLP) is used; more sophisticated CRF modeling is a promising direction for improvement.
- Fine-tuning on HDR-Plenoxels real scenes is required for generalization to HDR-NeRF real scenes — a domain gap persists.
- Dynamic scene HDR reconstruction remains unexplored.
Related Work & Insights¶
- vs. GaussianHDR: An optimization-based method requiring ~30 min/scene with SfM point cloud initialization; point cloud collapse causes artifacts under sparse views. InstantHDR requires neither poses nor point clouds, surpassing it by +2.90 dB PSNR under sparse 4-view settings.
- vs. AnySplat: A feed-forward 3D reconstruction method that assumes appearance consistency; multi-exposure inputs produce severe ghosting. InstantHDR outperforms it by +5.65 dB in zero-shot mode, with the core difference being exposure normalization and cross-view attention fusion.
- vs. HDR-GS: A strong optimization-based method but slow. InstantHDR_1K significantly outperforms it under sparse settings (22.16 vs. 15.40).
- Insights: The approach of reusing intermediate-layer attention maps from a frozen backbone for cross-view guidance is generalizable to multi-view segmentation, cross-view editing, and related tasks. The MetaNet paradigm of predicting module parameters is applicable to adaptive dehazing, white balance, and similar problems.
Rating¶
- Novelty: ⭐⭐⭐⭐ First feed-forward HDR-NVS; geometry-guided appearance modeling and MetaNet are both novel designs.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-view settings, dual LDR/HDR evaluation, comprehensive ablation, and rich qualitative results.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, method diagrams are intuitive, and experiments are well-organized.
- Value: ⭐⭐⭐⭐ Pioneers the feed-forward HDR-NVS direction with practically significant speed improvements, though a gap with optimization-based methods remains under dense-view settings.