LEDiff: Latent Exposure Diffusion for HDR Generation¶
Conference: CVPR 2025
arXiv: 2412.14456
Code: https://lediff.mpi-inf.mpg.de/ (Project Page)
Area: Image Generation
Keywords: High Dynamic Range Imaging, Latent Exposure Fusion, Diffusion Models, Inverse Tone Mapping, HDR Generation
TL;DR¶
LEDiff is proposed to enable HDR generation in existing generative models and achieve SOTA LDR-to-HDR translation. This is achieved by performing exposure fusion in the latent space of a pre-trained diffusion model (rather than the image space) and fine-tuning the VAE decoder and denoiser with a small amount of HDR data.
Background & Motivation¶
- Background: Consumer displays increasingly support HDR with a dynamic range exceeding 10 stops, but the vast majority of image resources (web photos, AI-generated content) are still limited to 8-bit LDR. Existing generative models (such as Stable Diffusion) can only output LDR images.
- Limitations of Prior Work: Traditional HDR reconstruction methods either require multi-exposure inputs (which suffer from scene alignment difficulties) or recover HDR from a single LDR image via inverse tone mapping (ITM). However, existing ITM methods suffer from poor detail generation quality in clipped areas, particularly possessing almost no processing capability for shadow regions. Although GlowGAN made the first attempt at GAN-based HDR generation, it is limited by category-specific training.
- Key Challenge: Both the VAE and UNet of pre-trained diffusion models are only trained on LDR data—the VAE cannot represent the HDR dynamic range, and the latent codes generated by the UNet only encode LDR content. Directly fine-tuning the entire model on HDR data requires a massive amount of HDR data and destroys pre-trained priors.
- Goal: (a) To enable pre-trained diffusion models to generate HDR content; (b) To convert arbitrary LDR images to HDR.
- Key Insight: It is observed that the latent space of diffusion models is highly correlated with the image space in terms of clipping and pixel intensity—pixels clipped in the image space are also clipped in the latent space. Therefore, exposure fusion can be performed in the latent space to eliminate clipping.
- Core Idea: While keeping the pre-trained latent space unchanged, highlight and shadow generators are trained to generate multi-exposure "latent brackets" in the latent space. These brackets are then merged using a learnable fusion module and reconstructed into HDR images via a fine-tuned decoder.
Method¶
Overall Architecture¶
The input is a single LDR image (or a latent code generated by a diffusion model). The pipeline consists of: (1) obtaining the latent code \(C_+\) using a pre-trained encoder; (2) generating low-exposure latent codes \(C_0, C_-\) using a highlight denoiser \(\epsilon_{\theta_-}\) (recovering highlight details); (3) generating high-exposure latent codes using a shadow denoiser \(\epsilon_{\theta_+}\) (recovering shadow details); (4) merging the three latent codes into a clipping-free latent code \(C_{\text{merge}}\) using a learnable fusion module \(\mathcal{F}\); and (5) decoding it into a linear HDR image using a fine-tuned HDR decoder.
Key Designs¶
-
Latent Exposure Fusion:
- Function: Seamlessly fuse multi-exposure LDR latent codes without clipping, while preserving the pre-trained latent space intact.
- Mechanism: Multi-exposure LDR images \(I_-, I_0, I_+\) are simulated from an HDR image (by introducing non-linearity through randomly sampled CRFs), and the corresponding latent codes \(C_-, C_0, C_+\) are obtained using a pre-trained encoder. The fusion module \(\mathcal{F}\) generates weight maps for each latent code using depth-wise convolutions, which are then normalized via softmax and merged through weighted summation. The output \(C_{\text{merge}}\) reconstructs the HDR image in the log space through a fine-tuned decoder. The training is guided by reconstruction loss and GAN loss.
- Design Motivation: Analogy to exposure fusion techniques in the image space (such as Mertens' method), but executed in the latent space instead. The advantage is that it fully preserves the generative priors learned by the pre-trained model—the latent code generation is handled by the pre-trained model, and only fusion and decoding require training with HDR data.
-
Highlight & Shadow Generators:
- Function: Generate latent codes at different exposure levels from a single-exposure latent code to hallucinate details in clipped regions.
- Mechanism: Fine-tuning is conducted based on the denoiser of a pre-trained Stable Diffusion model. Utilizing the high-exposure latent code \(C_+\) as a condition (via channel concatenation), \(\epsilon_{\theta_-}\) is trained to generate low-exposure latent codes \(C_0\) and \(C_-\) (to recover highlights). Similarly, \(\epsilon_{\theta_+}\) is trained to process shadows. The first convolutional layer is modified in terms of channel dimensions to accept the concatenated input, while the rest of the network is initialized with pre-trained weights. Standard diffusion denoising loss is used for training.
- Design Motivation: Leverage the strong generative capabilities of pre-trained diffusion models (similar to inpainting) to hallucinate content in clipped regions. Training highlight and shadow generators separately avoids the difficulty of forcing a single model to handle both directions simultaneously.
-
HDR Decoder Fine-tuning:
- Function: Decode the clipping-free merged latent code into a linear HDR image.
- Mechanism: Fine-tuning is performed on the original VAE decoder architecture, with the training objective comparing the predicted HDR and ground-truth HDR in the log space. It is trained on 36,000 HDR images with a learning rate of 1e-6 for 200K steps. The decoder simultaneously learns two things: dynamic range expansion and linearization.
- Design Motivation: The original VAE decoder can only output gamma-corrected LDR images. Fine-tuning the decoder to output linear, high-dynamic-range pixel values is a key step in correctly mapping the information recovered in the latent space to the HDR domain.
Loss & Training¶
Fusion module + decoder: Reconstruction loss (L1 + perceptual) + GAN loss. Denoiser: standard diffusion denoising loss (MSE). Decoder: 200K steps (lr=1e-6), denoisers: 400K steps (lr=1e-5), both using the Adam optimizer. The HDR dataset includes 36,000 images from multiple HDR sources.
Key Experimental Results¶
Main Results (LDR-to-HDR Reconstruction - Highlight Regions)¶
| Method | HDR-VDP3↑ | PU21-PIQE↓ | FID-R↓ | FID-D↓ | FID-L↓ |
|---|---|---|---|---|---|
| HDRCNN | 6.90 | 49.43 | 13.39 | 16.95 | 16.67 |
| MaskHDR | 6.47 | 49.38 | 12.83 | 13.85 | 15.21 |
| SingleHDR | 6.13 | 49.74 | 28.68 | 34.53 | 29.50 |
| ExpandNet | 5.23 | 52.53 | 18.85 | 25.49 | 21.34 |
| LEDiff | 6.16 | 48.46 | 12.70 | 13.08 | 13.73 |
Ablation Study¶
| Configuration | HDR-VDP3↑ | PU21-PIQE↓ | Description |
|---|---|---|---|
| w/o VAE decoder fine-tuning | 4.67 | 48.60 | Has hallucination but no dynamic range expansion |
| w/o denoiser fine-tuning | 5.59 | 50.25 | Has dynamic range but no detail hallucination |
| Replace with SD inpainting | 3.65 | 50.18 | Poor results with irregular masks |
| Full LEDiff | 6.16 | 48.46 | Both components are indispensable |
Key Findings¶
- Both VAE decoder and denoiser are indispensable: The decoder is responsible for dynamic range expansion and linearization, while the denoiser is responsible for detail hallucination in clipped areas—their functions are complementary.
- User study achieves landslide victory: In 1,200 pairwise comparisons, the preference rates for LEDiff are 84.19% (vs. HDRCNN), 88.22% (vs. MaskHDR), 89.40% (vs. SingleHDR), and 94.52% (vs. ExpandNet), respectively.
- Shadow processing is a differentiating advantage: Existing methods barely handle shadow regions, whereas LEDiff is the first generative method to address both highlights and shadows.
- Plug-and-play capability: LEDiff is indifferent to how the latent representation is generated, allowing it to seamlessly integrate into any SD-based generation pipeline (panoramas, videos, etc.).
Highlights & Insights¶
- Discovery of clipping correlation between latent and image spaces: This key observation makes exposure fusion in the latent space possible—it is a simple yet highly insightful discovery.
- Minimizing interference to pre-trained models: By keeping the pre-trained latent space unchanged, fine-tuning only the decoder and conditional denoisers, HDR generation is realized using only 36K HDR images. This "repurpose rather than rebuild" philosophy is worth learning from.
- Wide range of potential applications: HDR panorama generation can be directly used for IBL (Image-Based Lighting), and HDR video generation or DoF (Depth of Field) rendering effects require linear HDR data—LEDiff opens up many application scenarios.
Limitations & Future Work¶
- Inherits the generative limitations of Stable Diffusion (resolution, content bias).
- Does not account for JPEG compression artifacts (such as JPEG ringing/blocking) and noise in the input LDR images.
- The HDR training dataset contains only 36K images; increasing the scale could further improve quality.
- Currently utilizes a three-exposure bracket; more exposure levels might be beneficial in extreme dynamic range scenes.
Related Work & Insights¶
- vs GlowGAN: GlowGAN models the HDR-LDR relationship via GANs but is limited to category-specific training; LEDiff leverages the generalization capability of pre-trained diffusion models without category limitations.
- vs Exposure Diffusion (Bemana et al.): This concurrent work also uses diffusion models to generate exposure brackets, but performs fusion in the image space and requires estimating exposure parameters; LEDiff fuses in the latent space, eliminating the need for CRF and exposure parameter estimation.
- vs HDRCNN/MaskHDR: These methods use CNNs to directly predict HDR, which is effective in highlight areas but completely ignores shadows; LEDiff employs bidirectional processing and possesses detail hallucination capabilities in both directions.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of latent space exposure fusion is simple and elegant.
- Theoretical/Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative + user study + ablation + multiple application scenarios demonstrated.
- Writing Quality: ⭐⭐⭐⭐⭐ Exquisite figures, rigorous logic, and a very clear deduction chain of observation \(\rightarrow\) method \(\rightarrow\) experiment.
- Value: ⭐⭐⭐⭐ Endows diffusion models with HDR capabilities, carrying high practical application value.