UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion¶

Conference: CVPR 2025
arXiv: 2501.11515
Code: Project Page
Area: Image Generation / HDR Imaging
Keywords: High Dynamic Range, Exposure Fusion, Guided Inpainting, Diffusion Models, Ultra-large Exposure Difference

TL;DR¶

UltraFusion reformulates exposure fusion as a guided inpainting problem for the first time. By leveraging under-exposed images as soft guidance rather than hard constraints for over-exposed regions, it achieves ultra-high dynamic range imaging with a 9-stop exposure difference while maintaining robustness against alignment errors and illumination variations.

Background & Motivation¶

HDR imaging is a fundamental problem in camera design. Leading approaches improve dynamic range by fusing images captured at different exposures, but severe practical limitations persist: - Prior methods can only handle a 3-4 stop exposure difference (e.g., HDR+ only adds 3 stops), which is far from sufficient for ultra-high dynamic range scenes. - Alignment issues: Under large exposure differences, the input luminance varies drastically, making optical flow alignment highly challenging and leading to ghosting artifacts. - Illumination inconsistency: Under-exposed images are not merely darkened versions of normally exposed images; object appearance changes with varying exposure. - Tone mapping artifacts: Traditional HDR methods first generate HDR images and then compress them to LDR for display. In high dynamic range scenarios, the tone mapping stage introduces additional artifacts. - Directly applying ControlNet cannot determine which frame should serve as the reference, leading to horizontally inconsistent reference frame selections across different regions. - There is a lack of large-scale exposure fusion training data on dynamic scenes.

Method¶

Overall Architecture¶

UltraFusion is a two-stage framework: (1) Pre-alignment stage—aligning the under-exposed image to the over-exposed image and masking occluded areas; (2) Guided inpainting stage—based on Stable Diffusion, using the over-exposed image as the main reference and the under-exposed image as soft guidance to reconstruct highlight information in over-exposed regions. The guided inpainting stage includes a decomposition fusion control branch and a fidelity control branch.

Key Designs¶

Key Design 1: Guided Inpainting Paradigm¶

Function: Redefining exposure fusion as an inpainting problem to directly output tone-mapped LDR results.

Mechanism: Using the normally exposed (over-exposed) image \(I_{oe}\) as the baseline to reconstruct missing information in its highlight regions. The under-exposed image \(I_{ue}\) acts as a soft guidance rather than a hard constraint, providing ground-truth contents for highlights. During the pre-alignment stage, RAFT is employed to estimate bidirectional optical flow. An occlusion mask \(\mathcal{M}\) is obtained via consistency checks, resulting in the aligned output \(I_{ue \to oe} = (1-\mathcal{M}) \cdot \mathcal{W}(I_{ue}, f_{oe \to ue})\).

Design Motivation: Soft guidance is robust to alignment errors and illumination variations (unlike hard constraints, which magnify errors). Directly generating LDR avoids cascaded errors from HDR-to-LDR conversion. The generative priors of diffusion models ensure natural-looking and plausible outputs.

Key Design 2: Decomposition Fusion Control Branch¶

Function: Extracting structure and color information robust to luminance variations from extremely dark under-exposed images to effectively guide diffusion-based inpainting.

Mechanism: Decomposing the under-exposed image into a structural component \(S_{ue} = (Y_{ue} - \mu(Y_{ue})) / \sigma(Y_{ue})\) (normalized YUV luminance channel) and a color component (UV chroma channels). Structure and color features are extracted through separate convolutional extractors to obtain multi-scale features, which are then fused with the over-exposed image features using multi-scale cross-attention. The control branch replicates the U-Net encoder structure but updates weights independently, and its outputs are injected into the main U-Net via zero convolutions.

Design Motivation: Extremely dark under-exposed images are easily ignored by the model if used directly as guidance. After decomposing them into luminance-independent structure and color information, the effectiveness of the guiding signals is substantially improved.

Key Design 3: Fidelity Control Branch + Training Data Synthesis¶

Function: Alleviating texture distortions introduced by the VAE decoder; synthesizing training data for dynamic scene exposure fusion.

Mechanism: The Fidelity Control Branch (FCB) has a similar structure to the decomposition fusion control branch, but its main extractor adopts a VAE encoder structure (instead of U-Net) to provide skip connections for the VAE decoder. During training, the latent code encoded from the ground truth (GT) is used to simulate the denoising output, optimized via the \(\|I_{gt} - \hat{I}_{gt}\|_1\) reconstruction loss. For data synthesis, frame pairs are sampled from video datasets to simulate large motions, under-exposed patches are sampled from a static multi-exposure dataset (SICE), and pseudo-occlusion masks are used to simulate dynamic occlusions, enabling the model to learn to handle dynamic scenes using only static data.

Design Motivation: VAE decoding tends to introduce undesirable texture alterations; furthermore, no large-scale dynamic HDR training dataset is readily available.

Loss & Training¶

Standard diffusion denoising loss + L1 reconstruction loss \(\|I_{gt} - \hat{I}_{gt}\|_1\) for the fidelity control branch.

Key Experimental Results¶

Main Results: Static MEFB Dataset¶

Method	MUSIQ ↑	DeQA-Score ↑	PAQ2PIQ ↑	HyperIQA ↑	MEF-SSIM ↑
UltraFusion	68.82	3.881	73.80	0.6482	0.9385
HSDS-MEF	66.76	3.544	72.60	0.6026	0.9520
HDR-Transformer	63.10	2.983	71.36	0.5996	0.8626

Ablation Study: Dynamic RealHDRV and UltraFusion Benchmark¶

Method	RealHDRV TMQI ↑	RealHDRV MUSIQ ↑	Benchmark MUSIQ ↑	Benchmark DeQA ↑
UltraFusion	0.8925	67.51	68.41	3.830+
HSDS-MEF	0.8323	61.76	64.54	3.627
HDR-Transformer	0.8680	62.24	63.66	2.909

Key Findings¶

UltraFusion is the first method capable of merging images with a 9-stop exposure difference.
In user studies, UltraFusion significantly outperforms all baseline methods in both subjective quality and user preference.
The soft guidance mechanism makes the method robust to both large motion and illumination changes simultaneously.
Decomposing the under-exposed image into structure and color components is key to effectively extracting and guiding information from dark images.

Highlights & Insights¶

Paradigm Shift: Reformulating exposure fusion from the conventional "align-then-merge" paradigm to a "guided inpainting" paradigm.
High Practicality: Directly outputting LDR images bypasses the tone mapping step, producing display-ready, high-quality outputs end-to-end.
Ingenious Data Synthesis Strategy: Simulating dynamic HDR scenes using a combination of video frames and static multi-exposure data.

Limitations & Future Work¶

Inherits the slow inference speed of diffusion model sampling processes.
The pseudo-occlusions used in training data synthesis may not fully capture the complexity of real-world dynamic scenes.
Currently handles only two frames (one long and one short exposure) and is not yet scaled to broader exposure brackets.
Future work could explore real-time inference and video HDR scenarios.

Compared to the direct application of ControlNet, the strategy of fixing a reference frame (the over-exposed image) eliminates ambiguity.
The concept of decomposing under-exposed images into structure and color can be extended to other cross-domain guidance tasks.
The newly collected 100-scene UltraFusion Benchmark provides standard evaluation data for ultra-high dynamic range imaging.

Rating¶

⭐⭐⭐⭐ — The proposed reformulation of exposure fusion as guided inpainting is novel and effective, achieving 9-stop exposure difference fusion for the first time. The method design is comprehensive, with highly ingenious decomposition fusion control branches and data synthesis pipelines. It achieves state-of-the-art quality across several benchmarks.