AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm¶

Conference: ICCV 2025 arXiv: 2506.23537 Code: https://github.com/eezkni/AFUNet Area: Other Keywords: HDR imaging, deep unfolding network, MAP estimation, alignment-fusion alternating optimization, deghosting

TL;DR¶

This paper formulates multi-exposure HDR reconstruction from a MAP estimation perspective, decomposes the problem into two alternating subproblems—alignment and fusion—via a spatial correspondence prior, and unfolds them into an end-to-end trainable AFUNet comprising SAM (spatial alignment), CFM (channel fusion), and DCM (data consistency) modules. The method achieves state-of-the-art performance on three HDR benchmarks, reaching PSNR-μ of 44.91 dB on the Kalantari dataset.

Background & Motivation¶

Existing HDR reconstruction methods fall into two paradigms: "align-then-fuse" (pre-alignment followed by fusion, but pre-alignment may discard useful information) and "fusion-only" (bypassing explicit alignment, leading to ghosting artifacts). Both are empirically designed without a principled mathematical foundation. The core insight is that interleaving alignment within the fusion process through alternating iterations outperforms executing the two steps independently.

Core Problem¶

How to provide a theoretically grounded framework for multi-exposure HDR reconstruction such that alignment and fusion mutually reinforce each other through progressive optimization?

Method¶

Overall Architecture¶

Three multi-exposure LDR images \((y_1, y_2, y_3)\) → SFEM shallow feature extraction → \(T=4\) stages of alternating alignment-fusion unfolding network (AFM) → residual HDR image reconstruction. Each AFM stage: SAM aligns non-reference features → SFM spatial fusion → CFM channel fusion → DCM data consistency update → MLP + residual update.

Key Designs¶

MAP-based formulation + unfolding: HDR reconstruction is modeled as MAP estimation (Eq. 2) with a spatial correspondence prior constraint. The HQS method decouples it into an alignment subproblem (gradient descent) and a fusion subproblem (proximal operator). Each iteration is unfolded into one AFM module with independently learnable parameters.
Spatial Alignment Module (SAM): Based on window-based cross-attention, SAM aligns non-reference features \(f_{\alpha_1}/f_{\alpha_3}\) with the intermediate reconstruction feature \(f_x\). Keys and Values incorporate information from the degradation transform \(D_i\) (learned via MLP), enabling the alignment process to be aware of exposure differences.
Channel Fusion Module (CFM): Based on a channel-attention Transformer, CFM performs adaptive channel-wise fusion after spatial fusion (SFM), combining the previous-stage reconstruction feature \(f_x^{t-1}\) with the aligned features.

Loss & Training¶

\(\mathcal{L} = \mathcal{L}_1\text{(tone-mapped)} + 0.005 \times \mathcal{L}_\text{perceptual}\text{(VGG-19)}\)
Tone mapping uses the \(\mu\)-law function (\(\mu = 5000\))
Adam optimizer, batch size = 6, lr = \(5\times10^{-4} \to 5\times10^{-6}\) cosine decay, 400 epochs
Training patch: \(128\times128\); data augmentation: random crop/rotation/flip
Single RTX 4090 GPU

Key Experimental Results¶

Kalantari Dataset¶

Method	PSNR-μ↑	PSNR-l↑	SSIM-μ↑	HDR-VDP2↑
CA-ViT	44.32	42.18	0.9916	66.03
SCTNet	44.43	42.21	0.9918	66.64
SAFNet	44.66	43.18	0.9919	66.69
LFDiff	44.76	42.59	0.9919	66.54
AFUNet	44.91	42.59	0.9923	66.75

Hu Dataset¶

Method	PSNR-μ↑	PSNR-l↑
LFDiff	48.74	52.10
AFUNet	48.83	52.13

Tel Dataset¶

Method	PSNR-μ↑	PSNR-l↑
SCTNet	42.55	47.51
AFUNet	43.31	47.83

Ablation Study¶

SFM only: PSNR-μ = 43.94 → +SAM: 44.48 → +CFM: 44.62 → +DCM: 44.45 → Full (AFUNet): 44.91
Alignment-then-Fusion (AF) order outperforms Fusion-then-Alignment (FA): 44.91 vs. 44.72
Number of stages: 2→44.40, 3→44.83, 4→44.91 (default), 5→44.85, 6→44.93 (4 stages offers the best efficiency-performance trade-off)
3 stages already surpasses prior SOTA, demonstrating the intrinsic effectiveness of the proposed framework

Highlights & Insights¶

Theory-driven architecture design: MAP formulation + HQS unfolding provides a principled theoretical basis for alternating alignment and fusion, rather than purely empirical design.
Alignment as an iterative process rather than preprocessing: The core innovation—alignment and fusion alternate, with each fusion result guiding the subsequent alignment step.
Window-based cross-attention for alignment: Local windows are better suited for spatial alignment than global attention, as alignment primarily involves local structure and high-frequency details.
Practical value of deep unfolding: Unfolding iterative algorithms into fixed-stage neural networks yields both theoretical interpretability and end-to-end trainability.

Limitations & Future Work¶

Only three-exposure inputs are validated; generalization to more exposures remains unexplored.
SAM relies on window-based attention, which may limit alignment capacity in regions with large motion.
The Kalantari dataset contains only 15 test samples, making the evaluation scale relatively small.
No direct fair comparison is made against other unfolding-based methods (e.g., GAN-style iterations in MERF).

vs. CA-ViT/SCTNet (Transformer-based): These methods still follow the "align-then-fuse" or "fusion-only" paradigm; AFUNet's alternating iterative paradigm proves more effective (+0.48–0.59 dB).
vs. LFDiff (Diffusion-based): AFUNet requires no additional diffusion sampling cost yet achieves superior PSNR-μ (44.91 vs. 44.76).
vs. Mai et al. (DUN-based): Prior unfolding methods treat HDR reconstruction as low-rank completion, imposing overly strong assumptions; AFUNet is more flexible and general.

Relevance to My Research¶

The deep unfolding paradigm—transforming iterative optimization into a learnable architecture—is transferable to other complex reconstruction tasks.
The alternating alignment-fusion iterative paradigm can be adapted to video inpainting, multi-view fusion, and related problems.

Rating¶

Novelty: ⭐⭐⭐⭐ — The MAP formulation → unfolding idea is relatively novel in the HDR domain, though deep unfolding itself is a mature technique.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets + ablations + paradigm analysis + stage count analysis; reasonably comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivation is clear; the path from MAP to unfolding is complete and traceable.
Value: ⭐⭐⭐ — The unfolding idea is instructive, but HDR reconstruction is not a core research focus.