Skip to content

GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution

Conference: AAAI 2026 arXiv: 2511.07103 Code: https://github.com/zhu-xlab/GEWDiff Area: Image Generation Keywords: Hyperspectral Image Super-resolution, Diffusion Model, Wavelet Transform, Geometric Enhancement, Remote Sensing

TL;DR

This paper proposes GEWDiff, a geometric enhanced wavelet-based diffusion model that efficiently compresses hyperspectral data into a latent space via a wavelet encoder-decoder, introduces edge-aware noise scheduling and mask-conditional control to preserve geometric integrity, and designs a multi-level loss function to facilitate stable convergence, achieving state-of-the-art performance on 4× hyperspectral image super-resolution.

Background & Motivation

Hyperspectral images (HSI) capture continuous spectral characteristics of ground objects, but high-resolution hyperspectral data are scarce due to sensor cost and coverage constraints. Existing HSI super-resolution methods face three major challenges:

High-dimensional HSI problem: HSI typically contains hundreds of bands (e.g., 242 bands), and directly feeding them into conventional diffusion models leads to memory overflow.

Geometric structure fidelity problem: General generative models lack understanding of the topology and geometric structures of ground objects in remote sensing imagery, making them prone to geometric distortion (especially for buildings) during super-resolution.

Convergence and quality problem: Most diffusion models optimize the loss at the noise level, resulting in unintuitive convergence behavior and suboptimal generation quality for complex data.

Limitations of Prior Work: CNN/GAN-based methods struggle to generate rich textures and complex spatial structures; existing HSI diffusion methods (SpectralDiff, HSR-Diff) either rely on two-stage training or fail to simultaneously guarantee spectral fidelity and visual quality.

Method

Overall Architecture

GEWDiff consists of three core components:

  1. Wavelet-based encoder-decoder: Compresses high-dimensional hyperspectral data into a low-dimensional latent space with near-lossless quality.
  2. Geometry-enhanced diffusion process: Incorporates an edge-aware noise scheduler and mask-controllable training.
  3. Multi-level loss function: Comprising pixel loss, perceptual loss, and gradient loss.

Key Designs

1. Wavelet Encoder-Decoder (RWA + PCA)

Core Idea: Regression Wavelet Analysis (RWA) combined with PCA is used to efficiently compress hyperspectral data to a dimensionality amenable to diffusion models.

Encoding process: - The input image \(\textbf{I}_{LR}\) undergoes \(J\)-level Haar wavelet decomposition, yielding principal coefficients \(\textbf{V}_{LR}^J\) and detail coefficients \(\textbf{w}_{LR}^j\). - Detail coefficients are predicted from principal coefficients via linear regression: \(\hat{\textbf{w}}_i^j = \beta_{i,0}^j + \beta_{i,1}^j \textbf{V}_1^j + ... + \beta_{i,k}^j \textbf{V}_k^j\) - Only the principal coefficients and regression weights are stored; residuals are discarded. - PCA is applied to the principal coefficients: \((\textbf{z}_{LR}, \textbf{R}_{LR}) = \text{PCA}(\textbf{V}_{LR}^J)\) - The resulting \(\textbf{z}_{LR}\) serves as input to the diffusion model.

Decoding process: - The diffusion model output \(\hat{\textbf{z}}_0\) is passed through inverse PCA to recover the super-resolved principal coefficients. - The full hyperspectral image is reconstructed via inverse RWA (using the regression model saved during encoding to predict detail coefficients).

The key advantage of this design is near-lossless spectral-spatial compression without lengthy training.

2. Geometry-Enhanced Diffusion Process

Built upon the EDM (Elucidating Diffusion Models) framework, using continuous noise levels \(\sigma\) instead of discrete timesteps.

Edge-aware noise scheduler: Enhances the diffusion model's ability to generate edge pixels during training: $\(\textbf{z}_t = \textbf{z}_0 + \sigma_t \epsilon \odot (1 - \textbf{E}(1-\sigma_{norm}^2)\eta)\)$

where \(\textbf{E}\) is a binary edge map. Noise near edges is smaller than in general regions (\(\eta=0.5\)), forcing the model to focus more on accurate reconstruction of edge regions during training. Key insight: when \(\sigma_{norm}\) is small (weak noise), edge modulation is stronger; when noise is large, \((1-\sigma_{norm}^2)\) approaches 0, eliminating edge influence and ensuring reasonable noise coverage during the initial stage.

Mask-controllable training and sampling: Segmentation masks are obtained from low-resolution RGB channels using the SAM model. Mask values are inverted based on the NDVI index: $\(M_s = 1 - \frac{1}{|S_s|}\sum_{(x,y) \in S_s} \text{NDVI}_{norm}(x,y)\)$

High-NDVI (vegetation) regions receive low mask values, while low-NDVI (building) regions receive high mask values, directing the model's attention toward the geometric accuracy of buildings. During training: $\(\hat{\textbf{z}}_0 = f_\theta(\textbf{z}_t, \textbf{C}, \sigma_t), \quad \textbf{C} = [\textbf{z}_{LR}, \textbf{M}]\)$

During sampling, DPM-Solver++ is used for accelerated generation with second-order approximation and adaptive step sizes.

3. 3D U-Net + Spectral Fidelity Enhancer (SFE)

The network backbone employs a 3D U-Net to model spectral-spatial coupled features, integrated with a Spectral Fidelity Enhancer (SFE) to ensure spectral consistency.

Loss & Training

Multi-level loss function: $\(\mathcal{L} = \lambda(t) \cdot (\lambda_1 \mathcal{L}_{pixel} + \lambda_2 \mathcal{L}_{perc} + \lambda_3 \mathcal{L}_{grad})\)$

Weights are set to \(\lambda_1=0.8, \lambda_2=0.1, \lambda_3=0.1\).

  • Pixel loss (spectral accuracy): Average of L2 norm and SAM angular loss: \(\mathcal{L}_{pixel} = (\|\textbf{z}_0 - \hat{\textbf{z}}_0\|^2 + \text{SAM}(\textbf{z}_0, \hat{\textbf{z}}_0))/2\)
  • Perceptual loss (high-level feature similarity): L2 distance in VGG feature space
  • Gradient loss (edge sharpness): L1 distance of image gradients in the x/y directions

Training is conducted on 4 NVIDIA A100 GPUs with a learning rate of \(1 \times 10^{-4}\) for 200 epochs.

Key Experimental Results

Main Results

MDAS Sample 1 Dataset (4× Super-resolution)

Method PSNR↑ SSIM↑ SAM↓ FID↓ LV↑
MCNet 28.300 0.6658 8.333 116.14 0.0004
MSDFormer 28.284 0.6592 8.744 103.74 0.0004
DMGASR 26.986 0.5831 11.34 49.03 0.0037
HIR-Diff 24.833 0.6401 8.954 50.60 0.0021
SNLSR 28.531 0.6718 7.891 125.75 0.0003
GEWDiff (Ours) 28.863 0.7104 8.428 44.46 0.0041

WDC Dataset

Method PSNR↑ SSIM↑ SAM↓ FID↓ CC↑
MCNet 33.389 0.7441 8.550 464.13 0.6495
ESSAFormer 25.504 0.4120 18.72 701.35 0.6326
HIR-Diff 34.473 0.7362 8.360 363.23 0.7102
SNLSR 35.734 0.7525 7.661 470.34 0.7733
GEWDiff (Ours) 35.837 0.7747 7.474 238.12 0.7906

Ablation Study

Configuration PSNR↑ SAM↓ FID↓ Note
w/o RWA & PCA (Baseline) 2.048 124.2 5019 Complete failure; cannot handle high-dimensional data
RWA only 15.79 85.24 484.2 Wavelet compression yields substantial improvement
RWA+PCA 25.64 15.13 83.63 PCA further significantly boosts performance
+Mask 26.58 11.77 43.45 Geometric mask improves building generation
+Edge 26.68 12.16 36.27 Edge scheduling enhances sharpness
Full model (Ours) 27.01 11.50 34.94 All components achieve optimal synergy

Key Findings

  1. GEWDiff achieves state-of-the-art performance across all four dimensions: fidelity (PSNR/SSIM), spectral accuracy (SAM), visual realism (FID), and sharpness (LV).
  2. Compared to conventional CNN-based methods (e.g., MCNet), GEWDiff holds a substantial advantage in FID (44 vs. 116), indicating more realistic texture generation.
  3. Compared to existing diffusion-based methods (HIR-Diff, DMGASR), GEWDiff achieves significantly higher fidelity while maintaining generative realism.
  4. The RWA+PCA encoder is foundational — without it, the model fails entirely.
  5. Edge-aware scheduling and mask conditioning contribute most to the FID metric (83→35), demonstrating that geometric enhancement is critical for visual quality.

Highlights & Insights

  • The training-free RWA+PCA encoder is an elegant design: it leverages the multi-scale decomposition capability of wavelet transforms and the orthogonal compression of PCA to achieve efficient hyperspectral compression without training a VAE.
  • Edge-aware noise scheduling: By applying less noise to edge regions during training, the model naturally becomes more adept at reconstructing edges — a principle generalizable to other generation tasks requiring structural preservation.
  • NDVI-based mask conditioning is a clever design tailored to remote sensing: vegetation indices distinguish buildings from natural areas without requiring additional annotations.

Limitations & Future Work

  • Training data covers only 15 cities, potentially limiting land cover diversity.
  • The model is large (4.55 GB) with a test time of 28.7 seconds, posing challenges for practical deployment.
  • Only 4× super-resolution is validated; performance at other scales (e.g., 8×, 16×) remains unknown.
  • The number of principal components in PCA is selected empirically and may affect generalization across different scenes.
  • FID is computed in RGB feature space, which has limitations for evaluating hyperspectral data.
  • WaveDiff first demonstrated the advantages of diffusion in the wavelet domain; this work innovatively extends the idea to hyperspectral imaging by incorporating RWA and PCA.
  • The continuous noise scheduling of the EDM framework provides a natural interface for edge-aware scheduling.
  • The "compress-then-generate" paradigm resembles Latent Diffusion, but the encoder employs a physics-driven wavelet transform rather than a learned VAE.
  • Edge-aware noise scheduling may inspire super-resolution tasks in medical imaging and other domains requiring the preservation of fine structures.

Rating

  • Novelty: ⭐⭐⭐⭐ — The RWA+PCA encoder and edge-aware noise scheduling are novel contributions, though the overall framework is largely combinatorial.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets with detailed ablations, though comparisons with the most recent methods are limited.
  • Writing Quality: ⭐⭐⭐⭐ — Methods are clearly described with complete formulations.
  • Value: ⭐⭐⭐⭐ — Practically valuable for remote sensing hyperspectral super-resolution; the encoder design is transferable.