GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution¶

Conference: AAAI 2026 arXiv: 2511.07103 Code: https://github.com/zhu-xlab/GEWDiff Area: Image Generation Keywords: Hyperspectral Image Super-resolution, Diffusion Model, Wavelet Transform, Geometric Enhancement, Remote Sensing

TL;DR¶

This paper proposes GEWDiff, a geometric enhanced wavelet-based diffusion model that efficiently compresses hyperspectral data into a latent space via a wavelet encoder-decoder, introduces edge-aware noise scheduling and mask-conditional control to preserve geometric integrity, and designs a multi-level loss function to facilitate stable convergence, achieving state-of-the-art performance on 4× hyperspectral image super-resolution.

Background & Motivation¶

Hyperspectral images (HSI) capture continuous spectral characteristics of ground objects, but high-resolution hyperspectral data are scarce due to sensor cost and coverage constraints. Existing HSI super-resolution methods face three major challenges:

High-dimensional HSI problem: HSI typically contains hundreds of bands (e.g., 242 bands), and directly feeding them into conventional diffusion models leads to memory overflow.

Geometric structure fidelity problem: General generative models lack understanding of the topology and geometric structures of ground objects in remote sensing imagery, making them prone to geometric distortion (especially for buildings) during super-resolution.

Convergence and quality problem: Most diffusion models optimize the loss at the noise level, resulting in unintuitive convergence behavior and suboptimal generation quality for complex data.

Limitations of Prior Work: CNN/GAN-based methods struggle to generate rich textures and complex spatial structures; existing HSI diffusion methods (SpectralDiff, HSR-Diff) either rely on two-stage training or fail to simultaneously guarantee spectral fidelity and visual quality.

Method¶

Overall Architecture¶

GEWDiff consists of three core components:

Wavelet-based encoder-decoder: Compresses high-dimensional hyperspectral data into a low-dimensional latent space with near-lossless quality.
Geometry-enhanced diffusion process: Incorporates an edge-aware noise scheduler and mask-controllable training.
Multi-level loss function: Comprising pixel loss, perceptual loss, and gradient loss.

Key Designs¶

1. Wavelet Encoder-Decoder (RWA + PCA)¶

Core Idea: Regression Wavelet Analysis (RWA) combined with PCA is used to efficiently compress hyperspectral data to a dimensionality amenable to diffusion models.

Encoding process: - The input image $\textbf{I}_{LR}$ undergoes $J$-level Haar wavelet decomposition, yielding principal coefficients $\textbf{V}_{LR}^J$ and detail coefficients $\textbf{w}_{LR}^j$. - Detail coefficients are predicted from principal coefficients via linear regression: $\hat{\textbf{w}}_i^j = \beta_{i,0}^j + \beta_{i,1}^j \textbf{V}_1^j + ... + \beta_{i,k}^j \textbf{V}_k^j$ - Only the principal coefficients and regression weights are stored; residuals are discarded. - PCA is applied to the principal coefficients: $(\textbf{z}_{LR}, \textbf{R}_{LR}) = \text{PCA}(\textbf{V}_{LR}^J)$ - The resulting $\textbf{z}_{LR}$ serves as input to the diffusion model.

Decoding process: - The diffusion model output $\hat{\textbf{z}}_0$ is passed through inverse PCA to recover the super-resolved principal coefficients. - The full hyperspectral image is reconstructed via inverse RWA (using the regression model saved during encoding to predict detail coefficients).

The key advantage of this design is near-lossless spectral-spatial compression without lengthy training.

2. Geometry-Enhanced Diffusion Process¶

Built upon the EDM (Elucidating Diffusion Models) framework, using continuous noise levels $\sigma$ instead of discrete timesteps.

Edge-aware noise scheduler: Enhances the diffusion model's ability to generate edge pixels during training: $$\textbf{z}_t = \textbf{z}_0 + \sigma_t \epsilon \odot (1 - \textbf{E}(1-\sigma_{norm}^2)\eta)$$

where $\textbf{E}$ is a binary edge map. Noise near edges is smaller than in general regions ($\eta=0.5$), forcing the model to focus more on accurate reconstruction of edge regions during training. Key insight: when $\sigma_{norm}$ is small (weak noise), edge modulation is stronger; when noise is large, $(1-\sigma_{norm}^2)$ approaches 0, eliminating edge influence and ensuring reasonable noise coverage during the initial stage.

Mask-controllable training and sampling: Segmentation masks are obtained from low-resolution RGB channels using the SAM model. Mask values are inverted based on the NDVI index: $$M_s = 1 - \frac{1}{|S_s|}\sum_{(x,y) \in S_s} \text{NDVI}_{norm}(x,y)$$

High-NDVI (vegetation) regions receive low mask values, while low-NDVI (building) regions receive high mask values, directing the model's attention toward the geometric accuracy of buildings. During training: $$\hat{\textbf{z}}_0 = f_\theta(\textbf{z}_t, \textbf{C}, \sigma_t), \quad \textbf{C} = [\textbf{z}_{LR}, \textbf{M}]$$

During sampling, DPM-Solver++ is used for accelerated generation with second-order approximation and adaptive step sizes.

3. 3D U-Net + Spectral Fidelity Enhancer (SFE)¶

The network backbone employs a 3D U-Net to model spectral-spatial coupled features, integrated with a Spectral Fidelity Enhancer (SFE) to ensure spectral consistency.

Loss & Training¶

Multi-level loss function: $$\mathcal{L} = \lambda(t) \cdot (\lambda_1 \mathcal{L}_{pixel} + \lambda_2 \mathcal{L}_{perc} + \lambda_3 \mathcal{L}_{grad})$$

Weights are set to $\lambda_1=0.8, \lambda_2=0.1, \lambda_3=0.1$.

Pixel loss (spectral accuracy): Average of L2 norm and SAM angular loss: $\mathcal{L}_{pixel} = (\|\textbf{z}_0 - \hat{\textbf{z}}_0\|^2 + \text{SAM}(\textbf{z}_0, \hat{\textbf{z}}_0))/2$
Perceptual loss (high-level feature similarity): L2 distance in VGG feature space
Gradient loss (edge sharpness): L1 distance of image gradients in the x/y directions

Training is conducted on 4 NVIDIA A100 GPUs with a learning rate of $1 \times 10^{-4}$ for 200 epochs.

Key Experimental Results¶

Main Results¶

MDAS Sample 1 Dataset (4× Super-resolution)

Method	PSNR↑	SSIM↑	SAM↓	FID↓	LV↑
MCNet	28.300	0.6658	8.333	116.14	0.0004
MSDFormer	28.284	0.6592	8.744	103.74	0.0004
DMGASR	26.986	0.5831	11.34	49.03	0.0037
HIR-Diff	24.833	0.6401	8.954	50.60	0.0021
SNLSR	28.531	0.6718	7.891	125.75	0.0003
GEWDiff (Ours)	28.863	0.7104	8.428	44.46	0.0041

WDC Dataset

Method	PSNR↑	SSIM↑	SAM↓	FID↓	CC↑
MCNet	33.389	0.7441	8.550	464.13	0.6495
ESSAFormer	25.504	0.4120	18.72	701.35	0.6326
HIR-Diff	34.473	0.7362	8.360	363.23	0.7102
SNLSR	35.734	0.7525	7.661	470.34	0.7733
GEWDiff (Ours)	35.837	0.7747	7.474	238.12	0.7906

Ablation Study¶

Configuration	PSNR↑	SAM↓	FID↓	Note
w/o RWA & PCA (Baseline)	2.048	124.2	5019	Complete failure; cannot handle high-dimensional data
RWA only	15.79	85.24	484.2	Wavelet compression yields substantial improvement
RWA+PCA	25.64	15.13	83.63	PCA further significantly boosts performance
+Mask	26.58	11.77	43.45	Geometric mask improves building generation
+Edge	26.68	12.16	36.27	Edge scheduling enhances sharpness
Full model (Ours)	27.01	11.50	34.94	All components achieve optimal synergy

Key Findings¶

GEWDiff achieves state-of-the-art performance across all four dimensions: fidelity (PSNR/SSIM), spectral accuracy (SAM), visual realism (FID), and sharpness (LV).
Compared to conventional CNN-based methods (e.g., MCNet), GEWDiff holds a substantial advantage in FID (44 vs. 116), indicating more realistic texture generation.
Compared to existing diffusion-based methods (HIR-Diff, DMGASR), GEWDiff achieves significantly higher fidelity while maintaining generative realism.
The RWA+PCA encoder is foundational — without it, the model fails entirely.
Edge-aware scheduling and mask conditioning contribute most to the FID metric (83→35), demonstrating that geometric enhancement is critical for visual quality.

Highlights & Insights¶

The training-free RWA+PCA encoder is an elegant design: it leverages the multi-scale decomposition capability of wavelet transforms and the orthogonal compression of PCA to achieve efficient hyperspectral compression without training a VAE.
Edge-aware noise scheduling: By applying less noise to edge regions during training, the model naturally becomes more adept at reconstructing edges — a principle generalizable to other generation tasks requiring structural preservation.
NDVI-based mask conditioning is a clever design tailored to remote sensing: vegetation indices distinguish buildings from natural areas without requiring additional annotations.

Limitations & Future Work¶

Training data covers only 15 cities, potentially limiting land cover diversity.
The model is large (4.55 GB) with a test time of 28.7 seconds, posing challenges for practical deployment.
Only 4× super-resolution is validated; performance at other scales (e.g., 8×, 16×) remains unknown.
The number of principal components in PCA is selected empirically and may affect generalization across different scenes.
FID is computed in RGB feature space, which has limitations for evaluating hyperspectral data.

WaveDiff first demonstrated the advantages of diffusion in the wavelet domain; this work innovatively extends the idea to hyperspectral imaging by incorporating RWA and PCA.
The continuous noise scheduling of the EDM framework provides a natural interface for edge-aware scheduling.
The "compress-then-generate" paradigm resembles Latent Diffusion, but the encoder employs a physics-driven wavelet transform rather than a learned VAE.
Edge-aware noise scheduling may inspire super-resolution tasks in medical imaging and other domains requiring the preservation of fine structures.

Rating¶

Novelty: ⭐⭐⭐⭐ — The RWA+PCA encoder and edge-aware noise scheduling are novel contributions, though the overall framework is largely combinatorial.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets with detailed ablations, though comparisons with the most recent methods are limited.
Writing Quality: ⭐⭐⭐⭐ — Methods are clearly described with complete formulations.
Value: ⭐⭐⭐⭐ — Practically valuable for remote sensing hyperspectral super-resolution; the encoder design is transferable.