AAAI 2026 Scientific Computing Masked Autoencoder hyperspectral LSMM SAM loss physics-informed knowledge-guided ML

Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction¶

Conference: AAAI 2026 arXiv: 2512.12445 Code: To be confirmed Area: Scientific Computing Keywords: Masked Autoencoder, hyperspectral, LSMM, SAM loss, physics-informed, knowledge-guided ML

TL;DR¶

This paper proposes KARMA, a framework that embeds the Linear Spectral Mixing Model (LSMM) as a physics constraint within the ViT-MAE decoder, combined with a Spectral Angle Mapper (SAM) loss, to improve reconstruction fidelity and downstream transfer performance for hyperspectral remote sensing imagery.

Background & Motivation¶

Pure data-driven ViT-MAE faces several limitations in hyperspectral remote sensing: (1) it ignores the physical mixing mechanism of spectral data, where each pixel is a linear combination of multiple surface materials; (2) conventional MSE loss focuses solely on numerical accuracy while neglecting spectral shape fidelity; (3) hyperspectral data is high-dimensional (218 bands) and exhibits mixed-pixel problems, making direct transfer from generic foundation models infeasible. The Knowledge-Guided Machine Learning (KGML) paradigm aims to embed domain knowledge into neural networks to improve interpretability and generalization.

Core Problem¶

How to effectively embed remote sensing physical priors (spectral mixing models) into a self-supervised Transformer framework, such that the learned representations are both data-efficient and physically consistent.

Method¶

Overall Architecture¶

KARMA = ViT-MAE backbone + LSMM physics branch + SAM angular loss + Huber robust loss

Key Designs¶

LSMM Embedding: A lightweight abundance head $f_\theta$ (MLP: $D \to D/2 \to M$) is appended to the decoder to predict the abundance vector for each patch: $$\hat{x} = \text{softmax}(f_\theta(z))$$ Physical reconstruction: $\hat{r}_{phys} = A\hat{x}$, where $A \in \mathbb{R}^{218 \times M}$ is the endmember matrix (randomly initialized, learned end-to-end). The softmax naturally enforces non-negativity and sum-to-one constraints: $x \geq 0, \mathbf{1}^\top x = 1$.

SAM Angular Loss: Preserves spectral shape independent of magnitude: $$\mathcal{L}_{SAM} = \frac{1}{N} \sum_{i=1}^N \arccos\frac{\langle \hat{r}_i, r_i \rangle}{\|\hat{r}_i\|_2 \|r_i\|_2 + \epsilon}$$

Composite Objective: $$\mathcal{L} = \lambda_1 \mathcal{L}_{Huber} + \lambda_2 \mathcal{L}_{SAM} + \lambda_3 \mathcal{L}_{phys}$$

The three losses respectively ensure: numerical accuracy (Huber), spectral shape fidelity (SAM), and physical consistency (LSMM).

Architecture Details¶

Patch size 16×16, $D=512$, $H=8$ attention heads, 75% masking ratio, EnMAP 218-band input.

Key Experimental Results¶

Reconstruction Quality:

Model	Avg PSNR (dB)	Avg SSIM
ViT-MAE	24.61	0.55
KARMA	27.38 (+11.3%)	0.68 (+23.6%)

Downstream Task (CDL Crop Classification):

Metric	ViT-MAE + Head	KARMA + Head
Top-1 Acc	48.26%	66.81% (+38.5%)
mIoU	34.88%	46.37% (+33.0%)

Cross-Region Generalization (NLCD Land Cover, CA→CO/KS): Cultivated Crops improves from 56.70% → 91.59% (+61.5%)

Computational Overhead: KARMA training costs 9.47ms per sample vs. ViT-MAE at 7.19ms (+31.7%, training only)

Highlights & Insights¶

LSMM serves as a "low-rank physics bottleneck," compelling the network to discover efficient, physically interpretable decompositions
SAM loss focuses on spectral angle (shape) rather than amplitude, which is critical for material identification
The triple-loss design accounts for numerical, geometric, and physical dimensions simultaneously
Strong cross-region generalization (CA→CO/KS) demonstrates that physical priors enhance transferability

Limitations & Future Work¶

Comparison is limited to vanilla ViT-MAE; no benchmarking against HSI-SOTA methods
The endmember matrix $A$ is randomly initialized and learned end-to-end, with no guarantee of correspondence to true physical endmembers
Ablation study is incomplete — planned ablations (fixed vs. learned $A$, effect of $M$) are not fully presented
Dataset is restricted to EnMAP California regions at limited scale (5,000 tiles for pretraining)

Method	Physics Constraint	Spectral Angular Loss	Interpretability
ViT-MAE	✗	✗	Low
SatMAE	✗	✗	Low
HyperKD	Distillation	✗	Medium
KARMA	LSMM	SAM	High

The paradigm of using a "physics model as a decoder branch" is generalizable to other domains with physical priors. The multi-loss design combining numerical, geometric, and physical objectives is broadly applicable. The learnable endmember matrix is essentially physics-guided dictionary learning.

Rating¶

⭐⭐⭐⭐ — The method is elegantly designed with a clear physical embedding rationale, but experimental comparisons and ablation studies are insufficiently comprehensive.