Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction¶
Conference: AAAI 2026 arXiv: 2512.12445 Code: To be confirmed Area: Scientific Computing Keywords: Masked Autoencoder, hyperspectral, LSMM, SAM loss, physics-informed, knowledge-guided ML
TL;DR¶
This paper proposes KARMA, a framework that embeds the Linear Spectral Mixing Model (LSMM) as a physics constraint within the ViT-MAE decoder, combined with a Spectral Angle Mapper (SAM) loss, to improve reconstruction fidelity and downstream transfer performance for hyperspectral remote sensing imagery.
Background & Motivation¶
Pure data-driven ViT-MAE faces several limitations in hyperspectral remote sensing: (1) it ignores the physical mixing mechanism of spectral data, where each pixel is a linear combination of multiple surface materials; (2) conventional MSE loss focuses solely on numerical accuracy while neglecting spectral shape fidelity; (3) hyperspectral data is high-dimensional (218 bands) and exhibits mixed-pixel problems, making direct transfer from generic foundation models infeasible. The Knowledge-Guided Machine Learning (KGML) paradigm aims to embed domain knowledge into neural networks to improve interpretability and generalization.
Core Problem¶
How to effectively embed remote sensing physical priors (spectral mixing models) into a self-supervised Transformer framework, such that the learned representations are both data-efficient and physically consistent.
Method¶
Overall Architecture¶
KARMA = ViT-MAE backbone + LSMM physics branch + SAM angular loss + Huber robust loss
Key Designs¶
LSMM Embedding: A lightweight abundance head \(f_\theta\) (MLP: \(D \to D/2 \to M\)) is appended to the decoder to predict the abundance vector for each patch: $\(\hat{x} = \text{softmax}(f_\theta(z))\)$ Physical reconstruction: \(\hat{r}_{phys} = A\hat{x}\), where \(A \in \mathbb{R}^{218 \times M}\) is the endmember matrix (randomly initialized, learned end-to-end). The softmax naturally enforces non-negativity and sum-to-one constraints: \(x \geq 0, \mathbf{1}^\top x = 1\).
SAM Angular Loss: Preserves spectral shape independent of magnitude: $\(\mathcal{L}_{SAM} = \frac{1}{N} \sum_{i=1}^N \arccos\frac{\langle \hat{r}_i, r_i \rangle}{\|\hat{r}_i\|_2 \|r_i\|_2 + \epsilon}\)$
Composite Objective: $\(\mathcal{L} = \lambda_1 \mathcal{L}_{Huber} + \lambda_2 \mathcal{L}_{SAM} + \lambda_3 \mathcal{L}_{phys}\)$
The three losses respectively ensure: numerical accuracy (Huber), spectral shape fidelity (SAM), and physical consistency (LSMM).
Architecture Details¶
Patch size 16×16, \(D=512\), \(H=8\) attention heads, 75% masking ratio, EnMAP 218-band input.
Key Experimental Results¶
Reconstruction Quality:
| Model | Avg PSNR (dB) | Avg SSIM |
|---|---|---|
| ViT-MAE | 24.61 | 0.55 |
| KARMA | 27.38 (+11.3%) | 0.68 (+23.6%) |
Downstream Task (CDL Crop Classification):
| Metric | ViT-MAE + Head | KARMA + Head |
|---|---|---|
| Top-1 Acc | 48.26% | 66.81% (+38.5%) |
| mIoU | 34.88% | 46.37% (+33.0%) |
Cross-Region Generalization (NLCD Land Cover, CA→CO/KS): Cultivated Crops improves from 56.70% → 91.59% (+61.5%)
Computational Overhead: KARMA training costs 9.47ms per sample vs. ViT-MAE at 7.19ms (+31.7%, training only)
Highlights & Insights¶
- LSMM serves as a "low-rank physics bottleneck," compelling the network to discover efficient, physically interpretable decompositions
- SAM loss focuses on spectral angle (shape) rather than amplitude, which is critical for material identification
- The triple-loss design accounts for numerical, geometric, and physical dimensions simultaneously
- Strong cross-region generalization (CA→CO/KS) demonstrates that physical priors enhance transferability
Limitations & Future Work¶
- Comparison is limited to vanilla ViT-MAE; no benchmarking against HSI-SOTA methods
- The endmember matrix \(A\) is randomly initialized and learned end-to-end, with no guarantee of correspondence to true physical endmembers
- Ablation study is incomplete — planned ablations (fixed vs. learned \(A\), effect of \(M\)) are not fully presented
- Dataset is restricted to EnMAP California regions at limited scale (5,000 tiles for pretraining)
Related Work & Insights¶
| Method | Physics Constraint | Spectral Angular Loss | Interpretability |
|---|---|---|---|
| ViT-MAE | ✗ | ✗ | Low |
| SatMAE | ✗ | ✗ | Low |
| HyperKD | Distillation | ✗ | Medium |
| KARMA | LSMM | SAM | High |
The paradigm of using a "physics model as a decoder branch" is generalizable to other domains with physical priors. The multi-loss design combining numerical, geometric, and physical objectives is broadly applicable. The learnable endmember matrix is essentially physics-guided dictionary learning.
Rating¶
⭐⭐⭐⭐ — The method is elegantly designed with a clear physical embedding rationale, but experimental comparisons and ablation studies are insufficiently comprehensive.