Skip to content

Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction

Conference: AAAI 2026 arXiv: 2512.12445 Code: To be confirmed Area: Scientific Computing Keywords: Masked Autoencoder, hyperspectral, LSMM, SAM loss, physics-informed, knowledge-guided ML

TL;DR

This paper proposes KARMA, a framework that embeds the Linear Spectral Mixing Model (LSMM) as a physics constraint within the ViT-MAE decoder, combined with a Spectral Angle Mapper (SAM) loss, to improve reconstruction fidelity and downstream transfer performance for hyperspectral remote sensing imagery.

Background & Motivation

Pure data-driven ViT-MAE faces several limitations in hyperspectral remote sensing: (1) it ignores the physical mixing mechanism of spectral data, where each pixel is a linear combination of multiple surface materials; (2) conventional MSE loss focuses solely on numerical accuracy while neglecting spectral shape fidelity; (3) hyperspectral data is high-dimensional (218 bands) and exhibits mixed-pixel problems, making direct transfer from generic foundation models infeasible. The Knowledge-Guided Machine Learning (KGML) paradigm aims to embed domain knowledge into neural networks to improve interpretability and generalization.

Core Problem

How to effectively embed remote sensing physical priors (spectral mixing models) into a self-supervised Transformer framework, such that the learned representations are both data-efficient and physically consistent.

Method

Overall Architecture

KARMA = ViT-MAE backbone + LSMM physics branch + SAM angular loss + Huber robust loss

Key Designs

LSMM Embedding: A lightweight abundance head \(f_\theta\) (MLP: \(D \to D/2 \to M\)) is appended to the decoder to predict the abundance vector for each patch: $\(\hat{x} = \text{softmax}(f_\theta(z))\)$ Physical reconstruction: \(\hat{r}_{phys} = A\hat{x}\), where \(A \in \mathbb{R}^{218 \times M}\) is the endmember matrix (randomly initialized, learned end-to-end). The softmax naturally enforces non-negativity and sum-to-one constraints: \(x \geq 0, \mathbf{1}^\top x = 1\).

SAM Angular Loss: Preserves spectral shape independent of magnitude: $\(\mathcal{L}_{SAM} = \frac{1}{N} \sum_{i=1}^N \arccos\frac{\langle \hat{r}_i, r_i \rangle}{\|\hat{r}_i\|_2 \|r_i\|_2 + \epsilon}\)$

Composite Objective: $\(\mathcal{L} = \lambda_1 \mathcal{L}_{Huber} + \lambda_2 \mathcal{L}_{SAM} + \lambda_3 \mathcal{L}_{phys}\)$

The three losses respectively ensure: numerical accuracy (Huber), spectral shape fidelity (SAM), and physical consistency (LSMM).

Architecture Details

Patch size 16×16, \(D=512\), \(H=8\) attention heads, 75% masking ratio, EnMAP 218-band input.

Key Experimental Results

Reconstruction Quality:

Model Avg PSNR (dB) Avg SSIM
ViT-MAE 24.61 0.55
KARMA 27.38 (+11.3%) 0.68 (+23.6%)

Downstream Task (CDL Crop Classification):

Metric ViT-MAE + Head KARMA + Head
Top-1 Acc 48.26% 66.81% (+38.5%)
mIoU 34.88% 46.37% (+33.0%)

Cross-Region Generalization (NLCD Land Cover, CA→CO/KS): Cultivated Crops improves from 56.70% → 91.59% (+61.5%)

Computational Overhead: KARMA training costs 9.47ms per sample vs. ViT-MAE at 7.19ms (+31.7%, training only)

Highlights & Insights

  • LSMM serves as a "low-rank physics bottleneck," compelling the network to discover efficient, physically interpretable decompositions
  • SAM loss focuses on spectral angle (shape) rather than amplitude, which is critical for material identification
  • The triple-loss design accounts for numerical, geometric, and physical dimensions simultaneously
  • Strong cross-region generalization (CA→CO/KS) demonstrates that physical priors enhance transferability

Limitations & Future Work

  • Comparison is limited to vanilla ViT-MAE; no benchmarking against HSI-SOTA methods
  • The endmember matrix \(A\) is randomly initialized and learned end-to-end, with no guarantee of correspondence to true physical endmembers
  • Ablation study is incomplete — planned ablations (fixed vs. learned \(A\), effect of \(M\)) are not fully presented
  • Dataset is restricted to EnMAP California regions at limited scale (5,000 tiles for pretraining)
Method Physics Constraint Spectral Angular Loss Interpretability
ViT-MAE Low
SatMAE Low
HyperKD Distillation Medium
KARMA LSMM SAM High

The paradigm of using a "physics model as a decoder branch" is generalizable to other domains with physical priors. The multi-loss design combining numerical, geometric, and physical objectives is broadly applicable. The learnable endmember matrix is essentially physics-guided dictionary learning.

Rating

⭐⭐⭐⭐ — The method is elegantly designed with a clear physical embedding rationale, but experimental comparisons and ablation studies are insufficiently comprehensive.