Universal Spectral Tokenization via Self-Supervised Panchromatic Representation Learning¶

Conference: NeurIPS 2025 arXiv: 2510.17959 Code: N/A Area: Astronomical Spectroscopy, Foundation Models, Self-Supervised Learning Keywords: Spectral Tokenizer, Heterogeneous Data Unification, Vision Transformer, Self-Supervised Pretraining, Astronomy

TL;DR¶

This paper proposes the first universal spectral tokenizer that jointly trains on heterogeneous astronomical spectra (SDSS/DESI/GALAH/APOGEE) on their native wavelength grids via continuous wavelength embeddings and self-supervised reconstruction objectives, producing aligned, uniform, and physically meaningful representations.

Background & Motivation¶

Root Cause¶

Key Challenge: Large astronomical surveys (SDSS, DESI, etc.) have collected millions of spectra, yet these cover different wavelength ranges and spectral resolutions.

State of the Field¶

Background: Existing analysis pipelines are fragmented: each survey requires independent preprocessing and task-specific models, precluding cross-survey knowledge sharing.

Limitations of Prior Work¶

Limitations of Prior Work: Fixed-grid methods for unifying multi-resolution data introduce interpolation artifacts, and are computationally infeasible over broad wavelength ranges (requiring ~300K pixels).

Starting Point¶

Key Insight: The central challenge for scientific foundation models is learning universal representations from irregular, multi-resolution sequential data.

Method¶

Overall Architecture¶

Based on the Vision Transformer (ViT) architecture, adapted for one-dimensional spectral data processing.
Encoder: receives spectral data on native wavelength grids and produces uniform wavelength-aware embeddings.
Decoder: reconstructs the original spectrum from embeddings, supporting arbitrary output wavelength grids.
Self-supervised pretraining with lightweight downstream adaptation.

Key Designs¶

Continuous Wavelength Embedding:
- Applies per-pixel sinusoidal positional encoding \(PE(\lambda)_k\), with frequencies \(\omega_k\) evenly spaced in log space.
- Operates directly on native wavelength grids without resampling or interpolation.
- Wavelength embeddings are added to flux patches to inject positional wavelength information.
Heterogeneous Input Handling:
- Spectral normalization: divide by median flux to focus on relative variations.
- Patch-level masking: patches with more than half bad pixels are flagged as invalid.
- Bad patches are automatically excluded during attention computation.
Loss-Aware Reconstruction:
- The decoder receives sinusoidal embeddings of the target wavelength grid as additional input.
- Gaussian likelihood reconstruction loss computed only over valid pixels.
- \(\mathcal{L} = \frac{1}{N}\sum_i m_i \frac{(y_i - \hat{y}_i)^2}{\sigma_i^2}\)
- Weighted by measurement uncertainty \(\sigma_i\), giving greater contribution to high-SNR pixels.

Loss & Training¶

Encoder: 6 layers; Decoder: 6 layers; embedding dimension 512; 8 attention heads.
Patch size: 32 pixels; batch size: 64.
AdamW optimizer, learning rate 1e-4, trained for 600k steps.
4× NVIDIA A100-SXM4-40GB GPUs, 48 hours of training.

Key Experimental Results¶

Training Data Overview¶

Dataset	Wavelength Range	Resolution	Object Type
SDSS DR17	3600–10400 Å	R~2000	Galaxies/Quasars/Stars
DESI DR1	3600–9800 Å	R~5000	Galaxies/Quasars/Stars
GALAH DR3	4700–7900 Å	R~28000	Stars
APOGEE	1.51–1.7 μm	R~22500	Stars

Main Results¶

Object Classification (DESI Spectra)

Model	Galaxy	Quasar	Star	Average
Zhong et al. (specialized model)	93%	99%	98%	96%
Ours + Adaptation Module	94%	97%	98%	96%

Stellar Parameter Estimation (APOGEE Spectra)

Model	log g	T_eff	[Fe/H]
The Cannon 2	0.07 dex	38 K	0.03 dex
astroNN	0.05 dex	30 K	0.02 dex
Olney et al.	0.15 dex	100 K	0.07 dex
Ours + Adaptation Module	0.07 dex	23 K	0.02 dex

Key Findings¶

A single model achieves unified representation across 4 surveys, spanning optical/infrared wavelengths and stars/galaxies/quasars.
The unsupervisedly learned embedding space naturally exhibits physical structure: UMAP visualizations reveal clear gradients of stellar mass and redshift.
A lightweight adaptation module (frozen encoder + linear layer) is sufficient to achieve competitive downstream performance against specialized baselines.
Reconstruction quality spans multiple orders of magnitude in flux and covers diverse physical phenomena.

Highlights & Insights¶

This work represents the first unified spectral model across surveys, resolutions, and object types, carrying significant methodological importance.
The continuous wavelength embedding elegantly circumvents the limitations of fixed grids and naturally generalizes to other irregular sequential data such as time series.
The pretrained representations operate without redshift information, breaking the circular dependency of "estimate redshift before analysis."
The domain-agnostic architectural design positions this work as a potential building block for scientific foundation models.

Limitations & Future Work¶

More advanced pretraining objectives such as masked autoencoding or contrastive learning have not been explored.
Downstream tasks use simple mean pooling, discarding intra-sequence wavelength-dependent information.
Cross-survey transfer learning (e.g., training on one survey and evaluating on another) has not been demonstrated.
The model scale is relatively small (6+6 layers); performance gains from larger models remain to be explored.

The use of continuous positional encodings for irregular grids is generalizable to medical signals, climate data, and beyond.
The paradigm of a single encoder with multiple downstream adapters holds broad promise for scientific data applications.
Measurement-error-weighted reconstruction loss offers a valuable reference for handling noisy scientific data.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Both the problem formulation and proposed solution are pioneering)
Technical Contribution: ⭐⭐⭐⭐ (Universal architecture design is concise and effective)
Experimental Thoroughness: ⭐⭐⭐⭐ (Validated across multiple datasets and tasks)
Writing Quality: ⭐⭐⭐⭐ (Motivation is clear and presentation is thorough)