Universal Spectral Tokenization via Self-Supervised Panchromatic Representation Learning¶
Conference: NeurIPS 2025 arXiv: 2510.17959 Code: N/A Area: Astronomical Spectroscopy, Foundation Models, Self-Supervised Learning Keywords: Spectral Tokenizer, Heterogeneous Data Unification, Vision Transformer, Self-Supervised Pretraining, Astronomy
TL;DR¶
This paper proposes the first universal spectral tokenizer that jointly trains on heterogeneous astronomical spectra (SDSS/DESI/GALAH/APOGEE) on their native wavelength grids via continuous wavelength embeddings and self-supervised reconstruction objectives, producing aligned, uniform, and physically meaningful representations.
Background & Motivation¶
Root Cause¶
Key Challenge: Large astronomical surveys (SDSS, DESI, etc.) have collected millions of spectra, yet these cover different wavelength ranges and spectral resolutions.
State of the Field¶
Background: Existing analysis pipelines are fragmented: each survey requires independent preprocessing and task-specific models, precluding cross-survey knowledge sharing.
Limitations of Prior Work¶
Limitations of Prior Work: Fixed-grid methods for unifying multi-resolution data introduce interpolation artifacts, and are computationally infeasible over broad wavelength ranges (requiring ~300K pixels).
Starting Point¶
Key Insight: The central challenge for scientific foundation models is learning universal representations from irregular, multi-resolution sequential data.
Method¶
Overall Architecture¶
- Based on the Vision Transformer (ViT) architecture, adapted for one-dimensional spectral data processing.
- Encoder: receives spectral data on native wavelength grids and produces uniform wavelength-aware embeddings.
- Decoder: reconstructs the original spectrum from embeddings, supporting arbitrary output wavelength grids.
- Self-supervised pretraining with lightweight downstream adaptation.
Key Designs¶
-
Continuous Wavelength Embedding:
- Applies per-pixel sinusoidal positional encoding \(PE(\lambda)_k\), with frequencies \(\omega_k\) evenly spaced in log space.
- Operates directly on native wavelength grids without resampling or interpolation.
- Wavelength embeddings are added to flux patches to inject positional wavelength information.
-
Heterogeneous Input Handling:
- Spectral normalization: divide by median flux to focus on relative variations.
- Patch-level masking: patches with more than half bad pixels are flagged as invalid.
- Bad patches are automatically excluded during attention computation.
-
Loss-Aware Reconstruction:
- The decoder receives sinusoidal embeddings of the target wavelength grid as additional input.
- Gaussian likelihood reconstruction loss computed only over valid pixels.
- \(\mathcal{L} = \frac{1}{N}\sum_i m_i \frac{(y_i - \hat{y}_i)^2}{\sigma_i^2}\)
- Weighted by measurement uncertainty \(\sigma_i\), giving greater contribution to high-SNR pixels.
Loss & Training¶
- Encoder: 6 layers; Decoder: 6 layers; embedding dimension 512; 8 attention heads.
- Patch size: 32 pixels; batch size: 64.
- AdamW optimizer, learning rate 1e-4, trained for 600k steps.
- 4× NVIDIA A100-SXM4-40GB GPUs, 48 hours of training.
Key Experimental Results¶
Training Data Overview¶
| Dataset | Wavelength Range | Resolution | Object Type |
|---|---|---|---|
| SDSS DR17 | 3600–10400 Å | R~2000 | Galaxies/Quasars/Stars |
| DESI DR1 | 3600–9800 Å | R~5000 | Galaxies/Quasars/Stars |
| GALAH DR3 | 4700–7900 Å | R~28000 | Stars |
| APOGEE | 1.51–1.7 μm | R~22500 | Stars |
Main Results¶
Object Classification (DESI Spectra)
| Model | Galaxy | Quasar | Star | Average |
|---|---|---|---|---|
| Zhong et al. (specialized model) | 93% | 99% | 98% | 96% |
| Ours + Adaptation Module | 94% | 97% | 98% | 96% |
Stellar Parameter Estimation (APOGEE Spectra)
| Model | log g | T_eff | [Fe/H] |
|---|---|---|---|
| The Cannon 2 | 0.07 dex | 38 K | 0.03 dex |
| astroNN | 0.05 dex | 30 K | 0.02 dex |
| Olney et al. | 0.15 dex | 100 K | 0.07 dex |
| Ours + Adaptation Module | 0.07 dex | 23 K | 0.02 dex |
Key Findings¶
- A single model achieves unified representation across 4 surveys, spanning optical/infrared wavelengths and stars/galaxies/quasars.
- The unsupervisedly learned embedding space naturally exhibits physical structure: UMAP visualizations reveal clear gradients of stellar mass and redshift.
- A lightweight adaptation module (frozen encoder + linear layer) is sufficient to achieve competitive downstream performance against specialized baselines.
- Reconstruction quality spans multiple orders of magnitude in flux and covers diverse physical phenomena.
Highlights & Insights¶
- This work represents the first unified spectral model across surveys, resolutions, and object types, carrying significant methodological importance.
- The continuous wavelength embedding elegantly circumvents the limitations of fixed grids and naturally generalizes to other irregular sequential data such as time series.
- The pretrained representations operate without redshift information, breaking the circular dependency of "estimate redshift before analysis."
- The domain-agnostic architectural design positions this work as a potential building block for scientific foundation models.
Limitations & Future Work¶
- More advanced pretraining objectives such as masked autoencoding or contrastive learning have not been explored.
- Downstream tasks use simple mean pooling, discarding intra-sequence wavelength-dependent information.
- Cross-survey transfer learning (e.g., training on one survey and evaluating on another) has not been demonstrated.
- The model scale is relatively small (6+6 layers); performance gains from larger models remain to be explored.
Related Work & Insights¶
- The use of continuous positional encodings for irregular grids is generalizable to medical signals, climate data, and beyond.
- The paradigm of a single encoder with multiple downstream adapters holds broad promise for scientific data applications.
- Measurement-error-weighted reconstruction loss offers a valuable reference for handling noisy scientific data.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Both the problem formulation and proposed solution are pioneering)
- Technical Contribution: ⭐⭐⭐⭐ (Universal architecture design is concise and effective)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Validated across multiple datasets and tasks)
- Writing Quality: ⭐⭐⭐⭐ (Motivation is clear and presentation is thorough)