AstroCo: Self-Supervised Conformer-Style Transformers for Light-Curve Embeddings¶

Conference: NeurIPS 2025 (ML4PS Workshop)
arXiv: 2509.24134
Code: To be released
Area: Physics
Keywords: Self-supervised learning, Conformer, light curves, astronomical time series, masked reconstruction, few-shot classification

TL;DR¶

This paper proposes AstroCo, a self-supervised encoder that introduces the Conformer architecture (attention + depthwise separable convolution + gating) for irregular astronomical light curves. On the MACHO dataset, AstroCo reduces reconstruction error by 61–70% compared to Astromer v1/v2 and improves few-shot classification macro-F1 by approximately 7%.

Background & Motivation¶

Large-scale astronomical surveys: Surveys such as MACHO and LSST generate massive volumes of unlabeled stellar light curves; manual annotation is prohibitively expensive, making label-efficient representation learning methods urgently needed.

Self-supervised pioneer Astromer: Donoso-Oliva et al. proposed Astromer v1/v2, representative works in the field that employ pure Transformer encoders with masked reconstruction pre-training.

Limitations of pure attention: Standard Transformers treat each time step uniformly, making it difficult to capture short-duration local phenomena in light curves (dips, flares, bursts) and lacking explicit control over noisy or temporally distant observations.

Inspiration from Conformer: The Conformer architecture (Gulati et al., 2020) from speech recognition demonstrated that a complementary "attention + convolution" design can simultaneously model global dependencies and local patterns.

Gating mechanisms: GLU (Dauphin et al., 2017) enables networks to adaptively select which local features to retain, which is particularly important for noisy astronomical observations.

Core motivation: Transfer the Conformer-style design to irregular astronomical time series to achieve better reconstruction accuracy and downstream classification performance with fewer training resources.

Method¶

Overall Architecture¶

Light curve {(t_i, m_i, σ_i)} 
    → Input embedding (concatenation fusion instead of additive fusion)
    → M Conformer-style encoder blocks
    → Inter-layer learnable scalar mixing
    → Masked mean pooling → sequence embedding

Three Sub-layer Designs¶

(1) Multi-Head Self-Attention (MHSA)¶

Standard multi-head self-attention + Dropout + residual connection + LayerNorm (post-norm).
Responsible for modeling long-range dependencies between any two time steps in the light curve.

(2) Depthwise Separable Convolution Sub-layer (Depthwise Conv + GLU)¶

LayerNorm → 1×1 pointwise projection expanding the dimension from \(D\) to \(2D\) → GLU gating (split into val and gate, \(\text{val} \odot \sigma(\text{gate})\)).
Followed by depthwise separable Conv1D (kernel size \(K=32\), selected after hyperparameter search over \(K=5\)–\(128\)), performing per-channel convolution along the time dimension to capture local temporal patterns.
BatchNorm1d → SiLU activation → 1×1 pointwise projection back to \(D\) → residual connection + LN.

(3) Gated Feed-Forward Network (Gated FFN)¶

Expansion ratio \(r=4\); the input is projected separately into two \(rD\)-dimensional vectors:
- \(\text{val} = \text{GeLU}(W_{\text{val}} X)\)
- \(\text{gate} = \sigma(W_{\text{gate}} X)\)
Output: \(Y = X + \text{Dropout}(W_{\text{out}}(\text{val} \odot \text{gate}))\), followed by LN.
The gating mechanism adaptively determines which global features are retained.

Input Embedding¶

Photometric values \((m_i, \sigma_i)\) are projected to \(d/2\) dimensions; timestamps \(t_i\) are mapped to \(d/2\) dimensions via sinusoidal embeddings.
The two representations are concatenated and fused into \(d\) dimensions through a linear layer + GeLU + LN, avoiding the scale mismatch problem of additive fusion.
Masked/padding tokens replace original values with zeros to prevent information leakage.

Inter-layer Scalar Mixing¶

Inspired by BERT layer analysis (Tenney et al., 2019), learnable scalar weights \(\{w_\ell\}\) (including the input layer) are normalized via softmax: \(\alpha_\ell = \frac{\exp(w_\ell)}{\sum_j \exp(w_j)}\).
The final representation is \(\tilde{x}_i = \sum_{\ell=0}^{M} \alpha_\ell x_i^{(\ell)}\), allowing the model to adaptively fuse shallow and deep features.

Loss & Training¶

Pre-training: BERT-style masked reconstruction; 50% of positions are probe targets (30% masked, 10% randomly replaced, 10% unchanged), with RMSE computed over probe positions as the loss.
Downstream classification: The encoder is frozen; only a linear head is trained using cross-entropy loss.

Key Experimental Results¶

Dataset¶

Pre-training: MACHO survey R band, approximately 1.5 million single-band light curves, window length 200.
Classification: MACHO LMC variable star catalog, 20,894 light curves, 6 classes (Cepheid I/II, eclipsing binary, long-period variable, RR Lyrae ab/c).

Masked Reconstruction Results (Table 1)¶

Model	RMSE ↓	R² ↑
Astromer v1	0.148	—
Astromer v2	0.113	0.73
AstroCo-S (5.9M)	0.060	0.922
AstroCo-L (15.2M)	0.044	0.956

AstroCo-S reduces RMSE by 59% over v1 and 47% over v2.
AstroCo-L reduces RMSE by 70% over v1 and 61% over v2.

Few-Shot Classification Results¶

Under few-shot settings of 20/100/500 labels per class, AstroCo-S/L with a frozen encoder and linear head outperforms Astromer v1/v2 across all settings.
Relative macro-F1 improvement of approximately 7% (Figure 3).
Results averaged over 3 folds × 3 seeds with stable variance.

Key Findings¶

Adding local convolution and gating significantly improves the representation quality of pure attention-based encoders.
Scalar mixing outperforms fixed pooling strategies (e.g., using only the last layer).
AstroCo-S (5.9M parameters, 11.6h, 4×A100) already surpasses Astromer v1/v2 (5.4M parameters, 3 days, 4×A5000), with higher resource efficiency.
AstroCo-L (15.2M, 1.2 days, 4×H200) further advances the state-of-the-art.

Highlights & Insights¶

Successful cross-domain transfer: Applying the Conformer design from speech to irregular astronomical time series validates the generality of the "attention + local convolution + gating" combination.
High resource efficiency: The smaller AstroCo-S surpasses prior baselines with substantially less compute.
Concatenation fusion replaces additive fusion to avoid dimension/scale mismatches—a practically useful engineering improvement.
Scalar mixing allows the model to adaptively leverage features from different depths, offering more flexibility than always using the last layer.
The self-supervised → freeze → linear probe paradigm validates foundation model few-shot transfer capability in the astronomical domain.

Limitations & Future Work¶

Workshop paper only: Experimental scale and analytical depth are limited; only the MACHO single-survey dataset is used.
Single-band: Multi-band fusion is not explored, whereas modern surveys (e.g., LSST) are inherently multi-band.
Limited downstream tasks: Only variable star classification is evaluated; other important astronomical tasks such as anomaly detection and period estimation are not tested.
Insufficient ablation: No isolated ablation experiments are provided for the gating, convolution, and scalar mixing components individually.
Limited interpretability: The distribution of scalar mixing weights is not analyzed, and the behavior of the gating mechanism lacks visualization.
Not open-sourced: Code and pre-trained weights are not yet publicly available, raising reproducibility concerns.

Work	Core Idea	Relation to This Paper
Astromer v1 (2023)	Transformer masked reconstruction pre-training for light curves	Direct baseline; AstroCo adds convolution and gating on top
Astromer v2 (2025)	Improved Transformer encoder	Stronger baseline; AstroCo still outperforms it significantly
Conformer (Gulati 2020)	Attention + convolution for speech recognition	Architectural inspiration
GLU (Dauphin 2017)	Gated linear units for language modeling	Core gating mechanism in the convolution sub-layer
BERT (Devlin 2019)	Masked language model pre-training	Source of the pre-training masking strategy
Scalar Mixing (Tenney 2019)	Inter-layer feature weighting	Inspiration for inter-layer aggregation

Rating¶

Novelty: ⭐⭐⭐ — Transferring Conformer to astronomical time series is moderately novel, but the architectural components themselves (MHSA, GLU, depthwise conv, scalar mixing) are all combinations of existing techniques.
Experimental Thoroughness: ⭐⭐⭐ — Evaluation across masked reconstruction and few-shot classification is clear, but limited to a single dataset, lacks ablations, and covers few downstream tasks.
Writing Quality: ⭐⭐⭐⭐ — Within the workshop paper format, the structure is clear, the presentation is accurate, and architecture diagrams and equations are complete.
Value: ⭐⭐⭐⭐ — Provides good reference value for the astronomical self-supervised learning community, validates the effectiveness of Conformer on irregular time series, and the resource efficiency advantage is noteworthy.