POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning¶
Conference: NeurIPS 2025 arXiv: 2506.03511 Code: Zenodo (dataset + code open-sourced) Area: Physics Keywords: high-contrast imaging, exoplanets, polarimetric imaging, self-supervised learning, diffusion models, contrastive learning
TL;DR¶
This work introduces POLARIS, the first ML benchmark dataset for exoplanetary polarimetric imaging (921 VLT/SPHERE/IRDIS polarimetric images + 75,910 preprocessed exposures), and proposes the Diff-SimCLR framework (diffusion-augmented contrastive learning), achieving 93% accuracy on the reference-star vs. target-star classification task with fewer than 10% manual annotations.
Background & Motivation¶
Background: Direct imaging is a key technique for detecting wide-orbit exoplanets, yet the extreme contrast ratio between stellar and planetary light (~\(10^{-6}\) to \(10^{-10}\)) demands high-contrast imaging (HCI) technology. Over the past decade, instruments such as GPI and SPHERE have acquired more than \(10^6\) images, but fewer than 40 exoplanets have been directly imaged.
Limitations of Prior Work: (a) Reference Differential Imaging (RDI) requires reference-star images free of disk signal to model the stellar PSF, yet reference-star selection relies entirely on manual inspection by astronomers—a time-consuming and non-scalable process; (b) the ML community lacks a standardized benchmark dataset for exoplanetary imaging; (c) only 96 images carry labels (<10%), leaving a large volume of unlabeled data unexploited.
Key Challenge: Automating reference-star selection could reduce observing costs by ~50% (approximately 10 nights, $350K), yet insufficient labeled data prevents training supervised models.
Goal: (i) Construct the first large-scale HCI polarimetric benchmark dataset for exoplanetary science; (ii) achieve automatic reference-star/target-star classification using a small labeled set combined with abundant unlabeled data; (iii) validate that classification results are applicable to downstream background reconstruction.
Key Insight: The PDI product \(Q_\phi\) directly encodes circumstellar dust-disk structure—the absence of disk signal naturally identifies a reference star, providing a weak supervisory signal.
Core Idea: Leveraging latent-space diffusion trajectories generated by a diffusion model to augment contrastive learning representations, achieving 93% classification accuracy with only 96 labeled samples.
Method¶
Overall Architecture¶
Raw VLT/SPHERE polarimetric observations → IRDAP unified preprocessing → 921 \(Q_\phi\) images and 75,910 preprocessed exposures → unsupervised/self-supervised representation learning to extract 32-dimensional features → downstream supervised/unsupervised classifiers for reference-star vs. target-star judgment → classification results used for VAE background reconstruction.
Key Designs¶
-
POLARIS Dataset Construction:
- Function: Retrieve all publicly available SPHERE/IRDIS polarimetric observations from the ESO archive (2014–2024) and apply unified preprocessing.
- Mechanism: IRDAP pipeline for uniform processing → manual quality control to remove bad exposures → crop central 256×256 region → log transform + linear normalization to \([-4, 4]\). The final dataset contains 96 labeled \(Q_\phi\) + 813 unlabeled \(Q_\phi\) + 75,910 preprocessed exposures.
- Design Motivation: Prior HCI data were scattered across individual team publications without uniform processing or labeling; POLARIS fills this gap.
-
Diff-SimCLR (Diffusion-Augmented Contrastive Learning):
- Function: Augment SimCLR's feature representations using denoising trajectories from a conditional DDPM.
- Mechanism: For input image \(x\), the first \(\Delta_t=8\) steps of diffusion latent states \(p=[x_0,...,x_{\Delta_t}]\) are extracted and encoded via a ResNet into \(h_p\); two augmented views are separately encoded by ResNet into \(h_1, h_2\); the concatenation \([h_i \| h_p]\) is projected through an MLP head to obtain \(z_i\), optimized with the InfoNCE loss.
- Design Motivation: The augmentation-invariant representations learned by standard SimCLR may lack the compactness needed to capture subtle inter-class differences; the denoising trajectories of the diffusion model supply additional structural priors that enhance discriminative capacity.
-
Multi-Model Baseline Evaluation:
- Function: Systematically evaluate three categories of models—statistical, generative, and LVLMs—on POLARIS.
- Includes: MAE (masked autoencoder), DeepCluster, SimCLR, and seven large vision–language models (GPT-4o/4.1, Gemini, Llama, DeepSeek).
Loss & Training¶
- DDPM: 300 epochs, lr=1e-3, batch=16, standard denoising loss.
- Contrastive learning stage: DDPM parameters frozen, 200 epochs, lr=1e-3, batch=32, InfoNCE loss.
- Feature dimensionality: 32 (balancing representational capacity against overfitting risk).
- Evaluation: 10-fold stratified cross-validation; hyperparameters tuned via 5-fold grid search.
Key Experimental Results¶
Main Results — Classification Accuracy¶
| Method | SVC | Random Forest | MLP | SVM | KNN | GMM | Spectral |
|---|---|---|---|---|---|---|---|
| MAE | 80.33 | 77.44 | 82.29 | 85.00 | 73.78 | 74.00 | 77.00 |
| SimCLR | 84.78 | 84.33 | 82.00 | 86.46 | 73.89 | 71.11 | 77.78 |
| DeepCluster | 67.67 | 74.00 | 70.83 | 69.67 | 70.67 | 72.00 | 74.89 |
| Diff-SimCLR | 93.00 | 89.67 | 92.71 | 89.56 | 75.00 | 74.22 | 77.33 |
LVLM Zero-Shot Classification¶
| Model | Accuracy |
|---|---|
| GPT-4o | 67.71 |
| GPT-4.1 | 75.00 |
| Gemini-2.0-Flash | 75.21 |
| Llama-3.2-11B | 48.96 |
| DeepSeek-VL2-Small | 50.00 |
Key Findings¶
- Diff-SimCLR consistently outperforms all supervised classifiers: SVC reaches 93%, surpassing SimCLR by 8.22 percentage points and DeepCluster by 25.33 percentage points.
- LVLMs show limited performance: The best-performing Gemini-2.0-Flash achieves only 75.21%; open-source LVLMs perform near chance (~50%), indicating that the task's domain specificity exceeds the capabilities of general-purpose LVLMs.
- VAE background reconstruction validation: A VAE trained on 206 reference images identified by spectral clustering successfully reconstructs the stellar PSF and recovers the target disk structure via subtraction.
Highlights & Insights¶
- First HCI ML benchmark for exoplanetary science: Fills a critical gap for both the astronomical and ML communities; the data scale (921 PDI images + 75K exposures) is sufficient to support deep learning research.
- Diffusion trajectories as feature priors: Leveraging intermediate denoising states of a DDPM to augment contrastive learning is a transferable idea applicable to any scientific image classification task with low annotation rates.
- Potential to save 50% of telescope time: Automated reference-star selection eliminates the need for dedicated reference-star observations, offering substantial value for future facilities such as VLT, ELT, and HWO.
Limitations & Future Work¶
- Among the 96 labeled samples, only the brightest protoplanetary disks are included; fainter debris disks may be misclassified as reference stars.
- Non-detection in polarized intensity does not imply absence of signal in total intensity—point sources such as exoplanets themselves may be invisible in polarimetry.
- Validation is limited to SPHERE/IRDIS data; cross-instrument generalization (GPI, CHARIS) requires further testing.
- The choice of diffusion steps \(\Delta_t=8\) in Diff-SimCLR lacks systematic ablation.
Related Work & Insights¶
- vs. KLIP: KLIP is the standard HCI pipeline and relies on manual reference-star selection; POLARIS provides an automated alternative.
- vs. 4S (Bonse 2025): 4S is an iterative PCA method for recovering exoplanetary signals but does not address automatic reference-star classification.
- Broader inspiration: The paradigm of diffusion-augmented contrastive learning is applicable to other low-annotation scientific imaging domains, such as medical imaging and remote sensing.
Rating¶
- Novelty: ⭐⭐⭐⭐ Novel combination of the first HCI ML benchmark and diffusion-augmented contrastive learning.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-model comparison with practical VAE reconstruction validation.
- Writing Quality: ⭐⭐⭐⭐ Cross-disciplinary narrative effectively balances astronomical and ML backgrounds.
- Value: ⭐⭐⭐⭐⭐ Direct practical value for data-driven astronomical research and next-generation telescope facilities.