Skip to content

POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning

Conference: NeurIPS 2025 arXiv: 2506.03511 Code: Zenodo (dataset + code open-sourced) Area: Physics Keywords: high-contrast imaging, exoplanets, polarimetric imaging, self-supervised learning, diffusion models, contrastive learning

TL;DR

This work introduces POLARIS, the first ML benchmark dataset for exoplanetary polarimetric imaging (921 VLT/SPHERE/IRDIS polarimetric images + 75,910 preprocessed exposures), and proposes the Diff-SimCLR framework (diffusion-augmented contrastive learning), achieving 93% accuracy on the reference-star vs. target-star classification task with fewer than 10% manual annotations.

Background & Motivation

Background: Direct imaging is a key technique for detecting wide-orbit exoplanets, yet the extreme contrast ratio between stellar and planetary light (~\(10^{-6}\) to \(10^{-10}\)) demands high-contrast imaging (HCI) technology. Over the past decade, instruments such as GPI and SPHERE have acquired more than \(10^6\) images, but fewer than 40 exoplanets have been directly imaged.

Limitations of Prior Work: (a) Reference Differential Imaging (RDI) requires reference-star images free of disk signal to model the stellar PSF, yet reference-star selection relies entirely on manual inspection by astronomers—a time-consuming and non-scalable process; (b) the ML community lacks a standardized benchmark dataset for exoplanetary imaging; (c) only 96 images carry labels (<10%), leaving a large volume of unlabeled data unexploited.

Key Challenge: Automating reference-star selection could reduce observing costs by ~50% (approximately 10 nights, $350K), yet insufficient labeled data prevents training supervised models.

Goal: (i) Construct the first large-scale HCI polarimetric benchmark dataset for exoplanetary science; (ii) achieve automatic reference-star/target-star classification using a small labeled set combined with abundant unlabeled data; (iii) validate that classification results are applicable to downstream background reconstruction.

Key Insight: The PDI product \(Q_\phi\) directly encodes circumstellar dust-disk structure—the absence of disk signal naturally identifies a reference star, providing a weak supervisory signal.

Core Idea: Leveraging latent-space diffusion trajectories generated by a diffusion model to augment contrastive learning representations, achieving 93% classification accuracy with only 96 labeled samples.

Method

Overall Architecture

Raw VLT/SPHERE polarimetric observations → IRDAP unified preprocessing → 921 \(Q_\phi\) images and 75,910 preprocessed exposures → unsupervised/self-supervised representation learning to extract 32-dimensional features → downstream supervised/unsupervised classifiers for reference-star vs. target-star judgment → classification results used for VAE background reconstruction.

Key Designs

  1. POLARIS Dataset Construction:

    • Function: Retrieve all publicly available SPHERE/IRDIS polarimetric observations from the ESO archive (2014–2024) and apply unified preprocessing.
    • Mechanism: IRDAP pipeline for uniform processing → manual quality control to remove bad exposures → crop central 256×256 region → log transform + linear normalization to \([-4, 4]\). The final dataset contains 96 labeled \(Q_\phi\) + 813 unlabeled \(Q_\phi\) + 75,910 preprocessed exposures.
    • Design Motivation: Prior HCI data were scattered across individual team publications without uniform processing or labeling; POLARIS fills this gap.
  2. Diff-SimCLR (Diffusion-Augmented Contrastive Learning):

    • Function: Augment SimCLR's feature representations using denoising trajectories from a conditional DDPM.
    • Mechanism: For input image \(x\), the first \(\Delta_t=8\) steps of diffusion latent states \(p=[x_0,...,x_{\Delta_t}]\) are extracted and encoded via a ResNet into \(h_p\); two augmented views are separately encoded by ResNet into \(h_1, h_2\); the concatenation \([h_i \| h_p]\) is projected through an MLP head to obtain \(z_i\), optimized with the InfoNCE loss.
    • Design Motivation: The augmentation-invariant representations learned by standard SimCLR may lack the compactness needed to capture subtle inter-class differences; the denoising trajectories of the diffusion model supply additional structural priors that enhance discriminative capacity.
  3. Multi-Model Baseline Evaluation:

    • Function: Systematically evaluate three categories of models—statistical, generative, and LVLMs—on POLARIS.
    • Includes: MAE (masked autoencoder), DeepCluster, SimCLR, and seven large vision–language models (GPT-4o/4.1, Gemini, Llama, DeepSeek).

Loss & Training

  • DDPM: 300 epochs, lr=1e-3, batch=16, standard denoising loss.
  • Contrastive learning stage: DDPM parameters frozen, 200 epochs, lr=1e-3, batch=32, InfoNCE loss.
  • Feature dimensionality: 32 (balancing representational capacity against overfitting risk).
  • Evaluation: 10-fold stratified cross-validation; hyperparameters tuned via 5-fold grid search.

Key Experimental Results

Main Results — Classification Accuracy

Method SVC Random Forest MLP SVM KNN GMM Spectral
MAE 80.33 77.44 82.29 85.00 73.78 74.00 77.00
SimCLR 84.78 84.33 82.00 86.46 73.89 71.11 77.78
DeepCluster 67.67 74.00 70.83 69.67 70.67 72.00 74.89
Diff-SimCLR 93.00 89.67 92.71 89.56 75.00 74.22 77.33

LVLM Zero-Shot Classification

Model Accuracy
GPT-4o 67.71
GPT-4.1 75.00
Gemini-2.0-Flash 75.21
Llama-3.2-11B 48.96
DeepSeek-VL2-Small 50.00

Key Findings

  • Diff-SimCLR consistently outperforms all supervised classifiers: SVC reaches 93%, surpassing SimCLR by 8.22 percentage points and DeepCluster by 25.33 percentage points.
  • LVLMs show limited performance: The best-performing Gemini-2.0-Flash achieves only 75.21%; open-source LVLMs perform near chance (~50%), indicating that the task's domain specificity exceeds the capabilities of general-purpose LVLMs.
  • VAE background reconstruction validation: A VAE trained on 206 reference images identified by spectral clustering successfully reconstructs the stellar PSF and recovers the target disk structure via subtraction.

Highlights & Insights

  • First HCI ML benchmark for exoplanetary science: Fills a critical gap for both the astronomical and ML communities; the data scale (921 PDI images + 75K exposures) is sufficient to support deep learning research.
  • Diffusion trajectories as feature priors: Leveraging intermediate denoising states of a DDPM to augment contrastive learning is a transferable idea applicable to any scientific image classification task with low annotation rates.
  • Potential to save 50% of telescope time: Automated reference-star selection eliminates the need for dedicated reference-star observations, offering substantial value for future facilities such as VLT, ELT, and HWO.

Limitations & Future Work

  • Among the 96 labeled samples, only the brightest protoplanetary disks are included; fainter debris disks may be misclassified as reference stars.
  • Non-detection in polarized intensity does not imply absence of signal in total intensity—point sources such as exoplanets themselves may be invisible in polarimetry.
  • Validation is limited to SPHERE/IRDIS data; cross-instrument generalization (GPI, CHARIS) requires further testing.
  • The choice of diffusion steps \(\Delta_t=8\) in Diff-SimCLR lacks systematic ablation.
  • vs. KLIP: KLIP is the standard HCI pipeline and relies on manual reference-star selection; POLARIS provides an automated alternative.
  • vs. 4S (Bonse 2025): 4S is an iterative PCA method for recovering exoplanetary signals but does not address automatic reference-star classification.
  • Broader inspiration: The paradigm of diffusion-augmented contrastive learning is applicable to other low-annotation scientific imaging domains, such as medical imaging and remote sensing.

Rating

  • Novelty: ⭐⭐⭐⭐ Novel combination of the first HCI ML benchmark and diffusion-augmented contrastive learning.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-model comparison with practical VAE reconstruction validation.
  • Writing Quality: ⭐⭐⭐⭐ Cross-disciplinary narrative effectively balances astronomical and ML backgrounds.
  • Value: ⭐⭐⭐⭐⭐ Direct practical value for data-driven astronomical research and next-generation telescope facilities.