Improving Posterior Inference of Galaxy Properties with Image-Based Conditional Flow Matching¶

Conference: NeurIPS 2025 arXiv: 2512.05078 Code: Not released Area: Image Generation Keywords: conditional flow matching, galaxy property estimation, simulation-based inference, morphology, posterior inference

TL;DR¶

This paper proposes a Conditional Flow Matching (CFM) framework that jointly models morphological information from galaxy images alongside photometric data, substantially improving posterior inference of physical galaxy properties including stellar mass, star formation rate, metallicity, and dust extinction.

Background & Motivation¶

Spectroscopy vs. photometry trade-off: Spectroscopic analysis is the gold standard for measuring physical galaxy properties, but is prohibitively expensive for large-scale surveys (e.g., millions of SDSS targets). Broadband photometry (ugriz five filters) scales well but retains only integrated flux information, discarding spatial structure, color gradients, and other morphological cues.
Morphology encodes rich physical information: Prior work has established connections between the spatial structure of galaxy images and physical quantities such as stellar mass, star formation history, and metallicity (Wu & Boada 2019; Alfonzo+ 2024; Parker+ 2024), yet traditional SED-fitting pipelines cannot exploit image information.
Limitations of existing approaches: Doorenbos+ (2024) generate spectra from images via generative models before inferring physical properties, introducing an intermediate step; direct incorporation of morphology into SBI frameworks (e.g., iglesias-navarro) is only beginning to emerge.
Core hypothesis: Explicitly incorporating image morphology into the SBI framework can tighten posterior distributions over galaxy properties and help break the dust–age degeneracy.

Core Problem¶

How can morphological information encoded in galaxy images improve posterior inference of physical properties without relying on spectroscopy? Specific objectives:

Quantify the improvement in posterior accuracy and informativeness afforded by image morphology.
Verify whether including images leads to more faithful recovery of known galaxy scaling relations.
Investigate the potential of morphological information to alleviate the dust–age degeneracy.

Method¶

Conditional Flow Matching (CFM) Framework¶

Core Idea: Learn a time-dependent velocity field \(v_\phi(t, \theta, \mathcal{D})\) that transports a simple Gaussian prior to the posterior \(p(\theta|\mathcal{D})\).
Interpolation path: Linear interpolation \(\theta_t = (1-t)\theta_0 + t\theta_1 + \sigma\epsilon\), where \(\theta_0 \sim \mathcal{N}(0, I)\) and \(\sigma = 0.05\).
Training loss: MSE loss fitting the target velocity \(\theta_1 - \theta_0\).
Inference: Fourth-order Runge-Kutta (RK4) integration over 100 steps from \(t=0\) to \(t=1\); 1,000 trajectories are sampled per object to approximate the posterior.

Two Comparison Models¶

Model	Input	Velocity Network Input Dimension
Photometry Model	ugriz 5-dim photometry	\([t; \theta; f_{\text{phot}}]\), 11-dim total
Image Model	ugriz photometry + 128×128 RGB image	\([t; \theta; f_{\text{img}}; f_{\text{phot}}]\), 267-dim total

Velocity network (MLP): 3 layers, width 256.
Image encoder (CNN): 4 stride-1 convolutional blocks + average pooling → global average pooling → 256-dim feature \(f_{\text{img}}\); average pooling (rather than max pooling) is used to preserve information from extended light distributions.

Inference Targets¶

Five physical galaxy properties are inferred jointly:

\(M_\star\): stellar mass
SFR: star formation rate
\(Z_{\text{gas}}\): gas-phase metallicity
\(D_n(4000)\): narrow 4000 Å break index (stellar age proxy)
\(A_V\): V-band dust extinction

Data & Training¶

Dataset: SDSS Main Galaxy Sample; 106,800 spectroscopically confirmed bright star-forming galaxies (BPT-classified); 80/10/10 train/validation/test split.
Images: 128×128 gri-band images downloaded from SDSS SkyServer (0.396″/pixel).
Optimizer: AdamW, learning rate \(5 \times 10^{-5}\), batch size 64, early stopping.
Hardware: 4× NVIDIA V100 GPUs with PyTorch DataParallel.

Evaluation Metrics¶

Accuracy: \(\Delta\log p(\theta_*; \mathcal{D}) = \log p(\theta_*|\mathcal{D}) - \log p(\theta_*)\); positive values indicate that the posterior density at the target exceeds the (empirical) prior, i.e., a per-object Bayesian evidence gain.
Informativeness: \(D_{\text{KL}}[p(\theta|\mathcal{D}) \| p(\theta)]\), measuring how much the posterior departs from the prior; averaged over \(\mathcal{D}\) this equals the mutual information \(I(\theta; \mathcal{D})\).
Population distribution fidelity: Per-variable Wasserstein distance between the distribution of posterior means and the ground-truth distribution on the test set.

Key Experimental Results¶

Per-Object Posterior Quality (\(N=1000\) test galaxies)¶

Metric	Image Model	Photometry Model
Mean \(\Delta\log p\)	2.17 (σ=3.30)	1.26 (σ=3.98)
Mean \(D_{\text{KL}}\)	3.41 (σ=0.95)	2.55 (σ=0.97)
\(\Delta\log p\) win rate	81.5% of targets superior to photometry	—
\(D_{\text{KL}}\) win rate	96.5% of targets superior to photometry	—

Wasserstein Distance (Population Distribution Fidelity)¶

Property	Image Model	Photometry Model	Improvement
\(M_\star\)	0.0264	0.0547	0.0283
SFR	0.0639	0.1119	0.0480
\(Z_{\text{gas}}\)	0.0156	0.0302	0.0146
\(D_n(4000)\)	0.0103	0.0131	0.0028
\(A_V\)	0.1937	0.2565	0.0628

The Image Model significantly outperforms the Photometry Model on all five properties.
Improvements are most pronounced for stellar mass and SFR (Wasserstein distance reduced by ~50%).

Scaling Relation Recovery¶

The Image Model more faithfully recovers known SDSS scaling relations in the \(M_\star\)–\(Z_{\text{gas}}\), \(M_\star\)–SFR, and SFR–\(Z_{\text{gas}}\) planes.
Selected image samples are visually consistent with astrophysical expectations (e.g., low-mass, low-SFR galaxies exhibit blue, diffuse morphologies).

Dust–Age Degeneracy¶

In the \(A_V\) vs. \(D_n(4000)\) plane, posterior distributions from the Image Model are closer to spectroscopic ground-truth values than those from the Photometry Model.
However, overall constraints on \(A_V\) remain weak, achieving only partial decoupling.

Highlights & Insights¶

Clear and direct methodology: A rigorous comparison between two structurally symmetric models (differing only in image input) cleanly isolates the contribution of morphological information.
Multi-level evaluation: Both per-object posterior quality (accuracy + informativeness) and population distribution fidelity (Wasserstein distance) are assessed, yielding a comprehensive evaluation framework.
Strong physical interpretability: Recovery of scaling relations paired with representative image visualizations establishes an intuitive link between morphological features and physical quantities.
Clear practical value: A viable pathway is demonstrated for integrating morphological information into SED-fitting pipelines.
Elegant CFM framework: Replacing traditional MCMC/nested sampling with conditional flow matching approximates the posterior from 1,000 sampled trajectories, offering clear computational advantages.

Limitations & Future Work¶

Insufficient \(A_V\) constraint: Dust extinction exhibits the largest Wasserstein distance (0.1937), and the dust–age degeneracy is only partially alleviated.
Sample limitations: The study is restricted to SDSS bright star-forming galaxies (\(r < 17.78\)), excluding quenched galaxies, low-surface-brightness galaxies, and high-redshift sources.
Simple CNN encoder: The four-layer CNN with global average pooling has limited image encoding capacity; pre-trained vision foundation models (e.g., AstroCLIP) or ViT architectures could be explored.
No physical priors incorporated: The current CFM prior is Gaussian and does not incorporate physical priors from SPS models; the authors note plans to combine this approach with SED fitting in future work.
No posterior calibration analysis: Coverage/calibration of the posterior is not examined, leaving the reliability of posterior credible intervals unverified.
Single redshift slice: All galaxies are low-redshift SDSS objects; generalization to deep-field surveys such as JWST/DESI remains untested.
Occasional negative \(\Delta\log p\) outliers: Both models exhibit cases where posterior density falls below the prior, possibly due to limited CFM architecture capacity.

Method	Input	Inference Approach	Key Difference
Traditional SED fitting (Conroy 2013)	Photometry	MCMC / nested sampling	Strong physical priors but cannot exploit images; high computational cost
Doorenbos+ (2024)	Image → generated spectrum → inference	Conditional diffusion model	Requires synthetic spectra as an intermediate step, potentially accumulating errors
Hahn & Melchior (2022)	Photometry	NPE (neural posterior estimation)	Amortized SBI inference but no image input
Iglesias-Navarro+ (2025)	JWST image pixels	SBI	Incorporates images into SBI but uses JWST rather than SDSS, targeting high redshift
Ours	Photometry + image latent features	CFM	First application of CFM to galaxy property inference; CNN encodes morphology jointly conditioned with photometry; strict controlled-variable comparison

Key distinction from Doorenbos+ (2024): This work directly infers physical properties from images without generating intermediate spectra, yielding a simpler pipeline that avoids error propagation from spectrum generation.
Comparison with Hahn & Melchior (2022): Both operate within the SBI paradigm, but this work replaces NPE with CFM; CFM's ODE-based inference is more stable and does not require a separate density estimation step.
AstroCLIP (Parker+ 2024) is a cross-modal foundation model, but this work trains a CNN encoder from scratch rather than using pre-trained features—a design choice that is both a simplicity advantage and a potential area for improvement.

Additional Insights¶

Generality of CFM for scientific inference: This work demonstrates that CFM is applicable not only to image/audio generation but also to amortized posterior inference of scientific parameters, with sampling efficiency far exceeding MCMC—a paradigm transferable to medical imaging parameter estimation, climate model calibration, and related domains.
Minimalist multi-modal conditioning design: Simple concatenation of CNN features with scalar photometric features as velocity field conditions, without complex cross-attention or FiLM modules, suggests that straightforward architectures can be sufficient given adequate data.
Physics-informed generative models: The authors' outlook of combining CFM with physical SED models exemplifies the "physics-informed generative model" direction—traditional physical priors provide interpretability and coverage of extreme cases, while data-driven models provide flexibility and morphological information.
Transferable evaluation framework: The three-level evaluation scheme—per-object Bayesian evidence gain (accuracy), KL divergence (informativeness), and Wasserstein distance (population fidelity)—is broadly applicable to other posterior inference tasks.
Physical intuition guiding architectural choices: The selection of average pooling over max pooling to preserve extended light distributions reflects domain knowledge informing architecture design, contrasting with the max pooling commonly used in object detection.

Rating¶

Novelty: 3.5/5 — The CFM framework itself is not a new contribution; the core novelty lies in incorporating image morphology into CFM-based conditional inference, with a clean and persuasive experimental design.
Experimental Thoroughness: 4/5 — Multi-level quantitative evaluation (accuracy, informativeness, Wasserstein distance, scaling relations) is comprehensive, though posterior calibration/coverage analysis and ablation studies (e.g., varying image resolution or encoder architecture) are absent.
Writing Quality: 4/5 — Structure is clear, motivation is well-articulated, and figures are informative; Methods and Results are tightly integrated.
Value: 3.5/5 — Directly useful for the astrophysics community as a pathway for incorporating morphology into SED fitting; relevant to the ML community as a paradigm for CFM-based scientific posterior inference; generality is currently limited by the narrow sample and redshift range.