Skip to content

Causal Discovery of Latent Variables in Galactic Archaeology

Conference: ICML 2025
arXiv: 2507.00134
Code: None
Area: Causal Inference
Keywords: Causal Discovery, Latent Variables, Galactic Archaeology, Structural Causal Models, Stellar Migration

TL;DR

Utilizing the Rank-based Latent Causal Discovery (RLCD) algorithm, this study automatically recovers two physically meaningful latent variables—birth radius and guiding radius—from only five observable stellar properties in a purely data-driven manner. This validates the potential of causal discovery methods to identify hidden physical quantities in astrophysics.

Background & Motivation

Galactic archaeology aims to reveal the formation and evolution processes of galaxies by studying the chemo-dynamical history of stars. However, astronomy is fundamentally an observational science where controlled experiments are impossible. Consequently, there is a crucial need to deeply understand the causal mechanisms underlying astrophysical variables, rather than relying solely on correlations.

Traditional methods rely on human intuition to match observational data through forward modeling, which is inherently limited by human understanding of complex systems. More importantly, several critical physical quantities in stellar evolution—such as the birth radius and guiding radius—are unobservable latent variables. These latent variables confound the relationships among observables, making causal inference highly challenging.

The core motivation of this paper is to investigate whether automated causal graph inference can automatically discover these hidden physical quantities and their causal relationships with observables from purely observational data.

Method

Overall Architecture

A three-stage pipeline is adopted:

  1. Causal Structure Discovery: Identify latent variables and the causal graph structure from the covariance patterns of five observable variables using the RLCD algorithm.
  2. Parameter Estimation: Quantify the strength of causal relationships (edge coefficients) via maximum likelihood estimation.
  3. Latent Variable Inference: Estimate the corresponding latent variable values for each individual star.

The inputs are five observable stellar attributes:

  • Metallicity [Fe/H]: The abundance ratio of iron to hydrogen in a star.
  • Age: The chronological age of the star.
  • Vertical Action \(J_z\): An orbital parameter describing the motion of the star perpendicular to the Galactic disk.
  • Angular Momentum \(L_z\): The angular momentum of the star rotating around the Galactic center.
  • Eccentricity \(e\): The degree of deviation of the stellar orbit from a perfect circle.

Key Designs

RLCD Algorithm

RLCD (Rank-based Latent Causal Discovery) is the core algorithm used in this work. Its key insight is that latent variables leave detectable statistical signatures in the covariance matrix of observed variables, which manifest as rank deficiencies.

Mechanism of the algorithm:

  • Analyze the covariance patterns among observable variables.
  • When latent variables are present, they constrain these covariance patterns in a detectable way.
  • By identifying rank deficiencies in the covariance matrix, the algorithm determines: (a) the number of latent variables, (b) which observables are affected by each latent variable, and (c) whether causal relationships exist among the latent variables.

Structural Causal Model (SCM)

Causal relationships are modeled using a Directed Acyclic Graph (DAG) \(\mathcal{G} := (\mathbf{V}_\mathcal{G}, \mathbf{E}_\mathcal{G})\), where each variable \(V_i\) is generated by a linear equation:

\[V_i = \sum_{V_j \in \text{Pa}_\mathcal{G}(V_i)} a_{ij} V_j + \varepsilon_{V_i}\]

where:

  • \(\text{Pa}_\mathcal{G}(V_i)\) represents the parents (direct causes) of \(V_i\).
  • \(a_{ij}\) quantifies the strength of the causal effect of \(V_j\) on \(V_i\).
  • \(\varepsilon_{V_i}\) represents the random noise term.
  • The variable set \(\mathbf{V}_\mathcal{G}\) consists of observable variables \(\mathbf{X}_\mathcal{G}\) (5 measurements) and latent variables \(\mathbf{L}_\mathcal{G}\) (hidden factors to be discovered).

Simulation Data Source

Experiments utilize high-resolution cosmological zoom-in hydrodynamical simulation data from the NIHAO-UHD project:

  • The focus is on the g2.79e12 simulation, selecting disk stars with [Fe/H] > -1 currently located within the range of 7-10 kpc.
  • Observational uncertainties are added: 10% in age, 0.02 dex in [Fe/H], and 0.06 dex in [O/Fe].
  • High- and low-\(\alpha\) disks are separated using \([O/Fe] = -0.13[Fe/H] + 0.17\), focusing specifically on the low-\(\alpha\) disk (dominated by secular evolution).

Loss & Training

Parameter Estimation: Since latent variables have no inherent scale, the variance of each latent variable is fixed to 1 (a standard convention). Parameters that best explain the observational data are then obtained via maximum likelihood estimation.

Latent Variable Inference: Given the causal structure and estimated parameters, the latent variable values for each star are obtained by minimizing the prediction error:

\[\hat{L} = \arg\min_{L} \| X_{\text{obs}} - f(L; \hat{a}) \|^2\]

where \(f\) is the forward function of the linear causal model.

Key Experimental Results

Main Results

RLCD automatically identifies two latent variables, \(L_1\) and \(L_2\), from the five observables:

Latent Variable Physical Mapping Impacted Observables Validation Method
\(L_1\) Birth radius \(R_b\) [Fe/H], \(J_z\) Compared against the inferred results of Lu et al. (2024), achieving comparable performance
\(L_2\) Guiding radius \(R_g\) \(L_z\), \(e\) Directly compared with simulation ground truth, successfully recovered

Ablation Study

Configuration Key Metric Description
Using only 5 observables Discovered 2 latent variables Recovers physical quantities without requiring prior knowledge
\(L_1\) vs Lu et al. (2024) Comparable performance RLCD achieves performance on par with supervised methods
\(L_2\) vs true guiding radius Direct recovery Accurately recovered in a purely data-driven manner
Low-\(\alpha\) disk vs High-\(\alpha\) disk Focus on low-\(\alpha\) disk Dominated by secular processes, yielding clearer causal relations

Key Findings

  1. \(L_1\) encodes birth conditions: It affects [Fe/H] and \(J_z\). This is consistent with physical knowledge: stars born at different Galactic radii inherit different metallicities due to the radial abundance gradient, and their vertical motion retains the memory of their birth environment through distinct gravitational potentials.

  2. \(L_2\) encodes orbital characteristics: It directly affects \(L_z\) and \(e\), which jointly define the guiding radius of a star. Stars develop eccentric orbits through gravitational scattering with giant molecular clouds (the "blurring" process), a mechanism that conserves angular momentum but increases eccentricity.

  3. Consistency in spatial distribution of chemical abundances: On the age-metallicity and \(\alpha\)-abundance planes, the distribution patterns of \(L_1\) show high consistency with the simulation ground truth and the inferences from Lu et al. (2024). Specifically, young metal-rich stars originate from smaller Galactic radii, while old metal-poor stars come from larger radii, reflecting the inside-out galaxy formation scenario.

  4. Alignment of causal graph structure with physical theory: The discovered causal relationships align with existing knowledge of Galactic chemical evolution, demonstrating that RLCD is capable of recovering true physical mechanisms in an unsupervised manner.

Highlights & Insights

  • Methodological Innovation: This is the first successful application of latent causal discovery methods to Galactic archaeology, proving that unobservable physical quantities can be automatically recovered from purely observational data.
  • Strong Physical Interpretability: The two discovered latent variables directly correspond to established physical concepts (birth radius and guiding radius) rather than abstract statistical factors.
  • Unsupervised vs. Supervised: Without access to any labels, \(L_1\) achieves performance comparable to the supervised method of Lu et al. (2024), suggesting that the causal structure itself encodes rich physical information.
  • Applicability of the Linear Assumption: Despite using a linear SCM, the algorithm successfully captures the critical causal structures, implying that the dominant relationships in Galactic chemical evolution can be reasonably approximated linearly.
  • Interdisciplinary Demonstration: This work provides a compelling case study for the application of causal discovery methods in physical sciences.

Limitations & Future Work

  1. Linear Assumption: The current method is restricted to linear SCMs, which may overlook non-linear effects in real astrophysical systems. Non-linear causal discovery methods should be explored in the future.
  2. Dependence on Simulation Data: Validation is based on the NIHAO-UHD simulation and has not yet been verified on real observational data. Systematic discrepancies may exist between simulations and the actual Milky Way.
  3. Analysis Restricted to Low-\(\alpha\) Disk: The turbulent, gas-rich environment where high-\(\alpha\) disks form might possess more complex causal structures, requiring further testing of the current framework's applicability.
  4. Observational Uncertainty Modeling: Although noise is incorporated, the uncertainty structures in real observations can be much more complex.
  5. Scalability: The framework is tested using only five observables and two latent variables. The stability and computational efficiency of the algorithm when scaling up the number of variables need further investigation.
  6. Confidence Intervals of Causal Effects: Currently, only point estimates for edge coefficients are provided, lacking formal uncertainty quantification.
  • RLCD (Dong et al., 2024; 2025): The core algorithm utilized in this study, which performs latent causal discovery based on rank deficiency and has been validated on synthetic and real data.
  • Lu et al. (2024): A supervised inference method for birth radius, which serves as the validation baseline for \(L_1\).
  • Pasquato et al. (2023): A preliminary exploratory study of causal discovery in astronomy.
  • Jin et al. (2025): Bayesian causal structure analysis of galaxy-supermassive black hole co-evolution.
  • PC/FCI Algorithms (Spirtes et al., 2001): Classical causal discovery methods, which face challenges when dealing with latent variables.
  • Insights: This research paradigm can be extended to other astrophysical systems (e.g., planet formation, star cluster evolution) and provides a methodological reference for recovering hidden physical quantities from observational data in other scientific domains.

Rating

Dimension Score (1-5) Description
Novelty 4 First application of RLCD to latent causal discovery in astrophysics, showing clear interdisciplinary novelty
Technical Depth 3 The core algorithm originates from previous work; the primary contribution of this paper lies in its application
Experimental Thoroughness 3 Validated on a single simulation only, lacking real-world data and comparisons with more baselines
Writing Quality 4 Clear physical motivation, smooth presentation of methods, and well-coordinated figures and text
Value 4 Opens a new path for causal inference in astronomy, with highly generalizable methods
Overall Score 3.6 The interdisciplinary application is novel and convincing, though the experimental scale and technical contributions are somewhat limited