Skip to content

Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions

Conference: NeurIPS 2025 arXiv: 2510.04417 Code: https://github.com/warrenzha/flow-pid Area: Interpretability / Information Theory Keywords: Partial Information Decomposition, normalizing flow, Gaussian distribution, multimodal learning, mutual information

TL;DR

Two complementary tools are proposed: Thin-PID is an efficient Gaussian PID algorithm (10× faster than existing methods), and Flow-PID applies normalizing flows to map arbitrary input distributions to Gaussian space before computing PID, addressing the infeasibility of PID on continuous high-dimensional data. The paper also resolves an open problem regarding whether the joint Gaussian solution is optimal.

Background & Motivation

Background: Partial Information Decomposition (PID) is an information-theoretic framework for quantifying multi-source information interactions. It decomposes the total mutual information of two sources \(X_1, X_2\) about a target \(Y\) into four non-negative components: redundant information \(R\) (shared by both), unique information \(U_1, U_2\) (exclusive to each), and synergistic information \(S\) (only available when both are combined). PID has been applied in multimodal learning to understand modality interactions.

Limitations of Prior Work: Computing PID requires solving an optimization problem over the set of joint distributions satisfying marginal constraints. This is feasible for discrete small-scale data (via CVX), but practically impossible for continuous high-dimensional data, where even estimating mutual information and entropy is extremely challenging. The BATCH method uses neural network parameterization but achieves poor accuracy. Tilde-PID is restricted to Gaussian distributions without proof of optimality.

Key Challenge: A fundamental gap exists between the theoretical elegance of PID and its computational infeasibility — while PID can precisely quantify modality interactions in theory, in practice it is limited to discrete low-dimensional data.

Goal: (1) Prove the optimality of the joint Gaussian solution in Gaussian PID; (2) Design an algorithm more efficient than Tilde-PID; (3) Generalize to non-Gaussian high-dimensional continuous data.

Key Insight: Two key insights — PID under Gaussian distributions admits a closed-form solution and can be computed efficiently; the invertibility of normalizing flows preserves mutual information, enabling transformation to Gaussian space prior to PID computation.

Core Idea: Transform data to Gaussian via normalizing flows, then efficiently compute PID in the Gaussian space.

Method

Overall Architecture

The framework operates at two levels: (1) Thin-PID handles Gaussian PID — reformulating the optimization objective as minimizing a function of the noise cross-covariance matrix and solving via projected gradient descent; (2) Flow-PID handles general distributions — training a Cartesian product normalizing flow \(f_1 \times f_2 \times f_Y\) to map \((X_1, X_2, Y)\) to a Gaussian marginal space, then invoking Thin-PID.

Key Designs

  1. Thin-PID: Efficient Gaussian PID Algorithm

  2. Function: Efficiently solves the PID optimization problem when the marginals are known to be Gaussian.

  3. Mechanism: PID is reinterpreted via a Gaussian broadcast channel model — \(Y\) is the transmitted signal, with \(X_1 = H_1 Y + n_1\) and \(X_2 = H_2 Y + n_2\). Synergistic information is equivalent to the cooperative gain under the worst-case noise correlation. The optimization variable is reduced to the noise cross-covariance matrix \(\Sigma_{n_1 n_2}^{\text{off}}\) (of size \(d_{X_1} \times d_{X_2}\)), solved via projected gradient descent. Gradients have closed-form expressions (Proposition 3.4), and projection is implemented via SVD (truncating singular values to \([0,1]\)). Complexity is \(O(\min(d_{X_1}, d_{X_2})^3)\).
  4. Design Motivation: Tilde-PID requires eigendecomposition of the full \((d_{X_1}+d_{X_2}) \times (d_{X_1}+d_{X_2})\) matrix, whereas Thin-PID only performs SVD on the \(d_{X_1} \times d_{X_2}\) cross-covariance, yielding significant speedups when \(d_{X_1} \gg d_{X_2}\).

  5. Proof of Joint Gaussian Optimality

  6. Function: Proves that the optimal joint distribution under the GPID definition is necessarily Gaussian, resolving an open problem.

  7. Mechanism: The key lemma establishes that for any \(q\), \(h_q(Y|X_1,X_2) \leq h_{\hat{q}}(Y|X_1,X_2)\), where \(\hat{q}\) is the Gaussian distribution sharing the same first and second moments as \(q\) (via the Gaussian upper bound property of conditional entropy). Since the optimization objective is equivalent to maximizing \(h_q(Y|X_1,X_2)\) and \(\hat{q}\) preserves the marginal constraints, the Gaussian solution is necessarily optimal.
  8. Design Motivation: The prior Tilde-PID merely assumed that the Gaussian solution was sufficiently good without formal proof. This result elevates a "heuristic approximation" to an "exact solution."

  9. Flow-PID: Normalizing Flow Encoder

  10. Function: Transforms non-Gaussian continuous data into a marginal Gaussian space, enabling Thin-PID.

  11. Mechanism: Three independent normalizing flows \(f_1, f_2, f_Y\) are trained so that the marginals of \((f_1(X_1), f_Y(Y))\) and \((f_2(X_2), f_Y(Y))\) approximate Gaussians. By Theorem 4.1, invertible mappings preserve total mutual information; Corollary 4.2 guarantees that the PID decomposition is likewise preserved. The training objective minimizes KL divergence to variational Gaussian marginals.
  12. Design Motivation: Direct estimation of MI in high dimensions is extremely difficult, whereas Gaussian MI has a closed-form solution. The invertibility of the flows ensures that PID in the latent space is equivalent to PID in the original space.

Loss & Training

The Flow-PID loss is the sum of Gaussian marginal regularization terms over the two marginal pairs: \(\mathcal{L}_{\text{flow}} = \mathcal{L}_\mathcal{N}(\{(X_1, Y)\}) + \mathcal{L}_\mathcal{N}(\{(X_2, Y)\})\), which is equivalent to maximizing the Gaussian log-likelihood of the transformed samples plus the Jacobian term.

Key Experimental Results

Main Results: Non-Gaussian Synthetic Data

Dimension Method R U1 U2 S Notes
(2,2,2) Tilde-PID 0.18 0.29 0.76 0.02 Severe bias
(2,2,2) Flow-PID 0.62 0.91 0.50 0.11 Close to ground truth
(2,2,2) Ground Truth 0.79 1.46 0.58 0.18
(100,60,2) Tilde-PID 1.48 0 1.97 0.13 Worse in high dimensions
(100,60,2) Flow-PID 4.34 0.36 0 0.25 Close to ground truth
(100,60,2) Ground Truth 5.71 1.01 0 0.57

Ablation Study: Computational Efficiency

Method Main Bottleneck When \(\min(d_{X_1},d_{X_2})>100\)
Thin-PID SVD on \(\min(d_{X_1},d_{X_2})\) 10×+ faster than Tilde-PID
Tilde-PID ED on \(d_{X_1}+d_{X_2}\) Baseline

Key Findings

  • Thin-PID achieves very high precision: absolute error \(<10^{-12}\) on Gaussian synthetic data, compared to \(>10^{-8}\) for Tilde-PID.
  • Flow-PID correctly recovers the interaction structure of non-Gaussian data: Tilde-PID applied directly to sample covariances leads to entirely incorrect interaction types (e.g., misclassifying unique information as redundant), while Flow-PID correctly identifies the structure via learned inverse transformations.
  • Synergistic information is the hardest to estimate: BATCH tends to overestimate redundancy and underestimate synergy; Flow-PID reduces this bias.
  • Real multimodal data application: On 6 MultiBench datasets, Flow-PID estimates total mutual information far exceeding that of BATCH, yielding results more consistent with actual model performance.
  • Model selection accuracy: PID estimates from Flow-PID achieve 96–100% accuracy in multimodal model selection tasks.

Highlights & Insights

  • Theoretical contribution resolving an open problem: Although the proof of joint Gaussian optimality under GPID is not technically complex, its significance is substantial — it elevates Tilde-PID from a "heuristic" to an "exact" method.
  • The broadcast channel reinterpretation is elegant: Recasting PID as a worst-case noise optimization in a Gaussian broadcast channel establishes a beautiful bridge between information theory and multimodal learning.
  • Theoretical guarantee that flows preserve PID: Rather than simply "training a good encoder and applying it," the framework provides rigorous mathematical guarantees that invertible mappings preserve the entire PID decomposition.

Limitations & Future Work

  • The accuracy of Flow-PID depends on how well the normalizing flow approximates the true distribution; complex distributions may require more expressive flow architectures.
  • The current framework handles only two-source PID; extension to multiple sources is theoretically feasible but increases complexity.
  • No ground-truth PID is available for real datasets, necessitating evaluation via indirect metrics.
  • The whitening preprocessing in Thin-PID assumes independent noise, which may be too strong an assumption in certain settings.
  • No direct comparison is made with neural mutual information estimators such as MINE.
  • vs. CVX/BATCH: CVX is limited to discrete small-scale data; BATCH uses neural network parameterization but achieves poor accuracy (synergistic information is severely underestimated). Flow-PID is accurate and efficient on continuous high-dimensional data.
  • vs. Tilde-PID: Shares the same Gaussian PID definition but is approximately 10× slower and does not prove the optimality of the Gaussian solution.
  • vs. MINE/NWJ: These methods estimate MI only and cannot be directly applied to PID, which additionally requires optimization over a constrained distribution set.

Rating

  • Novelty: ⭐⭐⭐⭐ Resolving an open problem and the flow encoder design both demonstrate originality.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Ground-truth validation on synthetic data; coverage of multiple real-world benchmarks.
  • Writing Quality: ⭐⭐⭐⭐ Rigorous mathematical exposition and clear paper organization.
  • Value: ⭐⭐⭐⭐ Extends PID to practical multimodal scenarios with important implications for understanding modality interactions.