Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions¶
Conference: NeurIPS 2025 arXiv: 2510.04417 Code: https://github.com/warrenzha/flow-pid Area: Interpretability / Information Theory Keywords: Partial Information Decomposition, normalizing flow, Gaussian distribution, multimodal learning, mutual information
TL;DR¶
Two complementary tools are proposed: Thin-PID is an efficient Gaussian PID algorithm (10× faster than existing methods), and Flow-PID applies normalizing flows to map arbitrary input distributions to Gaussian space before computing PID, addressing the infeasibility of PID on continuous high-dimensional data. The paper also resolves an open problem regarding whether the joint Gaussian solution is optimal.
Background & Motivation¶
Background: Partial Information Decomposition (PID) is an information-theoretic framework for quantifying multi-source information interactions. It decomposes the total mutual information of two sources \(X_1, X_2\) about a target \(Y\) into four non-negative components: redundant information \(R\) (shared by both), unique information \(U_1, U_2\) (exclusive to each), and synergistic information \(S\) (only available when both are combined). PID has been applied in multimodal learning to understand modality interactions.
Limitations of Prior Work: Computing PID requires solving an optimization problem over the set of joint distributions satisfying marginal constraints. This is feasible for discrete small-scale data (via CVX), but practically impossible for continuous high-dimensional data, where even estimating mutual information and entropy is extremely challenging. The BATCH method uses neural network parameterization but achieves poor accuracy. Tilde-PID is restricted to Gaussian distributions without proof of optimality.
Key Challenge: A fundamental gap exists between the theoretical elegance of PID and its computational infeasibility — while PID can precisely quantify modality interactions in theory, in practice it is limited to discrete low-dimensional data.
Goal: (1) Prove the optimality of the joint Gaussian solution in Gaussian PID; (2) Design an algorithm more efficient than Tilde-PID; (3) Generalize to non-Gaussian high-dimensional continuous data.
Key Insight: Two key insights — PID under Gaussian distributions admits a closed-form solution and can be computed efficiently; the invertibility of normalizing flows preserves mutual information, enabling transformation to Gaussian space prior to PID computation.
Core Idea: Transform data to Gaussian via normalizing flows, then efficiently compute PID in the Gaussian space.
Method¶
Overall Architecture¶
The framework operates at two levels: (1) Thin-PID handles Gaussian PID — reformulating the optimization objective as minimizing a function of the noise cross-covariance matrix and solving via projected gradient descent; (2) Flow-PID handles general distributions — training a Cartesian product normalizing flow \(f_1 \times f_2 \times f_Y\) to map \((X_1, X_2, Y)\) to a Gaussian marginal space, then invoking Thin-PID.
Key Designs¶
-
Thin-PID: Efficient Gaussian PID Algorithm
-
Function: Efficiently solves the PID optimization problem when the marginals are known to be Gaussian.
- Mechanism: PID is reinterpreted via a Gaussian broadcast channel model — \(Y\) is the transmitted signal, with \(X_1 = H_1 Y + n_1\) and \(X_2 = H_2 Y + n_2\). Synergistic information is equivalent to the cooperative gain under the worst-case noise correlation. The optimization variable is reduced to the noise cross-covariance matrix \(\Sigma_{n_1 n_2}^{\text{off}}\) (of size \(d_{X_1} \times d_{X_2}\)), solved via projected gradient descent. Gradients have closed-form expressions (Proposition 3.4), and projection is implemented via SVD (truncating singular values to \([0,1]\)). Complexity is \(O(\min(d_{X_1}, d_{X_2})^3)\).
-
Design Motivation: Tilde-PID requires eigendecomposition of the full \((d_{X_1}+d_{X_2}) \times (d_{X_1}+d_{X_2})\) matrix, whereas Thin-PID only performs SVD on the \(d_{X_1} \times d_{X_2}\) cross-covariance, yielding significant speedups when \(d_{X_1} \gg d_{X_2}\).
-
Proof of Joint Gaussian Optimality
-
Function: Proves that the optimal joint distribution under the GPID definition is necessarily Gaussian, resolving an open problem.
- Mechanism: The key lemma establishes that for any \(q\), \(h_q(Y|X_1,X_2) \leq h_{\hat{q}}(Y|X_1,X_2)\), where \(\hat{q}\) is the Gaussian distribution sharing the same first and second moments as \(q\) (via the Gaussian upper bound property of conditional entropy). Since the optimization objective is equivalent to maximizing \(h_q(Y|X_1,X_2)\) and \(\hat{q}\) preserves the marginal constraints, the Gaussian solution is necessarily optimal.
-
Design Motivation: The prior Tilde-PID merely assumed that the Gaussian solution was sufficiently good without formal proof. This result elevates a "heuristic approximation" to an "exact solution."
-
Flow-PID: Normalizing Flow Encoder
-
Function: Transforms non-Gaussian continuous data into a marginal Gaussian space, enabling Thin-PID.
- Mechanism: Three independent normalizing flows \(f_1, f_2, f_Y\) are trained so that the marginals of \((f_1(X_1), f_Y(Y))\) and \((f_2(X_2), f_Y(Y))\) approximate Gaussians. By Theorem 4.1, invertible mappings preserve total mutual information; Corollary 4.2 guarantees that the PID decomposition is likewise preserved. The training objective minimizes KL divergence to variational Gaussian marginals.
- Design Motivation: Direct estimation of MI in high dimensions is extremely difficult, whereas Gaussian MI has a closed-form solution. The invertibility of the flows ensures that PID in the latent space is equivalent to PID in the original space.
Loss & Training¶
The Flow-PID loss is the sum of Gaussian marginal regularization terms over the two marginal pairs: \(\mathcal{L}_{\text{flow}} = \mathcal{L}_\mathcal{N}(\{(X_1, Y)\}) + \mathcal{L}_\mathcal{N}(\{(X_2, Y)\})\), which is equivalent to maximizing the Gaussian log-likelihood of the transformed samples plus the Jacobian term.
Key Experimental Results¶
Main Results: Non-Gaussian Synthetic Data¶
| Dimension | Method | R | U1 | U2 | S | Notes |
|---|---|---|---|---|---|---|
| (2,2,2) | Tilde-PID | 0.18 | 0.29 | 0.76 | 0.02 | Severe bias |
| (2,2,2) | Flow-PID | 0.62 | 0.91 | 0.50 | 0.11 | Close to ground truth |
| (2,2,2) | Ground Truth | 0.79 | 1.46 | 0.58 | 0.18 | — |
| (100,60,2) | Tilde-PID | 1.48 | 0 | 1.97 | 0.13 | Worse in high dimensions |
| (100,60,2) | Flow-PID | 4.34 | 0.36 | 0 | 0.25 | Close to ground truth |
| (100,60,2) | Ground Truth | 5.71 | 1.01 | 0 | 0.57 | — |
Ablation Study: Computational Efficiency¶
| Method | Main Bottleneck | When \(\min(d_{X_1},d_{X_2})>100\) |
|---|---|---|
| Thin-PID | SVD on \(\min(d_{X_1},d_{X_2})\) | 10×+ faster than Tilde-PID |
| Tilde-PID | ED on \(d_{X_1}+d_{X_2}\) | Baseline |
Key Findings¶
- Thin-PID achieves very high precision: absolute error \(<10^{-12}\) on Gaussian synthetic data, compared to \(>10^{-8}\) for Tilde-PID.
- Flow-PID correctly recovers the interaction structure of non-Gaussian data: Tilde-PID applied directly to sample covariances leads to entirely incorrect interaction types (e.g., misclassifying unique information as redundant), while Flow-PID correctly identifies the structure via learned inverse transformations.
- Synergistic information is the hardest to estimate: BATCH tends to overestimate redundancy and underestimate synergy; Flow-PID reduces this bias.
- Real multimodal data application: On 6 MultiBench datasets, Flow-PID estimates total mutual information far exceeding that of BATCH, yielding results more consistent with actual model performance.
- Model selection accuracy: PID estimates from Flow-PID achieve 96–100% accuracy in multimodal model selection tasks.
Highlights & Insights¶
- Theoretical contribution resolving an open problem: Although the proof of joint Gaussian optimality under GPID is not technically complex, its significance is substantial — it elevates Tilde-PID from a "heuristic" to an "exact" method.
- The broadcast channel reinterpretation is elegant: Recasting PID as a worst-case noise optimization in a Gaussian broadcast channel establishes a beautiful bridge between information theory and multimodal learning.
- Theoretical guarantee that flows preserve PID: Rather than simply "training a good encoder and applying it," the framework provides rigorous mathematical guarantees that invertible mappings preserve the entire PID decomposition.
Limitations & Future Work¶
- The accuracy of Flow-PID depends on how well the normalizing flow approximates the true distribution; complex distributions may require more expressive flow architectures.
- The current framework handles only two-source PID; extension to multiple sources is theoretically feasible but increases complexity.
- No ground-truth PID is available for real datasets, necessitating evaluation via indirect metrics.
- The whitening preprocessing in Thin-PID assumes independent noise, which may be too strong an assumption in certain settings.
- No direct comparison is made with neural mutual information estimators such as MINE.
Related Work & Insights¶
- vs. CVX/BATCH: CVX is limited to discrete small-scale data; BATCH uses neural network parameterization but achieves poor accuracy (synergistic information is severely underestimated). Flow-PID is accurate and efficient on continuous high-dimensional data.
- vs. Tilde-PID: Shares the same Gaussian PID definition but is approximately 10× slower and does not prove the optimality of the Gaussian solution.
- vs. MINE/NWJ: These methods estimate MI only and cannot be directly applied to PID, which additionally requires optimization over a constrained distribution set.
Rating¶
- Novelty: ⭐⭐⭐⭐ Resolving an open problem and the flow encoder design both demonstrate originality.
- Experimental Thoroughness: ⭐⭐⭐⭐ Ground-truth validation on synthetic data; coverage of multiple real-world benchmarks.
- Writing Quality: ⭐⭐⭐⭐ Rigorous mathematical exposition and clear paper organization.
- Value: ⭐⭐⭐⭐ Extends PID to practical multimodal scenarios with important implications for understanding modality interactions.