Efficient Network Automatic Relevance Determination¶
Conference: ICML2025
arXiv: 2506.12352
Code: Not publicly available
Area: Sparse Bayesian Learning / Feature Selection / Multi-output Regression
Keywords: ARD, Sparse Bayesian, Precision Matrix Estimation, Graphical Lasso, Multi-output Regression, Feature Selection
TL;DR¶
Extends Automatic Relevance Determination (ARD) from single-output to multi-output regression scenarios, proposes the NARD framework to jointly estimate sparse regression coefficients and the output precision matrix, and designs three acceleration algorithms (Sequential/Surrogate/Hybrid) to reduce complexity from \(\mathcal{O}(d^3)\) to \(\mathcal{O}(p^2)\).
Background & Motivation¶
- Multi-output regression is widely applied in fields such as genomics and proteomics, requiring simultaneous modeling of input-output relationships and dependency structures among outputs.
- Biological data typically feature ultra-high dimensional features (thousands of genes/proteins), but only a small number of features truly affect the phenotypes, necessitating sparse feature selection.
- Limitations of prior work:
- Frequentist methods such as MRCE use \(\ell_1\) penalties but are highly sensitive to hyperparameters.
- Bayesian MCMC methods like HS-GHS and JRNS achieve high accuracy but are computationally expensive (>3000 seconds).
- Classical ARD/SBL only handle single-output scenarios and fail to capture correlations among outputs.
- Core Motivation: To design a Bayesian framework that can jointly estimate sparse regression coefficients and the output precision matrix while efficiently handling high-dimensional data.
Method¶
NARD Framework¶
Imposes a matrix normal prior on the regression coefficient matrix \(W\):
where \(K = \text{diag}(\alpha_1, \cdots, \alpha_d)\) is the column precision matrix and \(V\) is the row covariance matrix. Through the ARD mechanism, when \(\alpha_i \to \infty\), the corresponding feature is automatically pruned.
An \(\ell_1\) penalty is imposed on the precision matrix \(\Omega = V^{-1}\) to encourage sparsity, which is updated using Graphical Lasso:
The overall method employs Block Coordinate Descent (BCD) to worker out-of-order updates alternatingly among \(W\), \(\alpha\), and \(V\), with a complexity of \(\mathcal{O}(m^3 + d^3)\) per iteration.
Sequential NARD¶
Inspired by Tipping's greedy method, it evaluates the contribution of each feature sequentially: - Decomposes the marginal likelihood into two parts: one dependent on and one independent of a single \(\alpha_i\). - Defines \(\eta_i = \text{Tr}(q_i q_i^\top V^{-1}) - m s_i\), where the optimal \(\alpha_i\) is given by:
- Starts from an almost empty model, sequentially adding or deleting features, retaining the update only when the marginal likelihood increases.
- Reduces complexity to \(\mathcal{O}(m^3 + p^3)\), where \(p \ll d\) is the number of finally retained features.
Surrogate NARD¶
Introduces a surrogate function to approximate the lower bound of the marginal likelihood: - Uses Lipschitz continuous gradients to construct a lower bound \(R(W, W')\), satisfying \(\text{Tr}(g(W)) \leq \text{Tr}(R(W, W'))\). - Key advantage: \(S_{xx} = K + \rho I\) remains a diagonal matrix, which requires only \(\mathcal{O}(d)\) for inversion. - Updates \(W, W', V, K\) alternately via BCD, reducing complexity to \(\mathcal{O}(m^3 + d^2)\).
Hybrid NARD¶
Combines the feature selection of Sequential NARD with the computational efficiency of Surrogate NARD: - Evaluates whether features should be included in the model using the Sequential approach. - Once included, updates matrices using the Surrogate approach. - Further reduces complexity to \(\mathcal{O}(m^3 + p^2)\).
Key Experimental Results¶
Synthetic Data (\(d=5000, m=1500, N=1500\))¶
| Method | TPR | FPR | Total Time (s) |
|---|---|---|---|
| MRCE | 0.9083 | 0.0072 | 53 |
| CAPME | 0.8972 | 0.0124 | 52 |
| HS-GHS | 0.9463 | 0.0033 | >3000 |
| JRNS | 0.9485 | 0.0037 | >3000 |
| NARD | 0.9483 | 0.0062 | 49 |
| Sequential NARD | 0.9459 | 0.0067 | 35 |
| Surrogate NARD | 0.9462 | 0.0072 | 31 |
| Hybrid NARD | 0.9471 | 0.0068 | 23 |
Scalability (\(N=20000, m=2000\), per-step time in seconds)¶
| \(d\) | MRCE | NARD | Surrogate NARD |
|---|---|---|---|
| 5000 | 12.2 | 10.7 | 3.7 |
| 10000 | 33.4 | 30.8 | 8.9 |
| 20000 | 201.9 | 168.6 | 33.7 |
| 30000 | 421.3 | 376.7 | 64.7 |
Aging Phenotype Data¶
- 1,022 healthy individuals, 5,641 phenotypes (1,522 macro + 4,119 molecular phenotypes).
- The Jaccard index between various NARD variants and baseline methods is >98.5%, indicating highly consistent feature selection.
- NARD takes about 24 minutes, while Sequential NARD takes about 14 minutes.
TCGA Cancer Data¶
- 7 tumor types, 10 key signaling pathways.
- NARD successfully identifies the GSK3-AKT regulatory relationship in the PI3K/AKT pathway within COAD, which aligns with known biological knowledge.
Highlights & Insights¶
- Unified Framework: Seamlessly integrates ARD feature selection with Graphical Lasso precision matrix estimation, solving both problems within a single framework.
- Theoretical Elegance: All three acceleration schemes are supported by rigorous mathematical derivations (proofs of surrogate lower bounds, complexity analysis).
- High Practicality: Hybrid NARD achieves about 2× speedup at \(d=5000\) and about 6× speedup at \(d=30000\), with almost no loss in accuracy.
- Biological Interpretability: Protein interaction networks discovered on TCGA data align with known biological pathways.
- Can be naturally extended to non-linear scenarios through kernel methods.
Limitations & Future Work¶
- The core assumption of the model is a linear relationship; although kernel expansion is mentioned, it is not verified with in-depth experiments.
- The selection of \(\lambda\) in Graphical Lasso relies on 5-fold cross-validation, and this computational cost is not included in the total time comparison.
- Fails to compare against recent deep feature selection methods (e.g., gating networks, attention mechanisms).
- Theoretical analysis lacks rigorous bounds for the convergence rate.
- The code is not publicly available, posing obstacles to reproducibility.
Related Work & Insights¶
- MRCE (Rothman et al., 2010): Jointly penalizes \(W\) and \(\Omega\) with \(\ell_1\) norm, representing a classic frequentist approach.
- HS-GHS (Li et al., 2021): Horseshoe + Graphical Horseshoe prior, which offers high accuracy but is extremely slow.
- Tipping's Fast ARD (Tipping & Faul, 2003): The inspiration behind the Sequential update.
- Valuable reference for research in high-dimensional multi-output sparse learning, Bayesian feature selection, and biological network inference.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The extension of ARD to multi-output scenarios is natural and meaningful, and the three acceleration schemes are elegantly designed.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple scenarios including synthetic, aging phenotype, and TCGA datasets, though lacks comparison with deep learning methods.
- Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and rigorous, and the algorithm pseudocode is complete.
- Value: ⭐⭐⭐⭐ — Possesses solid practical application value in the field of high-dimensional biostatistics; the lack of open-source code is a minor drawback.