Skip to content

Efficient Network Automatic Relevance Determination

Conference: ICML2025
arXiv: 2506.12352
Code: Not publicly available
Area: Sparse Bayesian Learning / Feature Selection / Multi-output Regression
Keywords: ARD, Sparse Bayesian, Precision Matrix Estimation, Graphical Lasso, Multi-output Regression, Feature Selection

TL;DR

Extends Automatic Relevance Determination (ARD) from single-output to multi-output regression scenarios, proposes the NARD framework to jointly estimate sparse regression coefficients and the output precision matrix, and designs three acceleration algorithms (Sequential/Surrogate/Hybrid) to reduce complexity from \(\mathcal{O}(d^3)\) to \(\mathcal{O}(p^2)\).

Background & Motivation

  • Multi-output regression is widely applied in fields such as genomics and proteomics, requiring simultaneous modeling of input-output relationships and dependency structures among outputs.
  • Biological data typically feature ultra-high dimensional features (thousands of genes/proteins), but only a small number of features truly affect the phenotypes, necessitating sparse feature selection.
  • Limitations of prior work:
    • Frequentist methods such as MRCE use \(\ell_1\) penalties but are highly sensitive to hyperparameters.
    • Bayesian MCMC methods like HS-GHS and JRNS achieve high accuracy but are computationally expensive (>3000 seconds).
    • Classical ARD/SBL only handle single-output scenarios and fail to capture correlations among outputs.
  • Core Motivation: To design a Bayesian framework that can jointly estimate sparse regression coefficients and the output precision matrix while efficiently handling high-dimensional data.

Method

NARD Framework

Imposes a matrix normal prior on the regression coefficient matrix \(W\):

\[W \sim \mathcal{MN}(0, V, K^{-1})\]

where \(K = \text{diag}(\alpha_1, \cdots, \alpha_d)\) is the column precision matrix and \(V\) is the row covariance matrix. Through the ARD mechanism, when \(\alpha_i \to \infty\), the corresponding feature is automatically pruned.

An \(\ell_1\) penalty is imposed on the precision matrix \(\Omega = V^{-1}\) to encourage sparsity, which is updated using Graphical Lasso:

\[\hat{V}, \hat{V}^{-1} \leftarrow \textbf{glasso}(V, \lambda)\]

The overall method employs Block Coordinate Descent (BCD) to worker out-of-order updates alternatingly among \(W\), \(\alpha\), and \(V\), with a complexity of \(\mathcal{O}(m^3 + d^3)\) per iteration.

Sequential NARD

Inspired by Tipping's greedy method, it evaluates the contribution of each feature sequentially: - Decomposes the marginal likelihood into two parts: one dependent on and one independent of a single \(\alpha_i\). - Defines \(\eta_i = \text{Tr}(q_i q_i^\top V^{-1}) - m s_i\), where the optimal \(\alpha_i\) is given by:

\[\alpha_i = \begin{cases} \frac{m s_i^2}{\eta_i}, & \eta_i > 0 \\ \infty, & \eta_i \leq 0 \end{cases}\]
  • Starts from an almost empty model, sequentially adding or deleting features, retaining the update only when the marginal likelihood increases.
  • Reduces complexity to \(\mathcal{O}(m^3 + p^3)\), where \(p \ll d\) is the number of finally retained features.

Surrogate NARD

Introduces a surrogate function to approximate the lower bound of the marginal likelihood: - Uses Lipschitz continuous gradients to construct a lower bound \(R(W, W')\), satisfying \(\text{Tr}(g(W)) \leq \text{Tr}(R(W, W'))\). - Key advantage: \(S_{xx} = K + \rho I\) remains a diagonal matrix, which requires only \(\mathcal{O}(d)\) for inversion. - Updates \(W, W', V, K\) alternately via BCD, reducing complexity to \(\mathcal{O}(m^3 + d^2)\).

Hybrid NARD

Combines the feature selection of Sequential NARD with the computational efficiency of Surrogate NARD: - Evaluates whether features should be included in the model using the Sequential approach. - Once included, updates matrices using the Surrogate approach. - Further reduces complexity to \(\mathcal{O}(m^3 + p^2)\).

Key Experimental Results

Synthetic Data (\(d=5000, m=1500, N=1500\))

Method TPR FPR Total Time (s)
MRCE 0.9083 0.0072 53
CAPME 0.8972 0.0124 52
HS-GHS 0.9463 0.0033 >3000
JRNS 0.9485 0.0037 >3000
NARD 0.9483 0.0062 49
Sequential NARD 0.9459 0.0067 35
Surrogate NARD 0.9462 0.0072 31
Hybrid NARD 0.9471 0.0068 23

Scalability (\(N=20000, m=2000\), per-step time in seconds)

\(d\) MRCE NARD Surrogate NARD
5000 12.2 10.7 3.7
10000 33.4 30.8 8.9
20000 201.9 168.6 33.7
30000 421.3 376.7 64.7

Aging Phenotype Data

  • 1,022 healthy individuals, 5,641 phenotypes (1,522 macro + 4,119 molecular phenotypes).
  • The Jaccard index between various NARD variants and baseline methods is >98.5%, indicating highly consistent feature selection.
  • NARD takes about 24 minutes, while Sequential NARD takes about 14 minutes.

TCGA Cancer Data

  • 7 tumor types, 10 key signaling pathways.
  • NARD successfully identifies the GSK3-AKT regulatory relationship in the PI3K/AKT pathway within COAD, which aligns with known biological knowledge.

Highlights & Insights

  • Unified Framework: Seamlessly integrates ARD feature selection with Graphical Lasso precision matrix estimation, solving both problems within a single framework.
  • Theoretical Elegance: All three acceleration schemes are supported by rigorous mathematical derivations (proofs of surrogate lower bounds, complexity analysis).
  • High Practicality: Hybrid NARD achieves about 2× speedup at \(d=5000\) and about 6× speedup at \(d=30000\), with almost no loss in accuracy.
  • Biological Interpretability: Protein interaction networks discovered on TCGA data align with known biological pathways.
  • Can be naturally extended to non-linear scenarios through kernel methods.

Limitations & Future Work

  • The core assumption of the model is a linear relationship; although kernel expansion is mentioned, it is not verified with in-depth experiments.
  • The selection of \(\lambda\) in Graphical Lasso relies on 5-fold cross-validation, and this computational cost is not included in the total time comparison.
  • Fails to compare against recent deep feature selection methods (e.g., gating networks, attention mechanisms).
  • Theoretical analysis lacks rigorous bounds for the convergence rate.
  • The code is not publicly available, posing obstacles to reproducibility.
  • MRCE (Rothman et al., 2010): Jointly penalizes \(W\) and \(\Omega\) with \(\ell_1\) norm, representing a classic frequentist approach.
  • HS-GHS (Li et al., 2021): Horseshoe + Graphical Horseshoe prior, which offers high accuracy but is extremely slow.
  • Tipping's Fast ARD (Tipping & Faul, 2003): The inspiration behind the Sequential update.
  • Valuable reference for research in high-dimensional multi-output sparse learning, Bayesian feature selection, and biological network inference.

Rating

  • Novelty: ⭐⭐⭐⭐ — The extension of ARD to multi-output scenarios is natural and meaningful, and the three acceleration schemes are elegantly designed.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple scenarios including synthetic, aging phenotype, and TCGA datasets, though lacks comparison with deep learning methods.
  • Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and rigorous, and the algorithm pseudocode is complete.
  • Value: ⭐⭐⭐⭐ — Possesses solid practical application value in the field of high-dimensional biostatistics; the lack of open-source code is a minor drawback.