Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs¶

Conference: NeurIPS 2025 arXiv: 2510.27517 Code: None Area: Optimization Keywords: preconditioner, conjugate gradient method, graph neural network, GPU acceleration, sparse approximate inverse

TL;DR¶

This paper proposes a graph neural network (GNN)-based method for learning sparse approximate inverse (SPAI) preconditioners, exploiting the natural compatibility between SPAI's locality and GNN message passing, and introduces a scale-invariant loss function (SAI loss) that achieves 40%–53% reduction in solve time (68%–113% speedup) on GPUs.

Background & Motivation¶

Background: Solving symmetric positive definite (SPD) sparse linear systems $\mathbf{Ax} = \mathbf{b}$ is a core problem in scientific computing. The conjugate gradient (CG) method is the dominant iterative solver, but its convergence rate critically depends on preconditioning.

Limitations of Prior Work: - Diagonal preconditioners: GPU-friendly but offer limited convergence improvement. - Incomplete Cholesky (IC): performs well on CPUs but requires triangular solves that are difficult to parallelize on GPUs. - Existing learning-based methods (GNN→IC): inherit the GPU parallelism bottleneck of triangular solves, and GNNs struggle to model long-range dependencies along the elimination tree.

Key Challenge: IC-type preconditioners yield fewer iterations but are GPU-inefficient (triangular solve bottleneck), while diagonal preconditioners are GPU-efficient but require many iterations.

Goal: Construct a preconditioner that both effectively reduces the condition number and fully exploits GPU parallelism.

Key Insight: The paper targets SPAI, using $\mathbf{M}^{-1} = \mathbf{GG}^\top + \varepsilon\mathbf{I}$ directly as the preconditioner. Each CG step requires only two sparse matrix–vector multiplications (SpMV), making it inherently GPU-friendly.

Core Idea: SPAI's locality (output at a node depends only on its two-hop neighborhood) is naturally compatible with GNN local message passing, making GNN-learned SPAI a more principled fit than GNN-learned IC.

Method¶

Overall Architecture¶

Nonzero entries and node features of input matrix $\mathbf{A}$ → GNN encoder–processor–decoder → output sparse matrix $\mathbf{G}$ → assemble $\mathbf{M}^{-1} = \mathbf{GG}^\top + \varepsilon\mathbf{I}$ → apply as preconditioner in CG.

Key Designs¶

Locality Analysis of SPAI Preconditioners: Assuming $\mathbf{G}$ shares the sparsity pattern of $\mathbf{A}$, the preconditioned output $\mathbf{s}_j = (\mathbf{M}^{-1}\mathbf{r})_j$ depends only on the two-hop neighborhood of node $j$: $$\mathbf{s}_j = \varepsilon\mathbf{r}_j + \sum_{l: \mathbf{A}_{jl}\neq 0} \mathbf{G}_{jl} \sum_{k: \mathbf{A}_{kl}\neq 0} \mathbf{G}_{kl}\mathbf{r}_k$$ This stands in sharp contrast to IC's global dependencies along the elimination tree, making SPAI naturally suited to GNN's local propagation mechanism.
GNN Architecture: An encoder–processor–decoder structure is employed, consisting of:
- Encoder: two MLPs that separately encode node features $\mathbf{v}_i$ and edge features $\mathbf{e}_{ij}$ (i.e., $\mathbf{A}_{ij}$).
- Processor: $L=4$ message-passing layers, each containing a message function $f_m^{(t)}$, node update $f_v^{(t)}$, and edge update $f_e^{(t)}$, all with residual connections.
- Decoder: an MLP that maps the final edge features $\mathbf{h}_{ij}^{(L)}$ to $\mathbf{G}_{ij} \in \mathbb{R}^{b \times b}$.

The total parameter count is only approximately 24k.

Scale-invariant Alignment-to-Identity (SAI) Loss:
- Two issues with the conventional SPAI loss $\|\mathbf{AM}^{-1} - \mathbf{I}\|_F^2$: (1) directly computing $\mathbf{AM}^{-1}$ is prohibitively expensive for large matrices; (2) the loss depends on the absolute scale of $\mathbf{A}$, whereas CG convergence rates are scale-invariant.
- Stochastic estimation: random vectors $\mathbf{w} \sim \mathcal{N}(0,1)$ are used to approximate the Frobenius norm, reducing matrix–matrix products to matrix–vector products.
- Scale normalization: $$\mathcal{L}_{\text{SAI}}(\mathbf{A}, \mathbf{M}^{-1}, \mathbf{w}) = \left\|\left(\frac{1}{\|\mathbf{A}\|}\mathbf{AM}^{-1} - \mathbf{I}\right)\mathbf{w}\right\|_2^2$$ satisfying $\mathcal{L}_{\text{SAI}}(\mathbf{A}, \mathbf{M}^{-1}) = \mathcal{L}_{\text{SAI}}(\alpha\mathbf{A}, \mathbf{M}^{-1})$ for any $\alpha > 0$.
- The norm is defined as $\|\mathbf{A}\| = \text{mean}_{\mathbf{A}_{ij}\neq 0}|\mathbf{A}_{ij}|$, which is more robust to matrix size and outliers.
- Theoretical analysis shows this loss directly bounds the condition number: $\kappa(\mathbf{AM}^{-1}) \leq 1 + 2\|\mathbf{E}\|_2$.

Loss & Training¶

All experiments fix $L=4$, hidden dimension $d=24$, and $\varepsilon = 10^{-4}$.
Training runs for 500 epochs with batch size 4, AdamW optimizer, and exponential learning rate decay (rate 0.99).
Training is performed on a single NVIDIA A100 GPU.

Key Experimental Results¶

Main Results — GPU Solve Time ($\text{rtol} = 10^{-8}$)¶

Problem	Diag	IC	AINV	Ours	Relative Speedup
Heat Equation	77ms (520)	167ms (204)	78ms (330)	36ms (197)	113%
Poisson Equation	45ms (320)	101ms (128)	58ms (217)	26ms (128)	73%
Hyperelasticity	86ms (464)	202ms (117)	247ms (266)	51ms (175)	68%
Synthetic System	445ms (2775)	1399ms (1808)	5024ms (10896)	253ms (1122)	75%

Numbers in parentheses are CG iteration counts. The proposed method achieves the fastest solve time across all datasets.

Ablation Study — Performance at Different Convergence Thresholds (Heat Equation, GPU)¶

rtol	Diag	IC	AINV	Ours
$10^{-2}$	43ms (309)	92ms (115)	52ms (199)	22ms (124)
$10^{-4}$	53ms (384)	114ms (143)	61ms (246)	27ms (154)
$10^{-6}$	63ms (442)	132ms (164)	67ms (280)	32ms (176)
$10^{-8}$	77ms (520)	167ms (204)	78ms (330)	36ms (197)

Key Findings¶

On GPUs, the proposed method achieves 68%–113% speedup.
Iteration counts are comparable to IC (indicating high preconditioner quality), while per-step application cost is far lower than IC's triangular solves.
Construction time (0.181ms) is on the same order as the diagonal preconditioner (0.196ms), far below IC (1.866ms) and AINV (18.924ms).
The SAI loss yields a notable condition number improvement over the non-scale-invariant conventional loss.

Highlights & Insights¶

Natural compatibility of SPAI and GNN: SPAI's two-hop locality perfectly matches GNN message passing, making it a more principled pairing than IC-GNN.
Elegant design of SAI loss: Scale invariance aligns the preconditioner optimization objective with CG convergence properties, without requiring precomputation of $\mathbf{A}^{-1}$.
Extremely lightweight model: With only 24k parameters, the GNN demonstrates that local structural information suffices to guide high-quality preconditioner construction.
Full GPU pipeline: The entire workflow—from construction (GNN forward pass) to application (SpMV)—executes on the GPU.

Limitations & Future Work¶

Applicable only to SPD matrices; not suitable for nonsymmetric or indefinite systems.
Separate GNN training is required for different types of PDE problems.
The sparsity pattern is fixed to match $\mathbf{A}$; adaptive sparsity pattern selection is not explored.
Comparison with multigrid methods such as AMG is limited.
Scalability to very large-scale problems (millions of degrees of freedom) remains to be verified.

Li et al., Häusner et al. (2023–2024): Pioneering work on GNN→IC preconditioners; this paper identifies the fundamental tension between triangular solves and GNN locality in those approaches.
Classical SPAI (Kolotilina et al.): This paper addresses the limitations of classical SPAI construction, which relies on sequential algorithms and yields suboptimal performance, through a data-driven approach.
Insight: In scientific computing, "GPU-friendliness" may matter more than "theoretical optimality"—IC is theoretically superior yet slower in practice on GPUs.

Rating¶

Novelty: ⭐⭐⭐⭐ The GNN+SPAI combination is novel, and the SAI loss design has theoretical depth.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on multiple PDE problems and synthetic datasets, with comprehensive comparison against both classical and learning-based methods.
Writing Quality: ⭐⭐⭐⭐⭐ Well-structured; the comparative analysis of locality is particularly compelling.
Value: ⭐⭐⭐⭐ High practical value for GPU-based scientific computing; the 68%–113% speedup is highly significant.

rtol	Diag	IC	AINV	Ours
\(10^{-2}\)	43ms (309)	92ms (115)	52ms (199)	22ms (124)
\(10^{-4}\)	53ms (384)	114ms (143)	61ms (246)	27ms (154)
\(10^{-6}\)	63ms (442)	132ms (164)	67ms (280)	32ms (176)
\(10^{-8}\)	77ms (520)	167ms (204)	78ms (330)	36ms (197)