Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs¶
Conference: NeurIPS 2025 arXiv: 2510.27517 Code: None Area: Optimization Keywords: preconditioner, conjugate gradient method, graph neural network, GPU acceleration, sparse approximate inverse
TL;DR¶
This paper proposes a graph neural network (GNN)-based method for learning sparse approximate inverse (SPAI) preconditioners, exploiting the natural compatibility between SPAI's locality and GNN message passing, and introduces a scale-invariant loss function (SAI loss) that achieves 40%–53% reduction in solve time (68%–113% speedup) on GPUs.
Background & Motivation¶
Background: Solving symmetric positive definite (SPD) sparse linear systems \(\mathbf{Ax} = \mathbf{b}\) is a core problem in scientific computing. The conjugate gradient (CG) method is the dominant iterative solver, but its convergence rate critically depends on preconditioning.
Limitations of Prior Work: - Diagonal preconditioners: GPU-friendly but offer limited convergence improvement. - Incomplete Cholesky (IC): performs well on CPUs but requires triangular solves that are difficult to parallelize on GPUs. - Existing learning-based methods (GNN→IC): inherit the GPU parallelism bottleneck of triangular solves, and GNNs struggle to model long-range dependencies along the elimination tree.
Key Challenge: IC-type preconditioners yield fewer iterations but are GPU-inefficient (triangular solve bottleneck), while diagonal preconditioners are GPU-efficient but require many iterations.
Goal: Construct a preconditioner that both effectively reduces the condition number and fully exploits GPU parallelism.
Key Insight: The paper targets SPAI, using \(\mathbf{M}^{-1} = \mathbf{GG}^\top + \varepsilon\mathbf{I}\) directly as the preconditioner. Each CG step requires only two sparse matrix–vector multiplications (SpMV), making it inherently GPU-friendly.
Core Idea: SPAI's locality (output at a node depends only on its two-hop neighborhood) is naturally compatible with GNN local message passing, making GNN-learned SPAI a more principled fit than GNN-learned IC.
Method¶
Overall Architecture¶
Nonzero entries and node features of input matrix \(\mathbf{A}\) → GNN encoder–processor–decoder → output sparse matrix \(\mathbf{G}\) → assemble \(\mathbf{M}^{-1} = \mathbf{GG}^\top + \varepsilon\mathbf{I}\) → apply as preconditioner in CG.
Key Designs¶
-
Locality Analysis of SPAI Preconditioners: Assuming \(\mathbf{G}\) shares the sparsity pattern of \(\mathbf{A}\), the preconditioned output \(\mathbf{s}_j = (\mathbf{M}^{-1}\mathbf{r})_j\) depends only on the two-hop neighborhood of node \(j\): $\(\mathbf{s}_j = \varepsilon\mathbf{r}_j + \sum_{l: \mathbf{A}_{jl}\neq 0} \mathbf{G}_{jl} \sum_{k: \mathbf{A}_{kl}\neq 0} \mathbf{G}_{kl}\mathbf{r}_k\)$ This stands in sharp contrast to IC's global dependencies along the elimination tree, making SPAI naturally suited to GNN's local propagation mechanism.
-
GNN Architecture: An encoder–processor–decoder structure is employed, consisting of:
- Encoder: two MLPs that separately encode node features \(\mathbf{v}_i\) and edge features \(\mathbf{e}_{ij}\) (i.e., \(\mathbf{A}_{ij}\)).
- Processor: \(L=4\) message-passing layers, each containing a message function \(f_m^{(t)}\), node update \(f_v^{(t)}\), and edge update \(f_e^{(t)}\), all with residual connections.
- Decoder: an MLP that maps the final edge features \(\mathbf{h}_{ij}^{(L)}\) to \(\mathbf{G}_{ij} \in \mathbb{R}^{b \times b}\).
The total parameter count is only approximately 24k.
-
Scale-invariant Alignment-to-Identity (SAI) Loss:
- Two issues with the conventional SPAI loss \(\|\mathbf{AM}^{-1} - \mathbf{I}\|_F^2\): (1) directly computing \(\mathbf{AM}^{-1}\) is prohibitively expensive for large matrices; (2) the loss depends on the absolute scale of \(\mathbf{A}\), whereas CG convergence rates are scale-invariant.
- Stochastic estimation: random vectors \(\mathbf{w} \sim \mathcal{N}(0,1)\) are used to approximate the Frobenius norm, reducing matrix–matrix products to matrix–vector products.
- Scale normalization: $\(\mathcal{L}_{\text{SAI}}(\mathbf{A}, \mathbf{M}^{-1}, \mathbf{w}) = \left\|\left(\frac{1}{\|\mathbf{A}\|}\mathbf{AM}^{-1} - \mathbf{I}\right)\mathbf{w}\right\|_2^2\)$ satisfying \(\mathcal{L}_{\text{SAI}}(\mathbf{A}, \mathbf{M}^{-1}) = \mathcal{L}_{\text{SAI}}(\alpha\mathbf{A}, \mathbf{M}^{-1})\) for any \(\alpha > 0\).
- The norm is defined as \(\|\mathbf{A}\| = \text{mean}_{\mathbf{A}_{ij}\neq 0}|\mathbf{A}_{ij}|\), which is more robust to matrix size and outliers.
- Theoretical analysis shows this loss directly bounds the condition number: \(\kappa(\mathbf{AM}^{-1}) \leq 1 + 2\|\mathbf{E}\|_2\).
Loss & Training¶
- All experiments fix \(L=4\), hidden dimension \(d=24\), and \(\varepsilon = 10^{-4}\).
- Training runs for 500 epochs with batch size 4, AdamW optimizer, and exponential learning rate decay (rate 0.99).
- Training is performed on a single NVIDIA A100 GPU.
Key Experimental Results¶
Main Results — GPU Solve Time (\(\text{rtol} = 10^{-8}\))¶
| Problem | Diag | IC | AINV | Ours | Relative Speedup |
|---|---|---|---|---|---|
| Heat Equation | 77ms (520) | 167ms (204) | 78ms (330) | 36ms (197) | 113% |
| Poisson Equation | 45ms (320) | 101ms (128) | 58ms (217) | 26ms (128) | 73% |
| Hyperelasticity | 86ms (464) | 202ms (117) | 247ms (266) | 51ms (175) | 68% |
| Synthetic System | 445ms (2775) | 1399ms (1808) | 5024ms (10896) | 253ms (1122) | 75% |
Numbers in parentheses are CG iteration counts. The proposed method achieves the fastest solve time across all datasets.
Ablation Study — Performance at Different Convergence Thresholds (Heat Equation, GPU)¶
| rtol | Diag | IC | AINV | Ours |
|---|---|---|---|---|
| \(10^{-2}\) | 43ms (309) | 92ms (115) | 52ms (199) | 22ms (124) |
| \(10^{-4}\) | 53ms (384) | 114ms (143) | 61ms (246) | 27ms (154) |
| \(10^{-6}\) | 63ms (442) | 132ms (164) | 67ms (280) | 32ms (176) |
| \(10^{-8}\) | 77ms (520) | 167ms (204) | 78ms (330) | 36ms (197) |
Key Findings¶
- On GPUs, the proposed method achieves 68%–113% speedup.
- Iteration counts are comparable to IC (indicating high preconditioner quality), while per-step application cost is far lower than IC's triangular solves.
- Construction time (0.181ms) is on the same order as the diagonal preconditioner (0.196ms), far below IC (1.866ms) and AINV (18.924ms).
- The SAI loss yields a notable condition number improvement over the non-scale-invariant conventional loss.
Highlights & Insights¶
- Natural compatibility of SPAI and GNN: SPAI's two-hop locality perfectly matches GNN message passing, making it a more principled pairing than IC-GNN.
- Elegant design of SAI loss: Scale invariance aligns the preconditioner optimization objective with CG convergence properties, without requiring precomputation of \(\mathbf{A}^{-1}\).
- Extremely lightweight model: With only 24k parameters, the GNN demonstrates that local structural information suffices to guide high-quality preconditioner construction.
- Full GPU pipeline: The entire workflow—from construction (GNN forward pass) to application (SpMV)—executes on the GPU.
Limitations & Future Work¶
- Applicable only to SPD matrices; not suitable for nonsymmetric or indefinite systems.
- Separate GNN training is required for different types of PDE problems.
- The sparsity pattern is fixed to match \(\mathbf{A}\); adaptive sparsity pattern selection is not explored.
- Comparison with multigrid methods such as AMG is limited.
- Scalability to very large-scale problems (millions of degrees of freedom) remains to be verified.
Related Work & Insights¶
- Li et al., Häusner et al. (2023–2024): Pioneering work on GNN→IC preconditioners; this paper identifies the fundamental tension between triangular solves and GNN locality in those approaches.
- Classical SPAI (Kolotilina et al.): This paper addresses the limitations of classical SPAI construction, which relies on sequential algorithms and yields suboptimal performance, through a data-driven approach.
- Insight: In scientific computing, "GPU-friendliness" may matter more than "theoretical optimality"—IC is theoretically superior yet slower in practice on GPUs.
Rating¶
- Novelty: ⭐⭐⭐⭐ The GNN+SPAI combination is novel, and the SAI loss design has theoretical depth.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on multiple PDE problems and synthetic datasets, with comprehensive comparison against both classical and learning-based methods.
- Writing Quality: ⭐⭐⭐⭐⭐ Well-structured; the comparative analysis of locality is particularly compelling.
- Value: ⭐⭐⭐⭐ High practical value for GPU-based scientific computing; the 68%–113% speedup is highly significant.