GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal \(k\)-sparse GLMs¶
Conference: ICML2025
arXiv: 2603.01306
Code: To be confirmed
Area: Others
Keywords: Sparse GLM, perspective relaxation, branch-and-bound, linear convergence, GPU acceleration, dual gap restart, proximal methods
TL;DR¶
Proposes GPU-friendly and linearly convergent first-order methods. Through composite reformulation and a dual gap restart strategy, the perspective relaxation solving is accelerated by 1-2 orders of magnitude, enabling optimality certification for large-scale sparse GLMs.
Background & Motivation¶
- Sparse GLMs are core tools in machine learning and statistics, widely applied in high-stakes scenarios such as healthcare and finance, which demand certifiable global optimal solutions rather than approximations.
- The problem is essentially an NP-hard \(\ell_0\)-constrained optimization, and the standard method is the branch-and-bound (BnB) framework.
- The core bottleneck of BnB lies in the lower bound computation at each node: the standard big-M relaxation is too loose, while the perspective relaxation is tighter but computationally expensive to solve.
- Interior point methods (IPMs) for solving the perspective relaxation suffer from cubic complexity, cannot be parallelized on GPUs, and do not support warm-start.
- Existing first-order methods (ADMM, coordinate descent) suffer from two issues:
- The convergence rate is only sublinear, and the convergence rate of the safe lower bound lacks theoretical guarantees.
- ADMM requires solving linear systems, and coordinate descent is inherently serial, making both unsuitable for GPUs.
Method¶
1. Composite Reformulation¶
Reformulates the perspective relaxation as an unconstrained composite optimization problem:
where \(F(\boldsymbol{X}\boldsymbol{\beta}) = f(\boldsymbol{X}\boldsymbol{\beta}, \boldsymbol{y})\) is the loss function, and \(G(\boldsymbol{\beta}) = 2\lambda_2 g_{\mathcal{N}}(\boldsymbol{\beta})\) is the implicit perspective regularization term. The key lies in defining the implicit regularization function:
which encodes both the perspective terms \(\beta_j^2/z_j\) and the cardinality/branching constraints into a unified regularizer.
2. Geometric Analysis: Primal Quadratic Growth + Dual Quadratic Decay¶
Proves key geometric properties under regularity conditions (firm convexity, polyhedral subdifferential):
- Primal Quadratic Growth: \(\Phi(\boldsymbol{\beta}) \geq \Phi^\star + \frac{\alpha}{2}\text{dist}^2(\boldsymbol{\beta}, \mathcal{B}^\star)\)
- Dual Quadratic Decay (proposed for the first time in this paper): \(-\Psi(\boldsymbol{\zeta}) \leq -\Psi^\star + \frac{\kappa}{2}\|\boldsymbol{\zeta} - \boldsymbol{\zeta}^\star\|^2\)
These two form a symmetric relationship. It further proves that the primal optimality gap strictly controls the dual error:
3. Dual Gap Restart Scheme¶
Designs a restart strategy based on the Fenchel dual gap \(\Delta(\boldsymbol{\beta}, \boldsymbol{\zeta}) = \Phi(\boldsymbol{\beta}) - \Psi(\boldsymbol{\zeta})\):
- Restarts accelerated methods like FISTA from the current iteration.
- Triggers a restart when the dual gap drops below a threshold.
- Upgrades the sublinear \(O(1/k^2)\) accelerated method to linear convergence.
- Applicable to various proximal method variants, including fixed step size, line search, and line-search-free schemes.
- This framework is general and not limited to sparse GLMs.
4. Efficient GPU Implementation¶
For the implicit perspective regularizer:
- Proves that both its function value and proximal operator can be computed exactly in \(O(p \log p)\) time (sorting + thresholding).
- Avoids expensive conic programming solvers.
- The primary computational load in each iteration comes from matrix-vector multiplications \(\boldsymbol{X}\boldsymbol{\beta}\) and \(\boldsymbol{X}^\top\boldsymbol{\zeta}\), which naturally fit GPU parallelization.
Key Experimental Results¶
| Experimental Setup | Baselines | Speedup of Ours |
|---|---|---|
| Synthetic Data CPU Lower Bound Computation | IPM / ADMM / CD | 1-2 orders of magnitude |
| Real Data CPU Lower Bound Computation | IPM / ADMM / CD | 1-2 orders of magnitude |
| GPU vs CPU (Ours) | CPU version of Ours | An additional 1 order of magnitude |
| Large-scale BnB Certification | Standard Solvers | Significantly improved BnB scalability |
- Fast convergence of the dual lower bound is validated on both synthetic and real datasets.
- GPU acceleration makes BnB certification feasible on large-scale instances.
- The linear convergence rate of the safe lower bound in experiments aligns with theoretical predictions.
Highlights & Insights¶
- The concept of dual quadratic decay is proposed for the first time, revealing a symmetric structure in primal-dual geometry.
- The dual gap restart is a general framework that can upgrade any sublinear proximal method to linear convergence, with applicability extending far beyond sparse GLMs.
- The \(O(p\log p)\) exact computation of the implicit perspective regularizer avoids general conic solvers, which is key to GPU-friendliness.
- A complete closed loop from theory to engineering: geometric analysis \(\to\) convergence guarantees \(\to\) efficient implementation \(\to\) GPU acceleration \(\to\) BnB integration.
- Comparison with FDPG: while FDPG starts from the dual side and only achieves sublinear primal convergence, this work starts from the primal side and achieves bilinear convergence.
Limitations & Future Work¶
- Geometric regularity conditions (e.g., firm convexity) might not hold under certain non-standard loss functions, limiting its scope of applicability.
- The constants in linear convergence \(\alpha, \kappa, \sigma\) depend heavily on the specific problem structure, and the theoretical bounds might not be tight.
- Experiments primarily focus on sparse linear/logistic regression; the performance on more complex GLMs (such as Poisson regression) has not been fully verified.
- The GPU implementation relies on the matrix \(\boldsymbol{X}\) being fully loaded into GPU memory, which might be restricted for ultra-large-scale features.
- The branching strategy in the BnB framework still employs standard schemes, leaving room for further co-optimization with first-order methods.
Related Work & Insights¶
- MIP for ML: Applying mixed-integer programming (MIP) to interpretable ML, classification/regression, decision trees, etc.
- Perspective relaxation: Pioneered by Ceria & Soares (1999), this paper advances it to highly efficient and solvable forms.
- GPU-accelerated optimization: Complementary to GPU-LP methods like cuPDLP (Lu et al., 2024), but this work directly handles non-linear objectives.
- The conference version of Liu et al. (2025) only handled the root node, whereas this work extends to every node of BnB and adds GPU experiments.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The dual quadratic decay and gap restart framework possess strong theoretical originality.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dimensional validation across synthetic, real, GPU, and BnB implementations.
- Writing Quality: ⭐⭐⭐⭐⭐ — Mathematically rigorous, well-structured, and motivated thoroughly.
- Value: ⭐⭐⭐⭐ — Provides a practical and theoretically guaranteed solution for the optimality certification of large-scale sparse optimization.