Learning Provably Improves the Convergence of Gradient Descent¶

Conference: NeurIPS 2025 arXiv: 2501.18092 Authors: Qingyu Song (Xiamen University), Wei Lin, Hong Xu (CUHK) Code: GitHub Area: Optimization Keywords: Learn to Optimize, Gradient Descent, Neural Tangent Kernel, Convergence Proof, Initialization Strategy

TL;DR¶

This paper presents the first rigorous proof of training convergence for an unrolling-based Learn to Optimize (L2O) framework (Math-L2O). By leveraging NTK theory, it establishes a linear convergence rate and proposes a deterministic initialization strategy that provably ensures L2O improves upon the convergence performance of gradient descent. Experiments demonstrate over 50% improvement in optimality compared to standard GD.

Background & Motivation¶

State of the Field¶

Learn to Optimize (L2O) employs deep neural networks to learn hyperparameters of optimization algorithms (e.g., step sizes), achieving success on convex problems (LASSO, logistic regression) and non-convex problems (MIMO sum-rate maximization). "White-box" approaches such as Math-L2O embed neural networks within the structure of classical optimization algorithms, offering both interpretability and practical utility.

Limitations of Prior Work¶

LISTA-CPSS: Constructively proves a linear convergence rate, but the theoretical guarantees rely on extremely strict assumptions (sign consistency of iterates, column constraints on parameter matrices); experiments show these conditions are frequently violated after training.
Math-L2O: Derives only necessary conditions for convergence without clarifying how the training process itself guarantees convergence; follow-up work relies on the strong assumption that the L2O model must mimic standard GD behavior.
Common issue: Both suffer from training instability caused by gradient explosion under long optimization horizons — low learning rates converge slowly, while high learning rates cause training collapse.

Root Cause¶

Two critical theoretical gaps remain unfilled: (1) a proof of training convergence for unrolling-based L2O models; (2) a precise characterization of the relationship between NN training convergence (parameter optimization dimension) and L2O solution convergence (iterative solving dimension).

Method¶

Problem Formulation¶

Optimization objective: \(\min_{x} f(x) = \frac{1}{2}\|\mathbf{M}x - y\|_2^2\), where \(\mathbf{M} \in \mathbb{R}^{b \times d}\) has full row rank and \(f\) is \(\beta\)-smooth.
Math-L2O update: \(X_t = X_{t-1} - \frac{1}{\beta} P_t \odot \nabla F(X_{t-1})\), where \(P_t = g_W(X_{t-1}, \nabla F(X_{t-1}))\) is a coordinate-wise step-size vector generated by the NN.
NN architecture: \(L\)-layer network with ReLU activations in intermediate layers and \(2\sigma\) (Sigmoid scaled by 2) at the output to ensure \(0 < P_t < 2\).
Training loss: \(F(W) = \frac{1}{2}\|\mathbf{M}X_T - Y\|_2^2\), optimized via Back-Propagation-Through-Time (BPTT).

Core Theoretical Contribution 1: Alignment Between L2O and GD¶

An explicit relationship is established between the training convergence rate \(r_k\) and the GD convergence rate:

\[F(X_T^k) \leq r_k \cdot \frac{\beta}{T}\|X_0^k - X^*\|_2^2\]

where \(r_k < 1\) is the convergence factor at training iteration \(k\) and \(\frac{\beta}{T}\) is the sublinear convergence rate of GD. This implies that L2O is at least as fast as GD, and training can further accelerate convergence.

Core Theoretical Contribution 2: Linear Training Convergence Rate (Theorem 4.3)¶

Using NTK theory, the paper proves linear convergence of Math-L2O training:

\[F(W^k) \leq \left(1 - 4\eta \frac{\beta_0^2}{\beta^2} \delta_4 \alpha_0^2\right)^k F(W^0)\]

\(\alpha_0 = \sigma_{\min}(G_{L-1,T}^0)\): minimum singular value of the penultimate layer output.
\(\delta_4 = \sigma(\delta_3 \Theta_L)(1 - \sigma(\delta_3 \Theta_L)) > 0\).
Only \(\mathcal{O}(Nd)\) network width is required, far below the infinite-width requirement of classical NTK results.

Key Lemmas: - Lemma 4.1 (Gradient bound): The training gradient norm is upper bounded by the objective value, ensuring the gradient decreases as training progresses. - Lemma 4.2 (Semi-smoothness): The response of Math-L2O outputs to parameter perturbations is bounded, with a coefficient of \(\mathcal{O}(e^{LT})\), consistent with known results for deep architectures.

Core Theoretical Contribution 3: Deterministic Initialization Strategy¶

Alignment initialization: The first \(L-1\) layers are sampled from a standard Gaussian distribution, followed by QR decomposition and non-negativization; the last layer is set to \(W_L^0 = \mathbf{0}\), so that the initial output \(P_T = \mathbf{I}\), equivalent to standard GD (step size \(1/\beta\)).
Singular value enhancement: A scaling factor \(e \geq 1\) is introduced, amplifying the initial parameters to \(\{eW_1^0, \ldots, eW_{L-1}^0\}\), scaling \(\alpha_0\) to \(e^{L-1}\alpha_0\).
Four lemmas specify the minimum required \(e\):
- Lemma 5.1: \(e = \Omega(T^{1/(L-1)})\)
- Lemma 5.2: \(e = \Omega(T^{(3T+6)/(TL-T-4L+6)})\) (most stringent; grows exponentially with \(T\))
- Lemma 5.3: \(e = \Omega(T^{4/(L-1)})\)
- Lemma 5.4: \(e = \Omega(T^{5/(L-1)} L^{1/(L-1)})\)

Increasing network depth \(L\) alleviates the demand on \(e\).

Key Experimental Results¶

Experiment 1: Training Convergence and Robustness¶

Environment: Python 3.9, PyTorch 1.12.0, Ubuntu 20.04, 128 GB RAM, 2× NVIDIA RTX 3090. Data: 10 Gaussian random problem instances, \(X \in \mathbb{R}^{5120 \times 1}\), \(Y \in \mathbb{R}^{4000 \times 1}\). Network: \(L=3\) layers, \(T=100\) steps, width 5120.

Setting	Method	Result
\(T=100\), various learning rates	Ours	Stable convergence across all learning rates (\(10^{-3}\) to \(10^{-7}\))
\(T=100\), various learning rates	Math-L2O	Slow convergence at low LR; training collapse at high LR
\(T=100\), various learning rates	LISTA-CPSS	Slow convergence at low LR; numerical overflow at high LR
Various \(T\)+LR combinations	Ours	Consistent convergence across all configurations, surpassing GD baseline

Key finding: The proposed method achieves similar convergence rates across \(\eta \in [10^{-3}, 10^{-7}]\), validating the linear convergence guarantee of Theorem 4.3. In contrast, state-of-the-art methods are terminated before 100 steps due to gradient explosion.

Ablation Study: Interaction Between Learning Rate and Scaling Factor¶

Setting: \(T=20\), \(X \in \mathbb{R}^{32 \times 32}\), \(Y \in \mathbb{R}^{32 \times 20}\), network width 1024. Performance is measured by the relative improvement \(\frac{\text{obj}_{\text{GD}} - \text{obj}_{\text{L2O}}}{\text{obj}_{\text{GD}}}\).

Fixed Parameter	Varying Parameter	Key Observation
\(e=50\)	\(\eta \in \{10^{-3}, \ldots, 10^{-7}\}\)	\(\eta=10^{-3}\) causes instability/divergence; \(\eta \leq 10^{-4}\) all converge; larger \(\eta\) converges faster
\(\eta=10^{-7}\)	Various values of \(e\)	Relative improvement increases monotonically with \(e\)

The experiments precisely validate Corollary C.1: \(e\) and \(\eta\) exhibit an inverse relationship, where larger \(e\) requires smaller \(\eta\) for convergence. An operational upper bound on the learning rate exists, below which larger \(\eta\) converges faster, consistent with the theoretical predictions of Theorem 4.3.

Overall performance: The proposed L2O framework achieves over 50% improvement in optimality compared to standard GD, with strong robustness to hyperparameter choices, outperforming LISTA-CPSS, Math-L2O, and the Adam optimizer.

Highlights & Insights¶

Pioneering theoretical contribution: The first rigorous proof of training convergence for unrolling-based L2O, establishing an explicit bridge between training convergence rate and objective optimization convergence rate.
Practical deterministic initialization: Zero-initialization of the last layer ensures the initial L2O behavior is equivalent to standard GD, avoiding gradient explosion; QR decomposition with non-negativization guarantees \(\alpha_0 > 0\).
Novel application of NTK theory: NTK is extended from standard ReLU networks to the Math-L2O architecture with recurrent structure and Sigmoid output, requiring only \(\mathcal{O}(Nd)\) width.
Strong theory-experiment consistency: Ablation experiments precisely validate theoretical predictions, including the inverse \(e\)-\(\eta\) relationship and the existence of a learning rate upper bound.

Limitations & Future Work¶

Restricted to quadratic programming: The theoretical analysis is limited to objectives of the form \(\frac{1}{2}\|\mathbf{M}x - y\|_2^2\) and has not been extended to general convex or non-convex problems.
Synthetic data only: Experiments are conducted exclusively on randomly generated Gaussian data, lacking validation on real-world optimization problems (e.g., LASSO, compressed sensing, signal recovery).
Rapid learning rate decay: \(\eta\) decays exponentially with \(L\) and \(T\); while theoretically permissible, this may limit practical applicability to large-scale problems.
Sensitivity of scaling factor \(e\): \(e\) grows exponentially with \(T\) (Lemma 5.2), potentially causing numerical issues when the number of optimization steps is large.
Coordinate independence assumption: The coordinate-wise architecture treats each dimension independently, failing to capture inter-dimensional coupling.
No stochastic optimization: The analysis is based on full-batch GD and has not been extended to stochastic methods such as SGD.

LISTA-CPSS (Chen et al. 2018): Constructively proves linear convergence but relies on conditions that are rarely satisfied at inference time (sign consistency, parameter matrix constraints); the present work directly proves training convergence via NTK.
Math-L2O (Liu et al. 2023): Provides only necessary conditions for convergence without proving training convergence; the present work establishes a linear training convergence rate.
Song et al. 2024: Analyzes inference-time convergence of Math-L2O but assumes the L2O model has already learned to mimic GD; the present work eliminates this assumption.
NTK theory (Jacot 2018, Du 2018, Allen-Zhu 2019): The present work extends NTK from feedforward networks and standard RNNs to L2O models with recurrent and gradient-dependent structure, addressing the novel challenge of higher-order polynomial parameter dependencies.
Nguyen et al. 2021: Provides training convergence analysis for ReLU networks; the present work inherits their relaxation strategy but faces a more complex semi-smoothness coefficient of \(\mathcal{O}(e^{LT})\).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First rigorous proof of training convergence for L2O; theoretically significant and original.
Experimental Thoroughness: ⭐⭐⭐ — Validated only on synthetic data; lacks experiments on real-world application scenarios.
Writing Quality: ⭐⭐⭐⭐ — Rigorous theoretical derivations and clear structure, though the extensive notation increases reading difficulty.
Value: ⭐⭐⭐⭐ — Fills a gap in L2O convergence theory, though generality is limited by the quadratic programming restriction.