Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings¶
Conference: NeurIPS 2025 (Spotlight)
arXiv: 2506.08428
Code: None
Area: Optimization
Keywords: nonconvex optimization, reduction mapping, convergence rate, manifold structure, over-parameterization
TL;DR¶
This paper proposes the Reduction Mapping framework, which exploits the manifold structure of the optimal solution set (arising from over-parameterization or symmetry) to reparameterize the optimization problem, and proves that this improves curvature properties and theoretically accelerates the convergence of gradient-based methods.
Background & Motivation¶
Background: In high-dimensional nonconvex optimization problems such as deep learning training, the set of minimizers typically forms a smooth manifold, arising from model over-parameterization or symmetry.
Limitations of Prior Work: Standard nonconvex optimization analyses focus on general convergence guarantees and overlook the acceleration potential offered by solution structure.
Key Challenge: Algorithms that exploit structural information empirically converge faster in practice, yet a unified theoretical explanation remains lacking.
Key Insight: When the geometry of the optimal solution set is known, at least locally, a reduction mapping can reparameterize the problem to eliminate redundant directions.
Method¶
Overall Architecture¶
Consider the optimization problem \(\min_x f(x)\), whose optimal solution set \(\mathcal{M}^*\) forms a smooth manifold. A reduction mapping \(\phi: \mathbb{R}^d \to \mathbb{R}^d\) maps a portion of the parameter space onto the solution manifold, yielding a reduced objective \(\tilde{f}(z) = f(\phi(z))\).
Key Designs¶
-
Definition of Reduction Mapping
- Function: The reparameterization ensures that part of the optimization variables automatically lies on the solution manifold.
- Formal definition: \(\phi\) satisfies \(\phi(z^*) \in \mathcal{M}^*\) for the desired \(z^*\).
- Origin: Arises naturally from inner optimization problems (e.g., a single step of alternating optimization).
- Design Motivation: Eliminate degeneracy along the normal directions of the solution manifold.
-
Curvature Improvement Theorem
- Core Idea: A well-designed reduction mapping improves the curvature of the objective function near the solution.
- Specifically: The smallest positive eigenvalue of the Hessian of the reduced objective \(\tilde{f}\) at the optimum is larger.
- Formally: \(\lambda_{\min}^+(\nabla^2 \tilde{f}) \geq \lambda_{\min}^+(\nabla^2 f)\), with strict inequality in most cases.
- Significance: Better condition number \(\to\) faster convergence.
-
Convergence Acceleration Analysis
- Gradient descent achieves superior convergence rates on the reduced objective.
- The unified framework subsumes alternating minimization, block coordinate descent, and reparameterization techniques.
- Explicit bounds on the reduction in iteration count are provided.
Theoretical Results¶
- Theorem 1 (Curvature Improvement): If \(\phi\) is a proper reduction mapping, then \(\kappa(\tilde{f}) \leq \kappa(f)\) (the condition number does not increase).
- Theorem 2 (Convergence Acceleration): The number of iterations for GD to reach an \(\epsilon\)-critical point on \(\tilde{f}\) is \(\mathcal{O}(\kappa(\tilde{f})/\epsilon^2)\), compared to \(\mathcal{O}(\kappa(f)/\epsilon^2)\) for the original problem.
Key Experimental Results¶
Experiment 1: Matrix Factorization¶
| Method | Condition Number | Iterations to \(\epsilon=10^{-6}\) |
|---|---|---|
| Standard GD | 245 | 12,340 |
| GD + Reduction | 38 | 1,890 |
| Adam | 245 | 5,670 |
| Adam + Reduction | 38 | 980 |
Experiment 2: Neural Network Training (Over-parameterized 2-Layer Network)¶
| Method | Iterations to Convergence | Final Loss |
|---|---|---|
| SGD | 5,000 | 1.2e-4 |
| SGD + Block Reduction | 2,100 | 8.5e-5 |
| Adam | 3,200 | 9.8e-5 |
| Adam + Block Reduction | 1,400 | 7.1e-5 |
Experiment 3: Comparison of Reduction Strategies¶
| Reduction Type | Effective Condition Number | Speedup |
|---|---|---|
| No Reduction (baseline) | 245 | 1× |
| Alternating Minimization | 82 | 2.9× |
| Orthogonal Reduction | 45 | 5.2× |
| Optimal Reduction | 38 | 6.5× |
Key Findings¶
- Condition number improvements are consistent with theoretical predictions.
- Acceleration is more pronounced under severe over-parameterization.
- The framework provides a unified explanation for the empirical speedups of alternating minimization and reparameterization techniques.
- 12 experimental figures clearly validate the theory.
Highlights & Insights¶
- Spotlight Paper: The theoretical contribution is recognized as important and technically refined.
- Unified Framework: Disparate optimization techniques are subsumed under a single theoretical framework.
- Practical Guidance: The work clarifies when and how solution structure can be exploited to accelerate optimization.
- A 43-page comprehensive paper with complete proofs is provided.
Limitations & Future Work¶
- The approach requires the solution manifold structure to be known, at least locally; methods for automatically discovering such structure remain to be investigated.
- The analysis for stochastic optimization (SGD) is less complete than for the deterministic setting.
- Precise characterization of the solution manifold in deep networks remains an open problem.
- The interaction effects with adaptive learning rate methods require further investigation.
Related Work & Insights¶
- Over-parameterization theory (Allen-Zhu et al. 2019, Du et al. 2019)
- Manifold optimization (Absil et al. 2008)
- Implicit regularization and solution structure
- Insight: Combining NTK theory to analyze reduction mappings in deep networks is a promising direction.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Unified framework explaining the acceleration of classical techniques
- Experimental Thoroughness: ⭐⭐⭐⭐ Theory validated with 12 figures
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretically clear with complete proofs
- Value: ⭐⭐⭐⭐⭐ Spotlight; foundational contribution