Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings¶

Conference: NeurIPS 2025 (Spotlight)
arXiv: 2506.08428
Code: None
Area: Optimization
Keywords: nonconvex optimization, reduction mapping, convergence rate, manifold structure, over-parameterization

TL;DR¶

This paper proposes the Reduction Mapping framework, which exploits the manifold structure of the optimal solution set (arising from over-parameterization or symmetry) to reparameterize the optimization problem, and proves that this improves curvature properties and theoretically accelerates the convergence of gradient-based methods.

Background & Motivation¶

Background: In high-dimensional nonconvex optimization problems such as deep learning training, the set of minimizers typically forms a smooth manifold, arising from model over-parameterization or symmetry.

Limitations of Prior Work: Standard nonconvex optimization analyses focus on general convergence guarantees and overlook the acceleration potential offered by solution structure.

Key Challenge: Algorithms that exploit structural information empirically converge faster in practice, yet a unified theoretical explanation remains lacking.

Key Insight: When the geometry of the optimal solution set is known, at least locally, a reduction mapping can reparameterize the problem to eliminate redundant directions.

Method¶

Overall Architecture¶

Consider the optimization problem \(\min_x f(x)\), whose optimal solution set \(\mathcal{M}^*\) forms a smooth manifold. A reduction mapping \(\phi: \mathbb{R}^d \to \mathbb{R}^d\) maps a portion of the parameter space onto the solution manifold, yielding a reduced objective \(\tilde{f}(z) = f(\phi(z))\).

Key Designs¶

Definition of Reduction Mapping
- Function: The reparameterization ensures that part of the optimization variables automatically lies on the solution manifold.
- Formal definition: \(\phi\) satisfies \(\phi(z^*) \in \mathcal{M}^*\) for the desired \(z^*\).
- Origin: Arises naturally from inner optimization problems (e.g., a single step of alternating optimization).
- Design Motivation: Eliminate degeneracy along the normal directions of the solution manifold.
Curvature Improvement Theorem
- Core Idea: A well-designed reduction mapping improves the curvature of the objective function near the solution.
- Specifically: The smallest positive eigenvalue of the Hessian of the reduced objective \(\tilde{f}\) at the optimum is larger.
- Formally: \(\lambda_{\min}^+(\nabla^2 \tilde{f}) \geq \lambda_{\min}^+(\nabla^2 f)\), with strict inequality in most cases.
- Significance: Better condition number \(\to\) faster convergence.
Convergence Acceleration Analysis
- Gradient descent achieves superior convergence rates on the reduced objective.
- The unified framework subsumes alternating minimization, block coordinate descent, and reparameterization techniques.
- Explicit bounds on the reduction in iteration count are provided.

Theoretical Results¶

Theorem 1 (Curvature Improvement): If \(\phi\) is a proper reduction mapping, then \(\kappa(\tilde{f}) \leq \kappa(f)\) (the condition number does not increase).
Theorem 2 (Convergence Acceleration): The number of iterations for GD to reach an \(\epsilon\)-critical point on \(\tilde{f}\) is \(\mathcal{O}(\kappa(\tilde{f})/\epsilon^2)\), compared to \(\mathcal{O}(\kappa(f)/\epsilon^2)\) for the original problem.

Key Experimental Results¶

Experiment 1: Matrix Factorization¶

Method	Condition Number	Iterations to \(\epsilon=10^{-6}\)
Standard GD	245	12,340
GD + Reduction	38	1,890
Adam	245	5,670
Adam + Reduction	38	980

Experiment 2: Neural Network Training (Over-parameterized 2-Layer Network)¶

Method	Iterations to Convergence	Final Loss
SGD	5,000	1.2e-4
SGD + Block Reduction	2,100	8.5e-5
Adam	3,200	9.8e-5
Adam + Block Reduction	1,400	7.1e-5

Experiment 3: Comparison of Reduction Strategies¶

Reduction Type	Effective Condition Number	Speedup
No Reduction (baseline)	245	1×
Alternating Minimization	82	2.9×
Orthogonal Reduction	45	5.2×
Optimal Reduction	38	6.5×

Key Findings¶

Condition number improvements are consistent with theoretical predictions.
Acceleration is more pronounced under severe over-parameterization.
The framework provides a unified explanation for the empirical speedups of alternating minimization and reparameterization techniques.
12 experimental figures clearly validate the theory.

Highlights & Insights¶

Spotlight Paper: The theoretical contribution is recognized as important and technically refined.
Unified Framework: Disparate optimization techniques are subsumed under a single theoretical framework.
Practical Guidance: The work clarifies when and how solution structure can be exploited to accelerate optimization.
A 43-page comprehensive paper with complete proofs is provided.

Limitations & Future Work¶

The approach requires the solution manifold structure to be known, at least locally; methods for automatically discovering such structure remain to be investigated.
The analysis for stochastic optimization (SGD) is less complete than for the deterministic setting.
Precise characterization of the solution manifold in deep networks remains an open problem.
The interaction effects with adaptive learning rate methods require further investigation.

Over-parameterization theory (Allen-Zhu et al. 2019, Du et al. 2019)
Manifold optimization (Absil et al. 2008)
Implicit regularization and solution structure
Insight: Combining NTK theory to analyze reduction mappings in deep networks is a promising direction.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Unified framework explaining the acceleration of classical techniques
Experimental Thoroughness: ⭐⭐⭐⭐ Theory validated with 12 figures
Writing Quality: ⭐⭐⭐⭐⭐ Theoretically clear with complete proofs
Value: ⭐⭐⭐⭐⭐ Spotlight; foundational contribution