Skip to content

Scalable Second-Order Riemannian Optimization for K-means Clustering

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=FNWNG1ftuw
Area: Optimization / Clustering
Keywords: K-means, Riemannian optimization, Semidefinite programming relaxation, Second-order critical point, Burer-Monteiro decomposition

TL;DR

This paper reformulates the non-negativity constrained low-rank K-means SDP relaxation as an unconstrained smooth optimization problem on a product manifold. It is solved using a cubic-regularized Riemannian Newton method. Through manifold decomposition, the cost per Newton subproblem is reduced to linear complexity with respect to the number of samples \(n\). This allows convergence to a second-order critical point (globally optimal under benign non-convexity assumptions) within \(n\cdot\epsilon^{-3/2}\cdot\mathrm{poly}(r,d)\) time, which is 2–4 times faster than first-order SOTA methods while maintaining comparable statistical accuracy.

Background & Motivation

Background: K-means clustering aims to partition \(n\) points into \(K\) groups to maximize intra-cluster similarity. Heuristic methods like Lloyd's algorithm and spectral clustering lack guarantees for local or global optima. A theoretically grounded path involves writing K-means as a semidefinite programming (SDP) relaxation. When data comes from a well-separated Gaussian Mixture Model (GMM), Peng & Wei (2007) proved the SDP relaxation can recover globally optimal clusters in polynomial time as \(\mu\to 0^+\), achieving exact recovery once cluster separation exceeds the information-theoretic threshold.

Limitations of Prior Work: SDP (2) requires optimization over an \(n\times n\) matrix \(Z\) with \(O(n^2)\) variables, which is impractical for \(n\) samples. A natural alternative is the Burer-Monteiro (BM) decomposition \(Z=UU^\top\), reducing variables to factor matrices \(U\in\mathbb{R}^{n\times r}\) at \(O(n)\) cost. However, this sacrifices convexity, generally only finding critical points that might be spurious local minima or saddle points. Moreover, the non-negative BM form for K-means adds element-wise constraints \(U\ge 0\) (for doubly non-negative \(Z\)). Non-negative BM decomposition is generally considered to lack benign non-convexity—testing for copositivity/complete positivity is NP-hard, and spurious local minima often emerge.

Key Challenge: Although theory suggests the non-negative form should be riddled with spurious solutions, the authors empirically and repeatedly observed that in average cases where SDP solves K-means globally, all second-order critical points of the low-rank non-negative form (3) lie within a neighborhood of some global optimum (formulated as "Benign Non-convexity" Assumption 1). If this assumption holds, solving K-means globally is equivalent to finding a second-order critical point of (3). However, in constrained non-convex scenarios, no general-purpose algorithm guarantees convergence to a second-order critical point. The core obstacle is maintaining the feasibility of non-convex constraints \(UU^\top\mathbf{1}_n=\mathbf{1}_n\) and \(\mathrm{tr}(UU^\top)=K\) at each iteration while advancing toward the optimum. Solvers like fmincon or knitro only guarantee convergence to critical points of a merit function (potentially infeasible for the original problem), and augmented Lagrangian only guarantees local first-order convergence.

Goal: Design an algorithm that guarantees convergence to a second-order critical point with low per-step cost and scalability to large \(n\), while maintaining strict feasibility.

Key Insight: In unconstrained non-convex optimization, algorithms like gradient descent and trust-region Newton can globally converge to second-order critical points from any initialization. If the constraints in (3) can be "internalized" into a manifold structure, turning the problem into unconstrained smooth optimization on a manifold, one can directly apply these mature algorithms along with their first/second-order convergence guarantees.

Core Idea: Rewrite the constraint set as a product manifold (projected hypersphere × orthogonal group) that admits cheap second-order retractions. Then, run a cubic-regularized Riemannian Newton method on it, leveraging the "block-diagonal + low-rank" structure of the Riemannian Hessian to compress the Newton subproblem to linear time.

Method

Overall Architecture

The method follows a rewriting chain: "relaxation → decomposition → manifoldification → second-order solving." The goal is to transform (3), a non-negative low-rank problem with non-convex constraints, into a form efficiently solvable by existing Riemannian second-order algorithms.

First step: The constraint set $\(\mathcal{M}=\{U\in\mathbb{R}^{n\times r}: UU^\top\mathbf{1}_n=\mathbf{1}_n,\ \mathrm{tr}(UU^\top)=K\}\)$ is verifiable as a manifold (satisfying Linear Independence Constraint Qualification, LICQ). Running Riemannian optimization directly on \(\mathcal{M}\) yields an algorithm similar to Carson et al. (2017), but its fatal flaw is the lack of an efficient retraction—which must be computed every step to keep \(U\in\mathcal{M}\). The exponential retraction used by Carson requires \(O(n^2)\), severely limiting scalability.

Second step: This is the key turning point. Instead of working directly on \(\mathcal{M}\), the authors construct a submersion \(\varphi\) from a product manifold \(\widetilde{\mathcal{M}}=\mathcal{V}\times O_r\) to \(\mathcal{M}\), where \(\mathcal{V}\) is a projected hypersphere and \(O_r\) is the \(r\times r\) orthogonal group. Retractions for each component on the product manifold are simple Euclidean projections, reducing the total cost to \(O(nr+r^3)\). The problem is then equivalently reformulated as optimizing objective (13) on \(\widetilde{\mathcal{M}}\).

Third step: Run a cubic-regularized Riemannian Newton method (Theorem 1, converges to a second-order critical point in \(O(\epsilon^{-3/2})\) iterations) on \(\widetilde{\mathcal{M}}\). While a naive implementation is expensive, this work exploits the block-diagonal + low-rank structure of the Riemannian Hessian and uses binary search to compress each step to linear time in \(n\), obtaining an \(\epsilon\)-second-order critical point in \((n/\epsilon^{1.5})\cdot\mathrm{poly}(r,d,K)\) time.

Key Designs

1. Formulating constrained K-means as unconstrained smooth optimization on manifolds: Bringing second-order convergence guarantees

Addressing the "lack of general second-order convergence in constrained non-convexity," the feasibility constraints are fully absorbed into the manifold geometry. For \(\min_{U\in\mathcal{M}} f(U)\) where \(\mathcal{M}=\{U: \mathcal{A}(UU^\top)+\mathcal{B}(U)=c\}\), under LICQ (8), every local minimum is an \(\epsilon\)-second-order critical point (satisfying first-order (9) and tangent space second-order (10) conditions). Riemannian algorithms use problem-specific retraction operators to keep iterates on the feasible set \(\mathcal{M}\). Starting from a feasible point \(U\), the algorithm moves along a smooth curve \(\gamma(t)=R_U(t\dot U)\) and performs a local second-order Taylor expansion (11) using the Riemannian gradient \(\mathrm{grad}\,f\) and Riemannian Hessian \(\mathrm{Hess}\,f\) to select descent directions. This opens the possibility of "global convergence to first/second-order optima from arbitrary initialization" for K-means—something Carson or NLR could not achieve (the former uses non-smooth objectives + penalties and may get stuck in saddles; the latter only has local linear convergence).

2. Product manifold submersion: Reducing \(O(n^2)\) retraction to \(O(nr+r^3)\)

This design is the key to scalability. The authors prove (Theorem 2) there exists a submersion from the product manifold \(\widetilde{\mathcal{M}}=\mathcal{V}\times O_r\) to \(\mathcal{M}\): $\(\varphi(V,Q)=\hat V Q,\quad \hat V=\begin{bmatrix}\hat{\mathbf 1}_n & V\end{bmatrix},\ \hat{\mathbf 1}_n=n^{-1/2}\mathbf 1_n,\)$ where the projected hypersphere \(\mathcal{V}=\{V\in\mathbb{R}^{n\times(r-1)}: \mathbf 1_n^\top V=0,\ \mathrm{tr}(VV^\top)=K-1\}\) and the orthogonal group \(O_r=\{Q: QQ^\top=I_r\}\). Standard submersion properties ensure that every \(\epsilon\)-second-order point on \(\widetilde{\mathcal{M}}\) is also a \(c\epsilon\)-second-order point on \(\mathcal{M}\) (\(c\) being a constant scaling), making the optimization equivalent. The beauty of the product manifold lies in its second-order retraction being the Euclidean projections of its two components: $\(\mathrm{Proj}_{\mathcal V}(V)=\sqrt{K-1}\,\frac{V-n^{-1}\mathbf 1_n\mathbf 1_n^\top V}{\lVert V-n^{-1}\mathbf 1_n\mathbf 1_n^\top V\rVert},\qquad \mathrm{Proj}_{O_r}(Q)=(QQ^\top)^{-1/2}Q,\)$ requiring only \(O(nr+r^3)\), a massive improvement over \(O(n^2)\). The equivalent objective becomes: $\(\min_{(V,Q)\in\widetilde{\mathcal{M}}}\ \langle C, VV^\top\rangle-\mu\sum_{i,j}\log\big(\varphi_{i,j}(V,Q)\big)_+,\qquad C=-XX^\top.\)$ The log-penalty necessitates starting from a strictly feasible point \(\varphi(V_0,Q_0)>0\) (the authors provide a suitable initialization and prove such points exist only when the search rank is over-parameterized \(r>K\)).

3. Cubic-regularized Riemannian Newton + Linear-time Newton subproblem: Trading speed for block-diagonal and low-rank structure

With high-speed retractions, Riemannian gradient descent could reach first-order points in \((n/\epsilon)\cdot\mathrm{poly}\) time. However, the log-penalty makes the landscape extremely ill-conditioned, causing standard methods like CG or Riemannian Trust Regions (RTR) to converge poorly. The authors use cubic-regularized Riemannian Newton (Theorem 1, \(O(\epsilon^{-3/2})\) iterations), solving the Newton subproblem: $\(\min_{Ap=0}\ g^\top p+\tfrac12 p^\top H p+\tfrac{L}{6}\lVert p\rVert^3\)$ efficiently. Here \(g, H\) are the vectorized Riemannian gradient and Hessian, and \(A\) enforces tangent space constraints. By recognizing that the Riemannian Hessian \(H\) has a "block-diagonal + low-rank" structure, the KKT system is solved via binary search on the regularization parameter \(\lambda\), reducing the complexity of each step to \(n\cdot\mathrm{poly}(r,d)\). Combining this with Theorem 1, the total time to reach an \(\epsilon\)-second-order point is \((n/\epsilon^{1.5})\cdot\mathrm{poly}(r,d,K)\).

Loss & Training

The objective is (13): the data term \(\langle C,VV^\top\rangle\) (\(C=-XX^\top\) is the negative Gram matrix) plus a log-barrier penalty \(-\mu\sum_{i,j}\log(\varphi_{i,j})_+\). Core hyperparameters: penalty coefficient \(\mu\) and search rank \(r\). Practice suggests taking \(\mu\) small enough to stay in the "benign" phase transition region. \(r\) is less sensitive when \(\mu\) is small; setting \(r=K+1\) is usually sufficient. After convergence, the cluster assignment matrix is uniquely recovered from the block-diagonal membership matrix \(Z=UU^\top\) via Lemma 1.

Key Experimental Results

Datasets: Synthetic Gaussian Mixture Models (GMM) and real-world CyTOF mass cytometry (265,627 cells × 32 markers; sampled as \(K=4\) imbalanced clusters with \(n=1800\) per Zhuang et al. 2024).

Main Results

Comparison with first-order SOTA (NLR) and classical baselines on CyTOF data (misclustering rate and Frobenius distance to oracle \(Z^\star\)):

Method Misclustering Rate Frobenius Dist. to \(Z^\star\) Description
Ours (Riemannian Newton) ≈0, lowest variance Lowest Same model but more accurate recovery via 2nd-order updates
NLR (Zhuang 2024) ≈0 Higher First-order SOTA; low error but poorer ground truth recovery
SC / NMF / KM++ Significantly higher Classical baselines

Convergence efficiency comparison on GMM (\(n=100\), \(\gamma=0.8\), \(\mu=0.1\)):

Metric Ours NLR
Iterations to converge 152 ≈80,000
Per-step cost 25–100x NLR step 1× (single matrix-vector product)
Per-step complexity vs \(n\) \(O(n)\) \(O(n)\)
Total wall-clock time 2–4x faster Baseline

Ablation Study

Configuration / Baseline Key Results Description
Verifying Assumption 1 Loss drops to \(10^{-10}\), min. eigenvalue of Hessian becomes positive 2nd-order point ≈ Global optimum; zero error
vs Carson et al. 2017 Misclustering > 30%, \(\lVert U_-\rVert\) remains high Penalty \(\lambda\) faces a tradeoff between feasibility and clustering
vs Classical RTR Ours ≈ 360 iterations; RTR stalls > 21,000 Log-penalty causes extreme ill-conditioning
Hyperparameter \(\mu\) Sharp phase transition threshold \(\mu\) must be small enough
Search rank \(r\) Insensitive if \(\mu\) is small; larger \(r\) helps escape with large \(\mu\) Recommend \(r=K+1\)
Misspecified \(K_{\mathrm{mis}}=3,5\) High NMI maintained Robust to cluster number misspecification

Key Findings

  • Newton's iterative advantage offsets per-step overhead: A single Newton step is 25–100x more expensive than NLR, but iterations drop by ~3 orders of magnitude. The net gain is a 2–4x total speedup.
  • Strong empirical support for Benign Non-convexity: Across 50 random initializations, the loss consistently converged to zero and Hessian eigenvalues were non-negative; second-order points are almost always global optima.
  • Ill-conditioning stems from log-barrier: Generic methods fail; cubic regularization combined with block-diagonal structure is essential for stability.
  • Hyperparameter Robustness: \(r=K+1\) is sufficient when \(\mu\) is well-chosen; the method is robust to misspecifying \(K\).

Highlights & Insights

  • Replacing expensive retraction with product manifold submersion: Lifting \(\mathcal{M}\) to "Projected Hypersphere × Orthogonal Group" reduces retraction complexity from \(O(n^2)\) to \(O(nr+r^3)\). This idea is transferable to other low-rank SDPs with complex constraints.
  • Block-diagonal + Low-rank Hessian → Linear-time Newton step: This allows second-order methods to achieve per-step costs comparable to first-order methods, providing a counter-example to "second-order methods are impractical."
  • Converting combinatorial clustering to guaranteed continuous optimization: Under benign non-convexity, global K-means is reduced to "finding a second-order critical point," which is strictly guaranteed by Riemannian algorithms.
  • Binary search for cubic regularization: Lemma 2 guarantees monotonicity, turning the subproblem into a clean 1D search for \(\lambda\).

Limitations & Future Work

  • Dependence on benign non-convexity assumption: The global guarantee relies on Assumption 1. While empirically supported, a rigorous proof for the non-negative BM form of K-means is still needed.
  • Dependence on well-separated cases: Theoretically bound to cases where the SDP can solve the problem; it cannot resolve clusters below the information-theoretic threshold.
  • Sensitivity to \(\mu\): A sharp phase transition exists; \(\mu\) must be tuned to locate the benign region.
  • Limited scale validation: Real-world experiments were at \(n=1800\). Performances on massive scales or very high rank \(r\) require further investigation.
  • vs Carson et al. (2017, 1st-order Riemannian): Uses non-smooth objectives and \(O(n^2)\) retractions. Ours provides feasibility, 2nd-order guarantees, and scalability.
  • vs NLR (Zhuang et al. 2024): An augmented Lagrangian 1st-order method with only local linear convergence. Ours is 2-4x faster and more accurate in recovery.
  • vs Hybrid methods: Methods mixing Riemannian and augmented Lagrangian often break the strict feasibility required for K-means membership; Ours maintains strict feasibility throughout.

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐