Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization¶
Conference: NeurIPS2025 arXiv: 2506.02504 Authors: Xingyu Chen, Bokun Wang, Ming Yang, Qihang Lin, Tianbao Yang (Texas A&M, Iowa) Code: To be confirmed Area: Optimization Keywords: FCCO, non-smooth non-convex optimization, stochastic momentum methods, Moreau envelope, constrained optimization, compositional optimization
TL;DR¶
For non-smooth non-convex finite-sum coupled compositional optimization (FCCO), this paper proposes two stochastic momentum methods — SONEX (single-loop) and ALEXR2 (double-loop) — that improve the iteration complexity from \(O(1/\epsilon^6)\) to \(O(1/\epsilon^5)\) via outer Moreau envelope smoothing and nested smoothing techniques, and achieve the same optimal complexity for non-convex inequality-constrained optimization.
Background & Motivation¶
The finite-sum coupled compositional optimization (FCCO) problem takes the form: $\(\min_{\mathbf{w}} F(\mathbf{w}) := \frac{1}{n}\sum_{i=1}^n f_i(g_i(\mathbf{w}))\)$
where \(f_i\) are outer functions (possibly non-smooth) and \(g_i\) are inner functions (stochastically estimated). This formulation arises broadly in X-risk optimization (AUC, contrastive losses), group distributionally robust optimization (GDRO), and non-convex constrained optimization.
Prior methods suffer from two key limitations: - High complexity: The previous state-of-the-art method SONX requires \(O(1/\epsilon^6)\) iterations under the mean Lipschitz continuity assumption on the stochastic inner function mean. - Incompatibility with deep learning: SONX relies on vanilla SGD updates and lacks momentum mechanisms, making it unsuitable for training deep neural networks.
This paper aims to address both issues simultaneously: designing stochastic momentum methods with provable convergence guarantees while reducing complexity by an order of magnitude.
Method¶
Mechanism: Outer Smoothing + Nested Smoothing¶
Outer Smoothing: The non-smooth outer function \(f_i\) is smoothed via the Moreau envelope: $\(f_{i,\lambda}(\mathbf{u}) := \min_{\mathbf{v}} f_i(\mathbf{v}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{v}\|^2\)$
This transforms the original problem into a smooth compositional objective \(\min_{\mathbf{w}} F_\lambda(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n f_{i,\lambda}(g_i(\mathbf{w}))\). When \(\lambda = \epsilon / C_f\), an \(\epsilon\)-stationary point of \(F_\lambda\) is an approximate \(\epsilon\)-stationary solution of the original problem (Theorem 4.2).
Nested Smoothing: When the inner function \(g_i\) is also non-smooth (weakly convex), a second Moreau envelope is applied to \(F_\lambda\), yielding the doubly smoothed objective \(\min_{\mathbf{w}} F_{\lambda,\nu}(\mathbf{w})\), which is amenable to momentum-based methods.
Algorithm 1: SONEX (Single-Loop, Smooth Inner Functions)¶
Applicable when: \(f_i\) is weakly convex and non-smooth or convex; \(g_i\) is smooth.
Key steps: 1. MSVR tracking: Maintain \(n\) auxiliary sequences \(u_{i,t}\) to track \(g_i(\mathbf{w}_t)\), updated in a coordinate-wise fashion (only sampled indices \(i \in \mathcal{B}_1^t\) are updated). 2. Gradient estimation: \(G_t = \frac{1}{|\mathcal{B}_1^t|}\sum_{i \in \mathcal{B}_1^t} \nabla f_{i,\lambda}(u_{t,i}) \nabla g_i(\mathbf{w}_t, \mathcal{B}_{i,2}^t)\) 3. Momentum update: \(\mathbf{v}_{t+1} = (1-\beta)\mathbf{v}_t + \beta G_t\), \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \mathbf{v}_{t+1}\) (or Adam update)
Complexity: \(O\left(\frac{n}{B_1\sqrt{B_2}} \epsilon^{-5}\right)\)
Algorithm 2: ALEXR2 (Double-Loop, Smooth or Weakly Convex Inner Functions)¶
Applicable when: \(f_i\) is convex; \(g_i\) is smooth or weakly convex (the latter requiring \(f_i\) to be monotonically non-decreasing).
\(F_\lambda\) is reformulated as a minimax problem, then doubly smoothed via nested Moreau envelopes to obtain \(F_{\lambda,\nu}\):
Key steps: 1. Outer loop: Momentum update of \(\mathbf{w}\) using the approximate gradient \(G_t = \frac{\beta}{\nu}(\mathbf{w}_t - \mathbf{z}_{t,K_t})\). 2. Inner loop: Run ALEXR to solve a strongly convex–strongly concave minimax subproblem, yielding an approximate proximal mapping solution \(\hat{\mathbf{z}}_t\). 3. Supports both Adam and momentum updates.
Complexity: \(\tilde{O}\left(\frac{n}{B_1 B_2 \epsilon^5} + \frac{1}{\epsilon^5}\right)\)
Application to Constrained Optimization¶
For inequality-constrained optimization \(\min g_0(\mathbf{w})\) s.t. \(g_i(\mathbf{w}) \leq 0\), a smooth hinge penalty function is employed: $\(\Phi_\lambda(\mathbf{w}) = g_0(\mathbf{w}) + \frac{1}{m}\sum_{i=1}^m f_\lambda(g_i(\mathbf{w}))\)$
where \(f_\lambda\) is the Moreau-smoothed version of the hinge function \(\rho[\cdot]_+\). Optimizing via SONEX/ALEXR2 achieves \(O(1/\epsilon^5)\) complexity for finding an approximate \(\epsilon\)-KKT solution, improving the dependence on \(\delta\) from \(O(\delta^{-6})\) to \(O(\delta^{-4})\).
Key Experimental Results¶
Experiment 1: Group Distributionally Robust Optimization (GDRO with CVaR)¶
Evaluated on Camelyon17 (30 groups, DenseNet121), Amazon (1252 groups, DistilBERT), and CelebA (16 groups, ResNet50) with CVaR ratio \(r=0.15\):
| Method | Update Type | Camelyon17 | Amazon | CelebA Loss | CelebA Test Acc. |
|---|---|---|---|---|---|
| OOA | SGD | Slower convergence | Slower convergence | Slower convergence | Lower |
| SONX | SGD | Moderate | Moderate | Moderate | Moderate |
| SONEX | Momentum/Adam | Fastest convergence | Fastest convergence | Fastest convergence | Highest |
SONEX achieves the fastest training loss reduction across all datasets, with the best test accuracy on CelebA. The advantage is most pronounced on Amazon (1252 groups) with Adam updates.
Experiment 2: AUC Maximization + ROC Fairness Constraints¶
Evaluated on Adult and COMPAS datasets with 14 fairness constraints (FPR/TPR gap tolerance \(\kappa=0.005\), thresholds \(\Gamma=\{-3,...,3\}\)) using a 2-hidden-layer network:
| Method | Penalty | Adult AUC | COMPAS AUC | Constraint Satisfaction |
|---|---|---|---|---|
| SOX | Squared hinge | Lower | Lower | Acceptable |
| SONX | Hinge | Moderate | Moderate | Acceptable |
| ICPPAC | Double-loop | Moderate | Moderate | Acceptable |
| ALEXR2 | Smooth hinge | Highest | Highest | Comparable |
ALEXR2 achieves superior AUC maximization while maintaining comparable constraint satisfaction.
Experiment 3: Continual Learning + Non-forgetting Constraints¶
Fine-tuning CLIP on BDD100K (target tasks: foggy/cloudy classification; constraints: no forgetting on other weather categories): - SONEX achieves significantly higher Target \(\Delta\)ACC than SOX (squared hinge). - Constraint violations are comparable across methods. - SONX (SGD) completely fails on the Transformer architecture and is unable to train.
Highlights & Insights¶
- Theoretical contribution: Improves the optimal complexity for non-smooth non-convex FCCO from \(O(\epsilon^{-6})\) to \(O(\epsilon^{-5})\); this is the first stochastic momentum method for this problem class.
- Dual smoothing technique: Outer Moreau envelope smoothing combined with nested smoothing elegantly handles the double non-smoothness of both outer and inner functions.
- Practical applicability: Supports Adam updates and is directly applicable to deep learning; effective on Transformer/CLIP architectures where the SGD-based SONX fails.
- New results for constrained optimization: Improvements in both complexity and dependence on the regularization constant \(\delta\).
- Unified framework: SONEX and ALEXR2 respectively cover smooth and weakly convex inner function settings, applicable to GDRO, AUC fairness, continual learning, and beyond.
Limitations & Future Work¶
- The single-loop SONEX has a weaker dependence on the inner batch size \(B_2\) (i.e., \(1/\sqrt{B_2}\)) compared to the double-loop ALEXR2 (i.e., \(1/B_2\)); the authors explicitly identify this as a direction for improvement.
- SONEX requires the mean Lipschitz continuity (MLC0) assumption, while ALEXR2 requires convexity of the outer function, each with its own restrictions.
- Assumption 4.3 (lower bound on the minimum singular value of the inner function Jacobian) is empirically verified but does not hold universally.
- Experimental scale is limited: validation is confined to relatively small classification tasks, without large-scale deep learning benchmark experiments.
- Hyperparameter settings rely on theoretically derived constants, which may require additional tuning in practice.
Related Work & Insights¶
- FCCO methods: SOX, MSVR/STORM, and SONX either require smooth \(f_i\) or are restricted to SGD updates; this paper is the first to handle non-smoothness with momentum.
- Smoothing techniques: Nesterov smoothing and Moreau envelopes are well-established for weakly convex problems; this paper extends them to the coupled structure of FCCO.
- Non-convex constrained optimization: OSS/ICPPAC achieve \(O(\epsilon^{-6})\), SSG achieves \(O(\epsilon^{-8})\), and Li et al. achieve \(O(\epsilon^{-7})\); the proposed \(O(\epsilon^{-5})\) is optimal in both smooth and weakly convex settings.
- X-risk optimization: Objectives such as AUROC and contrastive losses can be modeled as FCCO, and the proposed methods are directly applicable.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First stochastic momentum method for non-smooth FCCO; the outer smoothing + nested smoothing design is clear and effective.
- Experimental Thoroughness: ⭐⭐⭐ — Covers GDRO, fairness constraints, and continual learning, but experiments are limited in scale and baseline coverage.
- Writing Quality: ⭐⭐⭐⭐ — Rigorous theoretical derivations with a well-organized notation system; however, the high content density poses a significant barrier for readers outside the optimization community.
- Value: ⭐⭐⭐⭐ — The complexity improvement carries solid theoretical significance, and the introduction of momentum methods directly benefits practical deep learning applications.