Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization¶
Conference: ICML2025
arXiv: 2506.05791
Code: None
Area: others (Decentralized Optimization / Distributed Learning)
Keywords: Decentralized Optimization, Proximal Point Method, SPDO, Communication Complexity, Function Similarity
TL;DR¶
Proposes the Stabilized Proximal Decentralized Optimization (SPDO) method and its accelerated variant, achieving optimal communication and computation complexities simultaneously within a proximal decentralized optimization framework. This is achieved by relaxing local subproblem accuracy requirements (from increasing with iterations to constant) via a stabilized projection technique, and reducing communication overhead by replacing the maximum function dissimilarity \(\delta_{\max}\) with the average function similarity \(\delta\).
Background & Motivation¶
Background: Decentralized optimization is a core paradigm in distributed machine learning, where \(n\) nodes, each holding a local loss function \(f_i\), collectively minimize the average loss \(f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)\) by exchanging parameters only with their neighbors over a network topology. Reducing communication complexity is the primary goal, as latency and bandwidth for parameter exchange between nodes are far more expensive than local computation.
Limitations of Prior Work: - Dilemma of Multi-step Local Updates: Decentralized Gradient Descent and Gradient Tracking attempt to reduce communication using multi-step local updates. However, increasing the number of local steps requires a smaller learning rate, leaving the final communication overhead unchanged due to inconsistency between local and global objectives. - Computational Bottleneck of the PDO Framework: Proximal decentralized optimization (PDO) methods like SONATA can exploit function similarity to reduce communication, but they require either exact solving of the proximal subproblems (which is infeasible) or increasingly high subproblem accuracy as iterations progress, leading to suboptimal computational overhead. - Dependence on Maximum Dissimilarity: Existing analyses rely on \(\delta_{\max}\) (the maximum node dissimilarity), whereas the average dissimilarity \(\delta\) can be as small as \(\delta_{\max}/\sqrt{n}\) and remains underutilized.
Key Challenge: How to achieve low computational complexity while maintaining low communication complexity? Existing methods face a trade-off: accurate subproblems require fewer communications but higher computation, while inaccurate subproblems reduce computation but may increase communication.
Goal: Can a proximal decentralized method be designed to simultaneously achieve optimal communication and computational complexities?
Key Insight: The authors observe that directly applying the standard Hybrid Projection Proximal-point Method to decentralized settings does not converge, because the auxiliary variable \(v_i\) does not align with the iteration variable \(x_i\) under exact solving. By introducing a "stabilization term" (adding the gradient tracking vector \(h_i\) into the projection step), SPDO is guaranteed to degenerate into standard PDO when solved exactly, ensuring algorithmic consistency.
Core Idea: Incorporating a stabilization term into the proximal projection step allows the subproblem accuracy requirement to change from "increasing with iterations" to a "constant", while improving the dependence of communication complexity from \(\delta_{\max}\) to \(\delta\).
Method¶
Overall Architecture¶
SPDO follows an iterative paradigm of "local subproblem solving + neighborhood communication averaging":
- Input: Number of nodes \(n\), local losses \(\{f_i\}\), network weights \(W\), hyperparameters \(\lambda, M\)
- Each Iteration:
- Each node approximately solves a local subproblem (local computation).
- Update the auxiliary variable \(v_i\) (stabilized projection step, closed-form solution).
- Update \(v_i\) and the gradient tracking variable \(h_i\) via multi-step Gossip averaging (neighbor communication).
- Output: All node parameter average \(\bar{x}^{(R)}\) converges to the global optimum.
Key Designs¶
-
Stabilized Projection Step
- Function: After the approximate solving of subproblems, compute an auxiliary variable \(v_i^{(r+1/2)}\) as an anchor for communication and gradient tracking.
- Mechanism: The key update is \(v_i^{(r+1/2)} = \arg\min_v \{\langle \nabla f_i(x_i^{(r+1)}) + h_i^{(r)}, v\rangle + \frac{\mu}{2}\|v - x_i^{(r+1)}\|^2 + \frac{\lambda}{2}\|v - v_i^{(r)}\|^2\}\), which yields the closed-form solution \(v_i^{(r+1/2)} = \frac{1}{\mu+\lambda}(\mu x_i^{(r+1)} + \lambda v_i^{(r)} - \nabla f_i(x_i^{(r+1)}) - h_i^{(r)})\). Compared to the naive Hybrid Projection, the added \(+h_i^{(r)}\) term (highlighted in blue) is the key to stabilization.
- Design Motivation: Naive Hybrid Projection Proximal-point does not converge in decentralized scenarios because the auxiliary variable \(v_i \neq x_i\) under exact solving. By adding \(h_i^{(r)}\), SPDO degenerates to standard PDO (\(v_i = x_i\)) when solved exactly, ensuring algorithmic consistency. This relaxes the subproblem accuracy requirement from PDO's \(O(1/(r+1)(r+2))\) (decaying with iterations) to a constant \(\lambda^2/10\) in SPDO.
-
Exploiting Average Function Similarity \(\delta\)
- Function: Reduces the dependence of communication complexity from \(\delta_{\max}\) to \(\delta\).
- Mechanism: Defines the average second-order similarity as \(\frac{1}{n}\sum_{i=1}^n \|\nabla h_i(x) - \nabla h_i(y)\|^2 \leq \delta^2 \|x-y\|^2\), where \(h_i = f - f_i\). This is more fine-grained than the maximum dissimilarity \(\delta_{\max} = \max_i \|\nabla^2 f - \nabla^2 f_i\|\), and since \(\delta \leq \delta_{\max}\), it can be up to \(\sqrt{n}\) times smaller. The corresponding number of Gossip rounds \(M \geq \frac{1}{1-\rho}\log(\frac{6L}{\delta})\) replaces the prior version that depended on \(\delta_{\max}\).
- Design Motivation: When local node datasets are similar, \(\delta \ll \delta_{\max}\), meaning that SPDO can substantially reduce communication in homogeneous scenarios. This is the first analysis utilizing average similarity within the PDO framework.
-
Monteiro-Svaiter Acceleration (Accelerated-SPDO)
- Function: Further reduces both communication and computational complexities by accelerating the proximal point method and the Gossip averaging.
- Mechanism: Introduces acceleration sequences \(\{A_r, B_r, a_r\}\) and momentum interpolation \(y_i^{(r)} = \frac{A_r x_i^{(r)} + a_{r+1} v_i^{(r)}}{A_{r+1}}\), combined with Fast Gossip Averaging (Chebyshev accelerated Gossip with momentum). Under strong convexity, the communication complexity reaches \(\tilde{O}(\sqrt{\delta/(\mu(1-\rho))} \log(1/\epsilon))\) and the computational complexity reaches \(\tilde{O}(\sqrt{L/\mu} \log(1/\epsilon))\).
- Design Motivation: The communication complexity of non-accelerated SPDO is \(\tilde{O}(\delta/(\mu(1-\rho)) \log(1/\epsilon))\), whereas acceleration applies a square root to \(\delta/\mu\), yielding significant improvements when \(\delta \gg \mu\). Accelerated Gossip (second-order Chebyshev iteration) reduces the number of communication steps per round from \(O(1/(1-\rho))\) to \(O(1/\sqrt{1-\rho})\).
Loss & Training¶
The local subproblem for each round is \(\min_x F_{i,r}(x) := f_i(x) + \langle h_i^{(r)}, x\rangle + \frac{\lambda}{2}\|x - v_i^{(r)}\|^2\), which is solved via Nesterov's accelerated gradient descent. SPDO sets \(\lambda = 20\delta\) (non-accelerated) or \(\lambda = 96\delta\) (accelerated). The subproblem accuracy requirement is constant: \(\|\nabla F_{i,r}(x_i^{(r+1)})\|^2 \leq \frac{\lambda^2}{10}\|v_i^{(r)} - x_i^{(r+1)}\|^2\), which does not need to shrink over iterations.
Key Experimental Results¶
Main Results: MNIST + Logistic Regression¶
Gradient norms after 2000 communication rounds are compared under a 25-node ring topology with a Dirichlet-distributed data partition.
| Method | Communication Complexity | Computational Complexity | Empirical Convergence |
|---|---|---|---|
| Gradient Tracking | \(O(L/(\mu(1-\rho)^2))\) | Same as Comm. | Unaffected by similarity |
| Inexact-PDO | \(\tilde{O}(\delta/(\mu(1-\rho)))\) | \(\tilde{O}(\sqrt{\delta L}/\mu \log\log)\) | Stagnates in later stages due to insufficient subproblem accuracy |
| SPDO | \(\tilde{O}(\delta/(\mu(1-\rho)))\) | \(\tilde{O}(\sqrt{\delta L}/\mu)\) | Comm.=PDO, superior computation |
| Accel-SPDO | \(\tilde{O}(\sqrt{\delta/(\mu(1-\rho))})\) | \(\tilde{O}(\sqrt{L/\mu})\) | Both communication and computation are optimal |
Ablation Study: Impact of Similarity \(\delta\)¶
| Dirichlet \(\alpha\) | Similarity \(\delta\) | Gradient Tracking | SPDO |
|---|---|---|---|
| Small \(\alpha\) (Heterogeneous) | Large | Gradient norm ~\(10^{-2}\) | Gradient norm ~\(10^{-3}\) |
| Large \(\alpha\) (Homogeneous) | Small | Gradient norm ~\(10^{-2}\) | Gradient norm ~\(10^{-6}\) (Substantial improvement) |
Key Findings¶
- SPDO Outperforms Inexact-PDO: When local subproblems are solved with a fixed 10 steps of gradient descent, Inexact-PDO suffers from increased communication overhead in later stages due to insufficient subproblem accuracy (where Eq.(8) is violated), while SPDO faces no such issue.
- Double Optimality of Accelerated-SPDO: Under both strongly convex and convex settings, both communication and computational complexities achieve known optimal rates.
- Greater Advantages with Higher Similarity: The communication complexity of Gradient Tracking is independent of \(\delta\), whereas that of SPDO scales proportionally to \(\delta\) or \(\sqrt{\delta}\). Thus, the advantage of SPDO is highly pronounced in homogeneous scenarios.
Highlights & Insights¶
- Exquisite Design of Stabilization: Naively transferring Hybrid Projection to the decentralized setting fails because the auxiliary and iteration variables remain inconsistent under exact solving. Simply introducing a linear term \(\langle h_i^{(r)}, v\rangle\) to the projection objective restores consistency—a minimal modification with a crucial effect, showcasing a highly elegant design.
- Constant Accuracy as the Core Breakthrough: The \(O(1/r^2)\) subproblem accuracy requirement in standard PDO demands an increasing number of inner iterations in later stages, which is the root cause of the computational bottleneck. SPDO makes this requirement constant, avoiding additional nested logarithmic factors in computational complexity, which constitutes a substantial theoretical improvement.
- Unified Convex and Strongly Convex Analysis: While prior works like SONATA and its accelerated versions only analyzed strongly convex scenarios, this work provides the convergence rates under convex settings for the first time, expanding the applicability of the method.
Limitations & Future Work¶
- Deterministic Setting Only: All results are based on full gradients, leaving stochastic gradient (SGD) scenarios unaddressed. Since full-gradient computations are impractical in large-scale training, extending SPDO to stochastic settings is an important direction.
- Limited Experimental Scale: Verification is conducted only on small-scale settings (MNIST + Logistic Regression + 25 nodes), lacking validation on deep neural networks or larger-scale distributed training scenarios.
- Hyperparameter Selection: The choice of \(\lambda\) (e.g., \(4\delta, 20\delta, 96\delta\)) depends on prior knowledge of \(\delta\) and \(L\), which are typically unknown in practice. Self-adaptive schemes for choosing \(\lambda\) remain unaddressed.
- Network Topology Limitations: Experiments only utilize ring topologies (where spectral gap \(\rho\) is close to 1), whereas the advantages of SPDO are expected to be more prominent on dense graphs. Performance under realistic, heterogeneous network topologies has yet to be verified.
Related Work & Insights¶
- vs SONATA (Sun et al., 2022): SONATA is the exact version of PDO, which requires exactly solving subproblems (impractical) and has a communication complexity dependent on \(\delta_{\max}\). This work improves the dependence from \(\delta_{\max}\) to \(\delta\) (up to a \(\sqrt{n}\)-fold reduction) and introduces a viable inexact variant for the first time.
- vs Accelerated SONATA (Tian et al., 2022): The accelerated SONATA assumes strong convexity and \(\mu \leq \delta_{\max}\), with a computational complexity containing a \((\log(1/\epsilon))^2\) term. Accelerated-SPDO eliminates one logarithmic factor and provides an analysis for the convex case.
- vs S-DANE (Jiang et al., 2024b): SPDO reduces to S-DANE (the federated learning variant) on fully connected graphs. However, on incomplete graphs, it requires additional stabilization and multi-step Gossip techniques, which constitutes the core contribution of this work.
Rating¶
- Novelty: ⭐⭐⭐⭐ The insights on stabilized projection are mathematically elegant and novel, though the overall framework is built upon existing proximal point methods and Gossip averaging.
- Experimental Thoroughness: ⭐⭐⭐ Rigorous theory but limited to small-scale experiments under convex optimization settings, lacking deep learning validation.
- Writing Quality: ⭐⭐⭐⭐ Clear theoretical statements; Table 1 provides a very straightforward comparison of complexities across all methods.
- Value: ⭐⭐⭐⭐ Provides an optimal solution to the communication-computation trade-off in decentralized optimization, bringing solid theoretical contributions.