Distributed Algorithms for Euclidean Clustering¶

Conference: ICLR2026 arXiv: 2603.08615 Code: None (theoretical work) Area: Other Keywords: distributed clustering, coreset, communication complexity, k-means, k-median

TL;DR¶

This paper constructs \((1+\varepsilon)\)-coresets for Euclidean \((k,z)\)-clustering in the distributed setting, achieving communication complexity that matches known lower bounds (up to polylogarithmic factors) in both the coordinator model and the blackboard model.

Background & Motivation¶

Euclidean \((k,z)\)-clustering is a classical clustering problem: given \(n\) points in \(d\) dimensions, find \(k\) centers minimizing the sum of \(z\)-th power distances from each point to its nearest center. The case \(z=1\) corresponds to \(k\)-median and \(z=2\) to \(k\)-means.
Modern datasets are massive and naturally distributed across multiple machines, precluding centralized processing. Distributed clustering has become a core requirement: \(n\) points are distributed across \(s\) machines, which collaborate by exchanging compact summaries (coresets) to perform clustering while minimizing communication.
Existing distributed methods (e.g., merge-and-reduce) exhibit multiplicative coupling among \(s\), \(d\), and \(1/\varepsilon\) in their communication complexity (e.g., \(\tilde{O}(skd/\varepsilon^4 \cdot \log(n\Delta))\)), far from the known communication lower bounds.
Core Motivation: Can one design communication-optimal protocols that decouple \(s\) (number of machines), \(d\) (dimensionality), and \(1/\varepsilon\) (approximation accuracy) to match theoretical lower bounds?

Core Problem¶

The paper aims to construct \((1+\varepsilon)\)-strong coresets for \((k,z)\)-clustering in two classical distributed communication models, while achieving optimal communication complexity:

Coordinator Model: \(s\) machines communicate only through a coordinator relay, using private channels and private randomness.
Blackboard Model: \(s\) machines communicate via a shared blackboard, where any message written by one machine is visible to all others.

Method¶

Overall Architecture¶

The protocol proceeds in two phases: (1) obtain a constant-factor approximation; (2) upgrade to a \((1+\varepsilon)\)-coreset via sensitivity sampling. The key innovation lies in substantially reducing communication at each phase.

Constant-Factor Approximation in the Blackboard Model¶

Existing adaptive sampling (analogous to \(k\)-means++) requires all \(s\) machines to report their distance sums \(D_i\) after each round, incurring \(O(sk\log n)\) communication—prohibitively expensive.
Lazy Sampling: The blackboard maintains approximate values \(\widehat{D_i}\) of each machine's distance sum \(D_i\). Machines are selected for sampling proportionally to \(\widehat{D_i}\). Provided \(\widehat{D_i}\) is a constant-factor approximation of \(D_i\), the sampling failure probability is a constant, and the total number of sampling rounds remains \(O(k)\).
When sampling fails—indicating that \(D_i\) has decreased significantly—machine \(i\) updates its value on the blackboard. Each machine updates at most \(O(\log n)\) times, transmitting only \(O(\log\log n)\) bits per update (by sending the exponent).
\(L_1\) Sampling for Acceleration: Rather than requiring all machines to report each round, a single machine \(i\) is randomly selected with probability proportional to \(\widehat{D_i}\). The estimator \(D_i/p_i\) is used to estimate whether the global total weight \(\sum D_i\) has decreased by more than \(1/64\), triggering a global update only when necessary.
Exponentially Growing Batch Sampling: When the global weight has not decreased significantly, \(2^i\) samples are drawn in a single batch (with \(i\) increasing incrementally), reducing the number of communication rounds to \(O(\log n \log k)\).

Key Innovations in the Coordinator Model¶

Coordinate-wise Sampling: Rather than transmitting full high-dimensional center coordinates, the coordinator and a machine perform a distributed binary search over sorted coordinates, transmitting only a small offset. This decouples communication from dimensionality \(d\).
Coordinate-wise Sensitivity Sampling: Each point is decomposed along its coordinates, and dimensions are sampled according to their importance. The coordinator sends a compact summary, and machines request additional information only when necessary.
The reconstructed samples may not correspond to any actual point in the dataset, but the paper proves that the distortion of the clustering cost remains controlled.

\((1+\varepsilon)\)-Coreset Construction¶

Using the constant-factor approximation \(C'\), the sensitivity \(\mu(x)\) of each point is computed, and \(\tilde{O}(k/\min(\varepsilon^4, \varepsilon^{2+z}))\) points are sampled according to the sensitivity distribution.
Compact Encoding: Each sampled point is encoded as \((c'(x), y)\), where \(c'(x)\) is the index of the nearest center and \(y\) is the coordinate-wise logarithmic exponent of the residual vector. Each point requires only \(O(\log k + d\log(1/\varepsilon, d, \log(n\Delta)))\) bits.

Main Theorems¶

Theorem 1.1 (Coordinator Model): Total communication is \(\tilde{O}(sk + dk/\min(\varepsilon^4, \varepsilon^{2+z}) + dk\log(n\Delta))\) bits.

Theorem 1.2 (Blackboard Model): Total communication is \(\tilde{O}(s\log(n\Delta) + dk\log(n\Delta) + dk/\min(\varepsilon^4, \varepsilon^{2+z}))\) bits.

Both results match known lower bounds up to polylogarithmic factors.

Key Experimental Results¶

This is a purely theoretical work with no experimental component. The primary contributions are the proofs of communication complexity upper bounds and their matching with known lower bounds.

Model	Prior Best	Ours	Improvement
Coordinator	\(O(skd/\varepsilon^4 \cdot \log(n\Delta))\)	\(\tilde{O}(sk + dk/\varepsilon^4 + dk\log(n\Delta))\)	Decouples \(s\), \(d\), \(1/\varepsilon\); removes multiplicative dependence
Blackboard	\(\tilde{O}((s+dk)\log^2(n\Delta))\) (constant approx. only)	\(\tilde{O}(s\log(n\Delta) + dk\log(n\Delta) + dk/\varepsilon^4)\)	Upgrades from constant to \((1+\varepsilon)\) approx.; removes extra \(\log\) factors

Highlights & Insights¶

Communication Matches Optimal Lower Bounds: Both distributed models match the lower bounds of [Chen et al., NeurIPS 2016] and [Huang et al., STOC 2024], representing a complete theoretical resolution of the problem.
Parameter Decoupling: In the coordinator model, \(s\) (number of machines) is no longer multiplied by \(d\) (dimensionality) and \(1/\varepsilon\) (precision); in the blackboard model, \(1/\varepsilon\) is no longer multiplied by \(\log(n\Delta)\).
No Plaintext Coordinate Broadcast: In the coordinator model, no site needs to broadcast full point coordinates to all machines—an unexpected and important result.
Lazy Sampling + \(L_1\) Sampling: An elegant variant of distributed adaptive sampling that avoids global communication at every round.
Coordinate-wise Sensitivity Sampling is a novel technique that may be of independent value for distributed regression, low-rank approximation, and related problems.
The results generalize to arbitrary connected communication topologies, not restricted to the coordinator or blackboard setting.

Limitations & Future Work¶

Purely Theoretical: No experiments validate improvements in actual communication volume or runtime; in particular, the polylogarithmic factors may be non-trivial in practice.
Round Complexity: In the blackboard model, while the total bit complexity is optimal, the number of rounds is \(O(\log n \log k)\), which may be undesirable in latency-sensitive settings.
Degradation When \(\varepsilon\) Is Constant: When \(\varepsilon\) is a constant, the \(1/\varepsilon^4\) term becomes a constant, and the dominant communication term becomes \(dk\log(n\Delta)\) (transmitting the center coordinates themselves), which cannot be further optimized.
Finite-Precision Grid Assumption: Input points are assumed to lie on a \(\{1,\ldots,\Delta\}^d\) grid; continuous data in practice requires an additional discretization step.
Restricted to Euclidean Space: Non-Euclidean metrics and non-clustering cost functions are outside the scope of this paper.

Work	Approximation	Communication	Notes
Merge-and-reduce + [BCP+24]	\((1+\varepsilon)\)	\(\tilde{O}(skd/\varepsilon^4 \cdot \log(n\Delta))\)	Coupled parameters
[BEL13] + [BCP+24]	\((1+\varepsilon)\)	\(O(dk/\varepsilon^4 \cdot \log(n\Delta) + sdk\log(sk)\log(n\Delta))\)	\(s\) still multiplies \(dk\)
[CSWZ16] (Blackboard)	\(O(1)\)	\(\tilde{O}((s+dk)\log^2(n\Delta))\)	Constant approx. only
Ours (Coordinator)	\((1+\varepsilon)\)	\(\tilde{O}(sk + dk/\varepsilon^4 + dk\log(n\Delta))\)	Optimal
Ours (Blackboard)	\((1+\varepsilon)\)	\(\tilde{O}(s\log(n\Delta) + dk\log(n\Delta) + dk/\varepsilon^4)\)	Optimal

Broader Connections:

The distributed algorithms + coreset paradigm—first obtaining a constant-factor approximation at low communication cost, then refining via sensitivity sampling—may transfer to other distributed optimization problems such as low-rank approximation and regression.
The Lazy Sampling philosophy (tolerating stale information and updating only when deviation exceeds a threshold) is analogous to eventual consistency in distributed systems and may apply to gradient compression in distributed SGD.
Coordinate-wise Sampling decomposes the high-dimensional communication problem into lower-dimensional per-coordinate subproblems, paralleling the role of quantization and sparsification in federated learning.
The authors explicitly note natural connections between these techniques and data quantization and communication compression in LLM training.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Resolves an open problem on communication optimality in distributed clustering)
Experimental Thoroughness: ⭐⭐ (Purely theoretical; no experiments)
Writing Quality: ⭐⭐⭐⭐ (Clear structure; technical overview is well explained)
Value: ⭐⭐⭐⭐ (Complete theoretical resolution; techniques have broad transfer potential)