Coresets for Clustering Under Stochastic Noise¶

Conference: NeurIPS 2025 arXiv: 2510.23438 Area: Others Keywords: Coreset, Clustering, Stochastic Noise, k-Means, Surrogate Error Metric

TL;DR¶

This paper presents the first systematic study of \((k,z)\)-clustering coreset construction under noisy data. It proposes a novel surrogate error metric \(\mathsf{Err}_\alpha\) to replace the traditional \(\mathsf{Err}\), achieving a \(\text{poly}(k)\)-fold reduction in coreset size and a \(\text{poly}(k)\)-fold tightening of quality guarantees under mild data assumptions, along with a noise-aware cluster-wise sampling algorithm.

Background & Motivation¶

Background: Coresets are a fundamental data compression tool for clustering problems — a weighted subset \(S \subseteq P\) satisfying \(\text{cost}_z(S, C) \in (1 \pm \varepsilon)\text{cost}_z(P, C)\) for all center sets \(C\). The noiseless setting has a rich theory with optimal size bounds.

Limitations of Prior Work: Real-world data is almost always corrupted by noise (sensor errors, privacy mechanisms, transmission distortion), yet existing coreset constructions rely entirely on noiseless assumptions. The fundamental question of how a coreset built from noisy data \(\hat{P}\) guarantees quality with respect to the true data \(P\) has remained largely unaddressed.

Key Challenge: (a) The true data \(P\) is unobservable, precluding direct quality evaluation; (b) noise uniformly inflates clustering costs, making the traditional metric \(\mathsf{Err}\) overly pessimistic; (c) noise may alter cluster assignments of individual points.

Key Insight: The paper introduces the approximation-ratio metric \(\mathsf{Err}_\alpha\), which focuses on relative performance within the neighborhood of optimal center sets, rendering it naturally immune to the uniform cost inflation induced by noise.

Method¶

Noise Models¶

Model I: Each point remains unchanged with probability \(1-\theta\) and, with probability \(\theta\), receives independent per-coordinate noise \(\xi_{p,j} \sim D_j\) (zero-mean, unit variance, satisfying Bernstein's condition). This covers Gaussian mechanisms (differential privacy), sensor errors, and related settings.

Model II: Every point receives per-coordinate noise \(\xi_{p,j} \sim D_j\) with variance \(\sigma^2\); this is a more general formulation.

Two Surrogate Metrics¶

Traditional metric \(\mathsf{Err}\):

\[\mathsf{Err}(S, P) = \sup_{C \in \mathcal{C}} \frac{|\text{cost}_z(S, C) - \text{cost}_z(P, C)|}{\text{cost}_z(S, C)}\]

New metric \(\mathsf{Err}_\alpha\) (core innovation):

\[\mathsf{Err}_\alpha(S, P) = \sup_{C \in \mathcal{C}_\alpha(S)} \frac{r_P(C)}{r_S(C)} - 1\]

where \(\mathcal{C}_\alpha(S) = \{C: r_S(C) \leq \alpha\}\) denotes the set of \(\alpha\)-approximate center sets on \(S\), and \(r_P(C) = \text{cost}_z(P, C) / \mathsf{OPT}_P\).

Key Distinction: \(\mathsf{Err}\) takes the supremum over all center sets and is highly sensitive to uniform noise-induced cost inflation; \(\mathsf{Err}_\alpha\) compares only relative rankings over approximately optimal center sets, where the uniform inflation cancels between numerator and denominator.

Illustrative Example¶

Consider 1-Means on \(\mathbb{R}\), with \(n/2\) points at \(-1\), \(n/2\) points at \(+1\), corrupted by \(\mathcal{N}(0,1)\) noise: - \(\mathsf{Err}(\hat{P}, P) \approx 2/3\) → \(r_P(\hat{P}, 1) \leq 25/9\) (overly pessimistic) - \(\mathsf{Err}_1(\hat{P}, P) \leq 1/n\) → \(r_P(\hat{P}, 1) \leq 1 + 1/n\) (near-perfect)

Theorem 3.1: Coreset via \(\mathsf{Err}\)¶

Using any standard algorithm that guarantees \(\mathsf{Err}(S, \hat{P}) \leq \varepsilon\), the coreset size is \(\tilde{O}(\min\{k^{1.5}\varepsilon^{-2}, k\varepsilon^{-4}\})\) (identical to the noiseless case), but the quality guarantee degrades to:

\[r_P(S, \alpha) \leq \left(1 + \varepsilon + O\left(\frac{\theta nd}{\mathsf{OPT}_P} + \sqrt{\frac{\theta nd}{\mathsf{OPT}_P}}\right)\right)^2 \cdot \alpha\]

The additional noise error \(O(\theta nd / \mathsf{OPT}_P)\) cannot be eliminated.

Theorem 3.3: Coreset via \(\mathsf{Err}_\alpha\) (Main Contribution)¶

Assumption 3.2: (1) \(\gamma\)-cost stability (\(\mathsf{OPT}_P(k-1)/\mathsf{OPT}_P(k) \geq 1 + \gamma\)); (2) absence of strong outliers (\(r_i \leq 8\bar{r}_i\)).

Algorithm 1 (CN\(_\alpha\)): 1. Partition \(\hat{P}\) into \(k\) clusters \(\hat{P}_i\) according to \(\hat{C}\) 2. Compute per-cluster mean radius \(\hat{r}_i = \sqrt{\text{cost}(\hat{P}_i, \hat{c}_i) / |\hat{P}_i|}\) 3. Remove extreme noise points (ball clipping): \(P_i' = \hat{P}_i \cap B(\hat{c}_i, R_i)\), where \(R_i = 3\hat{r}_i + O\!\left(\sqrt{d}\log\frac{1+\theta kd}{\sqrt{\alpha-1}}\right)\) 4. Uniformly sample from each cluster \(|S_i| = O\!\left(\frac{\log k}{\varepsilon - \Delta} + \frac{(\alpha-1)\log k}{(\varepsilon - \Delta)^2}\right)\), where \(\Delta = \frac{\sqrt{\alpha-1}\,\theta nd}{\alpha\,\mathsf{OPT}_P}\)

The resulting quality guarantee is:

\[r_P(S, \alpha) \leq \left(1 + \varepsilon + O\left(\frac{\theta kd}{\mathsf{OPT}_P} + \frac{\sqrt{\alpha-1}}{\alpha} \cdot \frac{\sqrt{\theta kd \cdot \mathsf{OPT}_P} + \theta nd}{\mathsf{OPT}_P}\right)\right) \cdot \alpha\]

Comparison of the Two Approaches¶

Dimension	\(\mathbf{CN}\) (Thm 3.1)	\(\mathbf{CN}_\alpha\) (Thm 3.3)	Improvement Factor
Coreset size	\(\tilde{O}(\min\{k^{1.5}/\varepsilon^2, k/\varepsilon^4\})\)	\(\tilde{O}(k/\varepsilon)\)	\(\sqrt{k}/\varepsilon\)
Noise error in \(r_P\)	\(O(\theta nd / \mathsf{OPT}_P)\)	\(O(\theta kd / \mathsf{OPT}_P)\)	\(n/k\)
Assumptions required	None	Assumption 3.2	—

Technical Core¶

The noise tolerance of \(\mathsf{Err}_\alpha\) stems from the separation between center drift and cost inflation: - Center drift: \(\mathsf{C}(\hat{P}_i) = \mathsf{C}(P_i) + \frac{1}{|P_i|}\sum_p \xi_p\), controlled via concentration inequalities for independent noise, contributing \(O(\theta d / \mathsf{OPT}_{P_i})\) - Cost inflation: \(\text{cost}(P_i, \mathsf{C}(\hat{P}_i)) = \mathsf{OPT}_{P_i} + \frac{1}{|P_i|}\|\sum_p \xi_p\|_2^2\), likewise controllable

Cluster-wise analysis followed by aggregation replaces the global factor \(n\) with \(k\) in the total error.

Key Experimental Results¶

Coreset Quality on Real Datasets (\(k\)-Means, \(\mathbf{CN}\) vs. \(\mathbf{CN}_\alpha\))¶

Dataset	\(n\)	\(d\)	\(k\)	\(\theta\)	\(r_P\) of \(\mathbf{CN}\)	\(r_P\) of \(\mathbf{CN}_\alpha\) ↓
MNIST	60000	784	10	0.1	1.32	1.08
CIFAR-10 features	50000	512	10	0.1	1.28	1.05

Robustness Under Non-i.i.d. Noise (Table 7)¶

When the noise covariance matrix \(\Sigma \neq \sigma^2 I_d\), \(\mathbf{CN}_\alpha\) still substantially outperforms \(\mathbf{CN}\), demonstrating practical utility beyond the theoretical assumptions.

Performance When Assumptions Are Violated¶

Even when datasets violate Assumption 3.2 (e.g., weakly separated data with small \(\gamma\)), \(\mathbf{CN}_\alpha\) continues to perform well empirically (Table 2), indicating that the theoretical assumptions are conservative.

Key Findings¶

\(\mathsf{Err}_\alpha\) yields tighter quality guarantees than \(\mathsf{Err}\) in all tested scenarios
Coreset size improvements are significant for large \(k\) (factor of \(\sqrt{k}/\varepsilon\))
Ball clipping (Line 3 of Algorithm 1) is a key design: it stabilizes cluster structure without information loss

Highlights & Insights¶

The \(\mathsf{Err}_\alpha\) metric constitutes a fundamental improvement to coreset evaluation: in noisy settings, it relaxes the requirement from "uniform approximation over all \(C\)" to "rank preservation over approximately optimal \(C\)," which is both natural and necessary
Cluster-wise sampling combined with ball clipping is a clean and effective algorithmic design: by controlling the influence of extreme noise points, the analysis is localized from global to per-cluster
The separation of center drift from cost inflation is the core insight: noise uniformly inflates costs without altering the relative advantage of optimal centers, and \(\mathsf{Err}_\alpha\) precisely captures this invariance
Theory and experiments are well-aligned: although Assumption 3.2 is required for theoretical guarantees, the algorithm remains effective when the assumption is not satisfied

Limitations & Future Work¶

The \(\gamma\)-cost stability in Assumption 3.2 is difficult to satisfy when clusters are heavily overlapping, and the required \(\gamma\) grows with \(\alpha\) and \(\theta\)
\(\mathsf{OPT}_P\) must be estimated — while \(\text{cost}(\hat{P}, \hat{C})\) serves as a proxy, its accuracy degrades under high noise
The primary analysis targets \(k\)-Means (\(z=2\)); extensions to \(k\)-Median (\(z=1\)) and other values of \(z\) are sketched in the appendix but lack depth
The \(\sqrt{\theta nd / \mathsf{OPT}_P}\) term introduced by Cauchy–Schwarz in the \(\mathsf{Err}\) analysis may not be tight — whether it can be eliminated remains open
Experiments are limited in scale, without coverage of million-scale datasets or high-dimensional embedding spaces

Rating¶

Novelty: ⭐⭐⭐⭐ New metric + new algorithm + first systematic study of noisy coresets
Theoretical Depth: ⭐⭐⭐⭐ Complete analysis of both metrics with explicit quantification of improvements
Experimental Thoroughness: ⭐⭐⭐ Multi-dataset validation, though experiments lack large scale
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, intuitive examples are effective, notation is heavy
Overall: ⭐⭐⭐⭐ Fills a theoretical gap in noisy coresets; practical value grows as noise-aware pipelines become more prevalent

vs. Standard coresets (Feldman–Langberg, Cohen-Addad et al.): All assume noiseless data, making \(\mathsf{Err}\) directly applicable. This paper is the first to reveal the systematic over-pessimism of that metric under noise
vs. Robust coresets [38,52,54]: These treat noise as identifiable outliers to be filtered. This paper does not assume noise is separable and constructs coresets directly from noisy data
vs. Ben-David & Haghtalab (2014): That work studies robustness of clustering algorithms (e.g., fast convergence of \(k\)-means under Gaussian perturbations), targeting algorithm solving rather than coreset construction
Complementary to Coreset for Robust Geometric Median (2510.24621): Both papers extend coreset theory from different angles — the former addresses size dependence under deterministic outliers, the latter addresses cost inflation under stochastic noise. The technical tools differ, but both challenge the limits of classical analysis
The design paradigm of \(\mathsf{Err}_\alpha\) — comparing relative performance only within the neighborhood of approximate optima — is transferable to noisy graph partition coresets, privacy-preserving facility location, and related problems