Unifying Proportional Fairness in Centroid and Non-Centroid Clustering¶
Conference: NeurIPS 2025 arXiv: 2601.00447 Code: None Area: AI Safety / Fairness Keywords: fair clustering, proportional fairness, core stability, semi-centroid clustering, approximation algorithms
TL;DR¶
This paper unifies the study of proportional fairness in centroid and non-centroid clustering under a single "semi-centroid clustering" framework, establishes an impossibility theorem showing the two cannot be simultaneously achieved, and designs novel algorithms that attain constant-factor core guarantees under dual metric loss.
Background & Motivation¶
Fair clustering demands proportional fairness guarantees for every possible group, rather than only predefined protected attributes. Prior work studied proportional fairness in two separate paradigms:
Centroid clustering (Chen et al., 2019): each agent's cost is determined by its distance to the cluster representative (centroid). Canonical application: facility location, where residents prefer their assigned facility to be nearby.
Non-centroid clustering (Caragiannis et al., 2024): each agent's cost is determined by its distance to the farthest data point in its cluster. Canonical application: federated learning, where participants prefer cluster members whose data distributions are similar to their own.
Key Challenge: In many real-world applications, users care about both types of loss simultaneously. In facility location, residents may prefer both a nearby facility (centroid loss) and neighbors who share the same community (non-centroid loss). In federated learning, participants care both about model quality (centroid) and similarity of peers' data distributions (non-centroid). Moreover, the two losses may be defined over entirely different metric spaces—when a teacher forms groups, students' preferences over project topics (centroid) and over group members (non-centroid) may be grounded in completely different similarity measures.
Impossibility Result: The paper first establishes an important impossibility theorem (Theorem 1): no clustering can simultaneously maintain finite-approximation core stability with respect to both centroid and non-centroid losses. This rules out directly combining existing algorithms and necessitates a new joint algorithmic design.
Method¶
Overall Architecture¶
The paper proposes the semi-centroid clustering model, where each agent \(i\) in cluster \((C_t, x_t)\) incurs loss \(\ell_i(C_t, x_t)\) that depends jointly on: (1) the cluster membership \(C_t\) (who is in the cluster) and (2) the cluster centroid \(x_t\) (where the centroid is located).
Two structured loss function families are studied: - Dual Metric Loss: \(\ell_i(\mathcal{X}) = \max_{j \in C} d^m(i,j) + d^c(i, x_t)\), defined over two distinct metric spaces. - Weighted Single Metric Loss: \(\ell_i(\mathcal{X}) = \lambda \cdot \max_{j \in C} d(i,j) + (1-\lambda) \cdot d(i, x_t)\), a weighted combination over a single metric space.
Key Designs¶
-
Impossibility Theorem (Theorem 1):
- A carefully constructed instance with 6 agents and 3 clusters, \(\{a,b,c,d,e,f\}\), is used.
- It is shown that non-centroid core requirements force \(\{a,b\}\), \(\{e,f\}\), and \(\{c,d\}\) to each form their own cluster.
- It is then shown that the centroid core cannot be satisfied for this fixed partition, regardless of centroid placement.
- Significance: this rules out any direct adaptation of the existing Greedy Capture algorithm.
-
Dual Metric Algorithm (Algorithm 3):
- Two-phase design:
- Phase 1: Iteratively identifies the Most Cohesive Cluster (MCC)—the cluster-centroid pair minimizing the maximum agent loss. Agents in the identified cluster are then removed.
- Phase 2: Selectively permits agents to switch clusters, subject to key constraints:
- Switching must not excessively harm members of the new cluster: post-switch loss \(\leq c \cdot r_{t'}\).
- A bounded-loss guarantee is provided for the switching agent.
- Each agent's switching decision is made independently of others.
- The growth factor \(c\) is a key parameter controlling the permissiveness of switching: \(c = \frac{3\alpha + \sqrt{\alpha(\alpha+8)}}{4\alpha}\).
- Design Motivation: to interpolate between centroid GC (free switching) and non-centroid GC (no switching).
- Two-phase design:
-
Core Theoretical Results (Theorems 2–3):
- Existence: under dual metric loss, a 3-core clustering always exists (using an exact MCC subroutine).
- Polynomial-time algorithm: using a 4-approximation MCC subroutine, a \((3+2\sqrt{3})\)-core clustering is found in polynomial time.
- Stronger FJR (Fully Justified Representation) guarantees: exact 1-FJR exists; polynomial-time 4-FJR is achievable.
Loss & Training¶
- This is a theoretical work; no training is involved.
- Algorithm runtime depends on the efficiency of the MCC subroutine.
- For weighted single metric loss, parameterized approximation factors of \(\min\{2/\lambda, f_\lambda\}\) are provided as \(\lambda\) varies.
Key Experimental Results¶
Main Results¶
Core approximation ratios on real datasets
| Dataset | \(\lambda\) | Algorithm 3 Core Approx. | k-means++ Core Approx. | Notes |
|---|---|---|---|---|
| Occupancy | 0.5 | ~1.1 | ~2.5 | Far better than theoretical upper bound |
| Iris | 0.5 | ~1.0 | ~3.0 | Near-perfect |
| Wine | 0.5 | ~1.1 | ~2.8 | Consistently superior |
Ablation Study¶
| Configuration | Core Approx. (Median) | Classical Clustering Objective Loss | Notes |
|---|---|---|---|
| Algorithm 3 | ~1.05 | Slightly increased | Balances fairness and efficiency |
| k-means++ | ~2.5 | Optimal | Ignores fairness |
| Theoretical lower bound | ≥2 | — | Dual metric |
| Theoretical lower bound | ≥1 | — | FJR |
Key Findings¶
- Empirical performance substantially exceeds theoretical worst-case bounds: the algorithm typically achieves core approximation ratios close to 1.
- The price of fairness is small: classical clustering objectives increase only marginally.
- Compared to k-means++, the fair clustering algorithm consistently and significantly improves core approximation.
- Results extend analogous observations from Chen et al. and Caragiannis et al. within their respective paradigms.
- The impossibility theorem implies that \(\lambda\) must be provided as input; relying solely on a single metric \(d\) is insufficient.
Highlights & Insights¶
- The construction in the impossibility theorem (Theorem 1) is remarkably elegant: a fundamental impossibility is established using only 6 points and 3 clusters.
- The selective switching mechanism in Phase 2 is the core algorithmic innovation, striking a principled balance between unconstrained switching and no switching.
- The theoretical landscape is comprehensive: existence upper bounds, algorithmic upper bounds, and lower bounds together form a clear and coherent picture.
- The semi-centroid clustering model is forward-looking: it unifies two independent lines of research and naturally captures real-world scenarios.
Limitations & Future Work¶
- The polynomial-time core approximation factor of \((3+2\sqrt{3}) \approx 6.46\) leaves room for improvement; the theoretical lower bound is only 2.
- Non-centroid loss is restricted to the maximum distance (diameter); average distance and other aggregation functions are not addressed.
- Experiments are conducted on small real datasets; scalability to large-scale settings is untested.
- The weighting parameter \(\lambda\) must be specified in advance; learning an optimal \(\lambda\) from data is not discussed.
- A significant gap remains between the polynomial-time FJR approximation (4-FJR) and the existential guarantee (1-FJR).
Related Work & Insights¶
- The Greedy Capture algorithm of Chen et al. (2019) is the seminal work on proportional fairness in centroid clustering, achieving a \((1+\sqrt{2})\)-core.
- Caragiannis et al. (2024) extend proportional fairness to non-centroid clustering, motivated by federated learning.
- Solving the Most Cohesive Cluster problem is the computational bottleneck for algorithmic efficiency.
- The unified framework established in this paper provides a more general platform for future research, enabling simultaneous progress in both paradigms under the semi-centroid setting.
- Exploring clustering fairness through the lenses of sortition and strategic game theory warrants further investigation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Unifying two independent paradigms, the impossibility theorem, and the new algorithm are all significant theoretical contributions.
- Experimental Thoroughness: ⭐⭐⭐ Primarily theoretical; empirical validation is adequate but limited in scale.
- Writing Quality: ⭐⭐⭐⭐ Theorems are stated clearly and proofs are rigorously structured, though accessibility for non-theory readers is limited.
- Value: ⭐⭐⭐⭐ Advances the theory of fair clustering; the unified framework has long-term impact.