Skip to content

Unifying Proportional Fairness in Centroid and Non-Centroid Clustering

Conference: NeurIPS 2025 arXiv: 2601.00447 Code: None Area: AI Safety / Fairness Keywords: fair clustering, proportional fairness, core stability, semi-centroid clustering, approximation algorithms

TL;DR

This paper unifies the study of proportional fairness in centroid and non-centroid clustering under a single "semi-centroid clustering" framework, establishes an impossibility theorem showing the two cannot be simultaneously achieved, and designs novel algorithms that attain constant-factor core guarantees under dual metric loss.

Background & Motivation

Fair clustering demands proportional fairness guarantees for every possible group, rather than only predefined protected attributes. Prior work studied proportional fairness in two separate paradigms:

Centroid clustering (Chen et al., 2019): each agent's cost is determined by its distance to the cluster representative (centroid). Canonical application: facility location, where residents prefer their assigned facility to be nearby.

Non-centroid clustering (Caragiannis et al., 2024): each agent's cost is determined by its distance to the farthest data point in its cluster. Canonical application: federated learning, where participants prefer cluster members whose data distributions are similar to their own.

Key Challenge: In many real-world applications, users care about both types of loss simultaneously. In facility location, residents may prefer both a nearby facility (centroid loss) and neighbors who share the same community (non-centroid loss). In federated learning, participants care both about model quality (centroid) and similarity of peers' data distributions (non-centroid). Moreover, the two losses may be defined over entirely different metric spaces—when a teacher forms groups, students' preferences over project topics (centroid) and over group members (non-centroid) may be grounded in completely different similarity measures.

Impossibility Result: The paper first establishes an important impossibility theorem (Theorem 1): no clustering can simultaneously maintain finite-approximation core stability with respect to both centroid and non-centroid losses. This rules out directly combining existing algorithms and necessitates a new joint algorithmic design.

Method

Overall Architecture

The paper proposes the semi-centroid clustering model, where each agent \(i\) in cluster \((C_t, x_t)\) incurs loss \(\ell_i(C_t, x_t)\) that depends jointly on: (1) the cluster membership \(C_t\) (who is in the cluster) and (2) the cluster centroid \(x_t\) (where the centroid is located).

Two structured loss function families are studied: - Dual Metric Loss: \(\ell_i(\mathcal{X}) = \max_{j \in C} d^m(i,j) + d^c(i, x_t)\), defined over two distinct metric spaces. - Weighted Single Metric Loss: \(\ell_i(\mathcal{X}) = \lambda \cdot \max_{j \in C} d(i,j) + (1-\lambda) \cdot d(i, x_t)\), a weighted combination over a single metric space.

Key Designs

  1. Impossibility Theorem (Theorem 1):

    • A carefully constructed instance with 6 agents and 3 clusters, \(\{a,b,c,d,e,f\}\), is used.
    • It is shown that non-centroid core requirements force \(\{a,b\}\), \(\{e,f\}\), and \(\{c,d\}\) to each form their own cluster.
    • It is then shown that the centroid core cannot be satisfied for this fixed partition, regardless of centroid placement.
    • Significance: this rules out any direct adaptation of the existing Greedy Capture algorithm.
  2. Dual Metric Algorithm (Algorithm 3):

    • Two-phase design:
      • Phase 1: Iteratively identifies the Most Cohesive Cluster (MCC)—the cluster-centroid pair minimizing the maximum agent loss. Agents in the identified cluster are then removed.
      • Phase 2: Selectively permits agents to switch clusters, subject to key constraints:
        • Switching must not excessively harm members of the new cluster: post-switch loss \(\leq c \cdot r_{t'}\).
        • A bounded-loss guarantee is provided for the switching agent.
        • Each agent's switching decision is made independently of others.
    • The growth factor \(c\) is a key parameter controlling the permissiveness of switching: \(c = \frac{3\alpha + \sqrt{\alpha(\alpha+8)}}{4\alpha}\).
    • Design Motivation: to interpolate between centroid GC (free switching) and non-centroid GC (no switching).
  3. Core Theoretical Results (Theorems 2–3):

    • Existence: under dual metric loss, a 3-core clustering always exists (using an exact MCC subroutine).
    • Polynomial-time algorithm: using a 4-approximation MCC subroutine, a \((3+2\sqrt{3})\)-core clustering is found in polynomial time.
    • Stronger FJR (Fully Justified Representation) guarantees: exact 1-FJR exists; polynomial-time 4-FJR is achievable.

Loss & Training

  • This is a theoretical work; no training is involved.
  • Algorithm runtime depends on the efficiency of the MCC subroutine.
  • For weighted single metric loss, parameterized approximation factors of \(\min\{2/\lambda, f_\lambda\}\) are provided as \(\lambda\) varies.

Key Experimental Results

Main Results

Core approximation ratios on real datasets

Dataset \(\lambda\) Algorithm 3 Core Approx. k-means++ Core Approx. Notes
Occupancy 0.5 ~1.1 ~2.5 Far better than theoretical upper bound
Iris 0.5 ~1.0 ~3.0 Near-perfect
Wine 0.5 ~1.1 ~2.8 Consistently superior

Ablation Study

Configuration Core Approx. (Median) Classical Clustering Objective Loss Notes
Algorithm 3 ~1.05 Slightly increased Balances fairness and efficiency
k-means++ ~2.5 Optimal Ignores fairness
Theoretical lower bound ≥2 Dual metric
Theoretical lower bound ≥1 FJR

Key Findings

  • Empirical performance substantially exceeds theoretical worst-case bounds: the algorithm typically achieves core approximation ratios close to 1.
  • The price of fairness is small: classical clustering objectives increase only marginally.
  • Compared to k-means++, the fair clustering algorithm consistently and significantly improves core approximation.
  • Results extend analogous observations from Chen et al. and Caragiannis et al. within their respective paradigms.
  • The impossibility theorem implies that \(\lambda\) must be provided as input; relying solely on a single metric \(d\) is insufficient.

Highlights & Insights

  • The construction in the impossibility theorem (Theorem 1) is remarkably elegant: a fundamental impossibility is established using only 6 points and 3 clusters.
  • The selective switching mechanism in Phase 2 is the core algorithmic innovation, striking a principled balance between unconstrained switching and no switching.
  • The theoretical landscape is comprehensive: existence upper bounds, algorithmic upper bounds, and lower bounds together form a clear and coherent picture.
  • The semi-centroid clustering model is forward-looking: it unifies two independent lines of research and naturally captures real-world scenarios.

Limitations & Future Work

  • The polynomial-time core approximation factor of \((3+2\sqrt{3}) \approx 6.46\) leaves room for improvement; the theoretical lower bound is only 2.
  • Non-centroid loss is restricted to the maximum distance (diameter); average distance and other aggregation functions are not addressed.
  • Experiments are conducted on small real datasets; scalability to large-scale settings is untested.
  • The weighting parameter \(\lambda\) must be specified in advance; learning an optimal \(\lambda\) from data is not discussed.
  • A significant gap remains between the polynomial-time FJR approximation (4-FJR) and the existential guarantee (1-FJR).
  • The Greedy Capture algorithm of Chen et al. (2019) is the seminal work on proportional fairness in centroid clustering, achieving a \((1+\sqrt{2})\)-core.
  • Caragiannis et al. (2024) extend proportional fairness to non-centroid clustering, motivated by federated learning.
  • Solving the Most Cohesive Cluster problem is the computational bottleneck for algorithmic efficiency.
  • The unified framework established in this paper provides a more general platform for future research, enabling simultaneous progress in both paradigms under the semi-centroid setting.
  • Exploring clustering fairness through the lenses of sortition and strategic game theory warrants further investigation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Unifying two independent paradigms, the impossibility theorem, and the new algorithm are all significant theoretical contributions.
  • Experimental Thoroughness: ⭐⭐⭐ Primarily theoretical; empirical validation is adequate but limited in scale.
  • Writing Quality: ⭐⭐⭐⭐ Theorems are stated clearly and proofs are rigorously structured, though accessibility for non-theory readers is limited.
  • Value: ⭐⭐⭐⭐ Advances the theory of fair clustering; the unified framework has long-term impact.