Fair Model-Based Clustering¶

Conference: AAAI 2026 arXiv: 2602.21509 Code: None Area: AI Safety / Fairness Keywords: Fair clustering, finite mixture models, EM algorithm, mini-batch learning, scalability

TL;DR¶

This paper proposes FMC, a fair clustering algorithm based on finite mixture models. By imposing fairness constraints on model parameters rather than sample-level assignments, FMC achieves scalable fair clustering whose parameter count is independent of the sample size \(N\). It supports mini-batch learning and categorical data, and substantially outperforms existing methods on large-scale datasets.

Background & Motivation¶

Background: Fair clustering requires that the proportion of sensitive attributes (e.g., gender, race) within each cluster reflects that of the overall dataset. Most existing methods are built upon K-means clustering, jointly optimizing cluster centers and assignment mappings under fairness constraints.

Limitations of Prior Work: - Each sample's cluster assignment must be optimized jointly with cluster centers → the number of learnable parameters scales with \(N\) → poor scalability on large datasets - Fairness constraints depend on assignments over the full dataset → mini-batch learning is infeasible (standard acceleration techniques fail) - K-means-based methods require a metric space → cannot handle non-metric data such as categorical features - After training, the assignment mapping is only defined on training data → fairly assigning new data is non-trivial

Key Challenge: How can the computational complexity of fair clustering be decoupled from the sample size while preserving fairness guarantees?

Goal: To develop a scalable fair clustering algorithm whose parameter count is independent of the sample size.

Key Insight: Replace geometric distance with probabilistic mixture models, shifting fairness constraints from assignment mappings to model parameters — a parametric assignment mapping naturally supports mini-batch training and out-of-sample assignment.

Core Idea: Fair clustering based on finite mixture models, achieving scalability independent of \(N\) by imposing Gap constraints on model parameters.

Method¶

Overall Architecture¶

Given a dataset with sensitive attributes and a target cluster count \(K\), FMC outputs a fair probabilistic assignment mapping. Data are assumed to be generated by \(K\) mixture components: \(X \sim \sum_{k=1}^K \pi_k f(\cdot; \theta_k)\). Parameters \(\Theta = (\boldsymbol{\pi}, (\theta_1, ..., \theta_K))\) are estimated by maximizing the log-likelihood subject to fairness constraints.

Key Designs¶

Parametric Assignment Mapping:
- Function: Uses the posterior probability of the mixture model as a soft assignment mapping.
- Mechanism: \(\psi_k(x_i; \Theta) = \frac{\pi_k f(x_i; \theta_k)}{\sum_l \pi_l f(x_i; \theta_l)}\)
- Design Motivation: The assignment mapping is fully determined by model parameters \(\Theta\); the parameter count (i.e., \(K\) means, covariances, and mixing weights) is independent of \(N\).
Gap Fairness Constraint:
- Function: Ensures that the proportion of each sensitive group is balanced across clusters.
- Mechanism: Define \(\Delta(\Theta) = \max_k |\frac{\sum_{x_i \in \mathcal{D}^{(1)}} \psi_k(x_i; \Theta)}{N_1} - \frac{\sum_{x_j \in \mathcal{D}^{(2)}} \psi_k(x_j; \Theta)}{N_2}|\); the optimization objective is \(\max_\Theta \ell(\Theta | \mathcal{D}) - \lambda \Delta(\Theta)\).
- Design Motivation: The Gap metric is numerically more stable than the Balance metric, making it more suitable for gradient-based optimization.
FMC-GD and FMC-EM Optimization Algorithms:
- FMC-GD: Directly applies gradient descent on \(\ell(\Theta|\mathcal{D}; \lambda)\).
- FMC-EM: Augments the Q-function with a fairness penalty, \(Q_{fair} = \mathbb{E}[\ell_{comp}(\Theta|Y)] - \lambda\Delta(\Theta)\), and employs the Generalized EM (GEM) framework to ensure monotonic increase of the Q-function at each step.
- Empirically, FMC-EM achieves a superior cost–fairness trade-off with lower variance.
Mini-Batch Learning and Subsampled \(\Delta\):
- Function: Addresses computational bottlenecks on large-scale data.
- Mechanism: Mini-batches are used for the likelihood term, while \(\Delta\) is estimated via subsampling — Proposition 1 theoretically establishes that the approximation error of the subsampled \(\Delta\) is \(O(\sqrt{d/n'})\).
- Design Motivation: Existing methods cannot leverage mini-batch learning because their fairness constraints depend on complete assignments; the parametric assignment mapping in FMC makes this possible.

Loss & Training¶

\(\max_\Theta \ell(\Theta | \mathcal{D}) - \lambda \Delta(\Theta)\). FMC-GD: \(T=10000\), learning rate \(10^{-3}\). FMC-EM: \(T=200\), inner loop \(R=10\), learning rate \(10^{-2}\).

Key Experimental Results¶

Main Results¶

On three medium-scale UCI datasets (Adult, Bank, Credit), the Pareto frontier of \(\Delta\) vs. Cost is compared across methods. FMC-EM is competitive with baselines including SFC, VFC, and FCA, and substantially outperforms them on the Credit dataset.

Large-Scale Experiment (Census, 2.45M samples)¶

Method	Time (sec)	Feasible
SFC	Timeout	❌
VFC	Timeout	❌
FCA	~3000	✅
FMC (5% subsample)	~60	✅

On the Census dataset, FMC is approximately 50× faster than FCA, the fastest baseline, while achieving comparable fairness and clustering cost.

Key Findings¶

FMC-EM outperforms FMC-GD, yielding a better Pareto frontier with lower variance.
Subsampling to 5% incurs negligible performance degradation — reducing the subsampling ratio from 100% to 5% causes only marginal changes in \(\Delta\) and Cost.
FMC extends naturally to categorical data by replacing the Gaussian with a multinomial distribution, a capability unavailable to existing K-means-based methods.
For out-of-sample assignment, FMC's learned parametric mapping can be applied directly to test data, whereas existing methods require re-optimization.

Highlights & Insights¶

The conceptual shift from assignment optimization to parameter optimization is elegant — it simultaneously resolves scalability, mini-batch training, and out-of-sample assignment in a unified framework.
The theoretical guarantee for subsampled \(\Delta\) (Proposition 1) provides the first principled justification for mini-batch learning in fair clustering.
Extension to categorical data is a unique advantage not covered by any existing fair clustering method.

Limitations & Future Work¶

The Gaussian mixture model assumes data follow a specific parametric distribution, which may be inadequate for complex nonlinear manifold structures.
The trade-off between fairness and clustering cost requires manual tuning of \(\lambda\).
The primary analysis is restricted to binary sensitive attributes; extension to multiple groups is discussed but receives limited experimental validation.
Integration with deep clustering or contrastive learning methods has not been explored.

vs. SFC (fairlet): SFC preprocesses data via fairlet decomposition, incurring high computational cost and timing out on large-scale data; FMC's parametric approach is naturally scalable.
vs. VFC: VFC imposes variational fairness constraints via KL divergence and similarly does not support mini-batch learning; FMC's subsampling strategy addresses this limitation.
vs. FCA: FCA is the most recent state-of-the-art method; results on medium-scale data are mixed, but FMC is substantially faster than FCA at large scale.

Rating¶

Novelty: ⭐⭐⭐⭐ — The perspective of applying mixture models to fair clustering is novel and directly addresses the core scalability bottleneck.
Experimental Thoroughness: ⭐⭐⭐⭐ — Four datasets, large-scale validation, and subsampling analysis collectively provide strong empirical evidence.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are rigorous and the problem motivation is clearly articulated.
Value: ⭐⭐⭐⭐ — As the first scalable probabilistic fair clustering method, it holds high practical value.