Federated Active Learning Under Extreme Non-IID and Global Class Imbalance¶

Conference: CVPR 2026 arXiv: 2603.10341 Code: chenchenzong/FairFAL Area: AI Safety / Federated Learning Keywords: Federated Active Learning, Non-IID, Class Imbalance, active learning, Long-Tailed Distribution

TL;DR¶

This paper systematically investigates the query model selection problem in federated active learning (FAL), identifies class-balanced sampling as the key performance factor, and proposes FairFAL — a framework achieving fair and efficient FAL via adaptive model selection, prototype-guided pseudo-labeling, and uncertainty-diversity balanced sampling.

Background & Motivation¶

Federated active learning (FAL) combines the privacy guarantees of federated learning with the label efficiency of active learning, yet faces two severely underexplored challenges in realistic deployments:

Global class imbalance: Real-world federated systems typically exhibit long-tailed global distributions, where rare but critical classes appear sparsely across clients.

Extreme client heterogeneity: Data distributions vary drastically across clients (extreme Non-IID).

Existing FAL methods (e.g., LoGo, KAFAL, IFAL) have begun addressing Non-IID settings, but generally treat heterogeneity only as a data partitioning problem, implicitly assuming a relatively balanced global label distribution. Under long-tailed global distributions, existing acquisition strategies struggle to capture minority-class samples, leading to wasteful annotation budgets.

This paper raises a fundamental question: In FAL, which model — global or local — is better suited as the query selector, and how does this relate to class-balanced sampling?

Method¶

Overall Architecture¶

FairFAL is built upon three empirical observations and comprises three corresponding core components:

Observation 1: For uncertainty sampling, the local model generally outperforms the global model except when the global distribution is severely imbalanced and clients are approximately homogeneous → Adaptive Model Selection
Observation 2: Regardless of which model is used, better class-balanced sampling (especially minority-class acquisition) consistently leads to higher final performance → Class-Aware Sampling
Observation 3: For diversity sampling, the global model consistently outperforms the local model across all settings → Global Feature-Guided Diversity

Key Designs¶

Adaptive Model Selection: The method estimates the degree of global imbalance and local-global distribution divergence via lightweight prediction discrepancy, then adaptively selects the query model.

Global class imbalance estimation: For each client, a class-balanced subset \(\mathcal{B}^{(k)}\) is constructed, and the global model's predicted prior \(\hat{\boldsymbol{\pi}}_g^{(k)}\) is used to estimate the imbalance ratio: \(\gamma_k = \frac{\min_{c \in \mathcal{C}_k^+} \hat{\pi}_{g,c}}{\max_{c \in \mathcal{C}_k^+} \hat{\pi}_{g,c}} \in (0,1]\) Each client uploads the scalar \(\gamma_k\); the server averages them to obtain the global coefficient \(\bar{\gamma}\) (computed only in the first round).

Local-global distribution divergence estimation: \(d_k = \frac{1}{C}\sum_{c=1}^{C}\frac{|\hat{\pi}_{g,c} - \hat{\pi}_{\ell,c}^{(k)}|}{\hat{\pi}_{g,c} + \hat{\pi}_{\ell,c}^{(k)}}\)

Model selection score: \(s_k = 1 - \frac{1}{2}(d_k + \bar{\gamma})\). The global model is selected when \(s_k > \delta = 0.75\); otherwise the local model is used. Intuitively, when the global distribution is severely imbalanced (small \(\bar{\gamma}\)) and the local distribution closely mirrors the global one (small \(d_k\)), \(s_k\) is large and the global model is preferred.

Prototype-Guided Pseudo-Labeling: Class prototypes are constructed from global model features to provide more reliable class assignments, overcoming classifier bias induced by imbalanced data.

Class prototype: \(\boldsymbol{\mu}_c^{(k)} = \frac{1}{|\mathcal{D}_{L,c}^{(k)}|}\sum_{y_i=c} \mathbf{z}_i^{(k)}\), where \(\mathbf{z}_i^{(k)} = \frac{\phi^g(x_i)}{\|\phi^g(x_i)\|_2}\) denotes the \(\ell_2\)-normalized feature from the global model.

Pseudo-labels are assigned via cosine similarity: \(\hat{y}^{(k)}(x) = \arg\max_c \langle \mathbf{z}^{(k)}(x), \boldsymbol{\mu}_c^{(k)} \rangle\)

The unlabeled pool is partitioned into per-class subsets \(\widetilde{\mathcal{D}}_{U,c}^{(k)}\) based on pseudo-labels, forming the foundation for subsequent class-aware sampling.

Two-Stage Balanced Sampling: Uncertainty and diversity are jointly optimized within a class-balanced framework.

Stage 1 — Per-class candidate selection: A uniform budget \(b_c^{(k)}\) is allocated per class; the top \(\kappa \cdot b_c^{(k)}\) highest-uncertainty samples form an over-complete candidate pool \(\mathcal{H}_c^{(k)}\) (\(\kappa = 4\)).

Stage 2 — Diversity refinement: \(k\)-center sampling is applied in the gradient embedding space of the global model \(\mathbf{g}^{(k)}(x) = \psi(x; \phi^g, f^g)\), minimizing the maximum distance: \(\mathcal{Q}_c^{(k)} = \arg\min_{\mathcal{Q}'} \max_{x \in \mathcal{H}_c^{(k)}} \min_{a \in \mathcal{A}_c^{(k)} \cup \mathcal{Q}'} d(\mathbf{g}^{(k)}(x), \mathbf{g}^{(k)}(a))\) A greedy \(k\)-center algorithm is used to obtain a 2-approximation solution.

Loss & Training¶

Standard federated training: FedAvg framework with local SGD
Each FAL round consists of a complete federated training phase followed by active querying
5% of the training data is queried for annotation per round
The first round uses random querying; subsequent rounds apply the FairFAL strategy

Key Experimental Results¶

Main Results¶

Datasets: FMNIST / CIFAR-10 / CIFAR-100, global imbalance ratio \(\rho = 20\), 10 clients.

CIFAR-10, final-round accuracy (α=0.1, ρ=20):

Method	15%	25%	35%	45%
Random	47.24	50.46	54.29	55.70
KAFAL	49.99	56.34	58.41	60.01
LoGo	51.56	56.35	58.30	59.68
IFAL	47.76	52.67	55.62	57.51
FairFAL	52.12	56.90	59.62	60.44

Medical datasets (α=0.1): On OctMNIST, FairFAL achieves 72.80% vs. KAFAL 70.40%; on DermaMNIST, FairFAL achieves 73.77% vs. LoGo 73.62%.

FairFAL consistently outperforms all baselines across all datasets and heterogeneity settings, with larger gains observed as task difficulty increases.

Ablation Study¶

Configuration	(α=0.1, ρ=20)	(α=100, ρ=20)	Note
Adaptive model selection \(\mathcal{M}^{(k)}\)	59.33	63.65	Correct query model selected
Alternative model \(\widetilde{\mathcal{M}}^{(k)}\)	58.49	61.89	Wrong selection → −0.84~1.76%
+ Class-aware sampling (Local prototypes)	59.14	63.39	Local prototype quality is lower
+ Class-aware sampling (Global prototypes)	59.95	64.02	Global prototypes more accurate (+0.63~0.81%)
+ Two-stage balanced sampling (κ=2)	60.61	64.60	κ=2 marginally better but difference is small
+ Two-stage balanced sampling (κ=4, Final)	60.44	64.57	Full FairFAL with a more flexible candidate pool

Key Findings¶

Generality of observations: The pattern that class-balanced sampling leads to better performance holds consistently across all experimental settings.
Necessity of adaptive selection: Using the "wrong" model degrades performance by 0.84–1.76% relative to the correct selection.
Global prototypes outperform local prototypes: Global model features yield more discriminative and globally consistent representations.
Validation on medical data: FairFAL achieves the best performance on OctMNIST and DermaMNIST (naturally long-tailed), attaining 72.80% vs. 70.40%.
Collapse of existing methods: Under α=100 (near-homogeneous clients), methods lacking explicit class-balancing mechanisms (e.g., IFAL) perform even worse than random sampling.

Highlights & Insights¶

Systematic empirical study: This is the first work to systematically investigate global vs. local query model selection in FAL, presenting three valuable observations validated via rigorous statistical testing (Wilcoxon test + Hodges-Lehmann estimator) rather than simple mean comparisons.
Observation-driven design: Each component has a clear empirical motivation with transparent design rationale.
Privacy preservation: Adaptive model selection only requires uploading the scalar \(\gamma_k\), introducing no additional privacy leakage.
Practical modularity: The framework is modular with composable components; the \(\kappa\) hyperparameter exhibits low sensitivity.

Limitations & Future Work¶

Fixed threshold \(\delta = 0.75\): This may lack flexibility for specific scenarios; adaptive adjustment warrants investigation.
First-round assumption: The method assumes the first-round randomly queried labeled set approximates IID, which may not hold under extreme imbalance.
Classification-only validation: Performance on more complex tasks such as detection and segmentation remains unexplored.
Client scale: Only the 10-client configuration is tested.
Class count limitation: CIFAR-100 covers only 100 classes; performance under very large label spaces (e.g., ImageNet-21k) is not verified.

BADGE: A classic two-stage uncertainty-diversity sampling method; FairFAL extends this paradigm by incorporating class-aware mechanisms.
LoGo: A FAL method combining local clustering with global uncertainty scoring, but without consideration of global class imbalance.
KAFAL/IFAL: Leverage global-local prediction discrepancy to guide acquisition, but lack class-balancing designs and fail under extreme imbalance.
Key insight from this paper: Class balance is central to FAL performance, rather than solely pursuing uncertainty or diversity; the representational advantage of the global model can be leveraged for prototype computation.

Rating¶

Novelty: ⭐⭐⭐⭐ Empirical observations are substantive; method design follows clear theoretical logic.
Experimental Thoroughness: ⭐⭐⭐⭐ Five datasets, multiple configurations, statistical testing, and complete ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Excellent structure: observation → design → validation, with fluent exposition.
Value: ⭐⭐⭐⭐ Fills a critical gap in FAL under extreme imbalance and Non-IID settings, with practical implications for real-world deployment.