Skip to content

Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration

Conference: ICML 2026
arXiv: 2605.00578
Code: No public link
Area: Federated Learning / Pathology / Dataset Distillation
Keywords: WSI, Multiple Instance Learning, Gaussian Mixture, Federated Distillation, Curriculum Learning

TL;DR

This paper proposes FedHD: in heterogeneous federated pathology scenarios, it performs "one-to-one" WSI feature-level distillation via Gaussian-mixture feature alignment, then gradually injects cross-institution synthetic features into local training through curriculum learning. This enables institutions to collaborate without sharing raw data or exchanging model parameters, and is compatible with heterogeneous MIL architectures and feature extractors. FedHD comprehensively outperforms existing federated and distillation baselines on TCGA-IDH / CAMELYON16 / CAMELYON17.

Background & Motivation

Background: Cancer diagnosis on WSI (whole slide images, with tens of millions of pixels) relies on MIL (CLAM, TransMIL, ACMIL, etc.), but single-center data is scarce and privacy regulations restrict cross-institution sharing. Federated learning is a natural solution. However, real hospitals differ greatly in computational resources and modeling preferences, often using different feature extractors (ResNet50/UNI/PhV2) and MIL architectures, resulting in "unalignable parameter spaces" for traditional parameter averaging (FedAvg, FedMut, FedImpro).

Limitations of Prior Work: (1) FedDD (federated data distillation) switches to sharing synthetic datasets to address parameter incompatibility, but existing methods are designed for natural images: (a) the single Gaussian/mean matching assumption fails to capture the multi-component distribution of patch features within a WSI (coexistence of different morphological components); (b) extreme compression, reducing thousands of patches to a few synthetic images, leads to over-compression for WSIs, which are already small in sample size and highly heterogeneous between slides, losing fine-grained diagnostic cues.

Key Challenge: On WSI, the combination of "small sample size + high intra-class heterogeneity + client model heterogeneity" renders traditional DD's "extreme compression + single-component matching" assumptions invalid, and parameter-sharing methods like FedAvg are not applicable.

Goal: (1) Enable each client to independently generate synthetic features that retain diagnostic details and can be utilized by any MIL architecture; (2) Avoid domain shift caused by direct concatenation during cross-institution integration; (3) Ensure interpretability (crucial in medical scenarios).

Key Insight: Start from patch-level embeddings rather than pixel-level—this fits the MIL pipeline and reduces the distillation dimension from \(256\times 256\times 3\) to \(\mathbb{R}^d\); introduce "one-to-one" slide-level synthesis (each real slide corresponds to one synthetic slide) instead of "many-to-one" aggregation, preserving slide-level diversity.

Core Idea: Model WSI patch features as a 16-component GMM, align the mean and covariance of each component in the synthetic set (rather than a single global mean), and perform one-to-one distillation at the slide level. In the federated phase, use curriculum learning—let the local model converge on real data first, then gradually add synthetic features from other clients as auxiliary supervision.

Method

Overall Architecture

FedHD runs in two stages: "local distillation + curriculum federation". (i) On each client \(c\), for each real slide \(x_i^{(c)}\) (with \(K\) patch embeddings \(b_k^{i,c}\in\mathbb{R}^d\)), a GMM models the real distribution \(P_\text{real}^{(c,i)}\approx \sum_m \pi_m \mathcal{N}(\mu_m^{(c,i)},\Sigma_m^{(c,i)})\), and a synthetic slide \(h_i^{(c)}\) (with \(T\) learnable patch embeddings) is optimized so its GMM matches the real GMM in mean and covariance (Frobenius alignment); (ii) Clients upload \(\{h_i^{(c)}\}\) to the server, which aggregates all synthetic slides except those from the current client into \(\mathcal{H}_\text{global}^{(c)}\) and distributes them; (iii) Clients first train local MIL models on real data, then after \(t_0\) rounds gradually incorporate \(\mathcal{H}_\text{global}^{(c)}\) using GCE noise-robust loss for joint training; an optional FastGAN generator can invert synthetic embeddings to pseudo-patches for visualization.

Key Designs

  1. Gaussian-Mixture Feature Alignment (replacing single mean matching):

    • Function: Captures the complex distribution of multiple morphological components (tumor/normal/boundary regions, etc.) within a WSI, avoiding the "gray average" effect of single mean matching on heterogeneous patches.
    • Mechanism: Use GMM to estimate \(M=16\) components \(\{\mu_m,\Sigma_m,\pi_m\}\) on each real slide's patch features \(\{b_k^{i,c}\}_{k=1}^K\); assign synthetic slide patches \(\{p_j^{i,c}\}_{j=1}^T\) to components via the same GMM, obtaining \(\{\hat{\mu}_m,\hat{\Sigma}_m\}\); the loss \(\mathcal{L}_\text{align}^{(c)}=\sum_m(\|\mu_m-\hat{\mu}_m\|_2^2+\|\Sigma_m-\hat{\Sigma}_m\|_F^2)\) aligns both means and covariances.
    • Design Motivation: Previous DD methods use \(\sum_y \|\Phi_{T_y}-\Phi_{S_y}\|^2\), assuming a single Gaussian/center; empirical evidence shows WSI patches are multi-modal, and single-center matching smooths out diagnostically critical minority components (e.g., tumor patches), harming downstream MIL classification.
  2. One-to-One Slide-Level Distillation:

    • Function: Each real slide corresponds to one synthetic slide, avoiding over-compression from aggregating multiple slides into a few, and preserving diagnostic diversity.
    • Mechanism: Client \(c\) maintains \(N\) synthetic slides \(h_i^{(c)}\) (\(N\) = number of local real slides); each synthetic slide contains \(T=1000\) patch embeddings, aligned with its real counterpart. The upload payload is \(O(NTd)\) floats, matching the scale of transmitting full patch features but without sharing actual patches.
    • Design Motivation: Extreme compression (IPC=1/10/50) is feasible for natural images due to relative intra-class homogeneity; WSI datasets are small (hundreds of cases) and highly heterogeneous between slides, so further compression leads to universal distortion.
  3. Curriculum-based Federation:

    • Function: Allows the local model to converge robustly before gradually introducing external synthetic data, preventing early-stage noise-induced drift.
    • Mechanism: Local total loss \(\mathcal{L}_\text{local}^{(c)} = \mathcal{L}_\text{real}^{(c)} + \mathcal{L}_\text{GCE}^{(c)}\cdot \mathbb{I}(t\geq t_0)\); for the first \(t_0=30\) epochs, only real data is used; afterwards, synthetic data is added and Generalized Cross-Entropy \(\mathcal{L}_\text{GCE}=\frac{1-p_y^q}{q}\) (\(q=0.7\)) replaces standard CE to suppress potential label noise.
    • Design Motivation: Directly mixing cross-institution synthetic data introduces distribution shift; curriculum ensures the model has a "solid foundation" before absorbing external knowledge, akin to prerequisites in education; GCE reduces to MAE as \(q\to 0\), offering greater noise robustness.

Loss & Training

Local distillation: 1000 iterations; single round of federated communication; local MIL training: 50 epochs; GMM components \(M=16\) (per Song 2024); synthetic patch count \(T=1000\); GCE parameter \(q=0.7\); curriculum threshold \(t_0=30\); optional FastGAN generator with joint \(\mathcal{L}_\text{GAN}^{(c)}+\lambda_\text{rec}\mathcal{L}_\text{rec}^{(c)}\) for visualization.

Key Experimental Results

Main Results

Dataset Client/setting FedHE DESA FedDGM HistoFS FedWSIDD FedHD
CAM16 C1 [R50+CLAM] Acc 72.7 77.0 77.0 82.4 83.7 85.1
CAM16 C2 [UNI+TransMIL] Acc 77.7 86.2 87.8 91.3 93.2 95.8
CAM16 Avg Acc 75.2 81.9 83.4 86.7 88.7 91.2
CAM17 C1 [UNI+CLAM] Acc 72.3 72.3 74.3 75.9 77.3 83.6
CAM17 C3 [R50+ACMIL] Acc 77.0 78.0 79.0 79.0 79.0 84.0
CAM17 C4 [PhV2+TrMIL] Acc 73.7 78.3 79.9 82.3

(FedHD achieves the best Acc / MCC across all clients and combinations of heterogeneous feature extractors and MIL architectures; improvements are especially notable for each [feature, MIL] pairing on CAM17.)

Ablation Study

Configuration Function Description
Single Gaussian (M=1) vs GMM (M=16) M=16 significantly outperforms single mean Validates necessity of multi-component modeling
One-to-one vs Many-to-one compression One-to-one preserves diagnostic diversity Over-compression in highly heterogeneous WSI leads to performance drop
No curriculum (direct mix) vs curriculum \(t_0=30\) Curriculum introduces synthetic data later Prevents early-stage drift from external noise
CE vs GCE (\(q=0.7\)) GCE improves robustness Suppresses potential label noise in synthetic data
Communication payload \(O(NTd)\) Single round communication suffices Lower communication cost than iterative FedAvg
FastGAN decoding Interpretable pseudo-patches Meets medical audit requirements

Key Findings

  • Multi-component matching is critical: Single mean matching (FedWSIDD and other baselines) leads to significant performance drops on WSI; FedHD's GMM matches both mean and covariance, offering clear protection for minority components like tumors.
  • Architecture-agnostic collaboration: Traditional FedAvg fails completely under heterogeneous [R50+CLAM], [UNI+TransMIL], [PhV2+TrMIL] combinations; FedHD's feature-level distillation bypasses parameter space incompatibility.
  • Curriculum is more stable than direct mix: Directly introducing external synthetic data from epoch 0 degrades performance for some clients (e.g., CAM17 C3 with extreme class imbalance); \(t_0=30\) warm-up provides a stable foundation.
  • Interpretability module's clinical value: FastGAN-inverted pseudo-patches enable manual review by clinicians, addressing black-box concerns—a key gap for medical deployment.

Highlights & Insights

  • Replacing single mean with GMM is a design highly aligned with WSI morphological characteristics, hard-coding "morphological multi-component" domain knowledge into the DD loss—a good example of domain-aware distillation.
  • The "against the trend" choice of one-to-one slide-level distillation (explicitly not pursuing extreme compression) reflects a clear understanding of WSI data properties—not all domains are suitable for IPC=1.
  • The application of curriculum learning to "cross-client synthetic data integration" is inspiring: it can be generalized to any "self-distillation → federated integration" process, such as federated language models or recommender systems.
  • Single-round communication and feature-level payload design are highly practical for hospitals with low bandwidth and strict compliance auditing; combined with GCE noise robustness and FastGAN visualization, the engineering completeness is high.

Limitations & Future Work

  • The GMM component number \(M=16\) and synthetic patch count \(T=1000\) are empirically chosen and may not be optimal for all WSI datasets; automatic selection of \(M\) or Bayesian nonparametrics (DPGMM) is a natural direction.
  • Single-round communication is simple but may not converge to the optimum—multi-round iterative distillation could be better, but is not discussed by the authors.
  • Few clients (only 2 in CAM16, 5 in CAM17) make curriculum threshold \(t_0\) tuning easier; scalability to dozens or hundreds of clients remains untested.
  • GMM covariance \(\Sigma_m\) incurs \(O(d^2)\) computation/storage cost in high dimensions (e.g., UNI 1024d); the paper does not analyze overhead for large \(d\).
  • The credibility of FastGAN-decoded pseudo-patches compared to real tissue requires systematic evaluation by pathologists; currently only visual plausibility is shown, with no blinded assessment.
  • vs FedAvg / FedMut / FedImpro: Traditional parameter-sharing methods are infeasible for heterogeneous MIL architectures; this work bypasses the limitation via feature-level distillation.
  • vs FedHisto (Lu 2022) / HistoFS (Raswa 2025): These assume homogeneous MIL and balanced computation; this work explicitly targets heterogeneous scenarios, better reflecting real hospital networks.
  • vs FedWSIDD (Jin 2025): Also performs federated WSI distillation but uses single mean matching; this work demonstrates that such simplification severely loses diagnostic detail on WSI.
  • vs FedD3 (Song 2023) / FedDGM (Jia 2025): These focus on personalized FL (disentangled dual decoder / diffusion model latent generation), with high computational cost; FedHD is lighter and architecture-agnostic.
  • vs Natural Image DD (DM, MTT): This work advocates "do not pursue extreme compression"—an observation also valuable for other "small data + high heterogeneity" domains (rare diseases, satellite remote sensing).

Rating

  • Novelty: ⭐⭐⭐⭐ GMM multi-component alignment + one-to-one distillation + curriculum federation, a highly targeted combination for heterogeneous WSI FL scenarios
  • Experimental Thoroughness: ⭐⭐⭐⭐ 3 datasets × multiple clients × heterogeneous [feature, MIL] pairings, with standardized statistical significance reporting
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation-method-experiment logic, with well-presented loss functions and hyperparameter tables
  • Value: ⭐⭐⭐⭐ Directly addresses the "heterogeneous architecture + privacy + interpretability" trilemma in medical federated deployment, with strong engineering practicality