Data Heterogeneity and Forgotten Labels in Split Federated Learning¶
Conference: AAAI 2026 arXiv: 2511.09736 Code: GitHub Area: Optimization Keywords: Split Federated Learning, Catastrophic Forgetting, Data Heterogeneity, Multi-head, Processing Order
TL;DR¶
This paper systematically investigates catastrophic forgetting (CF) caused by data heterogeneity in Split Federated Learning — with particular focus on intra-round forgetting induced by the server-side processing order — and proposes Hydra, a multi-head method that partitions and trains the final layers of part-2 in groups before aggregation, significantly reducing the performance gap (PG) across labels by up to 75.4%.
Background & Motivation¶
State of the Field¶
Background: Split Federated Learning (SFL/SplitFedv2) divides the model into part-1 (client-side) and part-2 (server-side). Part-1 is trained in parallel across clients and aggregated as in standard FL; part-2 is trained on the server by sequentially processing each client's activations.
The authors identify two forms of CF in SFL: (1) inter-round forgetting in part-1, which mirrors standard FL model drift caused by aggregation; and (2) intra-round forgetting (intraCF) in part-2, caused by sequential server-side processing — analogous to continual learning — where the model tends to retain knowledge of labels associated with the last-processed client while forgetting earlier ones. Experiments on CIFAR-10 under non-IID settings show that the accuracy of the last-processed labels far exceeds that of the first-processed ones, with overall accuracy dropping to only 43% (vs. 80% under IID).
Existing CF mitigation methods from CL and FL (e.g., EWC, Scaffold) cannot be directly applied to SFL due to incompatible assumptions: EWC assumes sequential data streams, while Scaffold requires access to the full model.
Starting Point¶
Goal: How can intra-round catastrophic forgetting caused by sequential server-side processing in SFL be characterized and mitigated? How do the cut layer choice and processing order affect the degree of forgetting?
Method¶
Overall Architecture¶
Hydra introduces a multi-head architecture on the server side of SFL: part-2 is further divided into a shared part-2a and multiple heads (part-2b). Part-2a is updated sequentially (preserving the original SFL behavior), while each head is assigned to a group of clients with similar data distributions and updated in parallel, with aggregation at the end of each round.
Key Designs¶
Client Grouping: Prior to training, each client transmits its label distribution vector of length \(L\). A greedy algorithm assigns clients with similar label distributions to the same group. With \(G\) groups (heads) in total, each group exhibits a more uniform data distribution, reducing gradient conflicts.
Head Design: Each head corresponds to the final few layers of part-2. During the forward pass, the server routes the output of part-2a to the corresponding head according to a client-to-group mapping, and reverses this routing during backpropagation. At the end of each round, all heads are aggregated into a single model via FedAvg — unlike continual learning approaches that retain separate heads at inference time, Hydra ultimately produces a single-head model.
Processing Order Analysis: The authors define two ordering strategies — cyclic order (processing clients by dominant label in a cyclic fashion) and random order — and find that: - Cyclic order consistently outperforms random order, as structured processing yields greater stability. - Smaller values of \(\phi\) (i.e., fewer consecutive clients from the same class) lead to better performance, confirming the importance of diversity in the training sequence.
Cut Layer Impact: A shallow cut layer results in a larger part-2, amplifying intraCF; a deeper cut layer yields a larger part-1, leading to inter-round forgetting analogous to standard FL. Experiments confirm that intraCF is more harmful than the model drift incurred in part-1.
Forgetting Metrics¶
- Performance Gap (PG): \(\mathcal{PG}(r) = \frac{1}{L}\sum_{l=1}^{L}\max_{k}|min(0, A_l^r - A_k^r)|\), measuring the accuracy disparity across labels.
- Backward Transfer (BW): measuring the difference between peak accuracy and final accuracy.
Key Experimental Results¶
Main Results¶
| Method | Global Acc.↑ | PG↓ | BW↓ |
|---|---|---|---|
| SFL (baseline) | 43 | 43.4 | 28 |
| SplitFedV1/FL | 55 | 15.7 | 14 |
| MergeSFL | 65 | 21 | 13 |
| MultiHead | 47 | 36.6 | 25 |
| SFL+Hydra | 66.5 | 13.4 | 9 |
(MobileNet, CIFAR-10, 80%-DL)
Hydra outcomes: - PG reduced by up to 75.4% (ResNet101); accuracy improved by up to 112.2%. - Effective across diverse data partitioning strategies including Sharding and Dirichlet (\(\alpha\)=0.1/0.3). - Small heads (last 2 layers only) outperform larger heads, with memory overhead of only \(G \times 6\)MB.
Highlights & Insights¶
- First systematic characterization of intra-round catastrophic forgetting in SFL, clearly distinguished from inter-round forgetting in FL.
- Exceptionally thorough experiments: 2 models × 4 datasets × multiple data partitioning schemes × multiple processing orders, with ≥10 repeated runs per configuration and median reporting.
- Hydra is elegantly designed, leveraging the natural split structure of SFL with zero additional overhead on the client side.
- Provides detailed empirical validation and comparison of CL theory applicability within the SFL setting.
Limitations & Future Work¶
- Client grouping relies on sharing label distributions prior to training, which may be restricted in privacy-sensitive scenarios.
- The number of heads \(G\) must be predefined; adaptive grouping strategies remain unexplored.
- Only classification tasks are considered; forgetting patterns in generative or regression settings may differ.
- No integration with momentum-based methods such as SMoFi is explored, though the two approaches may be complementary.
Related Work & Insights¶
- FL forgetting methods (EWC/Scaffold): Require a complete model or sequential data assumptions, rendering them incompatible with SFL.
- SplitFedV1: Avoids sequential processing by maintaining a separate model copy per client on the server, but is effectively equivalent to FL and does not exploit the split structure.
- MergeSFL: Merges features and adjusts batch size, achieving 65% accuracy but with PG remaining at 21; Hydra reduces PG to 13.4.
- Traditional multi-head (CL): Retains multiple heads for separate inference; Hydra instead aggregates all heads into a single one, better suited for the SFL setting.
Related Work & Insights¶
- The discovery of intraCF provides important guidance for SFL research: server-side processing strategy must not be overlooked, and random ordering may introduce instability.
- Hydra's multi-head aggregation paradigm can be extended to personalization in heterogeneous federated learning.
- Strong complementarity with SMoFi: SMoFi addresses momentum divergence in SFLv1, while Hydra addresses sequential forgetting in SFLv2; combining the two is a promising direction.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐