Data Heterogeneity and Forgotten Labels in Split Federated Learning¶

Conference: AAAI 2026 arXiv: 2511.09736 Code: GitHub Area: Optimization Keywords: Split Federated Learning, Catastrophic Forgetting, Data Heterogeneity, Multi-head, Processing Order

TL;DR¶

This paper systematically investigates catastrophic forgetting (CF) caused by data heterogeneity in Split Federated Learning — with particular focus on intra-round forgetting induced by the server-side processing order — and proposes Hydra, a multi-head method that partitions and trains the final layers of part-2 in groups before aggregation, significantly reducing the performance gap (PG) across labels by up to 75.4%.

Background & Motivation¶

State of the Field¶

Background: Split Federated Learning (SFL/SplitFedv2) divides the model into part-1 (client-side) and part-2 (server-side). Part-1 is trained in parallel across clients and aggregated as in standard FL; part-2 is trained on the server by sequentially processing each client's activations.

The authors identify two forms of CF in SFL: (1) inter-round forgetting in part-1, which mirrors standard FL model drift caused by aggregation; and (2) intra-round forgetting (intraCF) in part-2, caused by sequential server-side processing — analogous to continual learning — where the model tends to retain knowledge of labels associated with the last-processed client while forgetting earlier ones. Experiments on CIFAR-10 under non-IID settings show that the accuracy of the last-processed labels far exceeds that of the first-processed ones, with overall accuracy dropping to only 43% (vs. 80% under IID).

Existing CF mitigation methods from CL and FL (e.g., EWC, Scaffold) cannot be directly applied to SFL due to incompatible assumptions: EWC assumes sequential data streams, while Scaffold requires access to the full model.

Starting Point¶

Goal: How can intra-round catastrophic forgetting caused by sequential server-side processing in SFL be characterized and mitigated? How do the cut layer choice and processing order affect the degree of forgetting?

Method¶

Overall Architecture¶

Hydra introduces a multi-head architecture on the server side of SFL: part-2 is further divided into a shared part-2a and multiple heads (part-2b). Part-2a is updated sequentially (preserving the original SFL behavior), while each head is assigned to a group of clients with similar data distributions and updated in parallel, with aggregation at the end of each round.

Key Designs¶

Client Grouping: Prior to training, each client transmits its label distribution vector of length \(L\). A greedy algorithm assigns clients with similar label distributions to the same group. With \(G\) groups (heads) in total, each group exhibits a more uniform data distribution, reducing gradient conflicts.

Head Design: Each head corresponds to the final few layers of part-2. During the forward pass, the server routes the output of part-2a to the corresponding head according to a client-to-group mapping, and reverses this routing during backpropagation. At the end of each round, all heads are aggregated into a single model via FedAvg — unlike continual learning approaches that retain separate heads at inference time, Hydra ultimately produces a single-head model.

Processing Order Analysis: The authors define two ordering strategies — cyclic order (processing clients by dominant label in a cyclic fashion) and random order — and find that: - Cyclic order consistently outperforms random order, as structured processing yields greater stability. - Smaller values of \(\phi\) (i.e., fewer consecutive clients from the same class) lead to better performance, confirming the importance of diversity in the training sequence.

Cut Layer Impact: A shallow cut layer results in a larger part-2, amplifying intraCF; a deeper cut layer yields a larger part-1, leading to inter-round forgetting analogous to standard FL. Experiments confirm that intraCF is more harmful than the model drift incurred in part-1.

Forgetting Metrics¶

Performance Gap (PG): \(\mathcal{PG}(r) = \frac{1}{L}\sum_{l=1}^{L}\max_{k}|min(0, A_l^r - A_k^r)|\), measuring the accuracy disparity across labels.
Backward Transfer (BW): measuring the difference between peak accuracy and final accuracy.

Key Experimental Results¶

Main Results¶

Method	Global Acc.↑	PG↓	BW↓
SFL (baseline)	43	43.4	28
SplitFedV1/FL	55	15.7	14
MergeSFL	65	21	13
MultiHead	47	36.6	25
SFL+Hydra	66.5	13.4	9

(MobileNet, CIFAR-10, 80%-DL)

Hydra outcomes: - PG reduced by up to 75.4% (ResNet101); accuracy improved by up to 112.2%. - Effective across diverse data partitioning strategies including Sharding and Dirichlet (\(\alpha\)=0.1/0.3). - Small heads (last 2 layers only) outperform larger heads, with memory overhead of only \(G \times 6\)MB.

Highlights & Insights¶

First systematic characterization of intra-round catastrophic forgetting in SFL, clearly distinguished from inter-round forgetting in FL.
Exceptionally thorough experiments: 2 models × 4 datasets × multiple data partitioning schemes × multiple processing orders, with ≥10 repeated runs per configuration and median reporting.
Hydra is elegantly designed, leveraging the natural split structure of SFL with zero additional overhead on the client side.
Provides detailed empirical validation and comparison of CL theory applicability within the SFL setting.

Limitations & Future Work¶

Client grouping relies on sharing label distributions prior to training, which may be restricted in privacy-sensitive scenarios.
The number of heads \(G\) must be predefined; adaptive grouping strategies remain unexplored.
Only classification tasks are considered; forgetting patterns in generative or regression settings may differ.
No integration with momentum-based methods such as SMoFi is explored, though the two approaches may be complementary.

FL forgetting methods (EWC/Scaffold): Require a complete model or sequential data assumptions, rendering them incompatible with SFL.
SplitFedV1: Avoids sequential processing by maintaining a separate model copy per client on the server, but is effectively equivalent to FL and does not exploit the split structure.
MergeSFL: Merges features and adjusts batch size, achieving 65% accuracy but with PG remaining at 21; Hydra reduces PG to 13.4.
Traditional multi-head (CL): Retains multiple heads for separate inference; Hydra instead aggregates all heads into a single one, better suited for the SFL setting.

The discovery of intraCF provides important guidance for SFL research: server-side processing strategy must not be overlooked, and random ordering may introduce instability.
Hydra's multi-head aggregation paradigm can be extended to personalization in heterogeneous federated learning.
Strong complementarity with SMoFi: SMoFi addresses momentum divergence in SFLv1, while Hydra addresses sequential forgetting in SFLv2; combining the two is a promising direction.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐