FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling¶

Conference: NeurIPS 2025 arXiv: 2501.19122 Code: GitHub Area: Optimization Keywords: Federated Learning, Dynamic Pruning, Thompson Sampling, Combinatorial Multi-Armed Bandit, Sparse Training

TL;DR¶

This paper reformulates federated dynamic pruning as a Combinatorial Multi-Armed Bandit (CMAB) problem and proposes TSAdj, a Thompson Sampling-based topology adjustment mechanism. By replacing deterministic decisions with probabilistic ones, the method obtains more robust sparse model topologies while significantly reducing communication overhead.

Background & Motivation¶

Background: Federated Learning (FL) enables distributed devices to collaboratively train models without sharing data, but the high computational and communication demands of deep models limit participation from resource-constrained devices. Existing federated pruning frameworks adopt dynamic sparse training, where an inner loop updates weights with a fixed topology and an outer loop adjusts the topology (pruning + reactivation) based on aggregated information.

Limitations of Prior Work: Existing methods (PruneFL, FedTiny, FedDST, FedMef, etc.) suffer from three critical issues: - Greedy adjustment: Only the myopic information from a small subset of participating clients in the current round is utilized, discarding all historical observations. - Topology instability: Deterministic decisions are made based on unreliable aggregated information, leading to high variance under heterogeneous data distributions. - Communication inefficiency: Uploading auxiliary data such as full-size gradients incurs communication costs approaching those of dense models.

Key Challenge: The fundamental issue lies in the "myopic observation + deterministic decision" paradigm — deterministic selections are made based solely on a small fraction of client data in the current round, lacking a global perspective.

Key Insight: Topology adjustment is cast as a CMAB problem, where each parameter position is treated as an arm and each outer loop selects \(K\) arms to activate.

Core Idea: Replace deterministic magnitude-based ranking with Beta posterior distributions for topology selection, and substitute myopic aggregated information with farsighted comprehensive information.

Method¶

Overall Architecture¶

FedRTS adopts a dual-loop training procedure. The inner loop fixes the topology \(m\) and updates weights \(W\) via standard FL over \(\Delta T\) rounds; the outer loop adjusts the topology via TSAdj — the server samples from Beta distributions, selects Top-\(K\) links as the new topology, and updates distribution parameters based on comprehensive observations. The initial topology is obtained via random pruning and Beta distribution sampling.

Key Designs¶

CMAB Problem Formulation:
- Function: Maps topology adjustment to the combinatorial multi-armed bandit framework.
- Mechanism: Each parameter position \(m_i\) is treated as an arm; at outer loop \(t\), the action \(S_t\) (containing \(K = d\langle m\rangle\) arms) is selected. The objective is to maximize \(\max \sum_{t=1}^T r(S_t, \mu)\). Observation \(X_t\) is obtained after the action is executed and guides the next round of decisions.
- Design Motivation: The CMAB framework enables rigorous analysis of the three deficiencies in existing methods and naturally motivates a Thompson Sampling solution.
Thompson Sampling-based Adjustment (TSAdj):
- Function: Replaces deterministic decisions with probabilistic ones for topology adjustment.
- Mechanism: A Beta posterior \(P_i \sim \text{Beta}(\alpha_i, \beta_i)\) is maintained for each link \(i\). At adjustment time, \(\xi_i \sim P_i\) is sampled and Top-\(K\) links are selected: \(S_t = \{i \in \text{Top}(\xi, K)\}\). Parameters are updated as \((\alpha_i, \beta_i) \leftarrow (\alpha_i + \lambda X_{t,i}, \beta_i + \lambda(1 - X_{t,i}))\), where \(\lambda\) is a scaling factor.
- Design Motivation: Probabilistic decisions naturally balance exploration and exploitation, exhibit lower variance than deterministic methods, and the posterior distribution inherently accumulates historical information.
Farsighted Comprehensive Observation Mechanism:
- Function: Integrates inner/outer loop information and individual/aggregated signals to construct observations.
- Mechanism: The final observation is a weighted fusion \(X_t = \gamma X_t^{agg} + (1-\gamma)\sum_{n \in C_t} p_n X_t^n\). Active arms are evaluated by weight magnitude (distinguishing \(\kappa\) core links); inactive arms are evaluated by gradient magnitude (selecting \(K-\kappa\) candidates for reactivation).
- Design Motivation: Fusing individual and aggregated information eliminates single-source bias; the Beta distribution accumulates historical information to compensate for partial client participation.
Communication-Efficient Design:
- Function: Clients upload only Top gradient indices.
- Mechanism: Each client uploads \(\mathcal{I}_{t+1}^n = \text{Top}(|G_{t+1}^n|, K-\kappa)\), eliminating the need to transmit full gradients.
- Design Motivation: The server does not use aggregated gradients to compute \(X_{t,i}^{agg}\) for inactive arms (set to 0.5 to avoid high variance), relying only on individual Top indices.

Loss & Training¶

Standard FL task loss (cross-entropy for image classification; language modeling loss for NLP).
SGD optimizer, learning rate \(\eta=0.01\), 5 local epochs, batch size 64 (CV) / 16 (NLP).
\(\lambda=10\), \(\gamma=0.5\), \(\alpha_{adj}=0.4\); per-layer sparsity follows the ERK distribution.

Theoretical Guarantees¶

Under assumptions of independence, \(L\)-continuity, and mean-field approximation, the regret bound of TSAdj is \(Reg(T) = O\Big(\sum_{s \neq s_1^*} \frac{\Delta_{max} L^2 \log T}{\Delta_{s,k}^2} + \Delta_{max}\Big)\), representing the first regret bound for federated pruning derived from a CMAB perspective.

Key Experimental Results¶

Main Results — NLP Task (TinyStories + GPT-2-32M)¶

Density	Method	PPL ↓	Avg. Acc ↑	Comm. Cost ↓
100%	FedAVG	20.56	0.4387	260.41MB
50%	FedDST	20.10	0.4261	138.30MB
50%	FedMef	20.61	0.4352	165.96MB
50%	FedRTS	18.54	0.4422	138.84MB
20%	FedDST	26.49	0.4219	60.22MB
20%	FedMef	21.53	0.4293	72.26MB
20%	FedRTS	19.93	0.4333	60.34MB

Ablation Study (CIFAR-10 + ResNet18)¶

Configuration	Performance	Note
FedRTS (full)	Best	Probabilistic decisions + farsighted info + fusion
FedRTS (\(\gamma=0\))	Close to full	Individual info only, no aggregated info
FedRTS (\(\gamma=1\))	Notable degradation	Aggregated info only, stability lost
FedRTS w/o TS (\(\gamma=1\))	Significant degradation	Momentum-based deterministic adjustment
FedTiny	Worst	None of the three advantages

Key Findings¶

FedRTS at 20% density achieves better PPL than full-size FedAVG (19.93 vs. 20.56).
Probabilistic decisions contribute the most: FedRTS w/o TS falls far behind FedRTS (\(\gamma=1\)).
Individual information is more reliable than aggregated information: \(\gamma=0\) is close to \(\gamma=0.5\), while \(\gamma=1\) degrades significantly.
Advantages are more pronounced under high heterogeneity (Dirichlet \(a=0.1\)) and low participation rate (10%).

Highlights & Insights¶

CMAB Perspective: This is the first work to connect federated pruning with combinatorial bandit theory, providing a novel analytical framework and algorithm design paradigm.
Dual Benefits of Probabilistic Decisions: Thompson Sampling simultaneously addresses the exploration–exploitation trade-off and reduces decision variance.
Communication Efficiency as a Natural Byproduct: TSAdj requires only Top gradient indices, which emerges naturally from the method design rather than as an explicit engineering choice.

Limitations & Future Work¶

The theoretical analysis simplifies the inner loop and partial observations, creating a gap with the actual implementation.
The independence assumption does not hold strictly in practice.
Validation is limited to lightweight models (ResNet18 / ShuffleNetV2 / GPT-2-32M); experiments on large-scale models are absent.
Whether the Beta distribution is the optimal posterior choice remains an open question.

vs. FedDST/FedMef: Deterministic magnitude pruning vs. probabilistic sampling — FedRTS shows clear advantages in heterogeneous settings.
vs. PruneFL: Client-side adjustment is less stable; FedRTS's server-side global probabilistic adjustment is superior.
vs. FedTiny: Uploading full gradients vs. only Top indices — a substantial communication efficiency gap.

Rating¶

Novelty: ⭐⭐⭐⭐ First use of CMAB + TS for federated pruning; theoretically novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers CV + NLP with complete ablations, but lacks large-model experiments.
Writing Quality: ⭐⭐⭐⭐ The derivation from CMAB formulation to TS solution is natural and well-structured.
Value: ⭐⭐⭐⭐ Provides a practical and efficient sparse training solution for resource-constrained federated learning.