Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation¶

Conference: ICLR 2026 arXiv: 2602.16936 Code: GitHub Area: AI Safety Keywords: Federated Fine-Tuning, LoRA, Heterogeneous Rank, Initialization Noise, Aggregation Noise

TL;DR¶

This paper proposes Fed-PLoRA, a framework that replaces multi-rank LoRA with multiple parallel one-rank modules (PLoRA). Via a Select-N-Fold strategy—selecting \(N\) modules for training and folding the remainder into frozen weights—it achieves zero initialization noise and minimal aggregation noise for heterogeneous federated fine-tuning, outperforming existing methods across 6 LLMs and multiple tasks.

Background & Motivation¶

Background: Federated fine-tuning (FFT) leverages LoRA to collaboratively fine-tune LLMs across distributed clients while preserving data privacy. However, heterogeneous client resources lead to different LoRA ranks, causing dimension mismatches during initialization and aggregation.

Limitations of Prior Work: (1) FLoRA: randomly reinitializes LoRA each round, introducing large initialization noise; (2) HETLoRA: truncates the global LoRA, discarding information beyond the low-rank subspace and introducing aggregation bias; (3) FlexLoRA: uses SVD reconstruction, introducing decomposition errors. All existing methods face an irreconcilable trade-off between initialization noise and aggregation noise.

Key Challenge: When the global model rank \(R\) exceeds a client's rank \(r_i\), the client cannot fully inherit global information (initialization noise), and aggregating separately trained adapters is imperfect (aggregation noise).

Key Insight: Decompose multi-rank LoRA into multiple parallel one-rank modules. Each module is independent, allowing clients to select a subset for training and fold the remainder into frozen weights, achieving zero initialization noise.

Method¶

Overall Architecture¶

PLoRA: \(\Delta W = \sum_{j=1}^{R} B_{(j)}A_{(j)}\), mathematically equivalent to standard LoRA but with independent modules. Select-N-Fold: client \(i\) selects \(r_i\) modules for training and folds the remaining modules into the pretrained weights for freezing. Aggregation: independent averaging along the rank dimension.

Key Designs¶

PLoRA (Parallel One-Rank Adaptation):
- Function: Decomposes a rank-\(R\) LoRA into \(R\) parallel rank-1 modules.
- Mechanism: \(\Delta W_{\text{PLoRA}} = \sum_{j=1}^R B_{(j)}A_{(j)} = \sum_{j=1}^R B_{[:,j]}A_{[j,:]} = BA = \Delta W_{\text{LoRA}}\)
- Design Motivation: Mathematical equivalence with modular independence naturally enables subset selection.
Select-N-Fold Strategy:
- Function: Each client randomly selects \(r_i\) PLoRA modules for training; the remaining modules are folded into frozen weights.
- Mechanism: \(\mathcal{W}_i^t = \mathcal{W}^0 + \sum_{j \notin \mathcal{K}_i^t} B_{(j)}^{t-1}A_{(j)}^{t-1}\), with training performed on \(\mathcal{W}_i^t\).
- Design Motivation: Folding preserves information from unselected modules, yielding zero initialization noise. Random selection ensures all modules are updated in expectation.
Noise Analysis:
- Initialization noise: \(\mathcal{N}_{\text{Init}}^t = 0\) (global information perfectly preserved).
- Aggregation noise upper bound: \(\leq \sum_{j=1}^R \frac{1}{|\mathcal{Q}_{(j)}^t|}\sum_i \|B_{i,(j)}^t - \bar{B}_{(j)}^t\|_2 + \|A_{i,(j)}^t - \bar{A}_{(j)}^t\|_2\)
- Cosine similarity analysis demonstrates that modules converge across clients during training, progressively tightening the upper bound.

Loss & Training¶

Standard federated fine-tuning pipeline (broadcast → local training → aggregation).
10% client participation per round.
SGD/AdamW as local optimizers.

Key Experimental Results¶

Main Results (Llama-1B, Natural Instructions)¶

Method	IID Accuracy	non-IID Accuracy	Initialization Noise
FedIT (homogeneous)	66.88	61.28	0
FLoRA	Medium	Medium	High (random re-init)
FlexLoRA	Medium	Medium	Medium (truncation + SVD error)
HETLoRA	Medium	Medium	Medium (truncation)
Fed-PLoRA	Highest	Highest	0

Multi-Model / Multi-Task Validation¶

Model	Task	Fed-PLoRA vs. Best Baseline
BERT-base	GLUE	Outperforms
Llama-3.1-8B	Financial NLP	Outperforms
Qwen3-4B	Instruction Following	Outperforms
Mistral-7B	Medical QA	Outperforms

Key Findings¶

Cosine similarity heatmaps show that PLoRA modules at the same rank index converge across clients after training (high diagonal values), while modules at different rank indices remain independent (low off-diagonal values), indicating that each rank captures distinct knowledge while clients converge.
Fed-PLoRA demonstrates a larger advantage in non-IID settings, suggesting that zero initialization noise is especially critical under data heterogeneity.
Communication, computation, and memory overhead are comparable to existing methods, introducing no additional cost.

Highlights & Insights¶

Zero Initialization Noise: By folding rather than truncating or reinitializing, Fed-PLoRA perfectly preserves global information—a simple yet fundamental solution to heterogeneous FFT.
Modular Independence of PLoRA: Although mathematically equivalent to standard LoRA, the modular independence enables subset selection and independent aggregation naturally—an architectural trick that yields systematic improvement.
Unified Noise Analysis Framework: The paper provides a unified analysis of initialization noise and aggregation noise for FLoRA, FlexLoRA, HETLoRA, and Fed-PLoRA, clearly characterizing the strengths and weaknesses of each approach.

Limitations & Future Work¶

Random module selection may be suboptimal; importance- or gradient-based selection strategies could be more effective.
The folding operation adds \(O(dk(R-r_i))\) computation per round, which, while much smaller than training cost, is non-zero.
Downlink communication incurs \(O((d+k)(R-r_i))\) additional overhead compared to HETLoRA and FlexLoRA.
Experiments are limited to LoRA applied to self-attention layers; the effect of applying PLoRA to FFN layers remains unexplored.

vs. FLoRA: FLoRA achieves zero aggregation noise but suffers from large initialization noise; Fed-PLoRA achieves zero initialization noise with small aggregation noise, yielding overall superior performance.
vs. HETLoRA: HETLoRA truncates high-rank components, discarding information; Fed-PLoRA folds them, retaining all information.
vs. Standard LoRA / FedIT: Fed-PLoRA is equivalent to FedIT in the homogeneous setting and outperforms all baselines in the heterogeneous setting.

Rating¶

Novelty: ⭐⭐⭐⭐ The PLoRA decomposition combined with the Select-N-Fold strategy is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 6 models, multiple domain tasks, IID/non-IID settings, and multiple baselines.
Writing Quality: ⭐⭐⭐⭐ The noise analysis framework is clear and comparisons are fair.
Value: ⭐⭐⭐⭐ Directly applicable to practical heterogeneous federated fine-tuning scenarios.