Skip to content

Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

Conference: NeurIPS 2025 arXiv: 2502.01755 Code: None Area: Model Compression Keywords: federated learning, LoRA, parameter-efficient fine-tuning, alternating optimization, LLM

TL;DR

This paper proposes RoLoRA, which alternately optimizes the down-projection (\(\mathbf{A}\)) and up-projection (\(\mathbf{B}\)) matrices of LoRA to address imprecise aggregation and limited expressiveness in federated learning. RoLoRA significantly outperforms FedAVG of LoRA and FFA-LoRA on RoBERTa-Large and Llama-2-7B.

Background & Motivation

Background: Using LoRA for parameter-efficient fine-tuning in federated learning is a mainstream approach. LoRA decomposes weight updates as \(\Delta \mathbf{W} = \alpha \mathbf{A}\mathbf{B}\), where \(\mathbf{A} \in \mathbb{R}^{d \times r}\), \(\mathbf{B} \in \mathbb{R}^{r \times d}\), \(r \ll d\).

Limitations of Prior Work: - FedAVG of LoRA: Directly averaging client matrices \(\mathbf{A}_i\) and \(\mathbf{B}_i\) introduces aggregation interference — \(\frac{1}{N}\sum_i \mathbf{A}_i \mathbf{B}_i \neq \frac{1}{N}\sum_i \mathbf{A}_i \cdot \frac{1}{N}\sum_i \mathbf{B}_i\) - FFA-LoRA: Freezing \(\mathbf{A}\) (down-projection) and only updating \(\mathbf{B}\) avoids interference but sacrifices model expressiveness, leading to significant performance degradation with fewer parameters or more clients - FlexLoRA/FLoRA: Recover exact updates via matrix multiplication and truncated SVD, but at substantial computational cost

Key Challenge: A three-way tension among exact aggregation, model expressiveness, and computational/communication efficiency.

Goal: Design a federated LoRA fine-tuning framework that simultaneously ensures exact aggregation, sufficient expressiveness, and low communication/computation overhead.

Key Insight: Inspired by multi-task linear representation learning (MLRL), the paper alternately freezes \(\mathbf{A}\) and \(\mathbf{B}\) — only one matrix is trained and aggregated per round, naturally guaranteeing exact aggregation.

Core Idea: In odd rounds, freeze \(\mathbf{A}\) and update \(\mathbf{B}\); in even rounds, freeze \(\mathbf{B}\) and update \(\mathbf{A}\). This alternation achieves both exact aggregation and full expressiveness.

Method

Overall Architecture

RoLoRA Algorithm (Algorithm 1): - Odd communication rounds: All clients freeze the shared \(\mathbf{A}^t\) and locally train \(\mathbf{B}_i^{t+1}\); the server aggregates \(\mathbf{B}^{t+1} = \frac{1}{N}\sum_i \mathbf{B}_i^{t+1}\) - Even communication rounds: All clients freeze the shared \(\mathbf{B}^{t+1}\) and locally train \(\mathbf{A}_i^{t+1}\); the server aggregates \(\mathbf{A}^{t+1} = \frac{1}{N}\sum_i \mathbf{A}_i^{t+1}\) - Since the frozen matrix is globally consistent, aggregation is inherently exact

Key Designs

1. Exact Aggregation Guarantee

In odd rounds, \(\mathbf{A}_i^t = \mathbf{A}^t\) is identical across all clients, so: $\(\frac{1}{N}\sum_i \mathbf{A}_i^t \mathbf{B}_i^{t+1} = \mathbf{A}^t \cdot \frac{1}{N}\sum_i \mathbf{B}_i^{t+1}\)$ Aggregation is fully exact with no interference.

2. Theoretical Analysis via Linear Models (Theorem 4.5)

In federated linear regression (\(\mathbf{Y}_i = \mathbf{X}_i \mathbf{a}^* \mathbf{b}^{*\top}\)), RoLoRA achieves exponential convergence: $\(\sin\theta(\mathbf{a}^{t+1}, \mathbf{a}^*) \leq \sin\theta(\mathbf{a}^t, \mathbf{a}^*) \sqrt{1 - \eta(1 - \delta_0^2)\|\mathbf{b}^*\|^2}\)$ - The angular distance decays exponentially to an arbitrarily small \(\epsilon\) - FFA-LoRA's loss lower bound (Proposition 4.6) is \((1 + \tilde{c})\|\mathbf{b}^*\|^2 (\delta_0)^2\), which is bounded away from zero by the initialization angle and can never converge to zero

3. Non-Convex Convergence Guarantee (Theorem A4.4)

Under a smooth non-convex setting, RoLoRA achieves a convergence rate of \(O(1/\sqrt{T})\), consistent with FedAVG.

Loss & Training

  • Same loss function as standard LoRA
  • Trainable parameters per round are halved (only \(\mathbf{A}\) or \(\mathbf{B}\) is trained), and communication cost is also halved
  • Learning rate selected from \(\{5e^{-4}, ..., 1e^{-1}\}\)

Key Experimental Results

Main Results: RoBERTa-Large on GLUE (50 clients, rank 4)

Method SST-2 QNLI MNLI QQP RTE Avg
LoRA 93.00 78.13 52.64 77.60 52.23 70.72
FFA-LoRA 93.23 85.05 69.97 78.44 55.72 76.48
FlexLoRA 54.08 55.40 39.14 72.00 52.71 54.67
RoLoRA 94.80 90.00 82.98 85.71 75.57 85.81

Under the 50-client setting, RoLoRA surpasses LoRA by +15.09% and FFA-LoRA by +9.33% in average accuracy.

Llama-2-7B on Commonsense (50 clients, rank 8)

Method BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
LoRA 61.42 33.19 31.88 21.23 31.36 27.36 32.03 26.07
FFA-LoRA 53.43 35.49 10.63 11.81 1.61 6.88 7.93 15.00
RoLoRA 61.83 61.26 39.76 27.49 47.67 33.19 40.13 31.67

FFA-LoRA nearly collapses on large models, while RoLoRA maintains a substantial lead.

Ablation Study

Ablation Dimension Key Findings
Number of clients (3→50) LoRA/FFA-LoRA degrade sharply; RoLoRA remains stable (3 clients: 88.28 → 50 clients: 85.81)
Non-IID (Dir 0.5/1.0) RoLoRA achieves 82.60% on MNLI; LoRA: 81.19%; FlexLoRA: only 35.45%
Reduced parameters FFA-LoRA degrades significantly with fewer parameters; RoLoRA remains robust
Symmetric vs. asymmetric updates Balanced alternation of A and B is optimal; bias toward either degrades performance
Local steps (1→20) FFA-LoRA: 72.52%→69.97% (degrades with more steps); RoLoRA: 84.39%→82.98% (stable)

Key Findings

  • As the number of clients increases (3→20→50), the aggregation interference in FedAVG of LoRA deteriorates sharply
  • FFA-LoRA's limitation stems from the quality of \(\mathbf{A}\) initialization — variance across random seeds is extremely large (PIQA: std=9.55)
  • Learning \(\mathbf{A}\) is especially critical in the early stages of training (20% RoLoRA + 80% FFA-LoRA already significantly outperforms pure FFA-LoRA)
  • RoLoRA's communication cost is only 50% of that of LoRA/FlexLoRA

Highlights & Insights

  • Simple yet effective: Alternating freezing is an extremely simple design that simultaneously resolves the tension between exact aggregation and expressiveness
  • Theory–practice alignment: The linear model theory (exponential convergence vs. saturation) is perfectly validated in nonlinear experiments
  • Outstanding robustness: Strong performance persists under extreme settings (50 clients, rank 2, non-IID)
  • Deployment-friendly: Communication and computation costs are halved with no additional SVD operations

Limitations & Future Work

  • The linear model theory assumes homogeneous clients and a single-layer LoRA structure, which diverges from practical multi-layer LoRA setups
  • Alternating optimization halves update efficiency per communication round (each of \(\mathbf{A}\) and \(\mathbf{B}\) is updated only half as many times given the same total rounds)
  • Adaptive alternating frequencies remain unexplored (how to determine the optimal alternation ratio between \(\mathbf{A}\) and \(\mathbf{B}\)?)
  • Comparison with full-parameter fine-tuning is absent
  • Integration with privacy-preserving mechanisms (e.g., differential privacy) is not discussed
  • The connection to MLRL (multi-task low-rank representation learning) provides the theoretical foundation
  • LoRA+ explores different learning rates for \(\mathbf{A}\) and \(\mathbf{B}\); RoLoRA's alternating strategy represents a more aggressive form of asymmetric treatment
  • The approach can be extended to heterogeneous rank settings (e.g., different clients using different ranks)
  • Insight: The aggregation precision problem in federated learning similarly exists in other decomposed parameter methods (e.g., adapters, prefix tuning)

Rating

⭐⭐⭐⭐ (4/5)

The method is simple and effective, the theoretical analysis is clear, and the experiments are comprehensive with substantial performance gains. The primary limitation is the gap between theory and practice (linear vs. multi-layer nonlinear settings).