FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=kntrZOm2AQ
Code: TBD
Area: Efficient Fine-tuning / Federated Learning
Keywords: Federated Fine-tuning, LoRA, Gram Matrix, Procrustes Alignment, Communication Efficiency, Convergence Analysis

TL;DR¶

FLoRG reparameterizes the two low-rank matrices of LoRA into a single low-rank matrix and aggregates only its Gram matrix. This transforms server-side aggregation from a "biased bilinear operation" into an "unbiased linear operation." It then employs Procrustes alignment to resolve the drift caused by non-unique decomposition, simultaneously eliminating aggregation errors, reducing communication overhead (up to 2041×), and tightening convergence bounds in federated fine-tuning.

Background & Motivation¶

Background: LoRA replaces full-parameter fine-tuning with two low-rank matrices $W = W_0 + BA$, becoming a mainstream method for adapting large models. Federated Learning (FL) enables collaborative fine-tuning on decentralized data without exposing raw data. Combining the two—where clients train LoRA locally, upload low-rank updates, and the server aggregates—is a natural solution.

Limitations of Prior Work: Directly integrating LoRA into FL encounters two fundamental contradictions. The first is Aggregation Error: traditional methods (e.g., FedIT) average $B_n$ and $A_n$ separately, yielding $(\frac{1}{N}\sum_n B_n)(\frac{1}{N}\sum_n A_n)$, whereas the desired update is $\frac{1}{N}\sum_n (B_n A_n)$. This systematic bias accumulates over rounds and degrades performance. The second is Decomposition Drift: another approach (e.g., FeDeRA) aggregates the product $B_n A_n$ and then performs matrix decomposition to recover the two matrices. While this eliminates aggregation error, the aggregated matrix is often rank-deficient or has repeated eigenvalues, leading to non-unique decompositions. Different decompositions change the parameter subspace and gradient directions in subsequent rounds, causing cumulative drift; moreover, the decomposed rank might not match the target rank $r$.

Key Challenge: As long as LoRA remains a "product of two matrices," one must choose between separate aggregation (producing error) or post-aggregation decomposition (producing drift).

Goal: Develop a federated fine-tuning scheme that eliminates aggregation errors and minimizes decomposition drift.

Core Idea: [Reparameterization] Represent LoRA using a single low-rank matrix and aggregate only its Gram matrix (the matrix of column vector inner products). Since Gram matrix aggregation is linear and maintains positive semi-definiteness, the server achieves a truly unbiased aggregation. [Alignment Stability] Use Procrustes Alignment to project the decomposed matrix onto the direction closest to the previous round while maintaining the Gram matrix, thereby suppressing drift caused by non-unique decomposition.

Method¶

Overall Architecture¶

FLoRG expresses the fine-tuning increment of each LoRA layer as $\Delta W^t = L Q^t R = L (A^t)^\top A^t R$, where $L\in\mathbb{R}^{d_{out}\times k}$ and $R\in\mathbb{R}^{k\times d_{in}}$ are globally shared, fixed semi-orthogonal bases ($L^\top L = I_k$, $R R^\top = I_k$, $k=\min\{d_{in},d_{out}\}$). Clients only train a single low-rank matrix $A^t\in\mathbb{R}^{r\times k}$. The per-round process involves: local updates of $A^t$ via gradient descent → uploading the Gram matrix $Q_n^{t+1/2}=(A_n^{t+1/2})^\top A_n^{t+1/2}$ → unbiased linear averaging at the server to get $Q^{t+1}$ → eigendecomposition of $Q^{t+1}$ followed by Procrustes alignment to recover $A^{t+1}$ for the next round → broadcasting back to clients.

flowchart LR
    A["Client n<br/>Local Update of A_n"] -->|"Upload Gram Matrix<br/>Q_n = AₙᵀAₙ"| B["Server<br/>Linear Aggregation Q = Avg(Q_n)<br/>(Unbiased, PSD-preserving)"]
    B --> C["Eigendecomposition<br/>Q = PᵀΛP<br/>To get Ã = Λ^½ P"]
    C --> D["Procrustes Alignment<br/>S⋆ = U Vᵀ<br/>Project to rank-r subspace"]
    D -->|"Broadcast A^{t+1} = S⋆ Ã"| A

Key Designs¶

1. Reparameterization with Single Matrix + Shared Bases: Error-free Aggregation. LoRA uses two matrices to adapt to different weight matrix shapes $W_0\in\mathbb{R}^{d_{out}\times d_{in}}$, which is the root of aggregation bias. FLoRG offloads the shape flexibility to two fixed and shared semi-orthogonal bases $L$ and $R$. Only the intermediate square matrix parameter $Q^t=(A^t)^\top A^t$ (represented via a single $A^t$) is trainable. Client gradients are computed only for $A^t$: $\nabla_A F_n(W^t;\xi_n)=A^t\big(H_n^t+(H_n^t)^\top\big)$, where $H_n^t=L^\top \nabla F_n(W^t;\xi_n) R^\top$. Because clients upload the Gram matrix, the aggregation $Q^{t+1}=\frac{1}{N}\sum_n (A_n^{t+1/2})^\top A_n^{t+1/2}$ is a purely linear operation that preserves positive semi-definiteness. The server thus obtains the true aggregated result, structurally eliminating bilinear inconsistency. Additionally, uploading one matrix instead of two reduces uplink communication by more than half.

2. Procrustes Alignment: Eliminating Decomposition Drift. The server performs eigendecomposition on the PSD matrix $Q^{t+1}=(P^{t+1})^\top \Lambda^{t+1} P^{t+1}$ to obtain a canonical decomposition $\tilde{A}^{t+1}=(\Lambda^{t+1})^{1/2}P^{t+1}$. However, since $(O\tilde{A}^{t+1})^\top O\tilde{A}^{t+1}=Q^{t+1}$ for any orthogonal column matrix $O$, the decomposition is non-unique. FLoRG solves for a semi-orthogonal alignment matrix $S^t$ to find the decomposition closest to the previous round $A^t$ in Frobenius norm: $$\min_{S^t}\ \big\|S^t\tilde{A}^{t+1}-A^t\big\|_F^2\quad \text{s.t.}\ (S^t)^\top S^t=I.$$ This is the classic orthogonal Procrustes problem. Taking the SVD of $A^t(\tilde{A}^{t+1})^\top=U^t\Sigma^t(V^t)^\top$ yields the closed-form optimal solution $S^{t,\star}=U^t(V^t)^\top$. This step achieves two goals: it selects the most "stable" decomposition to steady future gradients and uses semi-orthogonal projection to map the rank $r'^{,t}$ of $\tilde{A}^{t+1}$ back to the target rank $r$.

3. Alignment-tightened Convergence Bound: Under standard assumptions (L-smoothness, bounded gradients, bounded parameter space), the authors prove that the convergence rate of FLoRG under non-convex loss (Theorem 2) includes a Procrustes Alignment Drift term proportional to $\sum_t \Delta_{\text{proc}}^{t+1}/\sigma_{\min}(\cdot)$, where $\Delta_{\text{proc}}^{t+1}=\|S^t\tilde{A}^{t+1}-A^t\|_F^2-\|S^{t,\star}\tilde{A}^{t+1}-A^t\|_F^2\ge 0$. With optimal Procrustes alignment, $\Delta_{\text{proc}}^{t+1}=0$, effectively tightening the convergence bound.

Key Experimental Results¶

Main Results (GLUE Accuracy, N=20, r=4, Non-IID ρ=0.5)¶

Base Model	Dataset	Ours (FLoRG)	FedIT	FeDeRA	FFA-LoRA	FedSA-LoRA	FedEx-LoRA
OPT-125M	MNLI	87.35	79.42	81.15	83.54	84.61	85.83
OPT-125M	WNLI	65.28	58.45	59.34	62.61	62.83	64.15
RoBERTa-large	MNLI	91.27	84.91	88.06	89.28	90.75	90.96
RoBERTa-large	RTE	71.26	64.25	67.12	68.49	69.93	70.98
Llama-3.2-3B	MNLI	93.15	87.24	89.83	91.05	92.38	92.74
Llama-3.2-3B	RTE	73.84	67.08	69.75	71.33	72.56	73.15

FLoRG outperforms five SOTA baselines across most settings, with a gain of approximately 0.3–1.5% over the strongest baseline.

Communication Overhead (QNLI, Total Parameters Transmitted to Reach Target Accuracy)¶

Base	Target Acc	Ours (FLoRG)	FedIT	FedEx-LoRA
OPT-125M	80.00	8.2×10⁶	3.78×10⁷	1.25×10¹⁰
RoBERTa-large	85.00	1.45×10⁷	8.12×10⁷	2.96×10¹⁰

FLoRG achieves a communication reduction of up to 2041× compared to FedEx-LoRA.

Ablation Study (Impact of Procrustes Alignment, Accuracy)¶

Base	Setting	MRPC	MNLI	QNLI	WNLI	RTE
OPT-125M	w/ Alignment	86.54	87.20	89.69	65.41	68.77
OPT-125M	w/o Alignment	83.14	80.93	86.72	59.81	64.32
RoBERTa-large	w/ Alignment	89.87	91.39	92.48	66.41	71.40
RoBERTa-large	w/o Alignment	86.50	88.93	88.62	62.07	67.09

Key Findings¶

Procrustes Alignment is Critical: Dropping alignment leads to a performance loss of 2.5–4.3 points on RoBERTa-large, reducing FLoRG to FeDeRA levels.
Robustness Across Ranks: FLoRG consistently outperforms baselines for $r=2, 4, 8$.
Scalability: Server computation scales linearly with LoRA layers and is decoupled from the number of clients.

Highlights & Insights¶

Structure for Unbiasedness: Switching from two matrices to a "single matrix + fixed shared bases" makes aggregation naturally linear and unbiased, tackling the root cause (bilinearity) rather than patching it (e.g., FedEx-LoRA's residual matrices).
Complementarity: Gram aggregation ensures unbiasedness but introduces non-unique decomposition; Procrustes alignment exactly negates this side effect.
Theoretical Closure: The alignment term $\Delta_{\text{proc}}$ appears directly in the convergence bound, elevating an engineering trick to a provable convergence improvement.
Communication Gains: The 2041× compression stems from transmitting a single matrix and requiring fewer rounds due to faster convergence.

Limitations & Future Work¶

Task Diversity: Experiments are limited to NLU/QA, lacking validation on generative tasks, long-context scenarios, or instruction tuning.
Fixed Bases: The semi-orthogonal bases $L$ and $R$ are frozen. Whether extremely heterogeneous weights require adaptive bases remains unexplored.
Server Overhead: While decoupled from client count, $O(Lk^3)$ computation for large $k$ might become significant for massive models.
FL Dynamics: The study assumes full client participation, leaving performance under partial participation, asynchrony, or Differential Privacy (DP) for future work.

Standard Federated LoRA (FedIT): FLoRG structurally eliminates the aggregation bias inherent in FedAvg-style LoRA.
Aggregate-then-Decompose (FeDeRA): FLoRG addresses and solves the decomposition drift issue using Procrustes alignment.
Principle: A key takeaway for federated design is to transform variables such that they are closed under the aggregation operation (linear/PSD-preserving) before transmission.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of single-matrix reparameterization and Gram-Procrustes alignment is a clean solution to two major bottlenecks in Federated LoRA.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive benchmarks across model sizes and baselines, though task variety could be broader.
Writing Quality: ⭐⭐⭐⭐ — Logical progression from error to drift to solution, with clear complexity analysis.
Value: ⭐⭐⭐⭐ — High practical value for bandwidth-constrained federated LLM fine-tuning due to massive communication compression.