Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models¶
Conference: ICML2025
arXiv: 2409.06277
Code: allen4747/Ferret
Area: AI Safety
Keywords: Federated Learning, Full-parameter Fine-tuning, Communication Compression, Shared Randomness, LLMs, Projection Reconstruction
TL;DR¶
This paper proposes Ferret, the first federated full-parameter fine-tuning method that combines first-order optimization with shared randomness. By projecting local updates into low-dimensional spaces, Ferret achieves \(10^6\times\) communication compression and \(6\times\) computational acceleration while maintaining model accuracy comparable to FedAvg.
Background & Motivation¶
- Key Challenge: When performing federated full-parameter fine-tuning on LLMs, federated learning (FL) must achieve a balance among data privacy, communication efficiency, and model accuracy.
- Limitations of PEFT: Although parameter-efficient fine-tuning (PEFT, such as LoRA) reduces communication overhead, it only updates a subset of parameters, failing to fully capture subtle differences in local data distributions and thus leading to accuracy degradation.
- Limitations of Zeroth-Order Methods: Zeroth-order optimization methods like FedKSeed reduce communication volume to \(\mathcal{O}(K)\) by transmitting scalar gradients, but suffer from three major issues:
- High computational cost—estimating gradients requires \(K\) forward passes per round.
- Slow convergence—requiring significantly more communication rounds.
- Biased gradient estimation—errors accumulate with local iterations \(T\).
- Limitations of FedAvg: First-order methods are computationally efficient and converge fast, but their communication overhead is \(\mathcal{O}(d)\) (where \(d\) is the number of parameters, often in the billions), making them impractical for LLMs.
Core Problem: Can a method be designed to simultaneously incorporate the computational efficiency and fast convergence of first-order methods, and the low communication overhead of zeroth-order methods?
Method¶
Overall Architecture¶
Ferret repeats three steps in each communication round \(r \in [R]\):
Step ①: Global Aggregation
Each client receives the random seeds \(s^{(i)}\) and projection coordinates \(\{\gamma_k^{(i)}\}_{k=1}^K\) from other clients, rejuvenates the random bases \(\{\mathbf{v}_k^{(i)}\}_{k=1}^K\) using shared randomness, reconstructs local updates, and aggregates the global model:
Step ②: Local Update (First-Order Optimization)
Each client performs \(T\) steps of local updates using standard gradient descent:
Unlike zeroth-order methods that require hundreds of steps, first-order methods leverage more precise gradient information, achieving equivalent local update effects with very few iteration steps (e.g., \(T=10\)).
Step ③: Projection Update (Dimension-Reduced Transmission)
The local update \(\Delta_r^{(j)} = \mathbf{w}_{r-1}^{(j)} - \mathbf{w}_r^{(j)}\) is calculated and projected onto \(K\)-dimensional coordinates:
where \(\rho\) is the correction factor for the truncated normal distribution to ensure unbiased reconstruction. Only the seed \(s^{(j)}\) and \(K\) scalars are transmitted, reducing the communication volume from \(\mathcal{O}(d)\) to \(\mathcal{O}(K)\).
Key Designs¶
Choice of Random Bases: Sampling from a truncated normal distribution \(v \sim \mathcal{N}(0,1)\), \(v \in [-1/\sqrt{d}, 1/\sqrt{d}]\) ensures \(\|\mathbf{v}_k\| \leq 1\), enabling full-parameter updates while maintaining numerical stability.
Inversion-Free Reconstruction: Directly approximating \(\mathbf{V}^\top\mathbf{V} \approx \mathbf{I}_K\) avoids the \(\mathcal{O}(K^2d + K^3)\) computational overhead of matrix inversion, reducing it to \(\mathcal{O}(Kd)\).
Block-wise Reconstruction: Splitting the \(d\)-dimensional parameters into \(L\) blocks, where each block is independently projected and reconstructed, further reduces the computational complexity by \(1/L\) and scales down the memory complexity to \(\mathcal{O}(\max\{K_l, d_l\})\).
Theoretical Guarantees¶
- Unbiased Reconstruction (Theorem 1): \(\mathbb{E}[\widetilde{\Delta}] = \Delta\), avoiding the estimation bias of zeroth-order methods.
- Reconstruction Error (Theorem 2): The error rate is \(\widetilde{\mathcal{O}}(d/K)\), which decreases linearly as \(K\) increases and does not accumulate with local iteration steps \(T\).
- Convergence (Theorem 4): The communication round complexity is \(\mathcal{O}(1/\epsilon^2)\), which is asymptotically equivalent to standard SGD and independent of the parameter dimension \(d\).
Key Experimental Results¶
Accuracy Comparison (Rouge-L %)¶
| Method | NI (DataJuicer-1.3B) | NI (LLaMA-3B) | Dolly (DataJuicer-1.3B) | Dolly (LLaMA-3B) |
|---|---|---|---|---|
| FedIT (PEFT) | 22.30 | 28.13 | 30.80 | 33.23 |
| FedZO | 21.74 | 29.46 | 26.99 | 31.67 |
| FedKSeed | 22.33 | 29.77 | 30.91 | 34.56 |
| FedAvg | 23.95 | 32.11 | 29.67 | 30.98 |
| Ferret | 24.99 | 30.03 | 30.63 | 34.57 |
Experiments on Large Models (LLaMA2-7B / 13B)¶
| Method | CodeAlpaca (7B) | CodeAlpaca (13B) | GSM8K (7B) | GSM8K (13B) |
|---|---|---|---|---|
| FedKSeed | 8.33 | 10.70 | 28.26 | 33.67 |
| FedAvg | 15.41 | 14.68 | 38.30 | 39.82 |
| Ferret | 12.10 | 11.84 | 36.10 | 34.50 |
Scalability Comparison (Overhead per Round on LLaMA-3B)¶
| Method | Local Update (s) | Global Aggregation (s) | Total (s) | Communication Volume (No. of Parameters) |
|---|---|---|---|---|
| FedKSeed | 56.9 | 123.8 | 180.7 | 8.2×10³ |
| FedAvg | 1.8 | 0.3 | 2.1 | 6.0×10⁹ |
| Ferret | 5.6 (10.2×↓ vs FedKSeed) | 24.7 (5.0×↓) | 30.3 (6.0×↓) | 7.8×10³ (10⁶×↓ vs FedAvg) |
Overhead per Round on LLaMA2-7B¶
| Method | Total (s) | Communication Volume |
|---|---|---|
| FedKSeed | 627.0 | 8.2×10³ |
| FedAvg | 6.5 | 1.4×10¹⁰ |
| Ferret | 97.2 (6.5×↓ vs FedKSeed) | 6.4×10³ (10⁶×↓ vs FedAvg) |
Ferret converges in only 12 rounds on the NI dataset (compared to 40 rounds for FedKSeed), achieving a 3.3× reduction in the number of convergence rounds.
Highlights & Insights¶
- Elegant Fusion of First-Order and Zeroth-Order Advantages: Integrating first-order gradients to ensure computational efficiency and convergence speed while utilizing random projection and shared randomness for communication compression, yielding the best of both worlds.
- Theoretical Breakthrough in Unbiased Reconstruction: Proving that the reconstruction remains unbiased under the approximation \(\mathbf{V}^\top\mathbf{V} \approx \mathbf{I}_K\) and that the error does not accumulate over iterations—offering a fundamental advantage over the biased estimations of zeroth-order methods.
- Block-wise Reconstruction Strategy: Reducing computational complexity by an additional \(1/L\) factor, allowing the method to scale effectively to 7B and 13B models.
- High Practicality: Fully compatible with arbitrary gradient optimizers (such as AdamW) and straightforward to integrate into existing LLM training pipelines.
- Enhanced Privacy: Transmitting only the seeds and low-dimensional coordinates, providing superior privacy preservation compared to FedAvg, which transmits full gradients or parameters.
Limitations & Future Work¶
- Accuracy Gap on Complex Tasks: On complex tasks like CodeAlpaca and GSM8K, Ferret still underperforms compared to FedAvg (by a margin of 3-5%), suggesting that the information loss from projection reconstruction is more pronounced in challenging tasks.
- Higher Per-Round Computation than FedAvg: Although significantly superior to FedKSeed, Ferret still requires 30s per round compared to 2.1s for FedAvg (approximately 14× slower), with the major overhead residing in the reconstruction step during global aggregation.
- Theoretical Analysis Limited to Homogeneous Settings: The convergence analysis is only provided under the IID scenario where \(\mathcal{L}^{(i)} = \mathcal{L}\), lacking rigorous guarantees in heterogeneous settings.
- Selection of Hyperparameter \(K\): A small \(K\) leads to large reconstruction errors that degrade accuracy, while an excessively large \(K\) diminishes the communication compression benefits, necessitating task-specific tuning.
- Unvalidated under Large-Scale Client Scenarios: The experiments only evaluate a small number of active clients (5% sampling rate per round); performance under the scale of hundreds or thousands of clients remains unexplored.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to combine first-order optimization with shared randomness projection for federated full-parameter fine-tuning, featuring an elegant design.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers multiple models (1.3B to 13B) across various datasets, including scalability and ablation analyses, though large-scale client experiments are missing.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured paper with complementary theoretical formulation and empirical evaluations, along with intuitive illustrations.
- Value: ⭐⭐⭐⭐ — Provides a scalable full-parameter framework for federated LLM fine-tuning, successfully balancing efficiency, communication, and accuracy.