RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates¶
Basic Information¶
Conference: ACL 2025
Code: Kowsher/RoCoFT
Institution: Nokia Bell Labs / UCF / UIC
Area: Parameter-Efficient Fine-Tuning / LLM
Keywords: PEFT, LoRA, row-column updates, NTK, parameter-efficient finetuning
TL;DR¶
Proposes RoCoFT, an extremely simple parameter-efficient fine-tuning method: only updates a small subset of row or column parameters in the Transformer weight matrices. It achieves accuracy comparable to state-of-the-art PEFT methods like LoRA on tasks such as GLUE, QA, summarization, and commonsense/mathematical reasoning, while reducing memory and computation overhead. The effectiveness of the method is theoretically explained via the Neural Tangent Kernel (NTK) theory.
Background & Motivation¶
- The Dilemma of Full Fine-Tuning: As LLM parameter scales grow (from billions to hundreds of billions), storing a full copy of the model for each downstream task becomes impractical. Full fine-tuning is also prone to overfitting and catastrophic forgetting.
- Development of PEFT Methods: LoRA achieves efficient fine-tuning through low-rank matrix decomposition, Adapters insert additional modules, and Prefix/Prompt Tuning appends learnable vectors. While effective, these methods still introduce extra parameters or architectural modifications.
- Core Problem: Can a simpler PEFT method be designed? Simpler methods would not only improve efficiency but also help understand why PEFT is effective in the first place.
- Key Observation: The pre-training phase has already captured most critical features, and fine-tuning only needs to adjust a very small fraction of parameters.
Method¶
Core Idea¶
The methodology of RoCoFT is highly straightforward: it only updates a small number of row or column parameters in the weight matrices, while keeping the remaining parameters completely frozen.
Mathematical Formulation¶
For the weight matrices \(\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v, \mathbf{W}_{ff}\) in Transformers, the updates of RoCoFT can be formulated as:
Where \(\mathbf{R}\) and \(\mathbf{C}\) are restricted weight matrices with at most \(r\) non-zero rows or columns.
Comparison with LoRA¶
| Characteristic | LoRA | RoCoFT |
|---|---|---|
| Update Form | \(\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}\) | \(\mathbf{W} = \mathbf{W}_0 + \mathbf{R}\) (or \(\mathbf{C}\)) |
| Extra Parameters | Requires extra matrices \(\mathbf{A}\), \(\mathbf{B}\) | No extra parameters, updated in-place |
| Parameter Count (rank=r, d×k matrix) | \(r(d+k)\) | \(r \cdot k\) (row) or \(r \cdot d\) (column) |
| Forward Computation | Requires matrix multiplication \(\mathbf{B}\mathbf{A}\) | No extra computation |
| Initialization issues | Must consider initialization of \(\mathbf{A}\), \(\mathbf{B}\) | No initialization issues |
Row/Column Selection Strategy¶
- Default Strategy: Select rows or columns sequentially from the beginning.
- Key Discovery: Different selection strategies have minimal impact on performance; that is, any row or column can produce similar results, demonstrating the robustness of this method.
NTK Theoretical Analysis¶
- The Neural Tangent Kernel (NTK) theory is leveraged to explain the effectiveness of RoCoFT.
- Core Finding: The NTK constructed from a small subset of row/column parameters is numerically close to the full-parameter NTK.
- Validated using NTK kernel logistic regression on multiple tasks, where the classification performance of the restricted parameter set's kernel is comparable to that of the full-parameter kernel.
- This indicates: the pre-training phase has already captured most of the critical features required for fine-tuning.
Experiments¶
Experimental Setup¶
- Medium-sized Models: RoBERTa-Base/Large (GLUE), DeBERTa-v3 (SQuAD), BART-Large (Summarization).
- Large Models: Bloom-7B, GPT-J-6B, LLaMA2-7B, LLaMA2-13B (commonsense reasoning + mathematical reasoning, across 13 datasets in total).
- Baselines: LoRA, AdaLoRA, IA3, Prefix-Tuning, Prompt-Tuning, BitFit, Adapter, MAM Adapter, LoRA-XS, VeRA, Diff Pruning, etc.
GLUE Benchmark Results (RoBERTa-Base)¶
| Method | Trainable Params | Avg Score |
|---|---|---|
| Full FT | 124.6M | 83.56 |
| LoRA (r=8) | 0.89M | 84.32 |
| AdaLoRA | 1.03M | 84.06 |
| BitFit | 0.083M | 84.22 |
| SFT | 0.90M | 85.03 |
| RoCoFT3-Row | 0.249M | 85.65 |
| RoCoFT3-Column | 0.249M | 85.55 |
RoCoFT3 achieves the highest average score among all methods with only 0.249M parameters (approximately 28% of LoRA's parameters).
LLM Inference Task Results (LLaMA2-7B)¶
| Method | Trainable Params | Commonsense Reasoning Avg | Mathematical Reasoning Avg |
|---|---|---|---|
| LoRA | 24.30M | 75.53 | 78.52 |
| AdaLoRA | 24.90M | 74.81 | 77.48 |
| RoCoFT3-Row | 13.47M | 76.46 | 79.54 |
| RoCoFT3-Column | 13.47M | 76.45 | 79.35 |
On LLaMA2-7B, RoCoFT outperforms LoRA with only about 55% of the parameters.
LLaMA2-13B Results (Selected)¶
- RoCoFT3-Row also performs exceptionally on the 13B model, outperforming LoRA and AdaLoRA on multiple tasks.
- Trainable parameters are approximately 24M (compared to 44M for LoRA), representing an increase in parameter efficiency of about 45%.
Ablation Study¶
- Row vs Column: Row-wise and column-wise updates exhibit similar performance, with no significant differences.
- Impact of Selection Strategy: Strategies such as random selection, selecting from the beginning, selecting from the end, and uniform interval selection yield similar performance, validating the robustness of the method.
- Impact of Rank: The performance progressively improves as the rank increases from 1 to 3, with Rank=3 being sufficient to achieve outstanding performance.
- Application Depth: Applying updates to all weight matrices (Q, K, V, and FFN) yields the best results.
NTK Experimental Verification¶
- The kernel classification performance of the full-parameter NTK is compared against the restricted-parameter NTK on RoBERTa-Base.
- The performance gap between the restricted NTK and the full-parameter NTK on GLUE tasks is only 1-2%.
- This proves from a kernel method perspective that row/column parameters are sufficient to capture the core information needed for fine-tuning.
Highlights & Insights¶
- Ultra-Simple Design: This is perhaps the simplest known PEFT method—it introduces no extra parameters or modules, directly updating a subset of the original weight matrices.
- Theoretical Backing: The NTK analysis provides an elegant theoretical explanation rather than purely empirical observations.
- Robustness: It is insensitive to the row/column selection strategy, reducing the hyperparameter search burden.
- Efficiency Advantages: Unlike LoRA, there is no extra matrix multiplication, no initialization issues, and in-place updates reduce memory overhead.
- Profound Insight: Empirical results suggest that pre-training has already learned the vast majority of critical features, and fine-tuning merely adjusts the directions of a tiny subset of parameters.
Limitations & Future Work¶
- Theoretical Limitations: Technically, the NTK theory applies strictly to infinitely wide networks, presenting only an approximation for real-world networks of finite width.
- Rank Ceiling: When a larger capacity is needed to adapt to downstream tasks (e.g., tasks with significant domain gaps), a small number of rows/columns may be insufficient.
- Untested on Larger Models: The experiments are conducted on models up to 13B, and have not been validated on models with 70B+ parameters.
- Task Coverage: The evaluation primarily focuses on NLU and reasoning tasks, leaving more complex generative tasks (such as safe dialogues or creative writing) unexplored.
- Integration with Other Update Methods: The combination with other techniques like quantization (e.g., QLoRA) has not been explored.
Related Work & Insights¶
- Low-Rank Methods: LoRA (Hu et al., 2021), AdaLoRA (Zhang et al., 2023), VeRA (Kopiczko et al., 2023), LoRA-XS (Bałazy et al., 2024).
- Sparse Fine-Tuning: Diff Pruning (Guo et al., 2021), SFT (Ansell et al., 2024), Fish Mask (Sung et al., 2021).
- Other PEFT Methods: BitFit (Zaken et al., 2021), LayerNorm Tuning (Zhao et al., 2023), IA3 (Liu et al., 2022).
- NTK Theory: Jacot et al. (2018), Malladi et al. (2023) utilize NTK to analyze LLM fine-tuning.
Rating¶
⭐⭐⭐⭐ (4/5)
- Novelty: The method is extremely simple yet highly effective, raising a valuable question: "how simple can PEFT actually be?" (+1)
- Theoretical Depth: The NTK analysis provides a solid theoretical foundation for the method, moving beyond purely empirical work. (+0.5)
- Experimental Thoroughness: Extensive experiments from medium-sized to large models cover various tasks including NLU, reasoning, and summarization. (+0.5)
- Value: The implementation is straightforward, incurs no extra overhead, and is robust to selection strategies. (+0.5)
- Deductions: Fail to validate on ultra-large models, lack of exploration on combinations with other techniques like QLoRA, and unknown applicability to certain complex tasks. (-1)