RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates¶

Basic Information¶

Conference: ACL 2025
Code: Kowsher/RoCoFT
Institution: Nokia Bell Labs / UCF / UIC
Area: Parameter-Efficient Fine-Tuning / LLM
Keywords: PEFT, LoRA, row-column updates, NTK, parameter-efficient finetuning

TL;DR¶

Proposes RoCoFT, an extremely simple parameter-efficient fine-tuning method: only updates a small subset of row or column parameters in the Transformer weight matrices. It achieves accuracy comparable to state-of-the-art PEFT methods like LoRA on tasks such as GLUE, QA, summarization, and commonsense/mathematical reasoning, while reducing memory and computation overhead. The effectiveness of the method is theoretically explained via the Neural Tangent Kernel (NTK) theory.

Background & Motivation¶

The Dilemma of Full Fine-Tuning: As LLM parameter scales grow (from billions to hundreds of billions), storing a full copy of the model for each downstream task becomes impractical. Full fine-tuning is also prone to overfitting and catastrophic forgetting.
Development of PEFT Methods: LoRA achieves efficient fine-tuning through low-rank matrix decomposition, Adapters insert additional modules, and Prefix/Prompt Tuning appends learnable vectors. While effective, these methods still introduce extra parameters or architectural modifications.
Core Problem: Can a simpler PEFT method be designed? Simpler methods would not only improve efficiency but also help understand why PEFT is effective in the first place.
Key Observation: The pre-training phase has already captured most critical features, and fine-tuning only needs to adjust a very small fraction of parameters.

Method¶

Core Idea¶

The methodology of RoCoFT is highly straightforward: it only updates a small number of row or column parameters in the weight matrices, while keeping the remaining parameters completely frozen.

Mathematical Formulation¶

For the weight matrices $\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v, \mathbf{W}_{ff}$ in Transformers, the updates of RoCoFT can be formulated as:

\[\mathbf{W} = \mathbf{W}_0 + \mathbf{R} \quad \text{（行更新）}$$ $$\mathbf{W} = \mathbf{W}_0 + \mathbf{C} \quad \text{（列更新）}\]

Where $\mathbf{R}$ and $\mathbf{C}$ are restricted weight matrices with at most $r$ non-zero rows or columns.

Comparison with LoRA¶

Characteristic	LoRA	RoCoFT
Update Form	$\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$	$\mathbf{W} = \mathbf{W}_0 + \mathbf{R}$ (or $\mathbf{C}$)
Extra Parameters	Requires extra matrices $\mathbf{A}$, $\mathbf{B}$	No extra parameters, updated in-place
Parameter Count (rank=r, d×k matrix)	$r(d+k)$	$r \cdot k$ (row) or $r \cdot d$ (column)
Forward Computation	Requires matrix multiplication $\mathbf{B}\mathbf{A}$	No extra computation
Initialization issues	Must consider initialization of $\mathbf{A}$, $\mathbf{B}$	No initialization issues

Row/Column Selection Strategy¶

Default Strategy: Select rows or columns sequentially from the beginning.
Key Discovery: Different selection strategies have minimal impact on performance; that is, any row or column can produce similar results, demonstrating the robustness of this method.

NTK Theoretical Analysis¶

The Neural Tangent Kernel (NTK) theory is leveraged to explain the effectiveness of RoCoFT.
Core Finding: The NTK constructed from a small subset of row/column parameters is numerically close to the full-parameter NTK.
Validated using NTK kernel logistic regression on multiple tasks, where the classification performance of the restricted parameter set's kernel is comparable to that of the full-parameter kernel.
This indicates: the pre-training phase has already captured most of the critical features required for fine-tuning.

Experiments¶

Experimental Setup¶

Medium-sized Models: RoBERTa-Base/Large (GLUE), DeBERTa-v3 (SQuAD), BART-Large (Summarization).
Large Models: Bloom-7B, GPT-J-6B, LLaMA2-7B, LLaMA2-13B (commonsense reasoning + mathematical reasoning, across 13 datasets in total).
Baselines: LoRA, AdaLoRA, IA3, Prefix-Tuning, Prompt-Tuning, BitFit, Adapter, MAM Adapter, LoRA-XS, VeRA, Diff Pruning, etc.

GLUE Benchmark Results (RoBERTa-Base)¶

Method	Trainable Params	Avg Score
Full FT	124.6M	83.56
LoRA (r=8)	0.89M	84.32
AdaLoRA	1.03M	84.06
BitFit	0.083M	84.22
SFT	0.90M	85.03
RoCoFT3-Row	0.249M	85.65
RoCoFT3-Column	0.249M	85.55

RoCoFT3 achieves the highest average score among all methods with only 0.249M parameters (approximately 28% of LoRA's parameters).

LLM Inference Task Results (LLaMA2-7B)¶

Method	Trainable Params	Commonsense Reasoning Avg	Mathematical Reasoning Avg
LoRA	24.30M	75.53	78.52
AdaLoRA	24.90M	74.81	77.48
RoCoFT3-Row	13.47M	76.46	79.54
RoCoFT3-Column	13.47M	76.45	79.35

On LLaMA2-7B, RoCoFT outperforms LoRA with only about 55% of the parameters.

LLaMA2-13B Results (Selected)¶

RoCoFT3-Row also performs exceptionally on the 13B model, outperforming LoRA and AdaLoRA on multiple tasks.
Trainable parameters are approximately 24M (compared to 44M for LoRA), representing an increase in parameter efficiency of about 45%.

Ablation Study¶

Row vs Column: Row-wise and column-wise updates exhibit similar performance, with no significant differences.
Impact of Selection Strategy: Strategies such as random selection, selecting from the beginning, selecting from the end, and uniform interval selection yield similar performance, validating the robustness of the method.
Impact of Rank: The performance progressively improves as the rank increases from 1 to 3, with Rank=3 being sufficient to achieve outstanding performance.
Application Depth: Applying updates to all weight matrices (Q, K, V, and FFN) yields the best results.

NTK Experimental Verification¶

The kernel classification performance of the full-parameter NTK is compared against the restricted-parameter NTK on RoBERTa-Base.
The performance gap between the restricted NTK and the full-parameter NTK on GLUE tasks is only 1-2%.
This proves from a kernel method perspective that row/column parameters are sufficient to capture the core information needed for fine-tuning.

Highlights & Insights¶

Ultra-Simple Design: This is perhaps the simplest known PEFT method—it introduces no extra parameters or modules, directly updating a subset of the original weight matrices.
Theoretical Backing: The NTK analysis provides an elegant theoretical explanation rather than purely empirical observations.
Robustness: It is insensitive to the row/column selection strategy, reducing the hyperparameter search burden.
Efficiency Advantages: Unlike LoRA, there is no extra matrix multiplication, no initialization issues, and in-place updates reduce memory overhead.
Profound Insight: Empirical results suggest that pre-training has already learned the vast majority of critical features, and fine-tuning merely adjusts the directions of a tiny subset of parameters.

Limitations & Future Work¶

Theoretical Limitations: Technically, the NTK theory applies strictly to infinitely wide networks, presenting only an approximation for real-world networks of finite width.
Rank Ceiling: When a larger capacity is needed to adapt to downstream tasks (e.g., tasks with significant domain gaps), a small number of rows/columns may be insufficient.
Untested on Larger Models: The experiments are conducted on models up to 13B, and have not been validated on models with 70B+ parameters.
Task Coverage: The evaluation primarily focuses on NLU and reasoning tasks, leaving more complex generative tasks (such as safe dialogues or creative writing) unexplored.
Integration with Other Update Methods: The combination with other techniques like quantization (e.g., QLoRA) has not been explored.

Low-Rank Methods: LoRA (Hu et al., 2021), AdaLoRA (Zhang et al., 2023), VeRA (Kopiczko et al., 2023), LoRA-XS (Bałazy et al., 2024).
Sparse Fine-Tuning: Diff Pruning (Guo et al., 2021), SFT (Ansell et al., 2024), Fish Mask (Sung et al., 2021).
Other PEFT Methods: BitFit (Zaken et al., 2021), LayerNorm Tuning (Zhao et al., 2023), IA3 (Liu et al., 2022).
NTK Theory: Jacot et al. (2018), Malladi et al. (2023) utilize NTK to analyze LLM fine-tuning.

Rating¶

⭐⭐⭐⭐ (4/5)

Novelty: The method is extremely simple yet highly effective, raising a valuable question: "how simple can PEFT actually be?" (+1)
Theoretical Depth: The NTK analysis provides a solid theoretical foundation for the method, moving beyond purely empirical work. (+0.5)
Experimental Thoroughness: Extensive experiments from medium-sized to large models cover various tasks including NLU, reasoning, and summarization. (+0.5)
Value: The implementation is straightforward, incurs no extra overhead, and is robust to selection strategies. (+0.5)
Deductions: Fail to validate on ultra-large models, lack of exploration on combinations with other techniques like QLoRA, and unknown applicability to certain complex tasks. (-1)

Characteristic	LoRA	RoCoFT
Update Form	\(\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}\)	\(\mathbf{W} = \mathbf{W}_0 + \mathbf{R}\) (or \(\mathbf{C}\))
Extra Parameters	Requires extra matrices \(\mathbf{A}\), \(\mathbf{B}\)	No extra parameters, updated in-place
Parameter Count (rank=r, d×k matrix)	\(r(d+k)\)	\(r \cdot k\) (row) or \(r \cdot d\) (column)
Forward Computation	Requires matrix multiplication \(\mathbf{B}\mathbf{A}\)	No extra computation
Initialization issues	Must consider initialization of \(\mathbf{A}\), \(\mathbf{B}\)	No initialization issues