LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning¶

Conference: ICLR 2026 arXiv: 2505.21289 Code: None Area: Parameter-Efficient Fine-Tuning / Model Compression Keywords: LoRA, Low-Rank Adaptation, Full Fine-Tuning, Optimizer State Alignment, AdamW

TL;DR¶

This paper proposes LoFT, a low-rank adaptation method composed of six building blocks that aligns the internal optimizer dynamics (momentum and second-order moments) with those of full fine-tuning. In the full-rank limit, LoFT exactly recovers AdamW, and it substantially closes the performance gap between LoRA and full fine-tuning across multiple benchmarks.

Background & Motivation¶

Adapting large-scale pretrained models to downstream tasks has become the standard paradigm in NLP and beyond. However, as model sizes scale to billions of parameters, full fine-tuning becomes computationally expensive and impractical, particularly in multi-task or multi-user settings. Parameter-efficient fine-tuning (PEFT) techniques address this challenge by training only a small subset of parameters, with LoRA (Low-Rank Adaptation) being the most popular approach.

Successes and Limitations of LoRA:

LoRA freezes the original weights and injects trainable low-rank matrices \(W = W_0 + UV^\top\) into selected layers, where \(U \in \mathbb{R}^{m \times r}\), \(V \in \mathbb{R}^{n \times r}\), \(r \ll \min\{m, n\}\). This significantly reduces the number of trainable parameters without adding inference latency. Nevertheless, LoRA still lags behind full fine-tuning in certain settings:

Persistent performance gap: Empirical studies consistently show a non-trivial gap between LoRA and full fine-tuning.

Slower convergence: The optimization dynamics of LoRA differ fundamentally from those of full fine-tuning.

Hyperparameter sensitivity: The choice of the scaling factor \(\alpha\) significantly affects performance, incurring non-trivial tuning costs.

Key Insight of This Paper:

Prior work (e.g., DoRA, LoRA-Pro) primarily focused on more accurate gradient approximation within the low-rank subspace. This paper reveals a neglected yet critical factor: optimizer state misalignment — specifically, the first-order moment (momentum) and second-order moment (variance) in AdamW. When these internal statistics are not properly aligned with the low-rank constraint, adaptation quality degrades.

Method¶

Overall Architecture¶

LoFT can be understood as "the closest approximation to full fine-tuning under the constraint that weight updates are restricted to a low-rank subspace." It consists of six core building blocks that systematically address the discrepancies between LoRA and full fine-tuning.

Key Designs¶

Alternating Updates (Building Block 1):
- Function: Updates \(U\) and \(V\) alternately rather than simultaneously.
- Mechanism: When standard LoRA updates \(U\) and \(V\) jointly, a cross term proportional to \(\eta^2\) appears in the weight update, which is inconsistent with full fine-tuning. Alternating updates eliminate this term: when only \(U\) is updated, \(W^+ = W - \eta \nabla_W f(W) VV^\top\).
- Design Motivation: Eliminates the second-order cross term in LoRA updates, bringing the update form closer to full fine-tuning.
Gradient Scaling (Building Block 2):
- Function: Uses the scaled gradient \(\tilde{\nabla}_U f(W) = \nabla_U f(W)(V^\top V)^{-1}\).
- Mechanism: Using LoRA gradients directly introduces a scale-dependency issue — \(UV^\top = (cU)(V/c)^\top\) holds for any \(c \neq 0\), yet gradient updates scale with \(c\). The projection matrix \(\mathcal{P}_V = V(V^\top V)^{-1}V^\top\) is used to normalize the update: \(W^+ = W - \eta \nabla_W f(W) \mathcal{P}_V\). This ensures the update direction is always the orthogonal projection of the full gradient onto the current low-rank subspace — the optimal low-rank approximation.
- Design Motivation: Eliminates scale ambiguity, makes updates independent of the specific parameterization of \(U\) and \(V\), and removes the need for the \(\alpha\) hyperparameter.
First-Order Moment (Momentum) Calibration (Building Block 3):
- Function: Introduces a calibration matrix \(C_V^k = (V_{k-1}^\top V_k)(V_k^\top V_k)^{-1}\) into momentum accumulation.
- Mechanism: The standard LoRA momentum \(m_U^k V^\top\) involves \(V_i\) from different time steps and the current \(V_k\), leading to inconsistent implicit projections. The calibration matrix "rotates" historical momentum into the current subspace: \(m_U^k = \beta_1 m_U^{k-1} C_V^k + (1-\beta_1)\tilde{\nabla}_U f(W_i)\) The calibrated momentum is equivalent to accumulating the full gradient projected onto the intersection of all historical and current subspaces.
- Design Motivation: As the low-rank subspace evolves during training, historical momentum must be "transformed" into the current subspace to remain meaningful.
Second-Order Moment Alignment (Building Blocks 4 & 5):
- Function: Efficiently maintains cross terms of the second-order moment via Khatri-Rao and Kronecker product identities.
- Mechanism: Adam's second-order moment \(v_k\) involves element-wise squared gradients. Under low-rank parameterization, maintaining an \(r^2\)-sized cross-term matrix \(p_U^k\) is required: \(p_U^k = \beta_2 p_U^{k-1}(C_V^k \otimes C_V^k) + (1-\beta_2)(\tilde{\nabla}_U f \bullet \tilde{\nabla}_U f)\) The full second-order moment estimate is then reconstructed via \(\tilde{v}_U^k = p_U^k(V_k * V_k)\). Building Block 5 uses the calibrated first- and second-order moments to construct the final update: \(U_{k+1} = U_k - \eta_k \frac{m_U^k V_k / (1-\beta_1^k)}{p_U^k(V_k * V_k)/(1-\beta_2^k) + \varepsilon} V_k(V_k^\top V_k)^{-1}\)
- Design Motivation: Precisely aligns Adam's adaptive learning rate mechanism so that each step of low-rank optimization is as close as possible to full fine-tuning.
Gradient Clipping (Building Block 6):
- Function: Uses the projected effective gradient \(\nabla_W f(W) \mathcal{P}_V\) as the layer representative during gradient clipping.
- Design Motivation: Ensures clipping behavior is consistent with full fine-tuning.

Loss & Training¶

LoFT is an optimizer-level improvement and does not modify the loss function; it is compatible with any standard training objective.
Weight decay requires no special modification: alternating updates ensure that decay \(UV^\top \to (1-\lambda\eta_k)UV^\top\) is consistent with full fine-tuning.
Additional memory overhead: \(\mathcal{O}((m+n)r)\) for first-order moment calibration; \(\mathcal{O}((m+n)r^2)\) for second-order moment cross terms.
Computational overhead: primarily from \(r \times r\) matrix inversions and Khatri-Rao products, at \(\mathcal{O}(r^3)\) complexity.

Core Theoretical Guarantee: When \(r = \min\{m, n\}\) and \(U_k, V_k\) are full-rank, LoFT-AdamW exactly recovers the full AdamW update. This is the first low-rank adaptation method with this property.

Key Experimental Results¶

Main Results¶

Commonsense Reasoning (LLaMA series):

Model	Method	BoolQ	PIQA	SIQA	HS	WG	ARC-C	ARC-E	OBQA	Avg.
LLaMA-7B	LoRA	-	-	-	-	-	-	-	-	Baseline
LLaMA-7B	DoRA	-	-	-	-	-	64.68	-	-	Baseline+
LLaMA-7B	LoFT	-	80.96	78.27	80.50	76.40	-	80.26	78.40	74.95
LLaMA2-7B	DoRA	-	82.92	79.22	88.90	-	-	-	-	79.71
LLaMA2-7B	LoFT	71.80	-	-	-	82.72	69.11	84.43	81.00	Best

Image Classification (ViT-Base): - Evaluated on medical imaging datasets and highly imbalanced datasets such as DomainNet. - LoFT matches or exceeds full fine-tuning performance on most datasets.

Ablation Study¶

Configuration	Key Metric	Notes
LoFT (full)	Best convergence	All components working together
w/o alternating updates	Significantly worse	Second-order cross terms harm convergence
w/o optimizer state calibration	Noticeably worse	Misaligned momentum and variance lead to suboptimal results
Gradient scaling only	Limited improvement	Gradient alignment is only a partial solution
Standard LoRA	Slowest convergence	All issues compounded

A synthetic experiment (\(f(W) = \|W - A\|_F^2\), \(m=1024, n=512, r=8\)) clearly demonstrates the importance of each component: the convergence curve of LoFT nearly overlaps with that of full fine-tuning.

Key Findings¶

LoFT addresses not only gradient alignment but also the long-neglected problem of optimizer state alignment.
LoFT maintains robust performance even under extreme low-rank constraints (\(r=1\)).
LoFT naturally eliminates the need for the LoRA scaling factor \(\alpha\).
Significant training quality improvements are achieved without any additional inference cost.

Highlights & Insights¶

Optimizer state misalignment is a neglected critical issue: All prior LoRA improvement work focused on gradient approximation; this paper is the first to systematically identify and address the misalignment of momentum and second-order moments — a genuine blind spot in the field.
The six-block decomposition is clear and elegant: Each building block has a well-defined corresponding problem and theoretical motivation, and together they form a complete solution.
Mathematically provable equivalence: Exact recovery of AdamW in the full-rank limit constitutes a very strong theoretical guarantee.
Elimination of the \(\alpha\) hyperparameter: Gradient scaling naturally resolves norm ambiguity, reducing tuning burden.
Clever use of Khatri-Rao / Kronecker products: Matrix decomposition theory is leveraged to maintain second-order moments efficiently, keeping computational complexity tractable.

Limitations & Future Work¶

Additional memory overhead: The second-order moment cross terms require \(\mathcal{O}((m+n)r^2)\) extra memory, which is unfriendly for large values of \(r\). The authors plan to explore LLM-specific optimizers (e.g., Muon) to mitigate this.
Increased computational cost: While inference is unaffected, training requires additional matrix operations (calibration, projection, etc.).
Incomplete experimental tables: Conversion errors in ar5iv caused some numerical results to be missing, affecting the completeness of the presented results.
Larger-scale models not explored: Experiments are primarily conducted at the 7B–8B scale; effectiveness on models of 70B parameters and above remains to be verified.
Integration with other PEFT methods: Whether the ideas underlying LoFT can transfer to other PEFT approaches such as Adapter and Prefix Tuning is an open question.

LoRA and variants: LoRA → DoRA (decoupled direction and magnitude) → LoRA-Pro (improved gradient approximation) → LoFT (comprehensive alignment of optimizer dynamics).
Riemannian optimization perspective: Zhang et al. derive similar gradient scaling results from a Riemannian geometry viewpoint.
Inspiration: The idea of optimizer state alignment may be applicable to other constrained optimization settings — any optimization method operating within a subspace may benefit from analogous calibration strategies.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First work to systematically reveal and address optimizer state misalignment; solid theoretical contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers language and vision tasks across multiple model scales, though some data are missing.
Writing Quality: ⭐⭐⭐⭐⭐ — Building-block organization is clear; mathematical derivations are rigorous.
Value: ⭐⭐⭐⭐⭐ — Provides important guidance for LoRA improvements and has the potential to become a new standard practice.