Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation¶

Conference: NeurIPS 2025 arXiv: 2510.23123 Code: https://github.com/Leopold1423/toplora-neurips25 Area: Parameter-Efficient Fine-Tuning Keywords: LoRA, Low-Rank Adaptation, Token-wise Adaptation, Input-Output Projection, PEFT

TL;DR¶

TopLoRA analyzes the expressive capacity of LoRA from an input-output projection perspective, identifying that all tokens sharing a single projection matrix constitutes a critical bottleneck. It proposes dynamically adjusting LoRA weights via a learnable token-wise diagonal matrix \(\Sigma_X\) (i.e., \(\Delta W_X = B\Sigma_X A\)), achieving fine-grained adaptation without increasing rank, and consistently outperforming LoRA by 2–3% across tasks.

Background & Motivation¶

Background: LoRA enables parameter-efficient fine-tuning of large models via low-rank matrices \(\Delta W = BA\). Existing improvements primarily focus on increasing rank—HiRA (Hadamard product), MELoRA (mini-ensemble stacking), and MoELoRA (mixture of experts).

Limitations of Prior Work: In LoRA, all tokens share the same \(\Delta W = BA\), i.e., the same input-output projection matrix \(P = R_B L_A\). However, tokens differ substantially in semantics, and identical projection directions may encode entirely different information for different tokens, necessitating distinct processing.

Key Challenge: Increasing rank expands the dimensionality of the input/output space but incurs linear parameter growth; the expressive bottleneck from shared projections persists regardless of rank. Even at high rank, all tokens remain subject to the same mapping.

Goal: Learn distinct input-output projections for each token without increasing the LoRA rank.

Key Insight: LoRA is decomposed via QR/LQ factorization into three components—input space \(Q_A\), output space \(Q_B\), and projection matrix \(P = R_B L_A\). Rank determines the dimensionality of these spaces, while \(P\) governs the mapping; projection should therefore be token-specific.

Core Idea: A lightweight projection network generates a diagonal matrix \(\Sigma_X\) from each token \(X\), modifying the projection as \(P \to P_X = R_B \Sigma_X L_A\), thereby achieving token-adaptive input-output mapping.

Method¶

Overall Architecture¶

Standard LoRA: \(Y = (W + BA)X\). TopLoRA: \(Y = (W + B\Sigma_X A)X\), where \(\Sigma_X = \text{Diag}(\text{Exp}(\text{RMSNorm}(\Theta X)))\) is an \(r \times r\) diagonal matrix dynamically generated from input token \(X\), and \(\Theta \in \mathbb{R}^{r \times n}\) is a learnable projection parameter.

Key Designs¶

Token-wise Diagonal Matrix Generation:
- Function: Generates a distinct \(r\)-dimensional diagonal scaling matrix for each input token.
- Mechanism: \(\Sigma_X = \text{Diag}(\text{Exp}(\text{RMSNorm}(\Theta X)))\). \(\Theta X\) projects tokens into an \(r\)-dimensional space → RMSNorm normalizes away magnitude effects → Exp converts values into positive scaling factors.
- Design Motivation: RMSNorm prevents \(\Sigma_X\) from being influenced by token magnitude or the scale of \(\Theta\), amplifying inter-token differences in \(\Sigma_X\); Exp avoids near-zero values that would cause information loss, ensuring even subtle normalized differences are magnified.
Relationship to Standard LoRA:
- Function: TopLoRA output decomposes into a LoRA base term and a token-adaptive correction term.
- Mechanism: \(\Delta W_X X = BAX + B(\Sigma_X - I)AX\). The first term \(BAX\) corresponds to standard LoRA output (global pattern); the second term \(B(\Sigma_X - I)AX\) provides a token-specific correction.
- Design Motivation: When \(\Sigma_X = I\), TopLoRA reduces to standard LoRA, maintaining full compatibility.
Kaiming Initialization:
- Function: \(\Theta\) is initialized with Kaiming initialization rather than zero initialization.
- Mechanism: Consistent with the initialization strategy of the \(A\) matrix, ensuring training stability under a unified learning rate.
- Design Motivation: Zero initialization causes the initial \(\Sigma_X\) to be uniformly 1 (degenerating to LoRA); Kaiming initialization provides meaningful initial diversity.

Loss & Training¶

The downstream task loss is used directly, fully consistent with the standard LoRA training pipeline.
AdamW optimizer; LoRA dropout of 0.05.
Parameter analysis: An additional \(\Theta \in \mathbb{R}^{r \times n}\) is introduced, amounting to approximately 0.5× the parameter count of LoRA.

Key Experimental Results¶

Main Results¶

Task	Model	LoRA-r8	LoRA-r32	TopLoRA-r8	Gain
GLUE (NLU)	RoBERTa-Base	82.55%	84.19%	84.14%	+1.59%
GLUE (NLU)	RoBERTa-Large	87.06%	87.75%	87.64%	+0.58%
Math Reasoning	Gemma-7B	71.44%	72.22%	73.11%	+1.67%
Math Reasoning	LLaMA-3-8B	64.06%	64.48%	66.36%	+2.30%
Commonsense Reasoning	Gemma-7B	80.22%	80.43%	81.10%	+0.88%
Vision-Language	BLIP-2 COCO	B4=40.7	—	B4=43.3	+2.6

Ablation Study¶

Variant	GLUE Avg	Note
TopLoRA (full)	84.14%	Full model
w/o RMSNorm	83.72%	RMSNorm amplifies inter-token differences
w/o Exp	83.85%	Exp prevents information loss
Zero-init \(\Theta\)	83.51%	Initially degenerates to LoRA
Fixed \(\Sigma_X = I\) (=LoRA)	82.55%	No token adaptation

Key Findings¶

TopLoRA-r8 matches LoRA-r32 on GLUE (84.14 vs. 84.19), using ~1.5× the parameters of LoRA-r8 versus 4×.
Consistent effectiveness across models (RoBERTa/Gemma/LLaMA/BLIP-2) and tasks (NLU/NLG/Vision-Language).
Gains are most pronounced on mathematical reasoning and vision-language tasks (2–3%), with 1%+ improvements also observed on NLU.
Outperforms LoRA variants including DoRA, MELoRA, and HydraLoRA.
TopLoRA can be applied as a plug-in within the MoELoRA framework for further gains.

Highlights & Insights¶

Novel input-output projection perspective: Decomposing LoRA into input space + output space + projection matrix reveals that token-shared projection is an overlooked bottleneck. This analytical framework is itself highly informative.
Extremely simple and efficient method: Only a linear projection, RMSNorm, and Exp are added, introducing negligible architectural complexity while achieving consistent cross-task gains.
Expressiveness without rank increase: In contrast to the mainstream paradigm of "higher rank = greater expressiveness," this work demonstrates that rank is not the sole dimension of expressive capacity.

Limitations & Future Work¶

Advantages diminish when the base rank is already very high.
Additional parameters and computational overhead from the projection network \(\Theta\) (small but nonzero).
The diagonal constraint limits the expressiveness of \(\Sigma_X\)—a full-matrix \(\Sigma_X\) could be more expressive but would incur prohibitive parameter costs.
Layer-wise behavioral differences of \(\Sigma_X\) remain unexplored.

vs. LoRA: LoRA shares projections across tokens; TopLoRA enables token-level adaptation. TopLoRA-r8 achieves performance comparable to LoRA-r32.
vs. DoRA: DoRA decomposes weights into direction and magnitude; TopLoRA decomposes into space and projection—orthogonal perspectives.
vs. HiRA/MELoRA: These methods increase rank; TopLoRA improves projection without altering rank. The two approaches are orthogonal and composable.
vs. MoELoRA: MoELoRA uses multiple LoRA experts for token adaptation at higher parameter cost; TopLoRA achieves similar adaptivity more parameter-efficiently.

Rating¶

Novelty: ⭐⭐⭐⭐ Novel input-output projection analysis perspective; method is concise and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three task categories × multiple models × comprehensive ablations × comparison with multiple baselines.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly derived; mathematical derivations are rigorous.
Value: ⭐⭐⭐⭐ Contributes to both theoretical understanding and practical improvement of the LoRA family.
Value: ⭐⭐⭐⭐ Introduces a new dimension for LoRA improvement (projection diversity vs. rank); strong practical utility.