Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation¶
Conference: NeurIPS 2025 arXiv: 2510.23123 Code: https://github.com/Leopold1423/toplora-neurips25 Area: Parameter-Efficient Fine-Tuning Keywords: LoRA, Low-Rank Adaptation, Token-wise Adaptation, Input-Output Projection, PEFT
TL;DR¶
TopLoRA analyzes the expressive capacity of LoRA from an input-output projection perspective, identifying that all tokens sharing a single projection matrix constitutes a critical bottleneck. It proposes dynamically adjusting LoRA weights via a learnable token-wise diagonal matrix \(\Sigma_X\) (i.e., \(\Delta W_X = B\Sigma_X A\)), achieving fine-grained adaptation without increasing rank, and consistently outperforming LoRA by 2–3% across tasks.
Background & Motivation¶
Background: LoRA enables parameter-efficient fine-tuning of large models via low-rank matrices \(\Delta W = BA\). Existing improvements primarily focus on increasing rank—HiRA (Hadamard product), MELoRA (mini-ensemble stacking), and MoELoRA (mixture of experts).
Limitations of Prior Work: In LoRA, all tokens share the same \(\Delta W = BA\), i.e., the same input-output projection matrix \(P = R_B L_A\). However, tokens differ substantially in semantics, and identical projection directions may encode entirely different information for different tokens, necessitating distinct processing.
Key Challenge: Increasing rank expands the dimensionality of the input/output space but incurs linear parameter growth; the expressive bottleneck from shared projections persists regardless of rank. Even at high rank, all tokens remain subject to the same mapping.
Goal: Learn distinct input-output projections for each token without increasing the LoRA rank.
Key Insight: LoRA is decomposed via QR/LQ factorization into three components—input space \(Q_A\), output space \(Q_B\), and projection matrix \(P = R_B L_A\). Rank determines the dimensionality of these spaces, while \(P\) governs the mapping; projection should therefore be token-specific.
Core Idea: A lightweight projection network generates a diagonal matrix \(\Sigma_X\) from each token \(X\), modifying the projection as \(P \to P_X = R_B \Sigma_X L_A\), thereby achieving token-adaptive input-output mapping.
Method¶
Overall Architecture¶
Standard LoRA: \(Y = (W + BA)X\). TopLoRA: \(Y = (W + B\Sigma_X A)X\), where \(\Sigma_X = \text{Diag}(\text{Exp}(\text{RMSNorm}(\Theta X)))\) is an \(r \times r\) diagonal matrix dynamically generated from input token \(X\), and \(\Theta \in \mathbb{R}^{r \times n}\) is a learnable projection parameter.
Key Designs¶
-
Token-wise Diagonal Matrix Generation:
- Function: Generates a distinct \(r\)-dimensional diagonal scaling matrix for each input token.
- Mechanism: \(\Sigma_X = \text{Diag}(\text{Exp}(\text{RMSNorm}(\Theta X)))\). \(\Theta X\) projects tokens into an \(r\)-dimensional space → RMSNorm normalizes away magnitude effects → Exp converts values into positive scaling factors.
- Design Motivation: RMSNorm prevents \(\Sigma_X\) from being influenced by token magnitude or the scale of \(\Theta\), amplifying inter-token differences in \(\Sigma_X\); Exp avoids near-zero values that would cause information loss, ensuring even subtle normalized differences are magnified.
-
Relationship to Standard LoRA:
- Function: TopLoRA output decomposes into a LoRA base term and a token-adaptive correction term.
- Mechanism: \(\Delta W_X X = BAX + B(\Sigma_X - I)AX\). The first term \(BAX\) corresponds to standard LoRA output (global pattern); the second term \(B(\Sigma_X - I)AX\) provides a token-specific correction.
- Design Motivation: When \(\Sigma_X = I\), TopLoRA reduces to standard LoRA, maintaining full compatibility.
-
Kaiming Initialization:
- Function: \(\Theta\) is initialized with Kaiming initialization rather than zero initialization.
- Mechanism: Consistent with the initialization strategy of the \(A\) matrix, ensuring training stability under a unified learning rate.
- Design Motivation: Zero initialization causes the initial \(\Sigma_X\) to be uniformly 1 (degenerating to LoRA); Kaiming initialization provides meaningful initial diversity.
Loss & Training¶
- The downstream task loss is used directly, fully consistent with the standard LoRA training pipeline.
- AdamW optimizer; LoRA dropout of 0.05.
- Parameter analysis: An additional \(\Theta \in \mathbb{R}^{r \times n}\) is introduced, amounting to approximately 0.5× the parameter count of LoRA.
Key Experimental Results¶
Main Results¶
| Task | Model | LoRA-r8 | LoRA-r32 | TopLoRA-r8 | Gain |
|---|---|---|---|---|---|
| GLUE (NLU) | RoBERTa-Base | 82.55% | 84.19% | 84.14% | +1.59% |
| GLUE (NLU) | RoBERTa-Large | 87.06% | 87.75% | 87.64% | +0.58% |
| Math Reasoning | Gemma-7B | 71.44% | 72.22% | 73.11% | +1.67% |
| Math Reasoning | LLaMA-3-8B | 64.06% | 64.48% | 66.36% | +2.30% |
| Commonsense Reasoning | Gemma-7B | 80.22% | 80.43% | 81.10% | +0.88% |
| Vision-Language | BLIP-2 COCO | B4=40.7 | — | B4=43.3 | +2.6 |
Ablation Study¶
| Variant | GLUE Avg | Note |
|---|---|---|
| TopLoRA (full) | 84.14% | Full model |
| w/o RMSNorm | 83.72% | RMSNorm amplifies inter-token differences |
| w/o Exp | 83.85% | Exp prevents information loss |
| Zero-init \(\Theta\) | 83.51% | Initially degenerates to LoRA |
| Fixed \(\Sigma_X = I\) (=LoRA) | 82.55% | No token adaptation |
Key Findings¶
- TopLoRA-r8 matches LoRA-r32 on GLUE (84.14 vs. 84.19), using ~1.5× the parameters of LoRA-r8 versus 4×.
- Consistent effectiveness across models (RoBERTa/Gemma/LLaMA/BLIP-2) and tasks (NLU/NLG/Vision-Language).
- Gains are most pronounced on mathematical reasoning and vision-language tasks (2–3%), with 1%+ improvements also observed on NLU.
- Outperforms LoRA variants including DoRA, MELoRA, and HydraLoRA.
- TopLoRA can be applied as a plug-in within the MoELoRA framework for further gains.
Highlights & Insights¶
- Novel input-output projection perspective: Decomposing LoRA into input space + output space + projection matrix reveals that token-shared projection is an overlooked bottleneck. This analytical framework is itself highly informative.
- Extremely simple and efficient method: Only a linear projection, RMSNorm, and Exp are added, introducing negligible architectural complexity while achieving consistent cross-task gains.
- Expressiveness without rank increase: In contrast to the mainstream paradigm of "higher rank = greater expressiveness," this work demonstrates that rank is not the sole dimension of expressive capacity.
Limitations & Future Work¶
- Advantages diminish when the base rank is already very high.
- Additional parameters and computational overhead from the projection network \(\Theta\) (small but nonzero).
- The diagonal constraint limits the expressiveness of \(\Sigma_X\)—a full-matrix \(\Sigma_X\) could be more expressive but would incur prohibitive parameter costs.
- Layer-wise behavioral differences of \(\Sigma_X\) remain unexplored.
Related Work & Insights¶
- vs. LoRA: LoRA shares projections across tokens; TopLoRA enables token-level adaptation. TopLoRA-r8 achieves performance comparable to LoRA-r32.
- vs. DoRA: DoRA decomposes weights into direction and magnitude; TopLoRA decomposes into space and projection—orthogonal perspectives.
- vs. HiRA/MELoRA: These methods increase rank; TopLoRA improves projection without altering rank. The two approaches are orthogonal and composable.
- vs. MoELoRA: MoELoRA uses multiple LoRA experts for token adaptation at higher parameter cost; TopLoRA achieves similar adaptivity more parameter-efficiently.
Rating¶
- Novelty: ⭐⭐⭐⭐ Novel input-output projection analysis perspective; method is concise and elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three task categories × multiple models × comprehensive ablations × comparison with multiple baselines.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly derived; mathematical derivations are rigorous.
- Value: ⭐⭐⭐⭐ Contributes to both theoretical understanding and practical improvement of the LoRA family.
- Value: ⭐⭐⭐⭐ Introduces a new dimension for LoRA improvement (projection diversity vs. rank); strong practical utility.