Scalable Multi-Task Low-Rank Model Adaptation¶

Conference: ICLR 2026 arXiv: 2603.01526 Code: GitHub Area: Social Computing Keywords: LoRA, multi-task learning, spectral-aware regularization, block-level adaptation, fine-grained routing

TL;DR¶

This paper systematically analyzes the root causes of multi-task LoRA collapse as the number of tasks scales (uniform regularization destroying shared knowledge + component-level LoRA amplifying gradient conflicts), and proposes mtLoRA: spectral-aware regularization + block-level adaptation + fine-grained routing. The method outperforms the state of the art by an average of 2.3% on 15–25 tasks, while reducing parameters by 47% and training time by 24%.

Background & Motivation¶

LoRA excels at single-task adaptation, but real-world applications frequently require handling a large number of tasks simultaneously (15–25+), where multi-task LoRA suffers from catastrophic collapse.

Parameter misalignment: Weight update directions across different LoRA modules conflict (gradient antagonism).

Representation misalignment: Output features of LoRA modules diverge. 4. Existing regularization methods (Task Arithmetic, TIES-Merging) and dynamic routing methods (MoLE, HydraLoRA) each fail independently. 5. A key finding reveals a fundamental trade-off between regularization and routing — strengthening regularization reduces conflicts but simultaneously degrades routing effectiveness (routing entropy increases from 2.6 to 2.7). 6. Root cause analysis: (1) Shared knowledge concentrates in high-singular-value components (top-20% account for 89% of cross-task alignment), and uniform regularization destroys this shared knowledge; (2) component-level ($W_q, W_v$) LoRA amplifies gradient conflicts, whereas block-level adaptation reduces conflicts by 76%.

Method¶

Overall Architecture¶

mtLoRA builds upon the asymmetric structure of HydraLoRA (shared $A$, multiple task-specific $B_i$) and introduces three novel designs: 1. Spectral-Aware Regularization 2. Block-Level Adaptation 3. Fine-Grained Routing

Key Designs¶

Design 1: Spectral-Aware Regularization - Function: Selectively applies orthogonality constraints to low-singular-value components while preserving shared knowledge encoded in high-singular-value components. - Mechanism: SVD is applied to each $B_i$ to obtain singular values $\{\sigma_k\}$; a weighting function $w(\sigma) = \exp(-\sigma/\bar{\sigma})$ constructs a reweighted matrix $B'_i$; the loss is defined as $\mathcal{L}_{spectral} = \lambda \sum_{i<j} \|(B'_i)^T B'_j\|_F^2$. - Design Motivation: Low-SV components ($\sigma \ll \bar{\sigma}$) receive weights close to 1, enforcing orthogonalization (denoising); high-SV components ($\sigma \gg \bar{\sigma}$) receive weights close to 0, preserving cross-task shared knowledge. Experiments confirm low-SV components are suppressed 3× (−6.0%) compared to high-SV components (−2.0%).

Design 2: Block-Level Adaptation - Function: Elevates LoRA adaptation from the component level ($W_q, W_v$) to the block level (a parallel path over the entire Attention/FFN block). - Mechanism: $x' = x + W^{(F)}(\text{LN}(x)) + \Delta(\text{LN}(x))$, decoupling the LoRA update path from internal nonlinearities within the main block (e.g., Softmax). - Design Motivation: Component-level LoRA gradients propagate through Softmax and create cross-token dependencies — modifying the attention from "bank" → "money" automatically suppresses "bank" → "river". Block-level adaptation eliminates this competition while reducing parameters by 50%.

Design 3: Fine-Grained Routing - Function: Assigns dimension-specific routing weight vectors to each LoRA module, rather than scalar weights. - Mechanism: The router outputs $\Pi_i \in \mathbb{R}^g$ ($g$ groups) for each LoRA, with the combination $\sum_{i=1}^N \Pi_i(x) \odot \Delta_i(x)$ computed via grouped element-wise multiplication. - Design Motivation: Different feature subspaces may require different LoRA combinations (e.g., a "creativity" dimension emphasizes a brainstorming LoRA, while a "factuality" dimension emphasizes a QA LoRA); scalar routing cannot express such heterogeneity.

Loss & Training¶

The overall loss function is: $$\mathcal{L} = \mathcal{L}_{task} + \lambda_1 \mathcal{L}_{spectral} + \lambda_2 \mathcal{L}_{balance}$$ - $\mathcal{L}_{spectral}$: SVD is performed once per epoch to compute the spectral-aware orthogonality loss. - $\mathcal{L}_{balance}$: Load-balancing loss to prevent routing collapse (all samples selecting only a few experts). - Router: A 2-layer MLP that takes mean-pooled hidden states as input and outputs $N \times g$ weights normalized via softmax.

Key Experimental Results¶

Main Results¶

Large-scale multi-task evaluation across 15–25 tasks:

Method	Parameters	DOTA (15)	iNat2018 (25)	Dolly-15k (16)	BBH (27)	Avg.
HydraLoRA	75.5M (1.11%)	89.0	78.3	41.6	35.5	61.1
mtLoRA	39.8M (0.59%)	91.0	81.5	44.5	38.5	63.9

Ablation Study¶

Contribution of each component (relative to HydraLoRA baseline):

Component Combination	Parameters	Training Time	DOTA	BBH	Avg.
Baseline HydraLoRA	75.5M	1.00×	89.0	35.5	61.1
+Block-Level	37.7M	0.67×	91.2	37.9	63.2
+Block+Spectral	37.7M	0.70×	91.7	38.4	63.8
+Block+Fine-grained	39.8M	0.69×	89.9	38.2	63.1
All (mtLoRA)	39.8M	0.76×	91.0	38.5	63.9

Routing granularity ablation:

Strategy	Groups $g$	Dolly-15k	BBH	Avg.
Scalar routing	1	41.6	35.5	38.5
Fine-grained	2	41.6	37.0	39.3
Fine-grained	32	42.0	37.7	39.9

Key Findings¶

Multi-task LoRA collapse is severe: DOTA performance drops from 88.2% (5 tasks) to 2.0% (15 tasks); iNat drops from 87.0% (1 task) to 0.3% (100 tasks).
Block-level adaptation contributes the most (+2.1%) while reducing parameters by 50% — a design that achieves simultaneous gains in efficiency and performance.
Spectral-aware regularization and fine-grained routing together contribute an additional +0.7%, with particularly significant gains on NLP tasks (+2.9%).
mtLoRA consistently improves across all task difficulty levels: Easy +1.6%, Medium +3.5%, Hard +0.4%.
Uniform regularization combined with dynamic routing reaches the Pareto frontier and cannot improve further; mtLoRA breaks this trade-off through spectral awareness.

Highlights & Insights¶

First systematic analysis of scalability failure in multi-task LoRA: Reveals the mechanism by which shared knowledge concentrates in high-SV components and uniform regularization destroys it.
Elegance of block-level adaptation: Simply elevating the LoRA placement from component level to block level simultaneously reduces gradient conflicts by 76% and parameters by 50%.
Pareto improvement in efficiency–performance trade-off: +2.8% performance gain accompanied by 47% parameter reduction and 24% training time savings.
Clever design of the spectral-aware weighting function: $w(\sigma) = \exp(-\sigma/\bar{\sigma})$ is continuously adaptive, requiring no manual threshold for singular values.
Validation across both vision and NLP domains demonstrates the generality of the approach.

Limitations & Future Work¶

Block-level LoRA bypasses internal nonlinearities within attention layers, potentially limiting performance on tasks requiring fine-grained attention adjustment.
Experiments are conducted with a fixed rank of 16; performance under varying ranks is not explored.
The additional parameters introduced by fine-grained routing ($g=32$, +1.93%) need to be evaluated at larger model scales.
Spectral-aware regularization requires one SVD per epoch, which may become a bottleneck as the number of tasks and model scale increase.
Evaluation primarily relies on accuracy metrics; comprehensive assessment of generation quality (e.g., BLEU, ROUGE) is lacking.

HydraLoRA (Tian et al., 2024): Pioneer of the asymmetric structure (shared $A$, task-specific $B$); mtLoRA extends this foundation.
MoLE (Wu et al., 2024): Top-K routing with load-balancing loss, but does not resolve the regularization–routing trade-off.
AlphaEdit / SPHERE (Fang et al., 2025): Adopts a similar "protect principal directions" strategy in knowledge editing.
Insight: The spectral-aware regularization paradigm is generalizable to LoRA merging (model merging) and continual learning scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ All three designs are innovative; the insight behind spectral-aware regularization is particularly outstanding, though the motivation for block-level adaptation is relatively natural.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four large-scale benchmarks (15–25 tasks), comprehensive ablations, dual-domain validation (vision + NLP), and efficiency analysis are all provided.
Writing Quality: ⭐⭐⭐⭐ The three motivating observations visualized in Figure 1 are clear and convincing; overall structure is well-organized.
Value: ⭐⭐⭐⭐⭐ First work to make multi-task LoRA practically viable at 15+ tasks; highly valuable for real-world deployment; open-source code is directly usable.