Skip to content

Scalable Multi-Task Low-Rank Model Adaptation

Conference: ICLR 2026 arXiv: 2603.01526 Code: GitHub Area: Social Computing Keywords: LoRA, multi-task learning, spectral-aware regularization, block-level adaptation, fine-grained routing

TL;DR

This paper systematically analyzes the root causes of multi-task LoRA collapse as the number of tasks scales (uniform regularization destroying shared knowledge + component-level LoRA amplifying gradient conflicts), and proposes mtLoRA: spectral-aware regularization + block-level adaptation + fine-grained routing. The method outperforms the state of the art by an average of 2.3% on 15–25 tasks, while reducing parameters by 47% and training time by 24%.

Background & Motivation

  1. LoRA excels at single-task adaptation, but real-world applications frequently require handling a large number of tasks simultaneously (15–25+), where multi-task LoRA suffers from catastrophic collapse.

Parameter misalignment: Weight update directions across different LoRA modules conflict (gradient antagonism).

Representation misalignment: Output features of LoRA modules diverge. 4. Existing regularization methods (Task Arithmetic, TIES-Merging) and dynamic routing methods (MoLE, HydraLoRA) each fail independently. 5. A key finding reveals a fundamental trade-off between regularization and routing — strengthening regularization reduces conflicts but simultaneously degrades routing effectiveness (routing entropy increases from 2.6 to 2.7). 6. Root cause analysis: (1) Shared knowledge concentrates in high-singular-value components (top-20% account for 89% of cross-task alignment), and uniform regularization destroys this shared knowledge; (2) component-level (\(W_q, W_v\)) LoRA amplifies gradient conflicts, whereas block-level adaptation reduces conflicts by 76%.

Method

Overall Architecture

mtLoRA builds upon the asymmetric structure of HydraLoRA (shared \(A\), multiple task-specific \(B_i\)) and introduces three novel designs: 1. Spectral-Aware Regularization 2. Block-Level Adaptation 3. Fine-Grained Routing

Key Designs

Design 1: Spectral-Aware Regularization - Function: Selectively applies orthogonality constraints to low-singular-value components while preserving shared knowledge encoded in high-singular-value components. - Mechanism: SVD is applied to each \(B_i\) to obtain singular values \(\{\sigma_k\}\); a weighting function \(w(\sigma) = \exp(-\sigma/\bar{\sigma})\) constructs a reweighted matrix \(B'_i\); the loss is defined as \(\mathcal{L}_{spectral} = \lambda \sum_{i<j} \|(B'_i)^T B'_j\|_F^2\). - Design Motivation: Low-SV components (\(\sigma \ll \bar{\sigma}\)) receive weights close to 1, enforcing orthogonalization (denoising); high-SV components (\(\sigma \gg \bar{\sigma}\)) receive weights close to 0, preserving cross-task shared knowledge. Experiments confirm low-SV components are suppressed 3× (−6.0%) compared to high-SV components (−2.0%).

Design 2: Block-Level Adaptation - Function: Elevates LoRA adaptation from the component level (\(W_q, W_v\)) to the block level (a parallel path over the entire Attention/FFN block). - Mechanism: \(x' = x + W^{(F)}(\text{LN}(x)) + \Delta(\text{LN}(x))\), decoupling the LoRA update path from internal nonlinearities within the main block (e.g., Softmax). - Design Motivation: Component-level LoRA gradients propagate through Softmax and create cross-token dependencies — modifying the attention from "bank" → "money" automatically suppresses "bank" → "river". Block-level adaptation eliminates this competition while reducing parameters by 50%.

Design 3: Fine-Grained Routing - Function: Assigns dimension-specific routing weight vectors to each LoRA module, rather than scalar weights. - Mechanism: The router outputs \(\Pi_i \in \mathbb{R}^g\) (\(g\) groups) for each LoRA, with the combination \(\sum_{i=1}^N \Pi_i(x) \odot \Delta_i(x)\) computed via grouped element-wise multiplication. - Design Motivation: Different feature subspaces may require different LoRA combinations (e.g., a "creativity" dimension emphasizes a brainstorming LoRA, while a "factuality" dimension emphasizes a QA LoRA); scalar routing cannot express such heterogeneity.

Loss & Training

The overall loss function is: $\(\mathcal{L} = \mathcal{L}_{task} + \lambda_1 \mathcal{L}_{spectral} + \lambda_2 \mathcal{L}_{balance}\)$ - \(\mathcal{L}_{spectral}\): SVD is performed once per epoch to compute the spectral-aware orthogonality loss. - \(\mathcal{L}_{balance}\): Load-balancing loss to prevent routing collapse (all samples selecting only a few experts). - Router: A 2-layer MLP that takes mean-pooled hidden states as input and outputs \(N \times g\) weights normalized via softmax.

Key Experimental Results

Main Results

Large-scale multi-task evaluation across 15–25 tasks:

Method Parameters DOTA (15) iNat2018 (25) Dolly-15k (16) BBH (27) Avg.
HydraLoRA 75.5M (1.11%) 89.0 78.3 41.6 35.5 61.1
mtLoRA 39.8M (0.59%) 91.0 81.5 44.5 38.5 63.9

Ablation Study

Contribution of each component (relative to HydraLoRA baseline):

Component Combination Parameters Training Time DOTA BBH Avg.
Baseline HydraLoRA 75.5M 1.00× 89.0 35.5 61.1
+Block-Level 37.7M 0.67× 91.2 37.9 63.2
+Block+Spectral 37.7M 0.70× 91.7 38.4 63.8
+Block+Fine-grained 39.8M 0.69× 89.9 38.2 63.1
All (mtLoRA) 39.8M 0.76× 91.0 38.5 63.9

Routing granularity ablation:

Strategy Groups \(g\) Dolly-15k BBH Avg.
Scalar routing 1 41.6 35.5 38.5
Fine-grained 2 41.6 37.0 39.3
Fine-grained 32 42.0 37.7 39.9

Key Findings

  1. Multi-task LoRA collapse is severe: DOTA performance drops from 88.2% (5 tasks) to 2.0% (15 tasks); iNat drops from 87.0% (1 task) to 0.3% (100 tasks).
  2. Block-level adaptation contributes the most (+2.1%) while reducing parameters by 50% — a design that achieves simultaneous gains in efficiency and performance.
  3. Spectral-aware regularization and fine-grained routing together contribute an additional +0.7%, with particularly significant gains on NLP tasks (+2.9%).
  4. mtLoRA consistently improves across all task difficulty levels: Easy +1.6%, Medium +3.5%, Hard +0.4%.
  5. Uniform regularization combined with dynamic routing reaches the Pareto frontier and cannot improve further; mtLoRA breaks this trade-off through spectral awareness.

Highlights & Insights

  1. First systematic analysis of scalability failure in multi-task LoRA: Reveals the mechanism by which shared knowledge concentrates in high-SV components and uniform regularization destroys it.
  2. Elegance of block-level adaptation: Simply elevating the LoRA placement from component level to block level simultaneously reduces gradient conflicts by 76% and parameters by 50%.
  3. Pareto improvement in efficiency–performance trade-off: +2.8% performance gain accompanied by 47% parameter reduction and 24% training time savings.
  4. Clever design of the spectral-aware weighting function: \(w(\sigma) = \exp(-\sigma/\bar{\sigma})\) is continuously adaptive, requiring no manual threshold for singular values.
  5. Validation across both vision and NLP domains demonstrates the generality of the approach.

Limitations & Future Work

  1. Block-level LoRA bypasses internal nonlinearities within attention layers, potentially limiting performance on tasks requiring fine-grained attention adjustment.
  2. Experiments are conducted with a fixed rank of 16; performance under varying ranks is not explored.
  3. The additional parameters introduced by fine-grained routing (\(g=32\), +1.93%) need to be evaluated at larger model scales.
  4. Spectral-aware regularization requires one SVD per epoch, which may become a bottleneck as the number of tasks and model scale increase.
  5. Evaluation primarily relies on accuracy metrics; comprehensive assessment of generation quality (e.g., BLEU, ROUGE) is lacking.
  • HydraLoRA (Tian et al., 2024): Pioneer of the asymmetric structure (shared \(A\), task-specific \(B\)); mtLoRA extends this foundation.
  • MoLE (Wu et al., 2024): Top-K routing with load-balancing loss, but does not resolve the regularization–routing trade-off.
  • AlphaEdit / SPHERE (Fang et al., 2025): Adopts a similar "protect principal directions" strategy in knowledge editing.
  • Insight: The spectral-aware regularization paradigm is generalizable to LoRA merging (model merging) and continual learning scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐ All three designs are innovative; the insight behind spectral-aware regularization is particularly outstanding, though the motivation for block-level adaptation is relatively natural.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four large-scale benchmarks (15–25 tasks), comprehensive ablations, dual-domain validation (vision + NLP), and efficiency analysis are all provided.
  • Writing Quality: ⭐⭐⭐⭐ The three motivating observations visualized in Figure 1 are clear and convincing; overall structure is well-organized.
  • Value: ⭐⭐⭐⭐⭐ First work to make multi-task LoRA practically viable at 15+ tasks; highly valuable for real-world deployment; open-source code is directly usable.