Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy¶
Conference: CVPR 2026 arXiv: 2603.26299 Code: https://github.com/wooseong97/TARA-Merge Area: Model Compression / Model Merging Keywords: LoRA merging, subspace coverage, anisotropy, multi-objective optimization, model merging
TL;DR¶
This paper revisits the LoRA merging problem through two complementary lenses—subspace coverage and directional anisotropy—and proposes the TARA-Merging framework. By retaining LoRA directions to preserve subspace coverage and applying preference-weighted cross-entropy pseudo-loss for direction-level reweighting, TARA consistently outperforms existing merging methods across 8 vision and 6 NLI benchmarks.
Background & Motivation¶
- Background: LoRA has become the standard paradigm for fine-tuning large models. Merging multiple task-specific LoRA adapters into a single unified model (model merging) offers an effective alternative to expensive multi-task joint training.
- Limitations of Prior Work: Existing merging methods suffer from two categories of issues: (1) general-purpose methods (Task Arithmetic, TIES, DARE) disregard the low-rank structure of LoRA and operate directly in the full parameter space, leading to severe cross-task interference; (2) LoRA-aware methods (KnOTS, LoRA-LEGO) exploit the LoRA structure but typically address only one of the two problems—coverage or anisotropy.
- Key Challenge: The update directions of LoRA adapters span different subspaces and contribute unevenly. Naive merging attenuates directions most critical to certain task losses while over-emphasizing relatively unimportant ones.
- Goal: (a) Subspace coverage—whether the diversity of per-task LoRA directions is preserved after merging; (b) Directional anisotropy—different LoRA directions exhibit unequal sensitivity to task losses, requiring fine-grained direction-level control.
- Key Insight: Effective rank (erank) analysis reveals that LoRA-aware rank-1 stacking retains approximately 70% of the per-task independent dimensions, whereas interpolation-based merging (e.g., Task Arithmetic) causes severe subspace collapse. Directional sensitivity analysis via Jacobians further shows that sensitivity distributions are highly inconsistent across different preferences.
- Core Idea: While preserving LoRA rank-1 directions to maintain subspace coverage, TARA addresses anisotropy by optimizing direction-level weights through preference-weighted smooth Tchebycheff scalarization.
Method¶
Overall Architecture¶
The TARA framework takes as input \(N\) task-specific LoRA adapters \(\{\Delta W_i = B_i A_i^\top\}\) and a user-specified task preference vector \(\boldsymbol{\rho}\). Each LoRA is decomposed into rank-1 directions (\(\mathbf{b}_{ij}\mathbf{a}_{ij}^\top\)), and a learnable scalar weight \(\phi_{ij}\) is assigned to each direction. These weights are optimized by minimizing a preference-weighted entropy pseudo-loss. The final merged model weight is \(W = W_0 + \sum_i\sum_j \phi_{ij} \mathbf{b}_{ij}\mathbf{a}_{ij}^\top\).
Key Designs¶
-
Subspace Coverage Analysis and Preservation:
- Function: Quantify and preserve representational capacity in LoRA merging.
- Mechanism: An entropy-based effective rank (erank) is used to measure the effective dimensionality of LoRA directions. The rank-1 components of per-task LoRAs are vectorized and stacked, and the erank of three stacking strategies is compared: (1) per-task independent summation, (2) LoRA-agnostic \(\Delta W\) stacking, and (3) LoRA-aware rank-1 stacking. Rank-1 stacking retains approximately 70% of the independent dimensions, whereas \(\Delta W\) stacking suffers from severe collapse due to interpolation interference. TARA maintains subspace coverage by treating rank-1 directions as the fundamental optimization units.
- Design Motivation: Merging directly in the full parameter space destroys the internal low-rank structure of LoRA, resulting in a loss of representational capacity.
-
Directional Anisotropy Alignment:
- Function: Address the unequal sensitivity of different LoRA directions to task losses.
- Mechanism: A task-loss Jacobian is constructed as \(J_{i,k} = \langle \nabla f_i(W), S_k \rangle_F\), where \(S_k\) denotes the rank-1 LoRA direction. The condition number \(\kappa(J)\) reflects the degree of anisotropy. A directional sensitivity misalignment metric is further defined as \(\xi(\boldsymbol{\rho}_1, \boldsymbol{\rho}_2) = 1 - |\cos(\mathbf{h}(\boldsymbol{\rho}_1), \mathbf{h}(\boldsymbol{\rho}_2))|\), where \(h_k(\boldsymbol{\rho}) = \langle g(\boldsymbol{\rho}; W), S_k \rangle_F\) denotes the direction-level sensitivity. Experiments show that sensitivity distributions are highly inconsistent across different preferences (large \(\xi\)), validating the necessity of direction-level reweighting.
- Design Motivation: Equal-norm LoRA direction updates do not produce proportional changes in task loss; sensitivity differences must be accounted for at the direction level.
-
Two Merging Variants (Variant A/B) and Optimization:
- Function: Provide different efficiency–accuracy trade-offs.
- Mechanism: Variant A directly assigns weights to the rank-1 factors of each task: \(W_A = W_0 + \sum_i\sum_j \phi_{ij} \mathbf{b}_{ij}\mathbf{a}_{ij}^\top\), offering simplicity and efficiency. Variant B horizontally concatenates all adapters and applies SVD to obtain a shared orthonormal basis \(\{u_k\}\), then assigns weights to each task's projections onto the shared basis: \(W_B = W_0 + \sum_i\sum_k \phi_{ik}\sigma_k u_k v_{ki}^\top\). Optimization employs smooth Tchebycheff scalarization \(\Psi(\phi, \boldsymbol{\rho}) = \alpha \log(\sum_i \exp(\rho_i |f_i - z_i| / \alpha))\), using predictive entropy as a proxy for the true loss to avoid label dependency.
- Design Motivation: Variant A avoids SVD computation, making it suitable for resource-constrained settings; Variant B better decorrelates and reduces interference through the shared orthonormal basis, yielding higher accuracy.
Loss & Training¶
An AdaMerging-style predictive entropy is used as an unsupervised proxy loss, combined with smooth Tchebycheff scalarization and user preferences to optimize direction-level weights. The AdamW optimizer is used with a learning rate of 0.001, weight initialization of 0.4, 500 iterations, and a batch size of 16. The anchor \(z_i\) is set to the entropy loss obtained when using only the single adapter for task \(i\).
Key Experimental Results¶
Main Results¶
| Method | Cars | DTD | EuroSAT | GTSRB | MNIST | RESISC45 | SUN397 | SVHN | Avg (normalized%) |
|---|---|---|---|---|---|---|---|---|---|
| TA | 82.1 | 74.3 | 48.7 | 41.8 | 53.4 | 71.5 | 96.6 | 42.0 | 63.8 |
| TIES | 81.0 | 72.5 | 53.8 | 37.4 | 69.0 | 65.3 | 94.8 | 45.3 | 64.9 |
| AdaMerging | 79.5 | 73.5 | 70.9 | 39.7 | 63.0 | 69.0 | 97.8 | 66.6 | 70.0 |
| KnOTS-TIES | 82.7 | 73.7 | 49.3 | 48.9 | 68.9 | 70.9 | 95.5 | 53.8 | 68.0 |
| LoRA-LEGO | 81.1 | 73.0 | 54.4 | 40.3 | 48.6 | 71.5 | 97.3 | 37.1 | 62.9 |
| TARA-A | 82.2 | 76.0 | 74.9 | 43.5 | 76.3 | 70.2 | 98.0 | 70.8 | 74.0 |
| TARA-B | 86.2 | 78.4 | 76.8 | 42.9 | 82.7 | 75.4 | 98.6 | 69.7 | 76.3 |
Ablation Study (NLI Tasks, LLaMA-3 8B)¶
| Method | MNLI | QNLI | SNLI | RTE | SICK | SCITAIL | Avg (normalized%) |
|---|---|---|---|---|---|---|---|
| TA | 67.3 | 87.3 | 41.8 | 95.7 | 77.9 | 76.9 | 74.6 |
| AdaMerging | 47.5 | 92.9 | 41.3 | 102.6 | 93.8 | 94.2 | 78.7 |
| KnOTS-TIES | 41.1 | 83.4 | 56.6 | 87.2 | 87.9 | 94.8 | 75.2 |
| TARA-A | 51.7 | 92.6 | 41.4 | 102.6 | 95.3 | 94.4 | 79.7 |
| TARA-B | 46.8 | 94.1 | 41.4 | 103.4 | 98.1 | 97.8 | 80.3 |
Key Findings¶
- TARA-B achieves the best results on both vision and NLI tasks: 76.3% average on 8 vision tasks (vs. AdaMerging 70.0%) and 80.3% average on 6 NLI tasks (vs. AdaMerging 78.7%), demonstrating the importance of jointly addressing coverage and anisotropy.
- LoRA-LEGO underperforms vanilla baselines: Retaining rank-1 directions without sensitivity weighting is insufficient (62.9% vs. TA's 63.8%), confirming that both problems must be addressed simultaneously.
- Generalization to unseen tasks: After merging on 6 known tasks, TARA-B achieves 52.2% average accuracy on 2 unseen tasks, substantially outperforming TA (42.9%) and KnOTS-TIES (41.8%).
- Joint task evaluation: TARA-B achieves Hits@1 of 49.3% (vs. TA 43.5%, AdaMerging 48.1%).
Highlights & Insights¶
- Unification of two orthogonal perspectives: Decomposing the LoRA merging problem into the independent yet complementary dimensions of "coverage" and "anisotropy" yields a theoretically clear and elegant framework. Prior methods (KnOTS addressing coverage, AdaMerging addressing global weighting) each solve only half of the problem.
- Discovery of directional sensitivity misalignment: Jacobian analysis quantifies the inconsistency of LoRA direction sensitivity distributions across different preferences, providing a rigorous theoretical motivation for direction-level weighting.
- Efficiency advantage of LoRA-level operations: Compared to full-parameter merging, LoRA-level operations substantially reduce memory and computational requirements, making gradient-based merging methods scalable to foundation model sizes.
Limitations & Future Work¶
- Variant B requires joint SVD over all adapters, which may incur significant computational overhead when the number of tasks \(N\) is large or the LoRA rank is high.
- Experiments are primarily conducted on ViT-B/32 and LLaMA-3 8B; performance on larger models (e.g., 70B-scale) remains unknown.
- The preference vector \(\boldsymbol{\rho}\) must be manually specified by the user; automatic preference discovery would be more practical.
- The framework does not address conflicting gradients across tasks (e.g., PCGrad-style gradient projection).
Related Work & Insights¶
- vs. AdaMerging: AdaMerging learns layer-level weights but ignores sensitivity differences within LoRA directions. TARA operates at finer direction-level granularity. Both methods share the same entropy minimization proxy.
- vs. KnOTS: KnOTS aligns subspaces via SVD to address coverage but neglects anisotropic weighting. TARA-B builds upon the SVD basis and additionally introduces direction-level weight optimization.
- vs. LoRA-LEGO: LoRA-LEGO preserves modularity through clustering, but the clustering process may discard critical directional information. TARA retains all original directions.
Rating¶
- Novelty: ⭐⭐⭐⭐ The identification of two analytical perspectives and the unified framework design are original, though the specific implementation is relatively straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ The dual-track evaluation across vision and NLI, joint evaluation, generalization testing, and preference sensitivity analysis are comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear, though the dense notation imposes a non-trivial reading burden.
- Value: ⭐⭐⭐⭐ The work makes a substantive contribution to the LoRA merging field; the open-sourced code enables direct practical use.