Label-Free Cross-Task LoRA Merging with Null-Space Compression¶
Conference: CVPR 2026 arXiv: 2603.26317 Code: GitHub Area: Multimodal VLM Keywords: Model Merging, LoRA, Null-Space Compression, Label-Free, Cross-Task
TL;DR¶
Motivated by the observation that the null-space ratio of the down-projection matrix \(\mathbf{A}\) decreases during LoRA fine-tuning and is strongly correlated with task performance, this paper proposes NSC Merging — a label-free, task-agnostic LoRA merging method that achieves state-of-the-art results across 20 heterogeneous vision tasks, 6 NLI tasks, and VLM benchmarks.
Background & Motivation¶
Model merging combines independently fine-tuned checkpoints into a single multi-task model without joint training. In the era of foundation models, LoRA fine-tuning has become the de facto standard, making LoRA merging an increasingly important research direction.
Existing gradient-guided merging methods (e.g., AdaMerging) use output entropy minimization as a proxy objective to estimate merging coefficients. While effective on classification tasks, they face two fundamental limitations: 1. Inapplicable to regression tasks: Entropy is only meaningful for classification; tasks such as depth estimation and surface normal prediction cannot leverage this signal. 2. Not scalable to LLMs/VLMs: Entropy must be computed at every generated token, incurring a cost that grows linearly with sequence length.
Key Challenge: A merging signal is needed that applies to both classification and regression tasks without relying on output logits.
Key Observation: During LoRA fine-tuning, the null space of the down-projection matrix \(\mathbf{A}\) is systematically compressed — i.e., an increasing proportion of input activations fall within the adapter's projection subspace. This null-space compression is strongly negatively correlated with task performance, making it a task-agnostic merging signal.
Method¶
Overall Architecture¶
- Fine-tune the base model independently with LoRA for each task.
- Pre-compute the Gram inverse matrix \((A_kA_k^\top)^{-1}\) for each adapter.
- Optimize layer-wise merging coefficients \(\{\lambda_k^\ell\}\) using the NSC objective.
- Output the merged multi-task model.
Key Designs¶
-
Null-Space Ratio Definition and Compression Dynamics:
- Function: Provides a task-agnostic proxy signal for performance.
- Mechanism: For a LoRA update \(\Delta W = BA\), the null-space ratio is defined as \(\omega_k^\ell(\mathbf{z}) = \frac{\|\text{Proj}_{\mathcal{N}(A_k^\ell)}(\mathbf{z})\|_2}{\|\mathbf{z}\|_2}\), measuring the proportion of input activations discarded by the adapter. This ratio decreases monotonically during training (null-space compression), and exhibits a strong negative correlation with task performance — for both classification and regression tasks.
- Design Motivation: Since the LoRA rank is small (e.g., 16 vs. 768 dimensions), the adapter subspace covers only ~2.1% of the feature space. Null-space compression indicates that the adapter has learned to better capture task-relevant activations, making it a reliable performance proxy. Crucially, this is an input-driven signal that does not depend on output logits.
-
NSC Merging Objective:
- Function: Learns layer-wise merging coefficients without labels.
- Mechanism: Minimizes the average null-space ratio across all tasks: \(\min_{\{\lambda_k^\ell\}} \frac{1}{K}\sum_{k=1}^K \mathbb{E}_{\mathbf{x} \sim \mathcal{D}_k}[\Omega_k(\mathbf{x}; \Theta_{merge})]\), where \(\Omega_k\) is the mean null-space ratio across target layers. The merged model parameters are given by \(W_0^\ell + \sum_k \lambda_k^\ell B_k^\ell A_k^\ell\).
- Design Motivation: The null-space ratio is computed purely from adapter geometry, making it applicable to any task type. For LLMs/VLMs, only the activations of input tokens are required, so computational cost is independent of sequence length.
-
Fast NSC: Gram Inverse Caching:
- Function: Substantially reduces computational overhead.
- Mechanism: The null-space ratio admits the equivalent form \(\omega_k(\mathbf{z}) = \sqrt{1 - \frac{\mathbf{z}^\top A_k^\top(A_kA_k^\top)^{-1}A_k\mathbf{z}}{\|\mathbf{z}\|_2^2}}\). Since \(\mathbf{z}\) and \(A_k\mathbf{z}\) are already computed during inference, only the small matrix \((A_kA_k^\top)^{-1}\) (of dimension equal to the LoRA rank) needs to be pre-cached, avoiding the construction of the full null-space projection matrix.
- Target Layer Selection: The NSC objective is computed only over the last quarter of transformer blocks to balance efficiency and performance.
Loss & Training¶
- Optimizer: AdamW; lr = 0.001 (vision) / 0.0003 (LLM/VLM)
- Initialization: \(\lambda\) initialized to 0.4
- Iterations: 100 steps (vision) / 500 steps (LLM/VLM)
- Data: Unlabeled validation sets only
Key Experimental Results¶
Main Results — 20 Heterogeneous Vision Tasks (ViT-B)¶
| Method | NYUD-v2 (4 tasks) | PASCAL (5 tasks) | Taskonomy (11 tasks) | Overall Avg. |
|---|---|---|---|---|
| Task Arithmetic | ~46% | ~62% | ~103% | 77.2% |
| TIES | ~47% | ~62% | ~102% | 77.3% |
| KnOTS-TIES | ~45% | ~62% | ~102% | 76.6% |
| RobustMerge | ~69% | ~85% | ~100% | 89.9% |
| NSC (Ours) | ~75% | ~87% | ~100% | 92.0% |
(Values are performance normalized to single-task fine-tuning.)
LLM Experiments (LLaMA-3-8B, 6 NLI Tasks)¶
| Method | MNLI | QNLI | SNLI | RTE | SICK | SciTail | Avg. |
|---|---|---|---|---|---|---|---|
| TA | 92.8 | 86.8 | 93.3 | 93.6 | 83.8 | 95.0 | 90.9 |
| AdaMerging | 94.3 | 84.8 | 92.5 | 92.1 | 89.2 | 84.8 | 89.6 |
| RobustMerge | 94.3 | 88.1 | 93.7 | 93.6 | 83.0 | 94.5 | 91.2 |
| NSC (Ours) | 94.9 | 88.3 | 92.8 | 91.3 | 91.2 | 95.1 | 92.3 |
Ablation Study¶
| Configuration | Description | Effect |
|---|---|---|
| Full-layer NSC | Compute over all LoRA layers | Best performance, highest cost |
| Last 1/4 layers | Only the last quarter of transformer blocks | Near full-layer quality, substantially more efficient |
| Last 1 layer | Only the final block | Noticeable performance drop |
| Input IDs only | No image/text content required | Still effective on LLMs |
Key Findings¶
- NSC achieves the largest gains on heterogeneous mixed classification+regression tasks (+2.1% over RobustMerge on NYUD-v2), as competing methods struggle with regression tasks.
- AdaMerging underperforms on LLMs (89.6%) due to the computational cost of entropy computation over long sequences, leaving the optimization under-resourced.
- NSC exhibits the best task balance: it does not overfit a subset of tasks at the expense of others, a known failure mode of prior methods.
- The null-space ratio remains correlated with performance in the merged model: samples with lower ratios yield higher accuracy.
Highlights & Insights¶
- Null-space compression is an elegant observation: the down-projection matrix in LoRA gradually "captures" more task-relevant activations during training, and this dynamic is repurposed as a merging signal.
- The fully input-driven formulation naturally extends to regression and generative tasks, resolving the fundamental limitation of entropy-based approaches.
- The Gram inverse caching trick is practically valuable: it reduces the computational bottleneck from \(O(d^2)\) to \(O(r^2)\) (\(r \ll d\)).
- Being both label-free and output-free makes NSC the most lightweight gradient-guided merging method to date.
Limitations & Future Work¶
- A small amount of unlabeled data is still required for optimization; the method is not entirely data-free.
- Normalized performance on heterogeneous vision tasks reaches only 92%, leaving a non-trivial gap relative to single-task fine-tuning.
- The choice of target layers (last 1/4) is empirically determined and may require adjustment for different model architectures.
- Experiments are limited to ViT-B-scale vision models; effectiveness on larger models (e.g., ViT-L/H) remains unexplored.
Related Work & Insights¶
- vs. AdaMerging: Both optimize merging coefficients via gradient-based methods, but AdaMerging relies on output entropy (limited to classification, scales with sequence length), whereas NSC uses the null-space ratio (task-agnostic, input-driven).
- vs. KnOTS: KnOTS projects adapters onto a shared subspace and merges SVD components, focusing on adapter alignment rather than merging weight optimization.
- vs. Task Arithmetic: TA applies a global scaling factor and performs poorly on heterogeneous tasks (77.2%); NSC's layer-wise coefficients provide substantially finer-grained control.
Rating¶
- Novelty: ⭐⭐⭐⭐ The null-space compression observation is novel and compelling; deriving a merging signal from LoRA geometry is a natural and principled approach.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 20 vision tasks, 6 NLI tasks, and VLM benchmarks against 11 baselines, with comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Motivation and method derivation are clear; experimental presentation is thorough.
- Value: ⭐⭐⭐⭐⭐ Addresses a critical bottleneck in model merging for heterogeneous tasks, with practical implications for the LoRA ecosystem.