Task Singular Vectors: Reducing Task Interference in Model Merging¶
Conference: CVPR 2025
arXiv: 2412.00081
Code: GitHub
Area: Model Compression/Model Merging
Keywords: Model Merging, Task Vectors, Singular Value Decomposition, Task Interference, Multi-Task Learning
TL;DR¶
The Task Singular Vectors (TSV) framework is proposed to analyze and resolve task interference in model merging within the SVD space of layer-wise task matrices. TSV-Compress compresses task vectors to 10% of their size while maintaining 99% accuracy, and TSV-Merge decorrelates singular vectors of different tasks via a whitening transformation, outperforming existing methods by an average of approximately 15 percentage points across 8/14/20 task merging scenarios.
Background & Motivation¶
- Value of Model Merging: With a vast number of pre-trained models publicly available, merging multiple task-specific fine-tuned models into a single multi-task model (without extra training) holds significant practical value.
- Limitations of Task Arithmetic: TA treats the network as flat high-dimensional vectors to perform addition, neglecting the structural matrix information of the weights. Consequently, it can only evaluate relationships between tasks using coarse-grained measures such as cosine similarity.
- Root Cause of Task Interference: When weight changes from different tasks conflict in similar directions, simple averaging leads to task interference, degrading the performance of each task after merging.
- Intuition on Low-Rank Characteristics: Inspired by research on PEFT (such as LoRA), the weight change matrix generated by fine-tuning is inherently low-rank, meaning that only a few singular vectors are sufficient to faithfully represent functional changes in a layer.
- Lack of Structural Analysis: Existing methods (TIES, DARE, Consensus TA), although attempting to reduce interference, still operate at the parameter level and fail to fully exploit the matrix structure of weights.
- Key Insight: Under the premise of preserving the layer-wise matrix structure, Task Singular Vectors (TSV) are extracted via SVD to analyze and reduce task interference in the singular vector space.
Method¶
Overall Architecture¶
The TSV framework operates at the layer-wise matrix level. For each layer \(l\) and each task \(i\), the task matrix \(\Delta_i^{(l)} = \theta_{\text{ft}_i}^{(l)} - \theta_{\text{pre}}^{(l)}\) is calculated, and its SVD decomposition \(\Delta_i = U_i \Sigma_i V_i^\top\) is performed. The framework consists of two complementary modules: TSV-C (Compression), which leverages the low-rank nature to preserve the top-\(k\) singular components, and TSV-M (Merging), which removes cross-task singular vector interference on top of the compressed components using Procrustes orthogonalization.
Key Designs¶
Design 1: TSV-Compress (TSV-C) — Low-Rank Task Vector Compression - Function: Compresses each task vector to 10% of its original size while maintaining 99% accuracy. - Mechanism: Based on the Eckart-Young theorem, only the top-\(k\) singular components \(\hat{\Delta}_i = \sum_{j=1}^{k} \sigma_j^i u_j^i v_j^{i\top}\) are retained for each layer-wise task matrix. When task identity is known (e.g., via a router), setting \(k = \frac{\text{rank}}{T}\) reduces storage requirements by a factor of \(T\). Experiments demonstrate that even retaining only 3% of the singular components results in a mere 1.5% drop in average accuracy. - Design Motivation: Task matrices are naturally low-rank, meaning a vast number of singular components carry very little information. Discarding these components not only saves storage but also removes noise interference across different tasks.
Design 2: Singular Task Interference (STI) — Task Interference Measurement - Function: Quantifies layer-wise task interference based on the geometric relationship of singular vectors. - Mechanism: The metric is defined as \(\text{STI}(\{\Delta_i\}) = \|(U^\top U - I)\Sigma(V^\top V - I)\|_1\), where \(U\) and \(V\) represent the concatenation of TSVs from all tasks. When singular vectors of different tasks are orthogonal, \(U^\top U\) and \(V^\top V\) are close to identity matrices, and STI approaches 0. When singular vectors heavily overlap, the STI value is high, indicating severe task interference. - Design Motivation: Compared to global metrics like cosine similarity, STI characterizes interference at the layer-wise and singular-direction levels, providing a much finer level of analysis.
Design 3: TSV-Merge (TSV-M) — Whitening-Based De-interference Merging - Function: Facilitates model merging without requiring validation data or additional training. - Mechanism: Performs orthogonal Procrustes transformation on the compressed and concatenated matrices \(\hat{U}\) and \(\hat{V}\): \(\hat{U}_\bot = P_U Q_U^\top\) (where \(\hat{U} = P_U D_U Q_U^\top\) represents the SVD of \(\hat{U}\)), which is equivalent to the whitening transformation \(X \mapsto X(X^\top X)^{-1/2}\). Following the transformation, the singular vectors of different tasks are decorrelated. The merged matrix \(\hat{M} = U_\bot \Sigma V_\bot^\top\) is then reconstructed, leading to the final merged weights \(\theta_{\text{MT}} = \theta_{\text{pre}} + \alpha \hat{M}\). - Design Motivation: The equivalence of whitening and Procrustes guarantees numerical stability. Furthermore, the operation has a closed-form solution with no need for iterative optimization. Decorrelation directly reduces the STI metric.
Loss & Training¶
TSV-M does not involve any training or loss functions; it is a training-free, purely post-processing model merging method that only requires the weights of the pre-trained and task-specific fine-tuned models.
Key Experimental Results¶
Main Results: Average Accuracy of ViT-L-14 Multi-Task Merging¶
| Method | 8 tasks | 14 tasks | 20 tasks |
|---|---|---|---|
| Zero-shot | 64.70 | 68.20 | 65.23 |
| Weight Averaging | 79.56 | 76.73 | 71.60 |
| Task Arithmetic | 84.93 | 79.41 | 74.01 |
| Consensus TA | 86.34 | 82.22 | 79.00 |
| TSV-M (Ours) | ~90+ | ~87+ | ~83+ |
Detailed Results of ViT-B-32¶
| Method | 8 tasks | 14 tasks | 20 tasks |
|---|---|---|---|
| Task Arithmetic | 70.79 | 65.32 | 60.52 |
| TIES | ~72 | ~66 | ~62 |
| Consensus TA | 75.03 | 70.39 | 65.43 |
| TSV-M | 85.86 | 80.06 | ~76 |
Key Findings¶
- TSV-M improves accuracy by approximately 10.8 percentage points (75.03 to 85.86) on ViT-B-32/8 tasks compared to Consensus TA (the previous SOTA).
- TSV-C retains 99% accuracy with only 10% of the parameters, and even with only 3% of the parameters, the performance drop is only 1.5%.
- Compression and de-interference are complementary; both provide improvements individually, with their combination yielding the best overall performance.
- TSV-M does not require validation data, extra training, or labels, requiring the fewest resource assumptions (see Table 1).
- As the number of tasks increases (8 to 14 to 20), the advantage of TSV-M becomes more pronounced, indicating that interference is more severe under larger numbers of tasks, making de-interference highly valuable.
Highlights & Insights¶
- Mathematical Elegance: The proof of equivalence between whitening and Procrustes is concise, providing a solid theoretical foundation for the method.
- Zero Overhead: TSV-M does not require validation data, labels, extra training, or routers, serving as a model merging method with minimal requirements.
- STI Metric: Provides an interference analysis tool that is significantly more granular than cosine similarity, capable of being used independently to predict the quality of model merging.
- Consistency with PEFT: The low-rank discovery aligns with PEFT methods like LoRA, providing additional empirical evidence that fine-tuning induces low-rank updates.
Limitations & Future Work¶
- The computational cost of SVD decomposition increases with the number of layers and parameter scale, and its feasibility on ultra-large models (e.g., LLMs) remains to be validated.
- The current validation is restricted to ViT+CLIP vision classification tasks; its generalization to other modalities/tasks, such as NLP, remains unknown.
- The whitening transformation might discard features too aggressively, which may not be optimal in scenarios where shared information between tasks is beneficial.
- The scaling factor \(\alpha\) still requires manual tuning.
Related Work & Insights¶
- The low-rank analysis of TSV offers a novel theoretical perspective for methods like LoRA: the essence of fine-tuning might be identifying task-specific directions within a low-rank subspace.
- The STI metric can serve as a pre-screening tool, helping to determine which tasks are suitable for merging and which are not.
- The idea of whitening for de-interference can be extended to model aggregation in federated learning.
Rating¶
⭐⭐⭐⭐ — Solid theoretical foundation and thorough experimentation, achieving a significant breakthrough in the field of model merging. The designs of the STI metric and Procrustes whitening are highly inspiring, and the method is simple, elegant, and training-free.