Task Vector Quantization for Memory-Efficient Model Merging¶
Conference: ICCV 2025 arXiv: 2503.06921 Code: https://aim-skku.github.io/TVQ/ Area: Model Compression Keywords: Model Merging, Task Vector Quantization, Low-Bit Storage, Residual Quantization, Multi-Task Learning
TL;DR¶
This paper proposes quantizing task vectors (the difference between fine-tuned and pre-trained weights) rather than the fine-tuned weights themselves. By exploiting the narrower numerical range of task vectors, the method achieves quantization down to 3-bit without accuracy loss. The paper further proposes Residual Task Vector Quantization (RTVQ), which decomposes task vectors into a shared high-precision base vector and low-precision per-task offsets, maintaining or even improving model merging performance while using only 8% of the original storage.
Background & Motivation¶
Model merging constructs efficient multi-task models by combining multiple single-task fine-tuned models, avoiding the inference overhead of ensemble methods. Task Arithmetic defines task vectors as \(\tau_t = \theta_{ft}^t - \theta_{pre}\) and achieves multi-task merging via linear combination.
However, storing multiple fine-tuned checkpoints incurs substantial storage costs. For example, a single ViT-L/14 checkpoint occupies 1.14 GB, totaling 22.8 GB for 20 tasks. On edge devices such as the NVIDIA Jetson Nano (16 GB storage), this overhead becomes a bottleneck for scaling to larger models and more tasks.
A naive approach is to directly quantize the fine-tuned weights. However, the authors observe a critical phenomenon: the numerical range of task vectors is an order of magnitude smaller than that of fine-tuned weights. Since the upper bound of quantization error is \(|\epsilon| \leq \frac{\theta_{max} - \theta_{min}}{2(2^b - 1)}\), a narrower range implies smaller quantization error at the same bit-width.
Core Idea: Quantize task vectors rather than full weights, exploiting their naturally narrow dynamic range.
Method¶
Overall Architecture¶
The system comprises two quantization schemes: (1) TVQ — directly quantizing task vectors \(\tau_t\); and (2) RTVQ — decomposing task vectors into a shared base vector and task-specific offset vectors, quantized separately. Both are post-training quantization (PTQ) approaches that do not modify any merging method, operating solely on checkpoint storage.
Key Designs¶
-
Task Vector Quantization (TVQ)
- Function: Replaces stored fine-tuned weights \(\theta_{ft}^t\) with task vectors \(\tau_t = \theta_{ft}^t - \theta_{pre}\), which are then quantized.
- Mechanism: Applies asymmetric quantization \(\tau^q = \text{Round}(\tau / \Delta) + z\), where \(\Delta = \frac{\tau_{max} - \tau_{min}}{2^b - 1}\). Since the dynamic range of \(\tau\) is far smaller than that of \(\theta_{ft}\), the step size \(\Delta\) is smaller at the same bit-width, yielding lower quantization error.
- Design Motivation: Experiments confirm that the L2 quantization error of task vectors is significantly lower than that of directly quantized fine-tuned weights across all precisions, with the gap widening below 4-bit.
- Key Advantage: The method only modifies checkpoint storage format and integrates seamlessly into all existing task vector merging frameworks.
-
Residual Task Vector Quantization (RTVQ)
- Function: Addresses the sharp performance degradation of TVQ at ultra-low precision (2-bit).
- Mechanism: Decomposes task vectors into two components: \(\tau_t = \underbrace{(\theta_{ft}^t - \theta_{ft\_avg})}_{\text{Offset Vector}} + \underbrace{(\theta_{ft\_avg} - \theta_{pre})}_{\text{Base Vector}}\) where \(\theta_{ft\_avg} = \frac{1}{T}\sum_t \theta_{ft}^t\) is the mean of all fine-tuned weights. The base vector is stored at higher precision (e.g., 3-bit) and the offset at ultra-low precision (e.g., 2-bit), yielding an effective precision of approximately \(2 + 4/8 = 2.5\) bits/task.
- Quantization Error Correction: The base vector is first quantized to obtain \(\theta_{ft\_avg\_ec} = Q(\theta_{ft\_avg} - \theta_{pre}) + \theta_{pre}\), and the corrected reference is then used to compute offset vectors, reducing accumulated error.
- Design Motivation: The base vector is shared across all tasks, so its amortized overhead is negligible; the offset vectors, representing deviations from the mean, have smaller magnitudes and are thus well-suited for low-precision quantization.
-
Quantization Regularization Effect
- Function: The paper finds that 3-bit quantization sometimes improves merging performance.
- Mechanism: Quantization noise acts as a regularizer, reducing overfitting in a manner analogous to dropout.
- This phenomenon — where TVQ-INT3 outperforms FP32 — is consistently observed across Task Arithmetic, Ties Merging, and other merging methods.
Loss & Training¶
- No additional training is required; the method is a pure PTQ approach.
- Quantization is performed offline at the checkpoint stage, prior to model merging.
- Layer-wise quantization is supported, with scale and zero-point computed independently per layer.
Key Experimental Results¶
Main Results¶
8-task image classification (ViT-B/32):
| Method | FP32 | FQ-INT4 | TVQ-INT4 | TVQ-INT3 | TVQ-INT2 | RTVQ(3+2) |
|---|---|---|---|---|---|---|
| Task Arithmetic | 69.2 | 4.2 | 69.1 | 71.2 | 62.1 | 70.2 |
| AdaMerging | 81.8 | 4.5 | 81.5 | 82.0 | 78.1 | 82.8 |
| EMR-Merging | 88.3 | 3.9 | 89.8 | 90.0 | 77.2 | 83.2 |
8-task image classification (ViT-L/14):
| Method | FP32 | TVQ-INT4 | TVQ-INT3 | RTVQ(3+2) |
|---|---|---|---|---|
| Task Arithmetic | 84.3 | 84.4 | 84.8 | 84.8 |
| AdaMerging | 90.8 | 90.9 | 91.0 | 90.9 |
Ablation Study¶
Dense prediction tasks (NYUv2, ResNet-50):
| Method | Seg(mIoU)↑ | Depth(RelErr)↓ | Normal(MAE)↓ |
|---|---|---|---|
| Task Arith. FP32 | 31.6 | 24.0 | 30.6 |
| Task Arith. TVQ-4 | 31.5 | 24.0 | 30.6 |
| Task Arith. TVQ-2 | 36.4 | 26.2 | 36.1 |
| Task Arith. RTVQ | 36.1 | 24.6 | 32.6 |
Key Findings¶
- FQ-INT4 fails completely: Directly quantizing fine-tuned weights to 4-bit reduces accuracy to approximately 4% (random-chance level), demonstrating the necessity of task vector quantization.
- TVQ-INT3 consistently outperforms FP32: The regularization effect of quantization noise appears consistently across multiple merging methods.
- RTVQ effectively mitigates 2-bit degradation: When TVQ-INT2 suffers significant performance drops, RTVQ (effective 2.375 bit) maintains performance close to FP32.
- RTVQ becomes more advantageous as task count increases: At 20 tasks, the effective bit-width reduces to only 2.15 bit/task, with stronger amortization of the base vector overhead.
- Storage requires only 8% of the original FP32 checkpoint size.
Highlights & Insights¶
- Simple yet powerful observation: The narrow dynamic range of task vectors relative to fine-tuned weights, though straightforward to observe, underpins the entire method's effectiveness.
- Seamless compatibility with existing methods: As a purely checkpoint-side operation, TVQ integrates into Task Arithmetic, Ties Merging, AdaMerging, and all other merging frameworks at zero modification cost.
- Elegant residual decomposition: The shared base vector plus task-specific offset decomposition naturally exploits inter-task redundancy.
- Unexpected regularization finding: The phenomenon of 3-bit quantization improving performance carries theoretical research value.
Limitations & Future Work¶
- The RTVQ base vector depends on the mean of all tasks; adding new tasks requires recomputation.
- Dense prediction tasks are more sensitive to quantization, particularly methods relying on task-specific masks such as EMR-Merging.
- Only the PTQ paradigm is evaluated; whether quantization-aware training (QAT) could further improve performance remains unexplored.
- Validation on NLP tasks is limited, with the experiments primarily focused on vision domains.
Related Work & Insights¶
- Unlike storage optimization methods such as TALL Mask and TSV-C, TVQ is more general and requires no additional task-specific structures.
- The discovery of quantization regularization effects may inspire new regularization strategies for model merging.
- The residual decomposition idea is extensible to other scenarios requiring storage of multiple model variants (e.g., storing multiple LoRA adapters).
Rating¶
- Novelty: ⭐⭐⭐⭐ The core observation is simple but profound; the RTVQ design is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers classification, dense prediction, and NLP; evaluates 8/14/20-task scales across multiple merging methods.
- Writing Quality: ⭐⭐⭐⭐ Logically clear with persuasive figures and tables.
- Value: ⭐⭐⭐⭐⭐ Highly practical — preserving performance at 8% storage has significant implications for resource-constrained deployment.