Skip to content

Task Vector Quantization for Memory-Efficient Model Merging

Conference: ICCV 2025 arXiv: 2503.06921 Code: https://aim-skku.github.io/TVQ/ Area: Model Compression Keywords: Model Merging, Task Vector Quantization, Low-Bit Storage, Residual Quantization, Multi-Task Learning

TL;DR

This paper proposes quantizing task vectors (the difference between fine-tuned and pre-trained weights) rather than the fine-tuned weights themselves. By exploiting the narrower numerical range of task vectors, the method achieves quantization down to 3-bit without accuracy loss. The paper further proposes Residual Task Vector Quantization (RTVQ), which decomposes task vectors into a shared high-precision base vector and low-precision per-task offsets, maintaining or even improving model merging performance while using only 8% of the original storage.

Background & Motivation

Model merging constructs efficient multi-task models by combining multiple single-task fine-tuned models, avoiding the inference overhead of ensemble methods. Task Arithmetic defines task vectors as \(\tau_t = \theta_{ft}^t - \theta_{pre}\) and achieves multi-task merging via linear combination.

However, storing multiple fine-tuned checkpoints incurs substantial storage costs. For example, a single ViT-L/14 checkpoint occupies 1.14 GB, totaling 22.8 GB for 20 tasks. On edge devices such as the NVIDIA Jetson Nano (16 GB storage), this overhead becomes a bottleneck for scaling to larger models and more tasks.

A naive approach is to directly quantize the fine-tuned weights. However, the authors observe a critical phenomenon: the numerical range of task vectors is an order of magnitude smaller than that of fine-tuned weights. Since the upper bound of quantization error is \(|\epsilon| \leq \frac{\theta_{max} - \theta_{min}}{2(2^b - 1)}\), a narrower range implies smaller quantization error at the same bit-width.

Core Idea: Quantize task vectors rather than full weights, exploiting their naturally narrow dynamic range.

Method

Overall Architecture

The system comprises two quantization schemes: (1) TVQ — directly quantizing task vectors \(\tau_t\); and (2) RTVQ — decomposing task vectors into a shared base vector and task-specific offset vectors, quantized separately. Both are post-training quantization (PTQ) approaches that do not modify any merging method, operating solely on checkpoint storage.

Key Designs

  1. Task Vector Quantization (TVQ)

    • Function: Replaces stored fine-tuned weights \(\theta_{ft}^t\) with task vectors \(\tau_t = \theta_{ft}^t - \theta_{pre}\), which are then quantized.
    • Mechanism: Applies asymmetric quantization \(\tau^q = \text{Round}(\tau / \Delta) + z\), where \(\Delta = \frac{\tau_{max} - \tau_{min}}{2^b - 1}\). Since the dynamic range of \(\tau\) is far smaller than that of \(\theta_{ft}\), the step size \(\Delta\) is smaller at the same bit-width, yielding lower quantization error.
    • Design Motivation: Experiments confirm that the L2 quantization error of task vectors is significantly lower than that of directly quantized fine-tuned weights across all precisions, with the gap widening below 4-bit.
    • Key Advantage: The method only modifies checkpoint storage format and integrates seamlessly into all existing task vector merging frameworks.
  2. Residual Task Vector Quantization (RTVQ)

    • Function: Addresses the sharp performance degradation of TVQ at ultra-low precision (2-bit).
    • Mechanism: Decomposes task vectors into two components: \(\tau_t = \underbrace{(\theta_{ft}^t - \theta_{ft\_avg})}_{\text{Offset Vector}} + \underbrace{(\theta_{ft\_avg} - \theta_{pre})}_{\text{Base Vector}}\) where \(\theta_{ft\_avg} = \frac{1}{T}\sum_t \theta_{ft}^t\) is the mean of all fine-tuned weights. The base vector is stored at higher precision (e.g., 3-bit) and the offset at ultra-low precision (e.g., 2-bit), yielding an effective precision of approximately \(2 + 4/8 = 2.5\) bits/task.
    • Quantization Error Correction: The base vector is first quantized to obtain \(\theta_{ft\_avg\_ec} = Q(\theta_{ft\_avg} - \theta_{pre}) + \theta_{pre}\), and the corrected reference is then used to compute offset vectors, reducing accumulated error.
    • Design Motivation: The base vector is shared across all tasks, so its amortized overhead is negligible; the offset vectors, representing deviations from the mean, have smaller magnitudes and are thus well-suited for low-precision quantization.
  3. Quantization Regularization Effect

    • Function: The paper finds that 3-bit quantization sometimes improves merging performance.
    • Mechanism: Quantization noise acts as a regularizer, reducing overfitting in a manner analogous to dropout.
    • This phenomenon — where TVQ-INT3 outperforms FP32 — is consistently observed across Task Arithmetic, Ties Merging, and other merging methods.

Loss & Training

  • No additional training is required; the method is a pure PTQ approach.
  • Quantization is performed offline at the checkpoint stage, prior to model merging.
  • Layer-wise quantization is supported, with scale and zero-point computed independently per layer.

Key Experimental Results

Main Results

8-task image classification (ViT-B/32):

Method FP32 FQ-INT4 TVQ-INT4 TVQ-INT3 TVQ-INT2 RTVQ(3+2)
Task Arithmetic 69.2 4.2 69.1 71.2 62.1 70.2
AdaMerging 81.8 4.5 81.5 82.0 78.1 82.8
EMR-Merging 88.3 3.9 89.8 90.0 77.2 83.2

8-task image classification (ViT-L/14):

Method FP32 TVQ-INT4 TVQ-INT3 RTVQ(3+2)
Task Arithmetic 84.3 84.4 84.8 84.8
AdaMerging 90.8 90.9 91.0 90.9

Ablation Study

Dense prediction tasks (NYUv2, ResNet-50):

Method Seg(mIoU)↑ Depth(RelErr)↓ Normal(MAE)↓
Task Arith. FP32 31.6 24.0 30.6
Task Arith. TVQ-4 31.5 24.0 30.6
Task Arith. TVQ-2 36.4 26.2 36.1
Task Arith. RTVQ 36.1 24.6 32.6

Key Findings

  • FQ-INT4 fails completely: Directly quantizing fine-tuned weights to 4-bit reduces accuracy to approximately 4% (random-chance level), demonstrating the necessity of task vector quantization.
  • TVQ-INT3 consistently outperforms FP32: The regularization effect of quantization noise appears consistently across multiple merging methods.
  • RTVQ effectively mitigates 2-bit degradation: When TVQ-INT2 suffers significant performance drops, RTVQ (effective 2.375 bit) maintains performance close to FP32.
  • RTVQ becomes more advantageous as task count increases: At 20 tasks, the effective bit-width reduces to only 2.15 bit/task, with stronger amortization of the base vector overhead.
  • Storage requires only 8% of the original FP32 checkpoint size.

Highlights & Insights

  • Simple yet powerful observation: The narrow dynamic range of task vectors relative to fine-tuned weights, though straightforward to observe, underpins the entire method's effectiveness.
  • Seamless compatibility with existing methods: As a purely checkpoint-side operation, TVQ integrates into Task Arithmetic, Ties Merging, AdaMerging, and all other merging frameworks at zero modification cost.
  • Elegant residual decomposition: The shared base vector plus task-specific offset decomposition naturally exploits inter-task redundancy.
  • Unexpected regularization finding: The phenomenon of 3-bit quantization improving performance carries theoretical research value.

Limitations & Future Work

  • The RTVQ base vector depends on the mean of all tasks; adding new tasks requires recomputation.
  • Dense prediction tasks are more sensitive to quantization, particularly methods relying on task-specific masks such as EMR-Merging.
  • Only the PTQ paradigm is evaluated; whether quantization-aware training (QAT) could further improve performance remains unexplored.
  • Validation on NLP tasks is limited, with the experiments primarily focused on vision domains.
  • Unlike storage optimization methods such as TALL Mask and TSV-C, TVQ is more general and requires no additional task-specific structures.
  • The discovery of quantization regularization effects may inspire new regularization strategies for model merging.
  • The residual decomposition idea is extensible to other scenarios requiring storage of multiple model variants (e.g., storing multiple LoRA adapters).

Rating

  • Novelty: ⭐⭐⭐⭐ The core observation is simple but profound; the RTVQ design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers classification, dense prediction, and NLP; evaluates 8/14/20-task scales across multiple merging methods.
  • Writing Quality: ⭐⭐⭐⭐ Logically clear with persuasive figures and tables.
  • Value: ⭐⭐⭐⭐⭐ Highly practical — preserving performance at 8% storage has significant implications for resource-constrained deployment.