Skip to content

MergeTune: Continued Fine-Tuning of Vision-Language Models

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=MAApSY32Z6
Paper: OpenReview
Code: https://github.com/Surrey-UP-Lab/MERGETUNE
Area: Multimodal VLM
Keywords: VLM Continued Fine-Tuning, CLIP Adaptation, Model Merging, Linear Mode Connectivity, Catastrophic Forgetting

TL;DR

MERGETUNE defines the recovery of pre-trained knowledge in an already fine-tuned CLIP/VLM as a "continued fine-tuning" problem. By using Linear Mode Connectivity (LMC) constraints to further optimize previously trained parameters, the final model is positioned closer to both the zero-shot CLIP and the downstream fine-tuned model, improving base-novel, cross-dataset, domain generalization, and ID-OOD robustness without adding inference parameters.

Background & Motivation

Background: Vision-language models like CLIP rely on large-scale image-text pre-training to achieve strong zero-shot generalization, but real-world downstream tasks often require adaptation. Common practices include parameter-efficient fine-tuning (PEFT) methods like CoOp, KgCoOp, MMA, or PromptKD, which update only prompts, adapters, or lightweight heads. Another line is robust fine-tuning, which fine-tunes the entire model or linear heads on large data like ImageNet and mitigates out-of-distribution performance drops via weight averaging or prediction ensembles.

Limitations of Prior Work: These methods usually focus on "minimizing forgetting during fine-tuning," yet a portion of pre-trained knowledge is still lost after adaptation. A direct observation in the paper is that across 11 cross-dataset evaluations, no existing PEFT method consistently outperforms zero-shot CLIP. Furthermore, post-hoc model merging methods like TIES or DARE often degrade when merging zero-shot and fine-tuned checkpoints, suggesting the two solutions do not naturally lie on a low-loss linear path in the weight space.

Key Challenge: Downstream fine-tuning pulls the model toward task data, while zero-shot generalization relies on the knowledge near the original pre-trained solution. Simply restricting updates sacrifices adaptation capability, while naive checkpoint merging may traverse high-loss regions, leading to unstable trade-offs between base classes, novel classes, or OOD data. The problem is not just "how to average two models," but "whether a new solution can be learned that maintains low-loss connectivity with both endpoints."

Goal: The authors reframe this as continued fine-tuning: given an already adapted model, they aim to optimize the trainable parameters in a post-hoc stage to recover forgotten pre-trained knowledge while preserving downstream task performance, without reworking the original fine-tuning pipeline, changing the architecture, or requiring pre-training data.

Key Insight: MERGETUNE approaches this from the geometric perspective of model merging and mode connectivity. If a model \(w\) has low-loss linear paths to both the zero-shot solution \(\hat{w}_1\) and the fine-tuned solution \(\hat{w}_2\), then \(w\) is not just a crude average but resides in a connectivity region that inherits knowledge from both ends. The authors turn this property into an explicit training objective rather than a passive post-training averaging attempt.

Core Idea: Use Linear Mode Connectivity to guide continued fine-tuning, learning a continued model that is connected to both zero-shot CLIP and the fine-tuned downstream model with low loss, thereby "injecting" pre-trained generalization back into adapted VLMs.

Method

Overall Architecture

MERGETUNE takes two checkpoints as input: the zero-shot VLM weights \(\hat{w}_1\) (e.g., original CLIP) and weights \(\hat{w}_2\) already trained by a downstream method (e.g., CoOp, KgCoOp, MMA, PromptKD, linear probing, or E2E-FT). The method introduces no new architecture; instead, it initializes a continued model \(w\) in the same parameter space, continues training it on downstream data, and constrains it to remain close to the zero-shot solution while maintaining low-loss linear connectivity with the fine-tuned solution. After training, the single model \(w\) is used for inference. In robust fine-tuning scenarios, \(w\) can also be combined with the zero-shot model via standard weight interpolation as a stronger ensemble version.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["zero-shot VLM<br/>Pre-trained Knowledge"] --> C["Dual-endpoint Continued<br/>Fine-tuning"]
    B["fine-tuned VLM<br/>Downstream Adaptation"] --> C
    C --> D["Pre-trained End Replay-free<br/>Second-order Proxy"]
    C --> E["Downstream End LMC<br/>Low-loss Path Constraint"]
    D --> F["Continued Model<br/>Single-model Inference"]
    E --> F
    F --> G["Optional Weight Ensemble<br/>Further Robustification"]

The three major contribution nodes are: dual-endpoint continued fine-tuning defines the target solution; the pre-trained end replay-free second-order proxy handles the unavailability of original CLIP data; and the downstream LMC constraint ensures the continued model remains effective along the interpolation path to the fine-tuned checkpoint.

Key Designs

1. Dual-endpoint Continued Fine-tuning: Solving Forgetting Post-adaptation

MERGETUNE's most significant shift is no longer treating catastrophic forgetting solely as a regularization problem during fine-tuning. While traditional PEFT restricts parameters during training and robust fine-tuning averages weights after training, MERGETUNE asks: since an adapted model already exists, can we continue training a third model \(w\) that acts as a mergeable, smooth bridge between the two endpoints?

Formally, the goal is for \(w\) to be linearly connected to the zero-shot solution \(\hat{w}_1\) and fine-tuned solution \(\hat{w}_2\) with low loss along the paths:

\[ w = \arg\min_w \mathbb{E}_{\alpha \sim U[0,1]}\left[L_1(\hat{w}_1 + \alpha(w - \hat{w}_1)) + L_2(\hat{w}_2 + \alpha(w - \hat{w}_2))\right]. \]

This objective is proactive: instead of picking an interpolation coefficient \(\alpha\) and hoping for a good average, it optimizes the position of \(w\) so it enters a low-loss connectivity region. Consequently, MERGETUNE can be applied as a post-hoc enhancement to various VLM adaptation methods.

2. Pre-trained End Replay-free Second-order Proxy: Anchor Zero-shot Knowledge Without CLIP Data

A challenge in the LMC objective is that \(L_1\) is the CLIP pre-training loss, requiring web-scale image-text data. Since this data is typically unavailable, MERGETUNE uses a second-order Taylor approximation of the loss at the zero-shot end.

Near \(\hat{w}_1\), the term \(L_1(\hat{w}_1 + \alpha(w - \hat{w}_1))\) is expanded. Assuming the zero-shot checkpoint is a local optimum for the pre-training task, \(\nabla L_1(\hat{w}_1) \approx 0\). By simplifying the Hessian with an isotropic curvature \(H_1 \approx \mu I\), the pre-training loss becomes a distance regularization:

\[ R_{Task1} = \lambda \|w - \hat{w}_1\|^2. \]

This design provides a geometric anchor to the zero-shot solution. It is much cheaper than replaying pre-trained data and less prone to further forgetting than training on downstream data alone.

3. Downstream End LMC Path Constraint: Smoothing with the Fine-tuned Solution

To ensure the continued model retains downstream knowledge, MERGETUNE explicitly adds an LMC term on the fine-tuned side. It samples several interpolation points \(\hat{w}_2 + \alpha(w-\hat{w}_2)\) between \(\hat{w}_2\) and \(w\), computing the downstream loss \(L_2\) on these points to ensure the entire path remains low-loss.

The final replay-free objective is:

\[ L(w) = L_2(w) + \lambda\|w - \hat{w}_1\|^2 + \beta\mathbb{E}_{\alpha \sim U[0,1)}L_2(\hat{w}_2 + \alpha(w - \hat{w}_2)). \]

Here, \(L_2(w)\) ensures the model's own downstream performance, \(\lambda\) controls zero-shot knowledge retention, and \(\beta\) controls connectivity with the fine-tuned solution. During training, the expectation is approximated by sampling \(N_\alpha\) points (e.g., 5 or 10).

Loss & Training

The training workflow follows three steps. First, train the downstream checkpoint \(\hat{w}_2\) following the baseline method while keeping \(\hat{w}_1\) fixed. Second, initialize the continued model with \(w=(1-\tau)\hat{w}_1+\tau\hat{w}_2\). Third, continue training on the downstream data using the combined objective.

In few-shot settings (16-shot), MERGETUNE is evaluated on CoOp, KgCoOp, MMA, and PromptKD using CLIP ViT-B/16. For instance, CoOp+MERGETUNE follows the baseline batch size of 128 and learning rate of 0.002, training for 50 epochs. For many-shot robust fine-tuning, linear probing and E2E-FT are used as baselines, followed by an equal number of MERGETUNE training rounds.

Hyperparameter analysis shows that for \(\lambda \in [8, 16]\) and \(\beta \in [0.1, 0.5]\), performance is stable. The authors also confirm that prolonged continued fine-tuning (e.g., 100 epochs) does not cause "over-merging" or performance degradation, indicating that the dual-endpoint constraint provides stable anchors for training.

Key Experimental Results

Main Results

MERGETUNE was evaluated across four protocols: base-to-novel generalization, cross-dataset generalization, domain generalization, and many-shot ID-OOD robust fine-tuning. The primary finding is that while post-training merging (TIES/DARE) often degrades in PEFT scenarios, MERGETUNE provides consistent gains, particularly for methods with significant forgetting like CoOp.

Setting Baseline Original TIES / DARE MERGETUNE Key Conclusion
Base-to-novel Avg HM CoOp 71.66 66.32 / 70.59 77.24 +5.58 over CoOp; post-training merging fails
Base-to-novel Avg HM KgCoOp 77.01 72.56 / 75.17 77.98 +0.97 even on knowledge-preserving baseline
Base-to-novel Avg HM MMA 79.87 69.41 / 71.81 80.44 Outperforms original MMA despite differing structures
Base-to-novel Avg HM PromptKD 83.73 79.52 / 82.13 84.09 Stable +0.36 gain on very strong baseline
Cross-dataset Avg-C CoOp 63.88 63.80 / 61.67 65.80 +1.92 improvement migrating from ImageNet
Domain Avg-D CoOp 59.28 53.20 / 57.64 60.15 +0.87 gain on ImageNet shifts; TIES/DARE degrades

In base-to-novel experiments, MERGETUNE boosts CoOp's Novel accuracy from 63.22 to 73.97 while maintaining Base accuracy at 80.82. For robust fine-tuning, a single MERGETUNE-tuned model outperforms VRF.

Robust Fine-tuning Method ImageNet Avg-D Gain vs. FT Inference
Linear probing Original LP 79.79 57.39 - Single model
Linear probing MERGETUNE 79.96 59.66 +2.27 Single model
Linear probing MERGETUNE + Weight ens. 79.88 60.23 +2.84 Single weight-interp
E2E-FT Original E2E-FT 81.31 53.70 - Single model
E2E-FT MERGETUNE 82.26 62.29 +8.59 Single model
E2E-FT MERGETUNE + Weight ens. 82.18 62.90 +9.20 Single weight-interp

Ablation Study

Config HM Score Note
\(\lambda=1, \beta=0.1\) 76.44 Weak zero-shot constraint; insufficient novel recovery
\(\lambda=8, \beta=0.5\) 77.62 Optimal balance between base and novel
Initialization \(\tau=0.0\) 76.94 Starting from CLIP biases toward pre-training; lower adaptation
Initialization \(\tau=0.3\) 77.62 Balanced initialization yields best results
\(N_\alpha=1\) 77.19 Insufficient path sampling; weaker LMC constraint
\(N_\alpha=5\) 77.62 Good trade-off between performance and cost

Key Findings

  • Gains correlate with the baseline's degree of forgetting: CoOp shows the largest improvement (+5.58 HM).
  • Post-training merging is unreliable in PEFT scenarios. TIES and DARE mostly decrease HM, proving that merging itself isn't the answer—explicitly shaping a low-loss path during training is.
  • Domain and cross-dataset results show that MERGETUNE recovers generic cross-domain knowledge rather than just overfitting to splits.
  • In robust fine-tuning, MERGETUNE's single-model performance is highly competitive, and its weight-ensemble version achieves state-of-the-art OOD accuracy.

Highlights & Insights

  • MERGETUNE's strength lies in separating "knowledge recovery" into a post-hoc continued fine-tuning stage. This is practical as it allows researchers to enhance existing trained checkpoints.
  • It moves beyond simple distance regularization by using LMC to explain why merging fails and turning that observation into a training objective.
  • The second-order proxy is a clever engineering solution. By avoiding pre-training data replay, the method remains applicable to models where original data is inaccessible.
  • The method is model-agnostic, applicable to various PEFT methods (prompts, adapters) and full fine-tuning.

Limitations & Future Work

  • The second-order proxy relies on strong assumptions (\(\nabla L_1 \approx 0\) and isotropic Hessian) that might not be strictly true for all VLMs.
  • It still requires downstream data for training, unlike purely training-free merging methods.
  • Sampling interpolation points increases training cost (e.g., \(N_\alpha=5\) triples the training time of KgCoOp).
  • Experiments focus on encoder-based VLMs; application to generative VLMs or dense prediction tasks remains to be explored.
  • vs CoOp/KgCoOp: Instead of replacing them, MERGETUNE acts as a post-processor to restore forgotten knowledge.
  • vs TIES/DARE: While post-hoc methods struggle with parameter conflicts, MERGETUNE actively shapes the weight space for better mergeability.
  • vs Wise-FT: MERGETUNE can be viewed as "preparing" a checkpoint to be more compatible for weight interpolation, leading to better ID-OOD trade-offs.
  • vs VRF: MERGETUNE achieves better robust fine-tuning results as a single checkpoint without the complex per-sample operations or failure sets required by VRF.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Converting LMC from an observation into a post-hoc training objective is clear and effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across multiple protocols, backbones, and baselines.
  • Writing Quality: ⭐⭐⭐⭐☆ Logical flow; however, some symbol definitions require looking into the appendix.
  • Value: ⭐⭐⭐⭐⭐ High practical value for improving existing VLM fine-tuning pipelines without inference overhead.