Training-free LLM Merging for Multi-task Learning¶
Conference: ACL 2025
arXiv: 2506.12379
Code: GitHub
Area: LLM Model Merging
Keywords: Model Merging, Multi-Task Learning, Layer-wise Pruning, Conflict Elimination, Training-free
TL;DR¶
This paper proposes Hi-Merging, a hierarchical iterative training-free model merging method. By utilizing model-wise and layer-wise pruning and scaling operations combined with contribution analysis, it identifies and resolves parameter conflicts. This merges specialized LLMs across different tasks/languages into a single unified multi-task model, outperforming mixed-data fine-tuning baselines in most scenarios.
Background & Motivation¶
With the release of open-source large language models like LLaMA and Qwen, there are now over 1 million specialized LLMs fine-tuned for various tasks and languages on Hugging Face. A natural question arises: Can these specialized models be merged into a single unified multi-task model?
The direct solution is to collect all fine-tuning data and retrain them, which faces three major difficulties: 1. Data Inaccessibility: Models are publicly available, but fine-tuning data is often proprietary. 2. High Computational Cost: Retraining LLMs requires massive computing resources. 3. Seesaw Effect: When training on mixed data, improving performance on one task often degrades performance on another.
Model Merging has thus become an attractive alternative, but existing methods face two core challenges: - Noise Interference: Noise parameters introduced during fine-tuning due to data bias or overfitting can impair the generalization of the merged model. - Knowledge Misalignment: Independently trained models follow different optimization trajectories, leading to mismatched knowledge alignment in the parameter space. Direct merging causes incompatibility.
Existing methods, such as TIES-Merging and DARE, lack explicit guidance for conflict localization, leading to high performance variance. The proposed Hi-Merging systematically addresses these issues through layer-wise analysis.
Method¶
Overall Architecture¶
Hi-Merging adopts a two-stage hierarchical processing architecture: 1. Model-wise Pruning & Scaling: De-noising and scaling the entire delta vector of each fine-tuned model. 2. Layer-wise Pruning & Scaling: Identifying the layers with the most severe conflicts through contribution analysis and iteratively eliminating parameter conflicts.
The core mathematical foundation is the delta vector: \(\boldsymbol{\delta}_m = \boldsymbol{\theta}_m - \boldsymbol{\theta}_F\), which represents the parameter difference between the fine-tuned model and the base model.
Key Designs¶
-
Model-wise Pruning & Scaling:
- Pruning threshold \(p\): Retains the top \(p\%\) parameters in the delta vector with the largest absolute values and sets the rest to zero, eliminating noise parameters introduced by data bias.
- Scaling factor \(s\): Multiplies the retained delta vector by \(s \in [0,1]\), mitigating overly aggressive parameter updates caused by overfitting.
- Experimental validation: Setting \(p=0.1, s=0.9\) (retaining only 10% of parameters and scaling by 0.9) already outperforms the original model.
- Complementary operations: Pruning eliminates minor perturbations, while scaling regulates extreme parameter changes.
-
Contribution Analysis:
- Pruning Impact \(\alpha\): Constructs an initial merged model ℳ_G and measures the performance drop of ℳ_m on its original task when removing the delta vector of a specific layer.
- Addition Impact \(\beta\): Adds the delta vector of a specific layer to the base model ℳ_F and measures the performance gain of ℳ_m on its original task.
- Total contribution \(c = \alpha + \beta\). Conflict degree \(\gamma_m^l = c_{m,m}^l - c_{m,G}^l\).
- Sort layers by \(\Gamma^l = \sum_m \gamma_m^l\) to identify the most conflict-prone layers.
-
Iterative Conflict Elimination: Processes each layer in descending order of conflict severity, categorized into three cases:
- Severe Conflict (\(\gamma_A > 0\) and \(\gamma_B > 0\)): Both capabilities are degraded by merging \(\rightarrow\) keep only the delta vector with the larger contribution and zero out the other.
- Partial Conflict (\(\gamma_A \cdot \gamma_B < 0\)): One model's overfitting degrades the other \(\rightarrow\) prune and scale the conflicting model's delta vector further.
- Mutual Enhancement (\(\gamma_A \leq 0\) and \(\gamma_B \leq 0\)): Both capabilities improve after merging \(\rightarrow\) no adjustment needed.
Loss & Training¶
Completely training-free. Hi-Merging is a parameter post-processing method: - Base Model: Qwen2-7B-Instruct - Fine-tuning: LLaMA-Factory + LoRA (rank=8, alpha=16, dropout=0.01) - Merging Tool: mergekit - Model-wise \(p\) and \(s\) searched in the range of 0.1 to 1.0 (step size 0.1). - Layer-wise \(p\) and \(s\) are set to half of the model-wise values. - Evaluation Metrics: Accuracy for MCQA, BLEU-4 and ROUGE-1/2/L for QA.
Key Experimental Results¶
Main Results¶
Bilingual MCQA Task Merging (English MedQA + Chinese CMExam):
| Method | MedQA (Acc) | CMExam (Acc) | Avg Impr. | Avg Rank |
|---|---|---|---|---|
| Qwen2-7B Base | 51.41 | 74.62 | - | 17.0 |
| Single-Task Fine-tuning A (English) | 59.14 | 83.78 | +13.40% | 10.0 |
| Mixed-Data Fine-tuning | 60.08 | 88.22 | +17.67% | 3.5 |
| Task Arithmetic | 59.53 | 88.77 | +17.67% | 4.0 |
| TIES | 59.06 | 88.78 | +17.31% | 4.5 |
| DARE | 58.67 | 88.69 | +16.93% | 7.5 |
| Hi-Merging | 60.16 | 89.07 | +18.41% | 1.0 |
Monolingual Multi-task Merging (English MCQA+QA):
| Method | MedQA Acc | HCMagic BLEU-4 | HCMagic ROUGE-L | Avg Impr. | Avg Rank |
|---|---|---|---|---|---|
| Mixed-Data Fine-tuning | 59.22 | 35.60 | 20.46 | +25.23% | 8.3 |
| TIES | 60.47 | 35.79 | 20.37 | +26.78% | 4.2 |
| DARE | 58.44 | 36.58 | 20.39 | +26.29% | 4.4 |
| Hi-Merging | 60.16+ | Best-level | Best-level | Best | 1.0 |
Ablation Study¶
| Configuration | Key Metrics | Description |
|---|---|---|
| Model-wise processing only | Avg Rank ~4 | Outperforms no-processing, but inferior to the full hierarchical method |
| Layer-wise processing only (no model-wise) | Avg Rank ~3 | Lacks global de-oising, limiting the layer-wise optimization space |
| Full Hi-Merging | Avg Rank 1.0 | Achieves the best hierarchical coordination effect |
| \(p=0.1, s=0.9\) (single model) | Outperforms original | Validates that pruning + scaling is also beneficial for single models |
| Different base models (Yi-1.5-9B, Baichuan2-7B) | Effective but baselines differ | The method is robust to the choice of base models |
Key Findings¶
- Hi-Merging Consistently Ranks First: Across the three settings (bilingual MCQA, monolingual multi-task, and cross-lingual cross-task), the average rank is consistently 1.0.
- Outperforming Mixed-data Fine-tuning: Under most scenarios, the training-free Hi-Merging outperforms the mixed-data fine-tuning baseline which requires extra training.
- High Variance in Existing Methods: TIES and DARE occasionally perform best on individual metrics but exhibit overall instability due to the lack of guidance.
- 10% Parameters Suffice for Performance Preservation: When pruned to retain only 10% of delta parameters, proper scaling still maintains or even enhances performance.
- Severe Conflict Layers Identified and Resolved: Contribution analysis effectively identifies the most problematic layers in merging, and the three corresponding conflict elimination strategies are highly targeted.
Highlights & Insights¶
- Value of Hierarchical Thinking: Decomposing the global model merging problem into model-wise de-noising followed by layer-wise conflict elimination makes the problem more analytical and controllable.
- Innovation in Contribution Analysis: Quantifying the conflict severity of each layer by simultaneously measuring "pruning impact" and "addition impact" is more direct than statistical-based methods.
- Categorized Treatment of Three Conflict Types: The categorization of severe conflict, partial conflict, and mutual enhancement is intuitive and provides targeted resolution strategies.
- Complementarity of Pruning and Scaling: Pruning removes minor noise, while scaling adjusts large parameters. Their complementarity addresses two common issues in the fine-tuning process.
- High Practical Usability: Implemented based on mergekit with a reasonable hyperparameter search space (10x10 grid), making it friendly to the open-source community.
Limitations & Future Work¶
- Primarily Explores Two-Model Merging: Although the framework can theoretically scale to multiple models, experiments are mainly pairwise, and multi-model scenarios have not been fully validated.
- Limited Task Types: Validated only on MCQA and QA tasks in the medical domain, without covering other critical tasks like code generation and reasoning.
- Computational Overhead of Contribution Analysis: It requires pruning/addition experiments and performance evaluation for every single layer. The overhead is non-negligible when scaling up model and task combinations.
- LoRA Fine-tuning Assumption: The experiments utilize LoRA for fine-tuning. The effectiveness of merging full-parameter fine-tuned models is yet to be validated.
- Future work can explore adaptive selection of \(p\) and \(s\) to lower the hyperparameter search overhead.
Related Work & Insights¶
- Task Arithmetic (Ilharco et al., 2023): Proposes the fundamental framework of delta vector merging, on top of which Hi-Merging incorporates layer-wise optimization.
- TIES-Merging (Yadav et al., 2023) and DARE (Yu et al., 2024): Mitigate parameter conflicts through various strategies, but struggle with explicit conflict localization.
- DELLA (Deep et al., 2024): Incorporates parameter magnitude, but remains a global processing approach.
- Model Breadcrumbs (Davari & Belilovsky, 2024): Progressively sparsifies parameters without involving layer-wise analysis.
- Layer Swapping (Bandarkar et al., 2025): Utilizes a layer swapping strategy but lacks fine-grained conflict analysis.
- Insights: Model merging should not be a brute-force, single-step operation. Layer-wise analysis and iterative optimization can significantly reduce parameter conflicts.
Rating¶
- Novelty: ⭐⭐⭐⭐ The hierarchical conflict analysis and elimination framework is highly novel in the model merging domain, with an innovative contribution analysis method.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers three merging scenarios (bilingual, multi-task, and cross-lingual cross-task), compared against 10+ baselines, including detailed ablation analyses.
- Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivations, intuitive visualization of the three conflict types, and a well-structured layout.
- Value: ⭐⭐⭐⭐ This training-free method outperforms training-based baselines, providing substantial practical guidance for model integration in the LLM community.