Knowledge Fusion of Large Language Models Via Modular Skillpacks¶
Conference: ICLR 2026 arXiv: 2505.18502 Code: duguodong7/GraftLLM Area: Model Compression / Knowledge Fusion Keywords: knowledge grafting, SkillPack, heterogeneous model fusion, continual learning, delta compression
TL;DR¶
This paper proposes GraftLLM, a framework that extracts capabilities from heterogeneous source models into compact, transferable "SkillPacks" (modular skill packages). Through a module-aware adaptive compression strategy that stores parameter deltas, GraftLLM supports knowledge transfer, heterogeneous model fusion, and continual learning without forgetting, significantly outperforming existing PEFT and parameter merging methods across multiple settings.
Background & Motivation¶
Cross-capability transfer is a central challenge in LLM research, encompassing multi-task fusion, model compression, and continual learning. Works such as FuseLLM and FuseChat have demonstrated the potential of consolidating capabilities from multiple models into lightweight targets, yet existing approaches are primarily designed for small-scale homogeneous models and offer limited applicability to large-scale heterogeneous settings.
Knowledge transfer between large heterogeneous models faces three key challenges:
Catastrophic forgetting under full fine-tuning: Knowledge distillation combined with full-parameter fine-tuning overwrites the target model's inherent capabilities while acquiring knowledge from the source model.
Insufficient capacity of PEFT methods: Parameter-efficient methods such as LoRA mitigate large-scale forgetting but suffer from limited adapter capacity, making it difficult to absorb complex knowledge from large source LLMs—particularly under complex training objectives such as DPO, where performance degrades substantially.
Parameter conflicts: Directly merging parameter deltas from multiple models induces inter-task interference and performance degradation.
The core problem is: how to efficiently and composably transfer specialized capabilities from heterogeneous source models while preserving the target model's general-purpose competence?
Method¶
Overall Architecture¶
The core idea of GraftLLM is to represent capabilities in a "target model + SkillPack" format: 1. Transfer source model capabilities to the target model via SFT and DPO. 2. Extract the parameter delta \(\Delta\theta = \theta_{tgt}^* - \theta_{tgt}\). 3. Apply module-aware adaptive compression to the delta, yielding a compact SkillPack. 4. Support multi-SkillPack fusion and continual learning via a routing mechanism.
Key Designs¶
-
Two-stage cross-capability transfer:
- SFT stage: Minimize the negative log-likelihood \(\mathcal{L}_{SFT}(\theta) = -\mathbb{E}[\log p_\theta(y_i, x_i)]\) on high-quality data \(\mathcal{D}_{SFT}\) generated by the source model, bridging the source–target distributional gap.
- DPO stage: Construct preference pairs \((y_w, y_l)\) (best and worst responses from the same source model) and further align via the DPO loss: \(\mathcal{L}_{DPO} = -\mathbb{E}[\log\sigma(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)})]\)
-
Module-Aware Adaptive Compression: Differentiated compression strategies are applied to distinct parameter module types:
- Embedding and Output Head: Magnitude pruning, retaining the top-\(\alpha\) fraction of weights by absolute value. These layers are highly sensitive to vocabulary alignment and task adaptation.
- MLP modules: SVD decomposition \(\Delta\theta_{attn} = \mathbf{U}\Sigma\mathbf{V}^\top\), exploiting their high degree of parameter redundancy.
- Attention modules: Low-rank SVD decomposition, retaining only the components corresponding to the top-\(r\) singular values.
- Mixed-precision quantization: Bit-width \(k\) is adaptively assigned to each SVD component according to its importance, with grouped quantization performed via GPTQ: \(\hat{\mathbf{V}}_{[r]}^\top = \text{Quant}_k(\mathbf{V}_{[r]}^\top, \mathbf{x})\)
-
SkillPack generation: The resulting SkillPack \(\Delta\hat{\theta} = \{C_m(\Delta\theta_m)\}_{m \in \mathcal{M}}\) is a compact, transferable knowledge carrier, where each module \(m\) is compressed using the operator \(C_m\) best suited to its structural characteristics.
-
Knowledge fusion and continual learning:
- Multi-SkillPack fusion: \(\theta_{fused} = \theta_{tgt} + \sum_{i=1}^{n} \mathcal{R}(\Delta\hat{\theta}_i)\), where the routing function \(\mathcal{R}\) dynamically assigns each SkillPack to corresponding sub-modules based on the source model or task type.
- Continual learning: A task-adaptive activation mechanism activates only the relevant subset \(\mathcal{S}_t\) for each task \(t\): \(\theta_t = \theta_{tgt} + \sum_{\Delta\hat{\theta}_i \in \mathcal{S}_t} \mathcal{R}(\Delta\hat{\theta}_i)\)
- Plug-and-play paradigm: SkillPacks can be loaded or unloaded at any time, naturally supporting unlearning and detoxification.
Loss & Training¶
- Target model: Llama-3.1-8B-Instruct
- Source models: Gemma-2-27B-it, Mistral-Large-Instruct, Qwen-2.5-72B-Instruct, Llama-3.1-70B-Instruct
- Explicit fusion: FuseChat 2.0 setting (OpenChat-3.5-7B as pivot, 6 chat models as sources)
- Implicit fusion: FuseChat 3.0 setting (Llama-3.1-8B-I and Qwen-2.5-7B-I as targets)
- Continual learning: Llama-3.1-8B-I sequentially acquiring math then coding capabilities
Key Experimental Results¶
Knowledge Transfer and Compression (SFT Setting)¶
GraftLLM maintains performance within less than 1% of full fine-tuning on general tasks such as MMLU, while requiring substantially fewer stored parameters than the complete model. LoRA performs acceptably in simple SFT scenarios but degrades significantly or fails entirely under complex DPO training.
DPO Setting (GSM8K + MATH)¶
| Method | Parameter Efficiency | GSM8K+MATH Avg | Notes |
|---|---|---|---|
| Full fine-tuning | 100% | Highest | Upper bound |
| LoRA (r=32/64/128) | Very low | Severe degradation | Near failure under DPO |
| Other compression methods | Moderate | Moderate | Also struggle under DPO |
| GraftLLM | Moderate-low | Near full fine-tuning | Remains stable under DPO |
Main Results¶
Explicit Knowledge Fusion (AlpacaEval 2.0 + MT-Bench)¶
| Method Type | AlpacaEval 2.0 LC Win Rate | MT-Bench Avg Score |
|---|---|---|
| Best parameter merging (DARE, etc.) | Moderate | Moderate |
| Best routing method (Twin-Merging) | Higher | Higher |
| GraftLLM | +8.07% over best parameter merging | Surpasses all source models |
With only a 28% increase in parameter count, GraftLLM elevates a 7B model to performance competitive with Mixtral-8x7B and Qwen1.5-Chat-72B.
Implicit Knowledge Fusion (10 Benchmarks)¶
GraftLLM demonstrates consistent advantages over existing methods across 10 benchmarks on both Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct target models.
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| No SVD compression | Highest performance | But high storage cost |
| Uniform compression strategy | Performance drops | Module-differentiated compression yields gains |
| SFT only (no DPO) | Basic capability transfer | DPO significantly improves alignment quality |
| Varying SVD rank \(r\) | Performance improves then saturates with larger \(r\) | Efficiency optimum exists |
Continual Learning¶
| Method | Math→Coding Performance | Coding→Math Performance | Notes |
|---|---|---|---|
| LoRA | Moderate | Severe forgetting | Inter-task interference |
| Model Grafting | Moderate | Partial forgetting | Limited improvement |
| Model Tailor | Moderate | Partial forgetting | Limited improvement |
| GraftLLM | Highest | Near zero forgetting | Modular isolation |
Key Findings¶
- Failure of PEFT under DPO: Methods such as LoRA perform adequately in simple SFT scenarios but suffer severe capacity limitations under complex training objectives like DPO—an important observation that has been largely overlooked.
- Necessity of module-differentiated compression: Embedding layers are sensitive to pruning; attention layers are well-suited to low-rank decomposition; MLP layers benefit from SVD. A uniform strategy performs suboptimally.
- Plug-and-play nature of SkillPacks: Since the target model's parameters are never modified, SkillPacks can be freely loaded and unloaded, naturally enabling unlearning and detoxification.
- Routing eliminates parameter conflicts: The performance of conventional parameter merging methods (e.g., Task Arithmetic, TIES) is limited by inter-task parameter conflicts; routing isolates different skills into independent SkillPacks.
Highlights & Insights¶
- "Target model + SkillPack" storage paradigm: Reformulating cross-model capability transfer as a modular, composable representation is conceptually clear and practically appealing.
- Module-aware adaptive compression: The insight that different architectural components exhibit distinct compression characteristics has broad applicability in the delta-parameter compression literature.
- Two-stage SFT + DPO pipeline: Bridging distributional gaps via SFT first, then refining preference alignment via DPO, yields greater robustness than single-stage alternatives.
- Native support for continual learning: The modular design allows new skills to be added without interfering with existing ones, addressing the catastrophic forgetting problem in LLM continual learning.
Limitations & Future Work¶
- The design and training procedure of the routing mechanism are not fully elaborated in the paper; how to automatically select the optimal SkillPack combination for new tasks remains unclear.
- The module-level assignment of compression strategies (which module uses pruning vs. SVD vs. quantization) appears to be manually determined; automated search could potentially yield further improvements.
- Experiments primarily use the Llama model family as the target; generalization to more heterogeneous architectures (e.g., Mamba, hybrid architectures) requires validation.
- SFT data relies on source model generation, making data quality inherently bounded by source model capability.
- As the number of SkillPacks grows, the computational overhead of routing and the accuracy of SkillPack selection may become bottlenecks.
Related Work & Insights¶
This work synthesizes research from three directions: knowledge distillation (FuseLLM, FuseChat series), model grafting/pruning (Task Arithmetic, TIES-Merging, DARE, and related task-vector methods), and model compression (SVD, quantization methods such as GPTQ and BitDelta). GraftLLM's innovation lies in combining the performance advantages of full fine-tuning with parameter-efficient storage—first acquiring complete capabilities via full fine-tuning, then enabling efficient storage through differentiated compression—thereby avoiding the "one-step but insufficient capacity" limitation of PEFT methods. The SkillPack concept is analogous to LoRA adapters but more powerful, as it derives from full fine-tuning deltas rather than low-rank constraints.
Rating¶
- Novelty: ⭐⭐⭐⭐ (The SkillPack concept and module-aware compression are novel, though individual components are not entirely new in isolation)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Covers knowledge transfer, explicit fusion, implicit fusion, and continual learning across 10 benchmarks)
- Writing Quality: ⭐⭐⭐⭐ (Method is clearly presented and experiments are rich, though the paper having two title variants is slightly confusing)
- Value: ⭐⭐⭐⭐ (The modular capability transfer paradigm is practically useful and serves as a valuable reference for multi-task LLM deployment)