Weight Weaving: Parameter Pooling for Data-Free Model Merging¶

Conference: NeurIPS 2025 arXiv: 2510.13921 Code: https://github.com/VirtualSpaceman/weight_weaving Area: Other Keywords: model merging, data-free, scaling factor, parameter pooling, task vectors

TL;DR¶

This paper proposes Weight Weaving, a plug-and-play data-free model merging enhancement method that eliminates the dependency on evaluation data by pooling model parameters (e.g., via averaging or random selection) over the scaling factor search space. Across three scenarios — multi-task learning, continual learning, and domain generalization — the method achieves an average accuracy improvement of up to 15.9 percentage points.

Background & Motivation¶

Model merging integrates multiple expert models into a unified model through parameter-level operations, without requiring retraining. Classical methods such as Task Arithmetic scale task vectors (the difference between fine-tuned and pre-trained weights) by a scaling factor \(\lambda\) and add them back to the pre-trained model. However, the choice of \(\lambda\) has a substantial impact on performance:

Key Challenge: Nearly all existing methods rely on the scaling factor \(\lambda\), and correctly setting \(\lambda\) typically requires access to evaluation data ("privileged data"), which is often unavailable in real-world deployment. Researchers commonly tune \(\lambda\) on the evaluation set in an impractical manner. The only existing data-free alternative (MetaGPT) is restricted to Task Arithmetic and is not generalizable.

Core Idea: Rather than searching for a single optimal \(\lambda\), the paper proposes pooling all candidate parameters over the \(\lambda\) search space — an idea analogous to ensemble methods. This approach avoids the need for data to select \(\lambda\) while aggregating information across multiple \(\lambda\) values.

Method¶

Overall Architecture¶

Weight Weaving takes three user-defined inputs: (1) a base merging function \(f_{\text{merge}}\) (e.g., existing methods such as TIES or PCB); (2) a scaling factor search space \(\lambda_{\text{search}}\); and (3) a pooling function \(f_{\text{pooling}}\) (e.g., averaging, random selection, or another merging method). The algorithm proceeds in three steps: 1. Compute delta weights: \(\Delta w = \{\theta_t - \theta_{\text{pre}}\}\) 2. For each \(\lambda_i\) in the search space, apply \(f_{\text{merge}}\) to produce a set of augmented weights 3. Combine the delta weights and augmented weights into a set \(A^*\), apply \(f_{\text{pooling}}\), and add the result back to the pre-trained model

Key Designs¶

Parameter-level pooling instead of model-level selection: Traditional methods select a single \(\lambda\) to produce one merged model. Weight Weaving performs element-wise pooling over parameters generated from all \(\lambda\) values in the search space. The intuition is that different tasks have different optimal \(\lambda\) values, and pooling marginalizes out the dependency on any specific \(\lambda\).
Collaborative Variant: Pooling operates not only over augmented weights but also incorporates the original delta weights, forming a richer parameter set \(A^* = \Delta w \cup A\). Experiments show that including a broader set of parameter sources benefits final performance.
Orthogonality to existing methods: Weight Weaving serves as an outer wrapper and can be combined with any merging method that depends on \(\lambda\). It does not modify the internal logic of \(f_{\text{merge}}\); it only aggregates outputs at different \(\lambda\) values. The search space is also not restricted to scalar \(\lambda\) — it can include categorical variables, probability distributions, or even functions.

Pooling Function Options¶

Average: Element-wise arithmetic mean across candidates
Random Uniform: For each parameter position, independently sample one value from \(N\) candidates with equal probability
MagMax: For each parameter position, select the value with the largest absolute magnitude

Key Experimental Results¶

Main Results: Effect of Weight Weaving Under Data-Free Setting (Average Accuracy)¶

Base Method	Original (Data-Free)	+Weight Weaving	Gain
Breadcrumbs	52.17	68.11	+15.94
MagMax	60.14	69.77	+9.63
TIES	68.39	71.21	+2.82
PCB	71.41	72.10	+0.69
TSV	73.11	74.01	+0.90
ISO-C	72.38	73.78	+1.40

Per-Scenario Results (Averaged over ViT-B-32 / B-16 / L-14)¶

Method	Multi-Task Learning	Continual Learning	Domain Generalization
TIES	78.18	72.95	54.04
TIES+Ours	78.62	74.79	60.21
PCB	81.21	74.39	58.63
PCB+Ours	80.92	74.86	60.53
TSV	87.10	75.32	56.91
TSV+Ours	85.65	75.48	60.89

Ablation Study on Pooling Functions¶

Pooling Function	Breadcrumbs	TIES	PCB	TSV	ISO-C
Average	68.11	71.21	72.10	74.01	73.78
Random	68.11	71.21	72.08	73.64	73.61
MagMax	51.93	54.56	50.36	55.72	64.34

Analysis of Optimal \(\lambda\) Distributions¶

Scenario	Distribution Characteristics	Weight Weaving Effect
Multi-task learning	Concentrated at a single value (e.g., ISO-C at 1.0)	Limited or slightly negative gain
Continual learning	Broadly dispersed across the search space	Significant improvement
Domain generalization	Broadly dispersed	Significant improvement

Key Findings¶

Weight Weaving yields the largest gains in continual learning and domain generalization, precisely the scenarios where optimal \(\lambda\) values are most dispersed across tasks
When the optimal \(\lambda\) concentrates at a single value (e.g., ISO-C in multi-task learning), pooling provides limited benefit or a slight degradation
Average and Random pooling perform comparably, while MagMax pooling performs poorly — as it tends to select parameters corresponding to the largest \(\lambda\)
Sequential fine-tuning in continual learning introduces correlations among task vectors (in contrast to the near-orthogonality observed in multi-task learning), posing a unique challenge
Methods with weaker baseline performance (e.g., Breadcrumbs) benefit the most (+15.94), while already strong methods (e.g., TSV) benefit less (+0.90)

Highlights & Insights¶

Remarkably simple yet effective: The core idea of Weight Weaving — "try multiple \(\lambda\) values and average" — is conceptually straightforward and easy to implement
Genuinely data-free: The method requires no validation set, evaluation set, test data, or any form of privileged information, making it practically viable for real-world deployment
Plug-and-play modular design: As an outer wrapper, it can enhance any merging method that relies on \(\lambda\) without modifying the underlying method
Informative observations on continual learning: The finding that sequential fine-tuning induces high correlation among task vectors offers a useful direction for designing continual-learning-specific merging methods
Optimal \(\lambda\) distribution as a diagnostic tool: The analysis reveals a practical heuristic — Weight Weaving is most effective when optimal \(\lambda\) values are broadly dispersed across tasks

Limitations & Future Work¶

When the optimal \(\lambda\) is concentrated at a single value, pooling may introduce suboptimal parameters and cause a slight performance drop (e.g., ISO-C in multi-task learning)
How to filter out suboptimal \(\lambda\) values from the search space without using privileged data remains an open question
Computational overhead scales with the size of the search space and the complexity of \(f_{\text{merge}}\); large-scale models (e.g., billions of parameters) may require parallel computation
Experiments are conducted exclusively on vision tasks (ViT), leaving validation on other modalities such as NLP unexplored
While average pooling is simple and effective, it may not be optimal; adaptive weighted pooling is a natural direction for improvement
The design of the search space (range and step size) currently relies on empirical heuristics; automatic determination of the search space warrants further investigation

Task Arithmetic (Ilharco et al., 2023) introduces the concept of task vectors; Weight Weaving addresses the core bottleneck of \(\lambda\) selection that builds upon this foundation
TIES, PCB, MagMax, and related methods focus on resolving parameter conflicts (task conflicts); Weight Weaving improves merging quality from a complementary perspective — robustness to \(\lambda\)
MetaGPT proposes a closed-form solution for finding \(\lambda\) but is restricted to Task Arithmetic; Weight Weaving is applicable to all merging methods
The method has direct practical value: in deployment scenarios where evaluation data is unavailable (e.g., edge devices, privacy-sensitive settings), Weight Weaving represents the most pragmatic solution currently available

Rating¶

Novelty: ⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐