FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization¶
Conference: ICCV 2025 arXiv: 2503.12649 Code: Available (open-sourced as mentioned in the paper) Area: Model Merging / Multi-task Learning Keywords: Model merging, Frank-Wolfe optimization, multi-task learning, scalability, large language models
TL;DR¶
This paper formalizes model merging as a constrained optimization problem and introduces FW-Merging, a Frank-Wolfe optimization-inspired method that iteratively selects the most relevant models and performs local merging. The approach achieves scalable and robust merging over large black-box model pools, surpassing the data-aware method AdaMerging by 8.39% when merging 20 ViT models.
Background & Motivation¶
Model merging has emerged as a data-efficient alternative to multi-task learning. However, with the rapid growth of the open-source AI ecosystem, existing methods face two critical limitations:
Lack of adaptability to unknown models: Existing methods adjust merging coefficients based on known model capability information, performing poorly on models from diverse and partially unknown sources, and failing to distinguish between high-quality and low-quality fine-tuned models.
Inability to scale effectively: Performance degrades severely when merging a large number of unknown model checkpoints. The authors' analytical experiments show that adding 16 irrelevant models causes a performance drop of 18.9%–64.4%.
An ideal merging method should satisfy two fundamental scaling properties: (1) adding irrelevant models does not degrade performance; (2) adding relevant models leads to steady performance improvement.
Method¶
Overall Architecture¶
FW-Merging consists of an iterative process with three core stages: 1. Relevance Assessment: Constructs a linear approximation of the objective function using gradients from the current model to identify the most beneficial direction for improvement. 2. Model Selection: Selects the most relevant checkpoint from the candidate pool by minimizing the linear approximation. 3. Knowledge Integration: Integrates the selected checkpoint into the current model using an orthogonal merging method.
Key Designs¶
- Constrained Optimization Formulation: Redefines model merging as minimizing an objective function \(\ell(\theta)\) over the convex hull \(\mathcal{M} = \text{conv}(\{\theta_i^*\}_{i=1}^n)\). Proposition 1 establishes equivalence with the traditional coefficient optimization formulation. The key advantage is that the Linear Minimization Oracle (LMO) reduces to:
which requires only inner product minimization over a finite vertex set and is computationally efficient.
-
Hard FW vs. Soft FW:
- Hard LMO: Selects the argmin of the linear subproblem as the merging direction; simple and direct.
- Soft LMO: Selects the top-\(k\) vertices and performs internal optimization of their merging coefficients via projected gradient descent onto the simplex. The update rule is \(\theta_{t+1} = \theta_t + \sum_{j=1}^k \lambda_j^*(\tilde{s}_j - \theta_t)\).
- Theorem 1 proves that Soft FW achieves a convergence rate of \(O(1/T)\), superior to the vanilla \(O(1/\sqrt{T})\).
-
Task-wise vs. Layer-wise LMO:
- Task-wise LMO: Solves the LMO by vectorizing the entire model weight.
- Layer-wise LMO: Defines the constraint set as the Cartesian product of per-layer convex hulls \(\mathcal{M} = \mathcal{M}_1 \times \cdots \times \mathcal{M}_L\), selecting the best model independently per layer; can be viewed as a block-coordinate Frank-Wolfe algorithm.
- \(\text{FW}_{hard}\) is better suited to layer-wise LMO (7.2-point gain on NLP discriminative tasks), while \(\text{FW}_{soft}\) is better suited to task-wise LMO (since internal coefficient optimization already operates at the layer level).
Loss & Training¶
- Objective function: Task-specific cross-entropy loss minimized on training data.
- \(\text{FW}_{hard}\): 10 iterations for NLP, 3 for CV, initialized from Task Arithmetic results.
- \(\text{FW}_{soft}\): 15 iterations for CV, initialized from the pretrained model.
- Only 100 training samples per task are required (vs. 2.9K samples/task for conventional MTL).
- Constant memory overhead: a fixed number of models are loaded at each step; simultaneous storage of all models is unnecessary.
Key Experimental Results¶
Main Results¶
Vision tasks (merging 20 ViT-B/32 models):
| Method | SUN397 | Cars | GTSRB | DTD | Avg. |
|---|---|---|---|---|---|
| Pretrained | 62.3 | 59.7 | 32.6 | 43.8 | 49.6 |
| Task Arithmetic | 20.4 | 12.2 | 29.8 | 22.3 | 21.2 |
| Ties-Merging | 51.0 | 36.2 | 57.7 | 40.6 | 46.4 |
| AdaMerging | 66.4 | 70.1 | 95.1 | 64.0 | 73.9 |
| Surgery | 69.7 | 71.8 | 96.6 | 73.4 | 77.9 |
| FW_soft (Ours) | 72.9 | 74.8 | 96.8 | 76.0 | 80.1 |
Language tasks:
| Method | 4 Discriminative Tasks (8 models) | 3 Generative Tasks (16 models) | Avg. Normalized Score |
|---|---|---|---|
| Traditional MTL | 73.1 | 81.2 | 77.2 |
| Task Arithmetic | 80.8 | 75.9 | 78.4 |
| Ties-Merging | 64.3 | 78.5 | 71.4 |
| FW_hard (Ours) | 85.4 | 81.1 | 83.1 |
Ablation Study¶
Scaling experiments (effect of adding irrelevant/relevant models):
| #Models | Irrelevant Model Scenario | Relevant Model Scenario | ||
|---|---|---|---|---|
| Task Arith. | FW_soft | Task Arith. | FW_soft | |
| 4 | 70.3 | 74.1 | 59.2 | 59.2 |
| 12 | 47.9 | 74.1 | 52.3 | 67.5 |
| 20 | 21.2 | 74.2 | 36.3 | 68.3 |
FW-Merging demonstrates desirable scaling properties: performance does not degrade after adding 16 irrelevant models (74.1→74.2), and improves by 15.3% after adding 16 relevant models.
Design variant ablation:
| Coefficients | Method | LMO | CV Score |
|---|---|---|---|
| Optimized | FW_soft | Task-wise | 80.1 |
| Optimized | FW_soft | Layer-wise | 79.7 |
| Not optimized | FW_soft | Task-wise | 70.3 |
| — | FW_hard | Layer-wise | 74.0 |
Optimizing merging coefficients yields gains of up to 9.9 points; task-wise LMO slightly outperforms layer-wise LMO for \(\text{FW}_{soft}\).
Key Findings¶
- FW-Merging requires only 100 samples per task and 2 minutes, yet surpasses conventional MTL which requires 2.9K samples and 4.2 hours.
- The checkpoint with the smallest linear approximation value is precisely the most relevant model to the target task (validated in Figure 3), indicating that the gradient inner product is a reliable relevance indicator.
- The method demonstrates strong robustness to noisy models (models initialized from different pretraining starting points).
- FW-Merging can be combined with other merging functions (e.g., Ties-Merging) for further improvement.
Highlights & Insights¶
- Elegant theoretical framework: Establishes a correspondence between model merging and the classical Frank-Wolfe optimization algorithm, providing convergence guarantees and clear geometric intuition.
- Practical scalability: Constant memory overhead combined with robustness to irrelevant models is critical for large-scale model merging in the HuggingFace ecosystem.
- Exceptional data efficiency: Only 100 samples per task, making the method suitable for privacy-sensitive and data-scarce scenarios.
- Strong orthogonality: Serves as a general framework that can be combined with existing merging methods.
Limitations & Future Work¶
- The internal optimization (coefficient solving in Soft FW) may increase overall computational cost.
- Only Task Arithmetic is used as the MergeFn to ensure convex hull feasibility; more complex merging functions may violate constraints yet yield better empirical results.
- Validation on larger-scale models (e.g., 70B+) has not been conducted.
- The choice between layer-wise and task-wise LMO lacks an adaptive mechanism and must be manually selected based on the scenario.
- The objective function design relies on a small amount of task-aligned data, making the method inapplicable in fully unsupervised settings.
Related Work & Insights¶
- Task Arithmetic pioneered the paradigm of task vector operations in weight space.
- The parameter conflict resolution approaches of TIES-Merging and DARE are complementary to FW-Merging's model selection mechanism.
- AdaMerging's test-time entropy optimization contrasts with FW-Merging's training-set cross-entropy optimization.
- Frank-Wolfe algorithms hold broad promise for constrained optimization in deep learning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Innovatively bridges classical optimization algorithms with model merging, with solid theoretical contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers NLP discriminative/generative and CV tasks, with detailed scalability analysis and complete ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are rigorous, intuitive figures are clear, and theory is tightly integrated with experiments.
- Value: ⭐⭐⭐⭐⭐ Establishes a new paradigm for scalable model merging with significant practical value for the open-source model ecosystem.