FREE-Merging: Fourier Transform for Efficient Model Merging¶

Conference: ICCV 2025 arXiv: 2411.16815 Code: GitHub Area: Multimodal VLM Keywords: Model Merging, Fourier Transform, Task Interference, Frequency-Domain Analysis, Lightweight Expert

TL;DR¶

This paper is the first to identify the frequency-domain manifestation of task interference in model merging. It proposes FR-Merging, which removes low-frequency interference via high-pass filtering to construct a high-quality merged backbone, and combines it with lightweight task expert modules (FREE-Merging) to achieve an optimal performance–cost trade-off across vision, language, and multimodal tasks.

Background & Motivation¶

With the proliferation of open-source fine-tuned models, model merging has emerged as an efficient approach to consolidating multiple task-specific models into a single multi-task model, avoiding the high cost of joint multi-task training and data privacy concerns. However, existing methods face two core challenges:

Task interference degrades performance: Conflicts exist among fine-tuned weights from different tasks. Existing methods (e.g., Task Arithmetic, Ties-Merging, DARE) operate solely in the spatial domain (pruning, sign conflict resolution, etc.) and overlook frequency-domain interference. This paper is the first to reveal that task interference is significant in the frequency domain and concentrated in low-frequency regions, and that spatial-domain methods can barely mitigate frequency-domain interference (reducing frequency-domain amplitude variance by only 1–5%, whereas FR-Merging achieves 20–24%).

Conflict between performance and deployment cost: Introducing task experts improves performance, but existing methods (EMR-Merging, Twin-Merging) require storing substantial task-specific knowledge (2–3% of parameters) while neglecting backbone optimization.

The core insight of this paper is that low-frequency signals capture global structural information and are more likely to contain task-specific information that causes inter-task interference, whereas high-frequency signals represent fine-grained variations with stronger generalization capacity. Directly filtering out the low-frequency component can therefore substantially reduce task interference while preserving performance.

Method¶

Overall Architecture¶

FREE-Merging is a two-stage approach: - Stage 1 — FR-Merging (training-free): A high-pass filter is applied to each task vector \(v_k = \theta_k - \theta_{\text{pre}}\) to remove low-frequency interference signals, and the filtered vectors are merged to obtain a high-quality backbone network. - Stage 2 — Expert Extraction (training-free): Lightweight task experts comprising only ~1% of the parameter count are extracted from the task vectors and dynamically assigned at inference time via a router.

Key Designs¶

FR-Merging (Frequency-Domain High-Pass Filtering Merge):
- Function: Applies a Fourier transform to each task vector, filters out low-frequency interference regions, and applies the inverse transform.
- Mechanism: An ideal high-pass filter is applied to task vector \(v(x,y)\): \(G(x,y) = \mathcal{F}^{-1}\{H(\eta, \gamma) \cdot \mathcal{F}\{v(x,y)\}\}\) where \(H(\eta, \gamma) = \begin{cases} 1, & \sqrt{\eta^2 + \gamma^2} \geq D_0 \\ 0, & \sqrt{\eta^2 + \gamma^2} < D_0 \end{cases}\) and \(D_0\) is the cutoff frequency. Merging coefficients are computed via mean normalization of task vectors: \(\lambda_i = \mathbb{E}(v_i) \left(\sum_{j=1}^{K} \mathbb{E}(v_j)\right)^{-1}\)
- Design Motivation: Fine-tuned weights occupy different positions in the loss landscape, and linear interpolation can easily fall into high-loss regions. Removing low-frequency signals reduces model discrepancies, making the merged result more likely to reside within a loss basin. Experiments confirm that removing low-frequency components incurs only marginal per-task performance loss (diagonal), while substantially improving generalization (off-diagonal).
Lightweight Task Expert Extraction:
- Function: Selects the top-d% parameters with the largest magnitude changes from task vectors as task experts, requiring only ~1% of the parameter count.
- Mechanism: The highest-magnitude parameters are selected and rescaled: \(e(v_i) = \mu_i M(v_i, d), \quad \mu_i = -\frac{\mathbb{E}(M(v_i, d)) \cdot \log(d)}{\lambda_i \cdot \mathbb{E}(v_i)}\) where \(M(v_i, d)\) denotes the top-d% parameters and \(\mu_i\) is a scaling factor ensuring output consistency.
- Design Motivation: Theorem 5.1 provides a theoretical guarantee that a merged model cannot simultaneously retain all the capabilities of the original models without introducing additional information (No Free Lunch). Low-frequency signals encode task-specific information; however, directly storing the low-frequency region requires an inverse FFT at every inference step, making it impractical. Parameter magnitude is therefore used as a proxy.
MoE Router for Dynamic Dispatch:
- Function: Dynamically selects active task experts based on the input at inference time.
- Mechanism: \(\theta_* = \theta_m + \sum_{i=1}^{K} w_i e_i\), where \([w_1, \ldots, w_K] \leftarrow \arg\max(R(x))\) and \(R\) is a lightweight MLP router.
- Design Motivation: Inspired by Mixture-of-Experts, dynamic routing eliminates the overhead of loading all experts for every input.

Loss & Training¶

Both FR-Merging and expert extraction are entirely training-free, requiring only a one-time computation. The router can be implemented as a simple MLP or other classifier. The overall pipeline executes the FFT operation only once during merging; at inference, only a lightweight router is added.

Key Experimental Results¶

Main Results¶

8-task vision merging (average accuracy on ViT-B/32 / ViT-L/14):

Method	Extra Cost	ViT-B/32 Avg	ViT-L/14 Avg	Notes
Individual	—	90.5	94.2	Upper bound
Task Arithmetic	None	70.1	84.5	Baseline
Ties-Merging	None	73.6	86.0
PCB-Merging	None	75.8	86.9	Prev. SOTA (training-free)
FR-Merging	None	78.1	88.3	+2.3 / +1.4
EMR-Merging	3% storage	87.7	92.8
Twin-Merging	2% storage	87.8	92.7
FREE-Merging	1% storage	89.7	93.7	Lowest storage, highest performance

Ablation Study¶

Frequency-domain interference quantification (ViT-B/32 amplitude variance):

Method	Freq. Amplitude Variance	Variance Reduction	Notes
Task Arithmetic	0.059	—	Baseline
DARE	0.057	↓3%	Spatial-domain method
Ties-Merging	0.058	↓2%	Spatial-domain method
PCB-Merging	0.056	↓5%	Spatial-domain method
FR-Merging	0.045	↓24%	Frequency-domain method is markedly more effective

Cross-domain validation (language models; average over RoBERTa / T0-3B / Qwen-14B):

Method	RoBERTa	T0-3B	Qwen-14B	Average
Task Arithmetic	66.65	63.91	66.40	65.65
FR-Merging	70.02	66.88	68.00	68.30
EMR-Merging	74.20	67.11	70.98	70.76
FREE-Merging	80.16	68.68	72.78	73.87

Key Findings¶

In the 30-task vision merging setting, FR-Merging improves training-free performance from 48.88% (Task Arithmetic) to 53.90%, while FREE-Merging reaches 79.67%.
High-pass filtering substantially improves generalization (off-diagonal) with only marginal per-task performance loss (diagonal), validating that low-frequency signals primarily encode task-specific information.
FREE-Merging requires only 1% additional parameter storage yet outperforms EMR-Merging and Twin-Merging, which require 2–3%.
The method generalizes effectively to PEFT settings (LoRA, IA³), demonstrating broad applicability.

Highlights & Insights¶

Breakthrough frequency-domain perspective: This is the first work to introduce frequency-domain analysis into model merging, revealing the concentration of task interference in low-frequency regions and providing a novel theoretical lens for understanding model merging.
Simplicity and effectiveness: FR-Merging requires only a single FFT operation with a theoretical time complexity of \(O(nm\log m)\), and demands no training data or gradient computation.
Theoretical guarantee: Theorem 5.1 rigorously proves a "No Free Lunch" theorem for model merging, theoretically justifying the necessity of incorporating task experts.
Cross-modal generalization: Effectiveness is validated across CV, NLP, and multimodal tasks, spanning models of varying scales from ViT to LLaMA.

Limitations & Future Work¶

The ideal high-pass filter introduces ringing artifacts; Butterworth or Gaussian high-pass filters could be explored for smoother frequency transitions.
The cutoff frequency \(D_0\) is a fixed hyperparameter; different layers or tasks may benefit from adaptive cutoff strategies.
Router accuracy directly affects FREE-Merging performance and may degrade when task boundaries are ambiguous.
In very large-scale merging scenarios (e.g., 30+ models), the linear growth in the number of experts remains a storage burden.

The frequency-domain analysis paradigm is transferable to other parameter-space operations, such as frequency-domain pruning in knowledge distillation and model compression.
The lightweight expert extraction approach can be combined with PEFT methods such as LoRA and Adapter for more efficient multi-task deployment.
The insight that "low frequency encodes task-specific information while high frequency encodes generalization capacity" sheds light on the nature of parameter changes during fine-tuning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to introduce frequency-domain analysis into model merging; the discovery of low-frequency interference is highly pioneering.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers vision/language/multimodal tasks, full fine-tuning and PEFT, and merging settings with 8 and 30 tasks.
Writing Quality: ⭐⭐⭐⭐ — Logically clear with fair and systematic experimental comparisons and intuitive figures.
Value: ⭐⭐⭐⭐⭐ — Highly practical; the training-free approach lowers the barrier to model merging and has significant implications for large model deployment.