FREE-Merging: Fourier Transform for Efficient Model Merging¶
Conference: ICCV 2025 arXiv: 2411.16815 Code: GitHub Area: Multimodal VLM Keywords: Model Merging, Fourier Transform, Task Interference, Frequency-Domain Analysis, Lightweight Expert
TL;DR¶
This paper is the first to identify the frequency-domain manifestation of task interference in model merging. It proposes FR-Merging, which removes low-frequency interference via high-pass filtering to construct a high-quality merged backbone, and combines it with lightweight task expert modules (FREE-Merging) to achieve an optimal performance–cost trade-off across vision, language, and multimodal tasks.
Background & Motivation¶
With the proliferation of open-source fine-tuned models, model merging has emerged as an efficient approach to consolidating multiple task-specific models into a single multi-task model, avoiding the high cost of joint multi-task training and data privacy concerns. However, existing methods face two core challenges:
Task interference degrades performance: Conflicts exist among fine-tuned weights from different tasks. Existing methods (e.g., Task Arithmetic, Ties-Merging, DARE) operate solely in the spatial domain (pruning, sign conflict resolution, etc.) and overlook frequency-domain interference. This paper is the first to reveal that task interference is significant in the frequency domain and concentrated in low-frequency regions, and that spatial-domain methods can barely mitigate frequency-domain interference (reducing frequency-domain amplitude variance by only 1–5%, whereas FR-Merging achieves 20–24%).
Conflict between performance and deployment cost: Introducing task experts improves performance, but existing methods (EMR-Merging, Twin-Merging) require storing substantial task-specific knowledge (2–3% of parameters) while neglecting backbone optimization.
The core insight of this paper is that low-frequency signals capture global structural information and are more likely to contain task-specific information that causes inter-task interference, whereas high-frequency signals represent fine-grained variations with stronger generalization capacity. Directly filtering out the low-frequency component can therefore substantially reduce task interference while preserving performance.
Method¶
Overall Architecture¶
FREE-Merging is a two-stage approach: - Stage 1 — FR-Merging (training-free): A high-pass filter is applied to each task vector \(v_k = \theta_k - \theta_{\text{pre}}\) to remove low-frequency interference signals, and the filtered vectors are merged to obtain a high-quality backbone network. - Stage 2 — Expert Extraction (training-free): Lightweight task experts comprising only ~1% of the parameter count are extracted from the task vectors and dynamically assigned at inference time via a router.
Key Designs¶
-
FR-Merging (Frequency-Domain High-Pass Filtering Merge):
- Function: Applies a Fourier transform to each task vector, filters out low-frequency interference regions, and applies the inverse transform.
- Mechanism: An ideal high-pass filter is applied to task vector \(v(x,y)\): \(G(x,y) = \mathcal{F}^{-1}\{H(\eta, \gamma) \cdot \mathcal{F}\{v(x,y)\}\}\) where \(H(\eta, \gamma) = \begin{cases} 1, & \sqrt{\eta^2 + \gamma^2} \geq D_0 \\ 0, & \sqrt{\eta^2 + \gamma^2} < D_0 \end{cases}\) and \(D_0\) is the cutoff frequency. Merging coefficients are computed via mean normalization of task vectors: \(\lambda_i = \mathbb{E}(v_i) \left(\sum_{j=1}^{K} \mathbb{E}(v_j)\right)^{-1}\)
- Design Motivation: Fine-tuned weights occupy different positions in the loss landscape, and linear interpolation can easily fall into high-loss regions. Removing low-frequency signals reduces model discrepancies, making the merged result more likely to reside within a loss basin. Experiments confirm that removing low-frequency components incurs only marginal per-task performance loss (diagonal), while substantially improving generalization (off-diagonal).
-
Lightweight Task Expert Extraction:
- Function: Selects the top-d% parameters with the largest magnitude changes from task vectors as task experts, requiring only ~1% of the parameter count.
- Mechanism: The highest-magnitude parameters are selected and rescaled: \(e(v_i) = \mu_i M(v_i, d), \quad \mu_i = -\frac{\mathbb{E}(M(v_i, d)) \cdot \log(d)}{\lambda_i \cdot \mathbb{E}(v_i)}\) where \(M(v_i, d)\) denotes the top-d% parameters and \(\mu_i\) is a scaling factor ensuring output consistency.
- Design Motivation: Theorem 5.1 provides a theoretical guarantee that a merged model cannot simultaneously retain all the capabilities of the original models without introducing additional information (No Free Lunch). Low-frequency signals encode task-specific information; however, directly storing the low-frequency region requires an inverse FFT at every inference step, making it impractical. Parameter magnitude is therefore used as a proxy.
-
MoE Router for Dynamic Dispatch:
- Function: Dynamically selects active task experts based on the input at inference time.
- Mechanism: \(\theta_* = \theta_m + \sum_{i=1}^{K} w_i e_i\), where \([w_1, \ldots, w_K] \leftarrow \arg\max(R(x))\) and \(R\) is a lightweight MLP router.
- Design Motivation: Inspired by Mixture-of-Experts, dynamic routing eliminates the overhead of loading all experts for every input.
Loss & Training¶
Both FR-Merging and expert extraction are entirely training-free, requiring only a one-time computation. The router can be implemented as a simple MLP or other classifier. The overall pipeline executes the FFT operation only once during merging; at inference, only a lightweight router is added.
Key Experimental Results¶
Main Results¶
8-task vision merging (average accuracy on ViT-B/32 / ViT-L/14):
| Method | Extra Cost | ViT-B/32 Avg | ViT-L/14 Avg | Notes |
|---|---|---|---|---|
| Individual | — | 90.5 | 94.2 | Upper bound |
| Task Arithmetic | None | 70.1 | 84.5 | Baseline |
| Ties-Merging | None | 73.6 | 86.0 | |
| PCB-Merging | None | 75.8 | 86.9 | Prev. SOTA (training-free) |
| FR-Merging | None | 78.1 | 88.3 | +2.3 / +1.4 |
| EMR-Merging | 3% storage | 87.7 | 92.8 | |
| Twin-Merging | 2% storage | 87.8 | 92.7 | |
| FREE-Merging | 1% storage | 89.7 | 93.7 | Lowest storage, highest performance |
Ablation Study¶
Frequency-domain interference quantification (ViT-B/32 amplitude variance):
| Method | Freq. Amplitude Variance | Variance Reduction | Notes |
|---|---|---|---|
| Task Arithmetic | 0.059 | — | Baseline |
| DARE | 0.057 | ↓3% | Spatial-domain method |
| Ties-Merging | 0.058 | ↓2% | Spatial-domain method |
| PCB-Merging | 0.056 | ↓5% | Spatial-domain method |
| FR-Merging | 0.045 | ↓24% | Frequency-domain method is markedly more effective |
Cross-domain validation (language models; average over RoBERTa / T0-3B / Qwen-14B):
| Method | RoBERTa | T0-3B | Qwen-14B | Average |
|---|---|---|---|---|
| Task Arithmetic | 66.65 | 63.91 | 66.40 | 65.65 |
| FR-Merging | 70.02 | 66.88 | 68.00 | 68.30 |
| EMR-Merging | 74.20 | 67.11 | 70.98 | 70.76 |
| FREE-Merging | 80.16 | 68.68 | 72.78 | 73.87 |
Key Findings¶
- In the 30-task vision merging setting, FR-Merging improves training-free performance from 48.88% (Task Arithmetic) to 53.90%, while FREE-Merging reaches 79.67%.
- High-pass filtering substantially improves generalization (off-diagonal) with only marginal per-task performance loss (diagonal), validating that low-frequency signals primarily encode task-specific information.
- FREE-Merging requires only 1% additional parameter storage yet outperforms EMR-Merging and Twin-Merging, which require 2–3%.
- The method generalizes effectively to PEFT settings (LoRA, IA³), demonstrating broad applicability.
Highlights & Insights¶
- Breakthrough frequency-domain perspective: This is the first work to introduce frequency-domain analysis into model merging, revealing the concentration of task interference in low-frequency regions and providing a novel theoretical lens for understanding model merging.
- Simplicity and effectiveness: FR-Merging requires only a single FFT operation with a theoretical time complexity of \(O(nm\log m)\), and demands no training data or gradient computation.
- Theoretical guarantee: Theorem 5.1 rigorously proves a "No Free Lunch" theorem for model merging, theoretically justifying the necessity of incorporating task experts.
- Cross-modal generalization: Effectiveness is validated across CV, NLP, and multimodal tasks, spanning models of varying scales from ViT to LLaMA.
Limitations & Future Work¶
- The ideal high-pass filter introduces ringing artifacts; Butterworth or Gaussian high-pass filters could be explored for smoother frequency transitions.
- The cutoff frequency \(D_0\) is a fixed hyperparameter; different layers or tasks may benefit from adaptive cutoff strategies.
- Router accuracy directly affects FREE-Merging performance and may degrade when task boundaries are ambiguous.
- In very large-scale merging scenarios (e.g., 30+ models), the linear growth in the number of experts remains a storage burden.
Related Work & Insights¶
- The frequency-domain analysis paradigm is transferable to other parameter-space operations, such as frequency-domain pruning in knowledge distillation and model compression.
- The lightweight expert extraction approach can be combined with PEFT methods such as LoRA and Adapter for more efficient multi-task deployment.
- The insight that "low frequency encodes task-specific information while high frequency encodes generalization capacity" sheds light on the nature of parameter changes during fine-tuning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to introduce frequency-domain analysis into model merging; the discovery of low-frequency interference is highly pioneering.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers vision/language/multimodal tasks, full fine-tuning and PEFT, and merging settings with 8 and 30 tasks.
- Writing Quality: ⭐⭐⭐⭐ — Logically clear with fair and systematic experimental comparisons and intuitive figures.
- Value: ⭐⭐⭐⭐⭐ — Highly practical; the training-free approach lowers the barrier to model merging and has significant implications for large model deployment.