ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation¶
Conference: CVPR 2026 arXiv: 2603.02945 Code: N/A Area: LLM Evaluation Keywords: Model Merging, Data-Free, Covariance Estimation, Spectral Refinement, Closed-Form Solution
TL;DR¶
This paper theoretically proves that fine-tuning weight deltas encode input covariance information, and proposes ACE-Merging, which achieves data-free closed-form model merging through three steps: adaptive covariance estimation, collective structural prior, and spectral refinement. ACE-Merging achieves an average improvement of 4% over prior methods on GPT-2 and 5% on RoBERTa-Base.
Background & Motivation¶
Background: The pretrain-then-finetune paradigm produces a large number of task-specific models. Model merging aims to combine multiple expert models into a single unified model, avoiding costly multi-task retraining. Existing approaches fall into three categories: data-dependent (requiring original data), test-time adaptive (with high inference overhead), and data-free (most flexible).
Limitations of Prior Work: Data-free methods are the most practically valuable, yet approaches ranging from Task Arithmetic to TIES-Merging rely on heuristic operations in parameter space (sign alignment, pruning, etc.), addressing only the symptoms of interference rather than the root cause — differences in the statistical structure of task data distributions.
Key Challenge: The optimal merging formula \(\bar{W} = (\sum_t W_t \Sigma_t)(\sum_t \Sigma_t)^{-1}\) requires the input covariance \(\Sigma_t\) for each task, which is precisely unavailable in the data-free setting.
Goal: To accurately estimate the input covariance for each task without any data access, thereby enabling theoretically grounded optimal model merging.
Key Insight: The authors observe that the weight deltas \(\Delta W_t\) produced by fine-tuning implicitly encode input covariance information — treating the rows of \(\Delta W_t\) as independent samples, their empirical covariance is proportional to \(\Sigma_t\).
Core Idea: Fine-tuning weight deltas themselves encode input covariance, enabling estimation and construction of a theoretically optimal closed-form merging solution without any data access.
Method¶
Overall Architecture¶
Input: Pretrained model weights \(W_0\) and multiple fine-tuned expert weights \(\{W_t\}\). Output: Merged model \(\bar{W}\). The method operates independently per layer in three stages: (1) adaptive covariance normalization, (2) collective structural prior construction, and (3) spectral refinement. The final solution is closed-form, requiring no iterative optimization.
Key Designs¶
-
Estimating Input Covariance from Weight Deltas (Theoretical Core):
- Function: Obtain per-task input covariance estimates without data access.
- Mechanism: Theorem 1 proves that \(\Sigma_t \propto \text{Cov}_{\mathcal{D}_t}[\Delta W_t]\). Since fine-tuning updates are small, the gradient can be linearized at \(W_0\): \(\Delta W_t \approx -2\eta N_t \mathbb{E}[(W_0 x - y)x^\top]\), so weight deltas implicitly encode the second-order moment of inputs. In practice, the rows of \(\Delta W_t\) are treated as independent samples to compute the empirical covariance \(\hat{\Sigma}_t \propto (\Delta W_t - \mathbf{1}\mu_t^\top)^\top (\Delta W_t - \mathbf{1}\mu_t^\top)\).
- Design Motivation: This is the theoretical foundation of ACE-Merging, transforming data-free merging into an optimization problem with an explicit objective. Prior work WUDI-Merging implicitly used a similar proxy \(\hat{\Sigma}_t \propto \|\Delta W_t\|_F^{-2} (\Delta W_t)^\top \Delta W_t\), but relied on iterative gradient descent, which is unstable.
-
Adaptive Covariance Normalization:
- Function: Balance energy-scale discrepancies across tasks.
- Mechanism: A heterogeneity measure \(\gamma = \frac{\text{Var}_t[\log\|\Delta W_t\|_F^2]}{(\mathbb{E}_t[\log\|\Delta W_t\|_F^2])^2}\) detects inter-task scale differences. When \(\gamma > \tau\), trace normalization is applied to the covariance: \(\hat{\Sigma}_{t,\text{scaled}} = \hat{\Sigma}_t / \text{Tr}(\hat{\Sigma}_t)\), followed by adaptive Tikhonov regularization: \(\hat{\Sigma}_{t,\text{reg}} = \hat{\Sigma}_{t,\text{scaled}} + \frac{\epsilon}{\text{Tr}(\hat{\Sigma}_t)} I\).
- Design Motivation: Experiments show that task heterogeneity in RoBERTa (\(\gamma > 0.3\)) is much higher than in ViT (\(\gamma < 0.25\)); without normalization, high-energy tasks dominate the merged result. The \(\gamma\) threshold serves as a gate to avoid unnecessary normalization for homogeneous tasks.
-
Collective Structural Prior (CSP):
- Function: Introduce anisotropic regularization that captures shared feature geometry across tasks.
- Mechanism: \(\mathbf{C}_{\text{agg}} = \mathbf{1} \cdot (\frac{1}{d_{\text{in}}} \mathbf{1}^\top \sum_t \hat{\Sigma}_{t,\text{scaled}})\) broadcasts the column-wise mean of all tasks' scaled covariances to each row, forming a low-rank consensus prior. The final closed-form solution is \(\bar{W}_{\text{pre}} = (\sum_t W_t \hat{\Sigma}_{t,\text{reg}})(\sum_t \hat{\Sigma}_{t,\text{reg}} + \mathbf{C}_{\text{agg}})^{-1}\).
- Design Motivation: Standard \(\epsilon I\) regularization is isotropic and treats all feature dimensions equally, ignoring the intrinsic geometry of the input space. CSP uses cross-task consensus energy distributions as weights to selectively reinforce shared important dimensions.
-
Spectral Refinement:
- Function: Correct severe spectral ill-conditioning in the closed-form solution.
- Mechanism: A structural residual \(\Delta_{\text{res}} = \sum_t W_t (\hat{\Sigma}_{t,\text{scaled}} - \bar{\Sigma})\) is computed and fused with \(\bar{W}_{\text{pre}}\) before applying SVD. The top-\(k\) singular directions are reweighted using their mean singular value \(\sigma_{\text{iso}}\): \(\Delta W_{\text{refine}} = \sigma_{\text{iso}} \mathbf{U}_{:,1:k} \mathbf{V}_{:,1:k}^\top\), and the final result is \(\bar{W} = \bar{W}_{\text{pre}} + \Delta W_{\text{refine}}\).
- Design Motivation: Experiments reveal that the top 5% singular values of \(\bar{W}_{\text{pre}}\) account for over 99% of the energy, with condition number exceeding \(8.7 \times 10^5\), yet the principal directions are correct (cosine similarity ≈ 1 with the final solution). Hence, it suffices to preserve the directions and redistribute the energy.
Loss & Training¶
ACE-Merging is a purely closed-form method with no training or iterative optimization. Hyperparameters are fixed at \(\tau=0.3\), \(k_{\text{frac}}=0.3\), with \(\epsilon\) adjusted per model family (GPT-2: \(4\times 10^{-2}\), RoBERTa-Base: \(2\times 10^{-4}\), others: \(1\times 10^{-5}\)).
Key Experimental Results¶
Main Results¶
Vision Tasks (ViT-B/16, Mean Absolute Accuracy %)
| # Tasks | Weight Avg | Task Arithmetic | CART | TSV-M | ACE-Merging | Gain |
|---|---|---|---|---|---|---|
| 8 tasks | 72.2 | 75.4 | 88.3 | 89.0 | 90.6 | +1.6 |
| 14 tasks | 69.5 | 70.5 | 84.1 | 84.6 | 86.1 | +1.5 |
| 20 tasks | 65.3 | 65.8 | 80.5 | 80.6 | 82.1 | +1.5 |
Language Tasks (GPT-2, GLUE Avg %)
| Method | CoLA | MNLI | MRPC | QNLI | QQP | RTE | SST-2 | Avg |
|---|---|---|---|---|---|---|---|---|
| Task Arithmetic | 68.7 | 68.6 | 69.6 | 70.5 | 81.8 | 47.3 | 83.6 | 70.0 |
| TSV-M | 65.6 | 75.4 | 58.6 | 64.4 | 86.2 | 55.6 | 85.7 | 70.2 |
| ACE-Merging | 70.3 | 69.9 | 71.8 | 76.7 | 79.0 | 62.5 | 88.5 | 74.1 |
Ablation Study¶
| Configuration | RoBERTa-L | GPT-2 | ViT-B/16 (8 tasks) | Note |
|---|---|---|---|---|
| E1: Basic Closed-form | 80.05 | 68.72 | 89.91 | Closed-form only |
| E2: + Adaptive ε | 88.04 | 71.50 | 89.91 | Largest contribution from adaptive regularization |
| E3: + Aggregate Prior | 86.79 | 71.51 | 90.60 | Structural prior assists |
| E4: + Spectral Refinement | 91.68 | 74.09 | 90.60 | Final gain from spectral refinement |
Key Findings¶
- Adaptive regularization contributes the most — nearly 8 percentage points improvement from E1 to E2 on RoBERTa-L, indicating that balancing task heterogeneity is the core bottleneck.
- ViT automatically bypasses the adaptive normalization and spectral refinement stages due to low task heterogeneity (\(\gamma < 0.3\)) (E1≈E2, E3≈E4), validating the rationality of the \(\gamma\)-gating mechanism.
- Sensitivity analysis shows stable performance within \(\gamma \in [0.1, 0.3]\) and \(k_{\text{frac}} \in [0.1, 0.5]\); \(\epsilon\) is more sensitive.
- ACE-Merging (90.4%) substantially outperforms WUDI-Merging (85.3%) on RoBERTa-Base and maintains a ~3% advantage on RoBERTa-Large.
Highlights & Insights¶
- Theoretically Elegant Insight: The fundamental obstacle of data-free merging (lack of covariance) is transformed into a quantity directly estimable from weight deltas, establishing a formal connection between "fine-tuning weight deltas ↔ input covariance." This perspective not only explains why ACE-Merging works, but also unifies prior methods — Weight Averaging assumes \(\Sigma_t = kI\), while WUDI-Merging implicitly uses \(\|\Delta W_t\|_F^{-2} (\Delta W_t)^\top \Delta W_t\) as a proxy.
- Closed-Form vs. Iterative: WUDI-Merging requires iterative gradient descent, whereas ACE-Merging is a genuine closed-form solution, offering higher computational efficiency and stability.
- Insightful Spectral Diagnosis: The observation that the initial closed-form solution has correct directions but severely imbalanced energy distribution (top 5% singular values account for 99% of energy) leads to a targeted fix — preserving directions while redistributing energy. This "correct direction, wrong magnitude" diagnostic paradigm is transferable to other matrix optimization problems.
Limitations & Future Work¶
- \(\epsilon\) must be manually tuned per model family (different values for GPT-2, RoBERTa, and ViT); the authors acknowledge that automatic estimation of \(\epsilon\) is a direction for future work.
- The linear approximation \(f(W,x) \approx Wx\) may be insufficiently precise for deep nonlinear networks, particularly for attention layers.
- The assumption of treating rows of \(\Delta W_t\) as independent samples may not hold in practice, as rows of weight matrices exhibit structural correlations.
- Validation is limited to GLUE and visual classification tasks; generative tasks (e.g., LLM dialogue, code generation) are not evaluated.
- The layer-wise independent merging procedure ignores cross-layer dependencies.
Related Work & Insights¶
- vs. WUDI-Merging: This work reinterprets WUDI as a special case of ACE within a unified theoretical framework (norm-weighted covariance proxy), while ACE replaces WUDI's iterative optimization with a closed-form solution for greater stability and efficiency.
- vs. TSV-M: TSV-M decomposes shared/task-specific subspaces via SVD in a heuristic manner; ACE directly models the covariance, providing a more rigorous theoretical foundation.
- vs. RegMean: RegMean is a data-dependent method that uses ground-truth covariances directly; ACE demonstrates that comparable performance can be achieved by estimating covariance without data access.
- The paradigm of covariance estimation combined with spectral correction is generalizable to model aggregation problems in federated learning.
Rating¶
- Novelty: ⭐⭐⭐⭐ The theoretical contribution is elegant, though the core insight (weight delta ∝ covariance) is not entirely surprising.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Vision and language benchmarks, multiple architectures and scales, complete ablation and sensitivity analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear logical progression from theory → unified framework → method → experiments.
- Value: ⭐⭐⭐⭐ A practical tool for data-free merging, though the need to manually tune \(\epsilon\) reduces out-of-the-box usability.