MHA2MLA: Towards Economical Inference by Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs¶
Conference: ACL 2025
arXiv: 2502.14837
Code: https://github.com/JT-Ushio/MHA2MLA
Area: LLM/NLP
Keywords: Multi-Head Latent Attention, KV cache compression, partial RoPE, SVD, reasoning efficiency
TL;DR¶
MHA2MLA proposes the first method to efficiently migrate pre-trained MHA models to DeepSeek's MLA architecture. By utilizing contribution-aware partial-RoPE removal and joint SVD low-rank approximation, the performance can be restored with only 0.6%-1% of training data, compressing the KV cache of Llama2-7B by 92.19% with only a 1% drop in LongBench performance.
Background & Motivation¶
Background: Multi-Head Latent Attention (MLA) introduced by DeepSeek significantly reduces inference memory by compressing the KV cache into low-rank latent vectors, but it requires pre-training from scratch. Existing MHA/GQA models (e.g., Llama, Qwen) cannot directly benefit from it.
Limitations of Prior Work: - MHA stores full-rank KV cache with a footprint of \(O(2ln_hd_h)\), which grows linearly with the sequence length. - Although GQA reduces KV cache, it also reduces parameter count, leading to degraded performance. - MLA must be trained from scratch, which is computationally expensive and leaves existing models unable to migrate.
Key Challenge: The inference efficiency of MLA is only accessible to models trained from scratch, preventing the reuse of substantial investments already made in existing MHA models.
Goal: To migrate pre-trained MHA models to the MLA architecture at an extremely low data cost.
Key Insight: The primary structural differences in MLA are partial-RoPE (the separation of positional and non-positional dimensions) and KV low-rank compression. These differences are bridged using contribution analysis and SVD initialization, respectively.
Core Idea: Remove the lowest-contributing RoPE dimensions from MHA → Perform joint SVD compression on non-RoPE dimensions → Conduct minor fine-tuning to recover performance.
Method¶
Overall Architecture¶
MHA2MLA consists of two steps: (1) Partial-RoPE: Based on 2-norm contribution analysis, the RoPE frequency subspaces with the lowest contribution to attention scores in each attention head are removed and converted into NoPE (No Position Embedding) dimensions; (2) Low-rank Approximation: A joint SVD is performed on the Key (NoPE portion) and Value projection matrices to initialize the down-projection and up-projection matrices of MLA. Finally, a small-scale fine-tuning using 0.6%-1% of the training data is performed to recover performance.
Key Designs¶
-
Contribution-aware Partial-RoPE Removal:
- Function: Selectively remove RoPE from the frequency subspaces of QK and convert them to NoPE.
- Mechanism: RoPE splits QK into \(d_h/2\) 2D subspaces, each containing rotations of different frequencies. The contribution of each subspace to the attention score is calculated (via the Cauchy-Schwarz upper bound \(\|q^{[2k,2k+1]}\| \cdot \|k^{[2k,2k+1]}\|\)), retaining the RoPE of the top-\(r\) highest-contributing subspaces.
- Design Motivation: MLA must separate RoPE and non-RoPE dimensions (the RoPE part cannot be optimized via matrix consolidation and must be computed independently). Contribution-based selection (rather than uniform high-frequency/low-frequency/even strategies) yields the best experimental results because different attention heads pay attention to different frequencies.
- Difference from Prior Work: Prior work (such as GPT-Neo and Barbero et al.) explored training partial-RoPE models from scratch, while this work is the first to study fine-tuning from full-RoPE to partial-RoPE.
-
Joint SVD Low-rank Approximation (SVDjoint):
- Function: Jointly compress the projection matrices of Key (NoPE portion) and Value after RoPE removal into low-rank latent vectors.
- Mechanism: Concatenate \(W_{k,nope}\) and \(W_v\) of all heads into \([W_{k,nope}; W_v] \in \mathbb{R}^{d \times 2n_hd_c}\) and perform truncated SVD to obtain \(U\Sigma V^\top\). Take \(W_{dkv} = U_{\text{trunc}}\) as the down-projection matrix (yielding the latent vector \(c_{kv}\)), and split \(\Sigma V^\top\) into the up-projection matrices \(W_{uk}\) and \(W_{uv}\) for each head.
- Design Motivation: Separately applying SVD to K and V (SVDsplit) wastes latent space dimensions. Joint SVD allows K and V to share the latent space, achieving more efficient information utilization.
- Key Advantage: The matrices \(U, \Sigma, V\) of SVD are directly derived from pre-trained parameters, preserving existing knowledge to the greatest extent and keeping the fine-tuning data requirements minimal.
-
Compatibility with KV Cache Quantization:
- The output of MHA2MLA is a latent vector \(c_{kv}\) in standard MLA format, allowing seamless integration with KV cache quantization (e.g., 4-bit/2-bit quantization).
- The combination of latent vector compression and quantization can achieve up to 96.87% KV cache compression.
Loss & Training¶
- Minimal fine-tuning data: 0.6%-1% of the pre-training data (e.g., for Llama2-7B, ~10B tokens translates to only 60-100M tokens required).
- Full-parameter fine-tuning is used, but convergence is extremely rapid because SVD initialization provides a highly effective starting point.
Key Experimental Results¶
Main Results¶
Llama2-7B KV Cache Compression:
| Method | KV Cache Compression Rate | LongBench Performance Drop |
|---|---|---|
| GQA (4 groups) | 75% | -3-5% |
| MHA2MLA (SVDjoint) | 92.19% | -1% |
| MHA2MLA + 4-bit Quantization | 96.87% | -2-3% |
Multi-Model Validation (PPL Recovery):
| Model | Parameters | Data Volume | PPL Recovery Ratio |
|---|---|---|---|
| GPT-2 | 135M | 0.6% | ~99% |
| Llama2 | 7B | 1% | ~98% |
| Llama2 | 13B | 1% | ~98% |
| Llama3 (GQA) | 8B | 1% | ~97% |
Ablation Study¶
Partial-RoPE Strategy Comparison (Llama2-7B, r = Number of Retained RoPE Subspaces):
| Strategy | PPL Recovery | Notes |
|---|---|---|
| High-frequency | Moderate | Retain high-frequency RoPE |
| Low-frequency | Poor | Retain low-frequency RoPE |
| Uniform | Moderate | Retain at even intervals |
| 2-norm contribution | Best | Select by contribution |
SVD Strategy Comparison:
| Strategy | PPL Recovery | KV Cache Size |
|---|---|---|
| SVDsplit (Separate SVD) | Moderate | Same |
| SVDjoint (Joint SVD) | Superior | Same |
Key Findings¶
- 2-norm contribution selection significantly outperforms other strategies: different attention heads focus on different frequency subspaces (validated visually in Figure 3), and head-wise selection captures this diversity.
- SVDjoint outperforms SVDsplit: joint KV compression achieves higher utilization of the latent space.
- Both MHA and GQA models can be migrated: Llama3-8B (GQA) is successfully converted as well.
- Retaining fewer RoPE dimensions yields higher compression rates but complicates performance recovery; \(r=d_h/8\) serves as the optimal trade-off point.
Highlights & Insights¶
- First MHA-to-MLA Migration Scheme: Enabling existing models to leverage the inference efficiency of MLA is of great significance, bypassing the astronomical costs associated with training MLA models from scratch.
- Ingenuity of SVD Initialization: Directly initializing the up/down-projection matrices of MLA using the SVD of pre-trained parameters reduces the requirement for fine-tuning data to 0.6%-1%. This "maximal parameter reuse" approach is highly elegant.
- 2-Norm Contribution Analysis: The study finds that the importance of different frequency subspaces varies substantially across different attention heads, making head-wise adaptive selection critical.
Limitations & Future Work¶
- Fine-Tuning Still Demands Computation: Although data requirements are low, full-parameter fine-tuning of 7B/13B models still requires considerable GPU execution time.
- Lacking Validation on Larger Models (70B+): The efficacy and fine-tuning costs for 70B-scale models have not been evaluated.
- Impact of partial-RoPE on Long Contexts: Reducing RoPE dimensions may limit position modeling capabilities for extremely long sequences, which has not yet been tested beyond 100K tokens.
- Manual Selection of RoPE Dimension Ratio: The value of \(r\) still needs to be determined experimentally; future work could explore automatic selection based on gradient signals.
Related Work & Insights¶
- vs DeepSeek MLA: MLA requires training from scratch, whereas MHA2MLA allows existing models to obtain the advantages of MLA at minimal cost.
- vs GQA/MQA: GQA/MQA reduces parameter count, resulting in performance degradation, while MHA2MLA preserves parameters via SVD to maintain performance during compression.
- vs KV Cache Quantization: Can be stacked with quantization to yield even more extreme compression.
- This migration framework can be extended to other architectural transformations (e.g., MHA-to-Linear Attention migration).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The first MHA-to-MLA migration scheme; the combined design of partial-RoPE and joint SVD is highly ingenious.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across five model scales (135M to 13B), multiple ablation settings, and combination with quantization.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear mathematical derivations, highly intuitive visualizations, and a complete logic chain from MHA to MLA.
- Value: ⭐⭐⭐⭐⭐ Exceptionally high practical value—enables all existing MHA models to enjoy MLA inference efficiency at under 1% data cost.