ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models¶
Conference: ICLR 2026 arXiv: 2505.14238 Code: https://github.com/CERT-Lab/abba Area: Model Compression / PEFT Keywords: Parameter-efficient fine-tuning, LoRA, Hadamard product, low-rank adaptation, Khatri-Rao decomposition
TL;DR¶
This paper proposes ABBA adapters, which parameterize weight updates as the Hadamard product of two independently learnable low-rank matrices, \(\Delta W = s(B_1A_1) \odot (B_2A_2)\). Under the same parameter budget, ABBA achieves an effective rank of \(r_1 \cdot r_2\) compared to LoRA's \(r\), representing a quadratic improvement. Through Khatri-Rao reconstruction, ABBA maintains memory efficiency comparable to LoRA, and significantly outperforms existing PEFT methods on arithmetic and commonsense reasoning tasks.
Background & Motivation¶
Background: LoRA is the most widely adopted PEFT method, constraining weight updates to a rank-\(r\) subspace via \(\Delta W = BA\) (\(B \in \mathbb{R}^{m \times r}, A \in \mathbb{R}^{r \times n}\)).
Limitations of Prior Work: LoRA's updates are strictly confined to a rank-\(r\) subspace, inherently limiting expressiveness. HiRA introduces Hadamard products via \(\Delta W = W_0 \odot (BA)\) to increase effective rank, but couples updates to the frozen weight \(W_0\)—when the target update divided element-wise by \(W_0\) is not low-rank, HiRA offers no advantage.
Key Challenge: High expressiveness (high-rank updates) requires more parameters, yet the fundamental constraint of PEFT is a small parameter count. How can one break the rank barrier under the same parameter budget?
Goal: Substantially increase the expressiveness and effective rank of weight updates while maintaining LoRA-level parameter efficiency.
Key Insight: Set both factors in the Hadamard product as learnable low-rank matrices, fully decoupling updates from the pretrained weights. Employ Khatri-Rao decomposition to avoid instantiating full-size matrices.
Core Idea: The Hadamard product of two rank-\(r/2\) matrices can achieve an effective rank of \(r^2/4\), a quadratic improvement over LoRA's rank \(r\) under the same parameter count.
Method¶
Overall Architecture¶
In each target layer, LoRA's \(\Delta W = BA\) is replaced by \(\Delta W = s(B_1A_1) \odot (B_2A_2)\). The four matrices \(A_1, B_1, A_2, B_2\) form the "ABBA" structure. For fair comparison, \(r_1 = r_2 = r/2\) is set so that the total parameter count matches LoRA at rank \(r\).
Key Designs¶
-
Dual Low-Rank Parameterization via Hadamard Product:
- Function: Expresses the weight update as the Hadamard (element-wise) product of two independent low-rank matrices.
- Mechanism: Since \(\text{rank}(W_1 \odot W_2) \leq r_1 \cdot r_2\), ABBA's effective rank upper bound is \(r_1 \cdot r_2 = r^2/4\), far exceeding LoRA's \(r\). Matrix reconstruction experiments confirm that ABBA consistently achieves lower reconstruction error than LoRA under the same parameter budget.
- Design Motivation: Unlike HiRA, both factors are fully learnable and not tied to \(W_0\), freeing the update capacity from the structural constraints of the pretrained weights.
-
Khatri-Rao Efficient Implementation (Theorem 1):
- Function: Converts ABBA into a LoRA-compatible form via Khatri-Rao decomposition, avoiding the instantiation of full-size matrices.
- Mechanism: Define \(B_{\text{kr}} = B_1 \odot_r B_2 \in \mathbb{R}^{m \times r_1 r_2}\) and \(A_{\text{kr}} = (A_1^\top \odot_r A_2^\top)^\top\); then \(\Delta W x = B_{\text{kr}}(A_{\text{kr}} x)\), with intermediate activations of dimension only \(r_1 r_2\).
- Design Motivation: A naïve implementation would require constructing two \(m \times n\) matrices and computing their Hadamard product, incurring memory costs equivalent to full fine-tuning. Khatri-Rao reconstruction keeps both computation and storage at the low-rank level.
-
SVD Initialization + Rank Stability:
- Function: Initializes \((B_1, A_1)\) via truncated SVD of \(W_0\); \((B_2, A_2)\) follows standard LoRA initialization.
- Mechanism: The Eckart–Young–Mirsky (EYM) theorem guarantees that truncated SVD is the optimal rank-\(r_1\) approximation. The scaling factor \(s\) must be adjusted with respect to the effective rank \(r_1 r_2\) (not \(r\)); rank stability is formally proved in the paper.
- Design Motivation: The hybrid initialization anchors the update direction to a meaningful low-rank subspace while preserving the second matrix pair's capacity for task-specific exploration.
Loss & Training¶
Standard fine-tuning loss is used. Training hyperparameters are identical to LoRA; only the adapter structure is replaced with ABBA. Code is publicly available.
Key Experimental Results¶
Main Results¶
Arithmetic Reasoning (GSM8K, MATH, etc.):
| Method | Parameters | GSM8K | MATH | Avg ↑ |
|---|---|---|---|---|
| LoRA (r=16) | Baseline | Baseline | Baseline | Baseline |
| DoRA | Same | Marginal gain | Marginal gain | Marginal gain |
| HiRA | Same | Better than LoRA | Better than LoRA | Better than LoRA |
| ABBA (r=8+8) | Same | Best by significant margin | Best by significant margin | Best by significant margin |
Commonsense Reasoning (Average across multiple datasets):
| Method | LLaMA-7B | LLaMA-3-8B | Notes |
|---|---|---|---|
| LoRA | Baseline | Baseline | |
| ABBA | +2–3 pp | +2–3 pp | Consistently best |
Ablation Study¶
| Configuration | Performance | Notes |
|---|---|---|
| \(r_1 = r_2 = r/2\) | Best | Equal split maximizes \(r_1 r_2\) |
| \(r_1 \neq r_2\) | Slightly worse | Asymmetric allocation is suboptimal |
| Random init for \((B_1, A_1)\) | Worse | SVD initialization is critical |
| No scaling factor | Training unstable | Rank stability requires appropriate scaling |
Key Findings¶
- Matrix reconstruction experiments confirm that ABBA consistently outperforms same-parameter LoRA across various matrix types, validating its higher expressiveness.
- ABBA converges faster in practice than LoRA and HiRA (visualized via an MNIST toy experiment).
- Khatri-Rao reconstruction gives ABBA a memory footprint even smaller than HiRA, which must store the full \(W_0\).
- Rank stability analysis shows that \(s = 1/(r_1 r_2)\) is the appropriate scaling, consistent with the generalization of rsLoRA's \(1/r\) scaling.
Highlights & Insights¶
- Quadratic rank gain at constant parameter count: The effective rank of \(r/2 \times r/2 = r^2/4\) is the central contribution—equivalent to obtaining \(r/4\times\) greater expressiveness within the same budget.
- Engineering elegance of Khatri-Rao: While Hadamard products cannot be directly "distributed" into matrix-vector multiplication, the Khatri-Rao decomposition elegantly avoids full matrix instantiation—a key technical contribution that makes ABBA practically viable.
- Fundamental distinction from HiRA: HiRA fixes one factor as \(W_0\) (free but non-learnable); ABBA makes both factors learnable but low-rank (incurring a parameter cost but offering greater flexibility). This raises an interesting trade-off between exploiting pretrained weight structure versus unconstrained learning.
Limitations & Future Work¶
- The intermediate activation dimension in Khatri-Rao reconstruction is \(r_1 r_2\) (vs. LoRA's \(r\)), incurring additional FLOPs.
- ABBA does not admit a closed-form optimal solution (the EYM theorem does not apply directly), so optimization relies on gradient descent.
- Initialization requires a truncated SVD of \(W_0\) per layer, imposing a one-time upfront cost.
- Validation is limited to LLMs; applicability to vision and multimodal models remains unexplored.
Related Work & Insights¶
- vs. LoRA: ABBA raises the effective rank from \(r\) to \(r^2/4\) under the same parameter count, a fundamental gain in expressiveness at the cost of slightly more complex initialization and implementation.
- vs. HiRA: HiRA couples updates to the pretrained weights by fixing one Hadamard factor as \(W_0\); ABBA is fully learnable and thus more general.
- vs. DoRA: DoRA decouples direction and magnitude but the update remains low-rank; ABBA breaks the rank barrier via the Hadamard product.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The combination of dual low-rank Hadamard parameterization and Khatri-Rao efficient implementation is elegant, and the quadratic rank improvement insight is profound.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four models, arithmetic and commonsense reasoning, matrix reconstruction experiments, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ The narrative flows smoothly from motivation to theory to experiments, with clear figures and tables.
- Value: ⭐⭐⭐⭐⭐ As a direct improvement over LoRA, ABBA is simple, practical, and yields significant gains, with open-source code.