DiaBlo: Diagonal Blocks Are Sufficient For Finetuning¶
Conference: ICLR2026 arXiv: 2506.03230 Code: ziyangjoy/DiaBlo Area: Code Intelligence Keywords: PEFT, diagonal blocks, LoRA alternative, LLM fine-tuning, parameter efficiency
TL;DR¶
This paper proposes DiaBlo—a parameter-efficient fine-tuning method that replaces low-rank decomposition with diagonal block updates. The weight matrix is partitioned into \(N \times N\) blocks, and only the diagonal blocks \(\mathbf{D}_1, \ldots, \mathbf{D}_N\) are trained. This approach entirely bypasses the non-convex optimization, initialization sensitivity, and gradient instability introduced by the \(\mathbf{AB}\) product in LoRA. Zero initialization suffices for convergence, and the method requires only a single torch.einsum batched matmul in PyTorch. Theoretical analysis proves that DiaBlo is strictly more expressive than LoRA under the same parameter budget. DiaBlo achieves state-of-the-art results across commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, as well as 4-bit/2-bit quantization settings.
Background & Motivation¶
Background: LoRA and its variants (DoRA, PiSSA, MiLoRA, LoRA-GA) constitute the dominant family of PEFT methods. They inject trainable low-rank matrix products \(\Delta\mathbf{W} = \mathbf{AB}\) alongside frozen pretrained weights, substantially reducing the number of trainable parameters. Earlier approaches such as Prompt Tuning and Prefix Tuning are lightweight but lack expressiveness; Adapter-based methods require architectural modifications and introduce inference latency.
Limitations of Prior Work:
- Non-convex optimization: The \(\mathbf{AB}\) product in LoRA renders the objective non-convex with respect to \(\mathbf{A}\) and \(\mathbf{B}\). The gradients \(\mathbf{g}_{\mathbf{A}} = \mathbf{g}_{\mathbf{W}} \mathbf{B}^\top\) and \(\mathbf{g}_{\mathbf{B}} = \mathbf{A}^\top \mathbf{g}_{\mathbf{W}}\) are mutually dependent, causing extreme sensitivity to initialization and unstable convergence.
- Proliferating variant complexity: DoRA decouples magnitude and direction; PiSSA initializes with large singular values; MiLoRA uses small singular values; LoRA-GA aligns the first-step gradient. These variants are essentially patches for the matrix-product structure, increasing both algorithmic and engineering complexity.
- Hardware-unfriendly sparse methods: Fine-tuning based on unstructured sparsity (random masking or importance-based selection) avoids low-rank decomposition but induces irregular memory access patterns and poor GPU utilization.
Core Insight: The gradient of a diagonal block, \(\mathbf{g}_{\mathbf{D}_i} = \mathbf{X}_i^\top \mathbf{g}_{\mathbf{Y}_i}\), equals exactly the gradient of the corresponding sub-block \(\mathbf{W}_{ii}\) in full fine-tuning—without passing through any intermediate matrix-product variable. Consequently, zero initialization does not cause gradient vanishing, and the optimization landscape is far simpler than that of low-rank parameterization.
Core Idea: Rather than performing low-rank decomposition, DiaBlo directly updates \(N\) diagonal blocks \(\mathbf{D}_i \in \mathbb{R}^{d_1 \times d_2}\) of the weight matrix, implemented efficiently via batched matrix multiplication.
Method¶
Overall Architecture¶
For a linear layer \(\mathbf{Y} = \mathbf{X}\mathbf{W}\), DiaBlo partitions the weight \(\mathbf{W} \in \mathbb{R}^{m_1 \times m_2}\) into an \(N \times N\) block matrix and introduces a block-diagonal adapter:
Each diagonal block satisfies \(\mathbf{D}_i \in \mathbb{R}^{d_1 \times d_2}\) (where \(d_1 = m_1/N\), \(d_2 = m_2/N\)) and is stored as a tensor \(\mathcal{D} \in \mathbb{R}^{N \times d_1 \times d_2}\). All off-diagonal blocks are frozen; only the \(N\) diagonal blocks are trained.
Key Designs¶
- Structured sparsity without matrix products: LoRA represents updates as the product of two matrices \(\mathbf{AB}\), whereas DiaBlo directly updates diagonal sub-blocks of the original weight. Since no product is involved, the optimization problem is convex with respect to \(\mathbf{D}_i\) in the linear case, eliminating the need for specialized initialization or optimization strategies—zero initialization is sufficient.
- GPU-friendly batched matmul: The forward computation \(\mathbf{X}\mathbf{D}\) is equivalent to reshaping \(\mathbf{X}\) to \(b \times N \times d_1\) and computing
torch.einsum("bNd1,Nd1d2->bNd2", X, D); the backward pass follows analogously, without reconstructing the sparse matrix \(\mathbf{D}\). - Theoretical guarantees in both linear and nonlinear settings: Under linear least squares, the DiaBlo solution coincides with the full fine-tuning solution. With parameter count \(Nd_1d_2 = m_1m_2/N \geq m_2 r\) versus LoRA's requirement of \((m_1+m_2)r\) parameters, DiaBlo is strictly more expressive under the same budget. In nonlinear settings, when the activations \(\mathbf{X}\) and gradients \(\mathbf{g}_{\mathbf{Y}}\) satisfy low-rank conditions (supported by prior empirical findings), the stationary points of DiaBlo coincide with those of full fine-tuning.
Core Differences from LoRA¶
| Dimension | LoRA | DiaBlo |
|---|---|---|
| Parameterization | Low-rank product \(\mathbf{AB}\) | Direct diagonal block update \(\mathbf{D}_i\) |
| Optimization landscape | Non-convex, initialization-sensitive | Convex (linear) / flatter (nonlinear) |
| Initialization | Requires Kaiming/SVD strategies | Zero initialization |
| Gradient computation | \(\mathbf{g}_\mathbf{A}\) depends on \(\mathbf{B}\) and vice versa | \(\mathbf{g}_{\mathbf{D}_i} = \mathbf{X}_i^\top \mathbf{g}_{\mathbf{Y}_i}\), independent |
| Implementation complexity | Two parameter matrices + merge logic | Single tensor + einsum |
| FLOPs | \(2bmr\) | \(bNd^2\) (equal at the same parameter budget) |
| Training speed | Baseline | On par with LoRA; far faster than DoRA |
Key Experimental Results¶
Commonsense Reasoning (Commonsense Reasoning 170K, average over 8 subtasks)¶
| Model | Method | r/N | Trainable Params | Avg Acc (%) |
|---|---|---|---|---|
| LLaMA2-7B | Full FT | — | 100% | 83.5 |
| LoRA | r=32 | 0.83% | 77.6 | |
| DoRA | r=16 | 0.42% | 80.5 | |
| MiLoRA | r=32 | 0.83% | 79.2 | |
| SMT(Best) | — | 4.91% | 83.4 | |
| DiaBlo | N=128 | 0.52% | 83.5 | |
| LLaMA3-8B | Full FT | — | 100% | 87.5 |
| LoRA | r=32 | 0.78% | 80.8 | |
| DoRA | r=32 | 0.78% | 85.2 | |
| SMT(Best) | — | 3.01% | 87.2 | |
| DiaBlo | N=64 | 1.04% | 87.3 | |
| LLaMA-13B | DoRA | r=32 | 0.68% | 80.8 |
| DiaBlo | N=64 | 1.06% | 84.9 |
Arithmetic Reasoning (MetaMathQA → GSM8K + MATH, LLaMA2-7B)¶
| Method | r/N | Trainable Params | GSM8K | MATH | Avg |
|---|---|---|---|---|---|
| Full FT | — | 100% | 66.5 | 19.8 | 43.2 |
| LoRA | r=64 | 1.67% | 60.6 | 16.9 | 38.7 |
| PiSSA | r=64 | 1.67% | 58.2 | 15.8 | 37.0 |
| MiLoRA | r=64 | 1.67% | 63.5 | 17.8 | 40.7 |
| DiaBlo | N=32 | 2.09% | 66.3 | 20.4 | 43.4 |
Code Generation and Safety Alignment (LLaMA3-8B)¶
| Method | r/N | Trainable Params | Pass@1 | Pass@10 | HEx-PHI |
|---|---|---|---|---|---|
| LoRA | r=32 | 1.12% | 34.7 | 50.8 | 91.6 |
| DoRA | r=32 | 1.12% | 33.1 | 48.6 | 93.6 |
| LoRI | r=32 | 0.56% | 43.2 | 63.2 | 92.8 |
| DiaBlo | N=64 | 1.51% | 43.2 | 63.5 | 97.6 |
Quantized Model Fine-tuning (Math10K, LLaMA2-7B, average over 4 tasks)¶
| Quantization | Method | Trainable Params | Avg Acc (%) |
|---|---|---|---|
| 4-bit | QLoRA (r=64) | 112M | 53.7 |
| ApiQ-bw (r=64) | 112M | 53.5 | |
| MagR-DiaBlo (N=64) | 70M | 54.8 | |
| 2-bit | QLoRA (r=64) | 112M | 2.1 |
| GPTQ-LoRA (r=64) | 112M | 39.9 | |
| ApiQ-bw (r=64) | 112M | 47.3 | |
| MagR-DiaBlo (N=64) | 70M | 48.7 |
Sparsity Pattern Comparison (GSM8K, LLaMA3-8B, Sparsity 1/64)¶
| Sparsity Pattern | Fine-tuned Acc (%) | Training Time (min) |
|---|---|---|
| DiaBlo (Diagonal Blocks) | 67.68 | 17.26 |
| Random Entries | 65.35 | 26.51 |
| Random Block | 64.86 | 29.76 |
| Random Column | 65.19 | 17.01 |
| Random Row | 61.71 | 17.76 |
Key Findings¶
- Commonsense reasoning: DiaBlo (N=128, 0.52% parameters) achieves 83.5% on LLaMA2-7B, matching Full FT and substantially outperforming LoRA (77.6%); SMT requires 4.91% parameters to barely reach parity.
- Arithmetic reasoning: DiaBlo (N=32) attains the highest MATH score of 20.4% among all methods, surpassing Full FT (19.8%).
- Quantization robustness: Under 2-bit quantization, QLoRA nearly collapses (2.1%), while DiaBlo maintains 48.7%—a gap of 46.6 percentage points.
- Training efficiency: DiaBlo matches LoRA in training speed (170 min/epoch), whereas DoRA requires 480 min/epoch (2.8× slower).
- Structured sparsity advantage: Diagonal blocks achieve the highest accuracy among all sparsity patterns and are 1.5–1.7× faster than unstructured alternatives.
- Gradient stability: The gradient norm variance of DiaBlo is consistently lower than that of LoRA; LoRA exhibits gradient vanishing in matrix \(\mathbf{A}\) during early training.
Highlights & Insights¶
- Remarkably simple yet surprisingly effective: Zero initialization combined with diagonal block updates requires no additional tricks.
- Rigorous theoretical grounding: DiaBlo is strictly superior to LoRA on linear problems—not merely approximately so.
- Quantization-friendly: The diagonal block structure is more robust than low-rank matrix products under low-bit quantization.
- The optimization difficulty of LoRA is fundamental: The low-rank matrix product is inherently a non-convex problem; DiaBlo circumvents this entirely.
Limitations & Future Work¶
- The diagonal block assumption disregards cross-block interactions, which may limit performance on tasks requiring full-rank updates.
- The choice of \(N\) must be tuned to match hardware constraints and parameter budgets.
- A systematic comparison with adapter-based methods has not been conducted.
Related Work & Insights¶
- vs. LoRA: LoRA approximates \(\Delta \mathbf{W}\) with a low-rank product \(\mathbf{AB}\). DiaBlo employs structured sparsity (diagonal blocks), yielding greater stability and expressiveness.
- vs. S²FT: S²FT is also a structured sparse fine-tuning method. DiaBlo's diagonal blocks are more regular, leading to higher GPU efficiency.
- vs. QLoRA: QLoRA combines quantization with LoRA. The combination of DiaBlo and quantization appears more favorable, with a clear advantage at 2-bit precision.
Rating¶
- Novelty: ⭐⭐⭐⭐ Conceptually minimalist yet effective, with solid theoretical support
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across multiple tasks, precisions, and model scales
- Writing Quality: ⭐⭐⭐⭐ Theory and experiments are clearly presented with intuitive figures and tables
- Value: ⭐⭐⭐⭐⭐ Has the potential to replace LoRA as the default PEFT method