Rectifying Magnitude Neglect in Linear Attention¶

Conference: ICCV 2025 arXiv: 2507.00698 Code: https://github.com/qhfan/MALA Area: Signal Communication Keywords: Linear Attention, Magnitude-Aware, Vision Transformer, Attention Score Distribution, Linear Complexity

TL;DR¶

This paper identifies that Linear Attention completely discards Query magnitude information, causing a significant deviation of attention score distributions from Softmax Attention. It proposes Magnitude-Aware Linear Attention (MALA), which restores magnitude awareness by introducing a scaling factor \(\beta\) and an offset term \(\gamma\), achieving comprehensive improvements over existing methods across classification, detection, segmentation, NLP, speech recognition, and image generation tasks.

Background & Motivation¶

Background: The quadratic complexity \(O(N^2)\) of Softmax Attention limits Vision Transformers in high-resolution visual tasks. Linear Attention reduces complexity to \(O(N)\) via kernel function approximation, but suffers from notable performance degradation.
Limitations of Prior Work: Existing improvements (e.g., EfficientViT adding convolutions for local compensation, Flatten Transformer's focused linear attention) are largely heuristic patch-up strategies.
Key Challenge: This paper analyzes the fundamental cause of the performance gap at the mathematical formulation level: in the Linear Attention formulation, the Query magnitude \(\|\phi(Q_i)\|\) cancels out in the numerator and denominator (Eq. 4), leaving only the directional component \(\vec{\alpha_i}\). This means:
Softmax Attention: increasing Query magnitude → sharper attention distribution (high-score Keys receive more attention)
Linear Attention: attention distribution remains unchanged regardless of Query magnitude variation
Goal: This finding explains the long-standing issue of overly smooth attention scores and weak local perception in Linear Attention.

Method¶

Overall Architecture¶

MALA modifies the normalization scheme of Linear Attention: replacing division-based normalization with additive normalization, and introducing a scaling factor \(\beta\) and an offset term \(\gamma\) that are functions of the \(\phi(Q_i)\) magnitude, enabling attention scores to dynamically adapt to Query magnitude.

Key Designs¶

Formal Proof of Magnitude Neglect:
- Substituting \(\phi(Q_i) = \|\phi(Q_i)\| \vec{\alpha_i}\) into the Linear Attention formula shows the magnitude term cancels in numerator and denominator.
- Empirical validation: replacing \(Q\) with \(Q/\|Q\|\) in DeiT-T's Softmax Attention (discarding magnitude) drops accuracy from 72.2% to 70.0%, approaching Linear Attention's 69.8%.
- Attention score visualization also converges to the smooth distribution characteristic of Linear Attention.
MALA Formulation:
- Attention score: \(\text{Attn}(Q_i, K_j) = \beta \cdot \phi(Q_i)\phi(K_j)^T - \gamma\)
- Scaling factor: \(\beta = 1 + \frac{1}{\phi(Q_i)\sum_m \phi(K_m)^T}\) (negatively correlated with Query magnitude)
- Offset term: \(\gamma = \frac{\phi(Q_i)\sum_m \phi(K_m)^T}{N}\) (positively correlated with Query magnitude)
- Normalization preserved: \(\sum_j \text{Attn}(Q_i, K_j) = 1\)
- Core property: when \(\|\phi(Q_i)\|\) increases by a factor of \(a\), the attention ratio between high-score and low-score Keys increases (\(p_m > p\)), consistent with the behavior of Softmax Attention.
Difference in Magnitude Growth Rates (Key Insight):
- In Softmax Attention, the ratio \(p\) grows exponentially with the scaling factor \(a\) (\(p^a\)) → overly sharp attention.
- In MALA, the ratio \(p\) grows fractionally with \(a\) (more moderate) → more balanced distribution.
- Visualization confirms: Softmax focuses too locally, Linear Attention is overly smooth, and MALA achieves a well-balanced distribution.
- Linear complexity preserved: \(Y_i = \beta \phi(Q_i)\sum_j \phi(K_j)^T V_j - \gamma \sum_j V_j\), where \(K^TV\) can still be computed first and then interacted with \(Q\).

Loss & Training¶

A MAViT (Magnitude-Aware Vision Transformer) model family T/S/B/L is constructed.
Image classification: trained from scratch on ImageNet-1K for 300 epochs, with maximum stochastic depth rates of 0.1/0.15/0.4/0.55.
Detection/Segmentation: standard COCO/ADE20K configurations using RetinaNet/Mask R-CNN/Cascade Mask R-CNN/SemanticFPN/UperNet.

Key Experimental Results¶

Main Results¶

ImageNet-1K classification accuracy comparison (key scales):

Model	Type	Params(M)	FLOPs(G)	Top-1(%)
RMT-S	Trans	27	4.5	84.1
SECViT-S	Trans	27	4.6	84.3
RAVLT-S	Linear	26	4.6	84.4
MAViT-S	Linear	27	4.6	84.7
RMT-B	Trans	54	9.7	85.0
RAVLT-B	Linear	48	9.9	85.5
MAViT-B	Linear	50	9.9	85.7
RMT-L	Trans	95	18.2	85.5
MAViT-L	Linear	98	16.1	86.0

COCO detection (Cascade Mask R-CNN 3×+MS): MAViT-B achieves 55.5 \(AP^b\) / 48.0 \(AP^m\), surpassing the larger CSwin-B.

Ablation Study¶

Comparison of Linear Attention variants (DeiT-T/Swin-T/Swin-S settings, replacing only the attention mechanism):

Linear Attention Type	DeiT-T	Swin-T	Swin-S
Hydra Attn	68.3	80.7	—
Enhanced Linear Attn	72.9	81.8	—
Focused Linear Attn	74.1	82.1	83.5
InLine Attn	74.5	82.4	83.6
MALA	75.1	83.7	85.3

Ablation of \(\beta\) and \(\gamma\) (MAViT-T): removing \(\beta\) drops accuracy to 52.3%; removing \(\gamma\) causes NaN; replacing both with learnable parameters drops accuracy to 71.7%.

Kernel function insensitivity: ELU+1, ReLU, and exp are nearly equivalent (82.9 vs. 82.8 vs. 82.9).

Key Findings¶

MALA outperforms Softmax Attention on all tested tasks while maintaining linear complexity.
NLP (0.3B model / 15B tokens): MALA is competitive with Transformer and Mamba on LMB/PIQA/Hella/Wino benchmarks.
Speech recognition (Conformer replacement): WER improves from Softmax's 2.7/6.3 to MALA's 2.4/5.3.
Image generation (DiT framework): FID improves from 68.40 to 49.62, with throughput of 5.6 imgs/s (fastest).
Clear efficiency advantage in high-resolution inference: MAViT significantly outperforms Softmax-based models in 512×2048 semantic segmentation.
No negative or zero attention scores observed in experiments (theoretically possible but does not occur in practice).

Highlights & Insights¶

Precise analytical starting point: Rather than heuristically adding convolutions for compensation, this paper mathematically identifies the fundamental flaw of Linear Attention (magnitude cancellation), and the solution directly addresses the identified problem.
Minimal yet effective design of \(\beta\) and \(\gamma\): Only two analytically derived hyperparameters (non-learnable); ablations demonstrate their indispensability.
The difference in growth rates is a profound insight: the exponential growth in Softmax leads to overly sharp attention, while the fractional growth in MALA yields a more reasonable distribution.
Cross-domain validation (vision/NLP/speech/generation) demonstrates that this is a fundamental improvement rather than a task-specific trick.

Limitations & Future Work¶

The introduction of \(\beta\) and \(\gamma\) incurs additional per-token scalar computation, which increases constant overhead without affecting asymptotic complexity.
The theoretical analysis of negative attention scores is insufficiently rigorous (only claimed to be absent empirically).
No thorough comparison with Mamba/SSM-based methods at large-scale NLP settings.
Performance of Linear Attention on ultra-long sequences (e.g., video, genomics) remains unexplored.

Flatten Transformer (ICCV 2023) and InLine Attention (NeurIPS 2024) are the most direct comparison baselines.
MILA (NeurIPS 2024) is inspired by Mamba to redesign the macro-architecture of Linear Attention, but does not address the magnitude problem.
RMT (CVPR 2024) is prior work from the same group using a Retentive mechanism; MALA provides a more fundamental improvement.
The analysis of Softmax being overly sharp versus Linear Attention being overly smooth offers broadly applicable guidance for designing new attention mechanisms.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Both the problem identification and the solution are exceptionally elegant, grounded in mathematical fundamentals.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across 7 tasks with thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear, though some formula typesetting could be improved.
Value: ⭐⭐⭐⭐⭐ A fundamental improvement to Linear Attention with broad impact.