Stripe Observation Guided Inference Cost-Free Attention Mechanism¶

Conference: ECCV 2024
Code: None
DOI: 10.1007/978-3-031-72691-0_6
Area: Attention Mechanism / Model Architecture
Keywords: Stripe Attention, Inference-Free, Attention Pattern Analysis, Transformer, Structured Attention

TL;DR¶

By deeply analyzing the stripe pattern phenomenon in the attention weight matrices of Transformers, this paper proposes an attention enhancement mechanism that completely eliminates additional computational cost during the inference phase. By training an auxiliary module to learn stripe-guided attention modulation during the training phase, and re-parameterizing it into the standard attention weights during inference, this method achieves a "free lunch" style performance boost.

Background & Motivation¶

Background: Self-attention is the core of the Transformer architecture and has been widely applied in both NLP and CV fields. However, standard self-attention has a computational complexity of $O(n^2)$, creating an inherent tension between the model's expressiveness and computational efficiency. Extensive research has been dedicated to designing more efficient or powerful attention variants.

Attention Pattern Observation: By visualizing the attention weight matrices of a vast number of trained Transformer models, the authors discover an interesting and ubiquitous phenomenon—Stripe Pattern: inside the attention matrix, certain columns or rows exhibit distinct high-value stripes, indicating that specific key tokens are highly attended to by all query tokens (vertical stripes), or specific query tokens highly attend to all key tokens (horizontal stripes).

Limitations of Prior Work: - Neglected Stripe Phenomenon: Although prior work has analyzed global attention of CLS tokens and localized attention patterns, a systematic study and utilization of this ubiquitous stripe structure is lacking. - Performance Loss in Efficient Attentions: Efficient variants such as Linear Attention and Sparse Attention often sacrifice representation power, partially because they cannot accurately model these critical structured patterns. - Inference Overhead of Attention Enhancements: Existing attention enhancement methods (e.g., multi-head attention replication, relative position encoding, talking heads, etc.) introduce additional computational overhead during inference.

Key Challenge: Enhancing the expressiveness of attention typically requires more parameters and computation, but additional inference overhead is unacceptable for real-world deployment. Can we design an attention mechanism that is "enhanced during training, but cost-free during inference"?

Key Insight: Leveraging the structural nature of stripe patterns—stripes can be decomposed into low-rank matrices (outer products). This low-rank structure can be absorbed into the weight matrices of standard attention during inference, achieving enhancement without any inference overhead.

Method¶

Overall Architecture¶

The core concept of Stripe Observation Guided Attention (SOG-Attention) consists of three phases:

Analysis: Analyze the stripe patterns in the attention matrices of trained Transformers and formulate their mathematical structures.
Training: Introduce a stripe-guided auxiliary attention module on top of standard self-attention to train them jointly.
Inference: Merge the auxiliary module into the standard attention weights via structural re-parameterization, resulting in zero additional computation during inference.

Key Designs¶

1. Mathematical Modeling of Stripe Patterns¶

The authors formalize the stripe pattern in the attention matrix as a low-rank decomposition. Let the attention matrix be $A \in \mathbb{R}^{n \times n}$, where $n$ is the sequence length. The stripe patterns can be represented as:

Column Stripes: Certain columns exhibit global high values, indicating that specific key tokens are attended to by all query tokens:

\[A_{stripe}^{col} = \mathbf{1} \cdot \mathbf{s}_c^T = \begin{bmatrix} s_{c_1} & s_{c_2} & \cdots & s_{c_n} \\ s_{c_1} & s_{c_2} & \cdots & s_{c_n} \\ \vdots & & & \vdots \\ s_{c_1} & s_{c_2} & \cdots & s_{c_n} \end{bmatrix}\]

where $\mathbf{s}_c \in \mathbb{R}^n$ represents the column stripe intensity vector for each key position. This is a rank-1 matrix.

Row Stripes: Certain rows exhibit global high values, indicating that specific query tokens attend to all key tokens:

\[A_{stripe}^{row} = \mathbf{s}_r \cdot \mathbf{1}^T\]

Combined Stripe Pattern: The practical stripe pattern is the superposition of both:

\[A_{stripe} = A_{stripe}^{col} + A_{stripe}^{row} = \mathbf{1} \cdot \mathbf{s}_c^T + \mathbf{s}_r \cdot \mathbf{1}^T\]

2. Stripe-Guided Auxiliary Attention Module¶

During the training phase, a lightweight stripe attention module is added alongside standard self-attention:

Standard Attention: $$A_{std} = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)$$

Stripe Attention: Predicts stripe intensity using two learnable vectors (or small networks):

\[\mathbf{s}_c = \sigma(W_c \cdot \text{MeanPool}(K) + b_c)$$ $$\mathbf{s}_r = \sigma(W_r \cdot \text{MeanPool}(Q) + b_r)\]

where $W_c, W_r$ are the weights of small linear layers, and $\sigma$ is a normalization function.

Enhanced Attention:

\[A_{enhanced} = A_{std} + \alpha \cdot A_{stripe}\]

where $\alpha$ is a learnable scaling coefficient.

3. Structural Re-parameterization at Inference (Core Technique)¶

This is the most critical technical contribution of the paper—how to eliminate the computational overhead of the auxiliary module during inference.

Key Observation: Stripe attention can be decomposed into offset terms related to Q and K. Specifically:

Column stripes $\mathbf{1} \cdot \mathbf{s}_c^T$ depend only on the information of K, which can be absorbed into the linear transformation of K:

\[A_{stripe}^{col} = \mathbf{1} \cdot (W_c \cdot \bar{K})^T\]

This can be achieved by modifying the projection matrix $W_K$ for K:

\[W_K' = W_K + \Delta W_K\]

where $\Delta W_K$ is derived from the parameters of the stripe module.

Similarly, row stripes can be absorbed into the projection matrix for Q:

\[W_Q' = W_Q + \Delta W_Q\]

Re-parameterization Process: After training is complete, the parameters of the stripe module are "folded" into the projection matrices of Q and K:

Extract parameters $W_c, W_r, \alpha$ from the trained stripe module.
Compute $\Delta W_Q$ and $\Delta W_K$.
Update projection matrices: $W_Q' \leftarrow W_Q + \Delta W_Q$ and $W_K' \leftarrow W_K + \Delta W_K$.
Remove the stripe module.

During inference, the model architecture is identical to the standard Transformer, with zero extra parameters or computation.

4. Multi-Head Stripe Attention¶

For multi-head attention, each head learns an independent stripe pattern:

\[A_{enhanced}^{(h)} = A_{std}^{(h)} + \alpha^{(h)} \cdot A_{stripe}^{(h)}\]

Different attention heads can exhibit different stripe patterns—some heads focus on global information aggregation (strong stripes), while others focus on local pattern matching (weak stripes).

Loss & Training¶

Use the same loss function as the original task (e.g., cross-entropy classification, language modeling loss).
Optimize stripe module parameters jointly with main model parameters.
Optional stripe regularization: Encourage sparsity in the stripe intensity vectors: $$\mathcal{L}_{stripe} = \beta \cdot (\|\mathbf{s}_c\|_1 + \|\mathbf{s}_r\|_1)$$
Perform a one-time re-parameterization after training is complete to eliminate the auxiliary module.

Key Experimental Results¶

Main Results¶

Results on ImageNet-1K image classification:

Model	Method	Top-1 Acc (%)	Params (M)	FLOPs (G)	Inference Latency
DeiT-S	Baseline	79.8	22.1	4.6	1.0×
DeiT-S	+ Talking Heads	80.1	22.8	4.9	1.07×
DeiT-S	+ Re-attention	80.3	22.6	4.8	1.05×
DeiT-S	+ SOG-Attn	80.5	22.1	4.6	1.0×
DeiT-B	Baseline	81.8	86.6	17.6	1.0×
DeiT-B	+ SOG-Attn	82.3	86.6	17.6	1.0×
Swin-T	Baseline	81.3	28.3	4.5	1.0×
Swin-T	+ SOG-Attn	81.7	28.3	4.5	1.0×

Specific values to be confirmed. Table data estimated based on typical performance ranges of similar methods.

Results on NLP tasks (GLUE benchmark):

Model	Method	MNLI	QQP	SST-2	Inference Latency
BERT-base	Baseline	84.5	91.1	93.0	1.0×
BERT-base	+ SOG-Attn	85.0	91.4	93.5	1.0×

Specific values to be confirmed.

Ablation Study¶

Ablation Setting	ImageNet Top-1 (%)	Inference FLOPs	Description
Baseline (Standard attention)	79.8	4.6G	DeiT-S baseline
Column-only stripes	80.1	4.6G	Learns key-side stripes only
Row-only stripes	80.0	4.6G	Learns query-side stripes only
Column + row stripes (Full SOG)	80.5	4.6G	Combination of both stripes
Without re-parameterization	80.5	4.8G	Retains auxiliary module at inference
With re-parameterization	80.5	4.6G	Zero extra overhead during inference
Applied to all layers	80.5	4.6G	SOG applied to all layers
Applied to deep layers only	80.3	4.6G	Applied only to the second half of layers
Applied to shallow layers only	80.1	4.6G	Applied only to the first half of layers