Skip to content

AttnZero: Efficient Attention Discovery for Vision Transformers

Conference: ECCV 2024
Code: https://github.com/lliai/AttnZero
Area: Model Efficiency / Neural Architecture Search
Keywords: Linear Attention, Attention Mechanism Search, Vision Transformer, Evolutionary Algorithm, Attention Benchmark

TL;DR

This paper proposes AttnZero, the first framework to automatically discover efficient attention modules. By constructing a structured search space consisting of six computation graph types and a rich set of operators, and leveraging an evolutionary algorithm for multi-objective search, it automatically discovers linear attention formulations applicable to various ViTs. It achieves ImageNet top-1 accuracies of 74.9%/78.1%/82.1%/82.9% on DeiT/PVT/Swin/CSwin, respectively, and constructs the Attn-Bench-101 benchmark containing 2000 attention variants.

Background & Motivation

Background: Vision Transformers (ViTs) have become the dominant architecture in computer vision, but their core self-attention mechanism incurs an \(O(n^2)\) computational complexity, where \(n\) is the number of tokens. This severely limits the application of ViTs in high-resolution inputs and real-time inference scenarios. Linear attention, which approximates softmax attention to achieve \(O(n)\) complexity, is a promising alternative.

Limitations of Prior Work: Existing linear attention methods (e.g., Efficient Attention, FLatten Transformer, ReLU Attention) are manually designed, which suffers from two main issues: (1) manual design relies heavily on human experience and intuition, leading to a restricted search space that easily misses optimal attention formulas; (2) compared to standard softmax attention, existing linear attention formulations typically exhibit significant performance degradation, especially on small-scale models.

Key Challenge: While linear attention possesses a theoretical complexity advantage of \(O(n)\), in practice, manually designed linear attention formulas struggle to strike the optimal balance between efficiency and performance. The immense design space suggests that superior formula combinations likely exist, but manual enumeration is intractable.

Goal: (1) How to systematically search for the optimal linear attention formulas; (2) How to ensure that the discovered attention modules generalize well across different ViT architectures; (3) How to efficiently filter out a large number of infeasible candidates.

Key Insight: The authors model attention formulas as computation graphs, where each node represents an operation (activation function, normalization, binary operation), and employ an evolutionary algorithm to search for the optimal linear attention formula within this computation graph space. The key innovation lies in using the joint performance across multiple architectures as the search objective and utilizing program verification to accelerate filtering.

Core Idea: Employ an evolutionary algorithm to automatically discover linear attention modules in a structured search space that outperform manually designed alternatives.

Method

Overall Architecture

The pipeline of AttnZero is divided into three stages: (1) Search space construction: defining six computation graph templates and a rich operator set to compose the entire linear attention search space; (2) Evolutionary search: using a multi-objective evolutionary algorithm with the joint accuracy of candidate attention modules on various ViTs (DeiT, PVT, Swin, etc.) as the fitness function, while rapidly filtering infeasible schemes using program checking and rejection protocols; (3) Verification and transfer: replacing original attention blocks in various ViT architectures with the discovered optimal attention module for full training and downstream task evaluation.

Key Designs

  1. Structured Search Space:

    • Function: Defines all possible formula combinations of linear attention.
    • Mechanism: The search space comprises six computation graph types, covering different \(Q\), \(K\), \(V\) interaction modes. Within each computation graph, selectable operators for action nodes include: activation functions (ReLU, GELU, Sigmoid, ELU+1, Softmax, etc.), normalizations (L1-Norm, L2-Norm, LayerNorm, BatchNorm, etc.), and binary operations (addition, multiplication, element-wise maximum, etc.). The overall search space encompasses tens of thousands of possible attention formulas. A crucial constraint is that all candidate formulas must satisfy linear complexity—i.e., computing \(\phi(K)^T V\) (with complexity \(O(d^2 n)\)) prior to multiplying by \(\phi(Q)\), rather than computing \(\phi(Q)\phi(K)^T\) first (with complexity \(O(n^2 d)\)).
    • Design Motivation: Manual design can only explore extremely small subspaces. A systematic search space construction ensures that potential formula combinations are not missed. The six computation graphs cover various paradigms, such as \(Q\)-\(K\) first interaction, \(K\)-\(V\) first interaction, and adding bias terms.
  2. Multi-Objective Evolutionary Search:

    • Function: Efficiently finds the optimal attention formulas within the search space.
    • Mechanism: The evolutionary algorithm maintains a population of candidate attention formulas. In each generation: (a) the proxy performance (e.g., ImageNet accuracy over short training epochs) of each candidate formula on various ViT architectures (DeiT-Tiny, PVT-Tiny, etc.) is evaluated as the multi-objective fitness; (b) new individuals are generated via crossover operations that combine subgraphs of two high-fitness individuals; (c) operators in node functions are randomly replaced via mutation operations. Multi-objective optimization ensures that the searched attention generalizes across different architectures rather than overfitting to a single one.
    • Design Motivation: Single-objective search tends to result in attention modules highly specialized for a single architecture. Multi-objective search forces the discovery of more general design principles.
  3. Program Checking and Rejection Protocols:

    • Function: Rapidly filters out infeasible or obviously low-quality candidates to accelerate the search process.
    • Mechanism: Before running expensive proxy training evaluations on candidate formulas, a series of cheap verification checks are executed: (a) syntactic check—verifying if the computation graph is valid (e.g., type matching, dimension consistency); (b) numerical stability check—forward passing on random inputs to detect NaN values or numerical overflows; (c) gradient check—verifying if backward propagation can successfully compute gradients; (d) complexity check—confirming that the computational complexity remains linear. Only candidates passing all checks proceed to the proxy evaluation stage.
    • Design Motivation: A massive proportion of combinations in the search space are infeasible (due to numerical instability, vanishing gradients, etc.). It would be highly wasteful to run proxy training on every candidate. Pre-checking filters out over 90% of invalid candidates at virtually zero cost.

Loss & Training

During the search phase, proxy training (short-epoch training on ImageNet classification) is utilized to evaluate the quality of candidate attentions. The finally selected attention module (e.g., Trial-105) is then integrated into the target ViTs for full training. The training strategies align with the original configuration of each ViT (e.g., DeiT's DeiT-III training strategy, CSwin's EMA training strategy) to ensure a fair comparison. Additionally, the authors construct Attn-Bench-101, a pre-computed performance library of 2,000 attention variants, enabling users to directly query the performance of a specific attention on various ViTs without retraining.

Key Experimental Results

Main Results

Model Attention Type Top-1 Acc Gain
DeiT-Tiny Softmax (Original) 72.2% -
DeiT-Tiny AttnZero 74.9% +2.7%
PVT-Tiny Softmax (Original) 75.1% -
PVT-Tiny AttnZero 78.1% +3.0%
Swin-Tiny Softmax (Original) 81.3% -
Swin-Tiny AttnZero 82.1% +0.8%
CSwin-Tiny Softmax (Original) 82.7% -
CSwin-Tiny AttnZero 82.9% +0.2%

Ablation Study

Configuration Key Metrics Description
Single-Objective Search (DeiT only) Optimal on DeiT but poor generalization Overfitting to a single architecture
Multi-Objective Search (DeiT+PVT+Swin) Balanced improvement across architectures Better generalization of attention
Without Program Checking Search time increases significantly A large number of invalid candidates waste resources
With Program Checking Search efficiency increased by 10x+ Fast filtering of infeasible schemes

Key Findings

  • The discovered AttnZero attention module not only achieves linear complexity but also outperforms standard softmax attention on small-scale models, which is highly counter-intuitive.
  • Different ViT architectures share commonalities in their preferred attention formulas, indicating the existence of more universal attention design principles.
  • Analysis of Attn-Bench-101 reveals interesting attention design insights, such as proper normalization being more critical than the choice of activation function.

Highlights & Insights

  • This is the first work to model attention mechanism design as a search problem, opening up a new direction for automated attention design.
  • The multi-objective search strategy ensures generalization, enabling the discovered attention modules to be "plug-and-play" across various ViTs.
  • The construction of Attn-Bench-101 provides a valuable benchmark resource for subsequent attention mechanism research.
  • The design of the program checking protocol is practical and highly efficient, serving as a good reference for other NAS works.

Limitations & Future Work

  • Although the search space is rich, it remains a discrete computation graph and does not cover continuous parameterized attention representations.
  • Proxy evaluation relies on short-epoch training, which might introduce rank correlation bias compared to full training.
  • Currently, validation is mainly conducted on image classification; its performance on dense prediction tasks like object detection and segmentation requires further verification.
  • The search process still requires substantial GPU resources, making it less accessible for resource-constrained researchers.
  • Hardware-aware efficiency metrics (e.g., actual inference latency) are not considered, relying solely on theoretical FLOPs.
  • Linear Attention series (e.g., Katharopoulos et al., EfficientAttention) provide the theoretical foundation for linear attention.
  • Evolutionary search methods in the NAS (Neural Architecture Search) field (e.g., ENAS, DARTS) inspired the search framework of this paper.
  • DeiT, PVT, Swin, CSwin are mainstream ViT architectures, on which the discovered attention modules are validated in this work.
  • This work may inspire more "X-Zero"-style automated component design efforts, such as automated loss function discovery and automated data augmentation strategy discovery.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The first automated attention discovery framework, featuring outstanding design of the search space and search strategy.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple ViT architectures, with Attn-Bench-101 as an added bonus.
  • Writing Quality: ⭐⭐⭐⭐ Clear description of the search space and well-organized experiments.
  • Value: ⭐⭐⭐⭐ Opens up a new direction for automated attention design, and Attn-Bench offers long-term value.