Skip to content

Associative Transformer

Conference: CVPR 2025
arXiv: 2309.12862
Code: None publicly available
Area: LLM Efficiency
Keywords: Transformer, explicit memory, Hopfield network, sparse representation, bottleneck attention

TL;DR

The Associative Transformer (AiT) is proposed, which integrates a learnable explicit memory module and a Hopfield network for token reconstruction within the Transformer architecture, achieving classification and relational reasoning performance superior to ViT with fewer parameters.

Background & Motivation

Background: Vision Transformer (ViT) has made significant progress in computer vision tasks through the self-attention mechanism, but its token representation lacks the support of explicit structured memory, with all information implicitly encoded in the attention weights.

Limitations of Prior Work: The attention mechanism of standard Transformers involves global interactions among all tokens, possessing a computational complexity of \(O(N^2)\), and lacks a mechanism to maintain persistent information representations across samples. Consequently, models are prone to overfitting on small datasets and exhibit limited performance in tasks requiring relational reasoning.

Key Challenge: Although Transformers have powerful representation capabilities, they lack a mechanism akin to the human brain's "Global Workspace Theory." Existing methods either rely entirely on implicit representations or introduce external memory but lack an effective retrieval mechanism.

Goal: How can persistent explicit memory be introduced into Transformers so that tokens can competitively access a shared memory pool while maintaining computational efficiency?

Key Insight: Drawing inspiration from the Global Workspace Theory and associative memory (Hopfield Network) in cognitive science, a bottleneck mechanism is designed to force tokens to compete for entry into a shared memory space.

Core Idea: A Global Workspace Layer is introduced, combining low-rank explicit memory, bottleneck attention, and Hopfield networks to endow Transformers with persistent, competitive associative memory capabilities.

Method

Overall Architecture

Based on standard ViT, AiT adds a Global Workspace Layer (GWL) to each Transformer block. The input consists of a sequence of image patch tokens, which, after passing through self-attention, enters the GWL for memory interaction and token reconstruction, ultimately outputting enhanced token representations.

Key Designs

  1. Low-rank Explicit Memory

    • Function: Maintains a learnable memory pool \(\gamma \in \mathbb{R}^{M \times D}\), where \(M\) is the number of memory slots (32-128) and \(D\) is the low-dimensional embedding dimension (32).
    • Mechanism: Memory is continuously updated via EWMA: \(\gamma^{t+1} = (1-\alpha)\gamma^t + \alpha \cdot \text{LN}(\text{Concat}(h_1,...,h_S)W^O)\), with \(\alpha=0.1\).
    • Design Motivation: The low-dimensional design enables the memory pool to scale up to 32.8K tokens without introducing significant computational overhead, while cross-batch updates allow the accumulation of global statistical information.
  2. Bottleneck Attention

    • Function: Forces tokens to compete for entry into the memory space through a top-k selection mechanism.
    • Mechanism: The attention scores of each token over all memory slots are calculated, and only the top-k tokens with the highest scores are kept to interact with the memory.
    • Design Motivation: This competitive mechanism simulates the "broadcasting" process of the global workspace, ensuring that only the most relevant information is written into the shared memory.
    • The Balance Loss consists of two components: cumulative attention balancing and selection frequency balancing.
  3. Hopfield Network Token Reconstruction

    • Function: Retrieves and reconstructs token representations from memory using a continuous Hopfield network.
    • Mechanism: The Hopfield energy function is defined as \(E(\Xi^t) = -\text{lse}(\beta, f_{LT}(\gamma^{t+1})\Xi^t) + \frac{1}{2}\Xi^t(\Xi^t)^T\).
    • Design Motivation: Hopfield networks are inherently well-suited for retrieving matched patterns from a memory pool, and their FLOPs account for only 0.84% of the total computation.

Loss & Training

  • Total loss: \(\ell = \ell_{\text{class}} + \sigma \cdot \sum \ell_{\text{bottleneck}_i}\), where \(\sigma = 10^{-2}\).
  • Batch size: 512 (CIFAR), 128 (Pet), 64 (relational reasoning).
  • Number of memory slots M: 32 (CIFAR/Triangle), 128 (Pet).
  • Bottleneck capacity: 512 (CIFAR/Pet), 64 (Triangle).

Key Experimental Results

Main Results

Dataset AiT-Base (91M) AiT-Medium (45.9M) ViT-Base (85.7M) ViT-Medium
CIFAR10 85.44% 84.59% 83.82% 82.41%
CIFAR100 60.78% 60.58% 57.92% 55.78%
Triangle 99.64% 99.57% 99.63% 99.62%
Average 81.95% 81.58% 80.46% 79.27%

AiT-Medium (45.9M parameters) outperforms ViT-Base (85.7M) while using only half the parameter count. On ImageNet100: AiT-Medium achieves 36.72% vs ViT-Base at 34.62%.

Ablation Study

Configuration Average Accuracy Change
Full AiT-Small 79.70%
w/o Bottleneck 72.75% -6.95%
w/o Self-Attention 73.31% -6.39%
w/o Memory (=ViT) 77.40% -2.30%
w/o Hopfield 78.48% -1.22%
w/o Balance Loss 78.68% -1.02%
Reset Memory 79.12% -0.58%

Key Findings

  • Bottleneck attention contributes the most (-6.95%), demonstrating that the competitive access mechanism is a core design.
  • Removing memory degrades performance to standard ViT (-2.30%), showing that the memory module provides additional capacity.
  • The Hopfield computational overhead is extremely low (\(8.02 \times 10^6\) FLOPs, <0.84%), but its absence causes a decrease of -1.22%.
  • In Oxford Pet experiments, ViT-Base overfits after 50 epochs, whereas AiT-Small accuracy continues to rise.
  • Sort-of-CLEVR relational reasoning: AiT-Base achieves 80.03% (relational task), outperforming the standard Transformer.

Highlights & Insights

  • Cognitive science-inspired architecture: Introducing Global Workspace Theory into Transformers is a clever interdisciplinary transfer.
  • Counter-intuitive parameter efficiency: The smaller AiT-Medium outperforms the larger ViT-Base, suggesting that structured memory is more effective than simply increasing parameters.
  • Lightweight application of Hopfield networks: Bringing steady gains with only 0.84% computational overhead.
  • Memory updating via EWMA can potentially be transferred to settings outstanding in online learning and continual learning.

Limitations & Future Work

  • Evaluation is verified only on small-scale datasets, lacking complete evaluation on ImageNet-1K and downstream dense prediction tasks.
  • The number of memory slots M and bottleneck capacity k must be manually tuned.
  • Combination with parameter-efficient fine-tuning methods like LoRA has not been explored.
  • vs Memory Transformer: Memory Transformer lacks competitive mechanisms and Hopfield retrieval; AiT's combination of bottleneck + Hopfield is more effective.
  • vs Set Transformer: Inducing points in Set Transformer are similar to memory slots, but lack persistent updates and associative retrieval.

Rating

  • Novelty: ⭐⭐⭐⭐ Cognitive science-inspired design is creative, but external memory in Transformers is not entirely new.
  • Experimental Thoroughness: ⭐⭐⭐ The datasets used are relatively small, lacking large-scale evaluations.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and detailed methodology description.
  • Value: ⭐⭐⭐⭐ Explores a promising direction for structured memory in Transformers.