Memory Mosaics v2 scales associative memory networks to 10B parameters trained on 1T tokens, substantially outperforming same-scale—and even 8T-token-trained—Transformers on new-task learning and in-context learning.
Compositional generalization and in-context learning (ICL) have long been central objectives in machine learning, yet the underlying mechanisms in existing Transformers remain opaque.
Prior efforts via statistical independence (ICA) and multi-environment optimization (IRM/MAML) have achieved limited success.
Memory Mosaics (Zhang et al., 2025) replaces attention with simple key-value associative memories (without positional encoding), demonstrating superior ICL capability at GPT-2 scale on synthetic data.
Core Problem: Can these advantages be preserved at large scale on real data? This work scales Memory Mosaics to LLaMA-8B size to answer this question.
An associative memory stores key-value pairs \(\{(k_1,v_1)\dots(k_n,v_n)\}\) and retrieves values given a query key.
The storage set is permutation-invariant and can be viewed as estimating the conditional probability \(P(V|K)\); retrieval computes the conditional expectation.
Retrieval is implemented via Gaussian kernel regression: \(f(k) = \sum_i \frac{e^{-\beta\|k-k_i\|^2}}{\sum_j e^{-\beta\|k-k_j\|^2}} v_i\)
When all key vectors share the same L2 norm, this reduces to standard softmax attention (dot-product form).
Inspired by recurrent architectures such as RWKV, Mamba, and xLSTM, but used solely for key construction; the associative memory still retains all key-value pairs.
Evaluates the model's ability to store and retrieve new information at inference time.
Uses the Multi-Document Question Answering task from the Ruler benchmark (concatenating multiple articles with a question).
Substantially harder than "needle-in-a-haystack"—high information entropy, not simple exact matching.
Model
Train ctx
4k
8k
16k
32k
64k
Transformer large
32k
51.2
48.8
44.7
41.1
×
MM v2 large
32k
58.9
55.5
54.9
53.4
46.4
MM v2 outperforms Transformer by 12.3% at 32k task length.
MM v2 trained at 4k extrapolates to 32k without fine-tuning (Transformer fails beyond 4k→8k).
Compressed-memory approaches such as RNN/SSM/sliding-window attention structurally fail on this task—they cannot retain all articles before encountering the question.
Both semantic-label and anonymized-label variants are evaluated; the latter more rigorously tests genuine new-task learning.
Key Findings:
MM v2 accuracy consistently improves as the number of shots increases.
Transformers exhibit an anomalous trend—more shots actually degrade performance.
MM v2 surpasses Transformer by over 10%.
Adding separate short- and long-term attention to a Transformer does not reproduce this advantage, confirming that MM v2 is not a simple Transformer variant.
Extended Data Comparison: 1T MM v2 vs. 8T Transformer¶
Dimension
Transformer 1T
Transformer 8T
MM v2 1T
New Knowledge Storage (32k)
41.1%
46.9%
53.4%
Semantic-Label ICL
Lower
Close to MM v2
Best
Anonymized-Label ICL
Low
Even lower (degrades)
Significantly best
A Transformer trained on 8× more data still trails MM v2 (1T) by approximately 6.5% on new knowledge storage.
On anonymized-label ICL, more training data actually degrades Transformer performance—the "more data" strategy fails completely.
Interpretable memory mechanism: explicit key-value storage makes attention allocation transparent; attention scores over distant tokens are position-invariant (vs. position-dependent curves in Transformers).
Challenging scaling-law orthodoxy: a Transformer trained on 8T tokens still falls short of MM v2 trained on 1T tokens on new-task learning, demonstrating that architectural innovation matters more than data accumulation.
Context length extrapolation: the absence of positional encoding combined with adaptive bandwidth enables a model trained at 4k to extrapolate directly to 32k without fine-tuning.
Extremely fast fine-tuning: adaptation to a new domain requires only 1 mini-batch, offering high practical value.
Clear division of labor across three memory tiers: short-term handles local patterns, long-term handles remote retrieval, and persistent memory stores global knowledge.
Bietti et al. (2023) analyzed that Transformer induction heads require positional encoding and asymmetric QK; MM achieves equivalent behavior with symmetric keys in a single layer.
Olsson et al. (2022) identified the induction head mechanism as central to ICL; MM constructs this mechanism explicitly.
Compressed-memory methods such as RWKV, Mamba, and xLSTM structurally fail on multi-document QA because they cannot fully retain all information.