Towards Flexible Perception with Visual Memory¶
Conference: ICML2025
arXiv: 2408.08172
Authors: Robert Geirhos, Priyank Jaini, Austin Stone, Sourabh Medapati, Xi Yi, George Toderici, Abhijit Ogale, Jonathon Shlens (Google DeepMind)
Code: Not open-sourced
Area: Interpretability
Keywords: visual memory, kNN classification, retrieval-based inference, machine unlearning, data attribution, scalable search
TL;DR¶
Shift the knowledge representation of deep visual models from being "carved in weights" to "stored in an external database." By constructing a flexible Visual Memory using pre-trained encoders and kNN retrieval, this approach enables plug-and-play data operations (adding, deleting, and scaling) and interpretable classification, achieving an 88.5% top-1 accuracy on ImageNet.
Background & Motivation¶
Core Problem¶
Current deep learning models statically encode knowledge within millions or billions of parameters, leading to the following limitations:
Difficult data updates: When new data becomes available or old data needs to be revoked (due to privacy or fairness), retraining or fine-tuning is required.
Infeasible machine unlearning: Removing the influence of specific training exemplars from a model is extremely difficult.
Uninterpretable decisions: It is impossible to trace which training data points drove a particular prediction.
Concept drift: The real world is constantly changing, causing static models to quickly become outdated.
Motivation¶
The authors advocate for a complete decoupling of "representation" and "memory": using pre-trained models to learn general feature representations, while storing the knowledge required for classification in an external, editable database. This idea aligns with exemplar theory in psychology—which posits that humans recognize objects by comparing them with stored exemplars in memory.
Method¶
Overall Architecture¶
The system consists of two modules:
- Feature Encoder \(\Phi\): A frozen pre-trained model (e.g., DinoV2, CLIP) that extracts feature vectors from images.
- Visual Memory Database: Stores \((z_i, y_i)\) pairs, where \(z_i = \Phi(x_i)\) is the feature vector and \(y_i\) is the label.
Constructing Visual Memory¶
Given a training set \(\mathcal{D}_{\text{train}} = \{(x_1, y_1), \ldots, (x_N, y_N)\}\) and a pre-trained encoder \(\Phi\):
Only the feature vectors are stored (occupying approximately 1-3% of the original dataset's storage space), without storing the images themselves.
Retrieval-Based Classification¶
For a test image \(\tilde{x}\), the feature \(\tilde{z} = \Phi(\tilde{x})\) is extracted, and the \(k\)-nearest neighbors \(\{(z_{[1]}, y_{[1]}), \ldots, (z_{[k]}, y_{[k]})\}\) are retrieved from Visual Memory using cosine distance, satisfying:
A weighted vote over the neighbors is then conducted to obtain the final classification.
Aggregation Strategy: RankVoting (Core Contribution)¶
The authors observe that existing aggregation methods (Plurality, Distance, Softmax voting) suffer from a decrease in accuracy as \(k\) increases, as they assign too much weight to distant neighbors. To address this, the authors propose RankVoting:
where \(\alpha = 2.0\) is an offset. This power-law weight scheme allows the accuracy to monotonically increase with \(k\) until saturation, without requiring hyperparameter tuning.
| Aggregation Method | ImageNet Top-1 (DinoV2 ViT-L14) | Trend with respect to \(k\) |
|---|---|---|
| Plurality Voting | ~82.5% | Increases then decreases |
| Distance Voting | ~83.0% | Increases then decreases |
| Softmax Voting (\(\tau=0.07\)) | 83.6% | Increases then decreases |
| RankVoting | 83.6% | Monotonically increases to a plateau |
| + Gemini re-ranking | 88.5% | — |
Gemini Re-ranking¶
Feeding the 50 nearest neighbors and their labels into the context of Gemini 1.5 Flash for in-context learning re-ranking boosts accuracy from 83.5% to 88.5% (surpassing the 86.3% of DinoV2 ViT-L14 linear probing). Gemini alone without neighboring information achieves only 69.6%, indicating that the performance is primarily driven by the retrieved exemplars.
Retrieval Acceleration¶
- Small scale (ImageNet): Matrix multiplication + argmax on GPU/TPU.
- Large scale (Billion-scale): ScaNN-based approximate nearest neighbor search, requiring approximately 2ms per query on 1M features (500-600 QPS), supporting CPU deployment.
Key Experimental Results¶
Seven Core Capabilities Validation¶
1. Flexible Lifelong Learning: Adding OOD New Classes¶
Directly inserting 64 OOD new classes from the NINCO dataset into the Visual Memory of ImageNet-1k, without any training:
| Scenario | IN-val Acc | NINCO Acc |
|---|---|---|
| IN-train only | 83.6% | — |
| IN-train + NINCO | 83.6% (-0.02%) | 87.4% |
The addition of new classes has almost zero impact on the performance of the original classes (no catastrophic forgetting), while achieving 87%+ accuracy on the OOD classes.
2. Flexible Computation-Memory Trade-off¶
The performance of different models across varying memory sizes exhibits a consistent pattern: small model + large memory \(\approx\) large model + small memory. For instance, DinoV2 S/14 with a 1.28M memory \(\approx\) L/14 with a ~70K memory.
3. Training-Free Billion-Scale Scaling¶
Constructing a billion-scale Visual Memory using a subset of JFT-3B (with pseudo-labels generated by ViT-22B), the ImageNet validation error continues to decrease, showing a log-linear trend in log-log space.
4. Enhanced OOD Generalization¶
| Dataset | Linear Probe | IN Memory | JFT Memory | + Gemini |
|---|---|---|---|---|
| IN-A | 71.3 | 58.8 | 61.1 | 69.6 |
| IN-R | 74.4 | 62.8 | 73.7 | 81.4 |
| IN-Sketch | 59.3 | 61.5 | 68.0 | 75.0 |
| IN-V2 | 78.0 | 75.6 | 77.6 | 82.3 |
| IN-ReaL | 89.5 | 87.1 | 88.2 | 90.5 |
JFT memory significantly outperforms linear probing on IN-R and IN-Sketch; Gemini re-ranking leads across all benchmarks.
5. Machine Unlearning: Perfect Guarantees¶
Deleting data points from the Visual Memory achieves perfect unlearning (<20ms per sample), excelling on all three core metrics:
- Efficiency: Deleting a sample takes <20ms vs. hours of retraining.
- Model Utility: Performance on the remaining data is completely unaffected (by design).
- Unlearning Quality: 100% complete unlearning (by design).
Prerequisite: The encoder must be trained on a safe general dataset, and the data to be forgotten must only reside in the Visual Memory.
6. Fine-grained Incremental Data Addition (iNaturalist)¶
Simulating the discovery of new species on iNaturalist21 (10,000 species): adding only 5-10 images of a new species significantly improves the species-level accuracy, and the improvement propagates upward to higher taxonomic levels such as genus and family.
7. Memory Pruning and Interpretable Decision Making¶
Supports memory compression by removing low-quality/redundant samples, as well as tracing the data source of each prediction (data attribution), achieving interpretable classification.
Limitations & Future Work¶
- Incomplete Decoupling of Representation and Knowledge: If the encoder itself was trained on the data that needs to be forgotten, removing it from "memory" does not fully eliminate its influence.
- Evaluation Limited to Classification: Visual Memory schemes for other vision tasks such as detection, segmentation, and generation remain unexplored.
- High Cost of Gemini Re-ranking: The best result of 88.5% relies on VLM API calls, making inference costs and latency impracticable for real-time applications.
- Fixed Feature Space: The frozen encoder means the quality of representation itself cannot be improved via memory data.
- Retrieval Bottleneck: The storage and retrieval overhead of a billion-scale memory remains impractical on edge devices.
- Pseudo-label Noise: The JFT billion-scale experiments rely on pseudo-labels from ViT-22B, where label quality directly bounds performance.
- No Multimodal Extension Discussed: The method purely utilizes image features, without exploring the integrations of text or other modalities into the memory.
Related Work & Insights¶
- Exemplar Theory \(\to\) Visual Memory: A technical bridge from the ALCOVE model in cognitive science to kNN classification.
- Retrieval-Augmented Learning: Echoing the success of kNN-LM (Khandelwal et al., 2019) in NLP.
- ScaNN Approximate Search: Engineering enablement for billion-scale retrieval.
- Insights: A systematic exploration of migrating RAG concepts from NLP to computer vision, providing strong arguments for "knowledge representation beyond model parameters."
Highlights & Insights¶
The core value of this paper lies in its systematic approach—rather than a single technological breakthrough, it utilizes a minimal framework (pre-trained encoder + kNN + database) to unifiedly address several long-standing challenges in deep learning, such as lifelong learning, machine unlearning, OOD generalization, and interpretability. While simple, RankVoting proves effective by identifying the overconfidence issue in existing voting strategies. The Gemini re-ranking result is impressive (88.5%, surpassing linear probe), though its practicality remains questionable.
The biggest limitation is that this scheme fully depends on the quality of the pre-trained encoder—if DinoV2's representations are insufficient to distinguish certain classes, no amount of memory will help. Furthermore, other than classification, other tasks remain untested, leaving a long way to go towards "general Visual Memory."
The paper is exceptionally well-written, with a compelling narrative and exhaustive, meticulous experiments.
Rating¶
- Novelty: ⭐⭐⭐ (The idea is not entirely new, but the systematic demonstration is highly valuable)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Validation across seven core capabilities, billion-scale data, and multiple datasets)
- Writing Quality: ⭐⭐⭐⭐⭐ (Fluid narrative and clear motivation)
- Value: ⭐⭐⭐⭐ (Inspiring for the paradigm of knowledge representation)