Separating Knowledge and Perception with Procedural Data¶
Conference: ICML2025
arXiv: 2508.11697
Code: TBD
Area: Image Segmentation
Keywords: procedural data, visual memory, KNN classification, semantic segmentation, data privacy, differential privacy, self-supervised learning
Authors: Adrián Rodríguez-Muñoz, Manel Baradad, Phillip Isola, Antonio Torralba (MIT)
TL;DR¶
Training visual representation models solely on procedurally generated data (non-real images) and injecting real-world knowledge through a visual memory (KNN retrieval database) approaches the performance of models trained on real data in classification and segmentation tasks, while achieving full controllability of all real-world data (privacy protection and efficient forgetting).
Background & Motivation¶
Modern vision models "digest" images into weights through gradient descent, which introduces three major issues:
Privacy & Bias: Weights store knowledge in a black-box manner, making it difficult to trace or delete specific data (such as human faces or sensitive medical images);
Difficult Data Forgetting: When laws require the deletion of certain data, retraining the entire model is required, which is extremely costly;
Inflexible Knowledge Editing: Adding, deleting, or updating knowledge requires fine-tuning or retraining.
Prior work proposed the concept of visual memory: using KNN retrieval to replace parametric classifiers so that knowledge is stored as a database, facilitating additions and deletions. However, the limitation of prior work is that the feature extractor itself is still trained on real data, meaning knowledge and perception are not truly separated.
The key insight of this paper: train the feature extractor with procedural data, so the model never encounters real images. Procedural data consists of non-real images generated by simple code (OpenGL shaders), carrying minimal privacy risks. Consequently, all real-world data exists solely in the visual memory, achieving a complete separation of knowledge and perception.
Method¶
Overall Architecture¶
The system consists of three phases: 1. Training Phase: Train a ViT-S feature extractor using procedural data with a DINO self-supervised objective; 2. Knowledge Injection: Construct a visual memory database of real image embeddings (without additional training); 3. Inference Phase: Extract features from a query image, perform KNN retrieval in the visual memory, and output the majority label.
Key Designs: Shaders KML / Shaders KML Mixup¶
The previous best procedural dataset is Shaders Mixup (Baradad et al., 2022), which uses constant masks for Mixup in pixel space to alleviate short-cut issues. This work proposes two improvements:
Shaders KML (K-Means Leaves): 1. Sample three shader images \(s_1, s_2, s_3\); 2. Perform KMeans clustering on \(s_1\) in the RGB space to extract a data-driven mixing mask \(m\); 3. Blend \(s_2\) and \(s_3\) using \(m\) to obtain the final sample.
The key difference is that the mask is extracted from the data itself rather than being a fixed constant, which significantly increases the diversity of the dataset. Prior studies have demonstrated that diversity is the most critical driver of performance for procedural data.
Shaders KML Mixup: Superimposes standard Mixup on top of Shaders KML to further suppress short-cut solutions, achieving a new SOTA.
Training Details¶
- Backbone: ViT-S (Vision Transformer Small)
- Training Objective: DINO's local-to-global similarity (learning consistent representations of local and global views)
- On real data, the DINO objective guides the model to learn similar representations for different parts of the same object; on procedural data, it learns the part similarity of abstract shapes and textures.
Differential Privacy Analysis¶
Defines \(\epsilon\)-differential privacy: For an algorithm \(\mathcal{A}\), if for any datasets \(D_1, D_2\) differing in only a single sample \(x\), it satisfies:
For deterministic algorithms (such as KNN), this simplifies to: predictions on all test sets remain identical with or without sample \(x\). Under the architecture of procedural embeddings + visual memory, one only needs to compare whether the KNN prediction changes with or without a certain real image, avoiding any model retraining.
Key Experimental Results¶
Visual Similarity (NIGHTS Dataset)¶
The alignment of the best procedural model, Shaders KML, with human judgement reaches 82.4%, which is only 0.9% lower than that of the Places model. White-box metrics like PSNR/SSIM are close to random.
KNN Classification¶
| Data Type | Dataset | Flowers102 | CUB200 | Food101 | ImageNet-1K |
|---|---|---|---|---|---|
| Real | Places | 59.51 | 19.09 | 47.78 | 47.30 |
| Procedural | S. KML Mixup | 75.20 | 27.08 | 48.70 | 37.88 |
| White-box | Random init. | 11.18 | 1.93 | 5.32 | 1.84 |
Key Findings: In fine-grained classification, the procedural model outperforms the Places model by +15% (Flowers), +8% (CUB), and +1% (Food). This is because the semantic capacity of the Places model is occupied by scene knowledge, whereas the procedural model learns domain-agnostic visual skills. The performance gap on ImageNet-1K is approximately 10%.
Zero-Shot Semantic Segmentation (COCO)¶
| Data Type | Dataset | \(R^2\) |
|---|---|---|
| Real | ImageNet | 63.7 |
| Real | Places | 62.1 |
| Procedural | S. KML | 55.9 |
| Procedural | S. KML Mixup | 53.7 |
| White-box | Random init. | 36.7 |
The \(R^2\) score of the best procedural model on COCO lags behind the real-data models by approximately 10%, which is significantly higher than the random initialization baseline.
Medical Data (MedMNIST)¶
The procedural model matches or exceeds the best results of standard trained ResNets from the original paper on 7 out of 10 MedMNIST datasets. This holds great value for medical privacy scenarios.
Model Scaling¶
When scaling ViT from S to larger sizes, the procedural model does not overfit—greater capacity leads to higher performance, demonstrating robust generalization.
Privacy Analysis¶
On ImageNet, only <0.6% of the training samples are non-private (their deletion would alter at least one test prediction), and the accuracy shows a linear relationship with the proportion of non-private samples.
Highlights & Insights¶
- Counter-Intuitive Strong Results: Models that never digest real images surprisingly outperform models trained on real-world scene data in fine-grained classification tasks, indicating that domain-agnostic visual perception skills themselves are extremely valuable.
- An Elegant Privacy Solution: Restricting all real data to an erasable and appendable database transforms privacy protection and data forgetting into \(O(1)\) operations.
- Gestalt Analysis: It is discovered that neither real- nor procedurally-trained vision models possess Gestalt perception capabilities, revealing an essential gap between current vision models and human perception.
- Practical Storage/Computation/Accuracy Trade-off Analysis: The training cost of the KNN method is only 1/64 of a linear classifier, and storing the embeddings of the entire ImageNet in memory requires only ~2GiB.
Limitations & Future Work¶
- Part-Whole Problem: Since the procedural model has never seen real objects, it cannot associate visually distinct parts of the same object (such as a bicycle's hub and spokes) as a coherent whole, leading to spurious matching during KNN semantic segmentation. This is the primary driver of the performance gap.
- Only Verified on ViT-S: Although a non-overfitting trend scaling up is demonstrated, complete experiments on ViT-B/L are missing.
- Inference Latency: KNN inference under large-scale memory is slower than parametric classifiers (about 2x slower in the naive implementation). Although optimizations like faiss can reduce this to <0.03ms/query, it increases engineering complexity.
- Limitations on Segmentation Tasks: While zero-shot PCA segmentation is decent, KNN semantic segmentation performs poorly due to "overly local" representations, and no solution has been proposed.
- Ceiling of Procedural Data: The current best procedural data still originates from OpenGL shaders, and the upper bound of diversity in the generation process remains unclear.
Related Work & Insights¶
- Visual Memory (Geirhos et al., 2024; Nakata et al., 2022): Utilizing KNN retrieval to replace classifiers. This work builds on this by introducing procedural embeddings to achieve complete separation.
- Procedural Data (Baradad et al., 2021/2022; Kataoka et al., 2020): Learning representations from noise/fractals/shaders. This work extends this to segmentation tasks and proposes a new KML data augmentation technique.
- Differentially Private SGD (Abadi et al., 2016): Traditional approaches inject noise during training, whereas this work bypasses this requirement through architectural design.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of procedural data and visual memory is neat and elegant, with technical contributions from the KML mask generation.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers multiple dimensions including similarity, classification, segmentation, medical tasks, privacy, and Gestalt perception, with extensive quantitative and qualitative analyses.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clearly structured, deeply analyzed, with pragmatic trade-off discussions.
- Value: ⭐⭐⭐⭐ — Holds practical significance for privacy-sensitive scenarios (e.g., medical, facial data), though the part-whole problem limits general applicability.