The Invisible Gorilla Effect in Out-of-distribution Detection¶
Conference: CVPR 2026 arXiv: 2602.20068 Code: Available Area: Medical Imaging Keywords: OOD detection, out-of-distribution detection bias, visual similarity, medical imaging safety, feature space analysis
TL;DR¶
This paper reveals a previously unreported bias in OOD detection — the "Invisible Gorilla Effect": detection performance is substantially higher when OOD artifacts are visually similar to the model's region of interest (ROI), and degrades significantly when they are dissimilar, with feature-based OOD methods being most severely affected.
Background & Motivation¶
1. State of the Field¶
DNNs have achieved expert-level accuracy in high-stakes settings such as medical imaging and autonomous driving, yet suffer severe performance degradation when encountering out-of-distribution (OOD) data. OOD detection methods, which aim to identify and reject unreliable predictions, have become a regulatory necessity in AI-assisted medicine (both the U.S. FDA and EU AI regulations mandate that ML systems handle OOD inputs).
2. Limitations of Prior Work¶
Prior studies have observed that OOD detection performance varies considerably across artifact types, but the underlying reasons for this variation have not been systematically investigated. In real-world deployment, the types of OOD inputs a model may encounter cannot be anticipated in advance, making it essential to develop detection methods that generalize across diverse distribution shifts.
3. Root Cause¶
The conventional assumption holds that OOD detection difficulty increases monotonically with the similarity between a sample and the training distribution — near-OOD is harder to detect, far-OOD is easier. This paper shows that this assumption does not always hold: there exists a counterintuitive regime in which OOD samples that are visually more similar to the ROI are actually easier to detect.
4. Core Problem¶
To systematically identify, quantify, and explain the bias by which visual similarity influences OOD detection performance, and to evaluate potential mitigation strategies.
5. Starting Point¶
Color similarity is used as a controlled variable (color artifacts are common and can be varied independently of shape and texture) across large-scale experiments in medical imaging (skin lesion classification, chest X-rays) and industrial inspection (MVTec). The authors draw inspiration from the cognitive psychology "invisible gorilla" experiment — subjects focused on counting passes among players in white shirts tend to miss a person in a black gorilla costume walking through the scene, yet would notice the gorilla far more readily if it were wearing white.
6. Core Idea¶
The Invisible Gorilla Effect: OOD detection methods tend to detect artifacts that are visually similar to the model's ROI, while "overlooking" those that are dissimilar. This occurs because, in feature-based methods, color variation is distributed primarily along high-variance directions in latent space — precisely the directions that methods such as Mahalanobis distance down-weight.
Method¶
Overall Architecture¶
This work is a systematic empirical study rather than a proposal for a new OOD detection method. The research framework consists of:
- Controlled experiment construction: Artifact colors are annotated across 11,355 images from three public datasets (CheXpert chest X-rays, ISIC dermoscopy, MVTec industrial inspection).
- Large-scale evaluation: 40 OOD detection methods × 3,795 hyperparameter configurations × 7 benchmarks × 3 network architectures (ResNet18, VGG16, ViT-B/32) × 25 random seeds.
- Mechanistic analysis: PCA-based latent space analysis to explain the root cause of the effect.
- Mitigation evaluation: Two mitigation strategies are assessed — color jitter augmentation and subspace projection.
Key Designs¶
1. Similar/Dissimilar Grouping and Color Annotation¶
- Function: OOD artifacts are partitioned into "similar" and "dissimilar" groups based on their color similarity to the model's ROI.
- Mechanism: SAM is used to segment the ROI and artifact regions; mean RGB values are computed for each, and a threshold based on linear Euclidean distance is applied. For example, in ISIC the mean ROI RGB is (176, 116, 77); red ink is classified as similar, while black/green/purple ink is classified as dissimilar.
- Design Motivation: Color is a variable that can be controlled independently of shape and texture, and color artifacts are extremely common in medical imaging.
2. Color-Swapped Counterfactual Generation¶
- Function: Counterfactual images with swapped colors are generated for ISIC color-chart data — similar-colored charts (red/orange/yellow) are recolored to black, and dissimilar-colored charts (green/blue/black/gray) are recolored to the mean skin lesion color.
- Mechanism: Segmentation masks are used to recolor regions via per-channel mean shifting, preserving pixel-level variance and texture.
- Design Motivation: This controls for dataset-level confounds (e.g., different-colored artifacts appearing at different spatial locations), ensuring the observed effect is attributable to color similarity rather than other factors.
3. ROI Causal Validation (CheXpert Counterfactual Experiment)¶
- Function: The cardiac region in chest X-rays is altered from high to low brightness during training, after which the detection performance of synthetic OOD patches of varying brightness is evaluated.
- Mechanism: If the effect is truly ROI-driven, altering the ROI's appearance should reverse the direction of detection performance trends.
- Design Motivation: Causal validation — confirming that the effect is driven by the model's learned representation of the ROI rather than by other confounding factors.
4. Subspace Attribution Analysis (PCA Mechanistic Explanation)¶
- Function: PCA is applied to hidden-layer features; the discriminability \(I_k\) of each principal component for distinguishing similar from dissimilar artifacts is computed and correlated (via Spearman's \(\rho\)) with the component's variance \(\lambda_k\).
- Mechanism: If color-sensitive directions are aligned with high-variance directions (positive correlation), this explains why methods such as Mahalanobis distance, which down-weight high-variance directions, effectively "overlook" dissimilar artifacts.
- Design Motivation: To provide a mechanistic explanation from the geometry of the feature space.
Loss & Training¶
This work is primarily analytical and does not propose a new training methodology. Key training details include:
- Task models are trained with standard cross-entropy loss, using 25 random seeds × 5-fold cross-validation.
- One mitigation strategy employs color jitter augmentation (mild: brightness/contrast/saturation = 0.2; strong = 0.8).
- The subspace projection mitigation strategy is defined as \(F_\perp = (I - UU^\top)F\), where \(U\) spans the subspace formed by the top \(k=5\) principal components with the highest color sensitivity.
Key Experimental Results¶
Main Results¶
Table 1: ISIC Benchmark Key Results (ResNet18, 40 methods, AUROC %)
| Method Category | Representative Method | Ink-Similar | Ink-Dissimilar | Chart-Similar | Chart-Dissimilar | Mean Δ (pp) |
|---|---|---|---|---|---|---|
| Feature-based | Mahalanobis | 77.0 | 63.6 | 96.7 | 95.4 | 7.3 |
| Feature-based | KNN | 85.7 | 70.1 | 91.3 | 90.6 | 8.2 |
| Feature-based | FeatureNorm | 75.1 | 52.9 | 62.4 | 58.1 | 13.2 |
| Confidence-based | MCP | 69.8 | 68.7 | 57.5 | 55.4 | 1.6 |
| Confidence-based | ODIN | 72.8 | 72.4 | 59.7 | 57.0 | 1.6 |
| External | RealNVP | 84.0 | 65.6 | 96.1 | 94.2 | 10.1 |
Key figure: Mahalanobis distance achieves an AUROC 31.5% higher for detecting red ink (similar to the ROI) than for detecting black ink (dissimilar) on the ISIC benchmark.
Table 2: MVTec Benchmark Key Results (ResNet18, AUROC %)
| Method | Pill-Similar | Pill-Dissimilar | Metal Nut-Similar | Metal Nut-Dissimilar | Mean Δ (pp) |
|---|---|---|---|---|---|
| KNN | 93.3 | 86.2 | 71.0 | 36.9 | 20.6 |
| Mahalanobis | 71.9 | 68.7 | 69.8 | 58.3 | 7.3 |
| MCP | 78.5 | 78.3 | 58.8 | 45.3 | 6.8 |
| GradNorm | 80.1 | 79.1 | 60.3 | 59.8 | 0.8 |
Ablation Study¶
Mitigation Strategy Comparison (ISIC Ink Benchmark, ResNet18):
| Strategy | Method | Similar AUROC | Dissimilar AUROC | Gap Change |
|---|---|---|---|---|
| No augmentation | Mahalanobis | 77.0 | 63.6 | 13.4 pp |
| Subspace projection | Mahalanobis+Proj | 77.5 | 75.8 | 1.7 pp ↓↓ |
| No augmentation | FeatureNorm | 75.1 | 52.9 | 22.2 pp |
| Subspace projection | FeatureNorm+Proj | 75.3 | 74.5 | 0.8 pp ↓↓ |
| No augmentation | NAN | 75.6 | 48.5 | 27.1 pp |
| Subspace projection | NAN+Proj | 75.3 | 76.8 | −1.5 pp ↓↓ |
| Mild color jitter | KNN | 90.1 | 77.3 | 12.8 pp |
| Strong color jitter | KNN | 87.9 | 77.6 | 10.3 pp |
Key Findings¶
- Feature-based methods are most severely affected: mean AUROC drop of \(7.1 \pm 1.8\) pp, substantially larger than the \(1.5 \pm 1.1\) pp observed for confidence-based methods.
- CheXpert causal experiment: reversing the ROI appearance reverses the direction of detection performance trends, confirming that the effect is ROI-driven.
- PCA analysis: color-sensitive directions are significantly positively correlated with high-variance principal components (Spearman \(\rho = 0.47\), \(p < 1.5 \times 10^{-4}\)).
- Subspace projection is effective: it nearly eliminates the performance gap across three feature-based methods without degrading detection of similar artifacts.
- Color jitter yields inconsistent results: it benefits some methods (KNN) but harms others (DICE); strong jitter reduces in-distribution accuracy by 5.5 pp.
- DDPM-MSE is the sole exception: it exhibits no Invisible Gorilla Effect across any ISIC benchmark.
Highlights & Insights¶
- Apt naming: the analogy to the "invisible gorilla" cognitive psychology experiment provides an intuitive and communicable framing for the DNN "attention blind spot."
- Unprecedented experimental scale: 40 methods × 3,795 configurations × 7 benchmarks × 3 architectures × 25 seeds, with every conclusion supported by statistical significance testing (Wilcoxon signed-rank, \(p < 10^{-5}\)).
- Closed causal validation loop: the effect is not merely observed but causally attributed to ROI-driven learning through the CheXpert cardiac brightness reversal experiment.
- Clear mechanistic explanation: PCA subspace analysis identifies the root cause of feature-based methods' susceptibility — color variation is distributed along high-variance directions that are down-weighted by these methods.
- Transferable mitigation: the nuisance subspace learned from the ISIC color-chart benchmark transfers directly to the ink benchmark, demonstrating the generalizability of the identified subspace.
- Practical clinical significance: the findings reveal a silent failure mode for OOD detectors in real deployment — artifacts whose color differs from the ROI are precisely those most likely to be missed.
Limitations & Future Work¶
- Focus limited to color: although color serves as a well-controlled variable, similar effects may arise from shape, texture, or spatial location; future work should investigate these dimensions.
- Limited dataset scope: only three datasets (two medical, one industrial) are examined; high-stakes domains such as autonomous driving and remote sensing are not covered.
- Subspace projection limitations: the approach requires prior knowledge of which principal components constitute "nuisance" directions, which may be impractical in deployment (requiring some OOD-labeled samples).
- Foundation models excluded: large-scale pretrained models such as CLIP are excluded to avoid data leakage, yet foundation models represent a dominant trend; whether they exhibit the same effect warrants investigation.
- Mitigation strategies remain preliminary: color jitter yields inconsistent results, and subspace projection is validated only for feature-based methods — a general-purpose mitigation strategy is lacking.
- Cross-domain transfer potential: whether a nuisance subspace learned on one dataset can be applied zero-shot to OOD detection in an entirely different domain remains an open question.
Related Work & Insights¶
- Anthony & Kamnitsas (2023, 2025): found that the optimal feature layer for Mahalanobis Score varies by artifact type; the present work reveals the deeper underlying reason.
- Averly & Chao (2023): counterfactual analysis showing that OOD artifacts can produce high-confidence predictions; the present work systematizes this finding along the color dimension.
- Ren et al.: near-OOD vs. far-OOD framework; the present work challenges the monotonic assumption that detection difficulty increases with similarity.
- Implications for OOD detection method design: future feature-based methods should not indiscriminately down-weight high-variance directions; distinguishing "informative" from "nuisance" variance is necessary. Learning ROI-aware feature spaces may be a promising direction.
Rating¶
⭐⭐⭐⭐ An exceptionally rigorous empirical study that reveals, at unprecedented scale, an important and previously overlooked systematic bias in OOD detection. The causal validation and mechanistic explanation are compelling, and the findings carry significant implications for the safe deployment of OOD detectors in practice.