THUNDER: Tile-level Histopathology image UNDERstanding benchmark¶
Conference: NeurIPS 2025 Datasets and Benchmarks Track (Spotlight) arXiv: 2507.07860 Code: Available (https://github.com/MICS-Lab/thunder) Area: Medical Imaging / Digital Pathology Keywords: digital pathology, benchmark, foundation models, robustness, uncertainty, tile-level analysis
TL;DR¶
This paper presents THUNDER, a comprehensive tile-level benchmark for digital pathology foundation models, enabling efficient comparison of 23 foundation models across 16 datasets, covering downstream task performance, feature space analysis, robustness, and uncertainty estimation.
Background & Motivation¶
The Proliferation of Digital Pathology Foundation Models¶
In recent years, numerous digital pathology foundation models have been released (e.g., UNI, Virchow, CONCH, Phikon, CTransPath), serving as tile-level image feature extractors for various downstream tile-level and slide-level tasks. However, the pace of model release has far outstripped the community's understanding of their performance and differences.
Limitations of Prior Work¶
- Focus solely on downstream performance: Overlooking fundamental differences between models (e.g., feature space structure)
- Lack of robustness evaluation: In safety-critical domains such as healthcare, accuracy alone is insufficient
- Lack of uncertainty analysis: Models must be capable of expressing uncertainty when appropriate
- Poor reproducibility and extensibility: Many model evaluations employ inconsistent protocols and data splits
Paper Goals¶
To construct a fast, accessible, and dynamic benchmark that not only evaluates performance but also provides in-depth analysis of feature spaces, robustness, and uncertainty, offering the community a comprehensive understanding of model behavior.
Method¶
Overall Architecture¶
The THUNDER benchmark encompasses four evaluation dimensions:
Input: Foundation Models (extracting tile embeddings)
├── Dimension 1: Downstream Task Performance (classification, retrieval)
├── Dimension 2: Feature Space Analysis (structure, separability)
├── Dimension 3: Robustness Evaluation (distribution shift, adversarial perturbation)
└── Dimension 4: Uncertainty Estimation (calibration, OOD detection)
Key Designs¶
1. Dataset Composition¶
THUNDER comprises 16 datasets covering diverse tissue types and tasks: - Cancer classification: Breast, lung, colorectal, gastric cancer, etc. - Tissue classification: Normal tissue type recognition - Subtype classification: Fine-grained tumor subtype classification - Cross-site evaluation: Same task with data from different hospitals
2. Evaluation Protocol¶
- Linear probing: Frozen foundation model with only a linear classification head trained
- KNN classification: K-nearest neighbor classification directly in the embedding space
- Retrieval tasks: Similar tile retrieval via embedding similarity
- Unified data splits: Identical train/validation/test sets used for all models
3. Feature Space Analysis¶
- t-SNE visualization: Examining clustering quality of different classes in the embedding space
- Inter-class/intra-class distance ratio: Quantifying feature space separability
- Embedding dimension utilization: Analyzing the number of effectively used embedding dimensions
4. Robustness & Uncertainty¶
- Distribution shift: Performance degradation evaluated on data from different sites/staining protocols
- Uncertainty calibration: Expected Calibration Error (ECE) for assessing reliability of predictive confidence
- OOD detection: Ability to distinguish in-distribution from out-of-distribution samples
Loss & Training¶
THUNDER does not train models itself; evaluations employ: - Linear probing: Cross-entropy loss + SGD optimizer - KNN: No training required - All foundation model parameters are frozen
Key Experimental Results¶
Main Results¶
Average tile-level linear probing performance of 23 foundation models across 16 datasets:
| Model | Pretraining Data Scale | Architecture | Embedding Dim | Mean Acc ↑ | Mean AUC ↑ |
|---|---|---|---|---|---|
| UNI (v1) | 100k slides | ViT-L | 1024 | 82.4 | 0.924 |
| Virchow | 1.5M slides | ViT-H | 1280 | 83.1 | 0.931 |
| CONCH | 1.17M slides | ViT-B | 512 | 80.7 | 0.912 |
| Phikon | 40k slides | ViT-B | 768 | 78.3 | 0.896 |
| CTransPath | 15k slides | Swin-T | 768 | 76.8 | 0.882 |
| Lunit-DINO | 33k slides | ViT-S | 384 | 77.5 | 0.889 |
| Prov-GigaPath | 171k slides | ViT-G | 1536 | 84.2 | 0.938 |
| UNI v2 | 350k slides | ViT-L | 1024 | 83.8 | 0.935 |
| ResNet-50 (ImageNet) | 1.2M imgs | ResNet-50 | 2048 | 68.2 | 0.812 |
Robustness evaluation (cross-site performance degradation):
| Model | Source Site Acc | Target Site Acc | Performance Drop ↓ | ECE ↓ |
|---|---|---|---|---|
| UNI | 82.4 | 76.8 | -5.6 | 0.082 |
| Virchow | 83.1 | 78.2 | -4.9 | 0.071 |
| CONCH | 80.7 | 73.4 | -7.3 | 0.095 |
| Prov-GigaPath | 84.2 | 79.5 | -4.7 | 0.068 |
| CTransPath | 76.8 | 68.1 | -8.7 | 0.112 |
| ResNet-50 | 68.2 | 58.4 | -9.8 | 0.145 |
Ablation Study¶
Comparison of evaluation protocols:
| Evaluation Method | Accuracy Range | Computation Cost | Correlation with Full Fine-tuning |
|---|---|---|---|
| Linear probing | 68–84% | Fast (minutes) | r=0.92 |
| KNN (k=5) | 65–82% | Very fast (seconds) | r=0.88 |
| KNN (k=20) | 66–81% | Very fast (seconds) | r=0.86 |
| Few-shot (10-shot) | 55–72% | Fast | r=0.84 |
Key Findings¶
- Pretraining data scale remains the primary factor: Models pretrained on the largest datasets (Prov-GigaPath, Virchow) consistently achieve the best overall performance.
- Larger models are not necessarily more robust: Certain large models exhibit greater performance degradation in cross-site settings.
- Uncertainty estimation is broadly inadequate: Most models exhibit high calibration error, necessitating additional calibration steps for clinical deployment.
- Significant variation in feature space structure: Different models show substantial differences in clustering quality and embedding dimension utilization.
- ImageNet pretraining is clearly insufficient: General-purpose vision models lag considerably behind domain-specific models on pathology tasks.
Highlights & Insights¶
- Comprehensiveness: The first benchmark to simultaneously cover performance, feature space, robustness, and uncertainty for pathology models.
- Scale: 23 models × 16 datasets = 368 evaluation combinations.
- Practicality: Rapid execution, support for user-defined models, and dynamic extensibility.
- Spotlight acceptance: Recognized as having significant reference value for the community.
- Open source: Fully open-sourced to facilitate community reproduction and extension.
Limitations & Future Work¶
- Tile-level only: Slide-level tasks (e.g., WSI classification via MIL aggregation) are not covered.
- Limited evaluation protocols: Primarily linear probing and KNN; methods such as prompt tuning are not included.
- Dataset bias: Coverage is skewed toward common H&E-stained cancer types.
- Absence of multimodal evaluation: The text-based capabilities of vision-language models (e.g., CONCH) are not assessed.
- Timeliness challenge: Foundation models are evolving rapidly, requiring ongoing benchmark maintenance.
Related Work & Insights¶
- Pathology foundation models: UNI, Virchow, CONCH, Phikon, etc.
- General vision benchmarks: ImageNet, VTAB, etc., inform benchmark design principles.
- Directions for inspiration: Developing comprehensive slide-level benchmarks, incorporating more rare disease data, and integrating clinical metrics for evaluation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First comprehensive tile-level pathology benchmark
- Theoretical Depth: ⭐⭐⭐ — Primarily an experimentally driven benchmark contribution
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 23 models × 16 datasets; exceptionally comprehensive
- Practical Impact: ⭐⭐⭐⭐⭐ — Significant reference value for the pathology AI community
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, easy to navigate