Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning¶
Conference: CVPR 2026 arXiv: 2603.00667 Authors: Wentao Huang et al. (Stony Brook University, Mayo Clinic, Harvard/MGH, Stanford) Area: Medical Imaging / Pathology VQA Keywords: Whole Slide Image, Visual Question Answering, Information Bottleneck, Patch Selection, Tissue-Aware Reasoning
TL;DR¶
This paper proposes HistoSelect, a framework that emulates the coarse-to-fine reasoning process of pathologists through a three-stage filtering mechanism — tissue segmentation → Group Sampler → Patch Selector — grounded in Information Bottleneck (IB) theory. By compressing task-irrelevant visual tokens, the method achieves state-of-the-art performance across three datasets while reducing computational cost by approximately 70%.
Background & Motivation¶
Whole slide images (WSIs) are the gold standard for cancer diagnosis, yet a single WSI may contain tens of thousands of patches. Directly feeding these into large language models presents two fundamental bottlenecks:
Computational bottleneck: WSIs can reach resolutions of 100,000×100,000 pixels, yielding tens of thousands of patches after tiling. Encoding each patch as a visual token far exceeds the context window of LLMs.
Information redundancy: Pathologists do not examine every patch sequentially; instead, they first identify tissue types and then focus on regions relevant to the diagnostic question — the majority of patches are irrelevant to any given query.
Existing methods such as Q-Instruct and PathChat either apply uniform sampling (discarding critical information) or process all patches (computationally intractable). The root cause is: how can the number of tokens be drastically reduced while preserving diagnostically relevant information?
The natural workflow of pathologists provides direct inspiration: low-magnification overview to assess tissue architecture, followed by high-magnification examination of suspicious regions. HistoSelect formalizes this coarse-to-fine reasoning process as a learnable pipeline.
Method¶
Overall Architecture¶
HistoSelect consists of three core stages that mirror the cognitive workflow of pathologists:
- Tissue Segmentation: Patches from the WSI are grouped by tissue type.
- Group Sampler: The sampling ratio for each group is determined adaptively.
- Patch Selector: The most relevant patches within each group are selected.
The selected patches are then fed into a VLM for question answering.
Key Designs¶
Stage 1: Tissue-Aware Grouping
- A pathologist pre-defines \(M\) tissue-type text prompts (e.g., "tumor tissue," "stroma," "necrosis").
- CONCH, a pathology-domain CLIP model, computes cosine similarity between each patch feature and the tissue prompts.
- Each patch is assigned to the highest-similarity tissue group, yielding \(M\) groups \(\{G_1, G_2, \ldots, G_M\}\).
Stage 2: Group Sampler (IB-based Group-Level Sampling)
- A group prototype vector \(g_j\) is computed as the mean of patch features within each group.
- \(g_j\) is concatenated with the question encoding \(q\) and passed through a two-layer MLP with sigmoid activation to produce a sampling rate \(r_j \in (0,1)\).
- \(r_j\) determines the proportion of patches to retain from group \(j\): \(k_j = \lceil r_j \cdot N_j \rceil\).
- IB objective: maximize mutual information between \(r_j\) and the answer while minimizing the complexity of \(r_j\).
Stage 3: Patch Selector (Hard Patch-Level Selection)
- For each patch, a selection probability is computed as \(s_i = \sigma(F_{\text{patch}}([x_i; q]))\), where \(F_{\text{patch}}\) is a lightweight MLP.
- Within group \(G_j\), patches are ranked by \(s_i\) and the top-\(k_j\) are selected.
- The Straight-Through Estimator (STE) is used to enable gradient flow through the non-differentiable hard selection operation.
Loss & Training¶
The total loss consists of three terms, reflecting a two-level IB compression design:
- \(L_{\text{VQA}}\): Standard cross-entropy loss for VQA.
- \(L_{\text{group}}\) (group-level IB regularization): Bernoulli KL divergence between \(r_j\) and a prior derived from cosine similarity.
- \(L_{\text{patch}}\) (patch-level IB regularization): Bernoulli KL divergence between \(s_i\) and a patch-question cosine similarity prior.
Training strategy: - The Group Sampler, Patch Selector, and VLM are trained end-to-end jointly. - STE ensures gradient propagation through hard selection. - Cosine similarity priors serve as unsupervised weak signals to guide selection.
Key Experimental Results¶
Main Results¶
| Method | SlideBench-VQA (Acc) | WSI-Bench (Acc) | In-house Ovarian (Acc) | Visual Token Reduction |
|---|---|---|---|---|
| Random Sampling | 52.3 | 48.7 | 61.2 | 70% |
| Q-Instruct | 56.1 | 51.3 | 64.8 | 0% |
| PathChat | 58.4 | 53.9 | 67.3 | 0% |
| HistoSelect | 63.7 | 58.2 | 73.6 | ~70% |
Trained on 356K QA pairs; achieves consistent state-of-the-art results across all three datasets.
Ablation Study¶
| Configuration | SlideBench-VQA | Change |
|---|---|---|
| Full HistoSelect | 63.7 | — |
| w/o Group Sampler | 59.8 | -3.9 |
| w/o Patch Selector | 60.5 | -3.2 |
| w/o IB Loss (group) | 61.2 | -2.5 |
| w/o IB Loss (patch) | 61.8 | -1.9 |
| Random patch selection | 55.1 | -8.6 |
Key Findings¶
- Both filtering stages are necessary: Removing either the Group Sampler or the Patch Selector leads to significant performance degradation, confirming that the two-level coarse-to-fine filtering is mutually complementary.
- IB regularization is effective: Performance drops without IB losses, demonstrating that prior-guided information compression not only reduces computation but also improves accuracy.
- Strong interpretability: Selected patches are highly consistent with diagnostically critical regions annotated by senior pathologists, validating the clinical plausibility of the approach.
- Lossless 70% compression: Substantially reducing the number of tokens while surpassing full-input methods indicates that removing noisy patches is intrinsically beneficial.
Highlights & Insights¶
- Cognitively inspired design: By encoding the pathologist's overview-then-focus workflow into the model architecture, domain knowledge is leveraged more efficiently than purely data-driven approaches.
- Elegant application of IB theory: The information bottleneck framework is instantiated at two levels (group and patch), with Bernoulli KL divergence and cosine similarity priors forming a concise and effective design.
- Hard selection via STE: Hard patch selection reduces actual computation more faithfully than soft attention; STE ensures the model remains trainable.
- Clinical interpretability: Beyond benchmark scores, the selected patches align with pathologist cognition, enhancing both the credibility and practical utility of the method.
Limitations & Future Work¶
- Tissue types must be predefined: The \(M\) tissue prompts are manually specified by domain experts; reconfiguration is required when transferring across diseases or organs.
- Dependence on CONCH: Grouping quality is bounded by CONCH's encoding capability in specific pathological domains; rare or tail tissue types may be grouped inaccurately.
- Information loss from hard selection: Although STE enables training, discarded patches may still contain weak but potentially useful contextual information.
- Single-magnification processing: The current method operates at a single magnification level and does not exploit the multi-scale pyramidal structure of WSIs.
- Non-negligible preprocessing overhead: Encoding all patches with CONCH plus MLP inference may incur considerable upfront cost on extremely large WSIs.
Related Work & Insights¶
- CONCH / PLIP: Vision-language alignment models for pathology, providing high-quality patch feature spaces.
- Information Bottleneck: The classical framework of Tishby et al., with established applications in video understanding (AdaFocus) and NLP (VIB).
- PathChat / LLaVA-Med: Representative pathology VQA methods; HistoSelect can serve as a plug-and-play frontend for these systems.
- Insight: The two-level IB compression framework generalizes naturally to other long-sequence tasks with hierarchical structure, such as long-video understanding and multi-document QA.
Rating¶
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Novelty | 4 | Novel combination of two-level IB compression and pathologist cognitive workflow |
| Technical Depth | 4 | Solid information-theoretic foundations; well-motivated STE and prior designs |
| Experimental Thoroughness | 4 | Three datasets, ablation studies, and interpretability analysis |
| Value | 5 | 70% token reduction with performance gains; clinically deployable |
| Writing Quality | 4 | Clear structure and intuitive figures |
| Overall | 4.2 | An elegant integration of cognitive inspiration and information theory |