Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling¶
Conference: NeurIPS 2025 arXiv: 2505.17982 Code: GitHub Area: Computational Pathology / Few-Shot Learning Keywords: Multiple Instance Learning, Vision-Language Models, Hierarchical Graph, Whole Slide Images, Few-Shot Classification, Pathology
TL;DR¶
This paper proposes HiVE-MIL, a hierarchical vision-language MIL framework that constructs a unified heterogeneous graph to model cross-scale hierarchical relationships (5× and 20×) and intra-scale multimodal alignment. Combined with a text-guided dynamic filtering mechanism and a hierarchical contrastive loss, HiVE-MIL consistently outperforms existing methods under the 16-shot setting on three TCGA datasets (lung, breast, and renal cancer), achieving up to 4.1% improvement in Macro F1.
Background & Motivation¶
Whole Slide Image Classification: WSIs exhibit gigapixel resolution and encode multi-scale spatial information spanning coarse tissue architecture to fine-grained cellular morphology. The scarcity of annotations necessitates weakly supervised MIL frameworks.
Limitations of Conventional MIL: These methods rely on large amounts of annotated data, depend solely on visual features, and are sensitive to staining variations and domain shifts, resulting in poor performance in few-shot scenarios.
Progress and Bottlenecks of VLM-MIL: Recent multi-scale VLM-MIL approaches introduce scale-specific prompts but suffer from two critical drawbacks: - Insufficient cross-scale intra-modal interaction: Visual and textual features at each scale are processed independently, with simple summation or averaging at the prediction stage, discarding the coarse-to-fine semantic hierarchy. - Inadequate intra-scale cross-modal alignment: Fine-grained alignment between visual and textual features within the same scale remains underexplored.
Method¶
1. Multi-Scale Hierarchical Feature Extraction¶
Visual hierarchy: WSIs are processed at the low scale (5×) to extract \(N\) patches \(z_n^{(l)} = f_{\text{img}}(x_n^{(l)}) \in \mathbb{R}^D\); each low-scale patch is subdivided into \(M = (20/5)^2 = 16\) high-scale (20×) sub-patches \(z_r^{(h)} = f_{\text{img}}(x_{n,m}^{(h)})\).
Text hierarchy: GPT-4o is used to generate hierarchical text descriptions — \(O=4\) low-scale descriptions per class (coarse tissue features, e.g., "acinar pattern"), each paired with \(K=3\) high-scale sub-descriptions (fine-grained cellular features, e.g., "nuclear hyperchromatism"). CoOp-style learnable token prefixes of length \(L=16\) are prepended:
2. Text-Guided Dynamic Filtering (TGDF)¶
Stage 1 (Low-scale filtering): A patch-text cosine similarity matrix \(S^{(l)} \in \mathbb{R}^{N \times O}\) is computed. For each text \(o\), the mean \(\mu_o\) and standard deviation \(\sigma_o\) are used to define a text-adaptive threshold:
Stage 2 (High-scale refinement): The same filtering is applied to high-scale sub-patches corresponding to retained low-scale patches, masked by the low-scale filtering results to ensure consistency: \(S_{\text{masked}}^{(h)}(r,s) = S^{(h)}(r,s) \cdot S_{\text{filtered}}^{(l)}(n,o)\)
3. Hierarchical Heterogeneous Graph (HHG)¶
Node types: \(\mathcal{T} = \{\text{img}^{(l)}, \text{img}^{(h)}, \text{text}^{(l)}, \text{text}^{(h)}\}\)
Intra-scale edges: Connect visual-text node pairs within the same scale based on the TGDF-filtered similarity matrix.
Hierarchical edges: Connect same-modality nodes across low and high scales (visual hierarchy + text hierarchy), based on spatial parent-child relationships.
4. Modality-Scale Attention (MSA)¶
An attention mechanism is applied to hierarchical edges to enhance cross-scale information propagation. Node features incorporate learnable scale embeddings, and QKV projections are relation-specific:
5. Training Objectives¶
Hierarchical Text Contrastive Loss (HTCL): Aligns cross-scale text semantics, with positive pairs defined as same-class parent-child text pairs and negatives from different classes:
Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}}(z_i, y_i) + \lambda \mathcal{L}_{\text{HTCL}}\), with \(\lambda = 0.5\)
Key Experimental Results¶
16-Shot Results (PLIP Feature Encoder)¶
| Method | NSCLC ACC | NSCLC F1 | BRCA ACC | BRCA F1 | RCC ACC | RCC F1 |
|---|---|---|---|---|---|---|
| ABMIL | 70.64 | 70.37 | 65.83 | 65.29 | 80.00 | 77.95 |
| TransMIL | 73.21 | 72.98 | 72.08 | 71.94 | 87.05 | 84.96 |
| MSCPT | 76.86 | 76.82 | 72.71 | 72.58 | 86.21 | 84.20 |
| ViLa-MIL | 74.17 | 73.90 | 71.04 | 70.56 | 85.06 | 82.51 |
| FOCUS | 71.73 | 71.65 | 71.66 | 71.36 | 87.82 | 85.54 |
| HiVE-MIL | 80.13 | 80.08 | 75.21 | 74.99 | 88.89 | 87.18 |
| Δ vs 2nd | +3.27 | +3.26 | +2.50 | +2.41 | +1.07 | +1.64 |
Consistency Across VLM Encoders¶
HiVE-MIL achieves state-of-the-art performance consistently across three pathology VLMs: PLIP (208K image-text pairs), QuiltNet (1M image-text pairs), and CONCH.
Ablation Study¶
- Removing TGDF → F1 drops by 1.5–3%, demonstrating the necessity of filtering weakly matched patch-text pairs.
- Removing hierarchical edges → significant performance degradation, validating the importance of cross-scale modeling.
- Removing HTCL → reduced cross-scale consistency of textual semantics.
Highlights & Insights¶
- ⭐⭐⭐ Unified Hierarchical Heterogeneous Graph: The first approach to simultaneously model cross-scale intra-modal hierarchical interactions and intra-scale cross-modal alignment in WSIs, yielding substantial gains over naive fusion strategies.
- ⭐⭐⭐ Text-Guided Dynamic Filtering: A top-down two-stage filtering mechanism that effectively removes irrelevant patch-text pairs and is updated dynamically during training.
- ⭐⭐ Consistent and Substantial Improvements: Outperforms all baselines across 3 datasets × 3 VLMs × multiple shot settings.
- ⭐⭐ Well-Motivated Design: Each component (MSA, TGDF, HTCL) is thoroughly validated through ablation experiments.
Limitations & Future Work¶
- Computational Overhead: Constructing and processing the hierarchical heterogeneous graph — particularly the 16× high-scale patch expansion — introduces significant memory and computational costs.
- Dependence on GPT-4o for Text Generation: The quality of hierarchical text descriptions relies on LLM outputs; generalizability across different cancer types requires further validation.
- Limited to Binary/Ternary Classification: Experiments cover only 2–3 class settings; performance on finer-grained subtype classification remains unverified.
- Sensitivity of Threshold Hyperparameter \(\alpha\): Whether the filtering threshold \(\alpha=0.5\) in TGDF is optimal across all datasets and VLMs is not thoroughly investigated.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Technical Depth | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall Recommendation | ⭐⭐⭐⭐ |