MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification¶

Conference: CVPR 2026 arXiv: 2603.09374 Code: Available (open-sourced) Area: Medical Imaging Keywords: multiple instance learning, mammography, precomputed features, frozen foundation models, weakly supervised classification

TL;DR¶

Combining frozen general-purpose foundation encoders (DINOv2 ViT-Giant / MedSigLIP) with a lightweight MIL aggregation head of only ~40k parameters, MIL-PF achieves state-of-the-art performance on large-scale mammography classification benchmarks such as EMBED (AUC 0.916, Spec@Sens=0.9 of 0.762) via a dual-stream aggregation strategy (global mean pooling + local Perceiver cross-attention), training in 5–7 minutes with 35–458× fewer trainable parameters than baselines.

Background & Motivation¶

Background: Breast cancer is the most common malignancy and the leading cause of cancer-related death in women; mammography is the preferred screening modality. Existing methods either fine-tune large backbones end-to-end (14–23M parameters, computationally expensive) or rely on domain-specific pretrained encoders (e.g., MammoCLIP) with limited generalization.

Limitations of Prior Work: Mammography presents three unique challenges: (1) extremely high resolution (up to 4708×5844 pixels), making end-to-end fine-tuning of large models impractical; (2) annotations are available only at the breast level (BI-RADS grading), without pixel-level labels, yielding a weakly supervised setting; and (3) a single examination contains multiple views (CC, MLO), requiring joint reasoning across views.

Key Challenge: Can powerful general-purpose foundation models (e.g., DINOv2) zero-shot generalize to out-of-distribution mammography data? If so, precomputed features could substantially reduce experimental costs—challenging the prevailing assumption that medical imaging requires domain-specific pretraining.

Goal: To design a lightweight classification framework that models both global tissue structure and sparse local lesion signals using features from frozen foundation models, without fine-tuning large visual encoders.

Key Insight: The authors observe that DINOv2 ViT-Giant and MedSigLIP generalize surprisingly well to mammography in a zero-shot setting, substantially outperforming the domain-specific MammoCLIP (AUC 0.897 vs. 0.870), validating the feasibility of the "frozen general-purpose encoder + lightweight task head" paradigm.

Core Idea: Precomputed features from frozen general-purpose foundation encoders, combined with a dual-stream MIL aggregation head (global mean pooling + local Perceiver attention), achieve SOTA performance with ~40k parameters and 5 minutes of training.

Method¶

Overall Architecture¶

MIL-PF operates in two stages: (1) Feature Precomputation—a frozen encoder \(\mathcal{F}\) (DINOv2 ViT-Giant or MedSigLIP) extracts global features (whole-image encoding \(\mathcal{G}_i\)) and local features (tile-level encoding \(\mathcal{T}_i\)) for each mammogram, producing an embedding dataset \(\mathcal{E} = \{(\mathcal{G}_i, \mathcal{T}_i, y_i)\}\); (2) MIL Head Training—a ~40k-parameter aggregation head is trained on these embeddings, comprising a global-stream aggregator \(\mathcal{A}_\psi^G\), a local-stream Perceiver aggregator \(\mathcal{A}_\omega^T\), and a final classification layer \(h_\theta\). A bag is defined as all view images of the same breast within the same examination.

Key Designs¶

Dual-Stream Embedding Dataset Construction
- Function: Decomposes each mammographic examination into two complementary signal sources—global tissue context and local lesion candidates.
- Mechanism: The global stream encodes each full image as \(\mathcal{G}_i = \{\mathcal{F}(I_i^{(n)})\}_n\) to capture overall tissue density. The local stream tiles each image into a non-overlapping grid, discards pure-background tiles, encodes each tissue-containing tile individually, and merges all tiles across views: \(\mathcal{T}_i = \bigcup_n \bigcup_k \{\mathcal{F}(C_i^{(n)(k)})\}\). Tile size is chosen to be large enough to encompass the expected ROI while not exceeding the encoder's maximum supported resolution (448/518 pixels).
- Design Motivation: The global stream provides macroscopic tissue density information; the local stream captures fine-grained signals from sparse lesions. The two streams are complementary.
Perceiver-Style Local Attention Aggregator
- Function: Extracts the most relevant lesion information from a large set of local tile embeddings into a single summary vector.
- Mechanism: A single trainable latent vector \(z\) serves as the query; all tile embeddings are projected into keys and values; the aggregated representation is computed as \(\text{softmax}(zK^T)V\). A single latent query is found to be sufficient; adding more yields no benefit.
- Design Motivation: Mean pooling dilutes the signal with background tiles; max pooling captures only the single most salient tile. Cross-attention learns which tiles are task-relevant and is more parameter-efficient than self-attention, as it does not need to model inter-tile dependencies.
Late-Fusion Classification Head
- Function: Combines the global and local streams into a final prediction.
- Mechanism: The aggregated vectors from each stream are concatenated and passed through a classification layer: \(\hat{y}_i = h_\theta(\text{concat}(\mathcal{A}_\psi^G(\mathcal{G}_i), \mathcal{A}_\omega^T(\mathcal{T}_i)))\). Each aggregator contains a 2-layer MLP (embedding_dim→16→8, ReLU).
- Design Motivation: Late fusion preserves modularity and interpretability; more complex early fusion offers no meaningful benefit for this task.

Loss & Training¶

Binary Cross-Entropy loss.
The entire embedding dataset fits in a single batch on one A100 40 GB GPU; each training run takes 5–7 minutes with ~2M FLOPs per breast forward pass.
Each experiment is repeated for 36 independent runs; the model with the highest validation AUC is selected, leveraging the low training cost to mitigate variance.
Data splits are 70/10/20, stratified by BI-RADS label with no patient leakage.
Non-overlapping tiles are used for classification; 75% overlapping tiles are used for attention map visualization.

Key Experimental Results¶

Main Results — EMBED + VinDr BI-RADS Classification¶

Method	Trainable Params	Level	EMBED AUC↑	EMBED Spec@Sens=0.9↑	VinDr AUC↑
GMIC [Shen]	14.11M	Image	0.816	0.380	0.899
SIL IL GMIC [Pathak]	22.49M	Image	0.875	0.566	0.911
FPN-AbMIL [Mourão]	1.76M	Image	0.802	0.367	0.920
FPN-AbMIL (mean)	1.76M	Breast	0.835	0.403	0.911
MIL-PF (DINOv2 attn)	0.05M	Breast	0.916	0.762	0.894
MIL-PF (MedSigLIP attn)	0.04M	Breast	0.914	0.746	0.911

Results on Additional Datasets¶

Dataset	MIL-PF (DINOv2 attn) AUC	MIL-PF (MedSigLIP attn) AUC	Best Baseline AUC
VinDr Calcification	0.967	0.967	0.954 (FPN-AbMIL)
VinDr Mass	0.800	0.814	0.808 (FPN-AbMIL mean)
RSNA Cancer	0.923	0.923	0.914 (FPN-AbMIL mean)

Ablation Study — Encoder Choice and Aggregation Strategy¶

Encoder	Resolution	AUC (EMBED)	Spec@Sens=0.9
DINOv2 ViT-Giant	518×518	0.897	0.655
MedSigLIP	448×448	0.897	0.691
MammoCLIP	1520×912	0.870	0.558
BiomedCLIP	224×224	0.872	0.606
DINOv3 ViT-Huge+	512×512	0.831	0.497

Aggregation Strategy	DINOv2 AUC	DINOv2 Spec@Sens=0.9
Global max + Local max	0.905	0.703
Global max + Local attn	0.916	0.762

Key Findings¶

General-purpose foundation models (DINOv2/MedSigLIP) substantially outperform the domain-specific MammoCLIP in zero-shot generalization to mammography (AUC 0.897 vs. 0.870), challenging the assumption that medical imaging requires domain-specific pretraining.
The attention aggregator demonstrates a particularly pronounced advantage on Spec@Sens=0.9 (0.762 vs. 0.703), which is the clinically more critical metric.
DINOv3 unexpectedly underperforms (AUC 0.831), demonstrating that the latest model version is not always superior.
Small lesion detection is constrained by tile size (448–518 pixels), with small-lesion mAP values as low as 0.1–1.2.

Highlights & Insights¶

The core insight—"frozen general-purpose foundation models applied directly to mammography yield surprisingly strong results"—challenges the assumed necessity of domain-specific pretraining and carries broad implications for the medical imaging community. It validates the robust out-of-distribution generalization of large-scale general-purpose learned representations.
With only 40k parameters and 5–7 minutes of training, the approach substantially lowers the barrier to entry for research, particularly benefiting resource-constrained teams; precomputed features enable rapid experimental iteration and encoder comparison.
The Perceiver-style single-query cross-attention is an elegant design choice for ROI-sparse scenarios—more parameter-efficient than self-attention and better at capturing sparse signals than mean/max pooling.
The hierarchical MIL problem formulation is broadly transferable: the nested structure, complementary dual-stream design, and weak-label framework can be directly adapted to other high-resolution weakly supervised settings such as pathology and radiology.

Limitations & Future Work¶

On the smaller VinDr dataset, BI-RADS classification underperforms end-to-end fine-tuning (AUC 0.894 vs. 0.911), suggesting that the frozen encoder approach is less advantageous on small datasets.
Small lesion detection is constrained by tile size, yielding extremely low mAP values (0.1–1.2); multi-scale tiling strategies could address this.
Patient longitudinal history and bilateral symmetry information are not utilized—the authors identify these as important future directions.
Inter-run variance is substantial (Spec@Sens=0.9 variance up to 11%), requiring 36 training runs with a selection strategy, which adds operational complexity.
BI-RADS labels are inherently noisy with limited inter-radiologist agreement, placing an upper bound on the quality of training supervision.

vs. FPN-AbMIL/SetTrans (Mourão et al.): Requires 1.76–5.38M parameters for end-to-end training. MIL-PF comprehensively outperforms these methods on the largest benchmark, EMBED (AUC 0.916 vs. 0.835). The key advantages are training efficiency (35× fewer parameters) and breast-level modeling that better aligns with clinical workflows.
vs. GMIC/SIL IL GMIC (Shen/Pathak et al.): Requires 14–23M parameters. SIL IL GMIC achieves stronger performance on the smaller VinDr dataset (0.911), but is surpassed by MIL-PF on the largest and most diverse EMBED benchmark, indicating that the precomputed feature approach is more robust at scale.
vs. MammoCLIP: Domain-specific pretraining underperforms general-purpose DINOv2/MedSigLIP on out-of-distribution datasets, suggesting that large-scale general-purpose pretraining yields greater cross-domain robustness.

Rating¶

⭐⭐⭐⭐

Novelty ⭐⭐⭐: The method combines existing components; the core contribution is an empirical finding rather than a methodological innovation.
Experimental Thoroughness ⭐⭐⭐⭐⭐: Three datasets, multi-encoder comparisons, 36 independent runs, comprehensive ablations, and detection-based interpretability analysis.
Writing Quality ⭐⭐⭐⭐: Problem formulation is clear and experimental design is rigorous.
Value ⭐⭐⭐⭐: Significant practical value for the medical imaging community—demonstrates a low-resource, high-efficiency research pathway.