G2L: From Giga-Scale to Cancer-Specific Large-Scale Pathology Foundation Models via Efficient Fine-Tuning¶

Conference: AAAI 2026 arXiv: 2510.11176 Code: N/A Area: Computational Pathology / Foundation Models Keywords: Pathology Foundation Models, Knowledge Distillation, Model Compression, Cancer Specificity, Vision Transformer

TL;DR¶

This paper proposes G2L (Giga-to-Large), a distillation framework that transfers knowledge from a 1.9B-parameter giga-scale pathology foundation model (H-optimus-0) to a 300M-parameter large-scale model (Hibou-L) using only 1K whole slide images, achieving performance on par with or superior to the teacher model and larger models across multiple cancer-specific downstream tasks.

Background & Motivation¶

Foundation models in computational pathology have advanced rapidly in recent years. By pre-training Vision Transformers on large-scale whole slide images (WSIs), these models learn generalizable histomorphological representations applicable to downstream tasks such as tumor classification, mutation detection, and immune infiltration analysis.

Recent work has made clear that "scale is all you need" in pathology foundation models: more training data, greater cancer type diversity, and larger model capacity consistently yield better performance. Representative giga-scale models such as H-optimus-0 and GigaPath employ ViT-G backbones (1.9B parameters) trained on 170K+ slides spanning 28 cancer types, achieving state-of-the-art results across a wide range of downstream benchmarks.

Nevertheless, giga-scale models face three practical challenges:

Prohibitive development cost: Pre-training requires hundreds of thousands of slides and substantial GPU resources, placing such models beyond the reach of most research institutions and clinical settings.

Deployment difficulty: The inference cost of 1.9B-parameter models is substantial, limiting real-time clinical application.

Dilution of cancer-type specificity: When training data encompasses dozens of cancer types, the distinctive morphological signals of a specific cancer (e.g., breast cancer) may be overwhelmed by data from other cancer types. Cancer-specific structural patterns, nuclear morphology, and stromal interactions characteristic of breast cancer may be underweighted in multi-cancer training.

The core idea of this paper is: rather than training a massive model from scratch, knowledge distillation can compress the capabilities of a giga-scale model into a large-scale model with only 15% of the parameters, while simultaneously specializing the model for a target cancer type using as few as 1K slides. This approach balances efficiency, accessibility, and cancer-type specificity.

Method¶

Overall Architecture¶

The G2L framework consists of three steps: (1) select a target cancer type and retrieve 1K slides of that cancer from a public database (e.g., TCGA); (2) tile each slide into non-overlapping 256×256 patches and randomly crop to 224×224; (3) train via knowledge distillation using the giga-scale model as teacher and the large-scale model as student.

Key Designs¶

Teacher–Student Model Selection:
- Function: H-optimus-0 (ViT-G/14, 1.9B parameters) serves as the teacher; Hibou-L (ViT-L/14, 300M parameters) serves as the student.
- Mechanism: The teacher outputs 1536-dimensional features and the student outputs 1024-dimensional features; a linear projection layer \(W_p\) with BatchNorm is used to align dimensions.
- Design Motivation: H-optimus-0 is the current SOTA on multiple downstream tasks; Hibou-L is a strong performer among models of comparable scale. Adding a single projection layer at the student's output minimizes architectural modification.
Log-Sum Loss:
- Function: A smooth feature alignment loss that measures the discrepancy between teacher and student representations.
- Mechanism: \(D(Z_s, Z_t; W_p) = \log \sum_i |Z_s W_p - Z_t|_i^\alpha\), where \(\alpha=4\) is the smoothing factor.
- Design Motivation: Compared to standard MSE or L1 losses, the Log-Sum formulation is more sensitive to large deviations and more tolerant of small ones. The high-order exponent \(\alpha=4\) further amplifies large errors, encouraging the student to prioritize feature dimensions where it diverges most from the teacher.
Efficient Training Strategy:
- Function: Enables rapid distillation training using only 1K slides.
- Mechanism: AdamW optimizer (initial learning rate \(10^{-4}\), weight decay 0.05) with cosine annealing for both. Early stopping is applied: training halts when the current loss exceeds the running mean of the last 100 iterations more than 10 times.
- Design Motivation: 1K slides are sufficient to capture the morphological feature distribution of the target cancer type, eliminating the need for large-scale data. Data augmentation (flipping, color jitter, Gaussian blur) is applied during training to further improve generalization.
Data Augmentation:
- Function: Applied to input patches fed to both teacher and student simultaneously.
- Mechanism: Horizontal/vertical flipping (50%), color jitter (50%, brightness=0.15, contrast=0.15, saturation=0.1, hue=0.05), Gaussian blur (10%, kernel 9×9).
- Design Motivation: Improves robustness to color variation and imaging condition differences, which is particularly important in multi-center pathology data.

Loss & Training¶

The sole loss function is the Log-Sum Loss described above. Training is conducted on 3× NVIDIA RTX A6000 GPUs with batch size=32. The entire distillation process is highly data-efficient, drawing only 1K slides from TCGA-BRCA or TCGA-PRAD.

Key Experimental Results¶

Main Results¶

Evaluation is conducted on 9 downstream benchmark tasks spanning two major cancer types (breast and prostate), comparing 6 foundation models of varying scales. Two evaluation protocols are employed: a training-free method (kNN voting) and linear probing.

Training-Free Method (Accuracy):

Dataset	Cancer	G2L (0.3B)	H-optimus-0 (1.9B)	UNI-v2 (0.6B)	Hibou-L (0.3B)
TILS	Breast	0.9362	0.9344	0.9291	0.9214
TP53	Breast	0.6904	0.6598	0.6542	0.6504
IDC	Breast	0.9232	0.9141	0.9165	0.9074
Gleason	Prostate	0.8988	0.8994	0.8678	0.8124
AGGC	Prostate	0.9243	0.9226	0.9170	0.8788
CHIMERA	Prostate	0.7657	0.7663	0.7605	0.7184

Linear Probing (AUC):

Dataset	Cancer	G2L (0.3B)	H-optimus-0 (1.9B)	UNI-v2 (0.6B)	Hibou-L (0.3B)
TILS	Breast	0.9838	0.9822	0.9788	0.9827
TP53	Breast	0.8046	0.7603	0.6795	0.7085
IDC	Breast	0.9796	0.9778	0.9756	0.9488
Gleason	Prostate	0.9841	0.9846	0.9790	0.9708
AGGC	Prostate	0.9958	0.9955	0.9956	0.9921

G2L matches or surpasses the giga-scale teacher model on most tasks using only 15% of its parameters. Notably, on TP53 mutation prediction, G2L achieves an AUC of 0.8046, substantially outperforming the teacher's 0.7603 (+4.4%).

Ablation Study — Feature Similarity¶

Centered Kernel Alignment (CKA) is used to quantify the alignment between student and teacher feature spaces before and after distillation:

Dataset	CKA (Pre-distillation)	CKA (Post-distillation)	Note
BRCAS	0.7594	0.9683	+27.5% improvement; highly aligned feature spaces
BreakHis 40×	0.8909	0.9558	Consistent improvement across all magnifications
BreakHis 100×	0.9147	0.9686
BreakHis 200×	0.9230	0.9734
BreakHis 400×	0.8995	0.9575

Post-distillation CKA values consistently exceed 0.95, demonstrating that the student model learns a latent representation highly consistent with that of the teacher.

Robustness Metrics¶

Robustness is evaluated on the TIGER dataset using a robustness index (tissue consistency / site consistency; values >1 indicate greater sensitivity to biological features than to imaging-site variation):

Model	k=3	k=5	k=10	k=20
H-optimus-0	1.0826	1.1890	1.3730	1.8467
UNI-v2	1.0682	1.2433	1.4113	1.8548
Hibou-L	0.9056	0.9905	1.0879	1.1855
G2L	1.0891	1.3002	1.5021	2.0316

G2L achieves the highest robustness index across all values of \(k\), surpassing both the teacher model and all larger models, indicating that the distilled model more effectively captures biologically meaningful morphological features rather than imaging artifacts.

Key Findings¶

G2L matches or surpasses the teacher model on most benchmarks using only 15% of its parameters (0.3B vs. 1.9B), exceeding the teacher by approximately 4.4% AUC on TP53.
Distillation requires only 1K slides, demonstrating exceptional data efficiency.
The student not only acquires the teacher's representational capacity but in some cancer-specific tasks surpasses it — likely because the distillation process focuses on morphological signals of the target cancer type, avoiding the signal dilution inherent in multi-cancer training.
The improvement in robustness index indicates that G2L better discriminates between biological features and imaging variation, enhancing clinical applicability.

Highlights & Insights¶

The approach is remarkably simple and practical: standard knowledge distillation combined with 1K slides and a single projection layer suffices to transfer the capabilities of a massive model to a smaller one, solving a real-world problem without methodological complexity.
The finding that a smaller, cancer-specific model can outperform a larger generalist model is highly instructive: cancer-specific distillation may be better suited to clinical settings than universal large models, resonating with the broader debate around domain specialization versus general-purpose scaling.
The evaluation framework is comprehensive: 9 benchmarks spanning patch, ROI, and slide granularities; both kNN and linear probing protocols; CKA feature analysis and robustness index together form a complete evaluation loop.
The results have direct implications for clinical deployment: practitioners need not pursue the largest available model, as distillation can yield high-performing cancer-specific models at low cost.

Limitations & Future Work¶

Validation is limited to breast and prostate cancer; generalization to rarer cancer types — where fewer than 1K training slides may be available — remains untested.
The teacher–student pairing (H-optimus-0 → Hibou-L) is fixed; alternative pairings are not explored.
The choice of \(\alpha=4\) in the Log-Sum Loss is empirical and lacks ablation.
The effect of varying the number of distillation slides (e.g., 500 vs. 1K vs. 2K) is not investigated.
Only feature-level distillation is employed; more advanced techniques such as attention map distillation or relational distillation are not explored.

Unlike general vision distillation works such as TinyViT and DeiT, G2L targets the pathology domain and distills using only in-domain data, making it closer in spirit to domain adaptation via distillation.
The proposed approach can inspire domain-specific distillation of large foundation models in other fields (e.g., remote sensing, autonomous driving).
The robustness index evaluation methodology is worth adopting in other multi-center medical imaging studies.

Rating¶

Novelty: ⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐