Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation¶

Conference: CVPR 2026 arXiv: 2603.02554 Code: GitHub Area: Segmentation / Knowledge Distillation Keywords: Knowledge Distillation, Domain Generalization, Vision Foundation Models, Semantic Segmentation, Multi-stage Learning

TL;DR¶

This paper proposes Generalizable Knowledge Distillation (GKD), which transfers the cross-domain generalization capability of vision foundation models (VFMs) to lightweight student models through a multi-stage distillation scheme that decouples representation learning from task learning, along with a query-based soft distillation mechanism. GKD achieves an average improvement of +10.6% mIoU under the F2L setting.

Background & Motivation¶

Knowledge distillation (KD) is widely used for model compression in semantic segmentation. However, conventional KD methods share a largely overlooked critical flaw: they preserve in-domain accuracy while significantly degrading out-of-domain generalization. With the rise of VFMs such as DINOv2, this issue becomes more pronounced—VFMs inherently possess strong generalization ability, yet student models obtained via conventional KD exhibit degraded generalization.

The authors empirically validate a key insight: task loss and distillation loss conflict in their optimization directions—the task objective drives the student toward source-domain-specific decision boundaries, while the distillation objective pushes the student toward the teacher's domain-invariant representations. Joint optimization of both objectives leads to unstable convergence and degraded generalization.

Core Problem: Can VFMs' out-of-domain generalization be preserved during distillation?

Method¶

Overall Architecture¶

GKD is a two-stage framework. Stage 1 (Domain-General Distillation) first equips the student with domain-agnostic representations via feature distillation. Stage 2 (Task Learning) freezes the student encoder and trains only the decoder for segmentation. This decoupled design ensures the student internalizes transferable knowledge before task-specific adaptation.

Key Designs¶

Two-step Domain-General Distillation: Stage 1 is further divided into two steps: (a) Task-agnostic distillation—distillation is performed on a proxy dataset (ImageNet) to close the initial representational gap between the student and teacher: \(\min_{\theta_s} \mathbb{E}_{x_P \sim D_P}[\mathcal{L}_{QSD}(\mathcal{F}_{\theta_t}(x_P), \mathcal{F}_{\theta_s}(x_P))]\); (b) Domain-agnostic distillation—distillation continues on source-domain data without task labels, exposing the student to task-relevant yet domain-invariant features. Crucially, no task supervision is introduced throughout Stage 1, preventing the injection of domain-specific bias.
Query-based Soft Distillation (QSD): The core distillation mechanism, designed to address the limitations of conventional point-wise feature alignment. Student features \(v_s\) serve as queries to retrieve spatial knowledge from teacher features \(v_t\) via an attention mechanism: \(W = \varphi(v_s) \cdot v_t^\top\). The student feature is reconstructed as \(v_s' = \sigma(\varphi(v_s) \cdot v_t^\top) \cdot \phi(v_s)\), and an MSE constraint is applied: \(\mathcal{L}_{feat} = \|v_s' - v_t\|_2^2\). Design motivation: the spatial structural information of VFMs exhibits strong domain invariance (confirmed via PCA visualization). QSD enables the student to selectively acquire transferable relational structure rather than passively mimicking local activations.
Masked Patch Distillation + CLS Token Distillation: Inspired by DINOv2, a masked distillation loss \(\mathcal{L}_{mask} = \|v_s'^{mask} - v_t\|_2^2\) is introduced to extract hidden knowledge from VFMs, and CLS token distillation \(\mathcal{L}_{cls} = \|v_s'^{cls} - v_t^{cls}\|_2^2\) transfers global semantics. The total distillation loss is \(\mathcal{L}_{QSD} = \alpha\mathcal{L}_{feat} + \beta\mathcal{L}_{mask} + \gamma\mathcal{L}_{cls}\), with all hyperparameters set to 1.

Loss & Training¶

Stage 1: AdamW, lr=5e-4; for F2L, first trained on ImageNet for 100 epochs (batch=512, 224×224), then on source-domain data for 300 epochs (batch=128, 512×512).
Stage 2: Encoder frozen; Mask2Former decoder used; backbone lr=1e-5, decoder lr=1e-4; 40K iterations, batch=4, crop 512×512.
Segmentation loss follows the standard Mask2Former configuration.

Key Experimental Results¶

Main Results¶

Setting	Metric (Avg mIoU)	GKD (F2L DeiT ViT-B)	Prev. SOTA (KD)	Gain
GTAV → Citys+BDD+Map	Avg mIoU	57.9	51.1 (G2SD)	+6.8
Cityscapes → ACDC	Avg mIoU	64.6	53.8 (G2SD)	+10.8
Potsdam-RGB → P-I+V-I	Avg mIoU	65.1	59.5 (G2SD)	+5.6

Setting	Metric (Avg mIoU)	GKD (F2L DeiT ViT-S)	Prev. SOTA (KD)	Gain
GTAV → Citys+BDD+Map	Avg mIoU	54.1	47.8 (G2SD)	+6.3
Cityscapes → ACDC	Avg mIoU	57.7	51.2 (G2SD)	+6.5

Ablation Study¶

Configuration	Avg mIoU (GTAV)	Notes
Vanilla single-stage KD	49.9	Vanilla KD baseline
Two-stage KD (MSE)	~53	Decoupled without QSD
+ QSD	~56	With query-based selective distillation
+ Masked distillation	~57	Extracting hidden VFM knowledge
Full GKD	57.9	All components included

Key Findings¶

Conventional KD methods (CWD, Af-DCD) can actually reduce the student's out-of-domain generalization—a counterintuitive finding.
The key to the two-stage design lies in removing task gradients from Stage 1, yielding smoother convergence and better generalization.
The F2L (Foundation-to-Local) setting yields the most significant improvement (+10.6%), as smaller models have inherently weaker generalization and benefit more from GKD's compensation.
The attention mechanism in QSD effectively selects spatially transferable knowledge; PCA visualization confirms that the student acquires domain-invariant spatial structure.
GKD also generalizes to the EVA02 teacher model, demonstrating that the approach is not restricted to DINOv2.

Highlights & Insights¶

Diagnosing the problem is more valuable than solving it: The primary contribution of this work is the identification of the generalization bottleneck in conventional KD—a problem that was previously almost entirely overlooked.
The decoupled design is both simple and powerful: freezing the encoder while training the decoder is straightforward yet highly effective.
QSD transforms distillation from "passive imitation" to "active retrieval," representing an elegant advancement over standard soft distillation.
Experiments cover 5 domain generalization benchmarks under both F2F and F2L settings, providing strong empirical support.

Limitations & Future Work¶

The two-stage training pipeline is more complex than single-stage KD and requires an additional ImageNet pre-training step.
Validation is currently limited to semantic segmentation; applicability to other dense prediction tasks such as object detection and instance segmentation remains to be verified.
Using ImageNet as the proxy dataset in Stage 1 is an assumption; the impact of different proxy datasets has not been fully analyzed.
Freezing the encoder may impose an upper bound on task performance.

G2SD/TinyMIM/Proteus: Existing VFM distillation methods focus solely on in-domain transfer and neglect cross-domain generalization.
CWD/CIRKD/Af-DCD: Segmentation KD methods that do not account for domain generalization.
Domain Generalized Semantic Segmentation: This work bridges KD and DGSS, opening a new research direction.
Insight: The idea of decoupling "representation learning" from "task learning" can be generalized to other scenarios involving pre-training and fine-tuning.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic study of domain generalization in KD; innovative multi-stage decoupled design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five benchmarks, F2F/F2L dual settings, extensive comparisons and ablations.
Writing Quality: ⭐⭐⭐⭐ The logical chain from motivation validation to method design is very clear.
Value: ⭐⭐⭐⭐⭐ Exposes an important overlooked problem in the KD literature; the method is simple, effective, and opens a new research direction.