Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection¶

Conference: CVPR 2025
arXiv: 2503.23220
Code: https://github.com/TRAILab/DINO_Teacher
Area: Object Detection / Domain Adaptation
Keywords: Domain Adaptive Object Detection, DINOv2, Vision Foundation Models, Pseudo-labeling, Feature Alignment

TL;DR¶

DINO Teacher proposes replacing the traditional EMA teacher in the Mean Teacher framework with a frozen self-supervised DINOv2 foundation model. This acts as both a more accurate pseudo-label generator and a proxy target for feature alignment, achieving SOTA performance on multiple domain adaptive object detection benchmarks (+7.6% on BDD100k).

Background & Motivation¶

Background: The mainstream approach in domain adaptive object detection (DAOD) is the Mean Teacher self-labeling framework—which uses the EMA version of the student model as the teacher to generate pseudo-labels in the target domain, which are then used to train the student in a positive feedback loop.
Limitations of Prior Work: Mean Teacher tightly couples label generation with model training. A student model trained on the source domain may fail to generate accurate pseudo-labels on the target domain to bootstrap the feedback loop. Additionally, domain alignment methods rely on pseudo-labels for category-level alignment, creating a chicken-and-egg problem due to the unreliability of pseudo-label quality.
Key Challenge: Since the teacher and student models share the exact same architecture and are trained on the same data, why should the teacher be expected to generate high-quality labels in an unseen target domain?
Goal: (1) How to generate more accurate pseudo-labels in the target domain? (2) How to perform domain alignment without relying on pseudo-labels?
Key Insight: Vision foundation models (such as DINOv2) pre-trained with self-supervision on massive datasets possess powerful cross-domain generalized features. Even with frozen parameters, their features offer consistent semantic representation across the source and target domains.
Core Idea: Decouple label generation from student training—use a frozen DINOv2 as a pseudo-label generator, and use the DINOv2 feature space as a proxy target for domain alignment.

Method¶

Overall Architecture¶

DINO Teacher comprises three stages: (1) Offline labeller training: A Faster R-CNN detection head is trained on top of a frozen DINOv2 ViT-G encoder, using only source domain data; (2) Offline label generation: The trained labeller processes the target domain data to generate all pseudo-labels in a single pass; (3) Online student training: A small student network (VGG16/ResNet-50) is trained using source domain ground-truth labels and target domain pseudo-labels, while the student's features are aligned with those of a frozen DINOv2 ViT-B. During inference, only the student network is used.

Key Designs¶

DINOv2 Pseudo-Label Generator (Foundation Model Labeller):
- Function: Replaces the Mean Teacher as the pseudo-label source for the target domain to provide higher-quality cross-domain labels.
- Mechanism: A Faster R-CNN detection head is added to the frozen DINOv2 ViT-G encoder, and trained solely on labeled source domain data. Once trained, a single forward pass over all target domain images is executed to generate pseudo-labels \(\tilde{B}_T, \tilde{Y}_T\), which are filtered with a class probability threshold of \(\delta=0.8\). Because DINOv2 has seen samples far beyond the source distribution during large-scale pre-training, its features can facilitate more accurate box and category predictions on the target domain, even when the detection head is trained only on the source domain.
- Design Motivation: Traditional Mean Teacher's labeller is essentially an exponential moving average of the student, limited by the same small architecture and restricted training data. DINOv2 decouples feature quality from label generation, leveraging a large model for feature extraction while keeping the student model lightweight.
DINOv2 Feature Alignment:
- Function: Indirectly reduces the feature gap between the source and target domains by aligning student features with DINOv2 features.
- Mechanism: Using a frozen DINOv2 ViT-B as the alignment encoder, patch-level features from the student backbone are projected via a 2-layer MLP and aligned with DINOv2 features by computing a cosine similarity loss \(\mathcal{L}^{sim} = \frac{1}{NHW}\sum 1 - \frac{\text{interp}(g(\mathbf{x}))^T \mathbf{x}^{big}}{\|\text{interp}(g(\mathbf{x}))\|_2 \|\mathbf{x}^{big}\|_2}\). Alignment is performed independently on source and target domain images without cross-domain matching or any labels.
- Design Motivation: Traditional domain-invariant methods either perform image-level alignment (which does not guarantee instance-level consistency) or utilize pseudo-labels for category-level alignment (which depends heavily on label quality). By employing DINOv2 as a proxy target—aligning both source and target student features with DINOv2 respectively—the domain gap is automatically narrowed without requiring cross-domain matching.
Multi-Stage Training Strategy:
- Function: Properly schedules the startup sequence of each component to ensure training stability.
- Mechanism: Three stages are defined: (1) The first \(n^{initSim}=5000\) iterations focus solely on training on the source domain combined with source DINO alignment to learn the projection MLP; (2) Starting from 5,000 iterations, DINO alignment is performed simultaneously on both source and target domains, but without pseudo-labels; (3) Starting from \(n^{initPL}=20000\) iterations, training with DINO pseudo-labels is introduced. The total loss is formulated as \(\mathcal{L} = \mathcal{L}^{det}_S + \lambda^{unsup}\mathcal{L}^{det}_T + \lambda^{sim}\mathcal{L}^{sim}\), where \(\lambda^{unsup}=\lambda^{sim}=1\).
- Design Motivation: Establishing a solid feature foundation via domain alignment before introducing pseudo-label training prevents gradient interference from misaligned features in the early phase. This differs from Adaptive Teacher which initiates both simultaneously, as early alignment reduces domain-wise gradient conflicts.

Loss & Training¶

Total loss = source domain detection loss \(\mathcal{L}^{det}_S\) (including RPN and RoI losses) + target domain pseudo-label detection loss \(\mathcal{L}^{det}_T\) + cosine similarity alignment loss \(\mathcal{L}^{sim}\). The learning rate for the VGG16 backbone is 0.04 without decay, with an EMA decay factor of \(\alpha=0.9996\), for a total of 60k iterations. A weak-strong data augmentation strategy is applied (weak augmentation for the teacher, strong for the student).

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours	Prev. SOTA	Gain
Cityscapes → BDD100k	mAP@50	47.8	40.2 (HT)	+7.6%
Cityscapes → Foggy CS	mAP@50	55.4	53.1 (REACT)	+2.3%
Cityscapes → ACDC-Fog	mAP@50	68.6	62.2 (AT)	+6.4%
Cityscapes → ACDC-Night	mAP@50	36.4	29.5 (AT)	+6.9%
Cityscapes → ACDC-Rain	mAP@50	39.0	37.7 (AT)	+1.3%
Cityscapes → ACDC-Snow	mAP@50	56.8	55.2 (AT)	+1.6%

The improvement is most pronounced on BDD100k, which has the largest domain gap, especially for rare classes like truck (+12.9%) and bus (+12.1%).

Ablation Study¶

Configuration	Label Source	Alignment Method	Pre-PL mAP	Final mAP
AT (Baseline)	Mean Teacher	\(\mathcal{L}^{dis}\)	28.5	31.8
Case 1	Mean Teacher	\(\mathcal{L}^{sim}\)	32.5	35.3
Case 2	DINO Labeller	\(\mathcal{L}^{dis}\)	28.5	46.8
DT (Full)	DINO Labeller	\(\mathcal{L}^{sim}\)	33.0	47.8

Key Findings¶

DINO pseudo-labels are the primary contributor: Regardless of the alignment method used, employing DINO labels yields a gain of over 10% (AT → Case 2: +15.0%, Case 1 → DT: +12.5%), far outpacing changes in the alignment method (AT → Case 1: +3.5%).
DINO alignment is effective before pseudo-labels are introduced: Prior to enabling pseudo-labels, DINO alignment improves mAP from 28.5 to 32.5-33.0, indicating that feature alignment is independent of label quality.
Rare classes show the greatest improvement: On BDD100k, truck improved from 31.4 to 44.3, bus from 34.6 to 45.9, and motor from 24.4 to 38.3, demonstrating that DINOv2 features provide better representations for rare categories.
t-SNE visualizations clearly reveal that after DINO alignment, the category separation in the student network increases significantly, notably reducing confusion between person and rider.

Highlights & Insights¶

Thorough execution of decoupling: Label generation is decoupled from student training (with the DINO labeller independent of the student), and domain alignment is decoupled from pseudo-labels (DINO alignment does not require labels). This systematic decoupling allows each component to be optimized independently, avoiding the fragile circular dependencies of Mean Teacher.
Foundation models empowering small models: While final inference relies solely on lightweight networks like VGG16/ResNet-50, domain adaptation is accomplished during the training phase by leveraging knowledge from DINOv2. Inference cost remains completely unchanged while performance is dramatically boosted, making it highly deployment-friendly.
Offline label generation: Pseudo-labels are generated in a single offline forward pass over the target domain data. After this step, the foundation model is no longer required during training, keeping training costs highly manageable.

Limitations & Future Work¶

Static nature of DINO labels: Labels are generated once and never updated, failing to exploit new knowledge acquired by the student during training. Periodically updating pseudo-labels or adopting curriculum learning represents a promising direction.
Choice of alignment encoder: Utilizing ViT-B instead of ViT-G for alignment is a speed-centric compromise. While larger alignment models may yield better performance, they would slow down training.
Limited improvement in extreme weather conditions: The performance gains on ACDC-Rain (+1.3%) and ACDC-Snow (+1.6%) are less pronounced than on BDD100k and ACDC-Night. This might stem from smaller baseline domain gaps in these scenarios, or limitations in DINOv2's generalization capability under these specific conditions.
Integration with data augmentation techniques like CutMix/Mixup could yield further improvements.
Extending the DINO Teacher paradigm to domain adaptive segmentation or other dense prediction tasks appears highly straightforward.

vs Adaptive Teacher (AT): AT employs an EMA teacher alongside domain-discriminator adversarial loss. The proposed method replaces both components with the DINO labeller and DINO alignment, outperforming AT by 4.5% (Foggy CS) to 16.0% (BDD100k).
vs Harmonious Teacher (HT): HT represents the SOTA on FCOS detectors (BDD100k mAP of 40.2). This method substantially outperforms it with 47.8 using Faster R-CNN, proving that higher-quality pseudo-labels are more critical than complex detector architectures.
vs REIN: REIN demonstrated that a frozen foundation model backbone can approach full fine-tuning performance in domain-generalized segmentation. This work adapts a similar concept to pseudo-label generation in DAOD, further validating the cross-domain consistency of VFM features.
This methodological framework (foundation models providing knowledge -> small models being deployed) can be readily generalized to any dense prediction task requiring domain adaptation.

Rating¶

Novelty: ⭐⭐⭐⭐ Clear decoupling philosophy, successfully and systematically deploying foundation models to empower small models in DAOD.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage of 4 datasets across 6 test scenarios with rigorous ablation studies.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous logical progression in motivation and transparent methodological explanation.
Value: ⭐⭐⭐⭐⭐ Extremely practical, establishing a new baseline paradigm for DAOD.