Test-Time Adaptive Object Detection with Foundation Model¶
Conference: NeurIPS 2025 arXiv: 2510.25175 Code: https://github.com/gaoyingjay/ttaod_foundation Area: Object Detection / Domain Adaptation Keywords: test-time adaptation, open-vocabulary detection, Mean-Teacher, prompt tuning, dynamic memory
TL;DR¶
This paper proposes TTAOD, a source-free open-vocabulary test-time adaptive object detection framework that combines multimodal Prompt Tuning, Mean-Teacher, an Instance Dynamic Memory (IDM) module, and memory augmentation/hallucination strategies. It achieves 56.2% AP50 on Pascal-C (+11.0 vs. SOTA) and demonstrates consistent gains across 13 cross-domain datasets.
Background & Motivation¶
Background: Test-time adaptation (TTA) has been well studied for classification, but TTA for object detection remains underexplored. Existing methods such as STFAR require source-domain statistics (mean/variance) and assume a closed-set category space.
Limitations of Prior Work: (a) Reliance on source-domain data or statistics is impractical; (b) closed-set assumptions restrict applicability to open-world scenarios; (c) pseudo-label quality degrades under corruption/domain shift, causing Mean-Teacher to collapse.
Key Challenge: Open-vocabulary detectors (e.g., GroundingDINO) suffer performance drops under domain shift, yet TTA must operate without source data and without restricting the category space.
Goal: Achieve test-time domain adaptation under a source-free, open-vocabulary setting by tuning only prompt parameters.
Key Insight: Vision-language foundation models (GroundingDINO) already possess strong zero-shot capabilities; adapting to the target domain requires tuning only a small number of prompt parameters. A memory module is used to accumulate high-quality instances to guide adaptation.
Core Idea: Freeze GroundingDINO and introduce learnable text/visual prompts → update via Mean-Teacher EMA → accumulate high-quality pseudo-labels in an Instance Dynamic Memory (IDM) using DINOv2 features → refine detection scores via memory augmentation → address negative samples via memory hallucination.
Method¶
Overall Architecture¶
Test image → GroundingDINO (frozen) + learnable prompts → Teacher (EMA) generates pseudo-labels → Student learns → IDM stores high-quality detection instances (DINOv2 features) → Memory Augmentation refines detection scores using class prototypes → Memory Hallucination synthesizes training signals for negative samples.
Key Designs¶
-
Multimodal Prompt Tuning + Warm-Start:
- Function: Insert learnable prompts into both text and visual encoders.
- Mechanism: Text prompt \(\tilde{E}_T = E_T + P_T\); visual prompt inserts learnable tokens \(P_{I,i}\) at each layer. Key innovation — Test-Time Warm-Start (TTWS): initializes visual prompts using the average token features of the first test sample, i.e., \(P_{I,i} = \text{AvgPool}(E_{I,i})\).
- Design Motivation: Ablation studies show TTWS is critical (+8.5% AP50), as randomly initialized visual prompts produce extremely poor pseudo-labels in early iterations, causing Mean-Teacher to collapse.
-
Instance Dynamic Memory (IDM):
- Function: Maintains a queue of up to 20 high-quality detected instances per class.
- Mechanism: After each detection, high-confidence instances are enqueued with their DINOv2 features and confidence scores. The class prototype \(v_c\) is computed as the mean feature. Low-quality instances are replaced by newer ones.
- Design Motivation: Dynamically accumulates visual knowledge of the target domain to support subsequent memory augmentation and hallucination strategies.
-
Memory Augmentation + Memory Hallucination:
- Function: Augmentation — refine detection scores using prototypes; Hallucination — synthesize training signals for images with no detections.
- Mechanism: Augmentation — extract DINOv2 features for each detected box, compute cosine similarity with the class prototype, derive a refined score \(s' = \alpha \exp(-\beta(1 - \text{sim}))\), and fuse with the original score. Hallucination — for images without pseudo-labels, randomly sample instances from IDM and paste them onto the image (Beta mixing \(\lambda \sim \text{Beta}(1,1)\), IoU < 0.2 to prevent overlap), up to 3 instances per image.
- Design Motivation: Augmentation addresses per-frame detection uncertainty; Hallucination addresses Mean-Teacher degradation caused by an abundance of negative samples — without positive samples, the teacher cannot provide effective supervision.
Loss & Training¶
- \(L_{total} = L_{cls} + L_{loc}\) (contrastive classification loss + localization loss)
- Mean-Teacher EMA with \(\gamma = 0.999\)
- Only prompt parameters are updated (~0.1% of total parameters)
Key Experimental Results¶
Main Results¶
| Dataset | Method | AP50 |
|---|---|---|
| Pascal-C | Direct Test | 44.8% |
| Pascal-C | STFAR | 45.2% |
| Pascal-C | Mean-Teacher | 51.5% |
| Pascal-C | TTAOD (Ours) | 56.2% |
| COCO-C | TTAOD | 26.0 mAP |
| ODinW-13 | TTAOD | 54.2 mAP |
Ablation Study¶
| Component | AP50 | Notes |
|---|---|---|
| Baseline | 44.8% | No adaptation |
| + TPT only | 45.4% | Text prompt tuning |
| + VPT only | 41.4% | Visual prompt (no TTWS) — degrades performance |
| + TTWS | 53.4% | Warm-start is critical |
| + All components | 56.2% | Best performance |
Key Findings¶
- TTWS is the single most critical component (+8.5%), highlighting the importance of prompt initialization in TTA.
- Visual prompts without TTWS are harmful (−3.4%), underscoring the cold-start problem.
- TTAOD achieves the best performance on 14 out of 15 corruption types.
- On ODinW-13 cross-domain benchmark: improvement on 11/13 datasets, average gain of +1.4%.
Highlights & Insights¶
- TTWS addresses the cold-start problem in TTA: initializing prompts with features from the first test sample is simple yet highly effective, and the idea is transferable to other TTA settings.
- Memory Hallucination is a novel solution for handling negative samples: many images in detection may contain no objects, causing Mean-Teacher to degrade on such frames; synthesizing positive samples is an elegant remedy.
- Source-free + open-vocabulary is a more practical setting: eliminating dependence on source-domain data and closed-set category assumptions.
Limitations & Future Work¶
- Memory quality depends on pseudo-label selection; errors may accumulate under continual domain shift.
- GroundingDINO inference is itself heavy, and additional DINOv2 feature extraction introduces further overhead.
- Applicability to streaming or real-time scenarios is not discussed.
- The memory hallucination synthesis strategy is relatively simple (direct paste); more sophisticated augmentation methods may yield better results.
Related Work & Insights¶
- vs. STFAR: Requires source-domain statistics and a closed-set assumption; the proposed method is source-free and open-vocabulary.
- vs. Tent/MEMO: General-purpose TTA methods designed for classification that do not address detection-specific pseudo-label challenges.
- vs. TPT: TPT performs visual prompt tuning for classification; this work extends the idea to detection and introduces TTWS.
Rating¶
- Novelty: ⭐⭐⭐⭐ TTWS and memory hallucination are novel and well-motivated designs.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks, 15 corruption types, 13 cross-domain datasets, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ The methodology is presented clearly.
- Value: ⭐⭐⭐⭐ Advances the practical applicability of test-time adaptation for object detection.