Test-Time Adaptive Object Detection with Foundation Model¶

Conference: NeurIPS 2025 arXiv: 2510.25175 Code: https://github.com/gaoyingjay/ttaod_foundation Area: Object Detection / Domain Adaptation Keywords: test-time adaptation, open-vocabulary detection, Mean-Teacher, prompt tuning, dynamic memory

TL;DR¶

This paper proposes TTAOD, a source-free open-vocabulary test-time adaptive object detection framework that combines multimodal Prompt Tuning, Mean-Teacher, an Instance Dynamic Memory (IDM) module, and memory augmentation/hallucination strategies. It achieves 56.2% AP50 on Pascal-C (+11.0 vs. SOTA) and demonstrates consistent gains across 13 cross-domain datasets.

Background & Motivation¶

Background: Test-time adaptation (TTA) has been well studied for classification, but TTA for object detection remains underexplored. Existing methods such as STFAR require source-domain statistics (mean/variance) and assume a closed-set category space.

Limitations of Prior Work: (a) Reliance on source-domain data or statistics is impractical; (b) closed-set assumptions restrict applicability to open-world scenarios; (c) pseudo-label quality degrades under corruption/domain shift, causing Mean-Teacher to collapse.

Key Challenge: Open-vocabulary detectors (e.g., GroundingDINO) suffer performance drops under domain shift, yet TTA must operate without source data and without restricting the category space.

Goal: Achieve test-time domain adaptation under a source-free, open-vocabulary setting by tuning only prompt parameters.

Key Insight: Vision-language foundation models (GroundingDINO) already possess strong zero-shot capabilities; adapting to the target domain requires tuning only a small number of prompt parameters. A memory module is used to accumulate high-quality instances to guide adaptation.

Core Idea: Freeze GroundingDINO and introduce learnable text/visual prompts → update via Mean-Teacher EMA → accumulate high-quality pseudo-labels in an Instance Dynamic Memory (IDM) using DINOv2 features → refine detection scores via memory augmentation → address negative samples via memory hallucination.

Method¶

Overall Architecture¶

Test image → GroundingDINO (frozen) + learnable prompts → Teacher (EMA) generates pseudo-labels → Student learns → IDM stores high-quality detection instances (DINOv2 features) → Memory Augmentation refines detection scores using class prototypes → Memory Hallucination synthesizes training signals for negative samples.

Key Designs¶

Multimodal Prompt Tuning + Warm-Start:
- Function: Insert learnable prompts into both text and visual encoders.
- Mechanism: Text prompt \(\tilde{E}_T = E_T + P_T\); visual prompt inserts learnable tokens \(P_{I,i}\) at each layer. Key innovation — Test-Time Warm-Start (TTWS): initializes visual prompts using the average token features of the first test sample, i.e., \(P_{I,i} = \text{AvgPool}(E_{I,i})\).
- Design Motivation: Ablation studies show TTWS is critical (+8.5% AP50), as randomly initialized visual prompts produce extremely poor pseudo-labels in early iterations, causing Mean-Teacher to collapse.
Instance Dynamic Memory (IDM):
- Function: Maintains a queue of up to 20 high-quality detected instances per class.
- Mechanism: After each detection, high-confidence instances are enqueued with their DINOv2 features and confidence scores. The class prototype \(v_c\) is computed as the mean feature. Low-quality instances are replaced by newer ones.
- Design Motivation: Dynamically accumulates visual knowledge of the target domain to support subsequent memory augmentation and hallucination strategies.
Memory Augmentation + Memory Hallucination:
- Function: Augmentation — refine detection scores using prototypes; Hallucination — synthesize training signals for images with no detections.
- Mechanism: Augmentation — extract DINOv2 features for each detected box, compute cosine similarity with the class prototype, derive a refined score \(s' = \alpha \exp(-\beta(1 - \text{sim}))\), and fuse with the original score. Hallucination — for images without pseudo-labels, randomly sample instances from IDM and paste them onto the image (Beta mixing \(\lambda \sim \text{Beta}(1,1)\), IoU < 0.2 to prevent overlap), up to 3 instances per image.
- Design Motivation: Augmentation addresses per-frame detection uncertainty; Hallucination addresses Mean-Teacher degradation caused by an abundance of negative samples — without positive samples, the teacher cannot provide effective supervision.

Loss & Training¶

\(L_{total} = L_{cls} + L_{loc}\) (contrastive classification loss + localization loss)
Mean-Teacher EMA with \(\gamma = 0.999\)
Only prompt parameters are updated (~0.1% of total parameters)

Key Experimental Results¶

Main Results¶

Dataset	Method	AP50
Pascal-C	Direct Test	44.8%
Pascal-C	STFAR	45.2%
Pascal-C	Mean-Teacher	51.5%
Pascal-C	TTAOD (Ours)	56.2%
COCO-C	TTAOD	26.0 mAP
ODinW-13	TTAOD	54.2 mAP

Ablation Study¶

Component	AP50	Notes
Baseline	44.8%	No adaptation
+ TPT only	45.4%	Text prompt tuning
+ VPT only	41.4%	Visual prompt (no TTWS) — degrades performance
+ TTWS	53.4%	Warm-start is critical
+ All components	56.2%	Best performance

Key Findings¶

TTWS is the single most critical component (+8.5%), highlighting the importance of prompt initialization in TTA.
Visual prompts without TTWS are harmful (−3.4%), underscoring the cold-start problem.
TTAOD achieves the best performance on 14 out of 15 corruption types.
On ODinW-13 cross-domain benchmark: improvement on 11/13 datasets, average gain of +1.4%.

Highlights & Insights¶

TTWS addresses the cold-start problem in TTA: initializing prompts with features from the first test sample is simple yet highly effective, and the idea is transferable to other TTA settings.
Memory Hallucination is a novel solution for handling negative samples: many images in detection may contain no objects, causing Mean-Teacher to degrade on such frames; synthesizing positive samples is an elegant remedy.
Source-free + open-vocabulary is a more practical setting: eliminating dependence on source-domain data and closed-set category assumptions.

Limitations & Future Work¶

Memory quality depends on pseudo-label selection; errors may accumulate under continual domain shift.
GroundingDINO inference is itself heavy, and additional DINOv2 feature extraction introduces further overhead.
Applicability to streaming or real-time scenarios is not discussed.
The memory hallucination synthesis strategy is relatively simple (direct paste); more sophisticated augmentation methods may yield better results.

vs. STFAR: Requires source-domain statistics and a closed-set assumption; the proposed method is source-free and open-vocabulary.
vs. Tent/MEMO: General-purpose TTA methods designed for classification that do not address detection-specific pseudo-label challenges.
vs. TPT: TPT performs visual prompt tuning for classification; this work extends the idea to detection and introduces TTWS.

Rating¶

Novelty: ⭐⭐⭐⭐ TTWS and memory hallucination are novel and well-motivated designs.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks, 15 corruption types, 13 cross-domain datasets, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ The methodology is presented clearly.
Value: ⭐⭐⭐⭐ Advances the practical applicability of test-time adaptation for object detection.