Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval¶

Conference: CVPR 2025
arXiv: 2409.18733
Code: None
Area: Object Detection
Keywords: Open-vocabulary detection, long-tail detection, web-image retrieval, training-free, SAM

TL;DR¶

SearchDet proposes a completely training-free long-tail object detection framework. By retrieving positive and negative sample images from the web, generating attention-weighted queries, and performing joint localization with SAM region proposals and similarity heatmaps, SearchDet improves mAP by 48.7% on ODinW and 59.1% on LVIS compared to GroundingDINO, showcasing the immense potential of leveraging the Web as an external dynamic memory for inference-time augmentation.

Background & Motivation¶

Background: Open-vocabulary object detection (OVD) aims to detect objects described by arbitrary text labels. Current mainstream methods like GroundingDINO, GLIP, and T-Rex2 have acquired strong zero-shot capabilities through large-scale vision-language pre-training. However, their performance remains constrained by the distribution coverage of the pre-training data—performing well on common objects seen during training but poorly on long-tail, rare objects.

Limitations of Prior Work: To further enhance the performance of these models on long-tail objects, expensive continuous pre-training or task-specific fine-tuning is required. For instance, GroundingDINO completely fails when detecting fine-grained concepts such as a "Mountain Dew bottle," as the model's parameterized memory lacks sufficient relevant visual experience.

Key Challenge: The model's knowledge is frozen in its parameters and cannot be dynamically expanded, whereas web search engines can obtain visual examples of arbitrary concepts in real-time. However, there has been no effective method to integrate this external retrieval capability into object detection pipelines.

Goal: Design a completely training-free detection framework that leverages web-retrieved images to dynamically enhance detection capabilities during inference.

Key Insight: The authors observe that search engines themselves act as a continuously expanding "visual memory bank" of objects—given any text label, Google can return high-quality relevant images. Such retrieval-based visual representations naturally possess long-tail coverage capabilities.

Core Idea: For a target label, retrieve positive and negative sample images from the Web, perform attention-weighted fusion to generate query embeddings, and then combine SAM region proposals with similarity heatmaps for joint localization, achieving high-precision detection without any training.

Method¶

Overall Architecture¶

Given an input image and a target label, SearchDet operates in four steps: (1) retrieves positive sample images for the label and negative sample images for negative labels generated by an LLM from the web; (2) generates adjusted query embeddings using an attention-weighted fusion mechanism; (3) generates region proposals using SAM and filters them with adaptive thresholds; (4) generates similarity heatmaps and intersects them with the filtered SAM regions to output final detection boxes.

Key Designs¶

Positive/Negative Web Retrieval and Attentional Query Generation:
- Function: Retrieve visual representations of target and distractor objects from the Web to generate refined query embeddings.
- Mechanism: First, retrieve 5 positive and 5 negative sample images using a search engine (negative labels are generated by an LLM, e.g., the negative label for "surfboard" is "waves"). Use the CLS token of DINOv2 to embed the query image, positive samples, and negative samples, respectively. Compute the cosine similarity between the query and each positive/negative embedding to serve as attention weights: \(\alpha_{pos,i} = \text{softmax}(S_{pos,i})\), and then perform weighted aggregation: \(A_{pos} = \sum_i \alpha_{pos,i} e_{pos,i}\). The final query is defined as \(q_{adjusted} = A_{pos} - A_{neg}\). The attention mechanism ensures that retrieved images more relevant to the input image receive higher weights.
- Design Motivation: Negative samples are crucial—retrieved images of "surfboard" often contain ocean waves; without subtracting negative embeddings, the model might mistake waves for the target. Compared to simple mean pooling subtraction, attention weighting adapts more precisely to the specific context of the input image.
SAM Region Proposals and Frequency-Adaptive Thresholding:
- Function: Generate candidate regions and adaptively determine whether the target is present.
- Mechanism: Use HQ-SAM to generate all segmented regions in the input image. For each region, obtain its region embedding by feeding its masked image into DINOv2, and compute the Euclidean distance to the query embedding. Sort the distances and group them into bins (each bin containing \(m\) distances, where \(m\) is the number of queries). If a single mask occupies more than 80% of a certain bin, it is selected as a candidate. The key innovation lies in the verification step: compute the average distance \(\mu_{D_j}\) of candidate masks in the bin and compare it with the mean \(\mu_R\) of a reference distribution (the distribution of mutual distances among query embeddings). If the deviation exceeds 3 standard deviations, the candidate is rejected.
- Design Motivation: Simple percentile thresholding still outputs results when the target is absent, leading to false positives. The frequency-adaptive method dynamically adjusts based on the shape of the distance distribution. When the target is absent, the distance distribution remains uniform, meaning no single mask will dominate the low-distance bins.
Heatmap Generation and Joint Localization:
- Function: Provide a detection signal independent of SAM to refine localization.
- Mechanism: Upsample the DINOv2 patch features of the input image, generate query embeddings using the same attention-weighted method (but using patch features instead of CLS tokens this time), and compute the cosine similarity between the query and each patch to generate a heatmap. After binarizing the heatmap, intersect it with interest SAM regions—each filtered SAM region only outputs a bounding box if it has a non-empty intersection with the heatmap. This joint strategy is highly complementary: SAM might omit parts of an object, which the heatmap can supplement; conversely, the heatmap boundaries might be imprecise, while SAM provides more accurate segmentation boundaries.
- Design Motivation: Relying solely on SAM region proposals might fail to detect target objects if SAM misses the corresponding mask. Using only heatmaps lacks precise object boundaries. Combining both yields higher recall and more precise localization.

Key Experimental Results¶

Main Results¶

Method	Backbone	COCO	LVIS	ODinW-35	Roboflow100
GLIP-L	Swin-L	49.8	26.9	23.4	8.6
GroundingDINO-L	Swin-L	48.4	27.4	22.3	8.3
T-Rex2 (Text)	Swin-L	52.2	45.8	22.0	10.5
T-Rex2 (Visual-G)	Swin-L	46.5	45.3	27.8	18.5
SearchDet (Ours)	DINOv2-L	59.3	43.6	33.1	27.9

Under the 10-shot setting, SearchDet achieves 61.4 mAP, which is 16.1% higher than the previous SOTA method DE-ViT (52.9).

Ablation Study¶

Configuration	COCO mAP	Description
Full Method	59.34	—
Positive Only (No Negatives)	45.80	Decreases by 22.82%, showing negative samples are crucial
No Heatmap Refinement	51.07	Decreases by 13.94%, heatmap provides key supplementation
Mean Pooling instead of Attention Weighting	55.47	Decreases by 6.5%, attention adaptation is more effective

Key Findings¶

The importance of negative samples is overwhelmingly dominant (removing them drops mAP from 59.3 to 45.8), indicating that eliminating visual noise/distractors is key in open-vocabulary detection.
The number of retrieved images positively correlates with performance: increasing positive/negative samples from 1 to 10 steadily improves mAP from 49.7 to 59.3, a 19.4% gain.
Retrieval image embeddings gathered on different days are highly consistent (cosine similarity > 0.9), proving that web retrieval stability is sufficient to support practical applications.
The advantages are most prominent on domain-diverse datasets like ODinW and Roboflow100 (improving by 100%+), indicating that the long-tail coverage capability of web retrieval is its core competitiveness.

Highlights & Insights¶

Treating web search engines as an "external visual memory" is the most core insight: model parameters are static, but the Web is dynamically growing—any new object can be detected as long as web images exist, without requiring retraining.
LLM-generated negative labels is a simple yet effective design: utilizing the world knowledge of language models to automatically infer that "surfboard" often co-occur with "waves" saves the cost of manual negative class design.
Frequency-adaptive thresholding is a much more robust solution than fixed thresholding: it handles the "target absent from image" scenario to avoid false positives, which is crucial for practical deployment.
The entire pipeline can be migrated to visual question answering: retrieving relevant images as few-shot examples to enhance VLM reasoning capabilities.

Limitations & Future Work¶

Real-time capability is insufficient: processing each image takes about 3 seconds (V100 GPU), coupled with web retrieval latency.
Slightly lower than T-Rex2 on LVIS, potentially because using only 5 retrieved images is insufficient to cover the semantic space of 1203 classes.
Reliance on the Google search engine introduces privacy and availability issues; offline versions would require pre-building a large-scale image database.
Sensitive to vague labels (e.g., "20" instead of "20 dollar bill"), requiring better label prompt enrichment strategies.
Future directions: pre-building offline retrieval databases to eliminate web dependence, combining with VLMs to achieve more intelligent label expansion, and integrating with online learning to progressively refine retrieval quality.

vs GroundingDINO / GLIP: These methods encode knowledge into model parameters, offering limited support for long-tail objects. SearchDet dynamically expands knowledge via external retrieval, possessing a massive advantage in long-tail scenarios.
vs T-Rex2: T-Rex2 also utilizes visual prompts, but requires alignment during the training phase. SearchDet is completely training-free and more flexible.
vs DE-ViT: Both are training-free few-shot detection methods, but DE-ViT requires manually provided support images. SearchDet automatically retrieves them from the Web, which is more practical.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of web retrieval and training-free detection is innovative, with each component being a creative combination of existing models.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on four datasets with comprehensive ablation and stability analyses, though detailed analysis of inference speed is lacking.
Writing Quality: ⭐⭐⭐⭐ Method descriptions are clear and the algorithm pseudocode is complete, though some notation usage is not fully consistent.
Value: ⭐⭐⭐⭐ Demonstrates the immense potential of web-retrieval augmentation, opening up a new direction for training-free detection.