Zero-Shot Object Counting with Good Exemplars (VA-Count)¶

Conference: ECCV 2024
arXiv: 2407.04948
Code: https://github.com/HopooLinZ/VA-Count
Area: LLM/NLP
Keywords: Zero-Shot Counting, Vision-Language Pre-training, Exemplar Enhancement, Noise Suppression, Grounding DINO

TL;DR¶

Proposes VA-Count, a vision-association-based zero-shot object counting framework. It establishes robust visual associations between high-quality exemplars and images for arbitrary categories through a Grounding DINO-driven exemplar enhancement module and a contrastive learning noise suppression module.

Background & Motivation¶

Zero-shot object counting (ZOC) aims to count objects in an image knowing only the category name, without requiring any annotations.
The core problem of existing ZOC methods is their inability to effectively identify high-quality exemplars.
Two types of existing methods each have limitations:
Vision-language alignment methods (CLIP-Count, VLCount): Rely on direct image-text associations, resulting in insufficient representation for categories with atypical shapes.
Category-relevant exemplar search (ZSC): Rely on arbitrary patch selection, failing to accurately delineate complete objects, and are constrained by predefined categories.
Goal: Introduce a detection-driven exemplar discovery method (Grounding DINO) while simultaneously fusing textual and visual representations.

Method¶

Overall Architecture¶

Two core modules work collaboratively: 1. Exemplar Enhancement Module (EEM): Employs Grounding DINO to discover positive and negative exemplars, filtered by a single-object classifier. 2. Noise Suppression Module (NSM): Uses contrastive learning to distinguish between density maps of positive and negative exemplars, reducing the impact of erroneous exemplars.

Key Designs¶

Exemplar Enhancement Module (EEM):

Grounding DINO-guided Box Selection: - Positive samples: Input image + positive text label (specific category name) $\rightarrow$ obtain positive candidate boxes $B^p$. - Negative samples: Input image + negative text label ("object") $\rightarrow$ obtain negative candidate boxes $B^n$. - Logit threshold $\tau_l = 0.02$.

De-duplication Filtering: - For negative bounding boxes, remove parts that overlap with positive bounding boxes using an IoU threshold $\tau_{\text{iou}} = 0.5$.

Single-Object Exemplar Filtering: - Binary classifier $\delta(\cdot) = \text{FFN}(\text{CLIP-ViT}(b))$. - Uses a frozen CLIP-ViT-B/16 + a trainable FFN to judge whether a candidate box contains exactly one object. - Training data: positive samples = single-object exemplars annotated in the training set; negative samples = randomly cropped patches + the whole image. - Ensures the cleanliness of exemplars.

Noise Suppression Module (NSM):

Counter $\Gamma(\cdot)$: - Based on the CounTR architecture: Image encoder + exemplar-image interaction module + decoder. - Image features act as Query; exemplar features are linearly projected as Key/Value. - Generates positive/negative density maps using positive/negative exemplars, respectively.

Contrastive Learning: - $$L_C = -\log\left(\frac{\exp(\text{sim}(D^p, D^g))}{\exp(\text{sim}(D^p, D^g)) + \exp(\text{sim}(D^n, D^g))}\right)$$ - Maximizes the similarity between the positive density map and the ground truth, while minimizing the similarity between the negative density map and the ground truth. - Total loss: $$L_{\text{total}} = L_C + L_D$$

Loss & Training¶

Density loss $L_D$: $\text{MSE}(D^p, D^g)$
Contrastive loss $L_C$: contrastive learning between positive/negative density maps and the ground truth.
Optimizer: AdamW, learning rate $10^{-5}$, batch size 8.
Two-stage training: MAE pre-training + fine-tuning.
Select top-3 positive exemplars and top-3 negative exemplars for each image.

Key Experimental Results¶

Main Results¶

FSC-147 Dataset:

Method	Type	Val MAE	Val RMSE	Test MAE	Test RMSE
CLIP-Count	Zero-Shot	17.78	55.43	18.97	95.93
VLCount	Zero-Shot	18.20	60.63	19.18	103.29
VA-Count	Zero-Shot	Best	Best	Best	Best

CARPK Dataset (Cross-dataset generalization): VA-Count likewise outperforms existing zero-shot and few-shot methods.

Ablation Study¶

Component	MAE Change
Without EEM (random patches)	Significantly degraded
Without Single-Object Filter	Degraded (multi-object exemplars introduce noise)
Without NSM (no contrastive learning)	Degraded (erroneous exemplars not suppressed)
Full EEM + NSM	Best

Key Findings¶

Grounding DINO provides exemplar quality significantly superior to random patch selection.
Single-object filtering is crucial—bounding boxes with high Grounding DINO confidence can still contain multiple objects.
Contrastive learning effectively distinguishes the impact of positive and negative exemplars on density maps.
"object" as a general negative text label can detect objects of non-target categories.

Highlights & Insights¶

Detection-driven exemplar discovery: Introduces the universal detection capability of Grounding DINO into zero-shot counting, achieving high-quality exemplar acquisition for arbitrary categories.
Positive-negative exemplar contrastive learning: Learns not only "what the target is" but also "what the target is not", facilitating a more robust bi-directional constraint.
Modular design: EEM and NSM are independently controllable, making them easy to understand and extend.
Practical negative sample strategy: Uses "object" to detect all objects, and removes those overlapping with positive classes to obtain negative samples.

Limitations & Future Work¶

Relies heavily on the detection quality of Grounding DINO, and may fail for objects that are difficult for DINO to detect (extremely small or heavily occluded).
The single-object classifier requires additional training data and training processes.
Only 3 positive/negative exemplars are selected per image; whether more exemplars would yield further improvements remains to be explored.
Relatively high computational cost (Grounding DINO + CLIP + CounTR).
Future directions: End-to-end training of the entire pipeline, and utilizing SAM to replace/enhance exemplar localization.

The MAE pre-training + exemplar matching framework of CounTR serves as the base architecture of this work.
The introduction of Grounding DINO demonstrates the plug-and-play value of large-scale VLP models in downstream tasks.
The concept of positive-negative exemplar contrastive learning can be generalized to other tasks requiring exemplar matching (e.g., few-shot segmentation).

Rating¶

Dimension	Score (1-5)
Novelty	3.5
Technical Depth	3.5
Experimental Thoroughness	4
Writing Quality	4
Value	4
Overall Score	3.8