Exploring Contextual Attribute Density in Referring Expression Counting (CAD-GD)¶

Conference: CVPR 2025
arXiv: 2503.12460
Code: github.com/Xu3XiWang/CAD-GD
Area: Other
Keywords: Referring Expression Counting, Contextual Attribute Density, Open-World Detection, GroundingDINO, Density Map

TL;DR¶

The concept of Contextual Attribute Density (CAD) is proposed to enhance referring expression counting. By incorporating three modules—a U-shape density estimator, CAD attention, and dynamic query initialization—the approach reduces counting errors on the REC-8K dataset by approximately 30% compared to GroundingREC (MAE decreases from 6.80 to 5.43).

Background & Motivation¶

Background: Referring Expression Counting (REC) is an emerging counting task that requires counting objects of specific attributes based on fine-grained textual descriptions (e.g., "walking person" instead of a simple "person"). GroundingREC is the first REC baseline based on GroundingDINO.

Limitations of Prior Work: GroundingREC suffers from two types of errors when handling fine-grained attributes: (1) over-counting, where it over-focuses on category information while ignoring fine-grained attributes and thus includes objects with incorrect attributes; and (2) under-counting, where objects with the specified attributes are missed due to occlusions and scale variations.

Key Challenge: REC is inherently a detection-by-counting pipeline (one-to-one matching) and lacks awareness of spatial density distribution. While traditional counting methods have demonstrated that "visual density" is crucial for scale-robust spatial distribution modeling, existing open-world models ignore this capability.

Key Insight: Analogous to the concept of "visual density", "Contextual Attribute Density" (CAD) is defined to measure the information intensity of a specific fine-grained attribute across visual regions of different scales. Modeling CAD guides the model to more accurately align attribute information with visual patterns.

Core Idea: Introduce attribute-level density map supervision to open-world detectors, enabling them to perceive the spatial distribution of attributes corresponding to fine-grained textual descriptions.

Method¶

Overall Architecture¶

The CAD-GD framework is built upon GroundingDINO: after extracting image and text features via their respective backbones, multi-scale visual features \(\{F_{vi}\}_{i=1}^{4}\) and text features \(F_t\) are obtained through the Feature Enhancer. Then, CAD information is injected through three main modules: the CAD generation module produces density features, the CAD attention module enhances visual features, and the CAD dynamic query module initializes decoder queries.

Key Designs¶

CAD Generation Module (U-shape CADE): It first projects visual features into the text space to calculate similarity \(S_i = \text{Proj}(F_{vi}) \cdot F_t\), and then feeds the similarity features along with visual features into a U-shape estimator to generate multi-scale CAD features \(\{D_i\}_{i=1}^{4}\). Finally, it outputs density maps supervised by an \(\ell_2\) loss (where ground-truth density maps are generated via Gaussian kernels).
CAD Attention Module: This operates in two steps: spatial attention uses channel pooling (max+avg) on CAD features to generate spatial weight maps that enhance foreground regions; channel attention applies channel-level weighting via a shared MLP to the features enhanced by spatial attention, distinguishing different attributes across scales.
CAD Dynamic Query Initialization: It first dynamically initializes query content using text features (Text Init, \(\dot{Q} = (Q \times (F_t \times M)^\top) \times F_t\)), and then further refines it using CAD features via cross-attention (Density Init). This makes queries for different referring expressions more distinguishable in the feature space.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{\text{loc}} + \alpha \cdot \mathcal{L}_{\text{density}}\]

Where \(\mathcal{L}_{\text{loc}}\) is the standard localization loss (Hungarian matching + L1/GIoU), and \(\mathcal{L}_{\text{density}} = \|D_{\text{pred}} - D_{\text{gt}}\|_2^2\) is the \(\ell_2\) regression loss of the density map. During training, the visual and text backbones are frozen. The model is trained using AdamW with a learning rate of 1e-5 for 20 epochs, decaying by a factor of 10 at the 10th epoch.

Key Experimental Results¶

Method	Backbone	Val MAE↓	Val RMSE↓	Val F1↑	Test MAE↓	Test RMSE↓	Test F1↑
GroundingDINO	Swin-T	9.03	21.98	0.65	8.88	21.95	0.66
GroundingREC	Swin-T	6.80	18.13	0.68	6.50	19.79	0.69
CAD-GD	Swin-T	5.43	15.01	0.70	5.29	17.08	0.72
GroundingREC*	Swin-B	5.66	15.24	0.71	5.42	18.47	0.70
CAD-GD	Swin-B	4.83	13.52	0.75	4.94	14.65	0.76

Ablation Study¶

Module Combination	Val MAE	Val RMSE	Val F1
Baseline (w/o CAD)	6.52	17.72	0.665
+CAD Generation	6.17	16.38	0.673
+Spatial Attention	5.88	16.43	0.691
+Channel Attention	5.61	16.28	0.690
+Text Init	5.67	14.43	0.690
+Density Init	5.43	15.01	0.700
+Density Inference Strategy	4.83	13.52	0.695

Key Findings¶

The density map inference strategy (substituting thresholding with density map estimation for counting) yields an additional 11% reduction in MAE.
The CAD density map can distinguish spatial distributions of different attributes within the same category (e.g., "bluish pen" vs "greenish pen").
It also outperforms GroundingREC in zero-shot counting on FSC-147 (MAE of 9.30 vs 10.06).

Highlights & Insights¶

Conceptual Innovation: Introduces density estimation into cross-modal referring expression counting for the first time, defining the new concept of "Contextual Attribute Density".
Convincing Query Visualization: t-SNE visualization clearly demonstrates that queries for different attributes can be effectively separated after CAD initialization.
Plug-and-Play: The CAD module can enhance any open-world detector based on DETR-like structures.

Limitations & Future Work¶

Ground-truth density maps use a fixed-size Gaussian kernel (\(\sigma=15\)), which does not adapt to target scales.
The density inference strategy improves counting but slightly degrades localization metrics, showing a mismatch between the two.
Validation is only performed on REC-8K (~8000 images), which has a relatively small dataset scale.
There is still room for improvement regarding complex semantic combinations of unrelated attributes (e.g., negative expressions like "not in a bus").

Zero-Shot Counting Generalization (FSC-147)¶

Method	Val MAE	Val RMSE	Test MAE	Test RMSE
GroundingREC	10.06	58.62	10.12	107.19
CountGD	12.14	47.51	12.98	98.35
CAD-GD	9.30	40.96	10.35	86.88

GroundingDINO / GroundingREC: Open-world detection \(\rightarrow\) REC baseline.
Density Estimation Counting: CounTR, LOCA, CACViT, DAVE, etc. — proving the value of density features for counting tasks.
Text-Guided Counting: CLIP-Count, CounTX, CountGD, VLCounter — cross-modal counting.
Density Modeling in Other Tasks: DQ-DETR (small object detection), Cholakka (instance segmentation) — illustrating the universality of density priors.

Rating¶

Novelty: ⭐⭐⭐⭐ The concept of CAD is novel, and the fusion of density + detection is inspiring.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation studies, including zero-shot generalization validation.
Writing Quality: ⭐⭐⭐⭐ Clear and well-structured, with rich visualizations.
Value: ⭐⭐⭐⭐ Provides a fresh perspective for open-world counting.