Exploring Contextual Attribute Density in Referring Expression Counting (CAD-GD)¶
Conference: CVPR 2025
arXiv: 2503.12460
Code: github.com/Xu3XiWang/CAD-GD
Area: Other
Keywords: Referring Expression Counting, Contextual Attribute Density, Open-World Detection, GroundingDINO, Density Map
TL;DR¶
The concept of Contextual Attribute Density (CAD) is proposed to enhance referring expression counting. By incorporating three modules—a U-shape density estimator, CAD attention, and dynamic query initialization—the approach reduces counting errors on the REC-8K dataset by approximately 30% compared to GroundingREC (MAE decreases from 6.80 to 5.43).
Background & Motivation¶
Background: Referring Expression Counting (REC) is an emerging counting task that requires counting objects of specific attributes based on fine-grained textual descriptions (e.g., "walking person" instead of a simple "person"). GroundingREC is the first REC baseline based on GroundingDINO.
Limitations of Prior Work: GroundingREC suffers from two types of errors when handling fine-grained attributes: (1) over-counting, where it over-focuses on category information while ignoring fine-grained attributes and thus includes objects with incorrect attributes; and (2) under-counting, where objects with the specified attributes are missed due to occlusions and scale variations.
Key Challenge: REC is inherently a detection-by-counting pipeline (one-to-one matching) and lacks awareness of spatial density distribution. While traditional counting methods have demonstrated that "visual density" is crucial for scale-robust spatial distribution modeling, existing open-world models ignore this capability.
Key Insight: Analogous to the concept of "visual density", "Contextual Attribute Density" (CAD) is defined to measure the information intensity of a specific fine-grained attribute across visual regions of different scales. Modeling CAD guides the model to more accurately align attribute information with visual patterns.
Core Idea: Introduce attribute-level density map supervision to open-world detectors, enabling them to perceive the spatial distribution of attributes corresponding to fine-grained textual descriptions.
Method¶
Overall Architecture¶
The CAD-GD framework is built upon GroundingDINO: after extracting image and text features via their respective backbones, multi-scale visual features \(\{F_{vi}\}_{i=1}^{4}\) and text features \(F_t\) are obtained through the Feature Enhancer. Then, CAD information is injected through three main modules: the CAD generation module produces density features, the CAD attention module enhances visual features, and the CAD dynamic query module initializes decoder queries.
Key Designs¶
-
CAD Generation Module (U-shape CADE): It first projects visual features into the text space to calculate similarity \(S_i = \text{Proj}(F_{vi}) \cdot F_t\), and then feeds the similarity features along with visual features into a U-shape estimator to generate multi-scale CAD features \(\{D_i\}_{i=1}^{4}\). Finally, it outputs density maps supervised by an \(\ell_2\) loss (where ground-truth density maps are generated via Gaussian kernels).
-
CAD Attention Module: This operates in two steps: spatial attention uses channel pooling (max+avg) on CAD features to generate spatial weight maps that enhance foreground regions; channel attention applies channel-level weighting via a shared MLP to the features enhanced by spatial attention, distinguishing different attributes across scales.
-
CAD Dynamic Query Initialization: It first dynamically initializes query content using text features (Text Init, \(\dot{Q} = (Q \times (F_t \times M)^\top) \times F_t\)), and then further refines it using CAD features via cross-attention (Density Init). This makes queries for different referring expressions more distinguishable in the feature space.
Loss & Training¶
Where \(\mathcal{L}_{\text{loc}}\) is the standard localization loss (Hungarian matching + L1/GIoU), and \(\mathcal{L}_{\text{density}} = \|D_{\text{pred}} - D_{\text{gt}}\|_2^2\) is the \(\ell_2\) regression loss of the density map. During training, the visual and text backbones are frozen. The model is trained using AdamW with a learning rate of 1e-5 for 20 epochs, decaying by a factor of 10 at the 10th epoch.
Key Experimental Results¶
| Method | Backbone | Val MAE↓ | Val RMSE↓ | Val F1↑ | Test MAE↓ | Test RMSE↓ | Test F1↑ |
|---|---|---|---|---|---|---|---|
| GroundingDINO | Swin-T | 9.03 | 21.98 | 0.65 | 8.88 | 21.95 | 0.66 |
| GroundingREC | Swin-T | 6.80 | 18.13 | 0.68 | 6.50 | 19.79 | 0.69 |
| CAD-GD | Swin-T | 5.43 | 15.01 | 0.70 | 5.29 | 17.08 | 0.72 |
| GroundingREC* | Swin-B | 5.66 | 15.24 | 0.71 | 5.42 | 18.47 | 0.70 |
| CAD-GD | Swin-B | 4.83 | 13.52 | 0.75 | 4.94 | 14.65 | 0.76 |
Ablation Study¶
| Module Combination | Val MAE | Val RMSE | Val F1 |
|---|---|---|---|
| Baseline (w/o CAD) | 6.52 | 17.72 | 0.665 |
| +CAD Generation | 6.17 | 16.38 | 0.673 |
| +Spatial Attention | 5.88 | 16.43 | 0.691 |
| +Channel Attention | 5.61 | 16.28 | 0.690 |
| +Text Init | 5.67 | 14.43 | 0.690 |
| +Density Init | 5.43 | 15.01 | 0.700 |
| +Density Inference Strategy | 4.83 | 13.52 | 0.695 |
Key Findings¶
- The density map inference strategy (substituting thresholding with density map estimation for counting) yields an additional 11% reduction in MAE.
- The CAD density map can distinguish spatial distributions of different attributes within the same category (e.g., "bluish pen" vs "greenish pen").
- It also outperforms GroundingREC in zero-shot counting on FSC-147 (MAE of 9.30 vs 10.06).
Highlights & Insights¶
- Conceptual Innovation: Introduces density estimation into cross-modal referring expression counting for the first time, defining the new concept of "Contextual Attribute Density".
- Convincing Query Visualization: t-SNE visualization clearly demonstrates that queries for different attributes can be effectively separated after CAD initialization.
- Plug-and-Play: The CAD module can enhance any open-world detector based on DETR-like structures.
Limitations & Future Work¶
- Ground-truth density maps use a fixed-size Gaussian kernel (\(\sigma=15\)), which does not adapt to target scales.
- The density inference strategy improves counting but slightly degrades localization metrics, showing a mismatch between the two.
- Validation is only performed on REC-8K (~8000 images), which has a relatively small dataset scale.
- There is still room for improvement regarding complex semantic combinations of unrelated attributes (e.g., negative expressions like "not in a bus").
Zero-Shot Counting Generalization (FSC-147)¶
| Method | Val MAE | Val RMSE | Test MAE | Test RMSE |
|---|---|---|---|---|
| GroundingREC | 10.06 | 58.62 | 10.12 | 107.19 |
| CountGD | 12.14 | 47.51 | 12.98 | 98.35 |
| CAD-GD | 9.30 | 40.96 | 10.35 | 86.88 |
Related Work & Insights¶
- GroundingDINO / GroundingREC: Open-world detection \(\rightarrow\) REC baseline.
- Density Estimation Counting: CounTR, LOCA, CACViT, DAVE, etc. — proving the value of density features for counting tasks.
- Text-Guided Counting: CLIP-Count, CounTX, CountGD, VLCounter — cross-modal counting.
- Density Modeling in Other Tasks: DQ-DETR (small object detection), Cholakka (instance segmentation) — illustrating the universality of density priors.
Rating¶
- Novelty: ⭐⭐⭐⭐ The concept of CAD is novel, and the fusion of density + detection is inspiring.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation studies, including zero-shot generalization validation.
- Writing Quality: ⭐⭐⭐⭐ Clear and well-structured, with rich visualizations.
- Value: ⭐⭐⭐⭐ Provides a fresh perspective for open-world counting.