Skip to content

Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

Conference: CVPR 2026
arXiv: 2603.16129
Code: Coming soon
Area: LLM / NLP (Other)
Keywords: zero-shot counting, vision-language model, prompt tuning, cost aggregation, quantity awareness

TL;DR

The QICA framework addresses the lack of quantity awareness and spatial insensitivity in zero-shot object counting by using a quantity-conditioned Synergistic Prompting Strategy (SPS) to jointly adapt vision-language encoders, combined with a Cost Aggregation Decoder (CAD) operating on similarity maps to preserve zero-shot transferability, achieving zero-shot SOTA on FSC-147 (MAE 12.41) with strong cross-domain generalization.

Background & Motivation

Background: Zero-shot object counting (ZSOC) aims to enumerate arbitrary-category objects using only text descriptions. Mainstream methods leverage VLMs such as CLIP to compute vision-text similarity maps, then use CNN/Transformer decoders to predict density maps.

Limitations of Prior Work: (1) Lack of quantity awareness — text prompts specify only categories without quantity information; models excel at recognizing "what" but fail to understand "how many," especially in dense scenes. (2) Spatial insensitivity + feature space distortion — directly fine-tuning VLM encoders leads to severe overfitting to training categories, corrupting the pre-trained feature space and harming zero-shot generalization.

Key Challenge: Accurate counting requires adapting encoders to learn quantity-sensitive features, but fine-tuning corrupts zero-shot generalization — creating an adaptation-vs-generalization dilemma.

Goal: (1) Enable fine-grained quantity discrimination; (2) Achieve effective adaptation without corrupting the pre-trained feature space.

Key Insight: Introduce quantity-conditioned prompts for encoders to implicitly learn quantity information, while operating on similarity maps (rather than feature space) to avoid feature distortion.

Core Idea: During training, factual/counterfactual quantity prompts teach the model to distinguish different quantities; at inference, only category prompts are used for zero-shot counting.

Method

Overall Architecture

Input: image \(I\) and text description \(T\); output: density map \(D\). The framework comprises three core modules: (1) SPS jointly adapts frozen CLIP visual and language encoders via quantity-conditioned learnable prompts; (2) CAD performs spatial aggregation on vision-text similarity maps with multi-scale upsampling to predict density maps; (3) \(\mathcal{L}_{MQA}\) enforces quantity consistency at both encoder and decoder levels. Key design: quantity-information pathways active during training are disabled at inference, retaining only category semantic pathways.

Key Designs

  1. Synergistic Prompting Strategy (SPS)

    • Function: Jointly adapts visual and language encoders via quantity-conditioned learnable prompts
    • Mechanism: Maps quantity value \(q\) to continuous embedding \(\epsilon_q\), added to per-layer learnable prompts \(\Pi^j\) to generate quantity-aware text prompts \(\hat{\Pi}^j_k = \Pi^j + \epsilon_{q_k}\mathbf{1}^T\). A coupling function \(\Phi^j\) projects language prompts to visual prompts \(\Psi^j_k = \Phi^j(\hat{\Pi}^j_k)\), enabling bidirectional gradient flow. During training, both factual (true quantity) and counterfactual (deviated quantity) prompts are generated per image
    • Design Motivation: Independent unimodal prompts cannot achieve cross-modal quantity-aware coordination. The coupling function establishes a direct language→vision gradient pathway, enabling both encoders to jointly adapt toward quantity awareness
  2. Cost Aggregation Decoder (CAD)

    • Function: Operates directly on vision-text cosine similarity maps to produce density maps via spatial aggregation
    • Mechanism: Computes per-position cosine similarity between dense visual features \(\mathbf{V}\) and category text embeddings \(\mathbf{T}^{\text{cat}}_k\) to obtain similarity maps \(\mathbf{S}_k\); then applies embedding layer → Swin Transformer spatial aggregation → multi-scale upsampling (with skip connections and similarity gating) → prediction head for final density map
    • Design Motivation: Operating in feature space corrupts the pre-trained manifold and causes overfitting, while aggregating on similarity maps (a scalar field) preserves embedding space integrity, enabling encoder fine-tuning without harming generalization
  3. Multi-level Quantity Alignment Loss (\(\mathcal{L}_{MQA}\))

    • Function: Enforces quantity consistency constraints at both encoder and decoder levels
    • Mechanism: Encoder level uses ranking loss to ensure the true quantity hypothesis has the highest global similarity \(\alpha_0 > \alpha_i\), with closer quantities yielding higher similarity; decoder level uses auxiliary MSE loss to constrain density map integrals to match corresponding quantity values. Total loss: \(\mathcal{L}_{MQA} = \|D^0 - D^{GT}\|_2^2 + \lambda_1 \mathcal{L}^{qty}_{enc} + \lambda_2 \mathcal{L}^{qty}_{dec}\)
    • Design Motivation: Decoder-only supervision is insufficient; encoder-level ranking constraints implicitly encode quantity information in the feature space during training, enabling accurate counting at inference without quantity prompts

Loss & Training

  • Density map MSE + encoder ranking loss (\(\lambda_1=0.1\)) + decoder counting loss (\(\lambda_2=0.05\))
  • During training, \(K\) quantity hypotheses (factual + counterfactual) are generated per image with independent forward passes sharing encoder parameters
  • At inference, no quantity information is needed — only category text input

Key Experimental Results

Main Results

FSC-147 Zero-Shot Counting

Method Backbone Val MAE↓ Val RMSE↓ Test MAE↓ Test RMSE↓
CounTX ViT-B/16 17.76 65.21 16.70 105.21
VLCounter ViT-B/16 18.06 65.13 17.05 106.16
T2ICount SD-v1.5 13.78 58.78 11.76 97.86
CountGD GDINO-Swin-B 12.14 47.51 14.76 120.42
QICA ViT-B/16 13.82 60.24 13.05 104.17
QICA† ViT-L/14 12.98 56.35 12.41 -

Ablation Study

Configuration Val MAE Test MAE Note
Baseline (CLIP + Conv decoder) ~18 ~17 No quantity awareness
+ SPS (text prompts only) ~16 ~15 Limited improvement with unimodal prompts
+ SPS (synergistic prompts) ~15 ~14 Significant boost from bimodal coupling
+ CAD ~14 ~13.5 Spatial aggregation further optimizes
+ \(\mathcal{L}_{MQA}\) (full model) 13.82 13.05 Multi-level constraints yield final results

Key Findings

  • QICA significantly outperforms all zero-shot methods with the same backbone (ViT-B/16): CounTX MAE 16.70 → QICA 13.05
  • Cross-domain generalization tests (CARPK, ShanghaiTech-A) surpass all baselines, confirming no overfitting
  • The coupling function in SPS improves MAE by ~1.5 over independent prompting, demonstrating bimodal synergy is crucial
  • CAD reduces MAE by ~1–2 points compared to feature-space decoding while preserving zero-shot capability
  • Quantity ranking loss contributes more significantly in dense scenes (high object counts)

Highlights & Insights

  • Elegant train-inference consistency design: During training, the full quantity-aware embedding \(T^{full}\) produces category embedding \(T^{cat}\) via projection; at inference, semantically equivalent category embeddings are naturally produced — quantity knowledge is "distilled" into the visual encoder's implicit representations
  • Operating on similarity maps rather than feature space: This design choice is critical — CAD operates on a scalar field (similarity map), causing zero damage to the pre-trained feature space, solving the longstanding VLM fine-tuning problem
  • Factual/counterfactual quantity prompts: Teaching quantity awareness through contrastive learning with "correct quantity" vs "wrong quantity" is more discriminative than simply adding quantity labels

Limitations & Future Work

  • Still depends on pre-trained VLM category recognition capabilities; may fail for extremely rare objects unseen by the VLM
  • Multiple forward passes for K quantity hypotheses increase training overhead
  • Density map MSE loss may be imbalanced in extremely sparse or extremely dense scenes
  • The quantity awareness mechanism could be extended to open-set detection/segmentation tasks
  • vs CLIP-Count/CounTX: They freeze encoders to avoid overfitting but also sacrifice adaptation capability; QICA resolves this dilemma by operating CAD on similarity maps
  • vs CountGD: CountGD uses a larger GDINO backbone + visual exemplars; QICA achieves comparable performance using only text + ViT-B
  • vs T2ICount: Stable Diffusion-based methods have stronger generative priors but higher inference cost and different backbones make fair comparison difficult

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of quantity-aware prompts and cost aggregation decoder is creative, with elegant train-inference consistency design
  • Experimental Thoroughness: ⭐⭐⭐⭐ FSC-147 + cross-domain validation + rich ablations
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation, detailed method description
  • Value: ⭐⭐⭐⭐ Practical contribution to zero-shot counting