Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting¶

Conference: CVPR 2026
arXiv: 2603.16129
Code: Coming soon
Area: LLM / NLP (Other)
Keywords: zero-shot counting, vision-language model, prompt tuning, cost aggregation, quantity awareness

TL;DR¶

The QICA framework addresses the lack of quantity awareness and spatial insensitivity in zero-shot object counting by using a quantity-conditioned Synergistic Prompting Strategy (SPS) to jointly adapt vision-language encoders, combined with a Cost Aggregation Decoder (CAD) operating on similarity maps to preserve zero-shot transferability, achieving zero-shot SOTA on FSC-147 (MAE 12.41) with strong cross-domain generalization.

Background & Motivation¶

Background: Zero-shot object counting (ZSOC) aims to enumerate arbitrary-category objects using only text descriptions. Mainstream methods leverage VLMs such as CLIP to compute vision-text similarity maps, then use CNN/Transformer decoders to predict density maps.

Limitations of Prior Work: (1) Lack of quantity awareness — text prompts specify only categories without quantity information; models excel at recognizing "what" but fail to understand "how many," especially in dense scenes. (2) Spatial insensitivity + feature space distortion — directly fine-tuning VLM encoders leads to severe overfitting to training categories, corrupting the pre-trained feature space and harming zero-shot generalization.

Key Challenge: Accurate counting requires adapting encoders to learn quantity-sensitive features, but fine-tuning corrupts zero-shot generalization — creating an adaptation-vs-generalization dilemma.

Goal: (1) Enable fine-grained quantity discrimination; (2) Achieve effective adaptation without corrupting the pre-trained feature space.

Key Insight: Introduce quantity-conditioned prompts for encoders to implicitly learn quantity information, while operating on similarity maps (rather than feature space) to avoid feature distortion.

Core Idea: During training, factual/counterfactual quantity prompts teach the model to distinguish different quantities; at inference, only category prompts are used for zero-shot counting.

Method¶

Overall Architecture¶

Input: image \(I\) and text description \(T\); output: density map \(D\). The framework comprises three core modules: (1) SPS jointly adapts frozen CLIP visual and language encoders via quantity-conditioned learnable prompts; (2) CAD performs spatial aggregation on vision-text similarity maps with multi-scale upsampling to predict density maps; (3) \(\mathcal{L}_{MQA}\) enforces quantity consistency at both encoder and decoder levels. Key design: quantity-information pathways active during training are disabled at inference, retaining only category semantic pathways.

Key Designs¶

Synergistic Prompting Strategy (SPS)
- Function: Jointly adapts visual and language encoders via quantity-conditioned learnable prompts
- Mechanism: Maps quantity value \(q\) to continuous embedding \(\epsilon_q\), added to per-layer learnable prompts \(\Pi^j\) to generate quantity-aware text prompts \(\hat{\Pi}^j_k = \Pi^j + \epsilon_{q_k}\mathbf{1}^T\). A coupling function \(\Phi^j\) projects language prompts to visual prompts \(\Psi^j_k = \Phi^j(\hat{\Pi}^j_k)\), enabling bidirectional gradient flow. During training, both factual (true quantity) and counterfactual (deviated quantity) prompts are generated per image
- Design Motivation: Independent unimodal prompts cannot achieve cross-modal quantity-aware coordination. The coupling function establishes a direct language→vision gradient pathway, enabling both encoders to jointly adapt toward quantity awareness
Cost Aggregation Decoder (CAD)
- Function: Operates directly on vision-text cosine similarity maps to produce density maps via spatial aggregation
- Mechanism: Computes per-position cosine similarity between dense visual features \(\mathbf{V}\) and category text embeddings \(\mathbf{T}^{\text{cat}}_k\) to obtain similarity maps \(\mathbf{S}_k\); then applies embedding layer → Swin Transformer spatial aggregation → multi-scale upsampling (with skip connections and similarity gating) → prediction head for final density map
- Design Motivation: Operating in feature space corrupts the pre-trained manifold and causes overfitting, while aggregating on similarity maps (a scalar field) preserves embedding space integrity, enabling encoder fine-tuning without harming generalization
Multi-level Quantity Alignment Loss (\(\mathcal{L}_{MQA}\))
- Function: Enforces quantity consistency constraints at both encoder and decoder levels
- Mechanism: Encoder level uses ranking loss to ensure the true quantity hypothesis has the highest global similarity \(\alpha_0 > \alpha_i\), with closer quantities yielding higher similarity; decoder level uses auxiliary MSE loss to constrain density map integrals to match corresponding quantity values. Total loss: \(\mathcal{L}_{MQA} = \|D^0 - D^{GT}\|_2^2 + \lambda_1 \mathcal{L}^{qty}_{enc} + \lambda_2 \mathcal{L}^{qty}_{dec}\)
- Design Motivation: Decoder-only supervision is insufficient; encoder-level ranking constraints implicitly encode quantity information in the feature space during training, enabling accurate counting at inference without quantity prompts

Loss & Training¶

Density map MSE + encoder ranking loss (\(\lambda_1=0.1\)) + decoder counting loss (\(\lambda_2=0.05\))
During training, \(K\) quantity hypotheses (factual + counterfactual) are generated per image with independent forward passes sharing encoder parameters
At inference, no quantity information is needed — only category text input

Key Experimental Results¶

Main Results¶

FSC-147 Zero-Shot Counting

Method	Backbone	Val MAE↓	Val RMSE↓	Test MAE↓	Test RMSE↓
CounTX	ViT-B/16	17.76	65.21	16.70	105.21
VLCounter	ViT-B/16	18.06	65.13	17.05	106.16
T2ICount	SD-v1.5	13.78	58.78	11.76	97.86
CountGD	GDINO-Swin-B	12.14	47.51	14.76	120.42
QICA	ViT-B/16	13.82	60.24	13.05	104.17
QICA†	ViT-L/14	12.98	56.35	12.41	-

Ablation Study¶

Configuration	Val MAE	Test MAE	Note
Baseline (CLIP + Conv decoder)	~18	~17	No quantity awareness
+ SPS (text prompts only)	~16	~15	Limited improvement with unimodal prompts
+ SPS (synergistic prompts)	~15	~14	Significant boost from bimodal coupling
+ CAD	~14	~13.5	Spatial aggregation further optimizes
+ \(\mathcal{L}_{MQA}\) (full model)	13.82	13.05	Multi-level constraints yield final results

Key Findings¶

QICA significantly outperforms all zero-shot methods with the same backbone (ViT-B/16): CounTX MAE 16.70 → QICA 13.05
Cross-domain generalization tests (CARPK, ShanghaiTech-A) surpass all baselines, confirming no overfitting
The coupling function in SPS improves MAE by ~1.5 over independent prompting, demonstrating bimodal synergy is crucial
CAD reduces MAE by ~1–2 points compared to feature-space decoding while preserving zero-shot capability
Quantity ranking loss contributes more significantly in dense scenes (high object counts)

Highlights & Insights¶

Elegant train-inference consistency design: During training, the full quantity-aware embedding \(T^{full}\) produces category embedding \(T^{cat}\) via projection; at inference, semantically equivalent category embeddings are naturally produced — quantity knowledge is "distilled" into the visual encoder's implicit representations
Operating on similarity maps rather than feature space: This design choice is critical — CAD operates on a scalar field (similarity map), causing zero damage to the pre-trained feature space, solving the longstanding VLM fine-tuning problem
Factual/counterfactual quantity prompts: Teaching quantity awareness through contrastive learning with "correct quantity" vs "wrong quantity" is more discriminative than simply adding quantity labels

Limitations & Future Work¶

Still depends on pre-trained VLM category recognition capabilities; may fail for extremely rare objects unseen by the VLM
Multiple forward passes for K quantity hypotheses increase training overhead
Density map MSE loss may be imbalanced in extremely sparse or extremely dense scenes
The quantity awareness mechanism could be extended to open-set detection/segmentation tasks

vs CLIP-Count/CounTX: They freeze encoders to avoid overfitting but also sacrifice adaptation capability; QICA resolves this dilemma by operating CAD on similarity maps
vs CountGD: CountGD uses a larger GDINO backbone + visual exemplars; QICA achieves comparable performance using only text + ViT-B
vs T2ICount: Stable Diffusion-based methods have stronger generative priors but higher inference cost and different backbones make fair comparison difficult

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of quantity-aware prompts and cost aggregation decoder is creative, with elegant train-inference consistency design
Experimental Thoroughness: ⭐⭐⭐⭐ FSC-147 + cross-domain validation + rich ablations
Writing Quality: ⭐⭐⭐⭐ Clear motivation, detailed method description
Value: ⭐⭐⭐⭐ Practical contribution to zero-shot counting