Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions¶

Conference: CVPR 2026 arXiv: 2511.17722 Code: GitHub Area: Multimodal/VLM Keywords: VLM, counting ability, attention mechanism, synthetic benchmark, visual attention intervention

TL;DR¶

This paper constructs a synthetic counting benchmark dataset, systematically evaluates the counting capabilities of open-source VLMs under varying image and prompt conditions, and investigates mechanisms for improving counting behavior through visual attention reweighting at the decoder level.

Background & Motivation¶

Background: VLMs have been widely adopted for tasks such as visual question answering, yet they perform poorly on precise enumeration, falling far short of dedicated counting methods (e.g., PseCo, CountGD, CrowdDiff).

Limitations of Prior Work: Most existing evaluations rely on natural image datasets where variables are highly entangled (occlusion, texture, density, etc.), making it difficult to isolate specific failure factors. Systematic diagnostic frameworks for analyzing the root causes of counting failures are largely absent.

Key Challenge: VLMs acquire strong prior biases during training and tend to rely on memorized patterns rather than object-by-object analysis when confronted with counting tasks that require precise visual attention. This closely parallels the enumeration limits and cognitive load effects observed in human cognition.

Goal: To construct a controlled synthetic benchmark that precisely isolates contributing factors by varying image and prompt attributes one at a time, and to explore whether attention interventions can improve counting performance.

Key Insight: The work is approached from two complementary perspectives: cognitive science (cognitive load theory) and model interpretability (attention analysis).

Core Idea: A controllable synthetic data framework with strict variable control, combined with an interpretable diagnostic framework based on attention reweighting interventions.

Method¶

Overall Architecture¶

The framework consists of three components: (1) a synthetic data generation pipeline with single-variable control; (2) a multi-dimensional counting performance evaluation system; and (3) visual attention reweighting intervention experiments.

Key Designs¶

Synthetic Evaluation Dataset: Using 512×512 white-background images with black circles as the baseline, the following attributes are varied one at a time: object count (0–50 in steps of 10), object color/shape/texture, and background color/texture. Multiple dataset groups are generated, each modifying only one attribute while holding all others constant, enabling strictly controlled experiments. Design Motivation: Natural image benchmarks entangle multiple factors, preventing precise attribution of failure patterns.
Prompt Specificity Ladder: Five prompt levels (P1–P5) are designed, ranging from the most generic ("count the number of objects") to the most detailed ("count the number of Z-shaped objects with X texture and Y color"). Design Motivation: To disentangle the effect of linguistic complexity on counting, and to test whether VLMs can leverage richer descriptive information to improve performance.
Five Visual Attention Reweighting Strategies: Applied to visual token attention weights within the language decoder:
Amplify: \(\tilde{A}_{h,i,j} = \alpha \cdot A_{h,i,j}\) (\(\alpha=2.0\)), enhancing visual attention
Suppress: \(\tilde{A}_{h,i,j} = \beta \cdot A_{h,i,j}\) (\(\beta=0.5\)), reducing visual attention
Focus: Non-visual token attention is set to \(\epsilon=10^{-10}\), forcing all attention onto visual tokens
Balance: A target visual attention ratio \(r_v^{target}=0.4\) is set and used to calibrate scaling
Visual Mask Amplify: Using SAM segmentation masks, object regions are amplified with \(\alpha_{obj}=2.0\) and background regions suppressed with \(\alpha_{bg}=0.5\)

Design Motivation: VLMs exhibit "visual attention sinks," where a disproportionate amount of attention concentrates on visual tokens irrelevant to the query; redistributing attention may improve counting.

Loss & Training¶

This work involves no model training; all attention interventions are applied at inference time. Evaluation metrics include: - Accuracy: The proportion of predictions that exactly match the ground-truth count - MRCE (Mean Relative Count Error): \(\text{MRCE} = \frac{1}{N}\sum_{i=1}^{N}\frac{|c_{pred}^{(i)} - c_{true}^{(i)}|}{c_{true}^{(i)}}\)

Key Experimental Results¶

Main Results — Effect of Prompt Specificity¶

Feature Category	Model	P1 Acc	Best Prompt Acc	MRCE Change
Background Texture	Qwen7b	0.090	P2: 0.168 (+0.078)	−0.433
Background Texture	Kimi	0.169	P2: 0.264 (+0.095)	−0.355
Object Texture	Qwen32b	0.240	P1 is best	P5: +0.172 (degraded)
Object Color	Qwen7b	0.163	P2: 0.212 (+0.049)	−0.115

Ablation Study — Effect of Visual Complexity¶

Configuration	Key Metric	Description
Object count 0–9	Highest accuracy	All models perform best in the low-count range
Object count 40–50	Significant accuracy drop	Counting ability degrades systematically as count increases
Background texture — checkerboard	Elevated MRCE	High-frequency texture interferes with object detection
Background texture — diagonal stripes	Highest MRCE (Qwen32b: 0.308)	Directional texture creates confusion with object shape

Key Findings¶

Asymmetric effect of prompt specificity: Describing background features specifically consistently improves performance (simplifying visual segmentation), whereas object texture specificity monotonically degrades accuracy (inducing "cognitive load sink").
Cognitive load effect: Under high-load P5 prompts, the model's attention to shape is suppressed by the processing of texture and color, as directly confirmed by attention heatmaps.
Model scale does not imply robustness: Qwen32b performs worst on the object texture dimension (Acc drops from 0.240 to 0.132), demonstrating that greater scale does not translate to better counting ability.
Attention reweighting yields limited but measurable gains: Mask-guided amplification improves MRCE in certain settings, though the overall improvement remains modest.

Highlights & Insights¶

This is the first framework to systematically diagnose VLM counting capabilities from a cognitive science perspective, mapping human cognitive load theory onto VLM failure modes.
A "P1-optimal phenomenon" is identified: the simplest, most generic prompt often yields the best results, as it bypasses the cognitive sinking induced by specific semantic cues.
Cross-modal binding is identified as the fundamental cause of counting failures — a problem that natural image benchmarks cannot easily isolate.
Qualitatively consistent trends are validated on the FSC-147 real-world counting benchmark, demonstrating that the findings are not artifacts of synthetic imagery.

Limitations & Future Work¶

Attention interventions operate solely at inference time; attention guidance during training (e.g., attention-based loss terms) is not explored.
Although the synthetic data affords controllability, the complexity gap with real-world scenes is substantial, and intervention effectiveness may diminish in naturalistic settings.
Only three open-source VLMs are evaluated; closed-source models (GPT-4V, Gemini) are not analyzed.
Large-scale counting scenarios with object counts exceeding 50 are not explored.

Vo et al. found strong prior biases in o3/Gemini 2.5 Pro, consistent with the findings of this paper.
Kang et al.'s work on visual attention sinks directly inspired the attention intervention strategies adopted here.
The paper's controllable diagnostic framework is generalizable to systematic testing of other visual reasoning capabilities in VLMs.

Rating¶

Novelty: ⭐⭐⭐⭐ The diagnostic framework is novel, though the attention intervention strategies are relatively straightforward
Experimental Thoroughness: ⭐⭐⭐⭐⭐ The multi-model, multi-dimensional, multi-level systematic evaluation is highly comprehensive
Writing Quality: ⭐⭐⭐⭐ Well-structured with apt analogies to cognitive science
Value: ⭐⭐⭐⭐ Provides important diagnostic tools and mechanistic explanations for understanding VLM counting failures