CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting¶

Conference: ICCV 2025 arXiv: 2504.15485 Code: https://github.com/atinpothiraj/CAPTURe Area: Multimodal VLM Keywords: VLM evaluation, spatial reasoning, amodal completion, occlusion, counting benchmark

TL;DR¶

This paper introduces CAPTURe, a benchmark that evaluates spatial reasoning and world model construction in VLMs by requiring amodal counting of regularly arranged objects under occlusion. Results show that even the strongest model, GPT-4o, achieves a 14.75% counting error under occlusion, while humans perform nearly perfectly.

Background & Motivation¶

Background: VLMs have achieved remarkable progress across various visual reasoning tasks, yet whether they can understand occluded scenes and infer invisible objects as humans do remains an open question.
Limitations of Prior Work: (1) Existing VLM evaluations overlook occlusion reasoning; (2) amodal completion is typically assessed via pixel-level prediction, which is incompatible with text-output VLMs; (3) objective and quantifiable metrics for occlusion reasoning are lacking.
Key Challenge: The human visual system effortlessly infers and counts objects behind occluders, but whether VLMs possess analogous world model capabilities is unknown.
Goal: Design an objective and quantifiable benchmark for evaluating VLM occlusion reasoning.
Key Insight: Leverage regular spatial arrangements (e.g., grids, circles) so that counting under occlusion yields a uniquely determined answer, enabling counting accuracy as the evaluation metric.
Core Idea: Pattern + Occlusion + Counting = measurable world model evaluation, simultaneously probing VLMs along three dimensions: pattern recognition, spatial reasoning, and counting.

Method¶

Overall Architecture¶

CAPTURe is an evaluation benchmark rather than a methodological contribution. It comprises two subsets: CAPTURe\(^{\text{real}}\) (924 real images across 92 object categories) and CAPTURe\(^{\text{synthetic}}\) (1,250 synthetic images with controlled variables). Each image contains regularly arranged objects, with a portion occluded by a black rectangle; VLMs must infer the hidden objects and report the total count.

Key Designs¶

CAPTURe\(^{\text{real}}\) Dataset:
- Function: Provides amodal counting evaluation in real-world scenes.
- Mechanism: Images with regularly arranged objects are filtered from the FSC-147 dataset via GPT-4o pre-screening followed by manual verification, yielding 924 images. Black occlusion blocks are applied manually. Both occluded and non-occluded versions are retained for comparison.
- Design Motivation: Evaluate VLMs in natural scenes across 92 object types, with an average of 61.45 objects per image and 13.97 occluded objects.
CAPTURe\(^{\text{synthetic}}\) Dataset:
- Function: Provides fully controlled, diagnostic evaluation.
- Mechanism: Synthetic images of simple shapes (dots, squares) arranged in various patterns are generated, with systematic variation in object count (5–15), arrangement shape (rectangle/circle/triangle), position (5 types), and color (5 types).
- Design Motivation: Eliminate confounding factors such as background clutter and texture variation to precisely identify failure modes in VLMs.
Auxiliary Information Experiments (Oracle & Prediction):
- Function: Diagnose the sources of VLM errors by providing additional information.
- Mechanism: (1) All Object Coordinate Oracle: provides coordinates of all objects, requiring only textual counting; (2) Visible Object Coordinate Oracle: provides coordinates of visible objects, still requiring inference of occluded ones; (3) Inpainting Pipeline: occluded regions are restored using FLUX.1-Fill before being passed to the VLM.
- Design Motivation: Decouple "visual counting ability" from "world model / occlusion reasoning ability" to identify the root cause of errors.

Evaluation Metric¶

Primary metric: sMAPE (Symmetric Mean Absolute Percentage Error), ranging from 0–100%; lower is better.
\(\text{sMAPE} = 100 \cdot \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i - \hat{y}_i|}{|y_i| + |\hat{y}_i|}\)
Responses that fail to produce an answer are assigned the maximum error of 100%.

Key Experimental Results¶

Main Results¶

Model	CAPTURe\(^{\text{real}}\) No Occ.	CAPTURe\(^{\text{real}}\) Occ.	Δ	CAPTURe\(^{\text{syn}}\) No Occ.	CAPTURe\(^{\text{syn}}\) Occ.	Δ
GPT-4o	13.34%	14.75%	+1.41	5.90%	9.71%	+3.81
InternVL2	26.17%	32.90%	+6.73	16.44%	17.57%	+1.13
Molmo	25.90%	32.49%	+6.59	8.40%	17.73%	+9.33
Qwen2VL	18.96%	29.33%	+10.37	6.63%	11.74%	+5.11
6 VLM Avg.	21.95%	27.59%	+5.64	11.89%	15.64%	+3.75
Human	-	3.79%	-	-	0.92%	-

Ablation Study (Effect of Auxiliary Information on CAPTURe\(^{\text{real}}\) Occluded Set)¶

Model	Orig. Occ.	+All Coords	+Visible Coords	+Inpainted
GPT-4o	14.75%	2.93% (−11.82)	9.20% (−5.55)	15.89% (+1.14)
InternVL2	32.90%	17.48% (−15.42)	25.13% (−7.77)	31.12% (−1.78)
Qwen2VL	29.33%	9.62% (−19.71)	17.70% (−11.63)	22.64% (−6.69)
3 VLM Avg.	25.66%	10.01% (−15.65)	17.34% (−8.32)	23.22% (−2.44)

Key Findings¶

All VLMs exhibit substantial counting errors under both occluded and non-occluded conditions, with occlusion consistently degrading performance.
Humans achieve very low error under occlusion (3.79% / 0.92%), outperforming VLMs by a factor of 7–14.
Providing all object coordinates yields a large reduction in error (avg. −15.65%), indicating that visual counting itself is a major bottleneck.
Image inpainting offers limited benefit (avg. −2.44%), suggesting that diffusion models do not serve as perfect world models either.
Models recognize arrangement patterns reasonably well (accuracy >80%), but accuracy drops by approximately 11% under occlusion.
Error increases with the number of occluded objects, while total object count has comparatively little effect.
CountGD (a detection-based model) substantially outperforms VLMs in the non-occluded setting but cannot handle occlusion.

Highlights & Insights¶

The benchmark design is elegant: combining pattern + occlusion + counting converts world model construction into an objective, quantifiable metric.
The experimental analysis is thorough: oracle experiments precisely disentangle errors arising from "visual counting" versus "occlusion reasoning."
A fundamental limitation of VLMs is identified: counting objects in images is challenging even in the absence of occlusion.
A hybrid VLM+CountGD system demonstrates that feeding specialized detector outputs to a VLM can improve performance.

Limitations & Future Work¶

Only 4–6 VLMs are evaluated; more recent models (e.g., GPT-4.5, Gemini) are not covered.
Objects in CAPTURe\(^{\text{real}}\) are predominantly drawn from FSC-147, limiting data diversity.
Answer extraction relies on Llama 3.1 8B; although verified to be 100% accurate, this adds pipeline complexity.
Only regularly arranged objects are considered; irregular arrangements are not addressed.
Occlusion blocks are restricted to rectangular shapes; irregular occluders are not studied.
The benchmark focuses on diagnosing capability rather than proposing improvements.

FSC-147: A dense counting dataset that serves as the image source for CAPTURe\(^{\text{real}}\).
CountGD: A state-of-the-art detection-based counting method used as a reference baseline.
SpartQA: A spatial reasoning VQA benchmark, though it only tests relationships among visible objects.
FLUX.1-Fill: A diffusion-based inpainting model used to provide "predicted world model" assistance.
Insight: VLM evaluation should devote greater attention to the "invisible" aspects of scenes — occlusion reasoning and commonsense inference — rather than focusing solely on the processing of visible information.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to use amodal counting as a probe for VLM spatial reasoning and world model capabilities; a distinctly original perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Systematic evaluation with multi-VLM comparisons, human baselines, oracle conditions, inpainting pipeline, and factor analysis.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear; experimental analysis proceeds in a well-structured, progressive manner with rich figures and tables.
Value: ⭐⭐⭐⭐ Exposes fundamental deficiencies in VLMs regarding visual counting and occlusion reasoning, providing clear directions for future improvement.