ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning¶

Conference: ICML 2026
arXiv: 2604.24300
Code: Available (project page + GitHub + HuggingFace)
Area: Multimodal VLM / Evaluation Benchmark / Visual Spatial Intelligence
Keywords: VSI-Bench, spatial reasoning, frame budget, virtual video, hallucination

TL;DR¶

This work systematically reveals that the widely used VSI-Bench suffers from structural failures due to 3D annotation drift and inconsistent frame sampling. The authors re-annotate 381 scenes and 5,365 objects, design frame-budget adaptive QA, and introduce a dummy video stress test by removing all frames containing the queried object, resulting in a high-fidelity spatial intelligence benchmark named ReVSI. Evaluation shows that open-source VLMs experience up to a 40% drop on ReVSI, and still exhibit high hallucination rates on dummy videos, exposing that current spatial reasoning abilities have been systematically overestimated by VSI-Bench.

Background & Motivation¶

Background: As VLMs expand towards embodied and 3D perception, VSI evaluation benchmarks such as VSI-Bench, SPAR-Bench, and VSI-SUPER have become mainstream. These benchmarks use 3D datasets like ScanNet/ARKitScenes to automatically generate QA for testing models on tasks such as object counting, relative direction, and room area. VLM training (e.g., SpatialVLM, Cambrian-S, SpaceR) is also optimized around these benchmarks.

Limitations of Prior Work: Through manual auditing, the authors reveal two core defects. First, annotation-video drift: VSI-Bench GT is based on point cloud 3D reconstruction (serving traditional 3D perception), but objects clearly visible in raw video may be missed due to incomplete reconstruction, object categories may be mislabeled (e.g., cup labeled as notebook), and room area is calculated using noisy Alpha Shape, resulting in many QA items being incorrect or semantically ambiguous under video evidence—for example, among 565 Object Counting questions, 27% are wrong and 11% ambiguous. Second, unobservable frame sampling: VLMs can only see 16/32/64 frames, but VSI-Bench GT is labeled using all frames; Figure 3 shows that with 16 frames, GT correctness drops to 67%, meaning many questions are unsolvable given the model's actual input.

Key Challenge: The benchmark assumes "the model sees the entire scene = the annotator sees the entire scene," but modern VLMs' sparse-frame input breaks this assumption, making it impossible to distinguish whether a model's error is due to weak spatial reasoning or missing key evidence. Additionally, VSI-Bench's answer distribution is highly imbalanced (e.g., "2" accounts for 53% of Object Counting, distances 0–2m dominate), allowing models to score high using priors rather than visual evidence.

Goal: While retaining the task paradigm of VSI-Bench, ensure that (i) annotations strictly match the original video; (ii) QA is answerable and correct under every frame budget; (iii) provide controllable diagnostic tools to decouple "visual evidence" from "reasoning ability."

Key Insight: Rather than training another model, it is more effective to fix the evaluation—strictly aligning "what the benchmark intends to ask" with "what the model actually sees" gives the benchmark true diagnostic value.

Core Idea: By combining "video-aligned manual 3D re-annotation," "frame-budget adaptive QA," and "dummy video stress testing," the authors reconstruct the first input-consistent VSI benchmark, ReVSI.

Method¶

Overall Architecture¶

The ReVSI pipeline consists of three stages: (1) Using a custom 3D web annotation interface, the authors expand from the original VSI-Bench's 288 scenes and 65 categories to 381 scenes and 504 categories (open vocabulary) across ScanNetv2/ScanNet++/ARKitScenes/3RScan/MultiScan, redrawing 5,365 3D boxes; (2) For six task types (object counting, object size, absolute distance, room size, relative distance, relative direction—removing Object Appearance Order as it focuses more on temporal reasoning), QA is regenerated with stricter templates and manually verified; (3) For each video, GT is constructed under four frame budgets (16/32/64/all), and dummy videos are generated by removing all frames containing the queried object for visibility-guided control experiments.

Key Designs¶

Video-aligned Open Vocabulary 3D Re-annotation:
- Function: Replace 3D GT based on noisy reconstructed meshes with high-fidelity manual annotations anchored to the original video, increasing object annotations from 3,185 to 5,365 and categories from 65 to 504.
- Mechanism: Using a custom web annotator, the authors (3D domain experts) start from the original VSI-Bench annotations, filter mislabels, tighten 3D boxes, recover objects visible in video but missing in reconstruction, and extrapolate true physical sizes for geometrically damaged objects using adjacent frames. Open vocabulary labels (e.g., "Sony PlayStation," "Coca-Cola box") are manually written, with GPT-5.2 used only for verification. Room area is annotated by manually drawing polygons from a top-down view, discarding scenes with unclear boundaries, instead of using Alpha Shape.
- Design Motivation: The root problem of the old GT is "annotating meshes instead of video"; by changing the annotation target, all downstream issues are resolved. Open vocabulary and finer categories prevent models from exploiting the 65-category prior for narrow guessing.
Debiased & Manually Verified QA Regeneration:
- Function: Rewrite templates while retaining task definitions, specifically addressing VSI-Bench's answer distribution bias (e.g., always guessing "2" yields 62% accuracy).
- Mechanism: For Object Counting, reintroduce single-instance queries ("How many black office chairs") and add "sum of two categories" templates, changing "this room" to "the scene" to match multi-room videos. For Object Size, remove categories with nearly fixed sizes (e.g., toilet/bed) and sample OOD for refrigerators, etc. For Absolute Distance, remove <1m questions (answerable from single-frame 2D cues) and add long-distance pairs. For Relative Direction, require positioning object footprint ≤1 m² and inter-object distance ≥1m, adding templates like "facing away from X." For Room Size, add "main room only" templates to mitigate multi-room ambiguity. Every question is manually verified.
- Design Motivation: The original benchmark's answer choices are overly concentrated, enabling models to mode-collapse to high-frequency answers; statistical debiasing and template diversification block this shortcut, ensuring the metric truly reflects spatial reasoning.
Frame-budget Adaptive Evaluation & Dummy Video Control Experiments:
- Function: Construct GT for each of the four frame budgets (16/32/64/all), and generate dummy videos by removing all frames containing the queried object to stress-test whether models truly rely on visual evidence.
- Mechanism: Rasterize each sampled frame using the scene's GT camera poses to automatically determine object visibility (occupying >5% of frame area); manual annotation is used when not visible. Room Size and Route Planning are excluded under the 16-frame setting (insufficient information). Dummy videos retain scene context but remove all target object frames, making them "unanswerable" for humans, but GT is a deterministic value (e.g., object counting must be 0, object size replaced with all-black frames). For metrics: MCQ uses Acc; NQ uses Mean Relative Accuracy \(\text{MRA}=\frac{1}{|C|}\sum_{\theta\in C}\mathbb{1}[|\hat y-y|/y<1-\theta]\), \(C=\{0.5,0.55,\dots,0.95\}\).
- Design Motivation: Aligning "model input" with "benchmark evaluation target" is fundamental for trustworthy evaluation; dummy videos expose the implicit assumption of "seeing the object before answering"—if the model answers correctly without evidence, outputs are driven by prior rather than vision, which is the definition of hallucination.

Loss & Training¶

ReVSI is an evaluation benchmark and does not train models; evaluation follows MRA (NQ) and Acc (MCQ).

Key Experimental Results¶

Main Results¶

Evaluation covers general VLMs (Qwen3-VL, InternVL-3.5, LLaVA-Video, GPT-5.2, Gemini 3) and 3D expert models (SpatialVLM, Cambrian-S, SpaceR, VLM-3R, Spatial-MLLM), with performance compared on both ReVSI and VSI-Bench.

Dataset Statistics	VSI-Bench	ReVSI
Number of Scenes	288	381
Number of Objects	3,185	5,365
Number of Categories	65	504
Open Vocabulary	✗	✓
Frame-budget Adaptive GT	✗ (all-frame only)	✓ (16/32/64/all)

Model Type	VSI-Bench Performance	ReVSI Performance	Conclusion
Closed-source LMs (GPT-5.2, Gemini 3)	Seemingly lower than open-source	Significantly outperform open-source, especially in Object Counting	VSI-Bench systematically underestimates closed-source models
Open-source VLMs (Qwen3-VL, InternVL-3.5)	High	Up to 40% drop (Counting / Rel-Dist / Rel-Dir)	VSI-Bench overestimates open-source models
3D Fine-tuned Experts (SpaceR, 3D-R1)	Much higher than base	Sharply reduced gains, some subtasks worse than base	Fine-tuning gains are amplified by benchmark bias

Ablation Study¶

Diagnostic Setting	Key Findings
Object Counting, always guessing "2"	62% on VSI-Bench, <20% on ReVSI, confirming successful answer debiasing
Absolute Distance	Most models score higher on ReVSI—because MRA is more tolerant at long distances, removing <1m samples highlights Qwen3-VL's long-distance strengths
Dummy Video Object Counting	Models like InternVL-3.5 still output "medium numbers"—nonzero hallucination rate, proving outputs are driven by indoor priors rather than visual evidence
Object Size with all-black frames	Some expert models still hit "typical category sizes," revealing size estimation heavily relies on category priors
Frame budget scan	16→64 frame GT correctness rises from 67%→92%, demonstrating the necessity of frame-aware design

Key Findings¶

The conclusion "open-source > closed-source" on VSI-Bench is reversed on ReVSI, indicating that previous "expert model SOTA" claims are likely benchmark artifacts.
3D fine-tuned experts see sharply reduced gains on the cleaner ReVSI; post-training data scale decouples from performance, suggesting current 3D instruction tuning mainly overfits noisy GT.
Dummy video exposes that several SOTA open-source VLMs are almost insensitive to the presence of visual evidence—this is the real bottleneck in spatial reasoning.
Empirical frame sampling threshold: single-room scenes require at least 64 frames, and benchmarks should provide different GTs for each frame budget.

Highlights & Insights¶

An empirical example of "fixing evaluation is more important than fixing models": The authors' audit exposes a 27% error + 11% ambiguity rate in VSI-Bench and overturns most SOTA claims, showing that evaluation hygiene is one of the highest ROI directions in current spatial AI research.
Dummy video protocol: Easily extensible to any video QA; the approach is to "automatically remove evidence frames per question and see if the model can still answer." This visibility-controlled stress test can systematically quantify hallucination and is highly transferable.
Frame-budget-aware GT: For the first time, "GT is not a single value but a function \(\text{GT}(\text{frames})\)" is implemented at scale in a benchmark; future long-video benchmarks should follow suit.

Limitations & Future Work¶

Although large-scale, re-annotation is still manual and harder to scale to in-the-wild videos; next steps could involve semi-automation (GPT-5.2 assistance + human spot checks).
Object Appearance Order is removed to avoid temporal reasoning, so spatio-temporal joint understanding is not yet covered.
Dummy video treats "no evidence → answer 0/unknown" as GT, which aligns with human intuition but does not fully match some models' "refuse to answer" behavior; future work could add confidence calibration metrics.
ReVSI shares task definitions with VSI-Bench, meaning the new benchmark has not yet expanded to entirely new 3D reasoning tasks (e.g., multi-view registration, 6DoF manipulation).

vs VSI-Bench (Yang 2025a): Directly audited in this work; ReVSI corrects its issues in every detail and is a more reliable "successor."
vs SPAR-Bench / VSI-SUPER: Also suffer from GT drift and frame mismatch; ReVSI's "input-consistent" principle can be adopted by these benchmarks.
vs 3D fine-tuned VLMs (SpatialVLM, Cambrian-S, SpaceR): ReVSI reveals that these methods' gains sharply diminish on a clean benchmark, calling for rigorous evaluation before further training.
Cross-task insights: The dummy video / visibility-controlled QA protocol can be extended to medical VQA (removing key anatomical regions to test diagnosis) and robot perception (removing key frames), etc.

Rating¶

Novelty: ⭐⭐⭐⭐ "Rebuilding + new protocol" for benchmark work; not a brand-new task but with unique ideas and broad impact.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 10+ models across open-source/closed-source/expert types, multi-frame budgets, and dummy video multidimensional diagnostics, with sufficient audit data.
Writing Quality: ⭐⭐⭐⭐⭐ Clear three-part logic: problem diagnosis → solution → empirical validation, with Figures 1/3/5 illustrating core issues.
Value: ⭐⭐⭐⭐⭐ Directly challenges the credibility of a widely cited benchmark, potentially shifting the entire VLM spatial reasoning research direction, with huge community impact.