NeurIPS 2025 Multimodal VLM Visual reasoning binding problem visual prompting LVLM spatial structure cognitive science

Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs¶

Conference: NeurIPS 2025 arXiv: 2506.22146 Code: https://sharif-ml-lab.github.io/VISER/ Area: Multimodal VLM Keywords: Visual reasoning, binding problem, visual prompting, LVLM, spatial structure, cognitive science

TL;DR¶

This paper proposes VISER (Visual Input Structure for Enhanced Reasoning), which constructs spatial partitions by superimposing equidistant horizontal lines with numeric labels onto input images, combined with a "row-by-row scan" textual instruction. This approach converts the parallel visual processing of LVLMs into sequential region-by-region parsing. Without modifying the model, without training, and within a single query, VISER substantially mitigates the binding problem and improves performance on visual reasoning tasks including counting, visual search, scene description, and spatial relationship understanding.

Background & Motivation¶

Background: LVLMs (e.g., GPT-4o, Claude 3.5 Sonnet) have approached or surpassed human performance in language reasoning, yet consistently underperform on visual reasoning tasks—seemingly straightforward tasks such as counting, visual search, scene description, and spatial relationship understanding remain error-prone.

Limitations of Prior Work: The root cause of these visual reasoning failures can be attributed to the binding problem from cognitive science—the inability of models to reliably associate perceptual features (color, shape) with the correct visual objects. When multiple similar objects are present in a scene, features are easily confused across objects (illusory conjunctions), causing, for example, "red circle + green square" to be misperceived as "red square."

Key Challenge: Current LVLMs process visual features in parallel and lack a spatially-oriented serial attention mechanism. Attempts to guide reasoning through purely textual strategies (e.g., CoT prompting) are ineffective—once early tokens are generated based on entangled visual features, every subsequent reasoning step inherits the same binding errors. Remediation at the language level cannot fix problems at the visual encoding level.

Goal: To mitigate the binding problem in LVLMs and improve performance across multiple visual reasoning tasks through simple input-level interventions, without modifying model architecture or performing any training.

Key Insight: Grounded in cognitive science and neuroscience. The human visual system operates in two modes: fast but imprecise parallel processing (System 1) and more accurate serial attention (System 2). Humans overcome the binding problem through serial attention. Neuroscientific research further shows that grid frameworks enhance visual recognition memory—providing a theoretical basis for adding structured lines to visual inputs.

Core Idea: Drawing a few horizontal lines on the image and instructing the model to "scan row by row"—this is the visual analogue of Chain-of-Thought, injecting inductive bias directly into the visual input rather than the language prompt.

Method¶

Overall Architecture¶

VISER is a fully training-free, model-agnostic input augmentation method comprising two complementary components: (1) superimposing equidistant horizontal lines onto the input image as visual anchors; and (2) prepending sequential scanning instructions to the text prompt. The two components work in concert—the visual structure provides spatial partitioning, while the textual instruction guides the model to leverage these partitions for serial processing. The entire procedure is completed within a single query with virtually zero additional computational overhead.

Key Designs¶

Visual Structuring — Horizontal Line Partitioning
- Function: \(n\) equidistant horizontal lines (default \(n=3\)) are superimposed on the input image, dividing it into \(n+1\) horizontal regions, each labeled with a numeric index (1 to \(n+1\)) on the left side.
- Mechanism: The horizontal lines serve as "visual anchors" that constrain the model to perform local attention processing within each region, reducing cross-object feature interference. Numeric labels provide explicit guidance on processing order.
- Design Motivation: Inspired by neuroscientific research showing that grid frameworks enhance human visual recognition memory, and that humans reduce interference by iteratively detecting individual objects. The minimalist design (only a few lines) preserves image clarity while providing sufficient spatial guidance, avoiding content occlusion from dense grids.
Sequential Scanning Text Prompt
- Function: A fixed instruction is prepended to the original task prompt: "Scan the image sequentially based on horizontal lines exists in the image," directing the model to process regions sequentially from top to bottom.
- Mechanism: This shifts the model's attention pattern from global-parallel to local-serial, exposing the model to fewer competing objects within each region and thereby producing cleaner local representations for downstream reasoning.
- Design Motivation: Visual structure alone cannot guarantee that the model will exploit it—textual instructions are necessary to align the model's processing strategy with the visual scaffold. Experiments confirm that neither component alone is sufficient; using either component in isolation yields significantly weaker results than the complete method.
Task-Adaptive Prompt Extension
- Function: Task-specific auxiliary instructions are appended to the base scanning instruction for different tasks (counting, spatial relationships, visual search, etc.).
- Mechanism: The base sequential scanning instruction provides a general serial processing strategy, while task-specific instructions further direct the model to attend to task-relevant information during scanning (e.g., accumulating target counts per region during counting tasks).
- Design Motivation: This preserves the generality of the method (the same visual structure applies to all tasks) while enabling lightweight textual adaptation to optimize performance for individual tasks.

Loss & Training¶

VISER requires no training whatsoever and involves no loss functions, gradient updates, or modifications to model parameters. The method operates at inference time through input augmentation and is applicable to any black-box LVLM, including closed-source API models such as GPT-4o. No multi-turn queries or external tool calls are required; all augmentation is completed within a single forward pass.

Key Experimental Results¶

Main Results¶

Evaluated on 2D/3D synthetic datasets and natural images across four closed-source and open-source models:

Task	Metric	GPT-4o Base	GPT-4o +VISER	Claude Base	Claude +VISER	Qwen2.5-VL Base	Qwen2.5-VL +VISER
Counting (2D)	Accuracy	12.00%	38.83%	8.50%	10.67%	5.83%	40.83%
Counting (3D)	Accuracy	15.00%	31.00%	17.33%	22.00%	8.51%	26.67%
Visual Search (2D)	Harmonic Mean	0.48	0.73	0.34	0.66	0.30	0.40
Visual Search (3D)	Harmonic Mean	0.91	0.93	0.80	0.86	0.12	0.20
Scene Description (2D)	Edit Dist↓	1.94	1.62	3.01	2.20	8.12	7.39
Spatial Relations (2D)	Accuracy	43.00%	52.50%	34.18%	36.26%	48.50%	50.00%
Spatial Relations (Natural)	Accuracy	69.39%	77.43%	37.43%	46.15%	80.10%	77.04%

Ablation Study¶

Using GPT-4o on the 2D counting task to isolate the contribution of each component:

Configuration	2D Counting Accuracy	Note
Baseline (no augmentation)	12.00%	Vanilla model
Visual structure only (lines, no text prompt)	~28.5%	Visual structure contributes independently
Text prompt only (scan instruction, no lines)	~14.2%	Minimal effect from text alone
CoT prompt ("Let's think step by step")	~9.8%	Performance degrades!
VISER (full method)	38.83%	Both components are complementary

In comparison with fine-tuned models (on Qwen2.5-VL): VISER achieves 41% on 2D counting, while Mulberry (fine-tuned) achieves only 15% and OpenVLThinker (RL fine-tuned) also achieves 15%—the training-free method substantially outperforms trained counterparts.

Key Findings¶

Visual modification is necessary: Purely textual strategies such as CoT are not only ineffective but can degrade performance, since language-level reasoning cannot repair already-entangled visual representations.
Both components are indispensable: The visual structure contributes approximately 16.5% improvement; adding the text prompt yields an additional ~10%; the full method achieves the maximum gain.
Greater improvement on high-complexity scenes: As the number of objects increases from 10 to 20, VISER's relative gains become more pronounced (e.g., GPT-4o on 2D counting with 14 objects improves from 1% to 34%).
Model-agnostic: Effective across GPT-4o, Claude 3.5, LLaMA4, and Qwen2.5-VL, with generalization demonstrated on external benchmarks including MMBench and PhysBench.
Surpasses fine-tuned models: VISER applied to the base Qwen2.5-VL matches or exceeds the performance of specially fine-tuned models (Mulberry and OpenVLThinker) on most tasks.

Highlights & Insights¶

Binding problem as a unified explanatory framework: This paper is the first to systematically introduce the binding problem from cognitive science into the analysis of LVLM visual reasoning failures, offering a mechanistic explanation that goes beyond "insufficient model capability"—the issue is not that models are too weak, but that they lack a serial attention mechanism to correctly associate features with objects.
Introducing the concept of "visual CoT": CoT injects inductive bias for reasoning at the language level; VISER does so at the visual level. This analogy reveals a neglected dimension in multimodal reasoning: the design of visual inputs is equally—or perhaps more—important than language prompting.
Counter-intuitive finding—CoT is harmful: CoT degrades performance on visual reasoning tasks, directly challenging the default assumption that "CoT is universally beneficial." This demonstrates that when the problem originates in visual encoding rather than the reasoning chain, language-level interventions are ineffective.
Minimal intervention, maximal effect: Simply drawing 3 lines on an image and adding a one-sentence instruction raises GPT-4o's counting accuracy from 12% to 39%. The stark contrast between the method's simplicity and its substantial effectiveness suggests that there is enormous room for optimization in the visual processing pipeline of current LVLMs.

Limitations & Future Work¶

Static scaffold structure: The position and number of horizontal lines are fixed; when lines happen to pass through critical objects, they may occlude information or introduce interference. Future work could explore adaptive scaffolds that dynamically determine line placement based on image content.
Reliance on synthetic data: Primary experiments are conducted on synthetic datasets; improvements on natural images are relatively limited (e.g., Qwen2.5-VL shows a 3.06% decline on natural image spatial relations), and generalization to real-world scenarios warrants further investigation.
Unexplored visual structure variants: Only horizontal lines are employed; other visual scaffold forms such as grids, circular partitions, or contour lines are not systematically compared. Ensemble strategies integrating multiple scaffold types may yield greater robustness.
Lack of mechanistic analysis of internal visual encoding: The paper demonstrates effectiveness but does not conduct in-depth analysis of how LVLM attention patterns change before and after adding lines, leaving the question of why the method works without a mechanistic explanation.

vs. Visual Sketchpad: Sketchpad enables models to draw auxiliary lines for reasoning, but requires agentic multi-step interaction and tool calls; VISER achieves this in a single query at the input level, making it simpler and more efficient.
vs. LVLM-COUNT: COUNT employs external counting tools in a divide-and-conquer fashion but does not improve the model's intrinsic reasoning capability; VISER directly improves the feature binding process within the model.
vs. Mulberry / OpenVLThinker: Fine-tuning approaches require substantial training resources and are limited to specific base models; VISER is training-free and applicable to any model, including closed-source APIs.
Cognitive science inspiration: Feature Integration Theory (Treisman & Gelade, 1980) and serial attention theory provide a solid theoretical foundation for the method's design—this cross-disciplinary approach is worth emulating.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Introduces the binding problem framework to LVLM analysis and proposes the "visual CoT" concept; the method is simple yet the perspective is unique and highly thought-provoking.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 4 models, 4 core tasks, multiple external benchmarks, comparison with fine-tuned methods, and detailed ablations; evaluation on natural images is relatively limited.
Writing Quality: ⭐⭐⭐⭐⭐ — Cognitive science theory is thoroughly motivated, motivation is clear, experiments are well-organized, discussion is in-depth, and the overall narrative logic is rigorous.
Value: ⭐⭐⭐⭐⭐ — Plug-and-play, zero-cost, applicable to all models; reveals the underappreciated importance of visual input design over language prompting, with significant implications for future research directions.