The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design¶

Conference: ICML 2026
arXiv: 2605.01345
Code: None
Area: Multimodal VLM / Active Vision / Vision Agent
Keywords: Perceptual bandwidth bottleneck, Bayesian experimental design, active visual reasoning, high resolution, training-free

TL;DR¶

This paper formalizes the issue of "VLMs failing to perceive details" as a Sequential Bayesian Optimal Experimental Design (S-BOED) problem, and proposes a training-free FOVEA module based on a computable proxy objective of "coverage × resolution." FOVEA consistently outperforms Direct and ReAct-style baselines on high-resolution and remote sensing benchmarks.

Background & Motivation¶

Background: Modern VLMs (Qwen3-VL, GPT-5, Gemini 2.5, etc.) are already strong in overall scene understanding. Mainstream high-resolution processing methods fall into two categories: (1) downsampling the image and feeding the whole image into a ViT encoder with a fixed token budget; (2) introducing tool calls, where the VLM generates crop commands via ReAct or latent CoT, then invokes expert tools like OCR/detection.

Limitations of Prior Work: For fine-grained tasks such as small object counting, OCR, and precise spatial localization, all existing VLMs exhibit "perceptual blindness"—even simple reasoning fails. Downsampling causes small objects to "disappear" before encoding; ReAct-style cropping is heuristic and often crops the wrong region; sliding window brute-force scanning is too expensive and introduces much noise.

Key Challenge: The authors identify this as a perceptual bandwidth bottleneck—ViTs compress images of any resolution into a fixed number of tokens, forcing an unavoidable "field of view vs. resolution" trade-off: seeing more means losing detail, seeing detail means losing context. This is not merely a semantic reasoning failure, but a failure to obtain task-relevant evidence under limited bandwidth.

Goal: Transform the "where to look" decision from ad-hoc heuristics into a principled optimal experimental design problem, and provide a computable proxy objective in gigapixel continuous space.

Key Insight: Analogous to how scientists design experiments—each foveation (crop) is an experimental design \(\mathbf{d}\), aiming to reduce uncertainty about latent variables \(\boldsymbol{\theta}=\{\ell, y\}\) (target location + semantic answer). The BOED framework naturally fits this "active information foraging" process.

Core Idea: Use the product of "coverage × resolution" as a computable proxy for expected information gain, and package it as a plug-in module to refine the VLM's proposed crop.

Method¶

Overall Architecture¶

Input is a high-resolution image \(I\) and query \(Q\). The VLM first generates a seed crop \(\mathbf{d}_{\text{seed}}\) in ReAct style. FOVEA intercepts this command and generates a candidate crop pool \(\mathcal{D}_{\text{cand}}=\{\mathbf{d}_{\text{seed}}, \mathbf{d}_{\text{small}}, \mathbf{d}_{\text{large}}\}\) around it. Each candidate is scored using a resolvability probe, and the highest-scoring crop is fed back to the VLM or downstream tools (OCR/Detection). The entire process is completely training-free, only requiring extra VLM calls as a scorer during inference.

Key Designs¶

S-BOED Formalization and Three-Layer Probabilistic Model:
- Function: Reformulates "active vision" as Bayesian optimal experimental design, clearly distinguishing physical constraints (fixed token budget), generative process (visibility-gated observation), and decision objective (EIG).
- Mechanism: Defines perceptual bandwidth \(\mathcal{B}\), information density \(\rho(\mathbf{d})=\mathcal{B}/A(\mathbf{d})\), and resolution probability \(\phi(\mathbf{d})=f_{\text{sat}}(\rho(\mathbf{d}))\) (sigmoid form, corresponding to "semantic Nyquist rate"); introduces binary visibility event \(\mathcal{S}\), where \(\mathcal{S}=1\) only if the target is both spatially covered (\(\ell\in\mathbf{d}\)) and resolved (\(\phi=1\)), so observation \(\mathbf{z}\) carries semantic information about \(y\), otherwise it's background noise \(p_0\).
- Design Motivation: The authors highlight that this problem violates the submodularity assumption common in active learning—neither "wide field of view" nor "random zoom-in" alone yields information gain; only their sequential combination does, resulting in an "information cliff," so look-ahead is necessary rather than pure greediness.
Computable Coverage-Resolution Objective:
- Function: Simplifies the nested expectation form of EIG in BOED into a scalar objective computable in gigapixel space.
- Mechanism: Based on three progressive assumptions—Factorised Belief (\(p_t(\ell, y)\approx p_t(\ell)\cdot p_t(y)\)), Calibrated Visibility (\(H(\mathcal{S}|\mathbf{z},\mathbf{d})\approx 0\)), Ideal Observer (\(H(y|\mathbf{z},\mathcal{S}=1)\approx 0\))—derives \(U_t(\mathbf{d})\approx H_t(y)\cdot\mathcal{J}_t(\mathbf{d})\), where \(\mathcal{J}_t(\mathbf{d})=\left(\int_{\mathbf{x}\in\mathbf{d}}p_t(\mathbf{x})d\mathbf{x}\right)\cdot \phi(\mathbf{d})\) is the "coverage × resolution" product. Since \(H_t(y)\) is independent of \(\mathbf{d}\), maximizing EIG is equivalent to maximizing \(\mathcal{J}_t\).
- Design Motivation: This reduces the complex semantic reasoning objective to geometric visibility maximization, leaving "understanding" to the backbone VLM and "search" to FOVEA; this separation of concerns enables training-free inference-time optimization.
Resolvability Probing and Three Optimizers:
- Function: In the absence of a true ground-truth belief map, uses the VLM itself as a "binary scorer" to estimate \(\hat{\mathcal{J}}(\mathbf{d})\).
- Mechanism: Introduces a binary resolvability signal \(r\in\{0,1\}\), defines \(\hat{\mathcal{J}}(\mathbf{d})\approx P(\text{VLM}(I_\mathbf{d}, Q)=\text{"Yes"})\), i.e., "does this crop contain enough visual evidence to answer the question"; runs \(K=3\) random probes per candidate crop and averages the results. The upper layer supports three optimizers: Greedy (default, selects the crop with highest \(\hat{\mathcal{J}}\)), MCMC-style (iterative refinement), Lookahead (uses simulated next-state \(\hat{V}(\mathbf{d}, \mathcal{H}_{t-1})\) instead of immediate score, specifically to address information cliffs).
- Design Motivation: The resolvability probe is not an exact EIG estimator, but an empirical proxy from the S-BOED perspective—this design avoids training a scoring model, requiring only VLM calls; the three optimizers provide a continuous spectrum of "compute-accuracy operating points," switchable according to latency budget.

Loss & Training¶

Completely training-free, with no parameter updates. FOVEA is only inserted into the VLM's crop call chain at inference, selecting crops via an extra \(|\mathcal{D}_{\text{cand}}|\times K\) VLM probes. The cost is extra tokens, but the benefit is improved crop quality.

Key Experimental Results¶

Main Results¶

Method	Backbone	MME-RealW	CV-Bench	V*	HR-4K	HR-8K	Mean
GPT-5	Closed	55.0	84.9	77.0	78.1	75.5	74.1
Gemini 2.5 Flash	Closed	58.5	87.3	80.1	83.4	80.9	78.0
Direct	Qwen3-VL-30B	48.2	81.2	81.2	80.0	75.9	73.3
ReAct	Qwen3-VL-30B	51.1	81.3	83.8	80.8	78.3	75.1
RAP	Qwen3-VL-30B	40.8	72.2	86.4	79.6	80.6	71.9
FOVEA	Qwen3-VL-30B	54.6	84.8	85.3	84.5	79.2	77.7
Direct	Qwen3-VL-8B	47.6	84.5	76.9	74.5	70.9	70.9
ReAct	Qwen3-VL-8B	48.1	83.9	78.8	77.7	73.8	72.5
FOVEA	Qwen3-VL-8B	49.9	84.7	83.6	80.9	75.4	74.9

On 30B, FOVEA raises ReAct's 75.1 to 77.7, approaching Gemini 2.5 Flash's 78.0; on 8B, it improves from 72.5 to 74.9. The same strategy works across both backbone scales.

Ablation Study (Remote Sensing Subset, search-dominated)¶

Configuration	Accuracy	Notes
Direct (30B)	~35%	Whole image, no active search
ReAct	45.1%	Heuristic crop
FOVEA-Greedy	~48%	Adds resolvability probe
FOVEA-MCMC	~50%	Iterative refinement
FOVEA-Lookahead	54.7%	Explicit look-ahead for information cliff
Oracle Crop	~65%	Upper bound with human-annotated crop

Key Findings¶

FOVEA achieves the largest gains in search-dominated remote sensing scenarios; Lookahead outperforms Greedy by 6+ points, supporting the "information cliff" hypothesis—immediate gain signals are insufficient in such tasks, requiring look-ahead.
There remains a ~10-point gap between Oracle crop and FOVEA-Lookahead; the authors attribute this to both "search bottleneck" and "recognition bottleneck," indicating that even with the correct crop, the VLM backbone can still fail in recognition.
On the accuracy-compute curve, Greedy / MCMC / Lookahead form a monotonically increasing set of operating points; FOVEA is actually a family of strategies, not a single point—this introduces a new axis for inference-time scaling: not just spending more tokens on textual CoT, but spending more tokens to actively acquire visual evidence.

Highlights & Insights¶

Reframes active vision under the mature BOED decision-theoretic framework: Previous tool-based VLM agents (Thyme, RAP) used RL to train end-to-end policies; FOVEA, in contrast, offers a fully training-free yet theoretically grounded approach, with the core trick being the "coverage × resolution" proxy objective.
The "information cliff" observation is highly incisive: Explains why greedy strategies often fail on high-resolution tasks—not because the model is weak, but because the submodularity assumption does not hold, elevating the necessity of look-ahead to a theoretical imperative.
Resolvability probing is highly transferable: Essentially uses the VLM as its own critic (\(P(\text{VLM}=\text{Yes})\) as utility), and can be applied to web agents, tool calls, RAG retrieval ranking, etc., as long as the task has a "binary yes/no verification" form.

Limitations & Future Work¶

The authors acknowledge reliance on the Ideal Observer assumption; if the backbone VLM hallucinates, even oracle crops cannot help.
Proposal-limited is a hard limitation: if the seed crop is far from the true target region, local refinement and look-ahead cannot recover; the authors call this the cold-start problem, propose multi-seed as mitigation but do not explore it in depth.
Resolvability probe requires extra VLM runs, significantly increasing inference time; FOVEA should be viewed as a set of points on the compute-accuracy curve, not suitable for all latency-sensitive scenarios.
Future directions: train an amortized lightweight policy to directly predict good crops, or add a meta-policy to decide "when to activate FOVEA."

vs ReAct / Thyme / RAP: They use RL or heuristics for VLM crop proposals; this work leaves the backbone unchanged and adds a BOED optimization layer at inference, with zero training cost but theoretical guarantees; experimentally, at 30B, FOVEA (77.7) > RAP (71.9).
vs BED-LLM (Choudhury et al. 2025): They apply BOED to discrete question selection; FOVEA extends it to continuous gigapixel visual space, requiring additional handling of visibility gating and information cliffs.
vs Standard visual CoT: Textual CoT spends more tokens "thinking," while FOVEA spends more tokens "seeing"; the authors view this as another orthogonal axis for inference-time scaling.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ S-BOED + information cliff + coverage-resolution product, theoretically complete and novel framing
Experimental Thoroughness: ⭐⭐⭐⭐ Four high-resolution benchmarks + two backbone scales + three optimizers, with clear Oracle gap analysis, but only validated on Qwen3-VL family
Writing Quality: ⭐⭐⭐⭐⭐ Three-layer probabilistic model (physical → generative → decision) is clearly organized, with explicit assumptions and approximations
Value: ⭐⭐⭐⭐ Training-free and plug-and-play, low deployment barrier, but probe overhead and unresolved cold-start limit direct application