LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models¶
Conference: CVPR 2026 arXiv: 2603.14882 Code: https://empactlab.github.io/LLMind-CVPR-2026/ Area: Multimodal VLM Keywords: Bio-inspired visual sampling, Möbius transformation, training-free, pixel budget, VQA
TL;DR¶
Inspired by foveal encoding and cortical magnification in the human visual system, this paper proposes LLMind, a training-free adaptive sampling framework that leverages Möbius transformations for non-uniform pixel allocation. A closed-loop semantic feedback mechanism optimizes sampling parameters at test time, achieving substantial improvements over uniform sampling under tight pixel budgets of only 1%–5%.
Background & Motivation¶
Background: Current VLMs (e.g., Qwen, LLaVA) allocate equal resolution to all pixel regions during visual processing, consuming equal computational resources even for semantically irrelevant background areas. Dynamic tokenization partially alleviates redundancy but still requires full-resolution input, making it ill-suited for edge devices.
Limitations of Prior Work: Uniform downsampling neither reflects the resource allocation strategy of human vision nor handles the forced loss of globally critical details in high-resolution images—semantically important regions and irrelevant backgrounds are treated identically.
Key Challenge: There is a fundamental tension between efficiency and inference accuracy—uniform sampling cannot focus on task-critical regions under a limited pixel budget.
Goal: Can the foveal fixation strategy of biological vision be leveraged to maintain high accuracy in VLMs under extremely low pixel budgets?
Key Insight: The human eye achieves maximum information gain at minimal cost through high-resolution foveal sampling, low-resolution peripheral context, and rapid saccades. The authors map this mechanism to non-uniform sampling parameterized by Möbius transformations.
Core Idea: Möbius transformations are used to simulate cortical magnification, amplifying sampling density in task-relevant regions, while closed-loop semantic feedback optimization is achieved via SPSA gradient estimation for black-box VLMs.
Method¶
Overall Architecture¶
Given an input image \(I\) and question \(q\), the Bio-inspired Adaptive Sampling Strategy (BASS) module produces an adaptively sampled image \(\hat{I}\), which is fed into a frozen VLM to obtain the answer. BASS parameters are iteratively optimized at inference time via perceptual and semantic losses.
Key Designs¶
-
BASS (Bio-inspired Adaptive Sampling Strategy):
- Function: Applies non-uniform spatial sampling to the image, simulating the foveal magnification effect of the human eye.
- Mechanism: Image pixels are mapped to the complex plane via stereographic projection, and a Möbius transformation \(z = (aw+b)/(cw+d)\) is applied for smooth spatial remapping, enlarging the fixation region while compressing the periphery. Uniform sampling is then performed in the transformed space and mapped back, yielding an equivalent non-uniform sampling of the original image.
- Design Motivation: As a conformal mapping, the Möbius transformation preserves global geometric structure while magnifying local regions.
- Distinction from Simple Cropping: Global contextual information is retained without discarding scene structure.
-
MLP Parameter Predictor:
- Function: A lightweight MLP predicts the four real-valued parameters \(\theta \in \mathbb{R}^4\) of the Möbius transformation.
- Mechanism: Parameter learning is integrated into a differentiable sampling pipeline \(\hat{I} = \mathcal{M}_\theta^{-1}(\mathcal{I}(\mathcal{S}_B(\mathcal{M}_\theta(I))))\), where \(\mathcal{S}_B\) denotes uniform sampling under budget \(B\).
-
Closed-loop Semantic Feedback (CSF):
- Function: Optimizes sampling parameters at test time based on VLM response quality.
- Mechanism: A perceptual loss \(\mathcal{L}_{img} = \alpha \cdot \mathcal{L}_{VSI} + \beta \cdot \mathcal{L}_{DISTS} + \gamma \cdot \mathcal{L}_{MSE}\) ensures visual quality; a semantic loss \(\mathcal{L}_{text} = 1 - \cos(E(y_{pred}), E(y_{gt}))\) measures semantic similarity between predicted and ground-truth answers via a Sentence Transformer.
- Gradient Estimation: SPSA (Simultaneous Perturbation Stochastic Approximation) estimates gradients for black-box VLMs as \(\nabla_\theta \mathcal{L}_{text} \approx \frac{\mathcal{L}(\theta+\delta\Delta) - \mathcal{L}(\theta-\delta\Delta)}{2\delta}\), without accessing internal model parameters.
- Design Motivation: Compatible with both white-box and black-box VLMs, including closed-source APIs.
Loss & Training¶
- Entirely training-free: all optimization is performed at test time through a small number of iterations.
- Adaptive question selection: exponential weighting is applied to incorrectly answered questions, focusing optimization on hard examples.
Key Experimental Results¶
Main Results¶
| Dataset | Model | Pixel Budget | Uniform Sampling | LLMind | Gain |
|---|---|---|---|---|---|
| VQAv2 | Qwen2.5-VL | 5% | 59.94 | 73.54 | +22.68% |
| VQAv2 | SmolVLM | 5% | 59.06 | 76.46 | +29.46% |
| Seed-Bench | Qwen2.5-VL | 5% | — | — | +38% (avg) |
| A-OKVQA | Qwen2.5-VL | 5% | — | — | +37% (avg) |
Performance Retention under Extreme Low Budgets¶
| Pixel Budget | VQAv2/Qwen2.5-VL Retention | Notes |
|---|---|---|
| 1% | 63.31% | Only 1% of pixels |
| 3% | 75.17% | Retains most performance |
| 5% | 84.56% | Approaches full resolution |
Ablation Study¶
- Static foveal sampling underperforms uniform sampling, due to the absence of adaptivity.
- Sunflower-inspired and radial sampling also perform poorly.
- CSF closed-loop feedback is the key driver of performance gains.
- In region-guided VQA, LLMind at 1% pixel budget even surpasses full-resolution accuracy.
Comparison Details¶
- Static Foveated, Sunflower Inspired, and Radial Sampling all underperform uniform sampling, demonstrating that static foveal encoding cannot generalize across diverse tasks.
- Exponential weighting in the adaptive question selection strategy focuses optimization on hard examples, accelerating convergence.
Highlights & Insights¶
- First work to systematically incorporate neuroscientific foveal encoding and cortical magnification into VLM visual representation research.
- Fully training-free and plug-and-play; compatible with both white-box and black-box VLMs, including closed-source APIs.
- Retains 82% of full-resolution performance at an extreme 1% pixel budget, with significant practical value.
- The conformal property of Möbius transformations guarantees that global structure is not disrupted.
- On SmolVLM, retention at 5% budget reaches 95.56%, achieving nearly lossless compression.
Limitations & Future Work¶
- Test-time optimization requires multiple forward passes (approximately 5–10 iterations per image), increasing inference latency.
- SPSA gradient estimation may converge slowly in high-dimensional parameter spaces and is sensitive to the perturbation magnitude \(\delta\).
- The current approach relies on a small number of ground-truth answers for CSF optimization; applicability in fully annotation-free settings requires further validation.
- Handling scenes with multiple fixation points (e.g., multiple key regions in complex charts) remains unexplored.
- A single Möbius transformation may be insufficient to simultaneously magnify multiple spatially dispersed semantically critical regions.
- The phenomenon of LLMind surpassing full-resolution accuracy in region-guided VQA warrants deeper theoretical investigation.