AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition¶
Conference: CVPR2026
arXiv: 2512.03794
Code: github.com/adaptvision/adaptvision
Area: Multimodal VLM
Keywords: visual token compression, adaptive visual acquisition, reinforcement learning, tool calling, efficient VLM
TL;DR¶
This paper proposes AdaptVision, which enables VLMs to autonomously determine the minimum number of visual tokens required per sample through a coarse-to-fine active visual mechanism and reinforcement learning training, combined with Decoupled Turn Policy Optimization (DTPO) to achieve an optimal trade-off between efficiency and accuracy.
Background & Motivation¶
- VLMs rely on a large number of visual tokens (e.g., Qwen2.5-VL generates 2,678 tokens for a 2048×1024 image), incurring significant computational and memory overhead.
- Existing efficient VLM methods compress visual tokens at a fixed ratio (e.g., pruning 50%), lacking adaptive capability for varying task demands.
- Cognitive neuroscience reveals that the human visual system operates as "active vision"—first capturing coarse low-frequency information, then performing fine-grained analysis on critical regions.
- Recent "thinking with images" paradigms (e.g., zoom/crop operations in DeepEyes and Mini-o3) demonstrate the potential of active visual reasoning.
- Applying "thinking with images" to visual token compression—letting the model decide how many visual tokens suffice—remains underexplored.
- Training such a dual-objective policy with standard GRPO poses challenges of ambiguous credit assignment and optimization imbalance.
Method¶
Overall Architecture¶
AdaptVision adopts a coarse-to-fine pipeline: it first processes a quarter-resolution image (\(I_{low}\)) using only 25% of visual tokens. The model autonomously decides whether to answer directly or invoke a cropping tool via <tool_call>[x1,y1,x2,y2]</tool_call> to retrieve a critical region \(I_{crop}\) from the high-resolution image for subsequent reasoning. The total visual token count is \(n_{img} = n_{low} + \mathbf{1}_{tool} \cdot n_{crop}\).
Key Designs¶
Reward Function Design: Composed of an outcome reward \(\mathcal{R}_{oc}\) and a tool reward \(\mathcal{R}_{tool}\).
- Accuracy Reward \(\mathcal{R}_{acc}\): LLM-as-judge evaluates answer correctness (1/0).
- Format Reward \(\mathcal{R}_{form}\): Enforces compliance of
<think>,<answer>, and<tool_call>tags (0.5/0). - Balance Reward \(\mathcal{R}_{bal}\): Penalizes correct answers obtained via tool calls by 0.1; also penalizes low-confidence direct correct answers by 0.1 to prevent "lucky guessing."
- Crop Reward \(\mathcal{R}_{crop}\): GPT-4o evaluates whether the cropped region contains information necessary for answering.
- Area Penalty \(\mathcal{R}_{area}\): Penalizes excessively large cropped regions to incentivize minimal token usage.
DTPO (Decoupled Turn Policy Optimization):
- Decoupled Learning Objectives: Generated tokens are functionally separated into tool tokens (first turn) and answer tokens (second turn), with gradient signals normalized independently to address the under-optimization of tool tokens in long sequences.
- Decoupled Advantage Estimation: Independent advantage values \(A^{tool}\) and \(A^{oc}\) are computed for the tool reward and outcome reward respectively, resolving ambiguous credit assignment.
Loss & Training¶
A GRPO-based PPO-style clipping loss is employed, with separate normalization factors and advantage values applied to tool tokens and answer tokens: $\(\mathcal{J}_{DTPO} = \mathcal{J}_{tool} + \mathcal{J}_{answer}\)$ Each component is normalized independently to prevent gradient signal dilution across the two-turn sequence.
Key Experimental Results¶
Main Results: Comparison with Existing Efficient VLM Methods¶
| Method | #Token↓ | ChartQA | OCRBench | DocVQA | MME | MathVista | Avg.↑ |
|---|---|---|---|---|---|---|---|
| Vanilla (Qwen2.5-VL-7B) | 100% | 79.8 | 81.5 | 95.1 | 2316 | 68.2 | 100% |
| Down-Sample | 25% | 62.9 | 68.8 | 94.3 | 2270 | 62.2 | 92.1% |
| FastV (50%) | 50% | 72.6 | 75.8 | 93.6 | 2308 | 63.7 | 95.8% |
| VisionZip (50%) | 50% | 71.5 | 70.5 | 93.8 | 2209 | 64.1 | 94.8% |
| AdaptVision | <50% | Surpasses above | Surpasses above | Surpasses above | Surpasses above | Surpasses above | Highest |
Ablation Study: Contribution of DTPO Components¶
| Setting | Avg. Performance | Token Usage |
|---|---|---|
| GRPO baseline | Lower; unstable training | Tool calls rapidly collapse to overuse |
| + Decoupled normalization | Significant improvement | Stable tool usage rate |
| + Decoupled advantage estimation | Best | Optimal token efficiency |
Key Findings¶
- AdaptVision achieves performance superior to state-of-the-art efficient VLM methods while consuming significantly fewer visual tokens.
- Direct training with GRPO leads to an unstable process where tool usage first biases toward direct answering and subsequently collapses into excessive tool invocation.
- Both decoupled designs in DTPO contribute critically to training stability and performance improvement.
Highlights & Insights¶
- The coarse-to-fine mechanism of human active vision is elegantly incorporated into VLM efficiency optimization.
- The proposed DTPO algorithm is generalizable and applicable to any multi-turn RL training scenario involving tool calls.
- The multi-layered reward design (accuracy + format + balance + crop + area) reflects a deep understanding of practical training dynamics.
Limitations & Future Work¶
- Only a single tool call (1-turn crop) is supported; multi-turn iterative refinement is not explored.
- The crop reward relies on GPT-4o evaluation, resulting in relatively high training costs.
- Experiments are conducted on Qwen2.5-VL-7B; effectiveness on larger-scale models remains to be verified.
Related Work & Insights¶
- Compared to VisionThink: VisionThink only coarsely switches between low and high resolution, whereas AdaptVision supports region-level adaptive cropping with greater flexibility.
- Contrasted with fixed-compression methods such as FastV and SparseVLM: a paradigm shift from passive to active compression.
- The decoupling philosophy of DTPO may inspire solutions to gradient conflict issues in multi-objective RL training.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Novel perspective combining active vision and RL for token compression; DTPO is innovative)
- Experimental Thoroughness: ⭐⭐⭐⭐ (9 VQA benchmarks; comprehensive ablation study)
- Writing Quality: ⭐⭐⭐⭐ (Clear motivation; rigorous methodological derivation)
- Value: ⭐⭐⭐⭐ (Practical direction for efficient VLMs; DTPO is transferable)