CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception¶

Conference: CVPR 2026 arXiv: 2511.19820 Code: GitHub Area: Multimodal VLM Keywords: Visual cropping, reinforcement learning, GRPO, fine-grained perception, plug-and-play

TL;DR¶

This paper proposes CropVLM — a lightweight 256M-parameter cropping network trained via GRPO reinforcement learning (without manual bounding box annotations) that dynamically selects the most informative image regions for VLMs to focus on, enabling plug-and-play integration with both open-source and commercial VLMs to improve fine-grained visual understanding.

Background & Motivation¶

VLMs are constrained by input resolution in tasks requiring fine-grained visual perception (e.g., document analysis, scene text recognition) — LLaVA-1.5's 336×336 resolution fails to resolve small text. Uniformly increasing resolution is computationally expensive and unnecessary (research shows most queries can be answered with only a small number of image tokens).

Limitations of prior work: - Architecture modifications (e.g., Matryoshka, S2) require extensive retraining and risk catastrophic forgetting - Incompatible with commercial models whose weights are inaccessible - Training-free methods like ViCrop rely on attention maps/gradients and generalize poorly out-of-distribution - UV-CoT employs DPO training, requiring synthetic preference pairs with low data efficiency

CropVLM's unique positioning: a lightweight plug-in module trained with GRPO without manual bounding boxes, compatible with both open-source and commercial VLMs.

Method¶

Overall Architecture¶

Input image + question → CropVLM (SmolVLM 256M) generates bounding box coordinates → crops the corresponding region from the original image → original image + cropped region are jointly fed into the target VLM → answer is generated.

Key Designs¶

GRPO-Based Cropping Training:
- Function: Optimizes the contribution of cropping to downstream VLM performance without requiring GT bounding boxes
- Mechanism: For each image–question pair, \(G=6\) candidate bounding boxes are generated; each cropped region is combined with the original image and evaluated by a reward VLM; relative advantage is computed via within-group normalization
- Design Motivation: GT bounding box annotation is costly and not necessarily optimal (human annotations may not best facilitate model responses)
Dual Reward Design:
- Function: Provides learning signals to guide cropping quality
- Mechanism: An accuracy reward (comparing the VLM's answer given original + cropped image against the GT) and a log-likelihood reward (log-likelihood of the correct answer assigned by the VLM, computed via a single forward pass without generation)
- Design Motivation: The likelihood reward is more fine-grained (nearly eliminating identical within-group rewards), enabling more samples to contribute effectively to weight updates
SFT Seed Initialization:
- Function: Equips the model with the basic capability to generate valid bounding box formats
- Mechanism: A synthetic bounding box dataset is generated by Qwen 2.5-VL 7B for SFT; small-area bounding boxes are expanded via percentile-based scaling
- Design Motivation: SmolVLM natively lacks bounding box output capability; basic competency must be established before RL optimization

Loss & Training¶

Two-stage pipeline: SFT (learning bounding box format) → GRPO (optimizing cropping quality)
All training is conducted on a single A100 GPU; SFT takes approximately 3 hours and GRPO approximately 24 hours (2048px variant)
LoRA (rank 128, alpha 256) is applied to fine-tune SmolVLM

Key Experimental Results¶

Main Results (with Different Target VLMs)¶

Target VLM	w/o CropVLM	+CropVLM (2048)	Avg. Gain
LLaVA 1.5 (336px)	36.69	42.71	+6.02
Qwen 2.5 VL (448px)	56.42	67.14	+10.72
GPT 4.1 nano (512px)	41.27	47.41	+6.14

Comparison with Other Cropping Methods¶

Method	TextVQA	DocVQA	V*	HR-8k	Avg.
ViCrop (Qwen)	74.15	72.27	53.40	46.00	59.67
UV-CoT (Qwen)	74.56	76.60	56.54	47.25	60.64
CropVLM (Qwen)	75.72	84.41	59.69	60.75	67.14

Ablation Study¶

Configuration	1024px Avg.	Notes
Baseline SmolVLM	44.55	No cropping
+ SFT	46.55	Synthetic bbox training
+ GRPO (accuracy)	49.75	RL optimization
+ GRPO (likelihood)	50.89	Likelihood reward superior

Key Findings¶

CropVLM (1024px) paired with SmolVLM outperforms baseline SmolVLM (2048px) — low-resolution input with intelligent cropping surpasses brute-force high-resolution processing
Significant gains are observed on out-of-distribution benchmarks (V*, HR-Bench), demonstrating strong generalization of the learned cropping strategy
When paired with CropVLM, GPT 4.1 nano's refusals decrease from 31/191 to 2/191
The likelihood reward consistently outperforms the accuracy reward

Highlights & Insights¶

Plug-and-play design: no modification to the target VLM weights is required; applicable even to commercial API-based models
Extremely low cost: a 256M-parameter cropping network trained on a single GPU yields substantial performance gains
Elegance of GRPO training: no GT bounding boxes, no auxiliary evaluator models — downstream task performance serves directly as the reward signal
Demonstrates the significant value of the seemingly simple "crop" operation for fine-grained VLM understanding

Limitations & Future Work¶

Only single-region cropping is supported; multi-region or multi-step reasoning remains unexplored
SmolVLM's numeric output vocabulary is restricted (digits 0–9 only), resulting in slower bounding box coordinate generation
Training is conservative (single GPU, small group size), likely representing a lower bound on achievable performance
The cropping network operates at a fixed input resolution; adaptive resolution strategies have not been explored

vs. ViCrop: Training-free methods rely on attention maps/gradients and exhibit poor out-of-distribution performance; CropVLM learns a more robust cropping strategy
vs. UV-CoT: DPO training requires 249k preference pairs and a 7B model; CropVLM requires only 62k data points and a 256M model, offering substantially higher efficiency
vs. DeepEyes/Mini-o3: Multi-turn reasoning incurs high inference overhead; CropVLM achieves competitive results with a single crop, maintaining high inference efficiency

Rating¶

Novelty: ⭐⭐⭐⭐ GRPO-based cropping training combined with a plug-and-play design is novel in this area
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across multiple VLMs, benchmarks, methods, and cost analyses
Writing Quality: ⭐⭐⭐⭐ Method presentation is concise and clear; experimental reporting is well-structured
Value: ⭐⭐⭐⭐ Highly practical plug-and-play solution with low cost and high return