CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception¶
Conference: CVPR 2026 arXiv: 2511.19820 Code: GitHub Area: Multimodal VLM Keywords: Visual cropping, reinforcement learning, GRPO, fine-grained perception, plug-and-play
TL;DR¶
This paper proposes CropVLM — a lightweight 256M-parameter cropping network trained via GRPO reinforcement learning (without manual bounding box annotations) that dynamically selects the most informative image regions for VLMs to focus on, enabling plug-and-play integration with both open-source and commercial VLMs to improve fine-grained visual understanding.
Background & Motivation¶
VLMs are constrained by input resolution in tasks requiring fine-grained visual perception (e.g., document analysis, scene text recognition) — LLaVA-1.5's 336×336 resolution fails to resolve small text. Uniformly increasing resolution is computationally expensive and unnecessary (research shows most queries can be answered with only a small number of image tokens).
Limitations of prior work: - Architecture modifications (e.g., Matryoshka, S2) require extensive retraining and risk catastrophic forgetting - Incompatible with commercial models whose weights are inaccessible - Training-free methods like ViCrop rely on attention maps/gradients and generalize poorly out-of-distribution - UV-CoT employs DPO training, requiring synthetic preference pairs with low data efficiency
CropVLM's unique positioning: a lightweight plug-in module trained with GRPO without manual bounding boxes, compatible with both open-source and commercial VLMs.
Method¶
Overall Architecture¶
Input image + question → CropVLM (SmolVLM 256M) generates bounding box coordinates → crops the corresponding region from the original image → original image + cropped region are jointly fed into the target VLM → answer is generated.
Key Designs¶
-
GRPO-Based Cropping Training:
- Function: Optimizes the contribution of cropping to downstream VLM performance without requiring GT bounding boxes
- Mechanism: For each image–question pair, \(G=6\) candidate bounding boxes are generated; each cropped region is combined with the original image and evaluated by a reward VLM; relative advantage is computed via within-group normalization
- Design Motivation: GT bounding box annotation is costly and not necessarily optimal (human annotations may not best facilitate model responses)
-
Dual Reward Design:
- Function: Provides learning signals to guide cropping quality
- Mechanism: An accuracy reward (comparing the VLM's answer given original + cropped image against the GT) and a log-likelihood reward (log-likelihood of the correct answer assigned by the VLM, computed via a single forward pass without generation)
- Design Motivation: The likelihood reward is more fine-grained (nearly eliminating identical within-group rewards), enabling more samples to contribute effectively to weight updates
-
SFT Seed Initialization:
- Function: Equips the model with the basic capability to generate valid bounding box formats
- Mechanism: A synthetic bounding box dataset is generated by Qwen 2.5-VL 7B for SFT; small-area bounding boxes are expanded via percentile-based scaling
- Design Motivation: SmolVLM natively lacks bounding box output capability; basic competency must be established before RL optimization
Loss & Training¶
- Two-stage pipeline: SFT (learning bounding box format) → GRPO (optimizing cropping quality)
- All training is conducted on a single A100 GPU; SFT takes approximately 3 hours and GRPO approximately 24 hours (2048px variant)
- LoRA (rank 128, alpha 256) is applied to fine-tune SmolVLM
Key Experimental Results¶
Main Results (with Different Target VLMs)¶
| Target VLM | w/o CropVLM | +CropVLM (2048) | Avg. Gain |
|---|---|---|---|
| LLaVA 1.5 (336px) | 36.69 | 42.71 | +6.02 |
| Qwen 2.5 VL (448px) | 56.42 | 67.14 | +10.72 |
| GPT 4.1 nano (512px) | 41.27 | 47.41 | +6.14 |
Comparison with Other Cropping Methods¶
| Method | TextVQA | DocVQA | V* | HR-8k | Avg. |
|---|---|---|---|---|---|
| ViCrop (Qwen) | 74.15 | 72.27 | 53.40 | 46.00 | 59.67 |
| UV-CoT (Qwen) | 74.56 | 76.60 | 56.54 | 47.25 | 60.64 |
| CropVLM (Qwen) | 75.72 | 84.41 | 59.69 | 60.75 | 67.14 |
Ablation Study¶
| Configuration | 1024px Avg. | Notes |
|---|---|---|
| Baseline SmolVLM | 44.55 | No cropping |
| + SFT | 46.55 | Synthetic bbox training |
| + GRPO (accuracy) | 49.75 | RL optimization |
| + GRPO (likelihood) | 50.89 | Likelihood reward superior |
Key Findings¶
- CropVLM (1024px) paired with SmolVLM outperforms baseline SmolVLM (2048px) — low-resolution input with intelligent cropping surpasses brute-force high-resolution processing
- Significant gains are observed on out-of-distribution benchmarks (V*, HR-Bench), demonstrating strong generalization of the learned cropping strategy
- When paired with CropVLM, GPT 4.1 nano's refusals decrease from 31/191 to 2/191
- The likelihood reward consistently outperforms the accuracy reward
Highlights & Insights¶
- Plug-and-play design: no modification to the target VLM weights is required; applicable even to commercial API-based models
- Extremely low cost: a 256M-parameter cropping network trained on a single GPU yields substantial performance gains
- Elegance of GRPO training: no GT bounding boxes, no auxiliary evaluator models — downstream task performance serves directly as the reward signal
- Demonstrates the significant value of the seemingly simple "crop" operation for fine-grained VLM understanding
Limitations & Future Work¶
- Only single-region cropping is supported; multi-region or multi-step reasoning remains unexplored
- SmolVLM's numeric output vocabulary is restricted (digits 0–9 only), resulting in slower bounding box coordinate generation
- Training is conservative (single GPU, small group size), likely representing a lower bound on achievable performance
- The cropping network operates at a fixed input resolution; adaptive resolution strategies have not been explored
Related Work & Insights¶
- vs. ViCrop: Training-free methods rely on attention maps/gradients and exhibit poor out-of-distribution performance; CropVLM learns a more robust cropping strategy
- vs. UV-CoT: DPO training requires 249k preference pairs and a 7B model; CropVLM requires only 62k data points and a 256M model, offering substantially higher efficiency
- vs. DeepEyes/Mini-o3: Multi-turn reasoning incurs high inference overhead; CropVLM achieves competitive results with a single crop, maintaining high inference efficiency
Rating¶
- Novelty: ⭐⭐⭐⭐ GRPO-based cropping training combined with a plug-and-play design is novel in this area
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across multiple VLMs, benchmarks, methods, and cost analyses
- Writing Quality: ⭐⭐⭐⭐ Method presentation is concise and clear; experimental reporting is well-structured
- Value: ⭐⭐⭐⭐ Highly practical plug-and-play solution with low cost and high return