Skip to content

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Conference: CVPR 2026 arXiv: 2511.19820 Code: GitHub Area: Multimodal VLM Keywords: Visual cropping, reinforcement learning, GRPO, fine-grained perception, plug-and-play

TL;DR

This paper proposes CropVLM — a lightweight 256M-parameter cropping network trained via GRPO reinforcement learning (without manual bounding box annotations) that dynamically selects the most informative image regions for VLMs to focus on, enabling plug-and-play integration with both open-source and commercial VLMs to improve fine-grained visual understanding.

Background & Motivation

VLMs are constrained by input resolution in tasks requiring fine-grained visual perception (e.g., document analysis, scene text recognition) — LLaVA-1.5's 336×336 resolution fails to resolve small text. Uniformly increasing resolution is computationally expensive and unnecessary (research shows most queries can be answered with only a small number of image tokens).

Limitations of prior work: - Architecture modifications (e.g., Matryoshka, S2) require extensive retraining and risk catastrophic forgetting - Incompatible with commercial models whose weights are inaccessible - Training-free methods like ViCrop rely on attention maps/gradients and generalize poorly out-of-distribution - UV-CoT employs DPO training, requiring synthetic preference pairs with low data efficiency

CropVLM's unique positioning: a lightweight plug-in module trained with GRPO without manual bounding boxes, compatible with both open-source and commercial VLMs.

Method

Overall Architecture

Input image + question → CropVLM (SmolVLM 256M) generates bounding box coordinates → crops the corresponding region from the original image → original image + cropped region are jointly fed into the target VLM → answer is generated.

Key Designs

  1. GRPO-Based Cropping Training:

    • Function: Optimizes the contribution of cropping to downstream VLM performance without requiring GT bounding boxes
    • Mechanism: For each image–question pair, \(G=6\) candidate bounding boxes are generated; each cropped region is combined with the original image and evaluated by a reward VLM; relative advantage is computed via within-group normalization
    • Design Motivation: GT bounding box annotation is costly and not necessarily optimal (human annotations may not best facilitate model responses)
  2. Dual Reward Design:

    • Function: Provides learning signals to guide cropping quality
    • Mechanism: An accuracy reward (comparing the VLM's answer given original + cropped image against the GT) and a log-likelihood reward (log-likelihood of the correct answer assigned by the VLM, computed via a single forward pass without generation)
    • Design Motivation: The likelihood reward is more fine-grained (nearly eliminating identical within-group rewards), enabling more samples to contribute effectively to weight updates
  3. SFT Seed Initialization:

    • Function: Equips the model with the basic capability to generate valid bounding box formats
    • Mechanism: A synthetic bounding box dataset is generated by Qwen 2.5-VL 7B for SFT; small-area bounding boxes are expanded via percentile-based scaling
    • Design Motivation: SmolVLM natively lacks bounding box output capability; basic competency must be established before RL optimization

Loss & Training

  • Two-stage pipeline: SFT (learning bounding box format) → GRPO (optimizing cropping quality)
  • All training is conducted on a single A100 GPU; SFT takes approximately 3 hours and GRPO approximately 24 hours (2048px variant)
  • LoRA (rank 128, alpha 256) is applied to fine-tune SmolVLM

Key Experimental Results

Main Results (with Different Target VLMs)

Target VLM w/o CropVLM +CropVLM (2048) Avg. Gain
LLaVA 1.5 (336px) 36.69 42.71 +6.02
Qwen 2.5 VL (448px) 56.42 67.14 +10.72
GPT 4.1 nano (512px) 41.27 47.41 +6.14

Comparison with Other Cropping Methods

Method TextVQA DocVQA V* HR-8k Avg.
ViCrop (Qwen) 74.15 72.27 53.40 46.00 59.67
UV-CoT (Qwen) 74.56 76.60 56.54 47.25 60.64
CropVLM (Qwen) 75.72 84.41 59.69 60.75 67.14

Ablation Study

Configuration 1024px Avg. Notes
Baseline SmolVLM 44.55 No cropping
+ SFT 46.55 Synthetic bbox training
+ GRPO (accuracy) 49.75 RL optimization
+ GRPO (likelihood) 50.89 Likelihood reward superior

Key Findings

  • CropVLM (1024px) paired with SmolVLM outperforms baseline SmolVLM (2048px) — low-resolution input with intelligent cropping surpasses brute-force high-resolution processing
  • Significant gains are observed on out-of-distribution benchmarks (V*, HR-Bench), demonstrating strong generalization of the learned cropping strategy
  • When paired with CropVLM, GPT 4.1 nano's refusals decrease from 31/191 to 2/191
  • The likelihood reward consistently outperforms the accuracy reward

Highlights & Insights

  • Plug-and-play design: no modification to the target VLM weights is required; applicable even to commercial API-based models
  • Extremely low cost: a 256M-parameter cropping network trained on a single GPU yields substantial performance gains
  • Elegance of GRPO training: no GT bounding boxes, no auxiliary evaluator models — downstream task performance serves directly as the reward signal
  • Demonstrates the significant value of the seemingly simple "crop" operation for fine-grained VLM understanding

Limitations & Future Work

  • Only single-region cropping is supported; multi-region or multi-step reasoning remains unexplored
  • SmolVLM's numeric output vocabulary is restricted (digits 0–9 only), resulting in slower bounding box coordinate generation
  • Training is conservative (single GPU, small group size), likely representing a lower bound on achievable performance
  • The cropping network operates at a fixed input resolution; adaptive resolution strategies have not been explored
  • vs. ViCrop: Training-free methods rely on attention maps/gradients and exhibit poor out-of-distribution performance; CropVLM learns a more robust cropping strategy
  • vs. UV-CoT: DPO training requires 249k preference pairs and a 7B model; CropVLM requires only 62k data points and a 256M model, offering substantially higher efficiency
  • vs. DeepEyes/Mini-o3: Multi-turn reasoning incurs high inference overhead; CropVLM achieves competitive results with a single crop, maintaining high inference efficiency

Rating

  • Novelty: ⭐⭐⭐⭐ GRPO-based cropping training combined with a plug-and-play design is novel in this area
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across multiple VLMs, benchmarks, methods, and cost analyses
  • Writing Quality: ⭐⭐⭐⭐ Method presentation is concise and clear; experimental reporting is well-structured
  • Value: ⭐⭐⭐⭐ Highly practical plug-and-play solution with low cost and high return