See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/zss02/BiPS
Area: Multimodal VLM
Keywords: Visual Language Models, Reinforcement Learning, Perceptual Shaping, KL Constraint, Chart Understanding

TL;DR¶

BiPS shifts the "where to look" visual cues from inference-time tools or latent tokens to the training phase. By employing a pair of KL constraints (pulling toward "evidence-only" charts and pushing away from "evidence-ablated" charts) within the GRPO framework, it shapes the perceptual strategy of the VLM. Training on only 13K chart samples, Qwen2.5-VL-7B achieves a 7.3% average improvement across eight benchmarks (rising to 8.2% with 39K math data) with zero additional inference overhead.

Background & Motivation¶

Background: In Visual Question Answering (VQA), Visual Language Models (VLMs) often require intermediate visual cues to identify "where to look." Prevailing approaches follow two paths: (1) invoking external tools (cropping, masking, segmentation) during inference to generate focused evidence images; (2) training the model to output a "Visual Chain-of-Thought" (bounding boxes, tool-call trajectories, or implicit visual tokens) during inference.

Limitations of Prior Work: These paths treat visual cues as inference-time crutches, leading to three issues. First, rigid shapes: focused regions are often rectangular crops or coarse masks, failing to capture irregular evidence like thin lines in charts, lesion contours in medical images, or non-convex polygons in geometry. Second, scenario coupling: customized tools and training pipelines for specific tokens are tightly coupled with layouts/domains, hindering generalization. Third, inference overhead: generating intermediate cues at inference time increases latency and the risk of cascading errors.

Key Challenge: Placing visual cues in the inference phase forces a trade-off between "cue quality" and "computation/generalization"—finer cues require more specialized, slower, and less universal tools.

Goal: To internalize the ability to "perceive correct fine-grained evidence" into the model weights, eliminating the need for extra tools, parsers, or visual tokens during inference.

Key Insight: Since charts are rendered via code, every element (marks/axes/legend) has a specific programmatic origin. This allows "surgical" manipulation at the code level to programmatically generate two types of perfect, ground-truth visual views: one that preserves only the evidence required for the question, and one that precisely ablates only the critical evidence. These views are used as training signals rather than inference crutches.

Core Idea: Shaping the policy using a pair of opposing KL constraints—pulling the model's prediction on the original image toward the "evidence-preserved" view (learning where to look) while pushing it away from the "evidence-ablated" view (forcing it to rely on vision rather than linguistic shortcuts). This is termed Bi-directional Perceptual Shaping (BiPS).

Method¶

Overall Architecture¶

BiPS is a two-stage training curriculum built upon GRPO. It uses programmatically generated paired views to "bidirectionally shape" the VLM's perceptual strategy. This mechanism occurs entirely during training; at inference, the model behaves as a standard Qwen2.5-VL without extra steps.

The pipeline comprises three components: a programmatic data construction pipeline that edits rendering code to produce \((I, q, I_{pres}, I_{abl})\) quadruplets; a bi-directional KL constraint consisting of a consistency term (pulling the original image policy toward \(I_{pres}\)) and a separation term (pushing it away from \(I_{abl}\)); and a coarse-to-fine two-stage curriculum that decouples these objectives.

flowchart TD
    A["Chart Code + Question"] --> B["Programmatic Data Construction<br/>Edit code to generate paired views<br/>I_pres / I_abl"]
    B --> C["Consistency Constraint<br/>L_cons: Pull toward I_pres"]
    B --> D["Separation Constraint<br/>L_sep: Push away from I_abl"]
    C --> E["Coarse-to-Fine Curriculum<br/>Stage 1: L_cons first"]
    D --> E
    E -->|Load Stage 1 Weights| F["Stage 2: L_sep later"]
    F --> G["BiPS Model<br/>Zero inference overhead"]

Key Designs¶

1. Programmatic data construction pipeline: Pixel-precise views via code-level surgery

To perform bi-directional shaping, precise supervision of "what to look at" is required. Using the ECD corpus (multi-panel charts with executable code), the pipeline follows three steps: (i) Question reconstruction and verification: Open-ended questions are rewritten as multiple-choice questions via an LLM judge (GPT5-mini) using chart source code and metadata to ensure verifiability. (ii) Difficulty filtering: Problems solved correctly 8/8 times by the base model are discarded to focus training on difficult samples. (iii) Code editing and rendering: The evidence-preserved view \(I_{pres}\) is rendered by deleting code segments irrelevant to the question. The evidence-ablated view \(I_{abl}\) is rendered by deleting segments providing critical evidence while maintaining global context (axes, layout). This yields 13K high-quality samples where supervision is semantically precise and naturally aligned.

2. Consistency constraint \(L_{cons}\): Learning coarse-grained focus

To address the issue of the model being distracted by irrelevant information, \(L_{cons}\) requires the policy on the original image \(I\) to be consistent with the policy on the evidence-preserved view \(I_{pres}\) by minimizing KL divergence:

\[L_{cons} = \mathbb{E}_{(I,q,r)}\Big[\mathbb{I}(r{=}1)\min\big(c_{cons},\ D_{KL}(\pi_\theta(\cdot|I,q)\,\|\,\mathrm{sg}[\tilde\pi_\theta(\cdot|I_{pres},q)])\big)\Big]\]

\(\pi_\theta\) is the distribution on the original image, \(\tilde\pi_\theta\) is the target distribution on the preserved view, and \(\mathrm{sg}[\cdot]\) denotes a stop-gradient. This forces the model to treat irrelevant regions as redundant and focuses decision-making on the preserved evidence.

3. Separation constraint \(L_{sep}\): Cutting off text shortcuts

Models might rely on OCR text or linguistic priors to give the same answer on both \(I\) and \(I_{pres}\) without truly "seeing" the fine-grained evidence (shortcut learning). \(L_{sep}\) acts as a regularizer, forcing the policy on \(I\) to diverge from the policy on the evidence-ablated view \(I_{abl}\) by maximizing KL divergence:

\[L_{sep} = \mathbb{E}_{(I,q)}\Big[\min\big(c_{sep},\ D_{KL}(\pi_\theta(\cdot|I,q)\,\|\,\mathrm{sg}[\tilde\pi_\theta(\cdot|I_{abl},q)])\big)\Big]\]

Since \(I_{abl}\) removes key evidence, this forces the model to provide different answers for the original and "no-evidence" images, ensuring reliance on fine-grained visual features (See Right).

4. Coarse-to-fine two-stage curriculum: Decoupling conflicting objectives

\(L_{cons}\) acts as an attractive force and \(L_{sep}\) as a repulsive force; optimizing both simultaneously may lead to gradient conflict. BiPS decouples them: Stage 1 (Consistency Phase): \(L_{Stage 1} = L_{GRPO} + \alpha L_{cons}\) establishes coarse-grained focus. Stage 2 (Separation Phase): \(L_{Stage 2} = L_{GRPO} - \beta L_{sep}\) applies grounding constraints. Experiments show that reversing the order or joint training results in slower convergence and worse performance.

Loss & Training¶

Base Objective: \(L_{GRPO} = -\mathbb{E}\big[\min(r_t(\theta)A_t, \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t) - \gamma D_{KL}(\pi_\theta\|\pi_{ref})\big]\).
Curriculum: Stage 1 (5 epochs / 7K samples, \(+\alpha L_{cons}\)) → Stage 2 (3 epochs / 13K samples, \(-\beta L_{sep}\)) results in BiPS-Chart. Further fine-tuning on 39K math samples (ViRL39k) results in BiPS-General.
Optimization: AdamW, lr \(=1\times10^{-6}\), 8×H100 GPUs.

Key Experimental Results¶

Main Results¶

Using Qwen2.5-VL-7B as the base, average accuracy across eight benchmarks:

Model	Training Data Size	CharXiv	Evochart	MathVista	MMStar	Average (8 items)
Qwen2.5-VL-7B (base)	-	42.5	52.0	68.2	62.1	44.3
DeepEyes-7B	14K+33K	42.9	65.6	70.8	63.0	47.5
Chart-R1-7B	258K	46.2	64.7	67.5	61.1	45.5
BiPS-Chart-7B	13K	49.4	68.2	73.5	64.9	51.6 (+7.3)
BiPS-General-7B	13K+39K	50.6	68.7	75.0	65.7	52.5 (+8.2)

BiPS-Chart outperforms models trained on significantly more data. It also exhibits strong OOD generalization on general reasoning tasks (MathVista +5.3).

Ablation Study¶

Breakdown of bi-directional constraints (added to GRPO baseline):

Configuration	CharXiv	ECD	ChartMuseum	Description
Qwen2.5-VL-7B	42.5	19.0	26.0	base
GRPO	44.3	35.6	30.8	Pure RL baseline
GRPO + \(L_{cons}\)	47.2	36.3	31.3	Consistency only
GRPO + \(L_{sep}\)	47.7	38.3	31.8	Separation only
Ours (Both)	49.4	39.9	33.5	Two-stage synergy

Curriculum order and view generation strategy:

Dimension	Configuration	CharXiv	ECD	ChartMuseum
Order	Joint Training	46.4	36.7	31.5
Order	Reversed (Stage 2 then 1)	46.8	39.2	31.3
Order	Ours (Coarse-to-fine)	49.4	39.9	33.5
Strategy	Random Masking (60% patch)	44.8	37.6	31.8
Strategy	Ours (Programmatic)	49.4	39.9	33.5

Key Findings¶

Complementarity: \(L_{cons}\) primarily drives focus (CharXiv), while \(L_{sep}\) enhances fine-grained grounding and shortcut suppression (ECD).
Curriculum Order: Reversing the order causes a drop in CharXiv from 49.4 to 46.8 because regularization is applied before stable focus is established.
Programmatic vs. Random Masking: Significant gap (49.4 vs 44.8) proves that semantically faithful views are essential for effective \(L_{sep}\).
Case Analysis: BiPS significantly improves precision in reading values and counting curves where the base model often makes minor numerical errors.

Highlights & Insights¶

Inference Cues to Training Signals: BiPS internalizes perception into model weights, achieving zero inference overhead while improving fine-grained perception.
Code-Level Surgery: Manipulating rendering code bypasses the limitations of pixel-level masking, producing pixel-precise paired views without human annotation.
Symmetric KL Design: The dual-force structure (attraction/repulsion) effectively solves both "where to look" and "grounding" simultaneously.
Data Efficiency: 13K samples outperforming million-scale models suggests that improving core perception is more efficient than increasing task data volume.

Limitations & Future Work¶

Reliance on Rendering Code: While effective for charts (ECD), programmatically generating such views for natural or medical images remains an open challenge.
LLM Arbiter Dependency: The pipeline depends on GPT5-mini for question reconstruction and code editing, potentially limiting quality based on the arbiter's capabilities.
Domain Coverage: Validation was primarily within chart and math VQA; effectiveness in dense natural scene VQA or video has not been verified.
Hyperparameter Sensitivity: \(\alpha, \beta, c_{cons}, c_{sep}\) may require tuning when switching base models or data domains.

vs. External Tools / Visual CoT (DeepEyes, Chart-R1): These models require box/mask generation at inference. BiPS internalizes these skills during training, ensuring efficiency and better generalization.
vs. Noise-based Methods (ChiP, PAPO): These use random noise/masking as negative perturbations. BiPS uses semantically precise ablated views, providing a "cleaner" negative space for grounding.
vs. Implicit Latent Reasoning: Unlike methods that couple reasoning with specific tasks, BiPS shapes a generalizable perceptual strategy.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Innovative combination of training-time signal shaping and code-level paired views.
Experimental Thoroughness: ⭐⭐⭐⭐ Strong results on chart/math domains, but less exploration of natural images.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logical flow from motivation to method and results.
Value: ⭐⭐⭐⭐⭐ High data efficiency and practical utility due to zero inference overhead.