Perception Programs: Unlocking Visual Tool Reasoning in Language Models¶
Conference: CVPR 2026
arXiv: 2604.12896
Code: https://github.com/AISmartPerception/perception-programs
Area: LLM / NLP (Other)
Keywords: perception programs, visual tools, language-native representation, training-free, multimodal reasoning
TL;DR¶
Perception Programs (P2) is a training-free, model-agnostic method that converts raw visual tool outputs (depth, optical flow, correspondences, etc.) into compact language-native structured summaries, enabling MLLMs to directly "read" visual modalities rather than infer from dense pixels, achieving an average 19.66% improvement across 6 BLINK tasks.
Background & Motivation¶
Background: MLLMs are increasingly used in conjunction with visual tools (depth estimation, optical flow, visual correspondences, etc.) to enhance visual reasoning.
Limitations of Prior Work: Despite visual tools providing accurate perceptual signals, MLLMs often fail to fully leverage them. Raw tool outputs are dense pixel-level representations that are mismatched with LLMs' language-native reasoning capabilities. Experiments show that GPT-5 Mini cannot even recover correct depth ordering from depth maps (Kendall τ rapidly approaches zero).
Key Challenge: The bottleneck is not more tool calls or larger MLLMs, but the representation format of visual tool outputs. Dense numerical tokens are fundamentally mismatched with the language reasoning substrate.
Goal: Convert tool outputs from dense pixel-level representations to language-native structured summaries.
Key Insight: Humans extract cues from visual information differently depending on data type (depth focuses on near/far, optical flow focuses on direction, etc.). Converting key information to text relieves the model from processing pixel details.
Core Idea: P2 standardizes what tools convey (what), spatial locations (where), and inter-part relationships (how), enabling any MLLM to directly parse and reason.
Method¶
Overall Architecture¶
Given raw output from a visual tool, P2 partitions the pixel domain into a finite set of primitives (patches/points), extracts a structured item for each primitive \(I_p = (p, c_p, r_p, b_p)\) (identifier, normalized coordinates, modality readout, optional label), and generates sparse symbolic relation triples \(\mathcal{T}\). The entire summary is serialized as a YAML-formatted text block, directly served as MLLM input.
Key Designs¶
-
Unified Item Schema:
- Function: Standardized representation across modalities
- Mechanism: All modalities share the same item structure \((p, c_p, r_p, b_p)\): primitive identifier, spatial coordinates normalized to [0,1000]², readout extracted from modality data, and optional semantic label. The only difference across modalities is how the readout \(r_p\) is constructed and whether relations are included
- Design Motivation: A unified schema makes the method generalizable across depth, optical flow, correspondences, detection, and other modalities
-
Modality-Specific Readout Construction:
- Function: Extracts key information for each visual modality
- Mechanism: Depth: each grid cell stores minimum and maximum depth values \(r_p = [\min D, \max D]\), generating inter-neighborhood relation triples (e.g., "closer than," "farther than"). Optical flow: encodes motion direction and magnitude. Correspondences: encodes matching point positions and confidence. Detection: encodes object category and bounding box
- Design Motivation: Key information differs across modalities and requires specialized extraction
-
Training-Free, Model-Agnostic Deployment:
- Function: Plug-and-play for any MLLM
- Mechanism: P2 requires no parameter updates, architecture modifications, or additional tool calls. The same tool output is converted to P2 within the standard tool-use pipeline and directly consumed by the MLLM. Only minimal text processing overhead is added at inference time
- Design Motivation: Avoids training costs and model modifications while maintaining maximum flexibility
Loss & Training¶
P2 involves no training whatsoever. It is a purely inference-time representation conversion module.
Key Experimental Results¶
Main Results¶
| Model | Task | Baseline | +Raw Tool | +P2 |
|---|---|---|---|---|
| GPT-5 Mini | Multi-view Reasoning | 41.4% | 52.8% | 86.5% |
| GPT-5 Mini | Relative Depth | 52.4% | 61.2% | 81.5% |
| GPT-5 Mini | Visual Correspondence | 38.7% | 45.3% | 72.1% |
| InternVL3.5-4B | 6-task Average | 42.1% | 48.5% | 70.3% |
| Qwen3VL-4B | 6-task Average | 43.5% | 49.2% | 71.8% |
Ablation Study¶
| Configuration | BLINK 6-task Average | Note |
|---|---|---|
| Full P2 | 86.5% | Items + Relations |
| Items only (no relations) | 78.2% | No neighborhood relations |
| Coarse grid (4×4) | 82.1% | Reduced resolution |
| Fine grid (12×12) | 85.8% | Higher resolution |
| Raw tool output | 52.8% | Pixel-level representation |
Key Findings¶
- P2 boosts GPT-5 Mini's accuracy on multi-view reasoning from 41.4% to 86.5% (+45 percentage points) — a striking improvement
- Even on 4B-scale small models, absolute improvements of 21–25% are observed
- P2 can enhance existing agent tool-use methods: an additional 18.28% improvement on depth and localization tasks
Highlights & Insights¶
- The core insight is profound: the bottleneck in visual reasoning is not tool accuracy but representation format. MLLMs can "read" text but cannot effectively "see" dense numerical values
- P2's design embodies the principle of "let machines do what machines are good at": let visual tools extract perceptual signals, let LLMs handle language reasoning
- Training-free + model-agnostic design gives it extremely high practical value
Limitations & Future Work¶
- Grid partitioning granularity requires task-specific tuning
- For tasks requiring precise pixel-level information (e.g., fine segmentation boundaries), P2's spatial discretization may lose information
- Extension along the video temporal dimension has not been evaluated
- Adaptive granularity and dynamic relation generation could be explored
Related Work & Insights¶
- vs VisProg/ViperGPT: These methods generate programs that call tools but still operate on tool outputs at the pixel level; P2 changes the representation of tool outputs
- vs Aurora/Mirage: These methods use training to improve tool usage; P2 achieves greater improvement without any training
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The insight that "representation format is the bottleneck" redefines the problem
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across multiple models and tasks with striking results
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation, analysis, and experiments are all clear
- Value: ⭐⭐⭐⭐⭐ Significant implications for the MLLM tool-use paradigm