CVPR 2026 Multimodal VLM STEM visual perception executable code image reconstruction code-grounded captioning multimodal large language models perception enhancement

CodePercept: Code-Grounded Visual STEM Perception for MLLMs¶

Conference: CVPR 2026 arXiv: 2603.10757 Code: TongkunGuan/Qwen-CodePercept Area: Multimodal VLM / STEM Perception Keywords: STEM visual perception, executable code, image reconstruction, code-grounded captioning, multimodal large language models, perception enhancement

TL;DR¶

Through systematic scaling analysis, this work identifies perception—rather than reasoning—as the true bottleneck for MLLMs in STEM domains. It proposes the CodePercept paradigm, which uses executable Python code as an anchoring medium, constructs the million-scale ICC-1M dataset and the STEM2Code-Eval benchmark, and achieves significant improvements in STEM visual perception and downstream reasoning after two-stage SFT+RL training.

Background & Motivation¶

Background: A large body of work focuses on enhancing MLLM reasoning via reinforcement learning (cold-start data, RL reward design, transfer of unimodal reasoning data), yet a fundamental question remains unanswered: does MLLM failure on STEM tasks stem from insufficient perception or insufficient reasoning?

Limitations of Prior Work: The authors decouple STEM visual reasoning into two independent stages—perception (image→caption) and reasoning (caption→answer)—and independently scale each while fixing the other. Experiments on the MathVision dataset reveal that scaling perception consistently outperforms scaling reasoning (e.g., Perception@32B+Reasoning@4B substantially outperforms Perception@4B+Reasoning@32B). This demonstrates that perception is the true lever in STEM visual reasoning, yet it has received almost no systematic attention.

Key Challenge: The intuitive solution—knowledge distillation via descriptive captions generated by GPT/Gemini—suffers from two critical flaws: (1) teacher models are prone to hallucination regarding spatial relationships and quantitative details; (2) complex STEM images exhibit descriptive aphasia, where natural language cannot precisely capture structural information such as auxiliary line constructions and spatial relationships in polyhedra.

Goal: (1) How can the STEM visual perception capability of MLLMs be systematically enhanced? (2) How can perception be evaluated directly, rather than using question-answering accuracy as a proxy metric?

Key Insight: Executable code inherently possesses precise semantics, verifiability, and structured expressiveness—only a thorough understanding of an image can yield correct reconstruction code. Code thus serves as a more precise "structured caption" than natural language.

Core Idea: Use executable Python code as the precise perceptual medium for STEM images, serving simultaneously as a training signal (code-grounded captioning + image-to-code translation) and as an evaluation criterion (fidelity of code-reconstructed images).

Method¶

Overall Architecture¶

CodePercept comprises three major modules: (1) an image–code pair construction engine that generates high-quality image–code pairs via three complementary pipelines; (2) two code-grounded training tasks (Code-Grounded Caption Generation and STEM Image-to-Code Translation); and (3) a two-stage SFT+RL post-training procedure. The framework is built upon the Qwen3-VL model series.

Key Designs¶

Image–Code Pair Construction Engine (Three Complementary Pipelines)
- Function: Generate large-scale, high-quality image–code paired data from existing STEM sources.
- Mechanism: Three parallel pipelines are employed—(i) Image Reproduction: for each seed image, a detailed description is first generated, then used together with the image to produce matplotlib reconstruction code; (ii) Image Diversity: the underlying scientific principle \(G_{\text{principle}}\) is extracted from a seed image, and \(K\) distinct visual instantiations are generated from the same principle (e.g., deriving circular spinners, triangular arrangements, and grid diagrams from a domino puzzle); (iii) Solid Geometry: to address fundamental deficiencies of LLMs in generating spatial geometry code, a parameterized template library covering eight categories (cube unfolding/folding, orthographic three-view drawings, cross-section analysis, polyhedra construction, etc.) is constructed, and samples are generated by parameter sampling. A unified quality control mechanism filters by image quality \(Q_I\), code quality \(Q_C\), and image–code consistency \(Q_{IC}\).
- Design Motivation: Image Reproduction is constrained by the diversity of source images; Image Diversity overcomes this bottleneck by abstracting principles and re-instantiating them; Solid Geometry compensates for LLMs' weakness in spatial reasoning code generation.
Code-Grounded Caption Generation
- Function: Leverage ground-truth code to eliminate hallucinations in MLLM-generated captions.
- Mechanism: A three-step pipeline is used—(1) Native Caption: the MLLM generates a descriptive draft \(t_{\text{draft}}\) directly from the image (linguistically natural but potentially factually incorrect); (2) Code Analysis: the code itself along with an execution tracer \(\xi(\mathbf{c})\) (recording precise coordinates, dimensions, z-order levels, and all rendering details) are used to extract verified visual facts \(t_{\text{code}}\); (3) Code-Grounded Refinement: using the code analysis results as reference, quantitative errors and spatial relationship errors in the draft are surgically corrected while preserving the original linguistic style and fluency. Formally: \(t_{\text{new}} = G_{\text{refine}}(G_{\text{caption}}(\mathbf{x}), G_{\text{analyze}}(\mathbf{c}, \xi(\mathbf{c})))\).
- Design Motivation: The execution tracer resolves the problem of "directly analyzing complex code being too difficult for LLMs"—even when code involves deep recursion and nested loops, execution logs provide deterministic rendering information as factual references.
STEM Image-to-Code Translation
- Function: Train models to directly generate executable reconstruction code from images, serving as a complementary perceptual signal to natural language.
- Mechanism: The MLLM first generates an interpretive code draft \(c_{\text{draft}}\) from the image (with step-by-step decomposition and parameter explanations but potential factual errors), then the ground-truth code is used to correct errors while retaining the interpretive structure, yielding \(c_{\text{new}} = G_{\text{refine}}(G_{\text{code}}(\mathbf{x}), \mathbf{c})\).
- Design Motivation: Code provides a structured visual description complementary to natural language—geometric relationships, mathematical constraints, and structural details that natural language cannot adequately express are represented deterministically through programming constructs, resolving the descriptive aphasia problem.

Loss & Training¶

Stage 1 (SFT): Based on Qwen3-VL, jointly trained on ICC-1M for both image captioning and image-to-code translation tasks; 1 epoch on 32 A100 GPUs.
Stage 2 (RL): GRPO reinforcement learning applied exclusively to the code generation task on 10,000 selected samples. The reward function includes a format reward \(r_{\text{fmt}}\) (whether the code block format is correct) and a content reward \(r_{\text{cnt}}\) (execution success rate + GPT-4o-evaluated code semantic equivalence + image visual similarity).

Key Experimental Results¶

Main Results (STEM2Code-Eval Image Reconstruction)¶

Model	Image Score	Code Score	Avg	Exec Rate
Gemini2.5-Pro-Thinking	68.89	75.41	72.15	91.7%
GPT5-Thinking	64.97	64.98	64.98	96.6%
Qwen3-VL-4B-Instruct	24.55	26.42	25.49	79.4%
CodePercept-4B-S1	38.13	43.43	40.78	80.7%
CodePercept-4B-R1	47.17	45.86	46.52	91.3%
Qwen3-VL-32B-Instruct	36.85	39.98	38.42	81.8%
CodePercept-32B-R1	68.97	62.53	65.75	95.9%

Perception Capability Evaluation (Captioner–Solver Setup, LLM Solver: Qwen3-30A3-Thinking)¶

Model (Captioner)	MathVision	MathVista	MathVerse	DynaMath	WeMath	LogicVista	Avg
Qwen3-VL-4B-Instruct	54.21	67.30	64.59	69.40	46.10	54.14	59.29
CodePercept-4B-S1	57.63	69.60	65.59	71.38	47.81	60.40	62.07
Qwen3-VL-32B-Instruct	58.55	72.20	71.09	75.78	48.00	62.19	64.63
CodePercept-32B-S1	62.27	72.90	71.70	77.41	54.19	65.33	67.30

Ablation Study¶

Data Configuration	MathVision Gain	Avg Gain
Image Reproduction only	baseline	baseline
+ Image Diversity	+significant	+significant
+ Solid Geometry	+further improvement	+further improvement
+ CodeCap (code-grounded captioning)	+additional gain	captions and code are complementary
+ ImCode (image-to-code)	highest	highest

Key Findings¶

CodePercept-4B-R1 improves Image Score on STEM2Code-Eval from 24.55→47.17 (+92%) and Exec Rate from 79.4%→91.3%, demonstrating the effectiveness of the RL stage in improving code quality.
CodePercept-32B-R1 achieves an Avg Score of 65.75, approaching GPT5-Thinking's 64.98, closely matching the strongest closed-source model using only an open-source model.
Perception gains directly transfer to downstream reasoning: when CodePercept-32B-S1 serves as the captioner, downstream reasoning improves by an average of 2.7 points.
Code and captions are complementary: training on captions alone or code alone is inferior to joint training.

Highlights & Insights¶

Scaling analysis reveals "perception is the bottleneck": The field has been intensely focused on reasoning (RL, chain-of-thought, reward design), yet it is perception that truly constrains STEM performance. This finding may redirect the research priorities of the entire field.
Paradigm shift: code as a perceptual medium: Code serves as a "verifiable, executable structured caption." Spatial relationships and precise numerical values that natural language cannot adequately describe are represented deterministically in code. The execution tracer further addresses the problem of "LLMs being unable to comprehend complex code."
The "principle abstraction → re-instantiation" strategy of the Image Diversity Pipeline is an efficient data augmentation approach—not simple augmentation, but concept-level diversification that preserves scientific rigor, and is transferable to data construction in other scientific domains.

Limitations & Future Work¶

The STEM2Code-Eval benchmark contains only 1,000 samples; coverage of STEM sub-domains and difficulty distributions may be insufficient.
Code reconstruction relies on matplotlib and is not applicable to non-2D visualization content (e.g., real experimental photographs, microscopic images).
The RL reward function depends on GPT-4o scoring, introducing bias from an external model.
The solid geometry template library is manually designed, and its coverage is limited by the number of template categories.

vs. conventional STEM reasoning enhancement (KeyeVL, InternS1): These methods focus on the reasoning side (RL, chain-of-thought); this paper demonstrates that the perception side is the true lever—equal computational investment in perception enhancement yields greater returns.
vs. knowledge distillation methods: The conventional approach of generating captions using stronger models is limited by teacher model hallucinations; CodePercept uses code execution results as ground truth to eliminate hallucinations, providing a more reliable knowledge transfer mechanism.
vs. domain-specific code generation (UI-to-code, Chart-to-code): These works target downstream applications; CodePercept's image–code pairs serve dual value for both evaluation and training.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Scaling analysis revealing the perception bottleneck, combined with code as a perceptual medium, constitutes a paradigm-level innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Six STEM benchmarks + STEM2Code-Eval + ablations, though RL ablations lack sufficient granularity.
Writing Quality: ⭐⭐⭐⭐ Clear logical structure; scaling analysis figures are highly convincing.
Value: ⭐⭐⭐⭐⭐ Likely to redirect STEM multimodal research from the reasoning side toward the perception side.