Can Vision-Language Models Evaluate Handwritten Math?¶

Conference: ACL 2025
arXiv: 2501.07244
Code: AI4Bharat/FERMAT
Area: Multimodal VLM
Keywords: Handwritten Math Evaluation, Error Detection, Error Localization, Error Correction, VLM Benchmark

TL;DR¶

This paper proposes the FERMAT benchmark to systematically evaluate the error detection, localization, and correction capabilities of 9 VLMs on handwritten mathematical content. Using 609 manually curated Grade 7-12 math problems alongside over 2,200 handwritten erroneous solutions (covering computation, conceptual, notation, and formatting errors), the evaluation reveals that Gemini-1.5-Pro achieves the highest correction rate of 77%, though all models still face significant challenges when processing handwritten content.

Background & Motivation¶

VLMs hold immense potential in education, particularly in automated grading of handwritten math assignments. OpenAI previously demonstrated a GPT-4 demo evaluating handwritten math, drawing widespread attention. However, several critical issues remain:

Lack of Systematic Evaluation: Although VLMs have made progress in mathematical reasoning, comprehensive research on their ability to evaluate handwritten mathematical content is still lacking.

Limitations of Prior Work: Existing multimodal evaluation benchmarks focus on simple scenarios with printed text and images, or only handle single-line mathematical expression OCR, failing to address multi-line handwritten derivations and complex mathematical notation.

Specific Challenges of Handwritten Content: Varied handwriting styles, inconsistent writing quality, and diverse image conditions pose additional challenges for VLMs.

Key Challenge: Although VLMs claim to possess visual understanding capabilities, how well do their reasoning and evaluation capabilities actually perform when faced with highly varied handwritten mathematical content in real-world educational scenarios?

Core Idea: Build a handwritten math error evaluation benchmark based on educational scenarios to systematically test the "detect \(\rightarrow\) localize \(\rightarrow\) correct" error-capability chain of VLMs through controlled perturbation and manual handwritten transcription.

Method¶

Overall Architecture¶

The construction of FERMAT consists of four stages: 1. Problem collection (math textbooks + competition problems) 2. Designing a perturbation taxonomy (5 error axes) 3. Human-AI collaborative perturbation generation (GPT-4o generation + manual verification) 4. Handwritten transcription (43 annotators + quality audit)

Key Designs¶

Problem Collection and Processing:
- Hand-collected approximately 850 mathematical problems with detailed step-by-step solutions from Grade 7-12 textbooks.
- Covers 7 major areas (Arithmetic, Algebra, Geometry & Measurement, Geometry, Probability & Statistics, Trigonometry, and Calculus) and over 50 sub-topics.
- Additionally collected competition MCQs focusing on practical mathematics (e.g., profit/loss, time/work, data interpretation).
- Utilized GPT-4o to convert the problem images to LaTeX format, followed by manual validation, yielding 609 high-quality LaTeX pairs (Q, A_gold).
Perturbation Taxonomy (5 Error Axes):
- Calculation Errors (CO, 611 cases): Final numerical error, intermediate calculation error, non-propagated step error, propagated step error, transcribing error.
- Conceptual Errors (CP, 609 cases): Theorem misuse, misinterpreting the question, invalid assumption, obviously incorrect facts, formula misuse.
- Notation Errors (NO, 255 cases): Notation errors (x²→x2), operator swap (+→×), misplaced brackets.
- Formatting Errors (PR, 429 cases): Ignoring formatting requirements, term swap, incorrect logical order, contextual substitution, variable naming error, units error.
- Surface Perturbations (SU, 340 cases): Modifications that do not affect correctness (variable name changes, omitted steps, irrelevant information addition), used to test whether VLMs false-positive correct solutions.
Human-AI Collaborative Perturbation Generation:
- Used GPT-4o to perturb correct solutions based on perturbation types, instructions, and 3 in-context examples.
- Manually verified all perturbed outputs: checked if the perturbation aligned with the specified category, if the reasoning was logical, and if the final answer was correctly altered.
- Further classified perturbations into real errors or surface variations.
Handwritten Transcription and Validation:
- Hand-transcribed by 43 annotators from diverse demographic backgrounds.
- Utilized different paper types, pen colors, and ink types.
- Captured photos via mobile phones and uploaded them to a centralized platform.
- Recorded metadata: readability, image orientation, and overall quality.
- Developed dedicated validation tools for quality auditing.
Evaluation Task Design (Ascending Difficulty):
- Error Detection (ED): Determine if an error exists in the image (binary) and provide a reasoning process.
- Error Localization (EL): Identify the specific line where the error occurs (harder than ED).
- Error Correction (EC): Output the complete corrected LaTeX solution, formulation being the most challenging.
- Two variants exist for each task: processing the handwritten image directly, or performing OCR first and then processing (+OCR variant).
- Cascade Setting: Sequential execution of ED \(\rightarrow\) EL \(\rightarrow\) EC, where the output of the previous stage serves as input for the next.

Loss & Training¶

As this work introduces an evaluation benchmark, it does not involve model training. - ED uses Balanced Accuracy (BACC) to account for class imbalance (positive vs. negative samples). - EL and EC employ GPT-4o as the evaluator, achieving a 94% agreement rate with human evaluation. - All models use the same prompt and a temperature of 0 to ensure reproducibility.

Key Experimental Results¶

Main Results¶

Model	ED(BACC)	ED+OCR	EL(ACC)	EL+OCR	EC(ACC)	EC+OCR	Cascade
Gemini-1.5-Pro	0.63	0.67	0.43	0.56	0.76	0.77	0.50
GPT-4o	0.65	0.64	0.45	0.50	0.66	0.71	0.45
Llama-3.2-90B	0.52	0.62	0.18	0.41	0.25	0.57	0.31
Phi-3.5-VI	0.52	0.51	0.06	0.09	0.15	0.12	0.11

Ablation Study¶

Setting	GPT-4o ED(BACC)	Description
Base	0.658	Basic prompt
L1	0.670	Add grade/area/sub-domain
L2	0.676	L1 + all perturbation descriptions and examples
L3	0.691	L1 + specific perturbation category
L4	0.702	L3 + erroneous solution examples and explanations

Key Findings¶

Gemini-1.5-Pro is strongest in error correction (77%), while GPT-4o performs best in detection and localization.
The OCR step is generally beneficial: Pixtral-124B and Llama-3.2-90B show significant improvement with OCR (strong OCR capability compensates for weaker multimodal reasoning), while GPT-4o and Gemini-1.5-Pro yield marginal gains (due to their already strong intrinsic multimodal comprehension).
The cascade setting unexpectedly leads to performance drops: This is primarily because conservative detection behavior during the ED stage filters out a large number of images.
Additional information indeed aids VLMs: From L1 to L4, the ED performance of GPT-4o increases from 0.658 to 0.702.
Handwritten content remains the core challenge: Replacing handwritten images with printed LaTeX-rendered images or direct text inputs consistently boosts performance, with the most substantial gain occurring when switching from image to text input.

Highlights & Insights¶

Fills the gap in evaluating VLMs on handwritten mathematics, keeping closely aligned with real-world educational scenarios.
Outlines a comprehensive perturbation taxonomy design, incorporating the key category of "surface perturbations" to test false positive rates.
Features handwriting diversity from 43 annotators, ensuring the ecological validity of the benchmark.
Proposes an ascending task difficulty design (ED \(\rightarrow\) EL \(\rightarrow\) EC) which clearly exposes model bottlenecks within the evaluation chain.
Conducts comparative experiments with OCR variants, revealing intriguing differences in model behavior: stronger models rely more on end-to-end multimodal understanding, while weaker models benefit more from explicit OCR steps.

Limitations & Future Work¶

The perturbation categories may not be exhaustive, leaving more real-world student error patterns uncovered.
Mainly focuses on school-level mathematics, without covering more advanced mathematical domains.
Has not explored multi-agent approaches for error detection.
Information propagation in cascade settings can introduce error accumulation; more robust multi-step evaluation pipelines warrant investigation.
Future studies could examine performance disparities of VLMs across different handwriting styles (e.g., neat vs. messy) and varying image qualities.
Personalized feedback generation (providing pedagogical explanations rather than just detecting errors) is an avenue worth exploring.

Conceptual extension of the CheckList framework (Ribeiro et al., 2020): expanding model behavior testing from text models to multimodal math evaluation.
Related LLM evaluation benchmarks such as FBI, MathCheck, and DUPE: FERMAT uniquely focuses on handwritten visual inputs.
Multimodal error detection benchmarks like ErrorRadar: FERMAT features more fine-grained perturbations and a broader range of error categories.
Studies on LLM error correction capabilities (e.g., Li et al., 2024): Text-based LLMs exhibit weak detection yet strong correction abilities; FERMAT observes a similar trend in VLMs.

Rating¶

Novelty: ⭐⭐⭐⭐ Evaluating handwritten mathematics is a practical and under-explored direction, though the overall evaluation framework remains standard.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated 9 VLMs with multiple evaluation strategies and ablation studies, but lacks a fine-grained analysis of different handwriting qualities.
Writing Quality: ⭐⭐⭐⭐ Clear task definitions and experimental setups, with highly informative figures/tables.
Value: ⭐⭐⭐⭐ Holds direct reference value for educational technology applications; the perturbation taxonomy is highly reusable.