Can Vision-Language Models Evaluate Handwritten Math?¶
Conference: ACL 2025
arXiv: 2501.07244
Code: AI4Bharat/FERMAT
Area: Multimodal VLM
Keywords: Handwritten Math Evaluation, Error Detection, Error Localization, Error Correction, VLM Benchmark
TL;DR¶
This paper proposes the FERMAT benchmark to systematically evaluate the error detection, localization, and correction capabilities of 9 VLMs on handwritten mathematical content. Using 609 manually curated Grade 7-12 math problems alongside over 2,200 handwritten erroneous solutions (covering computation, conceptual, notation, and formatting errors), the evaluation reveals that Gemini-1.5-Pro achieves the highest correction rate of 77%, though all models still face significant challenges when processing handwritten content.
Background & Motivation¶
VLMs hold immense potential in education, particularly in automated grading of handwritten math assignments. OpenAI previously demonstrated a GPT-4 demo evaluating handwritten math, drawing widespread attention. However, several critical issues remain:
Lack of Systematic Evaluation: Although VLMs have made progress in mathematical reasoning, comprehensive research on their ability to evaluate handwritten mathematical content is still lacking.
Limitations of Prior Work: Existing multimodal evaluation benchmarks focus on simple scenarios with printed text and images, or only handle single-line mathematical expression OCR, failing to address multi-line handwritten derivations and complex mathematical notation.
Specific Challenges of Handwritten Content: Varied handwriting styles, inconsistent writing quality, and diverse image conditions pose additional challenges for VLMs.
Key Challenge: Although VLMs claim to possess visual understanding capabilities, how well do their reasoning and evaluation capabilities actually perform when faced with highly varied handwritten mathematical content in real-world educational scenarios?
Core Idea: Build a handwritten math error evaluation benchmark based on educational scenarios to systematically test the "detect \(\rightarrow\) localize \(\rightarrow\) correct" error-capability chain of VLMs through controlled perturbation and manual handwritten transcription.
Method¶
Overall Architecture¶
The construction of FERMAT consists of four stages: 1. Problem collection (math textbooks + competition problems) 2. Designing a perturbation taxonomy (5 error axes) 3. Human-AI collaborative perturbation generation (GPT-4o generation + manual verification) 4. Handwritten transcription (43 annotators + quality audit)
Key Designs¶
-
Problem Collection and Processing:
- Hand-collected approximately 850 mathematical problems with detailed step-by-step solutions from Grade 7-12 textbooks.
- Covers 7 major areas (Arithmetic, Algebra, Geometry & Measurement, Geometry, Probability & Statistics, Trigonometry, and Calculus) and over 50 sub-topics.
- Additionally collected competition MCQs focusing on practical mathematics (e.g., profit/loss, time/work, data interpretation).
- Utilized GPT-4o to convert the problem images to LaTeX format, followed by manual validation, yielding 609 high-quality LaTeX pairs (Q, A_gold).
-
Perturbation Taxonomy (5 Error Axes):
- Calculation Errors (CO, 611 cases): Final numerical error, intermediate calculation error, non-propagated step error, propagated step error, transcribing error.
- Conceptual Errors (CP, 609 cases): Theorem misuse, misinterpreting the question, invalid assumption, obviously incorrect facts, formula misuse.
- Notation Errors (NO, 255 cases): Notation errors (x²→x2), operator swap (+→×), misplaced brackets.
- Formatting Errors (PR, 429 cases): Ignoring formatting requirements, term swap, incorrect logical order, contextual substitution, variable naming error, units error.
- Surface Perturbations (SU, 340 cases): Modifications that do not affect correctness (variable name changes, omitted steps, irrelevant information addition), used to test whether VLMs false-positive correct solutions.
-
Human-AI Collaborative Perturbation Generation:
- Used GPT-4o to perturb correct solutions based on perturbation types, instructions, and 3 in-context examples.
- Manually verified all perturbed outputs: checked if the perturbation aligned with the specified category, if the reasoning was logical, and if the final answer was correctly altered.
- Further classified perturbations into real errors or surface variations.
-
Handwritten Transcription and Validation:
- Hand-transcribed by 43 annotators from diverse demographic backgrounds.
- Utilized different paper types, pen colors, and ink types.
- Captured photos via mobile phones and uploaded them to a centralized platform.
- Recorded metadata: readability, image orientation, and overall quality.
- Developed dedicated validation tools for quality auditing.
-
Evaluation Task Design (Ascending Difficulty):
- Error Detection (ED): Determine if an error exists in the image (binary) and provide a reasoning process.
- Error Localization (EL): Identify the specific line where the error occurs (harder than ED).
- Error Correction (EC): Output the complete corrected LaTeX solution, formulation being the most challenging.
- Two variants exist for each task: processing the handwritten image directly, or performing OCR first and then processing (+OCR variant).
- Cascade Setting: Sequential execution of ED \(\rightarrow\) EL \(\rightarrow\) EC, where the output of the previous stage serves as input for the next.
Loss & Training¶
As this work introduces an evaluation benchmark, it does not involve model training. - ED uses Balanced Accuracy (BACC) to account for class imbalance (positive vs. negative samples). - EL and EC employ GPT-4o as the evaluator, achieving a 94% agreement rate with human evaluation. - All models use the same prompt and a temperature of 0 to ensure reproducibility.
Key Experimental Results¶
Main Results¶
| Model | ED(BACC) | ED+OCR | EL(ACC) | EL+OCR | EC(ACC) | EC+OCR | Cascade |
|---|---|---|---|---|---|---|---|
| Gemini-1.5-Pro | 0.63 | 0.67 | 0.43 | 0.56 | 0.76 | 0.77 | 0.50 |
| GPT-4o | 0.65 | 0.64 | 0.45 | 0.50 | 0.66 | 0.71 | 0.45 |
| Llama-3.2-90B | 0.52 | 0.62 | 0.18 | 0.41 | 0.25 | 0.57 | 0.31 |
| Phi-3.5-VI | 0.52 | 0.51 | 0.06 | 0.09 | 0.15 | 0.12 | 0.11 |
Ablation Study¶
| Setting | GPT-4o ED(BACC) | Description |
|---|---|---|
| Base | 0.658 | Basic prompt |
| L1 | 0.670 | Add grade/area/sub-domain |
| L2 | 0.676 | L1 + all perturbation descriptions and examples |
| L3 | 0.691 | L1 + specific perturbation category |
| L4 | 0.702 | L3 + erroneous solution examples and explanations |
Key Findings¶
- Gemini-1.5-Pro is strongest in error correction (77%), while GPT-4o performs best in detection and localization.
- The OCR step is generally beneficial: Pixtral-124B and Llama-3.2-90B show significant improvement with OCR (strong OCR capability compensates for weaker multimodal reasoning), while GPT-4o and Gemini-1.5-Pro yield marginal gains (due to their already strong intrinsic multimodal comprehension).
- The cascade setting unexpectedly leads to performance drops: This is primarily because conservative detection behavior during the ED stage filters out a large number of images.
- Additional information indeed aids VLMs: From L1 to L4, the ED performance of GPT-4o increases from 0.658 to 0.702.
- Handwritten content remains the core challenge: Replacing handwritten images with printed LaTeX-rendered images or direct text inputs consistently boosts performance, with the most substantial gain occurring when switching from image to text input.
Highlights & Insights¶
- Fills the gap in evaluating VLMs on handwritten mathematics, keeping closely aligned with real-world educational scenarios.
- Outlines a comprehensive perturbation taxonomy design, incorporating the key category of "surface perturbations" to test false positive rates.
- Features handwriting diversity from 43 annotators, ensuring the ecological validity of the benchmark.
- Proposes an ascending task difficulty design (ED \(\rightarrow\) EL \(\rightarrow\) EC) which clearly exposes model bottlenecks within the evaluation chain.
- Conducts comparative experiments with OCR variants, revealing intriguing differences in model behavior: stronger models rely more on end-to-end multimodal understanding, while weaker models benefit more from explicit OCR steps.
Limitations & Future Work¶
- The perturbation categories may not be exhaustive, leaving more real-world student error patterns uncovered.
- Mainly focuses on school-level mathematics, without covering more advanced mathematical domains.
- Has not explored multi-agent approaches for error detection.
- Information propagation in cascade settings can introduce error accumulation; more robust multi-step evaluation pipelines warrant investigation.
- Future studies could examine performance disparities of VLMs across different handwriting styles (e.g., neat vs. messy) and varying image qualities.
- Personalized feedback generation (providing pedagogical explanations rather than just detecting errors) is an avenue worth exploring.
Related Work & Insights¶
- Conceptual extension of the CheckList framework (Ribeiro et al., 2020): expanding model behavior testing from text models to multimodal math evaluation.
- Related LLM evaluation benchmarks such as FBI, MathCheck, and DUPE: FERMAT uniquely focuses on handwritten visual inputs.
- Multimodal error detection benchmarks like ErrorRadar: FERMAT features more fine-grained perturbations and a broader range of error categories.
- Studies on LLM error correction capabilities (e.g., Li et al., 2024): Text-based LLMs exhibit weak detection yet strong correction abilities; FERMAT observes a similar trend in VLMs.
Rating¶
- Novelty: ⭐⭐⭐⭐ Evaluating handwritten mathematics is a practical and under-explored direction, though the overall evaluation framework remains standard.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated 9 VLMs with multiple evaluation strategies and ablation studies, but lacks a fine-grained analysis of different handwriting qualities.
- Writing Quality: ⭐⭐⭐⭐ Clear task definitions and experimental setups, with highly informative figures/tables.
- Value: ⭐⭐⭐⭐ Holds direct reference value for educational technology applications; the perturbation taxonomy is highly reusable.