ACL 2025 LLM Evaluation Proactive Error Handling Error Detection Error Correction Error Guidance benchmark SFT

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling¶

Conference: ACL 2025
arXiv: 2506.00064
Code: Not released (dataset will be released)
Area: LLM NLP / Error Handling and Evaluation
Keywords: Proactive Error Handling, Error Detection, Error Correction, Error Guidance, benchmark, SFT

TL;DR¶

The Mis-prompt benchmark is proposed, containing 4 evaluation tasks, a taxonomy of 14 error categories, and a dataset of 14,969 items, systematically investigating the proactive error-handling capabilities of LLMs when no explicit error-handling instructions are provided. It is found that the proactive error-handling capabilities of current LLMs are severely lacking, whereas SFT can significantly improve them.

Background & Motivation¶

Background: LLMs have made significant progress in error handling, including mathematical reasoning error correction (EIC-Math, ErrorRadar), grammatical error correction (GEC), fact-checking in summarization (SummEdits), etc. However, these works are all passive error handling—relying on explicit error-handling instructions provided by users in the prompt (e.g., "Please determine if the following content is correct").

Core Problem: In real-world scenarios, users usually do not provide explicit error-handling instructions. For example, if a user asks "After Napoleon's grand victory in the Battle of Waterloo in 1815...", there is an implicit historical error (Napoleon was defeated), but the user does not request the LLM to correct the error.

Key Findings: When facing user inputs containing errors, GPT-4o fails to proactively identify the errors and instead continues generating text based on the erroneous information, severely undermining the trustworthiness and reliability of the system.

Related Work & Insights: Existing benchmarks (BIG-Bench Mistake, ReaLMistake, Medec, ProcessBench, etc.) are all passive and mostly cover only detection and identification. This paper defines and evaluates proactive error handling for the first time, covering four dimensions: detection, identification, correction, and guidance.

Method¶

Overall Architecture¶

The Mis-prompt framework consists of three parts: (1) an error taxonomy—4 major categories and 14 subcategories; (2) four evaluation tasks—detection, identification, correction, and guidance; (3) dataset—14,969 annotated error prompt data items.

Four Evaluation Tasks¶

Error Detection: Determines if the prompt contains errors, outputting a binary label \(y \in \{\text{True}, \text{False}\}\).
Error Identification: Assesses whether the model attempts to locate the error (\(y_1\)) and whether it correctly locates it (\(y_2\)), using dual-label evaluation.
Error Correction: Assesses whether the model attempts to correct the error (\(y_1\)) and whether the correction is accurate (\(y_2\)), using dual-label evaluation.
Error Guidance: Assesses whether the model can provide meaningful suggestions to help users improve the prompt.

Error Taxonomy¶

Based on prior works (Pagnoni et al., Sourati et al., Orlovskiy et al., Masanti et al.):

Primary Category	Secondary Category	Data Volume
Linguistic Errors	Grammatical Errors / Punctuation Errors / Spelling Errors	3,135
Incomplete Information	Speaker/Character / Text Subject / Location / Time & Date	4,164
Factual Errors	Relational Errors / Entity Errors / Contextual Errors	3,109
Logical Errors	Fallacy of Relevance / Fallacy of Presumption / Inductive Defect / Fallacy of Ambiguity	4,288

Data Construction Method¶

Data Generation: Two pathways—(a) converting existing datasets (FEVEROUS \(\rightarrow\) factual errors, CommonsenseQA \(\rightarrow\) logical errors, ROCStories \(\rightarrow\) narrative errors), using GPT-4o to transform correct statements into error-containing Wh-questions; (b) direct generation—letting GPT-4o generate diverse error prompts according to secondary categories.
Diversity Design Principles: Special questions + errors (not simple yes/no questions); error information embedded in clauses; erroneous statements + related questioning.
Deduplication: Using Sentence-BERT to calculate cosine similarity, merging those with a threshold above 0.85.
Quality Control: Manual review by 3 graduate students, Fleiss' Kappa = 0.78 (high agreement), with an ultimate quality score of 93.76%.
Ground-truth Generation: GPT-4o generates standard answers based on the error category and erroneous prompt, containing error detection, explanation, correction, and guidance.

Evaluation Method¶

Automated Evaluation: GPT-4o acts as the judge model, using the F1 metric.
Human Evaluation: A secondary evaluation by 3 graduate students yields a Fleiss' Kappa of 0.63, with only a 5.59% discrepancy from the automated evaluation.

Experiments¶

Main Results (Table 3)¶

Model	Det.	Att.Ident.	Acc.Ident.	Att.Corr.	Acc.Corr.	Guid.	Avg
Claude-3.5	63.98	67.53	63.01	36.48	30.23	43.73	50.83
GPT-4o	43.54	48.71	43.78	31.72	23.32	30.66	36.96
LLaMA-3.3-70B	57.78	59.23	53.50	39.67	30.17	37.40	46.29
Qwen-2.5-32B	51.11	54.91	50.63	34.20	27.21	41.39	43.24
DeepSeek-V2-16B	29.44	33.90	27.92	18.57	11.46	12.80	22.35

Key Findings: - Closed-source models generally outperform open-source models, with Claude-3.5 performing best (50.83%). - The average F1 of all models is only 37.53%, indicating that proactive error-handling capability is severely deficient. - The difficulty of the four tasks increases progressively: Detection > Identification > Correction > Guidance. - Accurate Correction (22.62%) is the most challenging task.

Category-wise Analysis (Table 4 - GPT-4o)¶

Error Category	Det.	Acc.Ident.	Acc.Corr.	Guid.	Avg
Factual Errors	72.99	71.70	41.63	22.27	55.22
Logical Errors	41.49	43.04	30.91	18.36	38.74
Incomplete Information	40.58	41.32	7.83	48.43	32.53
Linguistic Errors	6.50	3.93	13.34	20.23	11.53

GPT-4o is most adept at discovering factual errors (benefiting from its rich knowledge base), but is virtually "blind" to linguistic errors (F1 only 6.5%), tending to output direct answers while ignoring the errors.

Impact of SFT (Table 5)¶

Model	Method	Det.	Acc.Ident.	Acc.Corr.	Guid.	Avg
LLaMA-3.1-8B	zero-shot	42.05	40.48	19.70	33.76	35.15
LLaMA-3.1-8B	3-shot	81.99	69.09	40.72	82.43	72.15
LLaMA-3.1-8B	CoT	75.62	73.56	47.02	75.44	70.22
LLaMA-3.1-8B	SFT	90.16	80.02	62.86	84.77	81.77
Qwen-2.5-32B	zero-shot	51.11	50.63	27.21	41.39	43.24
Qwen-2.5-32B	SFT	97.88	88.43	70.86	93.17	89.55

SFT brings a 30-50 percentage point improvement, far exceeding the few-shot and CoT methods. Qwen-2.5-32B + SFT achieves the best average result of 89.55%.

Scaling Law Observations¶

The LLaMA series conforms to the scaling law: 70B > 8B > 3B.
The Qwen-2.5 series experiences inverse scaling: 32B > 72B, showing that larger models are not necessarily better.
GPT-4o underperforms on Correction/Guidance relative to expectations because of its strong tendency to directly answer user questions.

Highlights & Insights¶

Novel Problem Definition: Distinguishing proactive vs. passive error handling for the first time, filling an evaluation gap. In real-world scenarios, users rarely tell models "there is an error in your input."
Comprehensive Taxonomy: The taxonomy of 4 major categories and 14 subcategories provides extensive coverage with a sufficient data scale (nearly 15K entries).
Well-designed Strategy and Task Gradients: The progressive gradient design (Detection \(\rightarrow\) Identification \(\rightarrow\) Correction \(\rightarrow\) Guidance) reveals performance drops as complexity increases.
SFT is a Silver Bullet: Experiments powerfully demonstrate that SFT is vastly superior to ICL and CoT for proactive error correction, showing that proactive error handling is not an intrinsic capacity of LLMs but requires explicit training.
Interesting Counter-example: GPT-4o almost completely ignores linguistic errors (Det. F1 only 6.5%), highlighting the limitations of current LLM attention allocation mechanisms in error handling scenarios.

Limitations & Future Work¶

Restricted to single-turn text dialogue, without covering multimodal and multi-turn scenarios.
Although the F1 score facilitates large-scale evaluation, it may fail to fully capture all dimensions of evaluation.
The dataset is predominantly generated by GPT-4o, which may inject specific biases.
The impact of combining instruction tuning with RLHF has not been explored.

Passive Error Handling: BIG-Bench Mistake (error detection in logical tasks), Medec (clinical note error correction), EIC-Math/ErrorRadar/ProcessBench (mathematical reasoning error correction).
LLM Error Detection & Evaluation: ReaLMistake (multi-dimensional error detection), SummEdits (factual checking in summarization).
Error Handling Improvements: LoRA fine-tuning, Few-shot learning, CoT prompting.

Rating ⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐ Proactive vs. passive error correction distinction is simple yet profound, with a highly practical problem definition.
Experimental Thoroughness: ⭐⭐⭐⭐ Thorough category-wise analysis across 13 models × 5 methods.
Dataset Quality: ⭐⭐⭐⭐ Well-structured taxonomy and strict quality control (93.76% pass rate).
Value: ⭐⭐⭐⭐ Directly guides the improvement of LLM safety and reliability.