ChatBench: From Static Benchmarks to Human-AI Evaluation¶

Conference: ACL 2025
arXiv: 2504.07114
Code: Available
Area: LLM Evaluation
Keywords: Benchmark Evaluation, Human-AI Interaction, User Simulation, MMLU, Dynamic Evaluation

TL;DR¶

Through user studies, this work converts the static MMLU benchmark into human-AI dialogues, constructing the ChatBench dataset (396 questions, 7,336 dialogues). It reveals that AI-alone accuracy cannot predict user-AI accuracy, and trains a user simulator that improves correlation by 22–26 percentage points, laying the foundation for scalable interactive evaluation.

Background & Motivation¶

Nearly 40% of US adults used generative AI in 2024, making the practical significance of LLM evaluation increasingly critical.
Standard benchmarks (e.g., MMLU) exhibit a large gap compared to real-world user interactions:
- Benchmarks: Full question text $\rightarrow$ single-letter option, in a fixed format.
- Real Interactions: Varied user phrasing, incomplete information, multi-turn dialogues, and context dependency.
Existing human-computer interaction evaluations (e.g., WildChat, ChatBot Arena, MT-Bench) are disconnected from standard benchmarks:
- Distribution shift: Real user questions vs. benchmark questions.
- Lack of ground truth: Requiring LLM-as-a-judge, which prevents direct comparison with MMLU results.
Core Problem: Can AI-alone benchmark scores predict actual performance when users collaborate with AI?
Lee et al. (2023) conducted similar exploration but with only 30 questions, which lacks scale and provides no simulator.

Method¶

Overall Architecture¶

Designing a pipeline to convert MMLU into user-AI dialogues: 1. Select high-quality questions from MMLU. 2. Collect three types of data: AI-alone (model solves independently), User-alone (user solves independently), and User-AI (user solves after chatting with the model). 3. Analyze the disparity between AI-alone and User-AI scenarios. 4. Train user simulators for scaling up.

Key Designs¶

1. User Study Design¶

Two-stage pipeline: - Phase 1: Users answer questions independently (user-alone data). - Phase 2: Users answer after chatting with the AI chatbot (user-AI data). - At least one message must be sent (forced interaction). - User confidence is recorded for each question.

Two experimental conditions: - Answer-first: In Phase 2, users solve the question independently before chatting with the AI (within-subjects design). - Direct-to-AI: In Phase 2, users chat with the AI directly (closer to real-world scenarios).

Incentive mechanism: A base fee of $\$5.00$ plus a $\$0.10$ bonus for each correct answer to enhance ecological validity.

2. Question Selection¶

5 MMLU subsets: Elementary/High School/College Mathematics + Conceptual Physics + Moral Scenarios.
Mathematics is chosen because it remains challenging (GPT-4o achieves 84% overall on MMLU, but only 48% on HS Math).
Quality control: Manual labeling using MMLU-Redux + cross-verification with o1 models.
Batched design (19 math batches, 7 physics/morality batches) to minimize variance in the number of responses per question.

3. AI-Alone Evaluation Methods¶

Three AI-alone variants: - Letter-only zero-shot: Outputs only the answer letter (standard benchmark style). - Letter-only few-shot: Includes 5 MMLU dev questions as in-context examples. - Free-text (novel design in this paper): No format constraints on replies, using GPT-4o to extract answers—closer to the user experience.

4. User Simulator¶

Two-step simulator architecture: - Task 1: Generates the user's first message given an MMLU question. - Task 2: Decides whether to output the final answer or to ask a follow-up question given the dialogue history.

Fine-tuning data construction: Every $k$-turn user dialogue yields $k+1$ training samples.

Fine-tuning approach: Supervised fine-tuning of GPT-4o on ChatBench data.

Data Scale¶

Data Type	Quantity
Total Questions	396
Tested Models	GPT-4o, Llama-3.1-8b
Confidence Responses	10,828
User-alone Answers	7,148
User-AI Dialogues	7,336
Total Answers	144,000+

Key Experimental Results¶

Main Results: AI-alone vs. User-AI Accuracy¶

Mean Absolute Error (MAE) between Letter-only few-shot and user-AI: 21 percentage points
Mean Absolute Error (MAE) between Free-text and user-AI: 10 percentage points (improved but still significantly different)

Key observations: - Math: GPT-4o performs well in free-text, but user-AI accuracy is significantly lower than AI-alone (users introduce ambiguity). - Llama-3.1-8b Math: The AI-alone $\rightarrow$ user-AI gap is smaller (the gap for weak models is already at the baseline). - While the two models differ by 25 percentage points in AI-alone accuracy, the gap boils down to only 5–9 percentage points in user-AI accuracy.

Question-level Correlation¶

Metric	Correlation (Pearson $r$)
Free-text vs. User-AI (direct-to-AI)	0.45
Free-text vs. User-AI (answer-first)	0.46
Free-text predicting user-AI improvement	0.26–0.27
Linear prediction of user-AI using User-alone + AI-alone	0.55–0.63

AI-alone also fails to predict user-AI performance well at the question level.

Only 39.8% of Dialogues "Mirror" the AI Benchmark¶

Conditions for interaction to mirror the AI benchmark: the user precisely copies the original question + the AI provides the answer in a single turn + the user adopts this answer.
Most interactions do not meet these criteria—users rephrase questions, omit information, and ask multi-turn follow-ups.

Net Effect of AI on Users¶

Effect	Proportion
User error corrected by AI	54%
User correct answer misled by AI	10%
Reasons why AI-alone was 100% correct but user-AI failed	67% AI did not output the correct answer (user rephrased the question)

User Simulator Results¶

Method	GPT-4o Corr. $\uparrow$	GPT-4o MAE $\downarrow$	Llama Corr. $\uparrow$	Llama MAE $\downarrow$
Letter-only few-shot	0.30	0.31	0.21	0.40
Free-text	0.49	0.20	0.61	0.20
IQA-EVAL	0.50	0.18	0.43	0.22
Two-Step (un-finetuned)	0.41	0.19	0.39	0.23
ChatBench-Sim (finetuned)	0.63	0.15	0.65	0.17

Fine-tuning improves correlation by 22–26 percentage points and reduces MAE by 21–26%.

Ablation Study¶

Condition Comparison: Under the answer-first condition, the accuracy gap between users and AI is smaller (since users have already thought about the problem).
Model Capability Comparison: Although GPT-4o's AI-alone performance is far superior to Llama-3.1-8b, the user-AI performance gap narrows drastically.
Impact of User Rephrasing: In approximately 66% of the cases where "AI should have got it right but the user-AI interaction failed", the user's initial prompt was not an exact paraphrase/copy of the original question.

Key Findings¶

AI-alone accuracy cannot predict user-AI accuracy: Statistically significant differences are observed across multiple subjects.
Letter-only format severely overestimates model capabilities: The deviation from user-AI accuracy reaches up to 21 percentage points.
Free-text evaluation is more realistic but still suffers from a 10 percentage point deviation.
The capability gap between the two models shrinks significantly after user interaction: From 25pp $\rightarrow$ 5–9pp.
Only 40% of user-AI dialogues align with the benchmark evaluation paradigm.
Fine-tuned user simulators can significantly improve prediction accuracy: Offering a feasible path for scalable evaluation.

Highlights & Insights¶

First to systematically compare AI-alone and user-AI evaluation on a large scale (396 questions, 7,336 dialogues).
The finding that "AI-alone benchmarks can mislead model selection" has direct industry implications—weaker models may perform closely to stronger models in interactive settings.
The fine-tuning method for the user simulator is simple yet effective: decomposing a single dialogue into multiple SFT samples and designing a two-step architecture.
Rigorous experimental design: pre-registration analysis, incentive schemes, quality control, and two experimental conditions.

Limitations & Future Work¶

Only 5 MMLU subsets were tested; generalization to other benchmarks or task types remains to be validated.
Users were sourced from the Prolific platform, which may not represent all user cohorts.
The simulator was only fine-tuned on ChatBench, utilizing limited training data (dialogues of 237 questions).
The sensitivity of AI-alone results to different prompt templates was not evaluated.
Only two models (GPT-4o and Llama-3.1-8b) were tested.
User-AI evaluation is costly; exploring how to further reduce simulator costs is highly worthwhile.

Complementary to WildBench/ArenaHard/MT-Bench: While these evaluate natural-occurring dialogues, they lack ground truth; ChatBench provides labeled comparisons.
Extension of Lee et al. (2023): Scaled from 30 questions to 396 questions + incorporated simulators.
Li et al. (2024b) conducted similar conversions in the healthcare domain (benchmark $\rightarrow$ simulated interaction), sharing similar ideas.
Inspiration: This method can be extended to the interactive evaluation of non-QA tasks like code generation and creative writing.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐ Systematically bridges the gap between AI-alone and user-AI evaluation
Experimental Thoroughness	⭐⭐⭐⭐⭐ Large-scale user study + pre-registration + simulator
Value	⭐⭐⭐⭐ Direct guidance for LLM evaluation practices
Writing Quality	⭐⭐⭐⭐⭐ Rigorous experimental design, thorough analysis
Overall Recommendation	⭐⭐⭐⭐