ChatBench: From Static Benchmarks to Human-AI Evaluation¶
Conference: ACL 2025
arXiv: 2504.07114
Code: Available
Area: LLM Evaluation
Keywords: Benchmark Evaluation, Human-AI Interaction, User Simulation, MMLU, Dynamic Evaluation
TL;DR¶
Through user studies, this work converts the static MMLU benchmark into human-AI dialogues, constructing the ChatBench dataset (396 questions, 7,336 dialogues). It reveals that AI-alone accuracy cannot predict user-AI accuracy, and trains a user simulator that improves correlation by 22–26 percentage points, laying the foundation for scalable interactive evaluation.
Background & Motivation¶
- Nearly 40% of US adults used generative AI in 2024, making the practical significance of LLM evaluation increasingly critical.
- Standard benchmarks (e.g., MMLU) exhibit a large gap compared to real-world user interactions:
- Benchmarks: Full question text \(\rightarrow\) single-letter option, in a fixed format.
- Real Interactions: Varied user phrasing, incomplete information, multi-turn dialogues, and context dependency.
- Existing human-computer interaction evaluations (e.g., WildChat, ChatBot Arena, MT-Bench) are disconnected from standard benchmarks:
- Distribution shift: Real user questions vs. benchmark questions.
- Lack of ground truth: Requiring LLM-as-a-judge, which prevents direct comparison with MMLU results.
- Core Problem: Can AI-alone benchmark scores predict actual performance when users collaborate with AI?
- Lee et al. (2023) conducted similar exploration but with only 30 questions, which lacks scale and provides no simulator.
Method¶
Overall Architecture¶
Designing a pipeline to convert MMLU into user-AI dialogues: 1. Select high-quality questions from MMLU. 2. Collect three types of data: AI-alone (model solves independently), User-alone (user solves independently), and User-AI (user solves after chatting with the model). 3. Analyze the disparity between AI-alone and User-AI scenarios. 4. Train user simulators for scaling up.
Key Designs¶
1. User Study Design¶
Two-stage pipeline: - Phase 1: Users answer questions independently (user-alone data). - Phase 2: Users answer after chatting with the AI chatbot (user-AI data). - At least one message must be sent (forced interaction). - User confidence is recorded for each question.
Two experimental conditions: - Answer-first: In Phase 2, users solve the question independently before chatting with the AI (within-subjects design). - Direct-to-AI: In Phase 2, users chat with the AI directly (closer to real-world scenarios).
Incentive mechanism: A base fee of \(\$5.00\) plus a \(\$0.10\) bonus for each correct answer to enhance ecological validity.
2. Question Selection¶
- 5 MMLU subsets: Elementary/High School/College Mathematics + Conceptual Physics + Moral Scenarios.
- Mathematics is chosen because it remains challenging (GPT-4o achieves 84% overall on MMLU, but only 48% on HS Math).
- Quality control: Manual labeling using MMLU-Redux + cross-verification with o1 models.
- Batched design (19 math batches, 7 physics/morality batches) to minimize variance in the number of responses per question.
3. AI-Alone Evaluation Methods¶
Three AI-alone variants: - Letter-only zero-shot: Outputs only the answer letter (standard benchmark style). - Letter-only few-shot: Includes 5 MMLU dev questions as in-context examples. - Free-text (novel design in this paper): No format constraints on replies, using GPT-4o to extract answers—closer to the user experience.
4. User Simulator¶
Two-step simulator architecture: - Task 1: Generates the user's first message given an MMLU question. - Task 2: Decides whether to output the final answer or to ask a follow-up question given the dialogue history.
Fine-tuning data construction: Every \(k\)-turn user dialogue yields \(k+1\) training samples.
Fine-tuning approach: Supervised fine-tuning of GPT-4o on ChatBench data.
Data Scale¶
| Data Type | Quantity |
|---|---|
| Total Questions | 396 |
| Tested Models | GPT-4o, Llama-3.1-8b |
| Confidence Responses | 10,828 |
| User-alone Answers | 7,148 |
| User-AI Dialogues | 7,336 |
| Total Answers | 144,000+ |
Key Experimental Results¶
Main Results: AI-alone vs. User-AI Accuracy¶
Mean Absolute Error (MAE) between Letter-only few-shot and user-AI: 21 percentage points
Mean Absolute Error (MAE) between Free-text and user-AI: 10 percentage points (improved but still significantly different)
Key observations: - Math: GPT-4o performs well in free-text, but user-AI accuracy is significantly lower than AI-alone (users introduce ambiguity). - Llama-3.1-8b Math: The AI-alone \(\rightarrow\) user-AI gap is smaller (the gap for weak models is already at the baseline). - While the two models differ by 25 percentage points in AI-alone accuracy, the gap boils down to only 5–9 percentage points in user-AI accuracy.
Question-level Correlation¶
| Metric | Correlation (Pearson \(r\)) |
|---|---|
| Free-text vs. User-AI (direct-to-AI) | 0.45 |
| Free-text vs. User-AI (answer-first) | 0.46 |
| Free-text predicting user-AI improvement | 0.26–0.27 |
| Linear prediction of user-AI using User-alone + AI-alone | 0.55–0.63 |
AI-alone also fails to predict user-AI performance well at the question level.
Only 39.8% of Dialogues "Mirror" the AI Benchmark¶
Conditions for interaction to mirror the AI benchmark: the user precisely copies the original question + the AI provides the answer in a single turn + the user adopts this answer.
Most interactions do not meet these criteria—users rephrase questions, omit information, and ask multi-turn follow-ups.
Net Effect of AI on Users¶
| Effect | Proportion |
|---|---|
| User error corrected by AI | 54% |
| User correct answer misled by AI | 10% |
| Reasons why AI-alone was 100% correct but user-AI failed | 67% AI did not output the correct answer (user rephrased the question) |
User Simulator Results¶
| Method | GPT-4o Corr. \(\uparrow\) | GPT-4o MAE \(\downarrow\) | Llama Corr. \(\uparrow\) | Llama MAE \(\downarrow\) |
|---|---|---|---|---|
| Letter-only few-shot | 0.30 | 0.31 | 0.21 | 0.40 |
| Free-text | 0.49 | 0.20 | 0.61 | 0.20 |
| IQA-EVAL | 0.50 | 0.18 | 0.43 | 0.22 |
| Two-Step (un-finetuned) | 0.41 | 0.19 | 0.39 | 0.23 |
| ChatBench-Sim (finetuned) | 0.63 | 0.15 | 0.65 | 0.17 |
Fine-tuning improves correlation by 22–26 percentage points and reduces MAE by 21–26%.
Ablation Study¶
- Condition Comparison: Under the answer-first condition, the accuracy gap between users and AI is smaller (since users have already thought about the problem).
- Model Capability Comparison: Although GPT-4o's AI-alone performance is far superior to Llama-3.1-8b, the user-AI performance gap narrows drastically.
- Impact of User Rephrasing: In approximately 66% of the cases where "AI should have got it right but the user-AI interaction failed", the user's initial prompt was not an exact paraphrase/copy of the original question.
Key Findings¶
- AI-alone accuracy cannot predict user-AI accuracy: Statistically significant differences are observed across multiple subjects.
- Letter-only format severely overestimates model capabilities: The deviation from user-AI accuracy reaches up to 21 percentage points.
- Free-text evaluation is more realistic but still suffers from a 10 percentage point deviation.
- The capability gap between the two models shrinks significantly after user interaction: From 25pp \(\rightarrow\) 5–9pp.
- Only 40% of user-AI dialogues align with the benchmark evaluation paradigm.
- Fine-tuned user simulators can significantly improve prediction accuracy: Offering a feasible path for scalable evaluation.
Highlights & Insights¶
- First to systematically compare AI-alone and user-AI evaluation on a large scale (396 questions, 7,336 dialogues).
- The finding that "AI-alone benchmarks can mislead model selection" has direct industry implications—weaker models may perform closely to stronger models in interactive settings.
- The fine-tuning method for the user simulator is simple yet effective: decomposing a single dialogue into multiple SFT samples and designing a two-step architecture.
- Rigorous experimental design: pre-registration analysis, incentive schemes, quality control, and two experimental conditions.
Limitations & Future Work¶
- Only 5 MMLU subsets were tested; generalization to other benchmarks or task types remains to be validated.
- Users were sourced from the Prolific platform, which may not represent all user cohorts.
- The simulator was only fine-tuned on ChatBench, utilizing limited training data (dialogues of 237 questions).
- The sensitivity of AI-alone results to different prompt templates was not evaluated.
- Only two models (GPT-4o and Llama-3.1-8b) were tested.
- User-AI evaluation is costly; exploring how to further reduce simulator costs is highly worthwhile.
Related Work & Insights¶
- Complementary to WildBench/ArenaHard/MT-Bench: While these evaluate natural-occurring dialogues, they lack ground truth; ChatBench provides labeled comparisons.
- Extension of Lee et al. (2023): Scaled from 30 questions to 396 questions + incorporated simulators.
- Li et al. (2024b) conducted similar conversions in the healthcare domain (benchmark \(\rightarrow\) simulated interaction), sharing similar ideas.
- Inspiration: This method can be extended to the interactive evaluation of non-QA tasks like code generation and creative writing.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ Systematically bridges the gap between AI-alone and user-AI evaluation |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ Large-scale user study + pre-registration + simulator |
| Value | ⭐⭐⭐⭐ Direct guidance for LLM evaluation practices |
| Writing Quality | ⭐⭐⭐⭐⭐ Rigorous experimental design, thorough analysis |
| Overall Recommendation | ⭐⭐⭐⭐ |