ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations¶

Conference: ACL2025
arXiv: 2506.14200
Code: inklab.usc.edu/eli-why
Area: LLM/NLP
Keywords: Educational Evaluation, Language Model Explanations, Pedagogical Adaptation, Information Needs, Readability Analysis

TL;DR¶

Constructs the ELI-Why benchmark containing 13.4K "Why" questions. Through two human studies, it is found that only 50% of GPT-4-generated explanations tailored to different educational levels match the targeted grade levels (compared to 79% for manually curated ones), and they satisfy learners' information needs 20% less than human answers.

Background & Motivation¶

Background: Language models are widely used in education for information retrieval, tutoring, and automated assessment, with personalized instruction regarded as a critical capability.

Limitations of Prior Work: LMs by default generate "one-size-fits-all" responses, failing to adapt to learners with diverse prior knowledge. Existing benchmarks predominantly focus on objective multiple-choice QA tasks, lacking a systematic evaluation of the pedagogical utility of explanatory responses.

Key Challenge: Although models like GPT-4 can be prompted to generate explanations tailored to specific grade levels, "ability to generate" does not equate to "actual suitability"—the difficulty level perceived by users often mismatches the model's intent.

Goal: To systematically quantify the extent and causes of pedagogical adaptation failures when LMs generate explanations for users of different educational backgrounds (elementary school, high school, and graduate school).

Key Insight: Utilizing "Why" questions (which require explanatory answers) as a vehicle to construct a standardized benchmark, combined with a dual-perspective human study (educators and learners) along with automated metric analysis.

Core Idea: To evaluate the pedagogical utility of LMs across two complementary dimensions—Perceived Background Match from educators and satisfaction of information needs from learners—revealing the fundamental limitations of prompt-based adaptation.

Method¶

Overall Architecture¶

ELI-Why Benchmark Construction: - Starting from 50 seed "Why" questions, GPT-4 is used in a few-shot manner to over-generate ~30K questions. After manual deduplication and crowdsourced filtering (removing overly niche domain questions), 13,392 questions are ultimately retained. - It covers 6,217 STEM questions (physics, chemistry, computer science, etc.) and 7,175 non-STEM questions (sociology, law, history, etc.).

Grade-Adapted Explanation Generation: - Three educational levels: Elementary School (approx. US Grade 4), High School (high school to college sophomore), and Graduate School. - Four model families: GPT-4-0613, Llama-3.2-3B-Instruct, Qwen 2.5 14B Instruct, and DeepSeek R1 Distill Llama 8B. - Zero-shot prompts instruct the models to act as experts and generate explanations for each grade level; the prompt explicitly instructs them "not to add greetings or honorific conversational fillers" to minimize stylistic distraction.

Key Designs¶

Human Study I: Educator's Perspective - A subset of 400 questions with three-level explanations generated by GPT-4. - Participants act as "educators" and judge which grade level the explanation is suitable for (Perceived Background). - Three independent annotations are collected for each explanation pair, resolved by majority voting. - Control group: Three-level explanations for 40 questions are manually curated from the web by the authors (Manually Web-Retrieved).

Human Study II: Learner's Perspective - Participants evaluate, based on their own educational background, whether the explanation (1) provides novel information and (2) bridges to existing knowledge. - Covers three groups of participants: elementary, high school, and graduate level (specifically in physics and psychology).

Automated Metrics: - Flesch-Kincaid Reading Ease readability score. - Surface features such as sentence count and percentage of complex words.

Baseline Comparisons¶

Default Explanations (zero-shot generation without specifying a grade level)
Web-Retrieved Explanations (Google API Featured Snippet)
Manually Web-Retrieved (manually curated three-level explanations)

Key Experimental Results¶

Main Results: Perceived Background Match¶

Explanation Source	Perceived Background Match Rate
GPT-4 Grade-Adapted	~50%
Manually Web-Retrieved	79.16%

GPT-4's explanations are mostly perceived as "High School" level, reflecting the model's default tendency to target general audiences.
Surprising mismatches occur: explanations intended for elementary school are perceived as suitable for graduate school, and vice versa.

Human Study II: Learner Information Needs¶

Out of all explanations, GPT-4-generated explanations satisfy information needs on average 20% less than manually curated ones.
The gap is especially pronounced among populations with graduate and high school backgrounds.

Findings from Automated Metrics¶

Model	Elementary Sentence Count	High School Sentence Count	Graduate Sentence Count
GPT-4	4.63±1.34	7.08±2.53	8.46±2.62
Llama-3.2-3B	3.29±1.63	6.70±2.97	9.10±3.33

Explanation length increases with educational levels, yet Flesch-Kincaid readability grades heavily overlap (mostly falling into the high school to college tier).
All four model families exhibit a similar trend: readability differences across grade levels are not statistically distinct.

Key Findings¶

Simply instructing the LM to target a specific grade via prompting does not lead the model to truly adapt the depth of knowledge; instead, it mostly adjusts the style (e.g., adding relatable scenarios like "playing in the park" for elementary level).
Manually curated explanations outperform GPT-4's matching rate by approximately 30 percentage points, demonstrating that current LM pedagogical adaptation capabilities fall far short of human information architecture skills.

Highlights & Insights¶

The dual-perspective evaluation framework is highly ingenious: the educator's perspective validates "match rate" while the learner's perspective assesses "utility," complementing each other to reveal the multidimensional failures of LM pedagogical adaptation.
The ELI-Why benchmark, spanning 13.4K questions across both STEM and non-STEM domains, provides a standardized evaluation resource for future educational LM research.
It quantitatively exposes the limits of prompt-based adaptation: even with carefully designed prompts, GPT-4 only achieves a 50% grade match, offering a crucial reality check to optimistic expectations of "realizing personalized teaching through prompting."
Comparative analysis between automated metrics and human studies indicates that traditional readability metrics (such as Flesch-Kincaid) are insufficient for capturing the quality of pedagogical adaptation.

Limitations & Future Work¶

Educational levels are restricted to only three categories (elementary school, high school, and graduate school), whereas real educational needs are more continuous and multidimensional.
The benchmark is grounded in the U.S. educational system; hence, its cross-cultural generalizability has not been verified.
Generating questions via GPT-4 may introduce model bias (unbalanced distribution of generated questions in certain domains).
Human studies were only conducted on GPT-4, leaving other models assessed purely via automated metrics.
While target "Why" questions are valuable, their scope is limited; pedagogical adaptation for other question types (e.g., How, What-if) might exhibit different patterns.

Relation to Educational LM Evaluation: Existing benchmarks like ScienceQA and MMLU focus on multiple-choice questions, whereas ELI-Why is the first to systematically evaluate open-ended explanation generation across different grade levels.
Relation to Text Simplification: Text simplification solely adjusts "readability," whereas ELI-Why emphasizes "depth of knowledge adaptation"—meaning choosing the appropriate granularity of concepts rather than merely simplifying sentences.
Insights: Future directions include exploring (1) dynamic explanation generation based on learner modeling (inferring knowledge levels from dialogue history), (2) extending ELI-Why to multilingual and multicultural contexts, and (3) incorporating curriculum knowledge graphs as external constraints for explanation generation.

Rating¶

Novelty: ⭐⭐⭐⭐ — First benchmark and evaluation framework to systematically assess the pedagogical utility of LMs across different educational levels.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two human studies combined with automated analysis of four model families, though human evaluations were limited to GPT-4.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with intuitive visualizations; the Sankey diagrams demonstrating grade mismatches are highly compelling.
Value: ⭐⭐⭐⭐ — Serves as a significant cautionary note for the AI-in-Education field, with both the benchmark and evaluation framework offering long-term reuse value.