EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework¶
Conference: ACL 2025
arXiv: 2504.14928
Code: github
Area: LLM Evaluation
Keywords: Teaching Capability Evaluation, Multi-Agent, Formative Assessment, Educational AI, LLM-as-Teacher
TL;DR¶
This work proposes EducationQ, a multi-agent dialogue framework designed to evaluate the teaching capabilities of LLMs by simulating teacher-student informal formative assessment interactions in real classrooms. The study reveals that teaching effectiveness is not linearly correlated with model size or general reasoning ability, with Llama 3.1 70B demonstrating the best teaching performance.
Background & Motivation¶
While LLMs are increasingly applied in education, existing evaluation methods possess fundamental limitations:
Misalignment of Evaluation Goals: Current benchmarks (MMLU, GPQA, etc.) primarily evaluate knowledge recall and reasoning capabilities rather than interactive teaching skills. The core of education lies in guiding the learning process, facilitating knowledge construction, providing personalized feedback, and scaffolding skills.
Limitations of Evaluation Methods: - Closed-ended questions only test baseline knowledge levels and fail to capture the dynamic nature of instruction. - Open-ended evaluations rely on human judgment, making them difficult to scale. - Multi-turn dialogue frameworks lack dedicated mechanisms to elicit and evaluate teaching effectiveness.
Lack of Teacher Initiative: Existing frameworks fail to assess the active role of a teacher in posing questions, evaluating understanding, and making real-time adjustments.
The core theoretical foundation of this work is Vygotsky's Zone of Proximal Development (ZPD) and Informal Formative Assessment (IFA), where teachers assess learning progress through continuous dialogue, identify gaps, and adapt teaching strategies.
Method¶
Overall Architecture¶
EducationQ adopts a tri-role multi-agent architecture comprising a Teacher Agent (to be evaluated), a standardized Student Agent, and an Evaluator Agent (to analyze teaching quality), simulating a cyclical teacher-student interaction in a classroom setting.
Key Designs¶
-
Student Agent: Llama 3.1 70B Instruct is used as the fixed student model (with an accuracy of 46.97% on the GPQA Diamond benchmark). Ablation studies show that switching the student model (e.g., to Qwen 72B or Mistral Nemo) does not alter the teacher rankings, proving that the method effectively isolates the differences in teacher performance.
-
Teacher Agent: The agent is prompted to dynamically assess the student's thought process, using probing questions to gauge comprehension and provide feedback. A key constraint is that the teacher cannot access the multiple-choice options and must guide learning solely based on the student's reasoning patterns and correctness feedback, preventing direct answer leakage.
-
Evaluator Agent: This agent utilizes a qualitative analysis framework with 17 rubric dimensions, covering teacher dimensions (questioning, assessment, feedback) and student impact dimensions (metacognitive reflection, cognitive dimensions, etc.). Evaluation by human experts demonstrates a 78% agreement rate with the automated qualitative analysis.
-
Interaction Protocol:
- Pre-test: Establish the student's initial knowledge baseline.
- Interaction: 5 rounds of dialogue per question, with a maximum of 150 tokens per turn for the teacher and 260 tokens for the student.
- Post-test: Incorporate pre-test reasoning traces and teacher-student dialogue history, maintaining evaluation parameters consistent with the baseline.
-
Dataset Construction: A dataset of 1,498 questions was meticulously curated from GPQA (448 questions) and MMLU-Pro (12,032 questions), covering 13 subjects and 10 difficulty levels. MMLU-Pro Stratified ensures a balanced distribution of subjects and difficulty via stratified sampling.
Loss & Training¶
This work introduces an evaluation framework rather than a training method. The evaluation metric system includes: - Absolute Learning Gain (ALG): \(ALG = ACC_{post} - ACC_{pre}\), which directly measures teaching effectiveness. - Positive-Negative Impact Ratio (PNIR): \(PNIR = N_{neg} / N_{pos}\), which measures teaching consistency (lower is better). - Cross-Subject Stability (CSS): The standard deviation of learning gains across subjects (lower is better). - Unique Improvement Count (UIC): The number of unique questions that only a specific teacher model can improve.
Key Experimental Results¶
Main Results¶
Overall Teaching Performance (14 LLMs, 1,498 questions):
| Teacher Model | Pre-test | Post-test | ALG↑ | CSS↓ | PNIR↓ | UIC |
|---|---|---|---|---|---|---|
| Llama 3.1 70B Instruct | 47.73 | 58.74 | 11.01 | 0.041 | 0.18 | 37 |
| Gemini 1.5 Pro 002 | 47.73 | 55.21 | 7.48 | 0.030 | 0.40 | 37 |
| OpenAI o1-mini | 47.73 | 53.57 | 5.84 | 0.051 | 0.25 | 7 |
| Qwen 2.5 72B Instruct | 47.73 | 53.14 | 5.41 | 0.054 | 0.33 | 7 |
| Llama 3.1 8B Instruct | 47.73 | 52.60 | 4.87 | 0.051 | 0.40 | 13 |
GPQA Diamond Subset (Consistency Across Student Models):
| Teacher \ Student | Llama 70B | Qwen 72B | Mistral Nemo |
|---|---|---|---|
| Llama 70B Teacher | +12.63% | +8.08% | +4.55% |
| Qwen 72B Teacher | +8.59% | +4.55% | +2.53% |
| Mistral Nemo Teacher | +7.07% | +2.53% | 0.00% |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| 250 tokens per turn (vs. 150) | No significant improvement | Relaxing output constraints does not improve teaching effectiveness. |
| 70-100 tokens per turn | Teaching performance degradation | Insufficient expression space constrains teaching strategies. |
| 10 dialogue rounds (vs. 5 rounds) | No significant improvement | Doubling compute costs yields limited gains. |
| Test-retest stability (GPQA-main) | σ²=0.00832 | Extremely low ALG variance, indicating high framework stability. |
| Cross-dataset consistency | r=0.871, p<0.001 | High ranking consistency across GPQA and MMLU-Pro datasets. |
Key Findings¶
- Teaching capability is not proportional to model size: Llama 3.1 70B outperforms larger 405B and commercial models, suggesting that teaching capabilities require specialized optimization.
- Different models possess unique teaching advantages: Llama 70B excels in subtle questioning strategies and knowledge-intensive subjects, o1-mini shines in reasoning-intensive subjects, and Gemini 1.5 Pro is proficient in providing targeted feedback.
- Surprising improvement in specific subjects: Llama 70B achieved up to a 24% accuracy improvement in certain subjects.
- Teacher rankings remain consistent across different student models: This validates the robustness of the evaluation framework.
Highlights & Insights¶
- Theory-driven evaluation design: Integrates Vygotskian learning theory and formative assessment theory into the AI evaluation framework, grounding technical assessment in pedagogical theory.
- Rigorous data flow control: Teachers cannot access options, students cannot access pre-test results, and learning occurs solely through dialogue. These constraints ensure the fairness of the evaluation and the authenticity of the teaching behaviors.
- Discovery of "Teaching Effectiveness ≠ Knowledge Level": Challenges the assumption that "larger models are inherently better," charting a clear direction for educational AI development.
- Mixed-method evaluation: Combines quantitative metrics (learning gain) with qualitative analysis (17 dimensions of teaching behaviors) to offer a comprehensive perspective.
Limitations & Future Work¶
- Authenticity of student models: The assumption that the student behavior simulated by LLMs truly reflects human student learning processes requires further validation.
- Limitations of short interactions: Only 5 rounds of dialogue per question may be insufficient to evaluate long-term instructional strategies.
- Uneven subject coverage: Biology has only 19 questions in GPQA Diamond, which impacts the reliability of cross-subject comparisons.
- Overlapping evaluation dimensions: There is overlap among the 17 qualitative analysis dimensions (as acknowledged by the authors).
- Evaluation limited to multiple-choice questions: The post-test uses a multiple-choice question (MCQ) format, failing to evaluate open-ended learning outcomes.
Related Work & Insights¶
- GPQA (Rein et al., 2023): A high-difficulty Q&A benchmark designed by domain experts.
- MMLU-Pro (Wang et al., 2024): An enhanced reasoning evaluation benchmark with 10 options.
- TeachTune (Jin et al., 2025): Generates teaching dialogues for human evaluation, complementing the automated approach proposed in this work.
- Insight: Evaluating the teaching capability of LLMs might be the closest way to measure "true understanding," as teaching others is inherently harder than solving problems on one's own.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Establishes the first systematic evaluation of LLMs' teaching capabilities with a novel framework design grounded in pedagogical theory.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 14 models with thorough stability validation and ablation studies, though lacking sufficient validation with human students.
- Writing Quality: ⭐⭐⭐⭐ Well-structured, tightly coupling theory with empirical practice.
- Value: ⭐⭐⭐⭐⭐ Reveals the decoupling of teaching capabilities from baseline knowledge levels, offering critical insights for the development of educational AI.