Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents¶

Conference: AAAI 2026 arXiv: 2511.11772 Code: GitHub Area: Education AI / LLM Applications Keywords: Multi-Agent Systems, Formative Feedback, Automated Scoring, Fairness, Metacognition

TL;DR¶

This paper proposes a zero-shot multi-agent pipeline comprising five role-based GPT-4o agents that assess learner reflection texts using a rubric-based scoring scheme and generate bias-aware conversational feedback. Evaluated on 336 reflections, the system achieves MAE=0.467, QWK=0.459 in scoring agreement, and a feedback quality score of Q(g)=3.967.

Background & Motivation¶

Core Problem: Formative feedback is one of the most effective interventions for improving learning outcomes (effect size up to 0.7), yet in large-enrollment courses, instructors cannot respond individually to every student's reflection text. This feedback gap disproportionately affects students from disadvantaged backgrounds.
Limitations of Prior Work: While LLMs can process text at superhuman speed, they (1) tend to overemphasize surface-level expression in the absence of instructor-designed rubrics; and (2) existing LLM-based scoring systems focus primarily on score agreement while neglecting fairness and the pedagogical value of feedback.
Key Challenge: No prior system has integrated stable rubric-based scoring, bias-aware feedback generation, and explicit fairness evaluation into an end-to-end pipeline.
Key Insight: A multi-agent role decomposition approach realizes a complete pipeline covering scoring, equity auditing, metacognitive prompting, aggregation, and self-verification, while introducing the cross-ability fairness metric \(\Delta_{\text{MAE}}\) to constrain scoring bias.

Method¶

Overall Architecture: Five-Agent Pipeline¶

Five GPT-4o agents are orchestrated within the AutoGen framework in a collaborative pipeline. All inference is zero-shot (no fine-tuning), with temperature set to 0.3. Each reflection text passes through all agents and produces a four-dimensional rubric score (0–3) and a learner-facing feedback response of no more than 120 words.

Key Designs: Role Assignments of the Five Agents¶

Evaluator Agent: Scores reflections along four rubric dimensions (conceptual understanding, real-world application, reflective questioning, and clarity of expression), outputting structured JSON containing scores, reasoning, and improvement suggestions.
Equity Monitor Agent: Reviews the evaluative narrative for biased or exclusionary language and proposes revisions.
Metacognitive Agent: Generates one to two prompting questions to encourage learners to examine their own reasoning.
Aggregator Agent: Synthesizes outputs from the preceding three agents into a concise feedback response (≤120 words) that highlights only a small number of actionable next steps.
Reflexion Agent: Performs a posterior verification, returning either CONFIDENT or REVISE with specific revision suggestions.

The Evaluator, Equity Monitor, and Metacognitive agents can run in parallel; the Aggregator and Reflexion agents execute sequentially.

Loss & Training: Three-Objective Formalization and Fairness Framework¶

Objective 1 (Scoring Accuracy): MAE and QWK are used to measure agreement between model predictions and expert annotations.
Objective 2 (Fairness): Students are divided into low-ability (scores 0–1) and high-ability (scores 2–3) groups based on human ratings. The maximum inter-group error gap is computed as \(\Delta_{\text{MAE}} = \max_{b \in \mathcal{B}} |\text{MAE}_b(f) - \text{MAE}_{\neg b}(f)|\).
Objective 3 (Feedback Utility): Feedback quality is measured by the aggregated score \(Q(g) = \frac{1}{M}\sum_{j}\frac{1}{5}\sum_{d}q_{j,d}\) across five Likert-scale dimensions.

No training or fine-tuning is performed; all reasoning is accomplished through structured prompting.

Key Experimental Results¶

Main Results: Scoring Accuracy (MAE, lower is better)¶

Dimension	MAE
Conceptual Understanding	0.381
Real-World Application	0.560
Reflective Questioning	0.500
Clarity of Expression	0.429
Overall	0.467

Ordinal Agreement (QWK, higher is better)¶

Dimension	μ (QWK)	σ
Conceptual Understanding	0.298	0.158
Real-World Application	0.479	0.077
Reflective Questioning	0.483	0.088
Clarity of Expression	0.349	0.126
Overall	0.459	0.008

Feedback Quality (1–5 Likert, higher is better)¶

Dimension	Mean ± SD
Correctness	4.080 ± 0.756
Rubric Alignment	3.924 ± 0.763
Actionability	3.760 ± 0.845
Depth of Insight	3.845 ± 0.860
Empathetic Tone	4.223 ± 0.612
Overall Q(g)	3.967

Fairness (\(\Delta_\text{MAE}\), lower is better)¶

Dimension	Low-Ability MAE	High-Ability MAE	\(\Delta_\text{MAE}\)
Conceptual Understanding	1.000	0.278	0.722
Real-World Application	1.500	0.403	1.097
Reflective Questioning	0.917	0.431	0.486
Clarity of Expression	0.167	0.472	0.306
Overall	0.896	0.396	0.500

Ablation Study: Efficiency¶

Per-reflection scoring latency: 7.71s (vs. human average of 1.4 min; 11× speedup)
End-to-end latency (including feedback generation): 33.35s per reflection
84 reflections scored: 10.8 min; complete feedback generation: 46.7 min

Highlights & Insights¶

Explicit Fairness Quantification: This work is the first to treat cross-ability scoring fairness as a measurable optimization objective rather than an afterthought, making the system auditable.
Role Separation Enhances Transparency: Each of the five agents handles a distinct subtask that can be independently audited, and instructors can adjust individual components.
Zero-Shot Performance Approaches Expert Level: Without fine-tuning, the system achieves MAE=0.467 and QWK=0.459, with empathetic tone receiving the highest feedback rating (4.223/5).
Fairness Analysis Reveals Systematic Bias: Low-ability students face substantially larger scoring errors (\(\Delta_\text{MAE}\)=0.500), providing a clear direction for future improvement.

Limitations & Future Work¶

Limited Dataset Scale: Only 336 reflections from 28 learners in a single AI literacy course; generalizability remains to be validated.
Persistent Fairness Gaps: The real-world application dimension exhibits \(\Delta_\text{MAE}\)=1.097, indicating that low-ability students are systematically over- or under-estimated.
Feedback Quality Below Target: Q(g)=3.967 falls short of the 4.0 target; actionability (3.760) and depth of insight (3.845) are the weakest dimensions.
Dependence on Closed-Source GPT-4o: Reproducibility and cost are constrained by API access.
No Demographic Attributes Collected: Fairness analysis is limited to ability-level groupings, precluding evaluation of bias along dimensions such as race or gender.

Evolution of Automated Scoring Systems: From rule-based methods in the 1960s to BERT/GPT-4-based prompt evaluation, this paper represents an exploration of the zero-shot multi-agent direction.
Multi-Agent LLM Systems: The work draws on the AutoGen framework for role-based collaboration; unlike general-purpose LLM agents, however, educational equity serves as the central constraint.
Insights: The fairness framework \(\Delta_{\text{MAE}}\) is transferable to other automated systems requiring equitable assessment, such as hiring screening and medical consultation.

Rating¶

⭐⭐⭐ — The problem framing is compelling (educational equity + automated feedback) and the fairness quantification framework is valuable, but the experimental scale is too small, zero-shot performance still lags behind human experts, and the technical contributions are relatively straightforward.