Skip to content

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Conference: AAAI 2026 arXiv: 2511.11772 Code: GitHub Area: Education AI / LLM Applications Keywords: Multi-Agent Systems, Formative Feedback, Automated Scoring, Fairness, Metacognition

TL;DR

This paper proposes a zero-shot multi-agent pipeline comprising five role-based GPT-4o agents that assess learner reflection texts using a rubric-based scoring scheme and generate bias-aware conversational feedback. Evaluated on 336 reflections, the system achieves MAE=0.467, QWK=0.459 in scoring agreement, and a feedback quality score of Q(g)=3.967.

Background & Motivation

  • Core Problem: Formative feedback is one of the most effective interventions for improving learning outcomes (effect size up to 0.7), yet in large-enrollment courses, instructors cannot respond individually to every student's reflection text. This feedback gap disproportionately affects students from disadvantaged backgrounds.
  • Limitations of Prior Work: While LLMs can process text at superhuman speed, they (1) tend to overemphasize surface-level expression in the absence of instructor-designed rubrics; and (2) existing LLM-based scoring systems focus primarily on score agreement while neglecting fairness and the pedagogical value of feedback.
  • Key Challenge: No prior system has integrated stable rubric-based scoring, bias-aware feedback generation, and explicit fairness evaluation into an end-to-end pipeline.
  • Key Insight: A multi-agent role decomposition approach realizes a complete pipeline covering scoring, equity auditing, metacognitive prompting, aggregation, and self-verification, while introducing the cross-ability fairness metric \(\Delta_{\text{MAE}}\) to constrain scoring bias.

Method

Overall Architecture: Five-Agent Pipeline

Five GPT-4o agents are orchestrated within the AutoGen framework in a collaborative pipeline. All inference is zero-shot (no fine-tuning), with temperature set to 0.3. Each reflection text passes through all agents and produces a four-dimensional rubric score (0–3) and a learner-facing feedback response of no more than 120 words.

Key Designs: Role Assignments of the Five Agents

  1. Evaluator Agent: Scores reflections along four rubric dimensions (conceptual understanding, real-world application, reflective questioning, and clarity of expression), outputting structured JSON containing scores, reasoning, and improvement suggestions.
  2. Equity Monitor Agent: Reviews the evaluative narrative for biased or exclusionary language and proposes revisions.
  3. Metacognitive Agent: Generates one to two prompting questions to encourage learners to examine their own reasoning.
  4. Aggregator Agent: Synthesizes outputs from the preceding three agents into a concise feedback response (≤120 words) that highlights only a small number of actionable next steps.
  5. Reflexion Agent: Performs a posterior verification, returning either CONFIDENT or REVISE with specific revision suggestions.

The Evaluator, Equity Monitor, and Metacognitive agents can run in parallel; the Aggregator and Reflexion agents execute sequentially.

Loss & Training: Three-Objective Formalization and Fairness Framework

  • Objective 1 (Scoring Accuracy): MAE and QWK are used to measure agreement between model predictions and expert annotations.
  • Objective 2 (Fairness): Students are divided into low-ability (scores 0–1) and high-ability (scores 2–3) groups based on human ratings. The maximum inter-group error gap is computed as \(\Delta_{\text{MAE}} = \max_{b \in \mathcal{B}} |\text{MAE}_b(f) - \text{MAE}_{\neg b}(f)|\).
  • Objective 3 (Feedback Utility): Feedback quality is measured by the aggregated score \(Q(g) = \frac{1}{M}\sum_{j}\frac{1}{5}\sum_{d}q_{j,d}\) across five Likert-scale dimensions.

No training or fine-tuning is performed; all reasoning is accomplished through structured prompting.

Key Experimental Results

Main Results: Scoring Accuracy (MAE, lower is better)

Dimension MAE
Conceptual Understanding 0.381
Real-World Application 0.560
Reflective Questioning 0.500
Clarity of Expression 0.429
Overall 0.467

Ordinal Agreement (QWK, higher is better)

Dimension μ (QWK) σ
Conceptual Understanding 0.298 0.158
Real-World Application 0.479 0.077
Reflective Questioning 0.483 0.088
Clarity of Expression 0.349 0.126
Overall 0.459 0.008

Feedback Quality (1–5 Likert, higher is better)

Dimension Mean ± SD
Correctness 4.080 ± 0.756
Rubric Alignment 3.924 ± 0.763
Actionability 3.760 ± 0.845
Depth of Insight 3.845 ± 0.860
Empathetic Tone 4.223 ± 0.612
Overall Q(g) 3.967

Fairness (\(\Delta_\text{MAE}\), lower is better)

Dimension Low-Ability MAE High-Ability MAE \(\Delta_\text{MAE}\)
Conceptual Understanding 1.000 0.278 0.722
Real-World Application 1.500 0.403 1.097
Reflective Questioning 0.917 0.431 0.486
Clarity of Expression 0.167 0.472 0.306
Overall 0.896 0.396 0.500

Ablation Study: Efficiency

  • Per-reflection scoring latency: 7.71s (vs. human average of 1.4 min; 11× speedup)
  • End-to-end latency (including feedback generation): 33.35s per reflection
  • 84 reflections scored: 10.8 min; complete feedback generation: 46.7 min

Highlights & Insights

  1. Explicit Fairness Quantification: This work is the first to treat cross-ability scoring fairness as a measurable optimization objective rather than an afterthought, making the system auditable.
  2. Role Separation Enhances Transparency: Each of the five agents handles a distinct subtask that can be independently audited, and instructors can adjust individual components.
  3. Zero-Shot Performance Approaches Expert Level: Without fine-tuning, the system achieves MAE=0.467 and QWK=0.459, with empathetic tone receiving the highest feedback rating (4.223/5).
  4. Fairness Analysis Reveals Systematic Bias: Low-ability students face substantially larger scoring errors (\(\Delta_\text{MAE}\)=0.500), providing a clear direction for future improvement.

Limitations & Future Work

  1. Limited Dataset Scale: Only 336 reflections from 28 learners in a single AI literacy course; generalizability remains to be validated.
  2. Persistent Fairness Gaps: The real-world application dimension exhibits \(\Delta_\text{MAE}\)=1.097, indicating that low-ability students are systematically over- or under-estimated.
  3. Feedback Quality Below Target: Q(g)=3.967 falls short of the 4.0 target; actionability (3.760) and depth of insight (3.845) are the weakest dimensions.
  4. Dependence on Closed-Source GPT-4o: Reproducibility and cost are constrained by API access.
  5. No Demographic Attributes Collected: Fairness analysis is limited to ability-level groupings, precluding evaluation of bias along dimensions such as race or gender.
  • Evolution of Automated Scoring Systems: From rule-based methods in the 1960s to BERT/GPT-4-based prompt evaluation, this paper represents an exploration of the zero-shot multi-agent direction.
  • Multi-Agent LLM Systems: The work draws on the AutoGen framework for role-based collaboration; unlike general-purpose LLM agents, however, educational equity serves as the central constraint.
  • Insights: The fairness framework \(\Delta_{\text{MAE}}\) is transferable to other automated systems requiring equitable assessment, such as hiring screening and medical consultation.

Rating

⭐⭐⭐ — The problem framing is compelling (educational equity + automated feedback) and the fairness quantification framework is valuable, but the experimental scale is too small, zero-shot performance still lags behind human experts, and the technical contributions are relatively straightforward.