LLMs can Perform Multi-Dimensional Analytic Writing Assessments¶
Conference: ACL 2025
arXiv: 2502.11368
Code: GitHub
Area: LLM/NLP
Keywords: Automated Writing Evaluation, Multi-Dimensional Analytic Scoring, Feedback Comment Generation, L2 Academic Writing, LLM-as-judge
TL;DR¶
Using an L2 postgraduate literature review corpus, this study systematically evaluates the capabilities of LLMs in multi-dimensional analytic writing assessment (scoring + commenting) and proposes ProEval, an explainable feedback quality evaluation framework.
Background & Motivation¶
High cost of multi-dimensional analytic assessment: Human grading and commenting across multiple dimensions simultaneously imposes a heavy cognitive load, is time-consuming, and is expensive, resulting in a severe scarcity of high-quality annotated corpora in the L2 writing field.
Emergence of LLMs in writing assessment: Prior studies have explored LLMs for holistic scoring or single-dimension commenting, but the ability to jointly perform multi-dimensional scoring and commenting has not been systematically investigated.
Limitations of prior work in evaluation: Evaluation of comment quality typically relies on human Likert scale ratings, which are costly, non-reproducible, and difficult to scale.
Lack of suitable evaluation corpora: Publicly available L2 writing corpora that concurrently contain multi-dimensional scores and comments are virtually non-existent.
Core Problem: Can LLMs provide "reasonably good" multi-dimensional analytic assessments? How do different interaction modes and prompting conditions affect the quality?
Value: If LLM assessment is reliable, it can significantly reduce feedback costs in L2 academic writing pedagogy, benefiting both learners and instructors.
Method¶
Overall Architecture¶
The research pipeline consists of three components: (1) constructing an L2 writing corpus annotated with human multi-dimensional assessments; (2) prompting LLMs to execute the same assessment task under various conditions; and (3) proposing the ProEval framework to automatically evaluate the quality of feedback comments.
Corpus Construction¶
- Scale: 141 literature reviews written by 51 L2 graduate students, with an average length of 1,321 words (930 words excluding citations), covering 5 humanities and social science topics.
- Assessment Rubric: Each paper is graded on a 10-point scale and reviewed by 2–3 independent experts across 9 analytic dimensions (C1–C9). C1: Material Selection, C2: Citation Integration, C3: Quality of Key Elements, C4: Structural Logic, C5: Content Clarity, C6: Coherence, C7: Use of Cohesive Devices, C8: Grammar and Syntax, C9: Academic Vocabulary.
- No Data Contamination: The corpus was created prior to the release of ChatGPT and has never been made public, making it suitable for LLM evaluation.
ProEval Feedback Comment Quality Evaluation Framework¶
A three-step pipeline evaluation:
- Problem Extraction: Uses GPT-4o to extract identified writing problems and their contextual information (explanation/suggestion/correction) from the feedback comments.
- Problem Classification: Characterizes each problem along three dimensions—whether it references a specific location in the paper, whether it includes suggestions for improvement, and whether it provides a directly applicable concrete correction.
- Correction Relevance Check: Uses GPT-4-Turbo to verify whether the identified problems are genuine, whether they relate to the assessment rubric, and whether the corrections are accurate.
Framework Validation: Two annotators independently labeled 200+ samples, achieving generally high Cohen's Kappa. The LLM obtained an \(F_1 = 0.92\) in problem extraction and a classification accuracy of \(\ge 5\%\).
LLM Evaluation Experimental Setup¶
- Models: GPT-4o, Gemini-1.5-flash, Llama-3 70B-Instruct.
- Three Interaction Modes: IM1 (one-shot prompting for 9 questions), IM2 (multi-turn conversation questioning dimension-by-dimension), IM3 (9 independent prompts questioned separately).
- Default Prompt Settings: System prompt containing L2 context and assessment guidelines, input containing references, score-first comment-second, and temperature = 0.
- Reliability Testing: Single-factor comparisons including altering model versions, simplifying system prompts, removing reference lists, comment-first, and temperature = 1.
Key Experimental Results¶
Table 1: Scoring Agreement (Figure 3 Heatmap)¶
| Comparison | QWK Range | AAR1 Range |
|---|---|---|
| Human-Human | Higher | Higher |
| LLM-LLM | Highest | Highest |
| Human-LLM (Best) | 0.59-0.88 (AAR1) | Majority \(>0.5\) |
Key Findings: (a) Humans exhibit a human-like scoring pattern, whereas LLMs cluster with other LLMs; (b) LLM scores usually differ from human scores by \(\le 1\) point; (c) Human-LLM agreement is highest under the IM3 interaction mode; (d) Agreement is higher for technical/objective dimensions such as C1/C2/C8/C9 and worst for C7 (cohesive devices).
Table 2: Statistics on Feedback Comments (Table 2)¶
| Evaluator | Comment Rate | Avg Length | Problem Detection Rate | Avg Problem Count |
|---|---|---|---|---|
| Human B | 0.24 | 104 \(\pm\) 85 | 0.97 | 3.8 \(\pm\) 3.5 |
| Human C | 1.00 | 62 \(\pm\) 85 | 0.56 | 1.3 \(\pm\) 1.8 |
| GPT-4o IM1 | 1.00 | 65 \(\pm\) 14 | 1.00 | 2.1 \(\pm\) 0.9 |
| GPT-4o IM3 | 1.00 | 381 \(\pm\) 65 | 1.00 | 6.1 \(\pm\) 2.0 |
| Gemini IM3 | 1.00 | 571 \(\pm\) 182 | 1.00 | 8.2 \(\pm\) 3.3 |
Key Findings: (a) LLMs consistently provide comments and detect problems, whereas humans sometimes omit them; (b) Comments generated under IM2/IM3 are significantly longer and more detailed than those from IM1; (c) Under IM3, LLMs provide more concrete corrections in subjective dimensions (C3–C6) compared to humans; (d) Scores and comments display the expected negative correlation, verifying assessment validity.
Reliability Testing (Table 4)¶
After altering model versions or prompting conditions, AAR1 remains \(\ge 0.81\) (majority \(>0.9\)) and BERTScore remains \(\ge 0.67\), indicating that LLM assessment exhibits strong stability and robustness.
Highlights & Insights¶
- Deft design of the ProEval framework: It decomposes subjective comment quality assessment into a verifiable pipeline of subtasks, achieving interpretability, extensibility, and reproducibility, and outperforming direct Likert scale ratings.
- Comprehensive experimental design: Evaluating 3 interaction modes \(\times\) multiple ablation conditions \(\times\) 3 LLMs, covering both scoring and commenting dimensions.
- Corpus contribution: The first publicly available L2 academic writing corpus that concurrently includes multi-dimensional scores and comments, with zero data contamination.
- High practical value: Under the IM3 mode, LLMs can provide concrete corrective recommendations even in subjective dimensions, compensating for the lack of detailed human commentary in such dimensions.
Limitations & Future Work¶
- Domain limitations: It only covers English literature reviews, leaving other genres (such as technical reports or creative writing) and other languages unverified.
- Indirect evaluation: The evaluation of comment quality is indirect (via problem extraction rather than direct human judgment) and lacks large-scale human validation.
- Assumption limitations of ProEval: It does not account for factors influencing perceived comment quality, such as politeness or logical coherence.
- Limited ablation studies: Only one condition is changed at a time, leaving multi-factor interaction effects unexplored.
- Model timeliness: Only GPT-4o, Gemini-1.5-flash, and Llama-3 70B were tested, without covering more recent models.
Related Work & Insights¶
- Traditional AWE/AES systems: From the 1960s to the present, primarily focused on holistic scoring, utilizing DNNs for scoring (Taghipour & Ng 2016) and sentence-level error correction (Nagata 2019).
- LLMs for AWE: Holistic scoring (Mizumoto & Eguchi 2023; Yancey et al. 2023), multi-dimensional scoring (Yavuz et al. 2024; Banno et al. 2024), and feedback generation (Han et al. 2024; Behzad et al. 2024). Stahl et al. (2024) is the only study investigating joint scoring and commenting, but focuses solely on holistic assessment.
- L2 writing corpora: TOEFL11 (scored without comments), CLC-FCE (error-annotated), and LEAF (personalized feedback)—none of which contain joint multi-dimensional score and comment annotations.
Rating¶
- Novelty: ⭐⭐⭐ — The task definition (joint multi-dimensional scoring and commenting) is innovative, and the ProEval framework is novel, though the core remains LLM prompting evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐ — The experiments are comprehensive with thorough ablations, and ProEval is doubly validated through human annotation and LLM-as-judge.
- Value: ⭐⭐⭐⭐ — The corpus and framework have direct practical value for L2 education and automated assessment research.
- Writing Quality: ⭐⭐⭐⭐ — The paper is clearly structured, with high-density charts/tables and detailed methodological descriptions.