Skip to content

Incentives Of EdTech: A Systematic Review Of EduNLP Research

Conference: ACL2026
arXiv: 2606.13691
Code: Pre-registered on OSF (no code repository)
Area: NLP / Educational NLP (EduNLP) Systematic Review
Keywords: Educational Technology, Systematic Literature Review, Stakeholders, Research Incentives, AI Ethics

TL;DR

This is the first systematic literature review of EduNLP (Educational Natural Language Processing) focusing on the ACL Anthology. The authors manually annotated 204 papers from the 2024–2025 BEA/NLP4CALL workshops and main conferences across five dimensions: tasks, motives, stakeholder inclusion, incentive structures, and ethical risks. A core tension is identified: research is driven by private sector incentives (e.g., commercial automated grading), while the actual needs of educational infrastructure—especially teachers—are systematically neglected. Teachers are treated as beneficiaries in only 33.3% of papers, real-world deployment accounts for only 9.8%, and ethical engagement often stops at "acknowledgment" rather than "action."

Background & Motivation

Background: From early feature-based automated scorers (e.g., e-rater) to LLM-driven intelligent tutoring systems (e.g., Khanmigo), the objective of applying NLP to education has always been to "extend good instruction and support learners who would otherwise be out of school." Given the global teacher shortage, the widening gap in educational equity, and the rapid adoption of commercial AI educational products, the question of "what role technology plays in education" is more urgent than ever.

Limitations of Prior Work: Rapidly evolving research fields face a specific risk: the closer one is to immediate technical problems, the easier it is to lose sight of macro goals. Researchers are naturally drawn to familiar datasets, trusted metrics, and tasks with clear progress, leading to "Can this system run?" displacing the more critical "Does it truly serve those we claim to serve?" Several existing EdTech ethics reviews (Yan et al. 2025, Fu and Weng 2024, Holmes et al. 2022, etc.) repeatedly diagnose the same issue: ethics are widely acknowledged in principle but inconsistently implemented in practice—yet most do not focus on the ACL/NLP community itself.

Key Challenge: A "push-pull" tension exists in EduNLP research between private sector incentives (automated grading has direct commercial value for large testing organizations and EdTech companies) and the fundamental needs of educational infrastructure (teacher empowerment, real-world classroom deployment, and the agency of those affected). The research agenda may be quietly shaped by commercial interests.

Goal: To audit the NLP community's own educational research through three research questions: RQ1 examines which tasks are prioritized, their motivations, and the contexts in which systems are deployed; RQ2 identifies stakeholders, how they are included, and whose interests are served; RQ3 investigates which risks/limitations are proposed and to what extent they are mitigated.

Key Insight: Rather than making a technical contribution, this work takes a step back and uses a systematic literature review methodology to quantitatively expose the question of "whether the field has met its own aspirations."

Core Idea: A multi-dimensional annotation schema covering tasks, motives, stakeholders, incentives, and risks is used to conduct an honest "self-check" of 204 EduNLP papers from the ACL Anthology, distilling actionable recommendations for improvement from exemplary papers.

Method

Overall Architecture

This is a methodologically rigorous systematic review involving a pipeline of "retrieval → sampling → three-stage manual annotation → consistency measurement → multi-dimensional analysis." The authors defined the corpus using two sources: all 204 papers from the 2024–2025 BEA and NLP4CALL workshops, plus an API search of the ACL Anthology main conference and associated conference titles/abstracts using 38 EduNLP-related search terms (e.g., "student modeling"). These sources yielded 191 (workshop) and 316 (ACL) hits, respectively. Stratified sampling was then applied: for workshops, 25% of each shared task was randomly sampled (minimum 5 papers per task) and all shared task overview papers were included; for ACL, 214 irrelevant papers were manually excluded after abstract reading, and 44 papers were sampled from the remaining 102 based on year/conference/search term. The final corpus = 160 workshop + 44 ACL = 204 papers*. Each paper was manually annotated for tasks, datasets, motives, stakeholders, incentives, ethical risks, mitigation measures, and future directions.

Key Designs

1. Dual-source Retrieval + Stratified Sampling: Covering Contemporary EduNLP via 204 Representative Papers

To address the gap in existing reviews not focusing on the NLP community, the authors intentionally limited the corpus to the ACL Anthology. A dual retrieval protocol was used: the workshop side took the full set then sampled 25% by shared task (ensuring representation of each task while including unique qualitative overview papers); the main conference side used 38 domain search terms to identify candidates, followed by manual relevance screening and stratified sampling by "year × conference × search term." This compromise—"full scope definition + stratified sampling"—provides an in-depth snapshot of contemporary trends at the cost of longitudinal (cross-decade) analysis, a trade-off the authors explicitly acknowledge.

2. Three-stage Annotation Process + Shared Extraction Schema: Turning a "Self-Audit" into a Verifiable Task

To mitigate the risk of subjectivity, the authors designed a three-stage iterative annotation process. Phase (1): Three authors collaboratively annotated a single paper to develop and validate the schema; Phase (2): Annotators independently labeled a stratified shared batch (25 papers, 12.3% of the corpus), followed by meetings to revise the schema and resolve ambiguities—with everyone retroactively updating Phase (2) labels after each change; Phase (3): The remaining papers were annotated independently. The schema captured: specific tasks, datasets and their availability, explicit motives, mentioned vs. included stakeholders (with citations), levels of inclusion, deployment context, explicit and implicit incentives, ethical risks, mitigation measures taken, and risk-related future directions. Each paper averaged 45 minutes to annotate, totaling approximately 190 annotation hours.

3. Multi-dimensional Consistency Measurement (IAA): Using Krippendorff's \(\alpha\) and Percentage Agreement (PA) for Reliability

To indicate which conclusions are reliable versus trends, the authors calculated Inter-Annotator Agreement (IAA) on the Phase (2) shared batch. Percentage Agreement (PA) was used for free-text fields, ranging from 0.52 (implicit incentives) to 1 (deployment). For four multi-label dimensions, both Krippendorff's \(\alpha\) and PA were reported: PA was generally high (0.84–0.94), but \(\alpha\) was more volatile. Agreement on whether stakeholders "appeared" was moderate to strong (\(\alpha\) = 0.49–0.7; "teacher" reached 0.79–0.84), while more interpretive dimensions like "level of inclusion" and "level of risk engagement" showed lower agreement (\(\alpha\) = 0.52–0.61). Consequently, the authors emphasize that figures for subjective dimensions like implicit incentives and risks should be read as indicative trends rather than precise counts.

4. Five-dimensional Analysis Framework: Quantifying "Who Does the Research Serve?"

To address the inherently vague question of whose interests are served, the authors decomposed it into five statistical dimensions. Notably, they distinguished between mentioning vs. inclusion of stakeholders, three levels of inclusion (High: involved in design; Middling: involved in data evaluation/annotation without design input; Low: test subjects for data collection only), and explicit vs. implicit beneficiaries (implicit beneficiaries required inference based on task nature, deployment, and funding, resulting in the lowest agreement). This breakdown allowed structural findings to emerge, such as "teachers being treated as pressure points (cost/time burden) rather than beneficiaries" and "automated grading implicitly serving industry."

Key Experimental Results

Main Results

Core statistics organized by RQ1–RQ3:

Dimension Key Figures Meaning
Task Distribution Automated Scoring (AES/ASAG) 56 papers, GEC 30, Text Simplification/Complexity 28 Assessment/feedback tasks occupy half the corpus, directly linked to commercial value in testing.
Dataset Concentration 284 datasets used 460 times; W&I+LOCNESS, ASAP, and CoNLL-2014 account for 12.9% of public use 73.9% are public (good for reproducibility), but highly concentrated in a few English datasets, questioning generalization.
Motive Types "Helping a stakeholder" 110 papers, "Educational/Ethical concerns" 82, Purely technical 43 (21.1%) Most have stakeholder-related motives, but pure technical motivation remains significant.
Deployment Context 79.4% (162 papers) never deployed to real users; only 9.8% had real-world deployment Massive research effort on benchmark optimization with little discussion on deployment paths.

Stakeholders and Incentives

Group Mentioned Inclusion Rate As Beneficiary
Learners/Students 170 papers (Most) 22.4% Most frequent explicit beneficiary (125 papers)
Teachers 97 papers 26.8%; 65.5% of inclusions are Middling (mostly annotators) Only 33.3% of papers treat them as beneficiaries; 80.9% of appearances are explicit
Domain Experts 88 papers 56.8% (Often hired as annotators/graders)
Parents Only 2 papers Almost absent despite critical roles in early education

Overall distribution of inclusion levels: Middling 47.0%, High 32.1%, Low 20.9%—even when included, stakeholders are more often "tools" for research rather than "agents" shaping it. Non-profits, industry, and government agencies appear prominently as implicit beneficiaries: automated scoring research consistently favors industry (reducing reliance on human graders), while direct benefits to teachers/examiners are sparse.

Risks and Mitigation

  • Risks are mentioned but seldom mitigated: The most frequent risks are methodological limitations (69 papers), dataset limitations (60), and lack of generalization/language specificity (56). Hallucination risks (12), dual-use (6), and informed consent/fair payment (11/10) are rarely mentioned—ironically, human subject protection is least discussed in a corpus involving heavy learner data collection and human annotation.
  • Low engagement levels: Most risk engagement remains at the Low/Middling level; Methodological (98.6%) and Dataset (90.0%) limitations are primarily Middling/Low. Bias risks were mentioned 46 times but only 15.2% received High-level treatment. Concerns raised are rarely converted into mitigations within the same paper.
  • Concentrated Funding: Universities dominate affiliations (188 papers), and government funding is primary (80, US/China national funds most frequent), with 20 industrial acknowledgments (e.g., Microsoft)—yet few papers explicitly discuss conflicts of interest arising from funding.

Key Findings

  • Structural Tension Evidenced: The convergence of tasks favoring commercial automated scoring, the marginalization of teachers, and the scarcity of real-world deployment reveals a structural misalignment between "private incentives vs. educational infrastructure needs."
  • Teachers are the most affected yet most ignored: Mentioned 97 times, but mostly as annotators when included; only 1/3 of papers view them as beneficiaries. There is a fundamental difference between "automation to reduce teacher burden" and "supporting teacher agency/augmentation."
  • Ethics Stop at "Acknowledgment": The field is generally aware of ethical dimensions but has yet to establish a norm of "detect and mitigate" within a single paper.

Highlights & Insights

  • Quantifying "Aspirations vs. Reality": By using axes like mention vs. inclusion, levels of inclusion, and explicit/implicit beneficiaries, the vague notion of "who is being served" is transformed into a statistical audit. This framework is transferable to other "application-driven" subfields (e.g., Medical or Legal NLP).
  • Methodological Safeguards: Reporting \(\alpha\) and PA for each subjective dimension, declaring "indicative trends," anchoring characterizations with direct quotes, and OSF pre-registration sets a benchmark for honesty in high-interpretation reviews.
  • Actionable Advice from Exemplars: Beyond criticism, the authors highlight exemplars like Galletti and Cesaroni (2025) and Wang et al. (2025c) to ground recommendations: "co-designing with teachers/learners from the start," "explicitly stating deployment context/costs," and "integrating ethical reflection into current work."
  • The sharp introduction of the "Implicit Beneficiary" dimension: It exposes the presence of industry/testing agencies as "unnamed yet obvious beneficiaries," making the influence of commercial incentives on the research agenda explicit.

Limitations & Future Work

  • The corpus is not exhaustive (204 papers) and is limited to the ACL Anthology—work in AIED journals, Learning Analytics conferences, or specialized EdTech venues is excluded. It portrays the NLP community, not the entire field.
  • The 2024–2025 window is short; trends may not generalize earlier or later. The authors suggest expanding the window to pre-ChatGPT eras for a "pre/post-Generative AI" comparison.
  • Interpretive dimensions (e.g., inclusion levels) have unavoidable subjectivity, only partially reflected by consistency scores; the implicit incentives dimension had the lowest agreement (0.53).
  • The authors admit they are members of the community they criticize, and the definitions of "meaningful inclusion" or "sufficient ethical engagement" are shaped by their own values.
  • vs. Yan et al. (2025) / Fu and Weng (2024) / Holmes et al. (2022) (EdTech Ethics Reviews): These typically cover the broader AIED/Learning Analytics field and focus on qualitative diagnoses of "ethical tensions." This work focuses specifically on the ACL/NLP community and expands analysis to a five-dimensional quantify structure of incentives and beneficiaries.
  • vs. Suresh and Guttag (2021) (Bias in ML Lifecycles): This work adopts their "lifecycle-wide examination of bias" perspective but grounds it in specific stakeholder inclusion and deployment gaps in EduNLP.
  • Insight: Treating "who benefits, who is included, and who is served" as first-class citizens for quantitative auditing is a reusable self-examination paradigm for any application-oriented AI subfield. The call for "mitigating risks within the paper they are proposed" aligns directly with the ARR Responsible NLP Checklist.

Rating

  • Novelty: ⭐⭐⭐⭐ First systematic review of EduNLP focusing on the ACL Anthology; the quantitative perspective on incentives/beneficiaries is novel (though the review format itself is not a technical innovation).
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 204 manually annotated papers, ~190 hours, multi-dimensional IAA, and OSF pre-registration; the methodology is solid and honest.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear and restrained arguments; boundaries of conclusion reliability are repeatedly noted; recommendations are specific and actionable.
  • Value: ⭐⭐⭐⭐⭐ Provides the EduNLP community with a mirror and a roadmap for improvement; likely to influence research agendas and review standards.