FEAT: A Preference Feedback Dataset through a Cost-Effective Auto-Generation and Labeling Framework for English AI Tutoring¶
Conference: ACL 2025
arXiv: 2506.19325
Code: hyenee/FEAT
Area: Others
Keywords: AI Tutoring, Preference Dataset, Teacher Feedback, knowledge distillation, Ranking Model
TL;DR¶
Proposed the FEAT framework, which automatically generates and labels teacher feedback preference datasets using LLMs for English tutoring systems. The study finds that mixing in only 5-10% of human-annotated data can outperform the ranking performance achieved by using 100% human-annotated data.
Background & Motivation¶
In English tutoring, teacher feedback is crucial for guiding students and improving learning outcomes. With the rise of AI tutoring systems, constructing high-quality teacher feedback preference datasets has become a core requirement. Such datasets can support reward- or ranking-based learning (such as RLHF, DPO) to train AI tutors that align more closely with human teachers.
However, existing data construction methods face a dilemma: - Human Generation + Human Annotation: High quality but extremely high cost, difficult to scale - Pure LLM Generation: Low cost but difficult to guarantee quality
The core problem is: Can a cost-benefit optimal sweet spot be found? That is, achieving the maximum increase in model performance with the minimum investment in human annotation.
Method¶
Overall Architecture¶
FEAT (Feedback Dataset Generation Framework for English AI Tutoring) consists of three complementary datasets:
- DIRECT-Manual (DM): Feedback is collaboratively generated by humans and LLMs, with humans performing ranking annotations. High quality, high cost.
- DIRECT-Generated (DG): Completely generated and annotated by LLMs. Medium quality, low cost.
- DIRECT-Augmented (DA): Primarily based on DG, with a small amount of DM mixed in. High quality, low cost.
Key Designs¶
-
Five-Dimensional Feedback Criteria System: FEAT is based on five teacher feedback criteria defined by Seo et al. (2025)—Correct (accurate and specific), Revealing (not directly giving answers), Guidance, Diagnostic, and Encouragement. These five criteria run through the entire process of data generation and evaluation.
-
DM Dataset Construction Workflow:
- Feedback Generation: Feedback is collected from five sources: Human, DIRECT dataset, PrepTutor, GPT-3.5, and GPT-4.
- Human Ranking: Annotators rank five candidate feedbacks based on two priority criteria: Correct and Revealing.
- Preference Pair Construction: Pairwise (chosen, rejected) preference data are constructed from the rankings.
-
DG Dataset Construction Workflow:
- Scenario Conversion: Convert reading comprehension tasks (MCTest) into tutoring scenarios to generate large-scale dialogues.
- Criteria-Based Automatic Annotation: Generate two versions of feedback—a "criteria-guided" version and a "non-guided" version. The former is designated as chosen, while the latter is rejected.
- This avoids the overhead of human annotation, under the assumption that feedback adhering to educational criteria is of higher quality.
-
DA Dataset Construction: Mix DG with different proportions of DM (5%-100%). The core finding is that mixing only a small amount of high-quality seed data can significantly improve overall quality.
-
Diversity Enhancement: During training, not only are standard (chosen, rejected) pairs used, but feedback from different contexts is also incorporated for comparison, allowing the model to learn differences in feedback quality across scenarios.
Loss & Training¶
Five ranking model methods are used:
- Binary Classifier: Encodes preference pairs as a binary classification (1 = correct order, 0 = inverted).
- Reward Model: Computes scalar preference scores for feedback, trained to assign higher scores to chosen responses.
- DPO: Direct Preference Optimization, based on the log probability difference between chosen and rejected responses.
- RankNet: Uses binary cross-entropy to learn score differences.
- Ensemble: Majority voting of the four methods.
The evaluation metric is RBO (Rank-Biased Overlap), which measures the similarity between the predicted ranking and the ground-truth ranking, taking values in \([0, 1]\).
Key Experimental Results¶
Main Results¶
| Scenario | Method | Llama-1B | Llama-1B-IT | Llama-3B | Llama-3B-IT | Qwen-3B-IT |
|---|---|---|---|---|---|---|
| DM→DM | Ensemble | 0.77-0.80 | 0.77-0.80 | 0.77-0.80 | 0.77-0.80 | 0.77-0.80 |
| DG→DM | Binary | 0.76 | - | - | - | - |
| DG→DM | Reward | - | - | - | 0.73 | - |
| DG→DM | RankNet | - | - | - | - | 0.76 |
| DG→DM | Ensemble | - | - | - | - | 0.76 |
Models trained on DG evaluated on DM, approaching the performance level of full human annotation
Ablation Study: DM Ratio¶
| DM Ratio | Llama-3B-IT (Majority Method) | Qwen-3B-IT |
|---|---|---|
| 0% (Pure DG) | Below DM→DM baseline | Below DM→DM baseline |
| 5% | Outperforms DM→DM baseline | Close to baseline |
| 10% | Outperforms DM→DM baseline | Close to baseline |
| 50% | Outperforms baseline | Outperforms DM→DM baseline |
| 100% (= DM→DM) | Baseline | Baseline |
Llama-3B-IT requires only 5% human data to outperform full human data performance
Key Findings¶
-
5-10% of seed data is sufficient to outperform 100% human data: This is the core finding of the paper. For Llama-3B-IT, Binary Classifier, DPO, and Ensemble methods require only 5% DM to exceed DM→DM performance.
-
LLM-generated rankings are highly consistent with human rankings: The RBO for DG→DM reaches 0.76, while DM→DM is between 0.77 and 0.80, showing a very small gap.
-
The number of feedback criteria impacts performance: Increasing from 2 to 5 criteria improves performance for most models and methods, with DPO showing the largest gain (+11.41%).
-
Ensemble method is the most stable: It performs most consistently across different model architectures and scales, mitigating the volatility of individual methods.
-
Larger models benefit more from DG data: 3B models perform significantly better than 1B models in the DG→DM scenario.
Highlights & Insights¶
- Highly Practical Value: Provides a deployable, low-cost data construction scheme for the educational AI field. The conclusion that 5% human annotation can outperform full human annotation is highly compelling.
- The five-dimensional criteria system introduces pedagogical theory into data construction, making it more professional than general helpful/harmless criteria.
- The three-dataset design (DM/DG/DA) itself is an excellent experimental design, clearly illustrating the cost-quality trade-off space.
- Validates an inspiring paradigm: A small amount of high-quality data + a large amount of low-cost data > a large amount of high-quality data, resonating with research directions in curriculum learning and data mixing.
Limitations & Future Work¶
- Only the MCTest dataset is used to generate tutoring scenarios, and the generalization of the framework to other educational datasets remains unverified.
- Only 1B and 3B models are used, leaving the effectiveness on larger models (7B, 13B, 70B) unknown.
- Only pairwise ranking methods are adopted, without exploring pointwise and listwise ranking methods.
- Human annotation in DM is based only on two criteria (Correct and Revealing), which may not fully utilize the other three dimensions.
- Lack of evaluation on final tutoring effectiveness—whether the RBO improvement of ranking models truly translates to better pedagogical feedback remains to be validated.
Related Work & Insights¶
- RLHF (Ouyang et al., 2022) and DPO (Rafailov et al., 2023): Foundational frameworks for preference learning.
- UltraFeedback (Cui et al., 2023): Large-scale AI feedback dataset.
- DIRECT (Huang et al., 2023): The primary source for the DM dataset.
- Liermann et al. (2024): Improving tutoring feedback generation and automatic evaluation.
- Distinction from general preference datasets: FEAT designs domain-specific criteria tailored for educational scenarios.
Rating¶
- Novelty: ⭐⭐⭐ The framework workflow is clear, but the methodological components (DPO, Reward Model, etc.) are existing tools. The innovation lies primarily in the data construction strategy.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers 5 ranking methods, 5 models, and multiple data ratio combinations, with thorough ablations.
- Writing Quality: ⭐⭐⭐ Well-structured, but some descriptions are slightly verbose; charts and figures could be more intuitive.
- Value: ⭐⭐⭐⭐ High instructional significance for data construction in the educational AI field. The "small high-quality seed" conclusion can be generalized to other domains.