Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction¶

Conference: ACL 2025
arXiv: 2501.13125
Code: GitHub
Area: Others
Keywords: Distractor Generation, Multiple-Choice Questions, DPO, Student Choice Prediction, Educational Assessment

TL;DR¶

This paper proposes a three-step pipeline that predicts student choice preferences via a pairwise ranker and subsequently trains a distractor generator using DPO, rendering the generated multiple-choice question (MCQ) distractors more plausible and discriminative.

Background & Motivation¶

Background: Multiple-choice questions (MCQs) are essential assessment tools in education, where the quality of distractors (incorrect options) directly determines test effectiveness. Automated distractor generation (ADG) has become a research hotspot, yet existing methods mainly focus on generating distractors similar to human-written ones, neglecting the improvement of plausibility.

Limitations of Prior Work: Distractors generated by prior work are often too simplistic for students to eliminate instantly, failing to effectively assess students' actual comprehension levels, thereby reducing the educational value and discriminative power of MCQs.

Key Challenge: Generating plausible distractors requires understanding students' common misconceptions and knowledge gaps. Such a "student cognitive model" is difficult to directly encode into generation models.

Goal: Train a model capable of generating highly plausible distractors, making the produced distractors more likely to be selected by students, thereby improving the Discrimination Index (DI) of MCQs.

Key Insight: Leverage selection rate information from real student response data to first train a model that infers student misconceptions and ranks the plausibility of distractors, and then use the ranking results to train a generator via DPO.

Core Idea: Use a pairwise ranker to learn students' choice preference patterns, and construct a preference dataset to drive DPO training, enabling the distractor generator to "exploit their misconceptions" and produce more plausible options.

Method¶

Overall Architecture¶

Three-step training pipeline: Step 1 trains a pairwise ranker (predicting which distractor is more likely to be selected) \(\rightarrow\) Step 2 constructs a student choice dataset (ranking the distractors) \(\rightarrow\) Step 3 trains the distractor generator using DPO.

Key Designs¶

Pairwise Ranker
- Function: Given a question, the correct answer, and two distractors, determine which is more likely to be selected by students.
- Mechanism: The model first generates a reasoning chain (analyzing possible student misconceptions) before outputting the selection result. Training data for reasoning is generated using GPT-4o for SFT, and DPO is subsequently applied to correct reasoning errors.
- Design Motivation: Enhanced interpretability and accuracy are achieved through structured reasoning, where the reasoning process reveals the specific causes of student misconceptions.
Student Choice Dataset Construction
- Function: Build a plausibility ranking for all distractors of each question.
- Mechanism: Use GPT-4o to generate new distractors for each question, which are then scored and ranked together with original distractors using the ranker. Rankings based on real-world student selection rates are preserved between the original distractors.
- Design Motivation: Expand the candidate pool of distractors and establish preference pairs to provide chosen/rejected samples for DPO.
Distractor Generator Training (SFT + DPO)
- Function: Generate highly plausible distractors.
- Mechanism: In the SFT stage, basic generation capability is learned (including first judging the question type). In the DPO stage, preference pairs are constructed using top-\(n\) vs bottom-\(n\) distractors, guiding the model toward generating more plausible distractors.
- Design Motivation: Question type determination (identifying correct/incorrect statements) is crucial for generation validity, and DPO optimizes plausibility significantly better than SFT.

Loss & Training¶

Both the ranker and generator are based on Mistral-7B-Instruct-v0.2, fine-tuned using LoRA.
Ranker: Two-stage training with SFT + DPO, where the rejected samples for DPO are obtained from incorrect reasoning generated by the SFT model.
Generator: SFT + DPO, where DPO constructs preference pairs using the top-\(n\) and bottom-\(n\) ranked distractors.
To mitigate position bias, the order of A and B is swapped during ranking, performing multiple inferences until consistency is achieved.

Key Experimental Results¶

Main Results¶

Ranker	Python Accuracy	DB Accuracy	MLDL Accuracy	Average Accuracy
GPT-3.5 (Reasoning)	0.633	0.523	0.606	0.587
GPT-4o (Reasoning)	0.686	0.664	0.570	0.640
Ours (DPO, Comb.)	0.712	0.659	0.655	0.675
Human Expert	-	-	-	0.717

Ablation Study¶

Ablation Condition	Average Ranking Accuracy
Ours (SFT, Sep.)	0.587
Ours (SFT, Comb.)	0.657
Ours (DPO, Comb.)	0.675
Ours (SFT w/o Reasoning)	0.567

Key Findings¶

The DPO ranker (67.5%) surpasses GPT-4o (64.0%) and approaches the human expert level (71.7%), despite the training data originating from GPT-4o's reasoning—demonstrating a "student surpassing the teacher" phenomenon.
Removing the reasoning process results in a significant drop in ranking accuracy from 67.5% to 56.7%, indicating that structured reasoning is essential for the task.
In human evaluation, the distractors generated via DPO achieve the highest Discrimination Index (DI), confirming that highly plausible distractors indeed help distinguish students of different proficiency levels.
Primary factors of student misconceptions: "incorrect assumptions about function outputs/operations" is most common in programming questions, whereas "conceptual confusion with similar terms" is most common in conceptual questions.

Highlights & Insights¶

Elegant Closed-Loop Design: Real selection rates \(\rightarrow\) ranker \(\rightarrow\) synthetic preference data \(\rightarrow\) DPO generator; each step is closely linked and backed by empirical data.
High Interpretability: The reasoning process of the ranker uncovers the types and distributions of common student misconceptions, providing direct pedagogical guidance.
Small Models Outperforming GPT-4o: After DPO, the 7B Mistral model surpasses GPT-4o in ranking accuracy, demonstrating the power of specialized training.

Limitations & Future Work¶

Validation is restricted to the computer science domain (Python/DB/MLDL), leaving generalizability to other disciplines (such as history or languages) unexplored.
Dependence on real student selection rate data makes it inapplicable in cold-start scenarios where new questions lack historical data.
Filtering of "invalid distractors" (e.g., contradicting the correct answer or formatting errors) is primarily handled manually and could be automated.
The long-term impact of generated distractors on practical pedagogical outcomes (e.g., student learning progress) has not been investigated.

Scarlatos et al. (2024): Proposed an overgenerate-and-rank paradigm for math questions, but the ranker lacked reasoning. The reasoning-based ranking proposed in this paper is more generalizable and accurate.
DPO (Rafailov et al. 2024): A preference optimization method simplified from RLHF, which this paper innovatively applies to distractor generation in educational settings.
Insights: This paradigm of "learning ranking preferences first, then optimizing generation based on those preferences" can be extended to other generation tasks requiring human alignment.

Rating¶

Novelty: ⭐⭐⭐⭐ (The combination of pairwise ranking, reasoning, and DPO in educational contexts is highly creative)
Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive evaluation incorporating automated metrics, human assessment, and real-world student testing)
Writing Quality: ⭐⭐⭐⭐ (Clear flowcharts and well-articulated motivations)
Value: ⭐⭐⭐⭐ (Provides direct practical value in pedagogical application scenarios)