Automatic Generation of Inference Making Questions for Reading Comprehension Assessments¶
Conference: ACL 2025
arXiv: 2506.08260
Code: https://github.com/maafiah/InferenceQuestionsAQG
Area: NLP Understanding / Educational NLP
Keywords: Reading Comprehension, Inference Question Generation, Bridging Inference Taxonomy, GPT-4o, Diagnostic Assessment
TL;DR¶
A reading comprehension inference question taxonomy (pronominal bridging / text-connecting / gap-filling) is developed to automatically generate multiple-choice questions for specific inference types using GPT-4o few-shot prompting; while 93.8% of the questions are of acceptable quality, only 42.6% accurately match the target inference type, indicating LLMs still lack precise control over their reasoning abilities.
Background & Motivation¶
Background: Inference ability is a core but complex skill in reading comprehension (RC). Diagnostic RC assessments require items targeted at specific inference types to help educators provide targeted reading interventions. Existing LLM-based question generation research largely treats RC as a single construct and does not distinguish among inference types.
Limitations of Prior Work: (a) Manual creation of inference questions is highly costly and difficult to scale; (b) existing automatic question generation research focuses on overall quality but lacks control over the inference type; (c) there is a lack of a systematic taxonomy of RC inference questions to guide generation.
Key Challenge: While LLMs can generate high-quality RC items, can they generate items targeting specific inference types? Precise control over inference types represents the critical gap between "usable" and "usable for diagnostic assessment."
Goal: (a) Establish a bridging inference question taxonomy; (b) verify whether GPT-4o can generate RC items of specific inference types under few-shot prompting; (c) evaluate whether CoT prompting is helpful.
Key Insight: Building an inference taxonomy based on reading science literature, prompting GPT-4o to generate questions for each type, and evaluating their quality and type accuracy using three expert annotators.
Core Idea: LLMs can generate high-quality RC questions at scale, but precisely matching the inference type still requires human review; "automatic generation + human judgment" represents a scalable approach to diagnostic assessment.
Method¶
Overall Architecture¶
- Literature review → Constructing a bridging inference taxonomy (3 types)
- Annotation and validation of the taxonomy on a live item bank
- Manually writing training exemplar questions (6 passages × 2-4 questions per type)
- Generating questions for 10 new passages using GPT-4o few-shot prompting
- Three expert annotators assessing the generated questions across three dimensions
Key Designs¶
-
Bridging Inference Taxonomy:
- Function: Categorizes RC inference items into three bridging inference types.
- Pronominal Bridging: Bridges information between sentences using pronouns as clues, e.g., referring to the earlier "ships" with "That."
- Text-Connecting: Links two explicitly stated textual elements using noun phrases, often involving causal relationships.
- Gap-Filling: Requires readers to apply extra-textual world knowledge to fill in details not explicitly stated.
- Design Motivation: Bridging inference accounts for 51% of the 192-item live item bank, representing the most critical sub-construct; the three types correspond to different cognitive skills.
-
Few-shot Prompt Generation:
- Function: Designs independent system prompts for each inference type, containing the type definition, generation steps, and 4 or 6 exemplars.
- Four conditions are compared: Standard_4, Standard_6, CoT_4, CoT_6.
- CoT conditions additionally provide a "Text Hint" (relevant sentences in the text) and "Reasoning" (explanation of the reasoning process).
- Three items are generated per passage-type combination, with temperature set to 0 and
frequency_penalty=0.2.
-
Three-Dimensional Expert Evaluation:
- General Item Quality: Overall item quality (whether the key is correct, whether distractors are plausible, grade appropriateness for grades 3-12).
- Inference-type Accuracy: Whether the generated items match the requested inference types.
- Reasoning Quality: Whether the reasoning chain provided by the LLM in CoT conditions is sufficient and sound.
- Two annotation rounds: independent annotation in the first round, followed by adjudication of discrepancies in the second, resulting in a Fleiss' κ of 0.57-0.83.
Key Experimental Results¶
Main Results¶
| Generation Method | No. of Items | Quality Acceptance Rate | Inference Type Accuracy | Reasoning Quality Acceptance Rate |
|---|---|---|---|---|
| Standard_4 | 88 | 93.2% | 40.9% | - |
| Standard_6 | 89 | 95.5% | 46.1% | - |
| CoT_4 | 90 | 90.0% | 41.1% | 35.6% |
| CoT_6 | 90 | 96.7% | 42.2% | 38.9% |
| Total | 357 | 93.8% | 42.6% | 37.2% |
Ablation Study: Difficulty of Generating Each Inference Type (Standard_6)¶
| Target Inference Type | Accuracy | Description |
|---|---|---|
| Gap-Filling | 60.0% | Easiest to generate accurately |
| Pronominal Bridging | 53.3% | Moderate |
| Text-Connecting | 24.1% | Hardest, often degenerating into factual questions |
Key Findings¶
- High Quality but Low Accuracy: While 93.8% of the items were of acceptable quality for actual assessments, only 42.6% matched the target inference type—high quality does not equate to precise control.
- Increasing Exemplars (4 to 6) is Effective: Performance across all metrics improved with 6 exemplars.
- CoT Does Not Help: Adding reasoning process exemplars did not improve inference type accuracy (42.2% vs 46.1%), likely because of the LLM's own reasoning deficiencies (only 38.9% of the generated reasoning chains were deemed sound).
- 34.8% of generated items degenerated into factual/literal items, indicating that LLMs tend to generate easier questions that do not require deep reasoning.
- The inference type distribution of the generated items is highly similar to that of the human item bank—though individual items might be inaccurate, the overall distribution remains useful.
Highlights & Insights¶
- Practical Value of the Inference Taxonomy: Validated on a live item bank (where bridging inference constitutes 51%), providing a roadmap for future item development and research.
- Decoupling of Quality and Controllability: Generating high-quality items does not guarantee precise control over item properties, which is a key insight for educational NLP applications.
- Pragmatic "Generation + Human Review" Approach: Instead of striving for full automation, utilizing LLMs for mass generation followed by human filtering is significantly more efficient than writing questions manually from scratch.
Limitations & Future Work¶
- Only evaluated using a single model (GPT-4o); other models with stronger reasoning capabilities (e.g., o1, Claude) were not tested.
- Evaluated on only 10 expository texts, without covering other genres like narrative texts.
- The ineffectiveness of CoT might be due to the limited number of training exemplars (only 12-18); more examples might yield improvements.
- No testing was conducted with real students; psychometric properties like item discrimination and difficulty remain unknown.
- Future work could leverage LLMs to first classify the inference types of existing items in stock to obtain more training exemplars.
- The matching rate for the Text-Connecting type is extremely low (24.1%), requiring targeted prompt optimization.
Related Work & Insights¶
- Compared to General QG Research: Prior QG studies treated RC as a singular construct. This work is the first to systematically generate items based on inference types, filling a critical gap.
- Compared to Säuberli & Clematide (2024): While they successfully applied CoT in RC QG, the inference type control task in this study is more complex, causing CoT to fail.
- Compared to Multi-Hop QA: Multi-hop reasoning in NLP intersects with bridging inference, but educational assessment scenarios entail unique requirements (e.g., grade-appropriateness, distractor quality).
- Insight: LLMs exhibit limited capability in grasping the nuances of specific inference types. Future efforts may need to integrate structured knowledge or specialized fine-tuning.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to systematically apply an inference type taxonomy to LLM-based question generation, offering a valuable task definition.
- Experimental Thoroughness: ⭐⭐⭐ Rigorous evaluation by three experts, though limited to a single model and a small set of texts.
- Writing Quality: ⭐⭐⭐⭐⭐ Well-structured, theoretically grounded taxonomy, and complete evaluation methodology.
- Value: ⭐⭐⭐⭐ Directly relevant and highly practical for educational NLP; the finding "high quality but inaccurate target typing" is highly significant.