Automatic Generation of Inference Making Questions for Reading Comprehension Assessments¶

Conference: ACL 2025
arXiv: 2506.08260
Code: https://github.com/maafiah/InferenceQuestionsAQG
Area: NLP Understanding / Educational NLP
Keywords: Reading Comprehension, Inference Question Generation, Bridging Inference Taxonomy, GPT-4o, Diagnostic Assessment

TL;DR¶

A reading comprehension inference question taxonomy (pronominal bridging / text-connecting / gap-filling) is developed to automatically generate multiple-choice questions for specific inference types using GPT-4o few-shot prompting; while 93.8% of the questions are of acceptable quality, only 42.6% accurately match the target inference type, indicating LLMs still lack precise control over their reasoning abilities.

Background & Motivation¶

Background: Inference ability is a core but complex skill in reading comprehension (RC). Diagnostic RC assessments require items targeted at specific inference types to help educators provide targeted reading interventions. Existing LLM-based question generation research largely treats RC as a single construct and does not distinguish among inference types.

Limitations of Prior Work: (a) Manual creation of inference questions is highly costly and difficult to scale; (b) existing automatic question generation research focuses on overall quality but lacks control over the inference type; (c) there is a lack of a systematic taxonomy of RC inference questions to guide generation.

Key Challenge: While LLMs can generate high-quality RC items, can they generate items targeting specific inference types? Precise control over inference types represents the critical gap between "usable" and "usable for diagnostic assessment."

Goal: (a) Establish a bridging inference question taxonomy; (b) verify whether GPT-4o can generate RC items of specific inference types under few-shot prompting; (c) evaluate whether CoT prompting is helpful.

Key Insight: Building an inference taxonomy based on reading science literature, prompting GPT-4o to generate questions for each type, and evaluating their quality and type accuracy using three expert annotators.

Core Idea: LLMs can generate high-quality RC questions at scale, but precisely matching the inference type still requires human review; "automatic generation + human judgment" represents a scalable approach to diagnostic assessment.

Method¶

Overall Architecture¶

Literature review → Constructing a bridging inference taxonomy (3 types)
Annotation and validation of the taxonomy on a live item bank
Manually writing training exemplar questions (6 passages × 2-4 questions per type)
Generating questions for 10 new passages using GPT-4o few-shot prompting
Three expert annotators assessing the generated questions across three dimensions

Key Designs¶

Bridging Inference Taxonomy:
- Function: Categorizes RC inference items into three bridging inference types.
- Pronominal Bridging: Bridges information between sentences using pronouns as clues, e.g., referring to the earlier "ships" with "That."
- Text-Connecting: Links two explicitly stated textual elements using noun phrases, often involving causal relationships.
- Gap-Filling: Requires readers to apply extra-textual world knowledge to fill in details not explicitly stated.
- Design Motivation: Bridging inference accounts for 51% of the 192-item live item bank, representing the most critical sub-construct; the three types correspond to different cognitive skills.
Few-shot Prompt Generation:
- Function: Designs independent system prompts for each inference type, containing the type definition, generation steps, and 4 or 6 exemplars.
- Four conditions are compared: Standard_4, Standard_6, CoT_4, CoT_6.
- CoT conditions additionally provide a "Text Hint" (relevant sentences in the text) and "Reasoning" (explanation of the reasoning process).
- Three items are generated per passage-type combination, with temperature set to 0 and frequency_penalty=0.2.
Three-Dimensional Expert Evaluation:
- General Item Quality: Overall item quality (whether the key is correct, whether distractors are plausible, grade appropriateness for grades 3-12).
- Inference-type Accuracy: Whether the generated items match the requested inference types.
- Reasoning Quality: Whether the reasoning chain provided by the LLM in CoT conditions is sufficient and sound.
- Two annotation rounds: independent annotation in the first round, followed by adjudication of discrepancies in the second, resulting in a Fleiss' κ of 0.57-0.83.

Key Experimental Results¶

Main Results¶

Generation Method	No. of Items	Quality Acceptance Rate	Inference Type Accuracy	Reasoning Quality Acceptance Rate
Standard_4	88	93.2%	40.9%	-
Standard_6	89	95.5%	46.1%	-
CoT_4	90	90.0%	41.1%	35.6%
CoT_6	90	96.7%	42.2%	38.9%
Total	357	93.8%	42.6%	37.2%

Ablation Study: Difficulty of Generating Each Inference Type (Standard_6)¶

Target Inference Type	Accuracy	Description
Gap-Filling	60.0%	Easiest to generate accurately
Pronominal Bridging	53.3%	Moderate
Text-Connecting	24.1%	Hardest, often degenerating into factual questions

Key Findings¶

High Quality but Low Accuracy: While 93.8% of the items were of acceptable quality for actual assessments, only 42.6% matched the target inference type—high quality does not equate to precise control.
Increasing Exemplars (4 to 6) is Effective: Performance across all metrics improved with 6 exemplars.
CoT Does Not Help: Adding reasoning process exemplars did not improve inference type accuracy (42.2% vs 46.1%), likely because of the LLM's own reasoning deficiencies (only 38.9% of the generated reasoning chains were deemed sound).
34.8% of generated items degenerated into factual/literal items, indicating that LLMs tend to generate easier questions that do not require deep reasoning.
The inference type distribution of the generated items is highly similar to that of the human item bank—though individual items might be inaccurate, the overall distribution remains useful.

Highlights & Insights¶

Practical Value of the Inference Taxonomy: Validated on a live item bank (where bridging inference constitutes 51%), providing a roadmap for future item development and research.
Decoupling of Quality and Controllability: Generating high-quality items does not guarantee precise control over item properties, which is a key insight for educational NLP applications.
Pragmatic "Generation + Human Review" Approach: Instead of striving for full automation, utilizing LLMs for mass generation followed by human filtering is significantly more efficient than writing questions manually from scratch.

Limitations & Future Work¶

Only evaluated using a single model (GPT-4o); other models with stronger reasoning capabilities (e.g., o1, Claude) were not tested.
Evaluated on only 10 expository texts, without covering other genres like narrative texts.
The ineffectiveness of CoT might be due to the limited number of training exemplars (only 12-18); more examples might yield improvements.
No testing was conducted with real students; psychometric properties like item discrimination and difficulty remain unknown.
Future work could leverage LLMs to first classify the inference types of existing items in stock to obtain more training exemplars.
The matching rate for the Text-Connecting type is extremely low (24.1%), requiring targeted prompt optimization.

Compared to General QG Research: Prior QG studies treated RC as a singular construct. This work is the first to systematically generate items based on inference types, filling a critical gap.
Compared to Säuberli & Clematide (2024): While they successfully applied CoT in RC QG, the inference type control task in this study is more complex, causing CoT to fail.
Compared to Multi-Hop QA: Multi-hop reasoning in NLP intersects with bridging inference, but educational assessment scenarios entail unique requirements (e.g., grade-appropriateness, distractor quality).
Insight: LLMs exhibit limited capability in grasping the nuances of specific inference types. Future efforts may need to integrate structured knowledge or specialized fine-tuning.

Rating¶

Novelty: ⭐⭐⭐⭐ First to systematically apply an inference type taxonomy to LLM-based question generation, offering a valuable task definition.
Experimental Thoroughness: ⭐⭐⭐ Rigorous evaluation by three experts, though limited to a single model and a small set of texts.
Writing Quality: ⭐⭐⭐⭐⭐ Well-structured, theoretically grounded taxonomy, and complete evaluation methodology.
Value: ⭐⭐⭐⭐ Directly relevant and highly practical for educational NLP; the finding "high quality but inaccurate target typing" is highly significant.