Domain-Specific Data Generation Framework for RAG Adaptation¶

Conference: ACL 2026 arXiv: 2510.11217 Code: None Area: Information Retrieval / RAG Keywords: RAG Adaptation, Data Generation, Domain-Specific, Embedding Fine-tuning, Bloom's Taxonomy

TL;DR¶

This paper proposes RAGen, a scalable and modular data generation framework that automatically synthesizes domain-specific QAC (Question-Answer-Context) data through document-level concept extraction, multi-chunk evidence assembly, and Bloom's Taxonomy-guided question generation. The framework supports contrastive fine-tuning of embedding models and supervised fine-tuning of LLMs, achieving substantial improvements over AutoRAG and LlamaIndex baselines across three domain-specific datasets.

Background & Motivation¶

Background: RAG (Retrieval-Augmented Generation) has become the predominant paradigm for integrating LLMs into domain-specific workflows, providing models with contextual information through external retrieval. However, directly applying general-purpose RAG pipelines to new domains frequently leads to suboptimal performance.

Limitations of Prior Work: (1) General-purpose retrievers and generators are not aligned with domain-specific terminology and data distributions; (2) RAG adaptation requires high-quality domain-specific training data, yet manual annotation is prohibitively expensive; (3) Existing data generation methods (AutoRAG, LlamaIndex) follow a single-chunk question generation paradigm—generating questions from individual text chunks—resulting in shallow, localized questions that lack cross-concept reasoning capability; (4) Methods such as RAFT optimize for individual components and are tightly coupled to specific training paradigms.

Key Challenge: The critical bottleneck in RAG adaptation lies not in model architecture or training objectives, but in upstream data supply—specifically, the absence of high-quality, cross-concept, multi-cognitive-level domain-specific training data.

Goal: To design a data-centric framework that automatically generates high-quality QAC datasets from raw documents, suitable for multi-component RAG adaptation (embedding models + LLMs).

Key Insight: Rather than operating at the chunk level, the framework begins from document-level concepts, assembles cross-chunk evidence into "question stems," applies Bloom's Taxonomy to guide question generation across different cognitive levels, and pairs each question with carefully constructed positive, negative, and distracting contexts.

Core Idea: High-quality RAG training data should be cross-concept, cross-chunk, and span multiple cognitive levels—rather than being mechanically generated as shallow QA pairs from individual text chunks.

Method¶

Overall Architecture¶

A three-stage pipeline: Stage 1 (Document Concept Extraction) — semantic chunking → chunk-level concept extraction (ChatGPT-4o) → embedding-based clustering to fuse into document-level concepts. Stage 2 (Concept-Centric Evidence Assembly) — cross-chunk retrieval → sentence-level evidence filtering → question stem construction. Stage 3 (QAC Generation) — Bloom's Taxonomy-guided multi-level question generation + construction of four context variants (fully supportive / partially supportive / irrelevant / distracting).

Key Designs¶

Document-Level Concept Extraction and Fusion:
- Function: Extract high-level semantic topics from documents to serve as anchors for question generation.
- Mechanism: Documents are first segmented into chunks (1,024 tokens with 200-token overlap); ChatGPT-4o extracts chunk-level concepts from each chunk; OpenAI Ada embeddings and K-means clustering then fuse all chunk-level concepts into \(K\) document-level concepts, with the centroid-nearest concept in each cluster serving as the representative.
- Design Motivation: Chunk-level concepts are overly localized. Document-level concepts capture high-level semantic themes that span multiple chunks, providing global anchors for cross-chunk question generation.
Bloom's Taxonomy-Guided Question Generation:
- Function: Generate diverse questions covering six cognitive levels, from remembering to creating.
- Mechanism: The six levels of the revised Bloom's Taxonomy (Remember → Understand → Apply → Analyze → Evaluate → Create) are used to guide question type selection. The framework supports both single-stem (\(\ell=1\)) and multi-stem (\(\ell \geq 2\)) inputs—the latter jointly conditions on evidence from multiple concepts to generate cross-concept questions. An upper bound is imposed when the number of multi-stem combinations becomes intractable.
- Design Motivation: Single-chunk methods tend to produce large proportions of shallow Remember/Understand-level questions. Bloom's Taxonomy guidance ensures greater coverage of higher-order Analyze/Evaluate/Create-level questions.
Four Context Variant Construction:
- Function: Construct diverse contexts for each QA pair to improve retrieval training robustness.
- Mechanism: Fully supportive (evidence that directly answers the question) + Partially supportive (incomplete information requiring cross-evidence reasoning) + Irrelevant (same-domain but unrelated content) + Distracting (topically related but semantically insufficient content for answering, analogous to distractors in reading comprehension).
- Design Motivation: Existing methods use randomly sampled chunks as negative samples. RAGen's carefully constructed distracting contexts introduce greater semantic difficulty, training more robust retrievers.

Loss & Training¶

Embedding fine-tuning: InfoNCE contrastive loss, learning rate 1e-5, 3 epochs, temperature \(\tau=0.02\), 2 negative samples. LLM fine-tuning: LoRA supervised fine-tuning (Qwen2.5-1.5B/3B), learning rate 1e-5, 5 epochs, 10% validation split. All experiments use 4× RTX 3090.

Key Experimental Results¶

Main Results¶

Embedding Model Retrieval Performance (BGE-large-v1.5, averaged across three domains)

Training Data	R@1	R@5	R@10	MRR@10
Vanilla (no fine-tuning)	0.153	0.411	0.534	0.263
AutoRAG	0.190	0.517	0.655	0.330
LlamaIndex	0.204	0.539	0.671	0.346
RAGen	0.333	0.716	0.828	0.497

Ablation Study¶

LLM Fine-tuning Performance (Qwen2.5-1.5B, ROUGE-L)

Domain	AutoRAG	LlamaIndex	RAGen
PPFS	0.288	0.329	0.396
TradePolicy	0.278	0.270	0.391
BusinessAI	0.270	0.269	0.339

Cognitive Level Distribution Comparison

Method	Remember + Understand (Lower-order)	Analyze + Evaluate + Create (Higher-order)
LlamaIndex	~70%	~15%
AutoRAG	~65%	~20%
RAGen	~30%	~50%

Key Findings¶

RAGen substantially outperforms baselines on embedding retrieval—R@1 is approximately 63% higher than LlamaIndex (0.333 vs. 0.204), demonstrating the superiority of cross-concept data generation.
RAGen consistently achieves the best ROUGE-L in LLM fine-tuning (+20–40% relative improvement), indicating that data quality is equally critical for the generation component.
Questions generated by RAGen exhibit higher cognitive levels—higher-order questions (Analyze/Evaluate/Create) account for 50%, compared to 15–20% for baselines.
The inclusion of distracting contexts significantly improves retrieval robustness over random negative sampling.
Multi-stem combinations (\(\ell \geq 2\)) generate cross-concept questions requiring deeper reasoning, which is the primary source of RAGen's data quality advantage.

Highlights & Insights¶

The data-centric approach to RAG adaptation—achieving the largest performance gains by improving training data rather than modifying model architecture or training objectives—offers a broadly applicable design philosophy.
Bloom's Taxonomy-guided question generation is a transferable methodology applicable to any educational or assessment data generation scenario.
The design of four context variants (particularly distracting contexts) draws on the concept of distractors in reading comprehension.

Limitations & Future Work¶

Concept extraction and question generation rely on ChatGPT-4o, which incurs relatively high costs and constrains results by the capability of the underlying model.
Evaluation is conducted on only three relatively small-scale domain datasets; large-scale industrial scenarios remain untested.
No direct comparison is made with end-to-end RAG adaptation methods such as RAFT.
Cross-document reasoning (combining concepts from different documents with \(\ell \geq 2\)) is not thoroughly explored.

vs. RAFT: RAFT focuses on distractor-aware fine-tuning of the generation component, whereas RAGen provides a general data generation framework supporting multi-component adaptation.
vs. AutoRAG/LlamaIndex: These methods follow a single-chunk generation paradigm; RAGen's cross-concept multi-stem design constitutes a fundamental distinction.
vs. RAGEval/RAGAS: These frameworks target RAG system evaluation, whereas RAGen is explicitly designed for generating training data for RAG adaptation.

Rating¶

Novelty: ⭐⭐⭐⭐ The data generation paradigm combining document-level concepts, Bloom's Taxonomy, and multi-stem composition is novel and practically useful.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers three domains, three embedding models, and two LLMs with sufficient ablations, though at a limited scale.
Writing Quality: ⭐⭐⭐⭐ The method is described clearly and systematically, with intuitive illustrations.
Value: ⭐⭐⭐⭐ Provides a practical data generation solution for domain-specific RAG adaptation.