Domain-Specific Data Generation Framework for RAG Adaptation¶
Conference: ACL 2026 arXiv: 2510.11217 Code: None Area: Information Retrieval / RAG Keywords: RAG Adaptation, Data Generation, Domain-Specific, Embedding Fine-tuning, Bloom's Taxonomy
TL;DR¶
This paper proposes RAGen, a scalable and modular data generation framework that automatically synthesizes domain-specific QAC (Question-Answer-Context) data through document-level concept extraction, multi-chunk evidence assembly, and Bloom's Taxonomy-guided question generation. The framework supports contrastive fine-tuning of embedding models and supervised fine-tuning of LLMs, achieving substantial improvements over AutoRAG and LlamaIndex baselines across three domain-specific datasets.
Background & Motivation¶
Background: RAG (Retrieval-Augmented Generation) has become the predominant paradigm for integrating LLMs into domain-specific workflows, providing models with contextual information through external retrieval. However, directly applying general-purpose RAG pipelines to new domains frequently leads to suboptimal performance.
Limitations of Prior Work: (1) General-purpose retrievers and generators are not aligned with domain-specific terminology and data distributions; (2) RAG adaptation requires high-quality domain-specific training data, yet manual annotation is prohibitively expensive; (3) Existing data generation methods (AutoRAG, LlamaIndex) follow a single-chunk question generation paradigm—generating questions from individual text chunks—resulting in shallow, localized questions that lack cross-concept reasoning capability; (4) Methods such as RAFT optimize for individual components and are tightly coupled to specific training paradigms.
Key Challenge: The critical bottleneck in RAG adaptation lies not in model architecture or training objectives, but in upstream data supply—specifically, the absence of high-quality, cross-concept, multi-cognitive-level domain-specific training data.
Goal: To design a data-centric framework that automatically generates high-quality QAC datasets from raw documents, suitable for multi-component RAG adaptation (embedding models + LLMs).
Key Insight: Rather than operating at the chunk level, the framework begins from document-level concepts, assembles cross-chunk evidence into "question stems," applies Bloom's Taxonomy to guide question generation across different cognitive levels, and pairs each question with carefully constructed positive, negative, and distracting contexts.
Core Idea: High-quality RAG training data should be cross-concept, cross-chunk, and span multiple cognitive levels—rather than being mechanically generated as shallow QA pairs from individual text chunks.
Method¶
Overall Architecture¶
A three-stage pipeline: Stage 1 (Document Concept Extraction) — semantic chunking → chunk-level concept extraction (ChatGPT-4o) → embedding-based clustering to fuse into document-level concepts. Stage 2 (Concept-Centric Evidence Assembly) — cross-chunk retrieval → sentence-level evidence filtering → question stem construction. Stage 3 (QAC Generation) — Bloom's Taxonomy-guided multi-level question generation + construction of four context variants (fully supportive / partially supportive / irrelevant / distracting).
Key Designs¶
-
Document-Level Concept Extraction and Fusion:
- Function: Extract high-level semantic topics from documents to serve as anchors for question generation.
- Mechanism: Documents are first segmented into chunks (1,024 tokens with 200-token overlap); ChatGPT-4o extracts chunk-level concepts from each chunk; OpenAI Ada embeddings and K-means clustering then fuse all chunk-level concepts into \(K\) document-level concepts, with the centroid-nearest concept in each cluster serving as the representative.
- Design Motivation: Chunk-level concepts are overly localized. Document-level concepts capture high-level semantic themes that span multiple chunks, providing global anchors for cross-chunk question generation.
-
Bloom's Taxonomy-Guided Question Generation:
- Function: Generate diverse questions covering six cognitive levels, from remembering to creating.
- Mechanism: The six levels of the revised Bloom's Taxonomy (Remember → Understand → Apply → Analyze → Evaluate → Create) are used to guide question type selection. The framework supports both single-stem (\(\ell=1\)) and multi-stem (\(\ell \geq 2\)) inputs—the latter jointly conditions on evidence from multiple concepts to generate cross-concept questions. An upper bound is imposed when the number of multi-stem combinations becomes intractable.
- Design Motivation: Single-chunk methods tend to produce large proportions of shallow Remember/Understand-level questions. Bloom's Taxonomy guidance ensures greater coverage of higher-order Analyze/Evaluate/Create-level questions.
-
Four Context Variant Construction:
- Function: Construct diverse contexts for each QA pair to improve retrieval training robustness.
- Mechanism: Fully supportive (evidence that directly answers the question) + Partially supportive (incomplete information requiring cross-evidence reasoning) + Irrelevant (same-domain but unrelated content) + Distracting (topically related but semantically insufficient content for answering, analogous to distractors in reading comprehension).
- Design Motivation: Existing methods use randomly sampled chunks as negative samples. RAGen's carefully constructed distracting contexts introduce greater semantic difficulty, training more robust retrievers.
Loss & Training¶
Embedding fine-tuning: InfoNCE contrastive loss, learning rate 1e-5, 3 epochs, temperature \(\tau=0.02\), 2 negative samples. LLM fine-tuning: LoRA supervised fine-tuning (Qwen2.5-1.5B/3B), learning rate 1e-5, 5 epochs, 10% validation split. All experiments use 4× RTX 3090.
Key Experimental Results¶
Main Results¶
Embedding Model Retrieval Performance (BGE-large-v1.5, averaged across three domains)
| Training Data | R@1 | R@5 | R@10 | MRR@10 |
|---|---|---|---|---|
| Vanilla (no fine-tuning) | 0.153 | 0.411 | 0.534 | 0.263 |
| AutoRAG | 0.190 | 0.517 | 0.655 | 0.330 |
| LlamaIndex | 0.204 | 0.539 | 0.671 | 0.346 |
| RAGen | 0.333 | 0.716 | 0.828 | 0.497 |
Ablation Study¶
LLM Fine-tuning Performance (Qwen2.5-1.5B, ROUGE-L)
| Domain | AutoRAG | LlamaIndex | RAGen |
|---|---|---|---|
| PPFS | 0.288 | 0.329 | 0.396 |
| TradePolicy | 0.278 | 0.270 | 0.391 |
| BusinessAI | 0.270 | 0.269 | 0.339 |
Cognitive Level Distribution Comparison
| Method | Remember + Understand (Lower-order) | Analyze + Evaluate + Create (Higher-order) |
|---|---|---|
| LlamaIndex | ~70% | ~15% |
| AutoRAG | ~65% | ~20% |
| RAGen | ~30% | ~50% |
Key Findings¶
- RAGen substantially outperforms baselines on embedding retrieval—R@1 is approximately 63% higher than LlamaIndex (0.333 vs. 0.204), demonstrating the superiority of cross-concept data generation.
- RAGen consistently achieves the best ROUGE-L in LLM fine-tuning (+20–40% relative improvement), indicating that data quality is equally critical for the generation component.
- Questions generated by RAGen exhibit higher cognitive levels—higher-order questions (Analyze/Evaluate/Create) account for 50%, compared to 15–20% for baselines.
- The inclusion of distracting contexts significantly improves retrieval robustness over random negative sampling.
- Multi-stem combinations (\(\ell \geq 2\)) generate cross-concept questions requiring deeper reasoning, which is the primary source of RAGen's data quality advantage.
Highlights & Insights¶
- The data-centric approach to RAG adaptation—achieving the largest performance gains by improving training data rather than modifying model architecture or training objectives—offers a broadly applicable design philosophy.
- Bloom's Taxonomy-guided question generation is a transferable methodology applicable to any educational or assessment data generation scenario.
- The design of four context variants (particularly distracting contexts) draws on the concept of distractors in reading comprehension.
Limitations & Future Work¶
- Concept extraction and question generation rely on ChatGPT-4o, which incurs relatively high costs and constrains results by the capability of the underlying model.
- Evaluation is conducted on only three relatively small-scale domain datasets; large-scale industrial scenarios remain untested.
- No direct comparison is made with end-to-end RAG adaptation methods such as RAFT.
- Cross-document reasoning (combining concepts from different documents with \(\ell \geq 2\)) is not thoroughly explored.
Related Work & Insights¶
- vs. RAFT: RAFT focuses on distractor-aware fine-tuning of the generation component, whereas RAGen provides a general data generation framework supporting multi-component adaptation.
- vs. AutoRAG/LlamaIndex: These methods follow a single-chunk generation paradigm; RAGen's cross-concept multi-stem design constitutes a fundamental distinction.
- vs. RAGEval/RAGAS: These frameworks target RAG system evaluation, whereas RAGen is explicitly designed for generating training data for RAG adaptation.
Rating¶
- Novelty: ⭐⭐⭐⭐ The data generation paradigm combining document-level concepts, Bloom's Taxonomy, and multi-stem composition is novel and practically useful.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers three domains, three embedding models, and two LLMs with sufficient ablations, though at a limited scale.
- Writing Quality: ⭐⭐⭐⭐ The method is described clearly and systematically, with intuitive illustrations.
- Value: ⭐⭐⭐⭐ Provides a practical data generation solution for domain-specific RAG adaptation.