Skip to content

Domain-Specific Data Generation Framework for RAG Adaptation

Conference: ACL 2026 arXiv: 2510.11217 Code: None Area: Information Retrieval / RAG Keywords: RAG Adaptation, Data Generation, Domain-Specific, Embedding Fine-tuning, Bloom's Taxonomy

TL;DR

This paper proposes RAGen, a scalable and modular data generation framework that automatically synthesizes domain-specific QAC (Question-Answer-Context) data through document-level concept extraction, multi-chunk evidence assembly, and Bloom's Taxonomy-guided question generation. The framework supports contrastive fine-tuning of embedding models and supervised fine-tuning of LLMs, achieving substantial improvements over AutoRAG and LlamaIndex baselines across three domain-specific datasets.

Background & Motivation

Background: RAG (Retrieval-Augmented Generation) has become the predominant paradigm for integrating LLMs into domain-specific workflows, providing models with contextual information through external retrieval. However, directly applying general-purpose RAG pipelines to new domains frequently leads to suboptimal performance.

Limitations of Prior Work: (1) General-purpose retrievers and generators are not aligned with domain-specific terminology and data distributions; (2) RAG adaptation requires high-quality domain-specific training data, yet manual annotation is prohibitively expensive; (3) Existing data generation methods (AutoRAG, LlamaIndex) follow a single-chunk question generation paradigm—generating questions from individual text chunks—resulting in shallow, localized questions that lack cross-concept reasoning capability; (4) Methods such as RAFT optimize for individual components and are tightly coupled to specific training paradigms.

Key Challenge: The critical bottleneck in RAG adaptation lies not in model architecture or training objectives, but in upstream data supply—specifically, the absence of high-quality, cross-concept, multi-cognitive-level domain-specific training data.

Goal: To design a data-centric framework that automatically generates high-quality QAC datasets from raw documents, suitable for multi-component RAG adaptation (embedding models + LLMs).

Key Insight: Rather than operating at the chunk level, the framework begins from document-level concepts, assembles cross-chunk evidence into "question stems," applies Bloom's Taxonomy to guide question generation across different cognitive levels, and pairs each question with carefully constructed positive, negative, and distracting contexts.

Core Idea: High-quality RAG training data should be cross-concept, cross-chunk, and span multiple cognitive levels—rather than being mechanically generated as shallow QA pairs from individual text chunks.

Method

Overall Architecture

A three-stage pipeline: Stage 1 (Document Concept Extraction) — semantic chunking → chunk-level concept extraction (ChatGPT-4o) → embedding-based clustering to fuse into document-level concepts. Stage 2 (Concept-Centric Evidence Assembly) — cross-chunk retrieval → sentence-level evidence filtering → question stem construction. Stage 3 (QAC Generation) — Bloom's Taxonomy-guided multi-level question generation + construction of four context variants (fully supportive / partially supportive / irrelevant / distracting).

Key Designs

  1. Document-Level Concept Extraction and Fusion:

    • Function: Extract high-level semantic topics from documents to serve as anchors for question generation.
    • Mechanism: Documents are first segmented into chunks (1,024 tokens with 200-token overlap); ChatGPT-4o extracts chunk-level concepts from each chunk; OpenAI Ada embeddings and K-means clustering then fuse all chunk-level concepts into \(K\) document-level concepts, with the centroid-nearest concept in each cluster serving as the representative.
    • Design Motivation: Chunk-level concepts are overly localized. Document-level concepts capture high-level semantic themes that span multiple chunks, providing global anchors for cross-chunk question generation.
  2. Bloom's Taxonomy-Guided Question Generation:

    • Function: Generate diverse questions covering six cognitive levels, from remembering to creating.
    • Mechanism: The six levels of the revised Bloom's Taxonomy (Remember → Understand → Apply → Analyze → Evaluate → Create) are used to guide question type selection. The framework supports both single-stem (\(\ell=1\)) and multi-stem (\(\ell \geq 2\)) inputs—the latter jointly conditions on evidence from multiple concepts to generate cross-concept questions. An upper bound is imposed when the number of multi-stem combinations becomes intractable.
    • Design Motivation: Single-chunk methods tend to produce large proportions of shallow Remember/Understand-level questions. Bloom's Taxonomy guidance ensures greater coverage of higher-order Analyze/Evaluate/Create-level questions.
  3. Four Context Variant Construction:

    • Function: Construct diverse contexts for each QA pair to improve retrieval training robustness.
    • Mechanism: Fully supportive (evidence that directly answers the question) + Partially supportive (incomplete information requiring cross-evidence reasoning) + Irrelevant (same-domain but unrelated content) + Distracting (topically related but semantically insufficient content for answering, analogous to distractors in reading comprehension).
    • Design Motivation: Existing methods use randomly sampled chunks as negative samples. RAGen's carefully constructed distracting contexts introduce greater semantic difficulty, training more robust retrievers.

Loss & Training

Embedding fine-tuning: InfoNCE contrastive loss, learning rate 1e-5, 3 epochs, temperature \(\tau=0.02\), 2 negative samples. LLM fine-tuning: LoRA supervised fine-tuning (Qwen2.5-1.5B/3B), learning rate 1e-5, 5 epochs, 10% validation split. All experiments use 4× RTX 3090.

Key Experimental Results

Main Results

Embedding Model Retrieval Performance (BGE-large-v1.5, averaged across three domains)

Training Data R@1 R@5 R@10 MRR@10
Vanilla (no fine-tuning) 0.153 0.411 0.534 0.263
AutoRAG 0.190 0.517 0.655 0.330
LlamaIndex 0.204 0.539 0.671 0.346
RAGen 0.333 0.716 0.828 0.497

Ablation Study

LLM Fine-tuning Performance (Qwen2.5-1.5B, ROUGE-L)

Domain AutoRAG LlamaIndex RAGen
PPFS 0.288 0.329 0.396
TradePolicy 0.278 0.270 0.391
BusinessAI 0.270 0.269 0.339

Cognitive Level Distribution Comparison

Method Remember + Understand (Lower-order) Analyze + Evaluate + Create (Higher-order)
LlamaIndex ~70% ~15%
AutoRAG ~65% ~20%
RAGen ~30% ~50%

Key Findings

  • RAGen substantially outperforms baselines on embedding retrieval—R@1 is approximately 63% higher than LlamaIndex (0.333 vs. 0.204), demonstrating the superiority of cross-concept data generation.
  • RAGen consistently achieves the best ROUGE-L in LLM fine-tuning (+20–40% relative improvement), indicating that data quality is equally critical for the generation component.
  • Questions generated by RAGen exhibit higher cognitive levels—higher-order questions (Analyze/Evaluate/Create) account for 50%, compared to 15–20% for baselines.
  • The inclusion of distracting contexts significantly improves retrieval robustness over random negative sampling.
  • Multi-stem combinations (\(\ell \geq 2\)) generate cross-concept questions requiring deeper reasoning, which is the primary source of RAGen's data quality advantage.

Highlights & Insights

  • The data-centric approach to RAG adaptation—achieving the largest performance gains by improving training data rather than modifying model architecture or training objectives—offers a broadly applicable design philosophy.
  • Bloom's Taxonomy-guided question generation is a transferable methodology applicable to any educational or assessment data generation scenario.
  • The design of four context variants (particularly distracting contexts) draws on the concept of distractors in reading comprehension.

Limitations & Future Work

  • Concept extraction and question generation rely on ChatGPT-4o, which incurs relatively high costs and constrains results by the capability of the underlying model.
  • Evaluation is conducted on only three relatively small-scale domain datasets; large-scale industrial scenarios remain untested.
  • No direct comparison is made with end-to-end RAG adaptation methods such as RAFT.
  • Cross-document reasoning (combining concepts from different documents with \(\ell \geq 2\)) is not thoroughly explored.
  • vs. RAFT: RAFT focuses on distractor-aware fine-tuning of the generation component, whereas RAGen provides a general data generation framework supporting multi-component adaptation.
  • vs. AutoRAG/LlamaIndex: These methods follow a single-chunk generation paradigm; RAGen's cross-concept multi-stem design constitutes a fundamental distinction.
  • vs. RAGEval/RAGAS: These frameworks target RAG system evaluation, whereas RAGen is explicitly designed for generating training data for RAG adaptation.

Rating

  • Novelty: ⭐⭐⭐⭐ The data generation paradigm combining document-level concepts, Bloom's Taxonomy, and multi-stem composition is novel and practically useful.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers three domains, three embedding models, and two LLMs with sufficient ablations, though at a limited scale.
  • Writing Quality: ⭐⭐⭐⭐ The method is described clearly and systematically, with intuitive illustrations.
  • Value: ⭐⭐⭐⭐ Provides a practical data generation solution for domain-specific RAG adaptation.