MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA¶

Conference: ACL 2025
arXiv: 2512.10996
Code: None
Area: Medical NLP
Keywords: Retrieval-Augmented Generation, Biomedical QA, Semantic Search, GPT-4o Fine-tuning, RAG

TL;DR¶

MedBioRAG proposes a retrieval-augmented generation framework that integrates semantic search, document retrieval, and fine-tuned LLMs for biomedical QA tasks. It outperforms previous SOTA and GPT-4o baseline models across multiple benchmarks in four dimensions: text retrieval (NFCorpus, TREC-COVID), closed-domain QA (MedQA, PubMedQA, BioASQ), and long-text QA.

Background & Motivation¶

Biomedical QA is a highly challenging task. The key challenge lies in:

Knowledge Timeliness: Although large models like GPT-4o possess strong zero-shot reasoning capabilities, they rely on static pre-training data, are prone to hallucinating, and cannot access up-to-date medical knowledge. In the medical field, information accuracy and timeliness are critical.

Retrieval Quality Bottleneck: RAG compensates for LLMs' limitations by dynamically retrieving external knowledge, but its effectiveness depends heavily on retrieval quality. Traditional keyword-based retrieval methods (BM25, TF-IDF) perform poorly when handling synonyms (e.g., "heart attack" vs. "myocardial infarction") and polysemy in medical terminology, often leading to irrelevant retrieval results.

Domain Adaptation Requirements: Unlike general QA, medical QA requires extremely high accuracy and interpretability. General LLMs require domain adaptation to meet clinical-grade requirements.

Research Gap: Prior to this, there was a lack of a comprehensive framework that systematically integrates semantic search, document ranking, and fine-tuned LLMs to address the multidimensional challenges of biomedical QA.

The mechanism of MedBioRAG is: hybrid retrieval combining primary semantic search and auxiliary lexical search + GPT-4o supervised fine-tuning + structured prompt engineering, with all three mutually reinforcing each other.

Method¶

Overall Architecture¶

The workflow of MedBioRAG is divided into three stages: 1. Retrieval Stage: Hybrid search (primary semantic search, auxiliary lexical search) is performed on user queries to retrieve relevant biomedical documents, followed by reranking. 2. Generation Stage: The retrieved documents serve as context, input to the fine-tuned LLM to generate answers. 3. Filtering Stage: Prompt engineering and content filtering are used to guarantee output quality.

It supports three QA modes: closed-domain QA (multiple-choice/yes-no), long-text QA (detailed explanations), and text retrieval.

Key Designs¶

Hybrid Retrieval Mechanism:
- Lexical Search (BM25): Traditional frequency-based retrieval using IDF weighting and document length normalization, excels at exact matching.
- Semantic Search: Encodes queries and documents into dense vector representations, computes relevance using cosine similarity, and selects Top-K documents.
- Design Motivation: Semantic search captures conceptual relationships between medical terms (even without exact keyword matches), improving NDCG@10 by 6.57 points ($31.34 \rightarrow 37.91$) on NFCorpus compared to lexical search.
- Top-K Selection: Experiments reveal that retrieval performance does not scale indefinitely with the number of retrieved documents; exceeding the optimal threshold introduces noise and contradictory information, leading to performance degradation.
GPT-4o Supervised Fine-tuning:
- Performs fine-tuning using (query + retrieve context, target answer) pairs.
- Standard language modeling loss: $$\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{|y|} \log P_\theta(y_t | y_{<t}, x)$$
- Necessity of Fine-Tuning: Zero-shot GPT-4o achieves only 44.74% on PubMedQA, which improves to 80.70% after fine-tuning, and reaches 85.00% when integrated with RAG. This demonstrates that domain-specific fine-tuning is crucial for reducing hallucinations and improving medical reasoning.
Prompt Engineering & Content Filtering:
- Structured prompts guide the model to generate formatted and reliable answers.
- Confidence Filtering: The model assigns a confidence score $s_c = \text{softmax}(W_o h_T)$ for each response. Responses below the threshold are discarded or iteratively corrected.
- Uses different prompt templates customized for varying task types (closed-domain/long-text).

Loss & Training¶

Fine-tuning utilizes standard autoregressive language modeling loss.
The base models are all GPT-4o.
Different model instances are fine-tuned separately for different tasks (closed-domain QA, long-text QA, retrieval).

Key Experimental Results¶

Main Results — Closed-Domain QA¶

Method	MedQA	PubMedQA	BioASQ
GPT-4o (Zero-shot)	81.82	44.74	96.12
GPT-4o + MedBioRAG	86.86	66.67	97.06
Fine-tuned GPT-4o	87.88	80.70	97.06
Fine-tuned GPT-4o + MedBioRAG	89.47	85.00	98.32
GPT-3.5	51.52	19.30	88.24
GPT-4 + MedBioRAG	78.79	72.81	97.79

Retrieval Performance¶

Metric	NFCorpus Lexical Search	NFCorpus Semantic Search	TREC-COVID Lexical Search	TREC-COVID Semantic Search
NDCG@10	31.34	37.91	48.35	61.02
MRR@10	51.63	64.29	82.50	89.17
MAP@10	46.01	56.15	72.31	82.19

Ablation Study — Contribution of Components¶

Configuration	PubMedQA Accuracy	Description
GPT-4o Zero-shot	44.74%	Baseline
+ RAG (No Fine-tuning)	66.67%	RAG yields +21.93%
Fine-tuned (No RAG)	80.70%	Fine-tuning yields +35.96%
Fine-tuned + RAG	85.00%	Combination of both is best

Key Findings¶

Fine-tuning is more important than RAG: On PubMedQA, fine-tuning alone contributes +35.96%, while RAG alone contributes +21.93%, suggesting that internalization of domain knowledge is more critical than external retrieval.
Semantic search out-performs lexical search overall: Consistently leading across all retrieval metrics.
Optimal value exists for Top-K: Retrieving too many documents introduces noise, which is particularly detrimental to closed-domain QA requiring concise answers.
GPT-3.5 with RAG may drop performance instead (MedQA: $51.52 \rightarrow 45.36$), suggesting that when the base model's capacity is insufficient, RAG might have a negative effect.

Highlights & Insights¶

Systematic Evaluation Framework: Systematically evaluates biomedical QA by deconstructing it into three dimensions—retrieval, closed-domain QA, and long-text QA—offering a comprehensive benchmark comparison.
Synergy of Fine-Tuning and RAG: Demonstrates that these two enhancement methods are complementary rather than alternative solutions—fine-tuning provides domain knowledge while RAG supplies the latest information.
Establishes a new SOTA record on PubMedQA (85%), outperforming prior models such as Med-PaLM-2.
Experimentally proves that weak models + RAG can be counterproductive, offering practical guidance for model selection in RAG systems.

Limitations & Future Work¶

Lack of clinical validation by medical experts: All evaluations are based on automated metrics, without evaluation by medical professionals regarding clinical accuracy and reliability of model outputs.
Insufficient handling of contradictions among retrieved documents: If retrieved documents contain conflicting information, the model lacks a mechanism for conflict resolution.
High computational overhead: Real-time retrieval increases inference latency, limiting its application in time-sensitive clinical scenarios.
High fine-tuning costs based on GPT-4o: Using closed-source commercial models limits reproducibility and incurs high deployment costs.
Evaluated only on English datasets: Cross-lingual biomedical QA remains unaddressed.
LiveQA with RAG in long-text QA degraded ROUGE scores, illustrating that the introduction of RAG in long-text generation scenarios requires more delicate strategies.

Extends the RAG (Lewis et al.) framework, specifically optimizing it for the biomedical domain.
Directly compared with and outperforms BlendedRAG and BM25S in retrieval performance.
Compares favorably with specialized medical models like MEDITRON-70B and Med-PaLM-2, demonstrating the competitiveness of general-purpose model + fine-tuning + RAG.
Serves as a reference for RAG system designs in other vertical domains (e.g., law, finance).

Rating¶

Novelty: ⭐⭐⭐ Individual components (semantic search, fine-tuning, RAG) are not new; the innovation lies mostly in the system integration level.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple task types and datasets, but lacks expert human evaluation.
Writing Quality: ⭐⭐⭐ Structure is complete, though writing is redundant in some parts, and mathematical formulations are basic-level.
Value: ⭐⭐⭐⭐ Provides a complete solution and a comprehensive baseline comparison for biomedical QA, yielding high reference values for practical deployment.