Skip to content

MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA

Conference: ACL 2025
arXiv: 2512.10996
Code: None
Area: Medical NLP
Keywords: Retrieval-Augmented Generation, Biomedical QA, Semantic Search, GPT-4o Fine-tuning, RAG

TL;DR

MedBioRAG proposes a retrieval-augmented generation framework that integrates semantic search, document retrieval, and fine-tuned LLMs for biomedical QA tasks. It outperforms previous SOTA and GPT-4o baseline models across multiple benchmarks in four dimensions: text retrieval (NFCorpus, TREC-COVID), closed-domain QA (MedQA, PubMedQA, BioASQ), and long-text QA.

Background & Motivation

Biomedical QA is a highly challenging task. The key challenge lies in:

Knowledge Timeliness: Although large models like GPT-4o possess strong zero-shot reasoning capabilities, they rely on static pre-training data, are prone to hallucinating, and cannot access up-to-date medical knowledge. In the medical field, information accuracy and timeliness are critical.

Retrieval Quality Bottleneck: RAG compensates for LLMs' limitations by dynamically retrieving external knowledge, but its effectiveness depends heavily on retrieval quality. Traditional keyword-based retrieval methods (BM25, TF-IDF) perform poorly when handling synonyms (e.g., "heart attack" vs. "myocardial infarction") and polysemy in medical terminology, often leading to irrelevant retrieval results.

Domain Adaptation Requirements: Unlike general QA, medical QA requires extremely high accuracy and interpretability. General LLMs require domain adaptation to meet clinical-grade requirements.

Research Gap: Prior to this, there was a lack of a comprehensive framework that systematically integrates semantic search, document ranking, and fine-tuned LLMs to address the multidimensional challenges of biomedical QA.

The mechanism of MedBioRAG is: hybrid retrieval combining primary semantic search and auxiliary lexical search + GPT-4o supervised fine-tuning + structured prompt engineering, with all three mutually reinforcing each other.

Method

Overall Architecture

The workflow of MedBioRAG is divided into three stages: 1. Retrieval Stage: Hybrid search (primary semantic search, auxiliary lexical search) is performed on user queries to retrieve relevant biomedical documents, followed by reranking. 2. Generation Stage: The retrieved documents serve as context, input to the fine-tuned LLM to generate answers. 3. Filtering Stage: Prompt engineering and content filtering are used to guarantee output quality.

It supports three QA modes: closed-domain QA (multiple-choice/yes-no), long-text QA (detailed explanations), and text retrieval.

Key Designs

  1. Hybrid Retrieval Mechanism:

    • Lexical Search (BM25): Traditional frequency-based retrieval using IDF weighting and document length normalization, excels at exact matching.
    • Semantic Search: Encodes queries and documents into dense vector representations, computes relevance using cosine similarity, and selects Top-K documents.
    • Design Motivation: Semantic search captures conceptual relationships between medical terms (even without exact keyword matches), improving NDCG@10 by 6.57 points (\(31.34 \rightarrow 37.91\)) on NFCorpus compared to lexical search.
    • Top-K Selection: Experiments reveal that retrieval performance does not scale indefinitely with the number of retrieved documents; exceeding the optimal threshold introduces noise and contradictory information, leading to performance degradation.
  2. GPT-4o Supervised Fine-tuning:

    • Performs fine-tuning using (query + retrieve context, target answer) pairs.
    • Standard language modeling loss: $\(\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{|y|} \log P_\theta(y_t | y_{<t}, x)\)$
    • Necessity of Fine-Tuning: Zero-shot GPT-4o achieves only 44.74% on PubMedQA, which improves to 80.70% after fine-tuning, and reaches 85.00% when integrated with RAG. This demonstrates that domain-specific fine-tuning is crucial for reducing hallucinations and improving medical reasoning.
  3. Prompt Engineering & Content Filtering:

    • Structured prompts guide the model to generate formatted and reliable answers.
    • Confidence Filtering: The model assigns a confidence score \(s_c = \text{softmax}(W_o h_T)\) for each response. Responses below the threshold are discarded or iteratively corrected.
    • Uses different prompt templates customized for varying task types (closed-domain/long-text).

Loss & Training

  • Fine-tuning utilizes standard autoregressive language modeling loss.
  • The base models are all GPT-4o.
  • Different model instances are fine-tuned separately for different tasks (closed-domain QA, long-text QA, retrieval).

Key Experimental Results

Main Results — Closed-Domain QA

Method MedQA PubMedQA BioASQ
GPT-4o (Zero-shot) 81.82 44.74 96.12
GPT-4o + MedBioRAG 86.86 66.67 97.06
Fine-tuned GPT-4o 87.88 80.70 97.06
Fine-tuned GPT-4o + MedBioRAG 89.47 85.00 98.32
GPT-3.5 51.52 19.30 88.24
GPT-4 + MedBioRAG 78.79 72.81 97.79

Retrieval Performance

Metric NFCorpus Lexical Search NFCorpus Semantic Search TREC-COVID Lexical Search TREC-COVID Semantic Search
NDCG@10 31.34 37.91 48.35 61.02
MRR@10 51.63 64.29 82.50 89.17
MAP@10 46.01 56.15 72.31 82.19

Ablation Study — Contribution of Components

Configuration PubMedQA Accuracy Description
GPT-4o Zero-shot 44.74% Baseline
+ RAG (No Fine-tuning) 66.67% RAG yields +21.93%
Fine-tuned (No RAG) 80.70% Fine-tuning yields +35.96%
Fine-tuned + RAG 85.00% Combination of both is best

Key Findings

  • Fine-tuning is more important than RAG: On PubMedQA, fine-tuning alone contributes +35.96%, while RAG alone contributes +21.93%, suggesting that internalization of domain knowledge is more critical than external retrieval.
  • Semantic search out-performs lexical search overall: Consistently leading across all retrieval metrics.
  • Optimal value exists for Top-K: Retrieving too many documents introduces noise, which is particularly detrimental to closed-domain QA requiring concise answers.
  • GPT-3.5 with RAG may drop performance instead (MedQA: \(51.52 \rightarrow 45.36\)), suggesting that when the base model's capacity is insufficient, RAG might have a negative effect.

Highlights & Insights

  • Systematic Evaluation Framework: Systematically evaluates biomedical QA by deconstructing it into three dimensions—retrieval, closed-domain QA, and long-text QA—offering a comprehensive benchmark comparison.
  • Synergy of Fine-Tuning and RAG: Demonstrates that these two enhancement methods are complementary rather than alternative solutions—fine-tuning provides domain knowledge while RAG supplies the latest information.
  • Establishes a new SOTA record on PubMedQA (85%), outperforming prior models such as Med-PaLM-2.
  • Experimentally proves that weak models + RAG can be counterproductive, offering practical guidance for model selection in RAG systems.

Limitations & Future Work

  • Lack of clinical validation by medical experts: All evaluations are based on automated metrics, without evaluation by medical professionals regarding clinical accuracy and reliability of model outputs.
  • Insufficient handling of contradictions among retrieved documents: If retrieved documents contain conflicting information, the model lacks a mechanism for conflict resolution.
  • High computational overhead: Real-time retrieval increases inference latency, limiting its application in time-sensitive clinical scenarios.
  • High fine-tuning costs based on GPT-4o: Using closed-source commercial models limits reproducibility and incurs high deployment costs.
  • Evaluated only on English datasets: Cross-lingual biomedical QA remains unaddressed.
  • LiveQA with RAG in long-text QA degraded ROUGE scores, illustrating that the introduction of RAG in long-text generation scenarios requires more delicate strategies.
  • Extends the RAG (Lewis et al.) framework, specifically optimizing it for the biomedical domain.
  • Directly compared with and outperforms BlendedRAG and BM25S in retrieval performance.
  • Compares favorably with specialized medical models like MEDITRON-70B and Med-PaLM-2, demonstrating the competitiveness of general-purpose model + fine-tuning + RAG.
  • Serves as a reference for RAG system designs in other vertical domains (e.g., law, finance).

Rating

  • Novelty: ⭐⭐⭐ Individual components (semantic search, fine-tuning, RAG) are not new; the innovation lies mostly in the system integration level.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple task types and datasets, but lacks expert human evaluation.
  • Writing Quality: ⭐⭐⭐ Structure is complete, though writing is redundant in some parts, and mathematical formulations are basic-level.
  • Value: ⭐⭐⭐⭐ Provides a complete solution and a comprehensive baseline comparison for biomedical QA, yielding high reference values for practical deployment.