MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA¶

Project	Content
Author	Seonok Kim
Institution	Mazelone, Seoul
Conference	ACL 2025
arXiv	2512.10996
Topic	Retrieval-Augmented Generation for Biomedical QA

TL;DR¶

MedBioRAG proposes a retrieval-augmented generation framework that combines semantic search, document retrieval, and fine-tuned LLMs, comprehensively outperforming GPT-4o baselines and prior SOTAs on three types of biomedical QA tasks: text retrieval, closed-book QA, and long-text QA.

Background & Motivation¶

Domain Challenges: Biomedical QA demands extremely high factual accuracy. General-domain LLMs (such as GPT-4o) rely on static pre-training data and are prone to hallucinations and outdated information.
Limitations of Prior Work: Traditional keyword retrieval (BM25, TF-IDF) cannot handle synonyms (e.g., "heart attack" vs "myocardial infarction") and polysemy in medical terminology, leading to incomplete retrieval.
RAG Bottlenecks: Although retrieval-augmented generation can dynamically introduce external knowledge, its effectiveness is highly dependent on retrieval quality, document ranking, and the degree of model fine-tuning.
Core Motivation: To design an end-to-end biomedical QA framework that integrates semantic search (for improved retrieval accuracy) and fine-tuned LLMs (for improved generation quality).

Method¶

Overall Architecture¶

MedBioRAG consists of three core stages:

Hybrid Retrieval Module: Utilizes both lexical and semantic search, with semantic search playing the leading role.
LLM-Based Answer Generation: The fine-tuned LLM integrates the retrieved information into coherent answers.
Prompt Engineering and Content Filtering: Optimizes prompt structures to guide the model to generate factually accurate outputs.

Retrieval Mechanism¶

Lexical Search: A classic term-frequency ranking method based on BM25, computing the matching score between documents and queries via IDF and TF.

Semantic Search: Map query \(Q\) and document \(D\) to dense vector representations through encoder \(\phi\), computing semantic relevance using cosine similarity:

\[\text{Sim}(Q, D_i) = \frac{v_Q \cdot v_{D_i}}{\|v_Q\| \|v_{D_i}\|}\]

The retrieval system ranks according to similarity scores and selects the Top-K documents. The core advantage of semantic search lies in its ability to retrieve semantically relevant documents even without exact keyword matches.

Fine-Tuning and Generation¶

Supervised Fine-Tuning: Trained using (x, y) pairs, where x represents the query + retrieved document context, and y represents the expected answer, optimizing the standard language model loss.
Confidence Filtering: The model electrics a confidence score to the generated response, dropping or iteratively correcting responses below the threshold.
Prompt Engineering: System prompts are tailored for closed-book QA (requiring only option letters), long-text QA (generating structured answers), and short-text QA (concise answers), using different max tokens, temperature, and top-p parameters.

Experiments¶

Evaluation Settings¶

Retrieval Evaluation: NFCorpus, TREC-COVID, with metrics such as NDCG@10, MRR@10, Precision@10, etc.
Closed-Book QA: MedQA, PubMedQA, BioASQ, evaluated by accuracy.
Long-Text QA: LiveQA, MedicationQA, PubMedQA, BioASQ, with metrics including ROUGE, BLEU, BERTScore, and BLEURT.

Table 1: Closed-Book QA Performance Comparison¶

Method	MedQA	PubMedQA	BioASQ
GPT-3.5 + MedBioRAG	45.36	38.60	66.91
GPT-4 + MedBioRAG	78.79	72.81	97.79
GPT-4o	81.82	44.74	96.12
GPT-4o + MedBioRAG	86.86	66.67	97.06
GPT-4o-mini + MedBioRAG	70.71	76.32	97.06
Fine-Tuned GPT-4o	87.88	80.70	97.06
Fine-Tuned GPT-4o + MedBioRAG	89.47	85.00	98.32

Key Point: Fine-tuned GPT-4o + MedBioRAG achieves the best performance across all datasets, improving PubMedQA from the GPT-4o baseline of 44.74% to 85.00%, an increase of over 40 percentage points.

Table 2: Retrieval Performance Comparison (Semantic vs Lexical Search)¶

Metric	NFCorpus Lexical	NFCorpus Semantic	TREC-COVID Lexical	TREC-COVID Semantic
NDCG@10	31.34	37.91	48.35	61.02
MRR@10	51.63	64.29	82.50	89.17
Precision@10	23.04	27.88	49.60	64.20
MAP@10	46.01	56.15	72.31	82.19

Key Point: Semantic search outperforms lexical search across all metrics, with NDCG@10 improving by approximately 6.6 points on NFCorpus and 12.7 points on TREC-COVID.

Highlights & Insights¶

Systematic and Comprehensive Evaluation: Covers three types of tasks (retrieval, closed-book QA, long-text QA) using up to 7 evaluation metrics, demonstrating a very thorough experimental design.
Remarkable Advantage of Semantic Search: Clearly demonstrates the comprehensive advantages of semantic search over lexical search in the biomedical domain using experimental data.
Synergy of Fine-Tuning + RAG: Proves that neither pure fine-tuning nor pure RAG is as effective as their combination, providing a clear technical solution for biomedical AI applications.
Trade-off Analysis of Top-K Retrieval: Reveals that retrieving more documents is not always better; excessive documents introduce noise and conflicting information, which degrades performance.

Limitations & Future Work¶

Lack of Medical Specialist Validation: Model outputs have not been reviewed by clinicians, making it impossible to confirm alignment with expert reasoning.
Insufficient Handling of Retrieval Contradictions: The model lacks an efficient conflict-resolution mechanism when retrieved documents contain factual contradictions.
High Computational Overhead: Real-time retrieval increases inference latency, limiting its application in time-sensitive scenarios (e.g., emergency decision-making).
Limited Domain Generalization: Performance in specific clinical sub-domains (e.g., clinical diagnosis, electronic health records) remains to be verified.
Limited Baseline Models: Primarily based on the GPT series, lacking in-depth comparison with open-source biomedical models (such as MEDITRON-70B, BioMistral).

Biomedical LLMs: Med-PaLM 2, BioGPT, MEDITRON-70B, BiomedGPT, etc., enhance biomedical reasoning capabilities through domain-specific fine-tuning.
RAG Frameworks: Hybrid retrieval strategies like BlendedRAG combine lexical and semantic search; LLM4IR explores the application of LLMs in information retrieval.
Embedding Models: Semantic retrieval methods based on pre-trained embeddings such as SGPT provide the foundation for biomedical semantic search.
Domain Fine-Tuning: Prompt optimization methods such as Medprompt and supervised fine-tuning strategies enhance the domain adaptation capabilities of LLMs.

Rating¶

Dimension	Score (1-5)	Description
Novelty	2	The combination of semantic search + RAG + fine-tuning is relatively conventional and does not introduce novel technical contributions.
Experimental Thoroughness	4	Covers three types of tasks, multiple benchmark datasets, and various metrics, with a comprehensive design.
Writing Quality	3	Structured clearly but the descriptions are somewhat verbose, with some redundant content.
Value	3	Provides a referenceable RAG framework for biomedical QA, but relies heavily on closed-source models.
Overall Score	3.0	An engineering integration work with a solid experimental design but lacking in methodological novelty.