Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?¶
Conference: ACL 2026 arXiv: 2604.19394 Code: None Area: Medical Imaging Keywords: Continual pre-training, domain adaptation, German medical LLM, model merging, data filtering
TL;DR¶
This paper constructs a high-quality German medical corpus, FineMed-de (7.3 million documents / 5.1 billion tokens filtered from FineWeb2), applies continual pre-training and SLERP model merging to three LLMs (7B–24B), and creates the DeFineMed model family. The results demonstrate that a domain-specialized 7B model can substantially narrow the performance gap with a 24B general-purpose model on German medical tasks, improving the win rate by approximately 3.5×.
Background & Motivation¶
Background: LLMs have shown transformative potential in healthcare, yet integrating them into clinical workflows remains challenging. General-purpose models typically fail to capture domain-specific knowledge and terminology with sufficient accuracy.
Limitations of Prior Work: (1) Strict data protection regulations mandate on-premise deployment, rendering large-scale API services infeasible and favoring smaller models; (2) smaller models lack domain-specific data coverage, making it difficult to handle complex medical terminology; (3) high-quality medical data in non-English languages—German in particular—is scarce.
Key Challenge: Regulatory constraints require the use of smaller models, yet smaller models need targeted domain knowledge to reach clinically acceptable performance—creating a critical compliance-versus-performance trade-off.
Goal: To achieve domain adaptation via continual pre-training and model merging, enabling a 7B model to compete with a 24B general-purpose model on complex medical tasks.
Key Insight: A complete methodology—spanning data filtering to model adaptation—is constructed by combining LLM-assisted annotation with classical ML classifiers to enable scalable data curation.
Core Idea: High-quality domain data + continual pre-training + model merging can make resource-efficient small models competitive solutions for complex medical tasks.
Method¶
Overall Architecture¶
The approach consists of two major components: (1) a medical filtering pipeline—Mixtral is used for zero-shot annotation of the German subset of FineWeb2, followed by training an XLM-RoBERTa classifier to scale annotation to the full dataset, yielding the FineMed-de corpus; (2) model adaptation—continual pre-training is applied to instruction-tuned models, which are then merged with the original instruction-tuned checkpoints via SLERP to restore instruction-following capability.
Key Designs¶
-
Hybrid Medical Document Filtering Pipeline
- Function: Efficiently extract high-quality medical documents from general-purpose web corpora.
- Mechanism: (a) Sample 260,000 documents from the German subset of FineWeb2; (b) apply Mixtral-8x7B for zero-shot classification into medical/non-medical categories (human-validated F1 = 91.1%); (c) fine-tune an XLM-RoBERTa (279M) classifier on the annotated data (precision 0.95, recall 0.80); (d) apply the classifier to the full 428 million documents, extracting 7.3 million medical documents (5.1 billion tokens).
- Design Motivation: LLMs provide high-quality annotations but are costly, while classical ML classifiers offer scalability—the hybrid approach balances quality and efficiency.
-
Continual Pre-training + SLERP Model Merging
- Function: Inject domain knowledge while preserving instruction-following capability.
- Mechanism: Instruction-tuned models are continually pre-trained on FineMed-de for 2 epochs (using FSDP, Flash Attention, and mixed precision), then merged with the original instruction-tuned checkpoints via layer-wise SLERP interpolation. Three base models are selected: Qwen2.5-7B, Mistral-7B, and Mistral-Small-24B.
- Design Motivation: Continual pre-training may cause catastrophic forgetting and degrade instruction-following ability; model merging provides an efficient means to restore these capabilities without additional fine-tuning.
-
Multi-dimensional Evaluation Design
- Function: Comprehensively assess the effects and trade-offs of domain adaptation.
- Mechanism: (a) Knowledge-intensive benchmarks (MMLU-de medical subset + MedQA-de) evaluate medical knowledge; (b) pairwise win-rate analysis evaluates complex medical instruction following; (c) failure mode analysis (language mixing, verbosity) evaluates side effects.
- Design Motivation: A single benchmark may obscure the true performance gap; multi-dimensional evaluation reveals the complete picture of domain adaptation.
Loss & Training¶
Continual pre-training employs the standard language modeling objective (next-token prediction) with the AdamW optimizer, linear learning rate decay, and 500 warm-up steps. Training efficiency is optimized via FSDP, Flash Attention, activation checkpointing, and sequence packing.
Key Experimental Results¶
Main Results¶
Average Accuracy on German Medical Benchmarks
| Model | Average Accuracy |
|---|---|
| BioMistral-7B (baseline) | 43.55 |
| BioMistral-7B-SLERP | 48.22 |
| Mistral-7B-Instruct | 49.73 |
| DeFineMed-Mistral-7B-SLERP | 56.46 |
| Qwen2.5-7B-Instruct | 59.08 |
| DeFineMed-Qwen2.5-7B | 64.91 |
Ablation Study¶
- The Qwen2.5-based DeFineMed 7B model achieves approximately a 3.5× improvement in pairwise win rate against Mistral-Small-24B-Instruct.
- SLERP model merging successfully restores instruction-following capability, but introduces side effects such as language mixing (German–English code-switching) and increased verbosity.
- Continual pre-training yields a larger gain for Qwen2.5-based models (+5.83) than for Mistral-based models (+6.73), though both improvements are substantial.
Key Findings¶
- Continual pre-training combined with model merging enables 7B models to approach and compete with 24B models on German medical tasks.
- Data quality matters more than data scale—a carefully filtered corpus of 5.1 billion tokens is sufficient to yield significant improvements.
- Model merging is effective at restoring instruction-following capability, but inherent trade-offs such as language mixing remain.
- The choice of base model has a substantial impact on final performance (Qwen2.5 > Mistral).
Highlights & Insights¶
- The hybrid data filtering pipeline (LLM annotation + ML classifier) is practical and transferable to other domains and languages.
- The finding that a 7B model can compete with a 24B model has significant practical implications for resource-constrained clinical settings.
- The failure mode analysis (language mixing, verbosity) provides an honest assessment of adaptation trade-offs.
- The methodology is directly generalizable to medical LLM development in other non-English languages.
Limitations & Future Work¶
- The approach targets German only and has not been extended to other languages.
- Language mixing and verbosity issues require targeted subsequent fine-tuning to resolve.
- The optimal ordering of continual pre-training and instruction fine-tuning remains an open question.
- Model utility has not been validated in real clinical settings.
Related Work & Insights¶
- Compared to BioMistral, this work focuses not only on benchmark score improvements but also on the competitiveness of small models against larger ones.
- Apollo-2 follows an instruction fine-tuning route, whereas this work adopts a continual pre-training route—the two approaches are complementary.
- The effectiveness of SLERP in the medical domain receives further empirical support.
Rating¶
- Novelty: ⭐⭐⭐ — Individual methodological components are well-established, but their combined application to the German medical setting is valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation is comprehensive across three dimensions: multiple benchmarks, win-rate analysis, and failure mode analysis.
- Writing Quality: ⭐⭐⭐⭐ — Structure is clear and experimental design is sound.