Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders¶

Conference: ACL 2026 arXiv: 2604.10937 Code: GitHub Area: Medical Imaging Keywords: Medical text retrieval, asymmetric encoders, Chinese medical benchmark, embedding models, RAG

TL;DR¶

This paper proposes CMedTEB (Chinese Medical Text Embedding Benchmark) and CARE (asymmetric retrieval framework). CMedTEB constructs a high-quality Chinese medical retrieval/reranking/STS benchmark via multi-LLM voting with expert validation, while CARE adopts an asymmetric architecture that encodes queries with a lightweight BERT and documents with a large LLM. Through a two-stage progressive alignment strategy, CARE achieves LLM-level retrieval accuracy at BERT-level online latency.

Background & Motivation¶

Background: Text embedding models are fundamental infrastructure in NLP, playing a particularly critical role in RAG systems. Recent LLM-based embedding models (e.g., Qwen3-Embedding, NV-Embed) have demonstrated strong performance on general benchmarks, yet Chinese medical text embedding remains underexplored.

Limitations of Prior Work: (1) Poor benchmark quality: Existing Chinese medical retrieval benchmarks (CmedqaRetrieval, MedicalRetrieval) suffer from severe false-negative issues—the "topic density" of the medical domain causes numerous semantically relevant but unannotated documents to be incorrectly labeled as irrelevant (averaging 9–19 false negatives per query). (2) Accuracy-efficiency trade-off: LLM-based embedding models achieve high accuracy but incur substantial latency, rendering them impractical for latency-sensitive scenarios such as real-time medical Q&A; BERT-style models offer low latency but insufficient accuracy.

Key Challenge: High accuracy demands large models, while real-time scenarios demand low latency—an apparently irreconcilable trade-off between accuracy and efficiency.

Goal: (1) Construct a high-quality Chinese medical embedding benchmark; (2) Design a retrieval framework that breaks the accuracy-latency trade-off.

Key Insight: In retrieval, query encoding is online (requiring low latency), whereas document encoding can be precomputed offline (permitting the use of large models). Exploiting this natural asymmetry, different-sized models are used to encode queries and documents separately.

Core Idea: A lightweight BERT encodes online queries while a large LLM encodes offline documents. A two-stage progressive alignment strategy—first freezing the document encoder to align the query encoder, then jointly fine-tuning both—bridges the semantic gap between heterogeneous encoders.

Method¶

Overall Architecture¶

CMedTEB benchmark: a multi-LLM consensus annotation pipeline for constructing retrieval, reranking, and STS tasks. CARE framework: initialize two encoders → Stage I: freeze the document encoder and align the query encoder via unsupervised self-contrastive learning → Stage II: unfreeze both encoders for joint fine-tuning. At inference, queries are processed by the BERT encoder (0.3B), while document embeddings are precomputed by the LLM.

Key Designs¶

CMedTEB Benchmark Construction (Multi-LLM Consensus + Expert Validation):
- Function: Provides a high-fidelity evaluation standard for Chinese medical text embedding.
- Mechanism: DeepSeek-V3, Doubao-1.5-Pro, and GPT-4o independently score query-document pairs on a 5-point scale; a pair is retained as a positive sample only when all three models unanimously agree. Experts independently re-annotate 5,000 pairs, achieving a 93.3% agreement rate. Fleiss' Kappa = 0.731 confirms annotation reliability.
- Design Motivation: Single-LLM annotation (e.g., CMIRB using only ChatGPT) cannot guarantee quality; multi-model consensus combined with expert validation provides a more reliable gold standard.
Two-Stage Asymmetric Alignment Strategy:
- Function: Bridges the semantic gap between the lightweight query encoder and the large document encoder.
- Mechanism: Stage I (query encoder alignment): The document encoder is frozen, and a "self-contrastive" strategy aligns the query encoder—embeddings of the same text produced by the two encoders serve as mutual positives. Loss = Asym-InfoNCE (soft ranking alignment) + MSE (hard structural alignment). Stage II (joint fine-tuning): Both encoders are unfrozen, and Asym-InfoNCE on query-document pairs is used for end-to-end optimization of retrieval boundaries.
- Design Motivation: Directly jointly training heterogeneous encoders leads to unstable convergence. The progressive strategy first establishes a spatial mapping foundation (Stage I using unlabeled data) and then optimizes task performance (Stage II using annotated data).
Medical Domain Training Data Construction (Diversity-Aware Deduplication + False-Negative Cleaning):
- Function: Addresses the false-negative problem caused by topic density in the medical domain.
- Mechanism: A vector index is initialized with 5,000 seed samples; new candidates are discarded if their similarity to existing samples exceeds a threshold (ensuring diversity). GPT-4o then verifies the top-50 retrieved results to distinguish true hard negatives from false negatives. This process ultimately yields 500K high-quality triplets.
- Design Motivation: Standard hard-negative mining fails in the medical domain—because a large number of semantically related documents remain unannotated, mined "negatives" are in fact positives.

Key Experimental Results¶

Main Results (CMedTEB Comprehensive Scores)¶

Model	Params (Q/D)	Retrieval nDCG@10	Rerank MAP@10	STS Pearson	Avg
bge-large-zh-v1.5	326M/326M	50.32	67.55	78.95	73.04
Conan-v1	326M/326M	52.75	69.31	81.49	76.44
gte-Qwen2-1.5B	1.78B/1.78B	55.39	72.35	85.50	77.61
CARE-0.3B-4B	305M/4.02B	55.91	72.84	88.53	78.13
CARE-0.3B-8B	305M/8.19B	56.75	73.67	87.07	78.94

Ablation Study (Asymmetric vs. Symmetric vs. Other Efficient Methods)¶

Method	Type	Retrieval	Rerank	Avg
KALE	Asymmetric	42.67	67.42	55.05
ScalingNote	Asymmetric	34.81	64.17	49.49
CARE-0.3B-4B	Asymmetric	55.91	72.84	64.38
Med-Emb-8B (symmetric)	Symmetric	56.42	74.84	65.63

Key Findings¶

CARE breaks the accuracy-latency trade-off: CARE-0.3B-8B trails the fully symmetric 8B model by only 0.6% in accuracy while reducing online inference parameter count by 27×.
CMedTEB is substantially more challenging than existing benchmarks: General models average 85.15 on CMedQA but only 57.85 on the new CMedTEB tasks.
The two-stage training strategy substantially outperforms other asymmetric methods: CARE exceeds KALE by 9.33 pp and ScalingNote by 14.89 pp.
Scaling the document encoder yields continuous performance gains without increasing online cost: Expanding from 4B to 8B improves the average score by 0.81.
False-negative issues in existing benchmarks are severe: 92% of LLM-annotated false negatives were confirmed by manual verification.

Highlights & Insights¶

The asymmetric architecture exploits the natural asymmetry of the retrieval task—the fact that queries are online and documents are offline is elegantly leveraged. This paradigm is transferable to any query-document matching scenario.
Self-contrastive alignment (treating embeddings of the same text from the two encoders as mutual positives) is an elegant unsupervised solution that establishes cross-model spatial mappings without additional annotation.
The CMedTEB construction methodology (multi-LLM consensus + expert validation + false-negative analysis) provides a reusable paradigm for domain-specific benchmark construction.

Limitations & Future Work¶

The document encoder requires offline precomputation, making the approach less suitable for scenarios with frequent document updates (e.g., real-time news retrieval).
Stage I's MRL (Matryoshka Representation Learning) truncates high-dimensional LLM embeddings to 768 dimensions, which may incur information loss.
CMedTEB covers Chinese only; cross-lingual medical retrieval is not addressed.
Validation is conducted solely in the medical domain; generalizability to other specialized domains such as law and finance remains to be confirmed.
Online distillation or progressive knowledge transfer could be explored to further close the gap of the query encoder.

vs. KALE/ScalingNote: These methods also pursue asymmetric retrieval but employ simpler alignment strategies (layer pruning or direct training); the proposed two-stage progressive alignment is markedly more effective.
vs. symmetric LLM embeddings: Models such as Qwen3-Embedding lead in accuracy but incur 10×+ latency; CARE nearly matches their accuracy while maintaining BERT-level latency.
vs. CMIRB benchmark: CMIRB relies on single-LLM annotation and covers only retrieval; CMedTEB offers broader coverage with multi-LLM consensus and three task types.

Rating¶

Novelty: ⭐⭐⭐⭐ The asymmetric architecture is not new, but the two-stage self-contrastive alignment strategy is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage of benchmark construction, model evaluation, ablation studies, and efficiency analysis, with expert validation for benchmark quality.
Writing Quality: ⭐⭐⭐⭐ Clear structure; figures and tables effectively convey core information.
Value: ⭐⭐⭐⭐⭐ Full open-sourcing of benchmark, model, code, and data provides a direct contribution to Chinese medical NLP.