BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels¶
Conference: ACL 2026 arXiv: 2604.15591 Code: https://github.com/MengfeiLan/BioHiCL Area: Medical Imaging Keywords: Biomedical retrieval, MeSH hierarchy, contrastive learning, multi-label, parameter-efficient fine-tuning
TL;DR¶
BioHiCL leverages the hierarchical multi-label annotations of MeSH (Medical Subject Headings) to provide structured supervision for dense retrievers. By aligning the embedding space with the MeSH semantic space via depth-weighted label similarity, a 0.1B model surpasses most specialized models on biomedical retrieval, sentence similarity, and question answering tasks.
Background & Motivation¶
Background: General-domain dense retrievers (e.g., BGE, E5) perform well on general IR benchmarks but fail to capture biomedical-specific terminology and semantic relationships. Specialized biomedical retrieval models (e.g., MedCPT, BMRetriever) improve semantic alignment through large-scale contrastive learning.
Limitations of Prior Work: Existing biomedical retrieval models rely on coarse-grained relevance signals—either binary annotations (relevant/irrelevant) or query-article click data. Such coarse signals cannot capture the complex partial semantic overlap in biomedical texts (e.g., two articles labeled "irrelevant" may share a common parent concept in the disease hierarchy).
Key Challenge: Semantic relationships among biomedical texts are graded and hierarchical, yet training signals are binary—learning graded semantic relationships from binary signals limits retrieval precision.
Goal: Design a method that leverages the MeSH hierarchical structure to provide fine-grained, graded supervision signals for adapting general-purpose retrievers to the biomedical domain.
Key Insight: MeSH provides naturally multi-faceted supervision—each article is annotated with multiple MeSH labels, the labels themselves form a hierarchical tree, and the degree and depth of label overlap can quantify semantic similarity.
Core Idea: Align the similarity in embedding space with that in the MeSH depth-weighted label space, replacing binary contrastive learning with hierarchical multi-label contrastive learning.
Method¶
Overall Architecture¶
Built upon the general-domain dense retriever BGE, the model is fine-tuned with LoRA on 80,000 MeSH-annotated abstracts from BioASQ. The training objective consists of two components: (1) a regression loss \(\mathcal{L}_{\text{mse}}\) that fits embedding similarity to label similarity; and (2) a hierarchical contrastive loss \(\mathcal{L}_{\text{con}}\) that pulls semantically related documents closer and pushes unrelated ones apart in embedding space.
Key Designs¶
-
Depth-Weighted Hierarchical Label Representation:
- Function: Encodes the MeSH hierarchical structure as computable label similarity.
- Mechanism: The MeSH label set of each abstract is expanded to include all ancestor nodes, forming a complete path \(m_i^{\text{hier}}\), encoded as a multi-hot vector \(y_i \in \{0,1\}^C\). Each MeSH concept \(c_j\) is assigned a depth weight \(w_j = \log(d(c_j)+1)\), such that deeper (more specific) concepts receive higher weights. The label similarity between two documents is defined as the cosine similarity of their weighted multi-hot vectors: \(\text{SimL}(k_p, k_q) = \cos(y_p \odot \mathbf{w}, y_q \odot \mathbf{w})\).
- Design Motivation: Shallow MeSH labels (e.g., "Disease") carry little discriminative value when matched, whereas deep labels (e.g., "Intracranial Hemorrhage") indicate genuine semantic relatedness. Depth weighting focuses the model on meaningful fine-grained matches.
-
Hierarchical Multi-Label Contrastive Loss:
- Function: Prevents embedding collapse and maintains discriminative structure.
- Mechanism: Positive pairs are document pairs with label similarity \(\text{SimL} > \beta\); negative pairs are those with no label overlap (\(\text{SimL}=0\)). Positive pairs are weighted by their label similarity in the contrastive loss: \(\mathcal{L}_{\text{con}} = -\mathbb{E}_i \log[\text{SimL}(k_i, k_i^+) \cdot \exp(\text{SimE}(k_i, k_i^+)) / \sum_{k_j^-} \exp(\text{SimE}(k_i, k_j^-))]\). The threshold \(\beta\) filters weakly associated pairs to reduce noisy supervision.
- Design Motivation: Regression loss alone may cause all embeddings to collapse to a single point; the contrastive loss maintains discriminability by pushing unrelated documents apart. Weighting by label similarity ensures that more semantically related positive pairs contribute larger gradients.
-
LoRA Parameter-Efficient Fine-Tuning:
- Function: Adapts a general-domain retriever to the biomedical domain at low cost.
- Mechanism: The original BGE parameters are frozen, and low-rank adapters \(W_{\text{adapted}}^{(l)} = W^{(l)} + B^{(l)} A^{(l)}\) are injected, training only 0.3% of the parameters.
- Design Motivation: Full-parameter fine-tuning is prone to overfitting on small datasets and incurs high computational cost; LoRA achieves domain adaptation while preserving general language understanding.
Loss & Training¶
Total loss: \(\mathcal{L} = \mathcal{L}_{\text{mse}} + \lambda \mathcal{L}_{\text{con}}\), with \(\lambda=0.1\) and \(\beta=0.3\). Training is conducted on 80,000 abstracts from BioASQ v2022; the optimal checkpoint is selected on the TREC-CT 2022 validation set. Training and inference can be completed on a single A100 40GB GPU.
Key Experimental Results¶
Main Results¶
| Task/Dataset | Metric | BioHiCL-Base (0.1B) | BMRetriever-1B | bge-base (0.1B) |
|---|---|---|---|---|
| IR Average | nDCG@10 | 0.543 | 0.531 | 0.529 |
| NFCorpus | nDCG@10 | 0.379 | 0.344 | 0.368 |
| TREC-COVID | nDCG@10 | 0.812 | 0.840 | 0.798 |
| BIOSSES | Spearman | 0.896 | 0.858 | 0.860 |
| PubMedQA | Recall@1 | 0.893 | 0.810 | 0.856 |
Ablation Study¶
| Configuration | IR Avg | Note |
|---|---|---|
| BioHiCL-Base | 0.543 | Full model |
| w/o \(\mathcal{L}_{\text{con}}\) | 0.528 | Removing contrastive loss yields the largest drop |
| w/o Ancestor Label | 0.538 | Without ancestor node expansion |
| w/o \(\mathcal{L}_{\text{mse}}\) | 0.537 | Without regression loss |
| w/o Depth Weight | 0.541 | Without depth weighting |
| w/o LoRA (full params) | 0.542 | LoRA matches full fine-tuning performance |
Key Findings¶
- BioHiCL-Base (0.1B) outperforms BMRetriever (1B) on the average IR metric, demonstrating that structured supervision signals can compensate for the gap in model scale.
- The contrastive loss is the most critical component (removing it reduces IR average by 0.015), confirming the necessity of preventing embedding collapse.
- Fine-tuning BMRetriever with the BioHiCL methodology leads to severe performance degradation (0.501→0.279), as replacing the original instruction-based training objective disrupts its retrieval-specialized embedding geometry.
- LoRA achieves performance comparable to full fine-tuning with only 0.3% of the parameters, validating the effectiveness of parameter-efficient methods for domain adaptation.
Highlights & Insights¶
- Leveraging the MeSH hierarchical structure as a graded supervision signal is a natural and effective design: MeSH is an expert-curated, standardized vocabulary that provides a precise measure of semantic relationships between documents. This paradigm of "borrowing existing structured knowledge for supervision" is transferable to any domain with a hierarchical label system (e.g., legal code classification, product taxonomy).
- Depth-weighted label similarity encodes the domain intuition that "specific concepts matter more than abstract ones" in an elegantly minimal form (a single formula \(w_j = \log(d(c_j)+1)\)).
- The extreme efficiency of the 0.1B model makes it well-suited for large-scale deployment, offering a clear practical advantage over systems such as BMRetriever and MedCPT that require 1B+ parameters.
Limitations & Future Work¶
- Training is conducted on only 80,000 abstracts from BioASQ, far smaller in scale than MedCPT (click data) and BMRetriever (multi-task data).
- The coverage and granularity of MeSH annotations are constrained by the label set maintained by NLM; emerging concepts may lack adequate coverage.
- The potential of combining MeSH hierarchical information with instruction-based retrieval training remains unexplored.
- Gains on SCIDOCS are limited (0.215→0.225), indicating that cross-domain generalization requires further improvement.
Related Work & Insights¶
- vs. MedCPT (Jin et al., 2023): The latter employs query-article clicks for contrastive learning; BioHiCL uses MeSH hierarchy to provide finer-grained supervision signals.
- vs. BMRetriever (Xu et al., 2024): The latter trains a 1B model via large-scale multi-task learning; BioHiCL achieves comparable performance with only 0.1B parameters and MeSH supervision, offering substantially higher efficiency.
- vs. BiCA (Sinha et al., 2025): The latter performs biomedical adaptation without exploiting hierarchical label structure; BioHiCL complements this with an explicit hierarchical dimension.
Rating¶
- Novelty: ⭐⭐⭐ Contrastive learning with MeSH supervision is a natural combination, though the core idea is not particularly surprising.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task evaluation (IR + similarity + QA), detailed ablations, and efficiency analysis provide comprehensive coverage.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear and concise, though the Related Work section is somewhat thin.