Skip to content

BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels

Conference: ACL 2026 arXiv: 2604.15591 Code: https://github.com/MengfeiLan/BioHiCL Area: Medical Imaging Keywords: Biomedical retrieval, MeSH hierarchy, contrastive learning, multi-label, parameter-efficient fine-tuning

TL;DR

BioHiCL leverages the hierarchical multi-label annotations of MeSH (Medical Subject Headings) to provide structured supervision for dense retrievers. By aligning the embedding space with the MeSH semantic space via depth-weighted label similarity, a 0.1B model surpasses most specialized models on biomedical retrieval, sentence similarity, and question answering tasks.

Background & Motivation

Background: General-domain dense retrievers (e.g., BGE, E5) perform well on general IR benchmarks but fail to capture biomedical-specific terminology and semantic relationships. Specialized biomedical retrieval models (e.g., MedCPT, BMRetriever) improve semantic alignment through large-scale contrastive learning.

Limitations of Prior Work: Existing biomedical retrieval models rely on coarse-grained relevance signals—either binary annotations (relevant/irrelevant) or query-article click data. Such coarse signals cannot capture the complex partial semantic overlap in biomedical texts (e.g., two articles labeled "irrelevant" may share a common parent concept in the disease hierarchy).

Key Challenge: Semantic relationships among biomedical texts are graded and hierarchical, yet training signals are binary—learning graded semantic relationships from binary signals limits retrieval precision.

Goal: Design a method that leverages the MeSH hierarchical structure to provide fine-grained, graded supervision signals for adapting general-purpose retrievers to the biomedical domain.

Key Insight: MeSH provides naturally multi-faceted supervision—each article is annotated with multiple MeSH labels, the labels themselves form a hierarchical tree, and the degree and depth of label overlap can quantify semantic similarity.

Core Idea: Align the similarity in embedding space with that in the MeSH depth-weighted label space, replacing binary contrastive learning with hierarchical multi-label contrastive learning.

Method

Overall Architecture

Built upon the general-domain dense retriever BGE, the model is fine-tuned with LoRA on 80,000 MeSH-annotated abstracts from BioASQ. The training objective consists of two components: (1) a regression loss \(\mathcal{L}_{\text{mse}}\) that fits embedding similarity to label similarity; and (2) a hierarchical contrastive loss \(\mathcal{L}_{\text{con}}\) that pulls semantically related documents closer and pushes unrelated ones apart in embedding space.

Key Designs

  1. Depth-Weighted Hierarchical Label Representation:

    • Function: Encodes the MeSH hierarchical structure as computable label similarity.
    • Mechanism: The MeSH label set of each abstract is expanded to include all ancestor nodes, forming a complete path \(m_i^{\text{hier}}\), encoded as a multi-hot vector \(y_i \in \{0,1\}^C\). Each MeSH concept \(c_j\) is assigned a depth weight \(w_j = \log(d(c_j)+1)\), such that deeper (more specific) concepts receive higher weights. The label similarity between two documents is defined as the cosine similarity of their weighted multi-hot vectors: \(\text{SimL}(k_p, k_q) = \cos(y_p \odot \mathbf{w}, y_q \odot \mathbf{w})\).
    • Design Motivation: Shallow MeSH labels (e.g., "Disease") carry little discriminative value when matched, whereas deep labels (e.g., "Intracranial Hemorrhage") indicate genuine semantic relatedness. Depth weighting focuses the model on meaningful fine-grained matches.
  2. Hierarchical Multi-Label Contrastive Loss:

    • Function: Prevents embedding collapse and maintains discriminative structure.
    • Mechanism: Positive pairs are document pairs with label similarity \(\text{SimL} > \beta\); negative pairs are those with no label overlap (\(\text{SimL}=0\)). Positive pairs are weighted by their label similarity in the contrastive loss: \(\mathcal{L}_{\text{con}} = -\mathbb{E}_i \log[\text{SimL}(k_i, k_i^+) \cdot \exp(\text{SimE}(k_i, k_i^+)) / \sum_{k_j^-} \exp(\text{SimE}(k_i, k_j^-))]\). The threshold \(\beta\) filters weakly associated pairs to reduce noisy supervision.
    • Design Motivation: Regression loss alone may cause all embeddings to collapse to a single point; the contrastive loss maintains discriminability by pushing unrelated documents apart. Weighting by label similarity ensures that more semantically related positive pairs contribute larger gradients.
  3. LoRA Parameter-Efficient Fine-Tuning:

    • Function: Adapts a general-domain retriever to the biomedical domain at low cost.
    • Mechanism: The original BGE parameters are frozen, and low-rank adapters \(W_{\text{adapted}}^{(l)} = W^{(l)} + B^{(l)} A^{(l)}\) are injected, training only 0.3% of the parameters.
    • Design Motivation: Full-parameter fine-tuning is prone to overfitting on small datasets and incurs high computational cost; LoRA achieves domain adaptation while preserving general language understanding.

Loss & Training

Total loss: \(\mathcal{L} = \mathcal{L}_{\text{mse}} + \lambda \mathcal{L}_{\text{con}}\), with \(\lambda=0.1\) and \(\beta=0.3\). Training is conducted on 80,000 abstracts from BioASQ v2022; the optimal checkpoint is selected on the TREC-CT 2022 validation set. Training and inference can be completed on a single A100 40GB GPU.

Key Experimental Results

Main Results

Task/Dataset Metric BioHiCL-Base (0.1B) BMRetriever-1B bge-base (0.1B)
IR Average nDCG@10 0.543 0.531 0.529
NFCorpus nDCG@10 0.379 0.344 0.368
TREC-COVID nDCG@10 0.812 0.840 0.798
BIOSSES Spearman 0.896 0.858 0.860
PubMedQA Recall@1 0.893 0.810 0.856

Ablation Study

Configuration IR Avg Note
BioHiCL-Base 0.543 Full model
w/o \(\mathcal{L}_{\text{con}}\) 0.528 Removing contrastive loss yields the largest drop
w/o Ancestor Label 0.538 Without ancestor node expansion
w/o \(\mathcal{L}_{\text{mse}}\) 0.537 Without regression loss
w/o Depth Weight 0.541 Without depth weighting
w/o LoRA (full params) 0.542 LoRA matches full fine-tuning performance

Key Findings

  • BioHiCL-Base (0.1B) outperforms BMRetriever (1B) on the average IR metric, demonstrating that structured supervision signals can compensate for the gap in model scale.
  • The contrastive loss is the most critical component (removing it reduces IR average by 0.015), confirming the necessity of preventing embedding collapse.
  • Fine-tuning BMRetriever with the BioHiCL methodology leads to severe performance degradation (0.501→0.279), as replacing the original instruction-based training objective disrupts its retrieval-specialized embedding geometry.
  • LoRA achieves performance comparable to full fine-tuning with only 0.3% of the parameters, validating the effectiveness of parameter-efficient methods for domain adaptation.

Highlights & Insights

  • Leveraging the MeSH hierarchical structure as a graded supervision signal is a natural and effective design: MeSH is an expert-curated, standardized vocabulary that provides a precise measure of semantic relationships between documents. This paradigm of "borrowing existing structured knowledge for supervision" is transferable to any domain with a hierarchical label system (e.g., legal code classification, product taxonomy).
  • Depth-weighted label similarity encodes the domain intuition that "specific concepts matter more than abstract ones" in an elegantly minimal form (a single formula \(w_j = \log(d(c_j)+1)\)).
  • The extreme efficiency of the 0.1B model makes it well-suited for large-scale deployment, offering a clear practical advantage over systems such as BMRetriever and MedCPT that require 1B+ parameters.

Limitations & Future Work

  • Training is conducted on only 80,000 abstracts from BioASQ, far smaller in scale than MedCPT (click data) and BMRetriever (multi-task data).
  • The coverage and granularity of MeSH annotations are constrained by the label set maintained by NLM; emerging concepts may lack adequate coverage.
  • The potential of combining MeSH hierarchical information with instruction-based retrieval training remains unexplored.
  • Gains on SCIDOCS are limited (0.215→0.225), indicating that cross-domain generalization requires further improvement.
  • vs. MedCPT (Jin et al., 2023): The latter employs query-article clicks for contrastive learning; BioHiCL uses MeSH hierarchy to provide finer-grained supervision signals.
  • vs. BMRetriever (Xu et al., 2024): The latter trains a 1B model via large-scale multi-task learning; BioHiCL achieves comparable performance with only 0.1B parameters and MeSH supervision, offering substantially higher efficiency.
  • vs. BiCA (Sinha et al., 2025): The latter performs biomedical adaptation without exploiting hierarchical label structure; BioHiCL complements this with an explicit hierarchical dimension.

Rating

  • Novelty: ⭐⭐⭐ Contrastive learning with MeSH supervision is a natural combination, though the core idea is not particularly surprising.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task evaluation (IR + similarity + QA), detailed ablations, and efficiency analysis provide comprehensive coverage.
  • Writing Quality: ⭐⭐⭐⭐ Method description is clear and concise, though the Related Work section is somewhat thin.