Skip to content

SECRET: Semi-supervised Clinical Trial Document Similarity Search

Conference: ACL2025
arXiv: 2505.10780
Authors: Trisha Das, Afrah Shafquat, Beigi Mandis, Jacob Aptekar, Jimeng Sun
Institution: University of Illinois Urbana-Champaign, Medidata Solutions
Code: Not open-sourced
Area: Medical NLP
Keywords: clinical trial, document similarity, contrastive learning, semi-supervised, information retrieval

TL;DR

Proposes SECRET, a semi-supervised clinical trial protocol similarity search method. By converting clinical trial documents into Q/A pair representations and combining local (Q/A-level) and global (trial-level) contrastive learning to generate embeddings, it improves recall@1 by 78% relative to the best baseline in full trial search.

Background & Motivation

Clinical trials are critical for evaluating the safety and efficacy of new therapies, but their design process is complex and error-prone. Retrieving similar historical trials can provide references for trial design (e.g., target population, eligibility criteria, dosing regimens, adverse event anticipation, etc.). However, existing methods face four core challenges:

Scarcity of labeled data: Publicly available labeled data for trial similarity is extremely scarce, whereas supervised methods (e.g., GTSLNet) rely on private datasets.

Long document issues: Clinical trial protocols often exceed 1,000 words. Existing methods (e.g., Trial2Vec) still require truncation for long paragraphs, leading to the loss of key information.

Insufficient local semantic understanding: Two text snippets containing the same medical entities can have completely different semantics (e.g., "detecting insulin levels to diagnose diabetes" vs. "prescribing insulin to manage diabetes"). Entity matching-based methods fail to distinguish them.

Inefficient contrastive supervision: SimCSE using the same document as a positive sample is too simplistic, while Trial2Vec's generation of positive samples by deleting paragraphs may lose crucial information.

SECRET proposes a semi-supervised framework that simultaneously leverages a small amount of labeled data and a large volume of unlabeled data, utilizing Q/A pairs as the representation unit to address the aforementioned problems.

Method

SECRET consists of three core components:

1. Q/A Pair Generation

Convert each clinical trial protocol into a set of Q/A pairs: - Long paragraphs (e.g., eligibility criteria): Use Llama-3.1-8B-Instruct to generate Q/A pairs, extracting key information and significantly compressing document length. - Short paragraphs (e.g., title, diseases, interventions): Use manually predefined questions. - Core assumption: Two similar trials will share a similar set of Q/A pairs.

2. Local Contrastive Learning (Q/A-level)

Perform contrastive training at the Q/A pair granularity, using BioBERT as the backbone encoder: - Positive Sample Selection: For an anchor Q/A pair, select the pair with the highest cosine similarity from the Q/A pool of the same paragraph as the positive sample. - Negative Samples: All other Q/A pairs within the batch. - Loss Function: InfoNCE loss with a temperature parameter tau = 0.1. - This design ensures that even sentences containing the same medical entities but having different semantics can obtain distinct embedding representations.

3. Global Contrastive Learning (Trial-level)

Perform contrastive training at the overall trial level, combining labeled and unlabeled data: - Positive Samples for Unlabeled Data: Randomly drop one Q/A pair from a paragraph containing multiple Q/A pairs to generate the positive sample. - Positive Samples for Labeled Data: Directly use labeled similar trials. - Hard Negative Samples: Trials within the same disease category but not similar. - Loss Function: Combination of pairwise loss and in-batch loss.

Finally, trial embeddings are ranked and retrieved using cosine similarity.

Key Experimental Results

Table 2: Full Trial Similarity Search (Core Results)

Method P@1 R@1 P@5 R@5 nDCG@5 MAP
TF-IDF 0.363 0.244 0.217 0.687 0.522 0.501
Trial2Vec 0.422 0.263 0.227 0.689 0.553 0.539
SECRET 0.647 0.467 0.297 0.924 0.796 0.754

SECRET leads by a wide margin on all metrics, with recall@1 increasing by 78% relative to Trial2Vec, precision@1 increasing by 53%, and MAP increasing by 40%.

Table 3: Partial Trial Search (Query by Title Only)

Method P@1 R@1 R@5 nDCG@5 MAP
Trial2Vec 0.456 0.322 0.717 0.592 0.579
SECRET 0.548 0.390 0.902 0.745 0.696

Under partial query scenarios, SECRET still significantly outperforms all baselines, with recall@2 relative to the best baseline increasing by 29%.

Table 4: Zero-Shot Patient-Trial Matching (TREC2021)

Method P@1 R@1 nDCG@5 MAP
Trial2Vec 0.608 0.129 0.618 0.695
SECRET 0.710 0.158 0.666 0.744

Under the zero-shot setting without patient-trial matching training, SECRET still outperforms all baselines, with precision@1 increasing by 17% and recall increasing by 22%.

Ablation Study

  • Local contrastive learning alone yields the worst performance; global only (Q/A representation) outperforms global only (full-text representation); the combination of both achieves the best performance.
  • Experiment on the number of Q/A pairs: Selecting the top-10 Q/A pairs yields the best results; too many will introduce noise, while too few will lose information.
  • The training data volume is only 1/4 of Trial2Vec (approx. 10K labeled + 60K unlabeled vs. Trial2Vec's full dataset).

Highlights & Insights

  • Q/A pair representation is an elegant solution to the long document issue, compressing verbose protocols into structured, comparable information units.
  • The dual-level contrastive learning is elegantly designed: local contrastive learning captures fine-grained semantic differences, while global contrastive learning models the overall similarity between trials.
  • The semi-supervised framework effectively balances annotation costs and performance requirements, outperforming the fully-trained baseline with less than 1/4 of the training data.
  • Zero-shot transfer to patient-trial matching tasks still outperforms all baselines, demonstrating excellent generalization capability.
  • Case studies indicate that SECRET can better capture exact matches of key attributes such as age and intervention measures.

Limitations & Future Work

  • Only used title, disease, intervention, keywords, outcomes, and eligibility criteria, excluding important paragraphs such as descriptions and study designs (due to LLM resource limitations).
  • Does not include other clinical trial-related documents such as informed consent forms and adverse event reports.
  • Q/A generation relies on LLMs (Llama-3.1-8B), which may introduce inconsistency in generation quality.
  • The evaluation dataset is limited in scale (1,420 pairs in the test set) and only contains English data.
  • The importance weights of different paragraphs were not explored; all paragraphs are treated with equal weight.
Dimension Trial2Vec GTSLNet SECRET
Learning Paradigm Self-supervised Supervised Semi-supervised
Document Representation Segment-wise Encoding + Merging Full Text Q/A Pair Set
Contrastive Granularity Entity-level - Q/A-level + Trial-level
Training Data Requirements Large-scale Unlabeled Large-scale Labeled (Private) Small-scale Labeled + Unlabeled
Long Document Processing Truncation Truncation Q/A Compression
Open Source Data Yes No Yes

Rating

  • Novelty: ⭐⭐⭐⭐ — The combination of Q/A pair representation + dual-level contrastive learning exhibits relatively high originality.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three tasks, 10 baselines, ablation studies, and case analyses are relatively complete.
  • Writing Quality: ⭐⭐⭐⭐ — The problem definition is clear, and the logic between the four challenges and their corresponding solutions is coherent.
  • Value: ⭐⭐⭐⭐ — Clinical trial retrieval is an important practical demand, and the method has direct application value.