SECRET: Semi-supervised Clinical Trial Document Similarity Search¶

Conference: ACL2025
arXiv: 2505.10780
Authors: Trisha Das, Afrah Shafquat, Beigi Mandis, Jacob Aptekar, Jimeng Sun
Institution: University of Illinois Urbana-Champaign, Medidata Solutions
Code: Not open-sourced
Area: Medical NLP
Keywords: clinical trial, document similarity, contrastive learning, semi-supervised, information retrieval

TL;DR¶

Proposes SECRET, a semi-supervised clinical trial protocol similarity search method. By converting clinical trial documents into Q/A pair representations and combining local (Q/A-level) and global (trial-level) contrastive learning to generate embeddings, it improves recall@1 by 78% relative to the best baseline in full trial search.

Background & Motivation¶

Clinical trials are critical for evaluating the safety and efficacy of new therapies, but their design process is complex and error-prone. Retrieving similar historical trials can provide references for trial design (e.g., target population, eligibility criteria, dosing regimens, adverse event anticipation, etc.). However, existing methods face four core challenges:

Scarcity of labeled data: Publicly available labeled data for trial similarity is extremely scarce, whereas supervised methods (e.g., GTSLNet) rely on private datasets.

Long document issues: Clinical trial protocols often exceed 1,000 words. Existing methods (e.g., Trial2Vec) still require truncation for long paragraphs, leading to the loss of key information.

Insufficient local semantic understanding: Two text snippets containing the same medical entities can have completely different semantics (e.g., "detecting insulin levels to diagnose diabetes" vs. "prescribing insulin to manage diabetes"). Entity matching-based methods fail to distinguish them.

Inefficient contrastive supervision: SimCSE using the same document as a positive sample is too simplistic, while Trial2Vec's generation of positive samples by deleting paragraphs may lose crucial information.

SECRET proposes a semi-supervised framework that simultaneously leverages a small amount of labeled data and a large volume of unlabeled data, utilizing Q/A pairs as the representation unit to address the aforementioned problems.

Method¶

SECRET consists of three core components:

1. Q/A Pair Generation¶

Convert each clinical trial protocol into a set of Q/A pairs: - Long paragraphs (e.g., eligibility criteria): Use Llama-3.1-8B-Instruct to generate Q/A pairs, extracting key information and significantly compressing document length. - Short paragraphs (e.g., title, diseases, interventions): Use manually predefined questions. - Core assumption: Two similar trials will share a similar set of Q/A pairs.

2. Local Contrastive Learning (Q/A-level)¶

Perform contrastive training at the Q/A pair granularity, using BioBERT as the backbone encoder: - Positive Sample Selection: For an anchor Q/A pair, select the pair with the highest cosine similarity from the Q/A pool of the same paragraph as the positive sample. - Negative Samples: All other Q/A pairs within the batch. - Loss Function: InfoNCE loss with a temperature parameter tau = 0.1. - This design ensures that even sentences containing the same medical entities but having different semantics can obtain distinct embedding representations.

3. Global Contrastive Learning (Trial-level)¶

Perform contrastive training at the overall trial level, combining labeled and unlabeled data: - Positive Samples for Unlabeled Data: Randomly drop one Q/A pair from a paragraph containing multiple Q/A pairs to generate the positive sample. - Positive Samples for Labeled Data: Directly use labeled similar trials. - Hard Negative Samples: Trials within the same disease category but not similar. - Loss Function: Combination of pairwise loss and in-batch loss.

Finally, trial embeddings are ranked and retrieved using cosine similarity.

Key Experimental Results¶

Table 2: Full Trial Similarity Search (Core Results)¶

Method	P@1	R@1	P@5	R@5	nDCG@5	MAP
TF-IDF	0.363	0.244	0.217	0.687	0.522	0.501
Trial2Vec	0.422	0.263	0.227	0.689	0.553	0.539
SECRET	0.647	0.467	0.297	0.924	0.796	0.754

SECRET leads by a wide margin on all metrics, with recall@1 increasing by 78% relative to Trial2Vec, precision@1 increasing by 53%, and MAP increasing by 40%.

Table 3: Partial Trial Search (Query by Title Only)¶

Method	P@1	R@1	R@5	nDCG@5	MAP
Trial2Vec	0.456	0.322	0.717	0.592	0.579
SECRET	0.548	0.390	0.902	0.745	0.696

Under partial query scenarios, SECRET still significantly outperforms all baselines, with recall@2 relative to the best baseline increasing by 29%.

Table 4: Zero-Shot Patient-Trial Matching (TREC2021)¶

Method	P@1	R@1	nDCG@5	MAP
Trial2Vec	0.608	0.129	0.618	0.695
SECRET	0.710	0.158	0.666	0.744

Under the zero-shot setting without patient-trial matching training, SECRET still outperforms all baselines, with precision@1 increasing by 17% and recall increasing by 22%.

Ablation Study¶

Local contrastive learning alone yields the worst performance; global only (Q/A representation) outperforms global only (full-text representation); the combination of both achieves the best performance.
Experiment on the number of Q/A pairs: Selecting the top-10 Q/A pairs yields the best results; too many will introduce noise, while too few will lose information.
The training data volume is only 1/4 of Trial2Vec (approx. 10K labeled + 60K unlabeled vs. Trial2Vec's full dataset).

Highlights & Insights¶

Q/A pair representation is an elegant solution to the long document issue, compressing verbose protocols into structured, comparable information units.
The dual-level contrastive learning is elegantly designed: local contrastive learning captures fine-grained semantic differences, while global contrastive learning models the overall similarity between trials.
The semi-supervised framework effectively balances annotation costs and performance requirements, outperforming the fully-trained baseline with less than 1/4 of the training data.
Zero-shot transfer to patient-trial matching tasks still outperforms all baselines, demonstrating excellent generalization capability.
Case studies indicate that SECRET can better capture exact matches of key attributes such as age and intervention measures.

Limitations & Future Work¶

Only used title, disease, intervention, keywords, outcomes, and eligibility criteria, excluding important paragraphs such as descriptions and study designs (due to LLM resource limitations).
Does not include other clinical trial-related documents such as informed consent forms and adverse event reports.
Q/A generation relies on LLMs (Llama-3.1-8B), which may introduce inconsistency in generation quality.
The evaluation dataset is limited in scale (1,420 pairs in the test set) and only contains English data.
The importance weights of different paragraphs were not explored; all paragraphs are treated with equal weight.

Dimension	Trial2Vec	GTSLNet	SECRET
Learning Paradigm	Self-supervised	Supervised	Semi-supervised
Document Representation	Segment-wise Encoding + Merging	Full Text	Q/A Pair Set
Contrastive Granularity	Entity-level	-	Q/A-level + Trial-level
Training Data Requirements	Large-scale Unlabeled	Large-scale Labeled (Private)	Small-scale Labeled + Unlabeled
Long Document Processing	Truncation	Truncation	Q/A Compression
Open Source Data	Yes	No	Yes

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of Q/A pair representation + dual-level contrastive learning exhibits relatively high originality.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three tasks, 10 baselines, ablation studies, and case analyses are relatively complete.
Writing Quality: ⭐⭐⭐⭐ — The problem definition is clear, and the logic between the four challenges and their corresponding solutions is coherent.
Value: ⭐⭐⭐⭐ — Clinical trial retrieval is an important practical demand, and the method has direct application value.