Improving Automatic Evaluation of LLMs in Biomedical Relation Extraction via LLMs-as-the-Judge¶

Conference: ACL 2025
arXiv: 2506.00777
Code: https://github.com/tahmedge/llm_judge_biomedical_re
Area: Biomedical NLP
Keywords: LLM-as-Judge, biomedical relation extraction, structured output, domain adaptation, evaluation

TL;DR¶

This paper presents the first systematic study of LLM-as-the-Judge in evaluating biomedical relation extraction. The authors find that its accuracy is typically below 50%, and propose structured output formatting (JSON) and domain adaptation techniques to improve evaluation accuracy by approximately 15%.

Background & Motivation¶

Background: LLMs have demonstrated strong zero-shot capabilities in biomedical relation extraction (BioRE), yet evaluation still heavily relies on expensive human annotation. Although the LLM-as-the-Judge paradigm has proven effective in general NLP tasks, it remains under-explored in the biomedical domain.

Limitations of Prior Work: (1) LLM-generated responses often contain synonyms or abbreviations of the gold standards (e.g., "dex" vs "dexamethasone"), which traditional exact match metrics fail to evaluate correctly; (2) human evaluation is expensive, time-consuming, and unscalable; (3) directly applying LLM-as-the-Judge to biomedical relation extraction yields surprisingly low accuracy (typically <50%).

Key Challenge: The free-form outputs generated by LLMs make it difficult for the LLM judge to accurately parse relations, a problem further compounded by technical terminology and naming conventions in the biomedical field.

Goal: (1) Quantify the actual capability of LLM-as-the-Judge in BioRE evaluation; (2) identify the root causes of failure; (3) propose improvement schemes to make the LLM judge more reliable.

Key Insight: Starting from the output format of the LLM generator—unstructured text is identified as the root cause of evaluation difficulty.

Core Idea: By constraining the LLM generator's output to a structured JSON format, the accuracy of LLM-as-the-Judge in BioRE evaluation is significantly improved, which is further enhanced by combining it with domain-adaptive fine-tuning.

Method¶

Overall Architecture¶

Input text → LLM-Generator (relation extraction) → Output relations → LLM-Judge (replacing human evaluation) → Determine the number of correct relations and total predicted relations → Compute Precision/Recall/F1. The core improvements lie in two stages: (1) output formatting of the LLM-Generator (unstructured → structured JSON); (2) domain-adaptive fine-tuning of the LLM-Judge.

Key Designs¶

Structured Output Formatting:
- Function: Requiring the LLM-Generator to output relation extraction results in JSON format instead of free-form text.
- Mechanism: Original LLM outputs such as "The drug X treats disease Y" are free text, from which the LLM-Judge must parse relation pairs, leading to frequent errors. By formatting the output as JSON {"relations": [{"entity1": "X", "relation": "treats", "entity2": "Y"}]}, the evaluation converts into a structured comparison.
- Design Motivation: Experimental observations show that the low accuracy of the LLM-Judge primarily stems from its inability to correctly parse relations from unstructured outputs. The choice of JSON format is based on recent studies showing that LLMs are more reliable with JSON than with other formats like YAML.
- Gain: An average improvement of approximately 15% in exact match accuracy.
Domain Adaptation via Transfer Learning:
- Function: Fine-tuning the LLM-Judge using human-annotated data from other datasets when human evaluation data for the target domain is lacking.
- Mechanism: Assuming that target dataset \(X\) has no training data, but human evaluation data from another dataset \(Y\) is available, the LLM-Judge is first fine-tuned on \(Y\) to learn the general capability of "how to evaluate relation extraction" before transferring to \(X\).
- Design Motivation: Open-source LLM judges (such as Prometheus-2) are fine-tuned only for general evaluation dimensions (e.g., helpfulness, factual correctness) and are not suitable for evaluating relation extraction, while human-annotated evaluation data remains scarce.
- Novelty: Solves the dual challenges of domain specificity and data scarcity.
Evaluation Metric Design:
- Exact Match (EM) Accuracy: Both the number of correct relations and the total predicted relations annotated by the LLM-Judge are perfectly consistent with the human annotations.
- RMSE: Calculating the root mean square error between the LLM-Judge annotations and human annotations to penalize large deviations.

Key Experimental Results¶

Main Results: LLM-Judge Zero-Shot Accuracy¶

LLM-Judge	BC5CDR EM↑	DDI EM↑	KD-DTI EM↑	BC5CDR RMSE↓
GPT-4o-Mini	48.35	59.03	53.11	2.33
Gemini-Flash	42.55	47.12	40.68	2.09
Qwen-2.5-7B	45.25	46.60	49.98	2.42
Claude-3-Haiku	29.50	31.15	40.27	2.26
LLaMA-3.1-8B	29.45	29.32	36.73	2.40
DeepSeek-R1-Qwen-7B	30.60	42.67	42.45	2.76

Ablation Study: Impact of Structured Output¶

Configuration	BC5CDR EM	DDI EM	KD-DTI EM	Explanation
Unstructured Output	~48% (GPT-4o-Mini)	~59%	~53%	Baseline, free-text format
Structured JSON Output	~63%	~74%	~68%	Average improvement of ~15%
+ Domain adaptation fine-tuning	Further improvement	Further improvement	Further improvement	Cross-dataset knowledge transfer

Key Findings¶

LLM-Judge performs poorly in BioRE: Even the top-performing GPT-4o-Mini achieves an accuracy of only about 50%, which is much lower than its performance in general NLP tasks.
Reasoning LLMs offer no advantage: The distilled versions of DeepSeek-R1 do not outperform standard instruction-tuned models, suggesting that "reasoning" capability does not directly transfer to evaluation tasks.
Structured format is critical: Unifying the output into JSON format makes it significantly easier for the LLM-Judge to accurately parse and compare relations, markedly improving consistency.
Biomedical-specific LLMs are unsuitable as judges: BioMistral-7B completely fails to follow instructions for evaluation.
Domain adaptation is effective but limited: Cross-dataset transfer performs better when relation types are similar.

Highlights & Insights¶

Precise diagnosis of issues: Through extensive experiments (over 100 runs), the root cause of the LLM-Judge's failure was accurately identified as parser difficulties arising from unstructured output rather than insufficient evaluation capability. This insight is simple yet profound.
Generality of structured outputs: Constraining outputs to JSON not only facilitates evaluation but is also a good practice for improving the quality of relation extraction itself, which can be adapted to other information extraction tasks.
Domain adaptation strategy: The paradigm of fine-tuning the LLM-Judge with out-of-domain human-annotated data can be generalized to other domains lacking evaluation data (such as legal or financial entity relations).
Open-source release of large-scale annotated data: The public release of 36K annotated samples (4K human + 32K LLM) offers substantial value to the research community.

Limitations & Future Work¶

Only three BioRE datasets were evaluated, covering a limited set of relation types (chemical-disease, drug-drug, chemical-protein interactions), leaving broader relation types such as gene-disease unexplored.
Domain adaptation only employed simple cross-dataset fine-tuning without exploring more advanced transfer learning techniques (such as adapters or LoRA).
Structured output relies on the JSON generation capabilities of the LLM-Generator, which earlier LLMs may fail to produce reliably.
The evaluation is focused heavily on relation extraction and has not yet been extended to other biomedical NLP tasks (e.g., NER, event extraction).

vs LLM-as-the-Judge (Zheng et al., 2023): While general LLM judges perform well in evaluating dialogue generation, this work reveals their severe limitations in structured tasks like relation extraction.
vs Prometheus-2 (Kim et al., 2024): Prometheus-2 is an open-source evaluation LLM specifically fine-tuned for general evaluation dimensions (e.g., helpfulness, correctness), which are unsuitable for relation extraction, highlighting the necessity of task-specific evaluation.
vs Jahan et al. (2024): Provides a zero-shot benchmark for LLMs on BioRE but relies on human evaluation; this work attempts to replace it with the LLM-Judge.

Rating¶

Novelty: ⭐⭐⭐ First systematic study of LLM-as-the-Judge in BioRE, although the methods (structured output + domain adaptation) are relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, with over 100 experiments involving 8 judges, 5 generators, and 3 datasets.
Writing Quality: ⭐⭐⭐⭐ Clear structure with a coherent logic chain from problem definition to analysis and resolution.
Value: ⭐⭐⭐⭐ High practical value; the open-sourced datasets represent a valuable contribution to the community.