CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of LLMs in Mental Health QA¶
Conference: ICLR 2026 Oral
arXiv: 2506.08584
Code: GitHub
Area: Medical Imaging
Keywords: mental health QA, expert annotation, adversarial benchmark, LLM-as-Judge, safety evaluation
TL;DR¶
CounselBench is a two-component benchmark constructed with 100 licensed mental health professionals — CounselBench-EVAL (2,000 expert annotations across six clinical dimensions) and CounselBench-Adv (120 adversarial questions with 1,080 annotated responses) — systematically revealing that LLMs achieve superficially high scores in mental health open-ended QA while exhibiting safety risks such as over-generalization and unsolicited medical advice, and demonstrating that LLM-as-Judge is severely unreliable in safety-critical domains.
Background & Motivation¶
Evaluation Gap: Existing medical QA benchmarks (MedQA, MedMCQA) focus predominantly on multiple-choice and factual tasks, and cannot assess LLM responses to real patients' open-ended questions. The mental health domain is especially distinctive — patient questions interweave symptom descriptions, treatment concerns, and emotional needs, requiring responses that balance empathy, clinical caution, and professional boundaries.
Insufficient Expert Involvement: Existing mental health QA evaluations either rely on small expert panels (due to cost constraints) or employ LLM-as-Judge (with questionable reliability), lacking large-scale, clinically grounded systematic evaluation.
Unknown Safety Risks: LLMs are already deployed on platforms such as CounselChat, yet their failure modes in sensitive scenarios — such as unsolicited medication recommendations and over-generalization — have not been subjected to prospective stress testing.
Mechanism: The work recruits 100 professionals for large-scale open-ended evaluation and 10 experts to author adversarial questions, forming a two-component benchmark of "evaluation + stress testing" that establishes a clinically grounded LLM assessment framework.
Method¶
Overall Architecture¶
CounselBench consists of two complementary components:
- CounselBench-EVAL: 100 real patient questions curated from the CounselChat platform (spanning 20 topics including depression, anxiety, trauma, and domestic violence) are answered by GPT-4, LLaMA-3.3-70B, Gemini-1.5-Pro, and online human therapists; each response is scored by five independent experts across six clinical dimensions, with span-level annotations and written justifications provided.
- CounselBench-Adv: Based on failure patterns identified in EVAL, 10 experts author 120 adversarial questions (12 per expert, covering 6 failure types); 9 LLMs each generate 120 responses (1,080 total), which are annotated by a separate panel of 5 experts for whether the target failure is triggered.
Six-Dimensional Evaluation Framework¶
| Dimension | Description | Scale |
|---|---|---|
| Overall Quality | Holistic judgment of response quality | 1–5 Likert |
| Empathy | Whether the response demonstrates emotional resonance, empathy, and emotional validation | 1–5 Likert |
| Specificity | Whether the response offers personalized advice tailored to the user's specific context (rather than generic guidance) | 1–5 Likert |
| Medical Advice | Whether the response includes treatment or diagnostic recommendations that should be provided only by licensed professionals | Binary (Yes/No) |
| Factual Consistency | Whether the response is consistent with accepted clinical knowledge and free of misinformation | 1–4 |
| Toxicity | Whether the response contains harmful, discriminatory, dismissive, or ethically problematic content | 1–5 |
Each dimension is grounded in clinical psychology literature: empathy derives from person-centered therapy theory; specificity is linked to therapeutic alliance outcomes; the medical advice dimension specifically captures unauthorized clinical recommendations.
Expert Annotation Protocol¶
- 100 U.S. licensed or trained mental health practitioners were recruited via Upwork, with credentials and licenses individually verified.
- Annotators represent 32 distinct license/degree types across 43 specialty areas.
- Each annotator was randomly assigned 5 questions × 4 responses (3 LLM + 1 human), with response order randomized to eliminate position bias.
- Each question–response pair was rated by 5 independent experts, yielding \(100 \times 4 \times 5 = 2{,}000\) annotations in total.
- Annotators were fully blinded to the source of each response.
- Median annotation time was 1 hour and 22 minutes; median written justification length was 576.5 words, indicating deep engagement.
Adversarial Question Design (CounselBench-Adv)¶
Six fine-grained failure modes identified from EVAL:
- Medication (GPT-4): Recommending specific medications (e.g., SSRIs)
- Therapy (GPT-4): Suggesting specific therapeutic techniques (e.g., CBT)
- Symptoms (LLaMA-3.3): Speculating unsolicited medical diagnoses
- Judgmental (LLaMA-3.3): Adopting a judgmental tone
- Apathetic (Gemini-1.5-Pro): Lacking empathy or appearing indifferent
- Assumptions (Gemini-1.5-Pro): Drawing inferences based on unwarranted assumptions
Each expert authored 2 questions per failure type; the questions themselves do not contain failures but are designed to elicit the corresponding error from LLMs.
Key Experimental Results¶
Main Results: Expert Ratings Across Four Response Sources¶
| Source | Overall ↑ | Empathy ↑ | Specificity ↑ | Medical Advice | Factual ↑ | Toxicity ↓ |
|---|---|---|---|---|---|---|
| GPT-4 | 3.28 | 3.37 | 3.46 | 7% | 3.53 | 1.78 |
| LLaMA-3.3 | 4.29 | 4.22 | 4.63 | 14% | 3.70 | 1.36 |
| Gemini-1.5-Pro | 3.26 | 2.76 | 3.50 | 8% | 3.52 | 1.64 |
| Human Therapist | 2.60 | 2.72 | 3.29 | 17% | 2.92 | 2.56 |
- LLaMA-3.3 leads on 5 of 6 dimensions, yet 14% of its responses are flagged for unsolicited medical advice (recommending therapeutic techniques).
- Approximately one-third of GPT-4 responses proactively include safety disclaimers, declining to answer and redirecting users to professionals.
- Human therapists score lowest — reflecting the variable quality of unstructured online counseling responses.
- Inter-annotator reliability is high: Krippendorff's \(\alpha \geq 0.72\) across all dimensions, reaching 0.82–0.83 for Overall Quality and Empathy.
Adversarial Results: Failure Trigger Rates Across 9 LLMs¶
| Failure Type | GPT-3.5 | GPT-4 | GPT-5 | LLaMA-3.1 | LLaMA-3.3 | Claude-3.5 | Claude-3.7 | Gemini-1.5 | Gemini-2.0 |
|---|---|---|---|---|---|---|---|---|---|
| Medication | 0.05 | 0.00 | 0.47 | 0.05 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 |
| Therapy | 0.20 | 0.20 | 0.85 | 0.55 | 0.65 | 0.45 | 0.50 | 0.20 | 0.26 |
| Symptoms | 0.15 | 0.45 | 0.60 | 0.45 | 0.45 | 0.50 | 0.37 | 0.26 | 0.25 |
| Judgmental | 0.25 | 0.25 | 0.05 | 0.11 | 0.10 | 0.05 | 0.10 | 0.20 | 0.10 |
| Apathetic | 0.70 | 0.20 | 0.15 | 0.15 | 0.15 | 0.05 | 0.20 | 0.40 | 0.30 |
| Assumptions | 0.40 | 0.35 | 0.15 | 0.25 | 0.25 | 0.35 | 0.25 | 0.40 | 0.35 |
Key Findings¶
- GPT-5 is the most prolific "overstepper": 85% of its responses recommend specific therapeutic techniques and 47% recommend specific medications — suggesting that greater capability correlates with greater propensity to exceed professional boundaries.
- Failure patterns are consistent within model families: LLaMA (3.1/3.3), Claude (3.5/3.7), and Gemini (1.5/2.0) each exhibit internally similar failure distributions, whereas the GPT family shows large cross-version variation.
- GPT-3.5 is the most "apathetic": It triggers the apathetic failure mode in 70% of cases, far exceeding other models.
- LLM-as-Judge is severely unreliable: All LLM judges assign near-perfect scores on Factual Consistency and near-minimum scores on Toxicity, even when experts have flagged content as harmful. The best-performing LLM judge (Claude-3.7-Sonnet) achieves an F1 of only 0.50 on the adversarial task.
Highlights & Insights¶
- LLM judges are unreliable in safety-critical domains: This is among the most important findings of the paper. LLM judges systematically overestimate model performance and overlook safety issues; substituting human expert evaluation with LLMs in high-stakes domains (medicine, law) is demonstrably dangerous.
- The paradox of greater capability, greater risk: GPT-5, as the most capable model, performs worst in adversarial testing — broader knowledge predisposes it toward specific but unauthorized clinical recommendations. This challenges the assumption that "scaling solves safety."
- Empirically driven adversarial design: Unlike predefined red-teaming attacks, the adversarial questions in this work emerge from failure modes observed in real expert evaluations, more faithfully reflecting actual clinical risks. The methodology is transferable to other high-stakes domains.
- Exceptionally high annotation quality: Median written justifications of 576.5 words, inter-annotator agreement of \(\alpha \geq 0.72\), and individually verified professional credentials establish a new standard for scale and quality in mental health AI evaluation.
Limitations & Future Work¶
- Linguistic and cultural homogeneity: Coverage is limited to English-language questions and U.S. mental health practitioners; model behavior in cross-cultural and multilingual settings remains unevaluated.
- Single-turn interactions only: Only single-turn QA is assessed; capabilities such as contextual tracking and consistency maintenance in multi-turn dialogue are not examined.
- Data source limitations: CounselChat is a public forum; the quality of its questions and responses may not be representative of real clinical encounters.
- High replication cost: The expense of engaging 100 expert annotators constrains broader application of this methodology.
- Model version currency: The evaluated model versions (e.g., GPT-4-0613) are no longer the latest releases; the continued applicability of the findings requires ongoing validation.
Related Work & Insights¶
- Medical QA Benchmarks: MedQA and MedMCQA emphasize factual multiple-choice tasks; MultiMedQA introduces multi-axis evaluation; HealthBench scales to tens of thousands of physician-curated items — but all focus on structured medical knowledge.
- Mental Health QA: Prior work predominantly uses exam-style multiple-choice questions (Racha et al., 2025) or small expert panels; this work achieves the first open-ended evaluation with expert participation at the scale of 100 annotators.
- LLM-as-Judge: Effective for summarization and factual tasks, but this work demonstrates its severe unreliability in high-stakes subjective domains (mental health safety).
- Adversarial Evaluation: Existing red-teaming efforts largely rely on literature-defined failure modes; this work adopts an empirically driven, expert-authored approach that captures a broader range of practically occurring failure patterns.
Rating¶
| Dimension | Score | Notes |
|---|---|---|
| Novelty | ⭐⭐⭐⭐ | First mental health LLM evaluation benchmark with 100-expert-scale participation |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ | 100 experts × 2,000 evaluations + 9 models × 1,080 adversarial responses, with high inter-annotator agreement |
| Writing Quality | ⭐⭐⭐⭐⭐ | Clinical dimension definitions are rigorous; experimental procedures are clearly described and reproducible |
| Value | ⭐⭐⭐⭐⭐ | Provides lasting impact on safety warnings for LLM medical deployment and evaluation methodology |
| Overall | ⭐⭐⭐⭐⭐ | ICLR 2026 Oral; benchmark quality and impact merit top-tier recognition |