H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis¶

Conference: NeurIPS 2025 arXiv: 2510.03700 Code: To be confirmed Area: Medical AI / LLM Evaluation Keywords: Differential Diagnosis, ICD-10 Hierarchy, Hierarchical F1, LLM Medical Evaluation, Approximate Correctness

TL;DR¶

H-DDx proposes a differential diagnosis evaluation framework grounded in the ICD-10 classification hierarchy. By expanding both predicted and ground-truth diagnoses to their ancestor nodes and computing a Hierarchical Diagnostic F1 (HDF1), the framework rewards "clinically relevant approximate correctness" rather than exact match only. Evaluating 22 LLMs reveals that the domain-specialized model MediPhi rises from 20th to 2nd place under HDF1, an advantage completely obscured by Top-5 metrics.

Background & Motivation¶

Background: LLM differential diagnosis evaluation primarily relies on Top-k accuracy — whether the correct diagnosis appears in the predicted list. Over 22 LLMs have been continuously benchmarked on datasets such as DDXPlus.

Limitations of Prior Work: Top-k treats all errors equally — predicting "viral URI" instead of "influenza" (same disease category) receives the same score of 0 as predicting "fracture" (entirely unrelated). This is highly unfair to models that produce clinically approximate predictions.

Key Challenge: The ICD-10 coding system naturally encodes hierarchical distances between diseases (chapter/section/category/subcategory), yet existing evaluation metrics entirely ignore this structure.

Goal: Design evaluation metrics that respect the ICD-10 hierarchical structure and reward clinically relevant approximate predictions.

Key Insight: Expand both the predicted and ground-truth diagnosis sets upward along the ICD-10 tree to all ancestor nodes, then compute precision/recall/F1 on the expanded sets.

Core Idea: Expand diagnosis sets along the ICD-10 tree to chapter/section/category ancestor nodes → compute Hierarchical F1 on expanded sets → reward approximate predictions within the same category or section.

Method¶

Overall Architecture¶

Mapping: Natural-language diagnoses → ICD-10 codes (retrieved via text-embedding-3-large + reranked by gpt-4o, Top-1 accuracy 93.1%) → Expansion: Each ICD-10 code is expanded upward to all ancestor nodes (chapter → section → category → subcategory) → HDF1 Computation: Precision/recall/F1 computed on the expanded sets.

Key Designs¶

Diagnosis-to-ICD-10 Mapping Pipeline:
- Function: Maps natural-language diagnoses from LLM outputs to standard ICD-10 codes.
- Mechanism: text-embedding-3-large retrieves top-15 candidates → gpt-4o reranks → Top-1 accuracy of 93.1% (vs. 71.3% for retrieval alone).
- Design Motivation: LLM outputs vary in phrasing (e.g., "flu" vs. "influenza"), requiring robust normalization to ensure fair comparison.
Hierarchical Expansion + HDF1 Metric:
- Function: Computes an F1 score that respects the hierarchical taxonomy of diseases.
- Mechanism: The predicted set \(\hat{D}_i\) and ground-truth set \(D_i\) are each expanded to \(\hat{C}_i\) and \(C_i\) comprising all ancestor nodes. \(HDP = \frac{1}{N}\sum_i \frac{|\hat{C}_i \cap C_i|}{|\hat{C}_i|}\), \(HDR = \frac{1}{N}\sum_i \frac{|\hat{C}_i \cap C_i|}{|C_i|}\), \(HDF1 = 2HDP \cdot HDR / (HDP + HDR)\).
- Design Motivation: If the prediction is "viral URI" while the ground truth is "influenza," the two share chapter-level (respiratory system) and section-level (upper respiratory infection) ancestors, so HDF1 awards partial credit; Top-k awards 0.
Hierarchical Cascade Analysis:
- Function: Computes HDF1 separately at each ICD-10 level (chapter/section/category/subcategory).
- Mechanism: Truncates expansion depth level by level to analyze how model accuracy changes from coarse to fine granularity.
- Design Motivation: Reveals whether a model is "directionally correct but imprecise in detail" or "entirely off-track."

Loss & Training¶

This is an evaluation framework; no training is involved.
22 LLMs are evaluated on 730 test cases from DDXPlus.

Key Experimental Results¶

Main Results¶

Model	Top-5 Rank	HDF1 Rank	HDF1 Score
Claude-Sonnet-4	—	1st	0.3673
MediPhi	20th	2nd	0.35+
GPT-4o	Top-5	Declined	—

Hierarchical Cascade (Average Across All Models)¶

ICD-10 Level	HDF1
Chapter	~60%
Section	~40%
Category	~30%
Subcategory	10–20%

Key Findings¶

MediPhi rises from 20th under Top-5 to 2nd under HDF1 — domain fine-tuning yields far superior performance on the "clinical approximation" dimension compared to general-purpose models.
Case Study: MediPhi achieves HDF1=0.5714 vs. GPT-4o HDF1=0.2069 on a complex respiratory case, despite GPT-4o achieving Top-5=1.0.
All models perform reasonably at the chapter level (~60%) but drop sharply at the subcategory level (10–20%), indicating that LLM diagnoses are "directionally correct but insufficiently precise."
The hierarchical competence of medically fine-tuned models is completely obscured by Top-k metrics.

Highlights & Insights¶

HDF1 exposes the blind spots of Top-k: The rank reversal (MediPhi 20→2) demonstrates that existing evaluations severely underestimate the true capability of domain fine-tuned models.
Clever utilization of the ICD-10 hierarchical structure: The framework repurposes an existing medical coding system as a measurement foundation, requiring no additional annotation.
Implications for all medical AI evaluation: Any evaluation task with a hierarchical taxonomy can adopt a similar approach.

Limitations & Future Work¶

DDXPlus is a synthetic dataset; validation on real clinical cases is needed.
ICD-10 may not fully capture clinical similarity (SNOMED CT may be more appropriate).
The framework evaluates static lists and does not assess sequential reasoning processes.
The mapping pipeline depends on gpt-4o, introducing additional bias.

vs. Top-k Accuracy: Top-k is a flat metric; HDF1 respects the disease hierarchy.
vs. BLEU/ROUGE: These metrics assess textual similarity rather than clinical correctness.

Rating¶

Novelty: ⭐⭐⭐⭐ Hierarchical F1 is a novel and practical metric for medical evaluation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 22 LLMs + hierarchical cascade analysis + case studies.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; case analyses are persuasive.
Value: ⭐⭐⭐⭐⭐ Potential to reshape the paradigm of medical AI evaluation.