H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis¶
Conference: NeurIPS 2025 arXiv: 2510.03700 Code: To be confirmed Area: Medical AI / LLM Evaluation Keywords: Differential Diagnosis, ICD-10 Hierarchy, Hierarchical F1, LLM Medical Evaluation, Approximate Correctness
TL;DR¶
H-DDx proposes a differential diagnosis evaluation framework grounded in the ICD-10 classification hierarchy. By expanding both predicted and ground-truth diagnoses to their ancestor nodes and computing a Hierarchical Diagnostic F1 (HDF1), the framework rewards "clinically relevant approximate correctness" rather than exact match only. Evaluating 22 LLMs reveals that the domain-specialized model MediPhi rises from 20th to 2nd place under HDF1, an advantage completely obscured by Top-5 metrics.
Background & Motivation¶
Background: LLM differential diagnosis evaluation primarily relies on Top-k accuracy — whether the correct diagnosis appears in the predicted list. Over 22 LLMs have been continuously benchmarked on datasets such as DDXPlus.
Limitations of Prior Work: Top-k treats all errors equally — predicting "viral URI" instead of "influenza" (same disease category) receives the same score of 0 as predicting "fracture" (entirely unrelated). This is highly unfair to models that produce clinically approximate predictions.
Key Challenge: The ICD-10 coding system naturally encodes hierarchical distances between diseases (chapter/section/category/subcategory), yet existing evaluation metrics entirely ignore this structure.
Goal: Design evaluation metrics that respect the ICD-10 hierarchical structure and reward clinically relevant approximate predictions.
Key Insight: Expand both the predicted and ground-truth diagnosis sets upward along the ICD-10 tree to all ancestor nodes, then compute precision/recall/F1 on the expanded sets.
Core Idea: Expand diagnosis sets along the ICD-10 tree to chapter/section/category ancestor nodes → compute Hierarchical F1 on expanded sets → reward approximate predictions within the same category or section.
Method¶
Overall Architecture¶
Mapping: Natural-language diagnoses → ICD-10 codes (retrieved via text-embedding-3-large + reranked by gpt-4o, Top-1 accuracy 93.1%) → Expansion: Each ICD-10 code is expanded upward to all ancestor nodes (chapter → section → category → subcategory) → HDF1 Computation: Precision/recall/F1 computed on the expanded sets.
Key Designs¶
-
Diagnosis-to-ICD-10 Mapping Pipeline:
- Function: Maps natural-language diagnoses from LLM outputs to standard ICD-10 codes.
- Mechanism: text-embedding-3-large retrieves top-15 candidates → gpt-4o reranks → Top-1 accuracy of 93.1% (vs. 71.3% for retrieval alone).
- Design Motivation: LLM outputs vary in phrasing (e.g., "flu" vs. "influenza"), requiring robust normalization to ensure fair comparison.
-
Hierarchical Expansion + HDF1 Metric:
- Function: Computes an F1 score that respects the hierarchical taxonomy of diseases.
- Mechanism: The predicted set \(\hat{D}_i\) and ground-truth set \(D_i\) are each expanded to \(\hat{C}_i\) and \(C_i\) comprising all ancestor nodes. \(HDP = \frac{1}{N}\sum_i \frac{|\hat{C}_i \cap C_i|}{|\hat{C}_i|}\), \(HDR = \frac{1}{N}\sum_i \frac{|\hat{C}_i \cap C_i|}{|C_i|}\), \(HDF1 = 2HDP \cdot HDR / (HDP + HDR)\).
- Design Motivation: If the prediction is "viral URI" while the ground truth is "influenza," the two share chapter-level (respiratory system) and section-level (upper respiratory infection) ancestors, so HDF1 awards partial credit; Top-k awards 0.
-
Hierarchical Cascade Analysis:
- Function: Computes HDF1 separately at each ICD-10 level (chapter/section/category/subcategory).
- Mechanism: Truncates expansion depth level by level to analyze how model accuracy changes from coarse to fine granularity.
- Design Motivation: Reveals whether a model is "directionally correct but imprecise in detail" or "entirely off-track."
Loss & Training¶
- This is an evaluation framework; no training is involved.
- 22 LLMs are evaluated on 730 test cases from DDXPlus.
Key Experimental Results¶
Main Results¶
| Model | Top-5 Rank | HDF1 Rank | HDF1 Score |
|---|---|---|---|
| Claude-Sonnet-4 | — | 1st | 0.3673 |
| MediPhi | 20th | 2nd | 0.35+ |
| GPT-4o | Top-5 | Declined | — |
Hierarchical Cascade (Average Across All Models)¶
| ICD-10 Level | HDF1 |
|---|---|
| Chapter | ~60% |
| Section | ~40% |
| Category | ~30% |
| Subcategory | 10–20% |
Key Findings¶
- MediPhi rises from 20th under Top-5 to 2nd under HDF1 — domain fine-tuning yields far superior performance on the "clinical approximation" dimension compared to general-purpose models.
- Case Study: MediPhi achieves HDF1=0.5714 vs. GPT-4o HDF1=0.2069 on a complex respiratory case, despite GPT-4o achieving Top-5=1.0.
- All models perform reasonably at the chapter level (~60%) but drop sharply at the subcategory level (10–20%), indicating that LLM diagnoses are "directionally correct but insufficiently precise."
- The hierarchical competence of medically fine-tuned models is completely obscured by Top-k metrics.
Highlights & Insights¶
- HDF1 exposes the blind spots of Top-k: The rank reversal (MediPhi 20→2) demonstrates that existing evaluations severely underestimate the true capability of domain fine-tuned models.
- Clever utilization of the ICD-10 hierarchical structure: The framework repurposes an existing medical coding system as a measurement foundation, requiring no additional annotation.
- Implications for all medical AI evaluation: Any evaluation task with a hierarchical taxonomy can adopt a similar approach.
Limitations & Future Work¶
- DDXPlus is a synthetic dataset; validation on real clinical cases is needed.
- ICD-10 may not fully capture clinical similarity (SNOMED CT may be more appropriate).
- The framework evaluates static lists and does not assess sequential reasoning processes.
- The mapping pipeline depends on gpt-4o, introducing additional bias.
Related Work & Insights¶
- vs. Top-k Accuracy: Top-k is a flat metric; HDF1 respects the disease hierarchy.
- vs. BLEU/ROUGE: These metrics assess textual similarity rather than clinical correctness.
Rating¶
- Novelty: ⭐⭐⭐⭐ Hierarchical F1 is a novel and practical metric for medical evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 22 LLMs + hierarchical cascade analysis + case studies.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; case analyses are persuasive.
- Value: ⭐⭐⭐⭐⭐ Potential to reshape the paradigm of medical AI evaluation.