Skip to content

MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning

Conference: ACL 2026
arXiv: 2604.24186
Code: https://github.com/Applied-Machine-Learning-Lab/ACL2026-MultiDx
Area: Medical Imaging / Medical Diagnostic Reasoning
Keywords: Multi-source knowledge integration, differential diagnosis, medical reasoning, RAG, Agent

TL;DR

MultiDx synthesizes web search, SOAP-structured cases, similar case libraries, and fine-grained reasoning snippet retrieval into a two-stage diagnostic reasoning framework. It first generates candidate diseases from multi-path evidence, then applies disease matching, voting, and differential diagnosis reranking to improve diagnostic hit rates and reasoning recall on MedCaseReasoning and DiReCT.

Background & Motivation

Background: Medical diagnostic reasoning involves creating a verifiable clinical reasoning chain based on chief complaints, signs, tests, imaging, and disease progression, rather than just outputting a name. Large models have recently been applied to MedQA, PubMedQA, and case Q&A, with multi-agent or retrieval-augmented frameworks like MedAgents, MDAgents, ConfAgents, and OpenAI-DR emerging.

Limitations of Prior Work: Relying solely on internal LLM knowledge leads to insufficient expertise in rare diseases, latest clinical updates, or complex multi-system cases. Static knowledge bases suffer from limited coverage and update speed. Furthermore, many methods focus only on final answer accuracy while ignoring whether the diagnostic process aligns with clinical practices, making results difficult for physicians to verify.

Key Challenge: Diagnostic reasoning requires two simultaneous capabilities: comprehensive and dynamic medical knowledge, and the ability to organize scattered evidence into a standard differential diagnosis process. Current methods emphasize either "finding more knowledge" or "prompting the model to think more," but lack an explicit integration layer to transform multi-source candidate diagnoses into a clinical-style comparative analysis.

Goal: The authors aim for the model to first list suspected diseases like a physician, then compare supporting and conflicting evidence for these candidates to output a final diagnosis with a reasoning trajectory. This is decomposed into three sub-problems: converting free-text cases into stable clinical structures, retrieving truly useful evidence from external knowledge, and unifying multi-path candidates into an interpretable differential diagnosis.

Key Insight: Clinical diagnosis is inherently multi-perspective. A single case can yield different clues from case structures, similar cases, similar reasoning steps, and the latest medical web data. While a single path might be biased, consensus and conflict between multiple paths serve as important signals for differential diagnosis.

Core Idea: Replace single-turn Q&A diagnosis with "multi-source candidate diagnosis generation + explicit differential diagnosis integration." The LLM first collects disease lists from different perspectives, then performs synonym matching, evidence aggregation, and clinical reranking within a unified candidate space.

Method

MultiDx is a training-free two-stage framework focused on organizing diagnostic reasoning workflows rather than fine-tuning a specific model. Input is a case description \(C\), and output includes the final diagnosis \(D\) and reasoning path \(R\). Stage one generates candidate lists from four sources; stage two integrates these into a final ranking and explanation.

Overall Architecture

Stage one is Multi-source Knowledge-guided Diagnosis Generation. The model calls four information sources in parallel: web search (\(H_{web}\)), SOAP-structured cases (\(H_{SOAP}\)), similar case retrieval (\(H_{case}\)), and similar reasoning snippet retrieval (\(H_{trace}\)). Each path outputs a list of suspected diseases with evidence descriptions.

Stage two is Evidence Integration and Differential Diagnosis. The LLM receives the original case and four candidate lists, unifies synonyms (e.g., "myocardial infarction" and "heart attack") into standard terms, counts support from each source and their rankings, and compares clinical evidence between high-confidence candidates to output the final list and reasoning trajectory.

This workflow has a clear clinical mapping: stage one resembles forming a "suspected diagnosis list," while stage two resembles performing the "differential diagnosis." Consequently, MultiDx does not simply vote on agent answers but incorporates the underlying evidence into the reranking process.

Key Designs

  1. Multi-source Candidate Generation:

    • Function: Generate candidate diseases from case structure, similar cases, similar reasoning snippets, and real-time web data to mitigate blind spots from single sources.
    • Mechanism: The SOAP branch uses the LLM to format free-text into Subjective, Objective, Assessment, and Plan; the case library branch uses BM25 to retrieve top-k similar cases from MedCaseReasoning; the reasoning snippet branch splits history chains into steps and uses Jaccard similarity \(|E_C \cap E_{i,j}| / |E_C \cup E_{i,j}|\) of biomedical entities to find relevant logic; the web branch allows an Agent to plan queries and browse content.
    • Design Motivation: Diagnostic reasoning is sensitive to knowledge coverage. SOAP addresses input clutter, similar cases provide few-shot clinical paradigms, reasoning snippets offer fine-grained alignment, and web search adds dynamic information. These are more robust than single RAG or agent discussion.
  2. Hierarchical Case Retrieval:

    • Function: Allow the model to refer to both complete cases and specific reasoning steps most relevant to local symptoms or test results.
    • Mechanism: The database \(\mathcal{G}=\{(C_i,R_i,D_i)\}_{i=1}^{N}\) is represented at two levels. Level one retrieves similar cases via BM25. Level two retrieves reasoning sentences by extracting medical entities and matching them with the input case entities.
    • Design Motivation: Retrieving only full cases can mislead the model with non-critical history details, while retrieving only short snippets can lose context. Hierarchical retrieval separates "complete case patterns" from "local evidence logic," improving interpretability and recall.
  3. Evidence Integration and Differential Diagnosis Reranking:

    • Function: Transform four candidate lists into a final diagnosis instead of mechanically selecting the most frequent disease.
    • Mechanism: The LLM performs four steps: disease name matching, aggregating support by source and rank, differential analysis for high-rank diseases, and outputting the final list with justifications. Formally, it generates \((R,D)\) based on \(C,H_{web},H_{SOAP},H_{case},H_{trace}\).
    • Design Motivation: Simple voting fails to handle medical synonyms, conflicting evidence, or cases where a disease appears less frequently but has stronger evidence. Explicit differential diagnosis allows the model to compare how candidates fit symptoms and tests, mimicking real clinical decisions.

Loss & Training

MultiDx introduces no new trainable parameters, relying on prompts, retrieval, and tool-calling workflows. DeepSeek-R1 API is used as the primary backbone. Hierarchical retrieval defaults to top 10 similar cases and top 10 reasoning paths. Medical entity extraction uses SciSpacy 0.5.5.

The training set is used to construct the case database rather than for fine-tuning. 13,092 cases from MedCaseReasoning form the retrieval library. For evaluation, 300 test samples from MedCaseReasoning and 50 from DiReCT were randomly selected. Web search modules exclude sources like PubMed or Hugging Face to avoid data leakage.

Key Experimental Results

Main Results

Evaluations were conducted on MedCaseReasoning (Reasoning Recall and H@k) and DiReCT.

Dataset Method Reasoning Recall H@1 Acc. H@5 Acc. H@10 Acc. Key Conclusion
MedCaseReasoning DeepSeek-R1 0.648 0.360 0.419 0.442 Strong backbone, but recall is insufficient
MedCaseReasoning MedAgents 0.641 0.344 0.458 0.471 Multi-expert discussion provides gains
MedCaseReasoning OpenAI-DR 0.557 0.416 0.553 0.602 High accuracy but lower reasoning recall
MedCaseReasoning MultiDx 0.662 0.420 0.577 0.617 Best across all, strongest gain in H@10
DiReCT DeepSeek-R1 0.473 0.293 0.413 0.473 Performance drops on clinical notes
DiReCT Self-refinement 0.662 0.300 0.466 0.586 Strong recall, H@10 near MultiDx
DiReCT MultiDx 0.665 0.333 0.503 0.587 Remains best or tied on small-sample DiReCT

MultiDx significantly expands the probability of the correct diagnosis enters the candidate list. On MedCaseReasoning, H@10 improved from 0.442 (DeepSeek-R1) to 0.617.

Ablation Study

Ablations used DeepSeek-R1 as the base model.

Configuration H@1 H@5 H@10 Reasoning Recall Description
DeepSeek-R1 0.360 0.419 0.442 0.648 No external enhancement
w/ SOAP 0.379 0.467 0.502 0.638 Improved accuracy; slight recall drop
w/ web search 0.416 0.553 0.602 0.460 Strongest single accuracy module; low recall
w/ related case 0.393 0.489 0.523 0.634 Helpful for seen diseases
MultiDx 0.420 0.577 0.617 0.662 All gains combined

Web search alone helps identify disease names but fails to generate clinical reasoning chains matching expert annotations. Full MultiDx integrates web knowledge with clinical paradigms and structured inputs.

Generalization, Integration Strategies, and Cost

  • Backbone Generalization: MultiDx improved Qwen3-14B's H@10 from 0.295 to 0.601.
  • Seen vs Unseen: H@5 on unseen diseases reached 0.448 (Ours) vs 0.366 (Base), showing web knowledge improves generalization.
  • Integration Strategy: Explicit differential diagnosis (H@1 0.420) outperformed simple voting (H@1 0.403).
  • Cost: End-to-end latency is ~8.46 mins with ~20,000 tokens, comparable to OpenAI-DR.

Key Findings

  • Multi-source fusion is more stable: Web search is the strongest single source for accuracy, but full MultiDx achieves both the highest H@k and Reasoning Recall.
  • Differential diagnosis is essential: Second-stage clinical comparison is more effective than mechanical voting.
  • Unseen diseases benefit: MultiDx does not merely rely on memorizing the training library.
  • Case analysis verifies logic: In complex CNS cases, MultiDx successfully ranked primary CNS lymphoma first by applying expert-like exclusion logic.

Highlights & Insights

  • Natural Task Decomposition: Splitting diagnosis into "candidate generation" and "differential diagnosis" allows the model to compare reasonable options rather than committing prematurely.
  • SOAP Value: Its benefit comes from reducing input noise through information organization, which is a bottleneck in medical reasoning.
  • Snippet Retrieval: Retrieval based on reasoning steps aligns closer to the diagnostic process than whole-case RAG, as diagnosis is often triggered by specific signs.
  • Against Simple Voting: In multi-source systems, source count does not equate to evidence strength. A candidate with fewer sources but tighter evidence fit is often more credible.

Limitations & Future Work

  • External Knowledge Quality: Noise in web search or retrieved cases can contaminate the candidate list.
  • Decoupled Design: The two-stage design may lose critical diseases early that cannot be recovered later. Future work could explore joint planning.
  • Sample Size: Evaluation scales were small (300 for MedCaseReasoning, 50 for DiReCT). Large-scale multi-center validation is required.
  • Safety: Detailed analysis of hallucinations, over-diagnosis risk, and human-in-the-loop mechanisms is still needed for real-world deployment.
  • vs MedAgents/MDAgents: MultiDx differs by explicitly introducing SOAP, case libraries, and reasoning snippets rather than relying solely on multi-agent discussion.
  • vs OpenAI-DR: While deep research agents are strong generally, MultiDx organizes tool-calling into structures that specifically align with professional medical processes.
  • Inspiration: The architecture can be migrated to legal or financial tasks: generate candidates from multi-source evidence, then perform synonym merging, support aggregation, and counter-evidence comparison.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Solid combination of multi-source RAG and clinical workflows.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers ablation, cross-backbone, and cost, though sample size is limited.
  • Writing Quality: ⭐⭐⭐⭐☆ Clear methodology and high information density.
  • Value: ⭐⭐⭐⭐⭐ Strong practical insights for medical LLMs focusing on verifiable processes.