EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes", "Hands" and "Minds"¶

Conference: CVPR 2026 arXiv: 2604.05541 Code: N/A Area: Medical Imaging Keywords: Echocardiography, Agent System, Multimodal Large Language Model, Cardiac Function Assessment, Tool Calling

TL;DR¶

This paper proposes EchoAgent, an agent system that simulates the "eyes–hands–minds" collaborative workflow of echocardiography clinicians. Through three stages—an Expertise-Driven Cognition engine (mind), a Hierarchical Collaboration Toolkit (eyes + hands), and an Orchestrated Reasoning Hub—the system achieves end-to-end reliable echocardiography interpretation, attaining state-of-the-art performance on multiple benchmarks.

Background & Motivation¶

Echocardiography (Echo) is one of the most important non-invasive imaging modalities for cardiac function assessment, yet its clinical value depends critically on expert interpretation. Clinicians must simultaneously coordinate three capabilities:

"Eyes" (Visual Observation): Recognizing diverse cardiac views such as the apical two-chamber, four-chamber, and parasternal long-axis views.

"Hands" (Manual Operation): Localizing and segmenting cardiac structures and quantitatively measuring key parameters.

"Minds" (Expert Reasoning): Acquiring clinical knowledge, integrating multimodal evidence, and performing reliable diagnostic reasoning.

Existing approaches follow two lines of development, each with notable limitations:

Task-specific deep learning models (e.g., MemSAM, EchoONE): Proficient at isolated tasks such as segmentation, possessing "eyes + hands" but lacking "minds" and thus incapable of autonomous complete diagnostic reasoning.
Multimodal large language models (e.g., GPT-5, Qwen2.5-VL): Capable of visual question answering with "eyes + minds," but lacking domain-specific Echo knowledge and quantitative "hands," leading to clinically ungrounded reasoning.

A unified end-to-end solution integrating "eyes–hands–minds" therefore remains absent. EchoAgent is designed to fill this gap.

Method¶

Overall Architecture¶

EchoAgent comprises three core stages that simulate the complete clinician workflow from learning → observation → operation → reasoning:

Expertise-Driven Cognition Engine (EDC): Constructs a domain knowledge base, endowing the agent with a professional "mind."
Hierarchical Collaboration Toolkit (HC): Equips the agent with perception and operation tools, providing "eyes" and "hands."
Orchestrated Reasoning Hub (OR): Coordinates the above components to enable end-to-end interpretable reasoning.

Key Designs¶

Expertise-Driven Cognition Engine (EDC):
- Domain knowledge is sourced from four authoritative repositories: the UMLS medical knowledge base and AHA/ASE/EACVI echocardiography guidelines.
- Heterogeneous documents are decomposed into semantic knowledge primitives \(P=\{p_1, p_2, \ldots, p_D\}\).
- A medical concept encoder \(f_\theta(\cdot)\) maps knowledge into a high-dimensional semantic space.
- Indexing is organized across 14 cardiac anatomical regions (left ventricle, mitral valve, aortic valve, etc.).
- A RAG retrieval mechanism supports anatomy-specific knowledge retrieval, fetching the top-\(k\) most relevant primitives to construct a structured knowledge base \(R\).
Hierarchical Collaboration Toolkit (HC): A three-tier progressive structure.
- Perceptual Layer: Employs the EchoPrime foundation model to parse video streams and automatically classify echocardiographic view types (48 views).
- Operational Layer: Applies a USFM-based customized segmentation model to automatically delineate key cardiac structures (left ventricle, aorta, right ventricle, left atrium, etc.).
- Functional Layer: Integrates fine-tuned versions of USFM and EchoPrime to compute key clinical parameters including ejection fraction (EF), chamber dimensions, and right atrial pressure.
Orchestrated Reasoning Hub (OR Hub): The core reasoning engine.
- Knowledge Retrieval and Task Decomposition: Given a diagnostic query \(Q\), the hub retrieves the relevant knowledge base \(R_{a_q}\), decomposes it into an executable step sequence \(S=\{s_1,\ldots,s_n\}\), and maps each step to the optimal tool.
- Dynamic Reasoning Graph Construction: Incrementally constructs a multimodal reasoning graph \(G=(N,E)\), where nodes encode diagnostic concepts, evidence, and data anchors, and edges represent generation, support–contradiction, and derivation relationships.
- Adaptive Reasoning Workflow: Hypothesis confidence is assessed via Bayesian posterior estimation \(P(h_m|G(t)) \propto P(G(t)|h_m) \cdot P(h_m)\); when confidence falls below threshold, supplementary examinations (e.g., switching views for re-measurement) are automatically triggered until the evidence graph reaches a consistency threshold.

Loss & Training¶

The base MLLM is Qwen3-VL-Plus.
Foundation models in the HC toolkit (EchoPrime, USFM) are fine-tuned separately on echocardiographic data.
The knowledge base is dynamically retrieved via the RAG mechanism, requiring no end-to-end joint training.
The CAMUS dataset is split into training/validation/test sets at a 7:1:2 ratio.
EF computation follows Simpson's biplane method (SMOD).

Key Experimental Results¶

Main Results¶

Single-structure task (EF grading, CAMUS dataset):

Method	Type	Normal Acc	Mildly Reduced Acc	Considerably Reduced Acc	Mean Acc
EchoONE	Task-specific	74.00	64.00	80.00	72.67
GPT-5	General MLLM	44.00	61.00	55.00	53.33
GPT-5* (augmented)	E-H-M	78.00	69.00	89.00	78.67
EchoAgent	E-H-M	88.00	80.00	92.00	80.00

Multi-structure task (EchoQA, MIMIC-EchoQA dataset):

Method	Pericardium	Aortic Valve	Mitral Valve	Ventricles	Atria	Vessels	Others
GPT-5	60.98	40.91	36.78	26.32	36.99	38.71	44.44
GPT-5*	69.51	60.61	59.77	47.89	63.01	41.94	66.67
EchoAgent	84.15	82.58	81.61	75.26	80.82	77.42	70.37

EchoAgent achieves accuracy exceeding 70% across all 7 anatomical structure categories, outperforming the best-performing MLLM by an average of 31.45%.

Ablation Study¶

Configuration	EF Grading Acc	EchoQA Acc	Notes
Baseline (eyes+minds)	35.00	43.57	Qwen3-VL-Plus only
+EDC (expert mind)	50.00 (+15.00)	51.45 (+7.88)	Domain knowledge added
+HC (skilled hands)	73.00 (+37.00)	59.97 (+16.40)	Operational tools added
+EDC+HC+OR (full)	80.00 (+45.00)	79.42 (+35.85)	Full collaboration

Key Findings¶

Adding tools alone (GPT-5*) yields substantial gains (+48.67%), yet still falls short of EchoAgent, demonstrating that tools, knowledge, and orchestration are all indispensable.
EchoAgent achieves AUROC of 98.43%, 87.79%, and 93.88% at the three EF grading thresholds, indicating strong clinical utility.
General MLLMs exhibit highly uneven performance across anatomical structures (e.g., GPT-5 attains only 26.32% on Ventricles), whereas EchoAgent maintains consistently high performance.
Quantitative operational capability ("hands") contributes most to EF grading (+37%), while the knowledge engine is more critical for knowledge-intensive tasks.

Highlights & Insights¶

Successful application of the agent paradigm: Modeling medical image analysis as an agent workflow rather than a single model is a promising direction. The "eyes–hands–minds" analogy is both intuitive and effective.
Dynamic reasoning graph design: Constructing an incrementally built multimodal reasoning graph enables traceable reasoning, representing a significant improvement over black-box LLM outputs.
Adaptive mechanism: The closed-loop design that automatically gathers supplementary evidence under low-confidence conditions mirrors the iterative confirmation process in actual clinical practice.
Broad coverage: Support for 48 view types and 14 anatomical structures approaches the clinical scope of a comprehensive echocardiographic examination.

Limitations & Future Work¶

Real-time performance unverified: The paper does not discuss inference latency; multi-round tool invocations may be time-consuming and potentially incompatible with real-time clinical requirements.
Dependence on upstream model quality: The accuracy of segmentation models in the HC toolkit directly affects higher-level reasoning, introducing error propagation risks.
Limited dataset scale: CAMUS contains only 500 cases and MIMIC-EchoQA only 622, leaving generalizability to be validated at larger scale.
Knowledge base update mechanism unclear: As clinical guidelines evolve continuously, how the EDC engine incorporates new knowledge is not addressed.
Insufficient comparison with other agent systems: Methods such as MedRAX and other medical agent frameworks are not included for comparison.

MedRAX: A conceptually similar medical agent approach, but targeting chest X-rays rather than echocardiography.
EchoPrime / EchoONE: Serving as foundation models within EchoAgent's toolkit, these works demonstrate the value of domain-specific foundation models.
LangChain: Provides the engineering foundation for the agent framework implementation.
Insight: This paradigm could be extended to other complex medical imaging modalities (e.g., multi-sequence CT/MRI analysis), with the key challenge being the design of modality-specific toolkits.

Rating¶

Novelty: ⭐⭐⭐⭐ — Applying the agent paradigm to echocardiography interpretation is relatively novel, though the overall framework of agent + RAG + tool calling is not entirely unprecedented.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets with comprehensive ablation and comparison, though dataset scale is limited.
Writing Quality: ⭐⭐⭐⭐⭐ — The "eyes–hands–minds" analogy is woven consistently throughout, with clear logic and excellent readability.
Value: ⭐⭐⭐⭐ — Substantial potential for real-world medical AI applications, demonstrating the engineering value of the agent paradigm.