TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis¶

Conference: ICLR 2026 arXiv: 2603.05867 Code: GitHub Area: LLM Reasoning Keywords: Tumor Analysis, Multimodal CoT Reasoning, Interleaved Reasoning, 3D CT, TNM Staging

TL;DR¶

This paper proposes TumorChain, an interleaved multimodal chain-of-thought reasoning framework for tumor analysis across five major digestive organs. It integrates a knowledge graph-driven 1.5M CoT-VQA data engine, organ-guided iterative interleaved reasoning (IIR), and joint optimization of segmentation, classification, and LLM models to realize a complete reasoning chain from imaging findings → clinical impressions → pathological predictions, achieving an average accuracy of 84.41% and substantially outperforming GPT-5-Mini (51.59%).

Background & Motivation¶

Background: Medical VLMs have made progress in general report generation, but remain critically insufficient for the high-stakes domain of clinical oncology. Tumor analysis requires a complete reasoning chain connecting imaging findings, clinical impressions, and pathological endpoints (TNM staging).
Three Key Limitations: (1) Existing Med-VLMs lack tumor-specific capabilities and cannot reliably map radiological findings to pathology-level endpoints; (2) Large-scale, multi-granularity tumor-specific datasets are absent — existing benchmarks such as CT-RATE focus on multiple-choice or short-text QA and do not support CoT reasoning; (3) Most Med-VLMs are restricted to 2D images and single-step reasoning, whereas the structural complexity of 3D CT demands multi-step clinical reasoning.
Key Challenge: Clinical tumor diagnosis is an inherently multi-step reasoning process (detecting anomalies → synthesizing judgments → pathological staging), yet existing models cannot produce traceable reasoning chains, leaving the internal reasoning process opaque.
Key Insight: This work constructs a complete findings → impressions → pathology reasoning pipeline and introduces a dedicated CoT evaluation protocol (TumorChain-Eval) to assess the quality of each step in the reasoning chain.

Method¶

Overall Architecture¶

TumorChain comprises five modules: a 3D visual encoder \(\mathcal{E}_v\), an organ segmentation expert \(\mathcal{S}eg\), an auxiliary classification model \(\mathcal{C}ls\), an MLP projector \(\mathcal{P}\), and an LLM \(\mathcal{LLM}\). These modules collectively enable end-to-end tumor analysis through global-local visual alignment and interleaved multimodal reasoning.

Key Designs¶

1. Knowledge Graph-Driven CoT Data Engine (TumorCoT-1.5M): - Raw data: 41,059 3D CT scans + 10,708 radiology reports + partial pathology reports, covering five major digestive organs: liver, pancreas, stomach, colon, and esophagus. - Six collaborative agents: a segmentation expert (TotalSegmentator), a structured feature extractor (Qwen3-235B), a CoT reasoner (GPT-4o-mini), a logic calibrator (Claude3.5-Haiku), and a summarizer (GPT-5-mini). - Diagnostic knowledge graph (KG) constraints: Five organ-specific KGs co-constructed with radiologists and pathologists to guide reasoning chains toward clinical standards. - Cross-validation mechanism: When the logic calibrator detects issues in a reasoning chain, two repair strategies are triggered (expanding the organ region / providing suspected causes) to guide re-reasoning. - Final output: 1,497,818 CoT-VQA pairs covering four task types: localization, lesion attributes, TNM staging, and CoT reports.

2. Organ-Guided Iterative Interleaved Reasoning (IIR): - Step I: The LLM receives global CT tokens and a task prompt to produce an initial diagnosis \(\mathcal{R}^1_{cot}\). - Step II: The target organ is identified from the initial output → ROI is extracted via segmentation → an enhanced prompt ("more attention should be paid to [organ name]") is generated → local organ tokens are obtained. - Step III: Global tokens + task prompt + initial answer + local tokens are jointly fed into the LLM for iterative reasoning; if additional relevant organs are identified, the loop continues. - Effect: Simulates the clinical radiologist workflow — global overview first, then focused inspection of suspicious regions with repeated confirmation.

3. Hybrid Collaborative Optimization (HCO): - Segmentation model: Continuously provides accurate ROI localization. - Classification model: Trained on local organ features for normal/abnormal binary classification, enhancing the visual encoder's discriminative power for subtle anomalies. - LLM: Integrates reasoning results and leverages the segmentation model for iterative decision-making. - Joint loss: \(L_{total} = L_{LLM} + \lambda L_{cls}\)

4. TumorChain-Eval Evaluation Protocol: - Subject-predicate-object triplets are extracted from CoT reasoning chains (e.g., "pancreatic tail – finding – malignancy"). - Three-level scoring: finding chain \(S_{FC}\) (individual facts) → impression chain \(S_{IC}\) (synthesis of multiple findings) → long reasoning chain \(S_{LRC}\) (high-level inference). - GPT-4 is used to score against defined rubrics; \(CoT_e\) is the weighted sum of all three levels.

Key Experimental Results¶

Main Results¶

Method	Avg. Accuracy	TNM-T	TNM-N	TNM-M	CoTe Score
GPT-5-Mini	51.59%	—	—	—	61.23
Gemini2.0	41.29%	—	—	—	54.28
TumorChain-7B	84.41%	88.83%	61.63%	71.07%	58.33

Ablation Study¶

Configuration	Avg. Accuracy	Note
Full TumorChain	84.41%	Complete framework
w/o IIR	80.34% (−4.07%)	IIR is the largest contributor
w/o CoT	82.45% (−1.96%)	CoT data also contributes significantly
w/o Classification Model	82.93% (−1.48%)	Auxiliary classification enhances discrimination

Key Findings¶

Localization accuracy is near-perfect: organ-level 99.97%, position-level 97.57%, substantially surpassing all baselines.
IIR contributes the most (−4.07% when removed) — iterative refinement is the core mechanism, mirroring the "scan → focus → re-examine" radiologist workflow.
Zero-shot generalization on the public DeepTumorVQA benchmark: 73.30% vs. MedVLM-R1 56.41%, demonstrating strong domain transferability.
TNM-N (lymph node metastasis) yields the lowest accuracy (61.63%), consistent with its difficulty in clinical practice.

Highlights & Insights¶

Complete clinical reasoning pipeline: The three-level reasoning chain design of findings → impressions → pathology ensures traceability and interpretability.
Knowledge graph-driven data engine: Automatic generation of 1.5M high-quality CoT samples addresses the scarcity of tumor-specific annotated data.
Iterative Interleaved Reasoning (IIR): Elegantly fuses global context with local evidence through multi-round self-verification, reducing hallucination risk.
Triplet-based evaluation protocol: Extracts structured knowledge from CoT chains for scoring, providing finer granularity than end-to-end metrics.

Limitations & Future Work¶

Iterative reasoning introduces a latency of 2.51 seconds per sample, requiring acceleration for real-time clinical deployment.
CoT evaluation relies on GPT-4 scoring, which may introduce systematic bias.
Coverage is currently limited to five digestive organs; generalizability to other domains (e.g., lung, breast) remains to be validated.
TNM-N staging accuracy of only 61.63% indicates that lymph node metastasis assessment remains a critical challenge.
Training data originate from multi-center Chinese hospitals; cross-regional and cross-device generalization requires further verification.
The absence of comparison experiments with specialist physicians limits conclusions regarding clinical deployment value.

Compared to general medical VLM datasets such as CT-RATE and 3D-RAD, TumorCoT-1.5M is the first large-scale dataset to provide tumor-specific CoT annotations.
Compared to medical reasoning models such as MedVLM-R1, TumorChain achieves deeper multi-step reasoning through iterative interleaved reasoning.
The IIR design philosophy — LLM → ROI identification → segmentation → local feature injection → re-reasoning — is generalizable to other medical imaging tasks requiring spatial refinement.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (First multimodal CoT reasoning framework specifically targeting oncology)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (1.5M data / multi-task evaluation / generalization / ablation)
Writing Quality: ⭐⭐⭐⭐ (Deep clinical motivation, complete technical details)
Value: ⭐⭐⭐⭐⭐ (Important tool for precision oncology with strong clinical translation potential)