3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks¶

Conference: NeurIPS 2025 arXiv: 2506.11147 Code: https://github.com/Tang-xiaoxiao/3D-RAD Area: Medical VQA / 3D Medical Image Understanding / Multimodal Keywords: 3D Med-VQA, CT imaging, multi-temporal reasoning, longitudinal diagnosis, benchmark

TL;DR¶

This paper introduces 3D-RAD — the first large-scale 3D medical VQA benchmark, comprising 170K CT-based question-answer pairs across six clinical task categories (including a novel multi-temporal diagnosis task), accompanied by a 136K training set. The benchmark reveals critical deficiencies of existing VLMs in 3D temporal reasoning.

Background & Motivation¶

Existing Med-VQA datasets face three major bottlenecks: 1. Dimensionality constraints: The vast majority are based on 2D images or 2D slices extracted from 3D volumes, discarding volumetric spatial relationships — yet clinical diagnosis (CT/MRI) is inherently dependent on 3D information. 2. Task homogeneity: Most datasets consist of simple multiple-choice or short (3–5 word) answers, lacking quantitative computation, temporal analysis, and other real-world clinical scenarios. 3. Insufficient scale and granularity: For instance, VQA-RAD contains only 315 images and SLAKE only 642, which is inadequate for large-scale training and comprehensive evaluation.

Furthermore, real radiology workflows extensively involve follow-up comparison — comparing scans from different time points to determine whether lesions are new, resolved, or persistent — yet no existing Med-VQA dataset supports this form of multi-temporal reasoning.

Core Problem¶

How to construct a large-scale, multi-task Medical VQA benchmark that supports 3D volumetric input and multi-temporal reasoning, so as to comprehensively evaluate and advance VLM capabilities in real-world 3D radiology scenarios?

Method¶

Overall Architecture¶

3D-RAD is built upon the CT-RATE dataset (16,188 CT scans, 11,255 patients) via a semi-automated pipeline, yielding: - 3D-RAD-Bench (evaluation set): 33,910 QA pairs, 2,662 3D images - 3D-RAD-T (training set): 136,195 QA pairs, 13,526 3D images

Six task categories are divided into open-ended (Tasks 1–3) and closed-ended (Tasks 4–6):

Key Designs¶

Task 1 — Anomaly Detection: Identifies anomalous patterns in 3D CT scans, outputting anomaly type, characteristics, and location. Divided into four sub-tasks: disease diagnosis, anomaly type, anomaly characteristics, and anomaly location. Open-ended response format.
Task 2 — Image Observation: Analyzes descriptive information in medical images, including recognition of both normal and abnormal structures (e.g., cardiac stents), assessing the model's basic perceptual capabilities.
Task 3 — Medical Computation: Performs quantitative reasoning on 3D medical images, such as measuring nodule diameter and wall thickness. Assesses numerical reasoning ability.
Task 4 — Existence Detection: Binary classification (yes/no) for 18 predefined anomaly categories, evaluating generalization across pathological classes.
Task 5 — Static Temporal Diagnosis [Novel Task]: Given only the current single 3D scan (no historical information), the model infers the temporal status of lesions (persistent / resolved / new / no anomaly). Simulates implicit temporal reasoning in the absence of prior records.
Task 6 — Longitudinal Temporal Diagnosis [Novel Task]: Provides a sequence of historical diagnostic labels (e.g., [1, 0, 1]) alongside the current scan, and asks the model to determine the same four temporal states. Evaluates the model's ability to integrate explicit temporal context.

Data Construction Pipeline¶

Open-ended QA (Tasks 1–3): Extracted from the Findings/Impression fields of clinical reports; QA pairs are generated by GPT-4o-mini using prompt templates. A "6W" framework (what/where/which) ensures diversity, with answers constrained to approximately five words.
Closed-ended QA (Tasks 4–6): Generated directly from CT-RATE's multi-label annotations via templates. Tasks 5–6 require longitudinal label comparison across multiple scans from the same patient.
Quality Control: GPT-based scoring along five dimensions (visual verifiability, specificity and clarity, answer appropriateness, QA alignment, linguistic quality); pairs scoring below 3 are discarded; high-frequency QA pairs are capped at 10 per question; manual verification of 600 samples yields a 96.17% agreement rate.
Cross-LLM Consistency Validation: DeepSeek-R1, LLaMA3-70B, and LLaMA3-8B are used to cross-validate the reliability of GPT-4o-mini's scoring.

Key Experimental Results¶

Zero-Shot Evaluation (Existing 3D Med-VLMs)¶

Task	Metric	RadFM (13B)	M3D (7B)	M3D (4B)	OmniV (1.5B)
Task 1 Anomaly Detection	ROUGE	17.62	18.64	23.19	25.72
Task 2 Image Observation	ROUGE	19.14	20.82	23.19	26.69
Task 3 Medical Computation	ROUGE	6.62	23.24	5.63	7.88
Task 4 Existence Detection	ACC	29.20	18.00	40.25	28.66
Task 5 Static Temporal	ACC	44.11	25.47	25.40	22.96
Task 6 Longitudinal Temporal	ACC	42.99	24.17	24.31	24.23

Fine-Tuning Performance (M3D-RAD)¶

Task	Metric	M3D (4B) Zero-shot	M3D-RAD (4B) Fine-tuned	Gain
Task 1 Anomaly Detection	ROUGE	23.19	42.45	+19.26
Task 2 Image Observation	ROUGE	23.19	50.52	+27.33
Task 3 Medical Computation	ROUGE	5.63	36.46	+30.83
Task 4 Existence Detection	ACC	40.25	82.43	+42.18
Task 5 Static Temporal	ACC	25.40	49.30	+23.90
Task 6 Longitudinal Temporal	ACC	24.31	74.77	+50.46

Ablation Study¶

Data scaling effect: Performance improves consistently across all tasks as training data scales from 1% → 10% → 100%; however, Tasks 5/6 exhibit high variance across data scales, indicating that current architectures lack the inductive biases necessary for temporal reasoning.
Cross-task transfer: Single-task fine-tuning also improves performance on other tasks, but joint training across all tasks yields the best results — demonstrating that the value of the dataset lies not only in scale but also in diverse domain knowledge.
General VLM evaluation: General-purpose models such as LLaVA-OneVision and Qwen2.5-VL also perform poorly on Tasks 4–6, while the fine-tuned M3D-RAD consistently outperforms them.

Highlights & Insights¶

First large-scale 3D Med-VQA benchmark: 170K QA pairs across six task categories, systematically covering clinical needs from perception to reasoning.
Novel temporal reasoning tasks: Tasks 5 (static temporal) and 6 (longitudinal temporal) represent the first systematic introduction of multi-temporal diagnosis in Med-VQA, closely mirroring real follow-up scenarios.
Rigorous quality control: Multi-LLM cross-validation, five-dimensional scoring, and manual verification achieve a 96.17% agreement rate.
Practical value of the training set: The 136K high-quality training samples enable fine-tuned models to improve on temporal tasks from ~25% to ~75%, demonstrating data effectiveness.
Revealing finding: Temporal reasoning capability does not emerge from pretraining alone and must be acquired through explicit supervised learning.

Limitations & Future Work¶

Limited temporal information: Task 6 provides only binary diagnostic label sequences (0/1) without leveraging complete 3D scans from multiple time points — richer spatial morphological change information remains unused.
Model input constraints: Existing VLMs do not support simultaneous input of multiple 3D volumes, making true longitudinal comparative reasoning infeasible.
Absence of open-ended temporal tasks: Tasks 5–6 are closed-ended four-class classification only; open-ended temporal analysis has not been introduced.
Single data source: The dataset is based solely on chest CT from CT-RATE, without coverage of other anatomical regions (e.g., brain, abdomen) or MRI modalities.
Dependence on LLM-generated QA: QA pairs generated by GPT-4o-mini may introduce inherent biases or hallucinations, despite human spot-checking.

vs. VQA-RAD / SLAKE / PathVQA: These are 2D datasets of limited scale (hundreds to thousands of images) with simple tasks, offering no 3D support and no temporal reasoning.
vs. M3D-VQA: Also a 3D Med-VQA dataset, but M3D-VQA contains only 120K QA pairs across five task categories, with no multi-temporal reasoning.
vs. RadFM: RadFM is a unified 2D/3D foundation model but lacks systematic benchmark evaluation and the temporal dimension.
vs. CT-RATE: 3D-RAD is built upon CT-RATE, constructing structured VQA tasks on top of its radiology reports and multi-label annotations.

Connections to related ideas: - Related to idea 20260316_2d_to_3d_medical_distill: 3D-RAD reveals the data bottleneck in 3D medical understanding; 2D→3D distillation could serve as a pretraining strategy to improve the foundational capabilities of 3D VLMs. - Related to idea 20260317_multi_agent_medical_diagnosis: The six task categories in 3D-RAD (detection → observation → computation → classification → temporal reasoning) naturally correspond to a multi-agent division of labor. - Key insight: The paper exposes an important gap — existing 3D VLMs entirely lack the capability to jointly process multiple 3D volumes as input. Designing an efficient representation and alignment method for multi-time-point 3D scans would constitute a high-impact research direction. The low ceiling of Task 5 (only 49.3% after fine-tuning) demonstrates that inferring temporal status from a single frame is extremely difficult, whereas Task 6 achieves 74.8% via simple label sequences — highlighting the substantial value of explicit temporal signals.

Rating¶

Novelty: ⭐⭐⭐⭐ First large-scale 3D Med-VQA benchmark with multi-temporal reasoning; Tasks 5/6 are creatively designed — though the contribution remains fundamentally that of a dataset paper.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 specialized 3D Med-VLMs and multiple general-purpose VLMs; includes zero-shot, fine-tuning, data scaling ablation, cross-task ablation, and failure case analysis — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear structure and rich figures; however, the related work section over-cites the authors' own prior work.
Value: ⭐⭐⭐⭐⭐ Fills a critical gap in 3D Med-VQA; open-source dataset and code; the 136K training set has strong practical utility and reveals a key bottleneck in temporal reasoning.