3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks¶
Conference: NeurIPS 2025 arXiv: 2506.11147 Code: https://github.com/Tang-xiaoxiao/3D-RAD Area: Medical VQA / 3D Medical Image Understanding / Multimodal Keywords: 3D Med-VQA, CT imaging, multi-temporal reasoning, longitudinal diagnosis, benchmark
TL;DR¶
This paper introduces 3D-RAD — the first large-scale 3D medical VQA benchmark, comprising 170K CT-based question-answer pairs across six clinical task categories (including a novel multi-temporal diagnosis task), accompanied by a 136K training set. The benchmark reveals critical deficiencies of existing VLMs in 3D temporal reasoning.
Background & Motivation¶
Existing Med-VQA datasets face three major bottlenecks: 1. Dimensionality constraints: The vast majority are based on 2D images or 2D slices extracted from 3D volumes, discarding volumetric spatial relationships — yet clinical diagnosis (CT/MRI) is inherently dependent on 3D information. 2. Task homogeneity: Most datasets consist of simple multiple-choice or short (3–5 word) answers, lacking quantitative computation, temporal analysis, and other real-world clinical scenarios. 3. Insufficient scale and granularity: For instance, VQA-RAD contains only 315 images and SLAKE only 642, which is inadequate for large-scale training and comprehensive evaluation.
Furthermore, real radiology workflows extensively involve follow-up comparison — comparing scans from different time points to determine whether lesions are new, resolved, or persistent — yet no existing Med-VQA dataset supports this form of multi-temporal reasoning.
Core Problem¶
How to construct a large-scale, multi-task Medical VQA benchmark that supports 3D volumetric input and multi-temporal reasoning, so as to comprehensively evaluate and advance VLM capabilities in real-world 3D radiology scenarios?
Method¶
Overall Architecture¶
3D-RAD is built upon the CT-RATE dataset (16,188 CT scans, 11,255 patients) via a semi-automated pipeline, yielding: - 3D-RAD-Bench (evaluation set): 33,910 QA pairs, 2,662 3D images - 3D-RAD-T (training set): 136,195 QA pairs, 13,526 3D images
Six task categories are divided into open-ended (Tasks 1–3) and closed-ended (Tasks 4–6):
Key Designs¶
-
Task 1 — Anomaly Detection: Identifies anomalous patterns in 3D CT scans, outputting anomaly type, characteristics, and location. Divided into four sub-tasks: disease diagnosis, anomaly type, anomaly characteristics, and anomaly location. Open-ended response format.
-
Task 2 — Image Observation: Analyzes descriptive information in medical images, including recognition of both normal and abnormal structures (e.g., cardiac stents), assessing the model's basic perceptual capabilities.
-
Task 3 — Medical Computation: Performs quantitative reasoning on 3D medical images, such as measuring nodule diameter and wall thickness. Assesses numerical reasoning ability.
-
Task 4 — Existence Detection: Binary classification (yes/no) for 18 predefined anomaly categories, evaluating generalization across pathological classes.
-
Task 5 — Static Temporal Diagnosis [Novel Task]: Given only the current single 3D scan (no historical information), the model infers the temporal status of lesions (persistent / resolved / new / no anomaly). Simulates implicit temporal reasoning in the absence of prior records.
-
Task 6 — Longitudinal Temporal Diagnosis [Novel Task]: Provides a sequence of historical diagnostic labels (e.g., [1, 0, 1]) alongside the current scan, and asks the model to determine the same four temporal states. Evaluates the model's ability to integrate explicit temporal context.
Data Construction Pipeline¶
- Open-ended QA (Tasks 1–3): Extracted from the Findings/Impression fields of clinical reports; QA pairs are generated by GPT-4o-mini using prompt templates. A "6W" framework (what/where/which) ensures diversity, with answers constrained to approximately five words.
- Closed-ended QA (Tasks 4–6): Generated directly from CT-RATE's multi-label annotations via templates. Tasks 5–6 require longitudinal label comparison across multiple scans from the same patient.
- Quality Control: GPT-based scoring along five dimensions (visual verifiability, specificity and clarity, answer appropriateness, QA alignment, linguistic quality); pairs scoring below 3 are discarded; high-frequency QA pairs are capped at 10 per question; manual verification of 600 samples yields a 96.17% agreement rate.
- Cross-LLM Consistency Validation: DeepSeek-R1, LLaMA3-70B, and LLaMA3-8B are used to cross-validate the reliability of GPT-4o-mini's scoring.
Key Experimental Results¶
Zero-Shot Evaluation (Existing 3D Med-VLMs)¶
| Task | Metric | RadFM (13B) | M3D (7B) | M3D (4B) | OmniV (1.5B) |
|---|---|---|---|---|---|
| Task 1 Anomaly Detection | ROUGE | 17.62 | 18.64 | 23.19 | 25.72 |
| Task 2 Image Observation | ROUGE | 19.14 | 20.82 | 23.19 | 26.69 |
| Task 3 Medical Computation | ROUGE | 6.62 | 23.24 | 5.63 | 7.88 |
| Task 4 Existence Detection | ACC | 29.20 | 18.00 | 40.25 | 28.66 |
| Task 5 Static Temporal | ACC | 44.11 | 25.47 | 25.40 | 22.96 |
| Task 6 Longitudinal Temporal | ACC | 42.99 | 24.17 | 24.31 | 24.23 |
Fine-Tuning Performance (M3D-RAD)¶
| Task | Metric | M3D (4B) Zero-shot | M3D-RAD (4B) Fine-tuned | Gain |
|---|---|---|---|---|
| Task 1 Anomaly Detection | ROUGE | 23.19 | 42.45 | +19.26 |
| Task 2 Image Observation | ROUGE | 23.19 | 50.52 | +27.33 |
| Task 3 Medical Computation | ROUGE | 5.63 | 36.46 | +30.83 |
| Task 4 Existence Detection | ACC | 40.25 | 82.43 | +42.18 |
| Task 5 Static Temporal | ACC | 25.40 | 49.30 | +23.90 |
| Task 6 Longitudinal Temporal | ACC | 24.31 | 74.77 | +50.46 |
Ablation Study¶
- Data scaling effect: Performance improves consistently across all tasks as training data scales from 1% → 10% → 100%; however, Tasks 5/6 exhibit high variance across data scales, indicating that current architectures lack the inductive biases necessary for temporal reasoning.
- Cross-task transfer: Single-task fine-tuning also improves performance on other tasks, but joint training across all tasks yields the best results — demonstrating that the value of the dataset lies not only in scale but also in diverse domain knowledge.
- General VLM evaluation: General-purpose models such as LLaVA-OneVision and Qwen2.5-VL also perform poorly on Tasks 4–6, while the fine-tuned M3D-RAD consistently outperforms them.
Highlights & Insights¶
- First large-scale 3D Med-VQA benchmark: 170K QA pairs across six task categories, systematically covering clinical needs from perception to reasoning.
- Novel temporal reasoning tasks: Tasks 5 (static temporal) and 6 (longitudinal temporal) represent the first systematic introduction of multi-temporal diagnosis in Med-VQA, closely mirroring real follow-up scenarios.
- Rigorous quality control: Multi-LLM cross-validation, five-dimensional scoring, and manual verification achieve a 96.17% agreement rate.
- Practical value of the training set: The 136K high-quality training samples enable fine-tuned models to improve on temporal tasks from ~25% to ~75%, demonstrating data effectiveness.
- Revealing finding: Temporal reasoning capability does not emerge from pretraining alone and must be acquired through explicit supervised learning.
Limitations & Future Work¶
- Limited temporal information: Task 6 provides only binary diagnostic label sequences (0/1) without leveraging complete 3D scans from multiple time points — richer spatial morphological change information remains unused.
- Model input constraints: Existing VLMs do not support simultaneous input of multiple 3D volumes, making true longitudinal comparative reasoning infeasible.
- Absence of open-ended temporal tasks: Tasks 5–6 are closed-ended four-class classification only; open-ended temporal analysis has not been introduced.
- Single data source: The dataset is based solely on chest CT from CT-RATE, without coverage of other anatomical regions (e.g., brain, abdomen) or MRI modalities.
- Dependence on LLM-generated QA: QA pairs generated by GPT-4o-mini may introduce inherent biases or hallucinations, despite human spot-checking.
Related Work & Insights¶
- vs. VQA-RAD / SLAKE / PathVQA: These are 2D datasets of limited scale (hundreds to thousands of images) with simple tasks, offering no 3D support and no temporal reasoning.
- vs. M3D-VQA: Also a 3D Med-VQA dataset, but M3D-VQA contains only 120K QA pairs across five task categories, with no multi-temporal reasoning.
- vs. RadFM: RadFM is a unified 2D/3D foundation model but lacks systematic benchmark evaluation and the temporal dimension.
- vs. CT-RATE: 3D-RAD is built upon CT-RATE, constructing structured VQA tasks on top of its radiology reports and multi-label annotations.
Connections to related ideas:
- Related to idea 20260316_2d_to_3d_medical_distill: 3D-RAD reveals the data bottleneck in 3D medical understanding; 2D→3D distillation could serve as a pretraining strategy to improve the foundational capabilities of 3D VLMs.
- Related to idea 20260317_multi_agent_medical_diagnosis: The six task categories in 3D-RAD (detection → observation → computation → classification → temporal reasoning) naturally correspond to a multi-agent division of labor.
- Key insight: The paper exposes an important gap — existing 3D VLMs entirely lack the capability to jointly process multiple 3D volumes as input. Designing an efficient representation and alignment method for multi-time-point 3D scans would constitute a high-impact research direction. The low ceiling of Task 5 (only 49.3% after fine-tuning) demonstrates that inferring temporal status from a single frame is extremely difficult, whereas Task 6 achieves 74.8% via simple label sequences — highlighting the substantial value of explicit temporal signals.
Rating¶
- Novelty: ⭐⭐⭐⭐ First large-scale 3D Med-VQA benchmark with multi-temporal reasoning; Tasks 5/6 are creatively designed — though the contribution remains fundamentally that of a dataset paper.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 specialized 3D Med-VLMs and multiple general-purpose VLMs; includes zero-shot, fine-tuning, data scaling ablation, cross-task ablation, and failure case analysis — highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and rich figures; however, the related work section over-cites the authors' own prior work.
- Value: ⭐⭐⭐⭐⭐ Fills a critical gap in 3D Med-VQA; open-source dataset and code; the 136K training set has strong practical utility and reveals a key bottleneck in temporal reasoning.