Skip to content

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

Conference: NeurIPS 2025 arXiv: 2506.11147 Code: https://github.com/Tang-xiaoxiao/3D-RAD Area: Medical VQA / 3D Medical Image Understanding / Multimodal Keywords: 3D Med-VQA, CT imaging, multi-temporal reasoning, longitudinal diagnosis, benchmark

TL;DR

This paper introduces 3D-RAD — the first large-scale 3D medical VQA benchmark, comprising 170K CT-based question-answer pairs across six clinical task categories (including a novel multi-temporal diagnosis task), accompanied by a 136K training set. The benchmark reveals critical deficiencies of existing VLMs in 3D temporal reasoning.

Background & Motivation

Existing Med-VQA datasets face three major bottlenecks: 1. Dimensionality constraints: The vast majority are based on 2D images or 2D slices extracted from 3D volumes, discarding volumetric spatial relationships — yet clinical diagnosis (CT/MRI) is inherently dependent on 3D information. 2. Task homogeneity: Most datasets consist of simple multiple-choice or short (3–5 word) answers, lacking quantitative computation, temporal analysis, and other real-world clinical scenarios. 3. Insufficient scale and granularity: For instance, VQA-RAD contains only 315 images and SLAKE only 642, which is inadequate for large-scale training and comprehensive evaluation.

Furthermore, real radiology workflows extensively involve follow-up comparison — comparing scans from different time points to determine whether lesions are new, resolved, or persistent — yet no existing Med-VQA dataset supports this form of multi-temporal reasoning.

Core Problem

How to construct a large-scale, multi-task Medical VQA benchmark that supports 3D volumetric input and multi-temporal reasoning, so as to comprehensively evaluate and advance VLM capabilities in real-world 3D radiology scenarios?

Method

Overall Architecture

3D-RAD is built upon the CT-RATE dataset (16,188 CT scans, 11,255 patients) via a semi-automated pipeline, yielding: - 3D-RAD-Bench (evaluation set): 33,910 QA pairs, 2,662 3D images - 3D-RAD-T (training set): 136,195 QA pairs, 13,526 3D images

Six task categories are divided into open-ended (Tasks 1–3) and closed-ended (Tasks 4–6):

Key Designs

  1. Task 1 — Anomaly Detection: Identifies anomalous patterns in 3D CT scans, outputting anomaly type, characteristics, and location. Divided into four sub-tasks: disease diagnosis, anomaly type, anomaly characteristics, and anomaly location. Open-ended response format.

  2. Task 2 — Image Observation: Analyzes descriptive information in medical images, including recognition of both normal and abnormal structures (e.g., cardiac stents), assessing the model's basic perceptual capabilities.

  3. Task 3 — Medical Computation: Performs quantitative reasoning on 3D medical images, such as measuring nodule diameter and wall thickness. Assesses numerical reasoning ability.

  4. Task 4 — Existence Detection: Binary classification (yes/no) for 18 predefined anomaly categories, evaluating generalization across pathological classes.

  5. Task 5 — Static Temporal Diagnosis [Novel Task]: Given only the current single 3D scan (no historical information), the model infers the temporal status of lesions (persistent / resolved / new / no anomaly). Simulates implicit temporal reasoning in the absence of prior records.

  6. Task 6 — Longitudinal Temporal Diagnosis [Novel Task]: Provides a sequence of historical diagnostic labels (e.g., [1, 0, 1]) alongside the current scan, and asks the model to determine the same four temporal states. Evaluates the model's ability to integrate explicit temporal context.

Data Construction Pipeline

  • Open-ended QA (Tasks 1–3): Extracted from the Findings/Impression fields of clinical reports; QA pairs are generated by GPT-4o-mini using prompt templates. A "6W" framework (what/where/which) ensures diversity, with answers constrained to approximately five words.
  • Closed-ended QA (Tasks 4–6): Generated directly from CT-RATE's multi-label annotations via templates. Tasks 5–6 require longitudinal label comparison across multiple scans from the same patient.
  • Quality Control: GPT-based scoring along five dimensions (visual verifiability, specificity and clarity, answer appropriateness, QA alignment, linguistic quality); pairs scoring below 3 are discarded; high-frequency QA pairs are capped at 10 per question; manual verification of 600 samples yields a 96.17% agreement rate.
  • Cross-LLM Consistency Validation: DeepSeek-R1, LLaMA3-70B, and LLaMA3-8B are used to cross-validate the reliability of GPT-4o-mini's scoring.

Key Experimental Results

Zero-Shot Evaluation (Existing 3D Med-VLMs)

Task Metric RadFM (13B) M3D (7B) M3D (4B) OmniV (1.5B)
Task 1 Anomaly Detection ROUGE 17.62 18.64 23.19 25.72
Task 2 Image Observation ROUGE 19.14 20.82 23.19 26.69
Task 3 Medical Computation ROUGE 6.62 23.24 5.63 7.88
Task 4 Existence Detection ACC 29.20 18.00 40.25 28.66
Task 5 Static Temporal ACC 44.11 25.47 25.40 22.96
Task 6 Longitudinal Temporal ACC 42.99 24.17 24.31 24.23

Fine-Tuning Performance (M3D-RAD)

Task Metric M3D (4B) Zero-shot M3D-RAD (4B) Fine-tuned Gain
Task 1 Anomaly Detection ROUGE 23.19 42.45 +19.26
Task 2 Image Observation ROUGE 23.19 50.52 +27.33
Task 3 Medical Computation ROUGE 5.63 36.46 +30.83
Task 4 Existence Detection ACC 40.25 82.43 +42.18
Task 5 Static Temporal ACC 25.40 49.30 +23.90
Task 6 Longitudinal Temporal ACC 24.31 74.77 +50.46

Ablation Study

  • Data scaling effect: Performance improves consistently across all tasks as training data scales from 1% → 10% → 100%; however, Tasks 5/6 exhibit high variance across data scales, indicating that current architectures lack the inductive biases necessary for temporal reasoning.
  • Cross-task transfer: Single-task fine-tuning also improves performance on other tasks, but joint training across all tasks yields the best results — demonstrating that the value of the dataset lies not only in scale but also in diverse domain knowledge.
  • General VLM evaluation: General-purpose models such as LLaVA-OneVision and Qwen2.5-VL also perform poorly on Tasks 4–6, while the fine-tuned M3D-RAD consistently outperforms them.

Highlights & Insights

  1. First large-scale 3D Med-VQA benchmark: 170K QA pairs across six task categories, systematically covering clinical needs from perception to reasoning.
  2. Novel temporal reasoning tasks: Tasks 5 (static temporal) and 6 (longitudinal temporal) represent the first systematic introduction of multi-temporal diagnosis in Med-VQA, closely mirroring real follow-up scenarios.
  3. Rigorous quality control: Multi-LLM cross-validation, five-dimensional scoring, and manual verification achieve a 96.17% agreement rate.
  4. Practical value of the training set: The 136K high-quality training samples enable fine-tuned models to improve on temporal tasks from ~25% to ~75%, demonstrating data effectiveness.
  5. Revealing finding: Temporal reasoning capability does not emerge from pretraining alone and must be acquired through explicit supervised learning.

Limitations & Future Work

  1. Limited temporal information: Task 6 provides only binary diagnostic label sequences (0/1) without leveraging complete 3D scans from multiple time points — richer spatial morphological change information remains unused.
  2. Model input constraints: Existing VLMs do not support simultaneous input of multiple 3D volumes, making true longitudinal comparative reasoning infeasible.
  3. Absence of open-ended temporal tasks: Tasks 5–6 are closed-ended four-class classification only; open-ended temporal analysis has not been introduced.
  4. Single data source: The dataset is based solely on chest CT from CT-RATE, without coverage of other anatomical regions (e.g., brain, abdomen) or MRI modalities.
  5. Dependence on LLM-generated QA: QA pairs generated by GPT-4o-mini may introduce inherent biases or hallucinations, despite human spot-checking.
  • vs. VQA-RAD / SLAKE / PathVQA: These are 2D datasets of limited scale (hundreds to thousands of images) with simple tasks, offering no 3D support and no temporal reasoning.
  • vs. M3D-VQA: Also a 3D Med-VQA dataset, but M3D-VQA contains only 120K QA pairs across five task categories, with no multi-temporal reasoning.
  • vs. RadFM: RadFM is a unified 2D/3D foundation model but lacks systematic benchmark evaluation and the temporal dimension.
  • vs. CT-RATE: 3D-RAD is built upon CT-RATE, constructing structured VQA tasks on top of its radiology reports and multi-label annotations.

Connections to related ideas: - Related to idea 20260316_2d_to_3d_medical_distill: 3D-RAD reveals the data bottleneck in 3D medical understanding; 2D→3D distillation could serve as a pretraining strategy to improve the foundational capabilities of 3D VLMs. - Related to idea 20260317_multi_agent_medical_diagnosis: The six task categories in 3D-RAD (detection → observation → computation → classification → temporal reasoning) naturally correspond to a multi-agent division of labor. - Key insight: The paper exposes an important gap — existing 3D VLMs entirely lack the capability to jointly process multiple 3D volumes as input. Designing an efficient representation and alignment method for multi-time-point 3D scans would constitute a high-impact research direction. The low ceiling of Task 5 (only 49.3% after fine-tuning) demonstrates that inferring temporal status from a single frame is extremely difficult, whereas Task 6 achieves 74.8% via simple label sequences — highlighting the substantial value of explicit temporal signals.

Rating

  • Novelty: ⭐⭐⭐⭐ First large-scale 3D Med-VQA benchmark with multi-temporal reasoning; Tasks 5/6 are creatively designed — though the contribution remains fundamentally that of a dataset paper.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 specialized 3D Med-VLMs and multiple general-purpose VLMs; includes zero-shot, fine-tuning, data scaling ablation, cross-task ablation, and failure case analysis — highly comprehensive.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and rich figures; however, the related work section over-cites the authors' own prior work.
  • Value: ⭐⭐⭐⭐⭐ Fills a critical gap in 3D Med-VQA; open-source dataset and code; the 136K training set has strong practical utility and reveals a key bottleneck in temporal reasoning.