4D-Bench: Benchmarking Multi-Modal Large Language Models for 4D Object Understanding¶

Conference: ICCV 2025 arXiv: 2503.17827 Code: https://4dbench.github.io/ Area: Video Understanding / Multi-Modal Keywords: 4D understanding, MLLM evaluation, multi-view temporal reasoning, benchmark, visual question answering

TL;DR¶

4D-Bench is the first benchmark for evaluating multi-modal large language models (MLLMs) on 4D object understanding. It encompasses two tasks—4D object question answering and captioning—and reveals that even the state-of-the-art GPT-4o achieves only 63% accuracy against a human baseline of 91%, exposing significant deficiencies in multi-view temporal reasoning among current MLLMs.

Background & Motivation¶

4D digital assets (dynamic 3D objects) are increasingly important in digital twins, augmented reality, gaming, and related domains. As MLLMs (e.g., GPT-4o, Qwen2-VL) have achieved substantial progress in 2D image/video understanding, a natural question arises: can these models understand 4D objects?

A critical gap currently exists:

No public benchmark for 4D language understanding: existing benchmarks either focus on 2D images/videos (ignoring multi-view understanding) or on static 3D scenes (ignoring temporal dynamics).

Unique challenges of 4D understanding: - Multi-view ambiguity: the same object presents different appearances from different viewpoints, requiring integration of multi-view information. - Temporal evolution: object parts move over time, necessitating tracking and reasoning. - Joint reasoning across views and time: as illustrated in Figure 1, a robot's right hand may be occluded from certain viewpoints and eventually disappear, so answering questions requires selecting the appropriate viewpoint, localizing the part, and tracking its changes.

Counterfactual testing: the synthetic objects in 4D-Bench can provide counterfactual data that violates physical laws or common sense (e.g., a spider with six legs, a ball rolling out of a hole), testing whether MLLMs genuinely understand the input rather than relying on memorized priors.

Core Idea: render 4D objects as multi-view videos and feed them directly into existing MLLMs for evaluation, without constructing new 4D understanding models. Carefully designed evaluation tasks expose specific shortcomings of MLLMs.

Method¶

Overall Architecture¶

4D-Bench consists of two tasks: 1. 4D Object Question Answering (QA): 751 four-choice questions covering 736 4D objects. 2. 4D Object Captioning: 580 4D objects, each with 5 manually annotated captions.

Key Designs¶

1. Data Collection and Filtering¶

Data is sourced from dynamic 3D objects in Objaverse-XL and processed through a two-stage filtering pipeline:

Motion analysis: motion boundaries are detected via pixel-change analysis to extract valid video segments, ensuring only dynamic objects are retained.
Visual quality assessment: human annotators label thousands of images as high/low quality; a fine-tuned CLIP image encoder serves as a quality classifier, and multi-view voting filters out low-quality objects.

Each 4D object is rendered from 24 viewpoints.

2. QA Task Design (5 Sub-tasks)¶

Sub-task	Evaluation Target	Unique Challenge
Appearance	Visual attribute analysis	Synthetic/fictional objects deviate from real-world training distributions
Action	Fine-grained motion detection	Motion direction requires multi-view observation
Object Counting	Precise counting in dynamic scenes	Object appearance/disappearance + cross-view occlusion
Spatial Relationship	Cross-view spatial configuration understanding	Spatial relations differ across viewpoints
Temporal Relationship	Temporal evolution and order understanding	Joint reasoning across both temporal and viewpoint dimensions

3. Annotation Pipeline¶

QA annotation: a hybrid approach is employed. - A professional annotation team performs initial labeling (retention rate drops from 92% to 62.5%, highlighting quality control challenges). - Subsequently, GPT-4o/Qwen2-VL generate candidate QA pairs → Qwen2-VL 7B performs initial filtering → text-only blind testing (Qwen2.5 + Llama 3.1, discarding items both models answer correctly) → final human review. - 751 high-quality QA pairs are retained.

Captioning annotation: fully manual; five professional annotators independently write a description for each object, and reviewers ensure that descriptions capture important details.

4. Evaluation Setup¶

\(K=3\) views are uniformly sampled from 24 viewpoints.
\(N=6\) frames are sampled per view → input consists of \(3 \times 6 = 18\) frames.
The captioning task uses GPT-4o as the evaluator, producing separate GPT-Appearance and GPT-Action scores (0–5).

Loss & Training¶

4D-Bench is an evaluation benchmark; no model training is involved.

Key Experimental Results¶

Main Results¶

4D Object QA Accuracy (%):

Model	Counting	Temporal Rel.	Action	Spatial Rel.	Appearance	Overall
MiniGPT4-Video	22.05	26.43	22.90	22.39	22.06	23.17
Qwen2-VL 7B	38.58	56.43	57.94	58.96	71.32	56.99
LLaVA-Video 72B	54.33	58.57	57.48	66.42	77.21	62.32
GPT-4o	44.09	59.29	63.55	69.40	77.21	62.98
All Models Avg.	37.29	49.29	49.37	53.57	63.92	50.69
Human	88.98	89.29	94.39	91.04	89.71	91.08

The gap between GPT-4o and the human baseline is nearly 28 percentage points.

Ablation Study (Effect of Number of Views and Sampling Rate)¶

Configuration Change	Accuracy Change (Gemini 1.5 Flash)
1 view → 3 views (fixed 6 frames)	41.3% → 53.7% (+12.4%)
1 frame → 6 frames (fixed 3 views)	46.3% → 53.7% (+7.4%)
3 views → 6 views	53.7% → decrease (information redundancy)
6 frames → 9 frames	Negligible improvement

Conclusion: the tasks genuinely require multi-view and temporal information, but exceeding 3 views or 6 frames introduces redundancy that interferes with model performance.

Captioning Task GPT-Eval Scores:

Model	GPT-Appearance	GPT-Action	GPT-Eval
Qwen2-VL 72B	3.324/5	2.791/5	3.057/5
Gemini 1.5 Pro	3.311/5	2.983/5	3.147/5
GPT-4o	3.507/5	3.258/5	3.382/5
Human	3.772/5	3.879/5	3.826/5

Key Findings¶

Counting is the most challenging sub-task: the average across all models is only 37.29% (near the random-guess baseline of 25%), requiring cross-view information integration to resolve occlusions.
Appearance understanding >> Action understanding: appearance averages 63.92% vs. action 49.37%, a gap of approximately 15%.
The open-source vs. closed-source gap is larger on action understanding: open-source models approach closed-source performance on appearance, but the gap on action understanding is substantial.
Counterfactual data exposes "memory dependence": when presented with a six-legged spider or physically impossible scenarios, all MLLMs produce incorrect answers, indicating reliance on world-knowledge priors rather than genuine visual understanding.
Good robustness: changing frame ordering (view-first vs. time-first) or adding timestamps has minimal effect on results.

Highlights & Insights¶

Filling the gap in 4D–language understanding evaluation: a novel evaluation dimension is introduced between static 3D and single-view 2D video benchmarks.
Elegant counterfactual test design: synthetic data naturally enables out-of-distribution evaluation beyond the real world, which is infeasible in 2D benchmarks.
Rigorous data quality control: the hybrid annotation pipeline (human + MLLM + blind testing + final review) ensures that questions genuinely require multi-view temporal reasoning.
Actionable findings: poor counting performance → better cross-view correspondence modeling is needed; weak action understanding → stronger temporal encoders are required.

Limitations & Future Work¶

The current approach uses concatenated multi-view videos as a proxy for 4D input rather than native 4D representations (e.g., point cloud sequences, 4D Gaussian Splatting), due to the input modality constraints of current MLLMs.
The dataset scale is relatively limited (751 QA pairs + 580 captioning instances), which may be insufficient for comprehensive statistical conclusions.
Objects are sourced from Objaverse-XL and are predominantly synthetic, introducing potential distribution gaps in appearance and motion relative to the real world.
Only general-purpose 2D MLLMs are evaluated; dedicated 3D/4D understanding models (e.g., 3D-LLM) are not included.

MVBench [Li et al., 2024]: a multi-task video understanding benchmark, but limited to single viewpoints.
ScanQA [Azuma et al., 2022]: 3D scene question answering, but restricted to static scenes.
T3Bench [He et al., 2023]: evaluates text-to-3D generation, focusing on generation quality rather than understanding.
4DGS [Wu et al., 2024]: 4D Gaussian Splatting, providing a 4D representation but lacking language understanding evaluation.
Implication: future MLLMs require native 4D input support (rather than multi-view video proxies) and stronger temporal modeling capabilities.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — the first benchmark for 4D object understanding, with a pioneering problem formulation.
Technical Depth: ⭐⭐⭐ — primarily an evaluation work; methodological contributions are relatively limited.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 14 MLLMs, 5 sub-tasks, and multi-dimensional analyses (number of views, frames, ordering, counterfactuals).
Practical Value: ⭐⭐⭐⭐ — provides clear directions for improving 4D understanding capabilities in MLLMs.