Two Causally Related Needles in a Video Haystack¶

Conference: NeurIPS 2025 arXiv: 2505.19853 Code: Project Page Area: Video Understanding / Causal Reasoning Keywords: Long video understanding, causal reasoning, needle-in-a-haystack, video-language models, benchmark

TL;DR¶

This paper proposes Causal2Needles, a benchmark of 4,100 QA pairs that binds the understanding of two causally related events via a "bridging entity," forcing VLMs to jointly retrieve and reason over two needles scattered across a long video. It reveals severe deficiencies in state-of-the-art models on the causal dual-needle task (ChatGPT-4o achieves only 13.4% Both accuracy on the dual-needle setting).

Background & Motivation¶

Background: Long video understanding benchmarks have proliferated, yet most evaluate only single-needle information extraction (retrieving an answer from one location) or track objects purely through visual appearance matching.

Limitations of Prior Work: (1) NLP research has shown that models perform substantially worse on multi-needle than single-needle tasks, yet the multimodal community lacks systematic multi-needle evaluation; (2) existing benchmarks assess VLMs' "world model" capacity only through physical motion prediction, neglecting causal reasoning over human actions; (3) narrative text input may introduce "text bias," allowing models to answer directly from text without genuinely understanding the video.

Key Challenge: Do models that perform well on single-needle tasks truly understand the causal structure of long videos, or are they merely performing local information retrieval?

Goal: Construct a long video understanding benchmark that simultaneously evaluates joint dual-needle comprehension and causal reasoning ability.

Key Insight: Exploit causally related event pairs in movie summary videos, using a "bridging entity" design that forces the model to first retrieve the effect event before localizing the cause event.

Core Idea: Obfuscate the bridging entity to transform the dual-needle problem into an indivisible joint reasoning task, preventing it from degenerating into two independent single-needle problems.

Method¶

Overall Architecture¶

Causal2Needles is built upon 192 movie summary videos (from the YMS and SyMoN datasets) and comprises four question types: (1) non-causal single-needle (1,704 questions)—asking about event details; (2) causal single-needle (902 questions)—inferring cause from effect, with the answer at a single location; (3) causal dual-needle–visual grounding format (747 questions)—requiring selection of the video segment containing the answer; (4) causal dual-needle–image description format (747 questions)—requiring description of visual details of the cause event. The construction pipeline proceeds as: LLM extracts causal relations → generates bridging entities → generates two-part questions → automatic and human quality evaluation.

Key Designs¶

Bridging-Entity-Driven Dual-Needle Question Design:
- Function: Design a question structure that forces the model to jointly understand both cause and effect events.
- Mechanism: Identify a bridging entity shared by a causal event pair (e.g., "Superman's death"), then replace it with an obfuscated expression in the question (e.g., "tragedy"). Part 1 requires resolving the bridging entity from the effect event; Part 2 requires retrieving the video segment of the cause event based on that resolution.
- Design Motivation: If the bridging entity is stated explicitly, the dual-needle problem degenerates into two independently answerable single-needle problems—obfuscation is the key to ensuring joint reasoning.
Dual Complementary QA Formats (Visual Grounding + Image Description):
- Function: Design two complementary question formats to offset each other's evaluation biases.
- Mechanism: The visual grounding format requires selecting the correct video segment, preventing purely text-based answers but potentially causing out-of-distribution (OOD) issues; the image description format requires answering multiple-choice questions about visual details, circumventing OOD but potentially susceptible to pretrained knowledge leakage. Both formats are used jointly for comprehensive evaluation.
- Design Motivation: A single format will either overestimate or underestimate model capability—visual grounding may underestimate (OOD), while image description may overestimate (movie knowledge leakage).
Global + Local Causal Relation Extraction:
- Function: Use an LLM to extract a comprehensive causal relation graph from video narratives.
- Mechanism: A global graph extracts long-range causal relations from the full narrative; a sliding window (15 sentences, stride 5) extracts local graphs to capture fine-grained relations. After merging, only causal pairs with distance \(\geq 3\) events apart are retained.
- Design Motivation: LLMs exhibit reduced attention to content in the middle of long texts ("lost in the middle"); local windows compensate for missed relations.

Loss & Training¶

This is a purely evaluative benchmark; no training is involved. Questions are generated using LLMs (GPT-4o-mini, Gemini-2.0-flash); quality is assessed by ChatGPT-4.1 and Gemini-2.0-flash as well as five human annotators. Bridging entity co-occurrence rate exceeds 95%, and factual correctness of questions scores 4.6+/5.

Key Experimental Results¶

Main Results¶

VLM accuracy on Causal2Needles (%, averaged over forward/reverse):

Model	Non-Causal 1-Needle	Causal 1-Needle	VG 2-Needle Both	ID 2-Needle
Human	-	78.2	79.3	88.2
ChatGPT-4o	56.8	39.2	13.4	59.2
Gemini-1.5-pro	55.4	35.6	8.4	60.9
ChatGPT-4o-mini	39.9	33.4	5.2	52.3
Qwen2.5VL-32B	30.7	11.7	1.9	53.5
LLaVA-OneVision-7B	12.3	18.0	0.1	28.3

Ablation Study¶

Effects of causality and dual-needle distance:

Analysis Dimension	Finding
Non-causal → Causal (1-needle)	ChatGPT-4o: 56.8% → 39.2% (causal reasoning is substantially harder)
1-needle → 2-needle Both	ChatGPT-4o: 39.2% → 13.4% (joint dual-needle reasoning causes a dramatic drop)
Distance correlation	Greater inter-needle distance correlates with lower performance (significant negative correlation)
Forward vs. reverse order	Most models perform better with forward-order input, though differences vary by model

Key Findings¶

Causal reasoning is far harder than non-causal retrieval: All models exhibit large performance drops from non-causal to causal questions.
Joint dual-needle comprehension is the true bottleneck: The strongest model, ChatGPT-4o, achieves only 13.4% VG dual-needle Both accuracy, far below the single-needle setting.
Open-source models nearly completely fail: Qwen2.5VL-32B achieves only 1.9% on dual-needle Both.
Image description format consistently outperforms visual grounding: Likely because models can leverage pretrained knowledge when answering appearance-based multiple-choice questions.

Highlights & Insights¶

Elegant bridging entity design: Obfuscating a shared reference ensures the dual-needle task is indivisible—this is the core innovation for evaluating joint reasoning.
Exposing the "high-score trap": Models may achieve high scores on existing benchmarks while entirely lacking causal reasoning ability.
Dual complementary format design: Simultaneously controls for OOD bias and knowledge leakage bias.
High dataset quality: Automatic and human evaluation confirms >95% valid bridging entities and 4.6+/5 factual correctness.

Limitations & Future Work¶

Videos are sourced from movie summaries (edited clips), which may not fully represent natural long videos.
Narrative text as auxiliary input may introduce text bias that is difficult to fully eliminate.
Human baseline evaluation is limited in scale (5 annotators).
Causal relation extraction itself relies on LLMs and may miss implicit causal chains.
Only a limited set of VLMs is evaluated; additional architectures (e.g., video-native models) remain to be assessed.

vs. VideoMME/MLVU: These benchmarks include multiple needles but the needles are only visually associated and can be understood independently—Causal2Needles requires causal joint understanding.
vs. EgoSchema: Lacks diagnostic category labels—Causal2Needles precisely isolates single-needle vs. dual-needle and causal vs. non-causal conditions.
Insight: "Capable of retrieval ≠ capable of reasoning"—the next frontier in long video understanding is advancing from information extraction to causal reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The bridging-entity-driven dual-needle causal reasoning design is unique among long video benchmarks.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 10+ models, 4 question types, forward/reverse comparison, and comprehensive quality evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear and motivation is well-argued.
Value: ⭐⭐⭐⭐⭐ Reveals fundamental weaknesses of existing VLMs in joint reasoning; highly diagnostic.