HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering¶
Conference: CVPR 2026
arXiv: 2512.14870
Code: None
Area: Video Understanding / Multimodal VLM
Keywords: VideoQA Benchmark, Multi-evidence Integration, Frame Selection, Long Video Understanding, Temporal Reasoning
TL;DR¶
HERBench is a VideoQA benchmark specifically designed for multi-evidence integration, consisting of 26,806 five-choice multiple-choice questions. Each question structurally necessitates the fusion of \(\ge 3\) temporally dispersed, non-overlapping visual cues. By introducing the Minimum Required Frame Set (MRFS) metric, it identifies two critical bottlenecks in current Video-LLMs: insufficient frame retrieval and evidence fusion failure.
Background & Motivation¶
-
Background: Video-LLMs (e.g., GPT-4o, Gemini, Qwen2.5-VL) have achieved high scores on existing VideoQA benchmarks, suggesting rapid progress in video understanding capabilities.
-
Limitations of Prior Work: Recent audit studies reveal that these high scores often stem from language priors or single-cue shortcuts rather than genuine temporal reasoning. Models can answer questions correctly by viewing a single frame or exploiting linguistic biases; existing benchmarks fail to distinguish "true video understanding" from "shortcut exploitation."
-
Key Challenge: The problem design of existing VideoQA benchmarks allows for single-cue shortcuts—a single keyframe or textual common sense is often sufficient. Consequently, it remains uncertain whether models actually possess the ability to integrate multiple pieces of evidence across time.
-
Goal: (1) Design a benchmark that structurally requires multi-evidence integration (\(\ge 3\) dispersed cues); (2) Propose the MRFS quantitative metric to measure "evidential requirements"; (3) Diagnose specific failure modes in current Video-LLMs—distinguishing between frame selection issues and information fusion failures.
-
Key Insight: The authors define the concept of "Evidential Requirement" (ER)—the minimum number of non-redundant visual evidences required to answer a question. By enforcing \(ER \ge 3\), single-cue shortcuts are fundamentally eliminated, making multi-evidence reasoning an unavoidable requirement.
-
Core Idea: Through structural design, each question is guaranteed to require at least 3 temporally dispersed visual cues. Combined with the MRFS metric to quantify frame fusion difficulty, the benchmark systematically reveals the dual shortcomings of Video-LLMs in frame retrieval and evidence fusion.
Method¶
Overall Architecture¶
HERBench is not a new model but an evaluation benchmark designed to "force out" genuine video understanding capabilities. It seeks to answer whether modern Video-LLMs can perform when a question structurally requires at least 3 dispersed cues. The construction follows four steps: first, a taxonomy of 12 sub-tasks across 4 reasoning families embeds the "\(\ge 3\) dispersed cues" requirement into the question structure; second, a three-channel data pipeline extracts spatio-temporal information at the trajectory, shot, and manual log granularities; third, oriented task programming synthesizes five-choice questions; fourth, a quality control stage filters card leakage and linguistic shortcuts to ensure answers depend on visual evidence; finally, the MRFS metric quantifies the "evidence demand" of each question. The final benchmark comprises 336 long videos (average 395 seconds) and 26,806 questions.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
T["Four Reasoning Families · 12 Sub-task System<br/>Mandatory ≥3 Dispersed Cues per Question"]
subgraph PIPE["Three-Channel Data Construction Pipeline"]
direction TB
P1["Channel I: Tracking & Trajectories<br/>RF-DETR+DeepSORT → Card A/Card B"]
P2["Channel II: Shot Segmentation<br/>MLLM Scene Descriptions"]
P3["Channel III: Manual Log Integration<br/>Narration Events → Temporal/Count Anchors"]
end
T --> S["Oriented Task Programming<br/>Synthesize 5-choice Questions"]
PIPE --> S
S --> Q["Quality Control<br/>Card Leakage + Blind LLM De-biasing + Expert Verification"]
Q --> B["Final Benchmark<br/>336 Videos / 26806 Questions"]
B -->|MRFS Quantifies Difficulty| M["Minimum Required Frame Set (MRFS)<br/>Binary Search for Min Successful Frames"]
Key Designs¶
1. Four Reasoning Families and 12 Sub-tasks: Embedding "Multi-evidence" into Structure
A critical weakness of existing VideoQA benchmarks is the single-cue shortcut. HERBench addresses this by redesigning tasks to structurally enforce \(k \ge 3\): the correct answer must be assembled from multiple cues at different timestamps. The 12 sub-tasks are grouped into four families: Temporal Reasoning & Chaining (TR&C, including Temporal Sequence Ordering TSO, Multi-Person Duration Reasoning MPDR, Action Sequence Integrity Identification ASII) requires understanding event order and duration; Referencing & Tracking (R&T, including Appearance-Grounded Behavior Interaction AGBI, Appearance-Grounded Attribute Recognition AGAR, Appearance-Grounded Localization Trajectory AGLT) forces identity binding across time; Global Consistency & Verification (GC&V, including False Action Memory FAM, Scene Verification Arrangement SVA, False Object Memory FOM) requires scanning the full video for verification; Multi-Entity Aggregation & Navigation (MEA&N, including Multi-Entity Grounded Localization MEGL, Action Counting AC, Region-Limited People Counting RLPC) requires cross-time de-duplication and aggregation.
2. Three-Channel Data Construction Pipeline: Multi-granular Information Extraction
To generate "mandatory multi-frame" questions at scale, three complementary pipelines are used. Channel I performs object tracking and trajectory analysis: using RF-DETR + DeepSORT to obtain trajectories, and generating non-overlapping A-cards (appearance) and B-cards (behavior/trajectory) for each track. Crucially, "appearance" and "behavior" are placed at different timestamps—the model must locate the person via A-card appearance and then track them to the B-card behavior timestamp, making identity binding an unavoidable multi-frame reasoning task. Channel II performs shot segmentation and uses MLLMs to generate scene cards for macro-structure. Channel III integrates manually verified narration logs to provide factual anchors for temporal sequences and counts.
3. Quality Control: Blocking Information Leakage and Linguistic Shortcuts
To ensure the validity of the "mandatory multi-frame" design, two shortcuts are blocked. First, card leakage: if A/B cards use similar phrasing, models might infer answers from text alone; token-level similarity checks and manual reviews eliminate such overlaps. Second, linguistic bias: questions answered correctly by \(\ge 3/4\) "blind" (text-only) LLMs are discarded. Furthermore, 15% of questions undergo expert verification to confirm \(k \ge 3\) compliance and answer uniqueness. Human performance (88.8% on full videos, 95.7% with oracle frames) confirms the benchmark is solvable but challenging.
4. Minimum Required Frame Set (MRFS) Metric: Quantifying Evidential Requirements
To quantify difficulty, MRFS is introduced. Given an MLLM \(f\), a frame selector \(r\), and a frame budget \(x\), the "minimum number of frames \(k\) required for the model to transition from incorrect to correct" is defined as the MRFS size. In implementation, text-solvable questions are excluded (where \(E(f(q, \varnothing), y) = 0\)), and an adaptive binary search is performed over \(k \in [1, x]\) to find the minimum frames for success. Unlike existing metrics, MRFS is automated, model-centric, and directly measures the difficulty of multi-evidence integration.
Key Experimental Results¶
Main Results¶
Evaluation of 13 SOTA Video-LLMs shows an overall accuracy of only 31-42% (random guessing is 20%):
| Model | TR&C Avg. | R&T Avg. | GC&V Avg. | MEA&N Avg. | Overall |
|---|---|---|---|---|---|
| GPT-4.1 | 25.4 | 66.0 | 37.1 | 29.0 | 39.4 |
| Gemini-2.5-Flash | 29.7 | 69.9 | 34.9 | 26.8 | 40.3 |
| Qwen2.5-VL-72B | 26.9 | 70.9 | 36.6 | 24.4 | 39.7 |
| Ovis-2.5-9B | 18.9 | 73.5 | 46.8 | 29.2 | 42.1 |
| InternVL3.5-8B | 33.6 | 70.2 | 29.7 | 30.8 | 41.1 |
Across-Benchmark MRFS Comparison¶
| Benchmark | Videos | Questions | MRFS↑ | Language De-biasing | Mandatory Fusion |
|---|---|---|---|---|---|
| MVBench | 4,000 | 4,000 | 3.52 | ✗ | ✗ |
| Video-MME | 900 | 2,700 | 5.31 | ✗ | ✗ |
| MINERVA | 223 | 1,515 | 5.14 | ✓ | ✗ |
| HERBench | 336 | 26,806 | 5.49 | ✓ | ✓ |
Key Findings¶
- Frame Retrieval Bottleneck (Finding 1): Although adaptive frame selectors outperform uniform sampling, a significant gap remains compared to oracle keyframes—models fundamentally fail to retrieve the critical evidence frames.
- Fusion Bottleneck (Finding 2): Even when provided with oracle frames, models show only modest accuracy gains, indicating an inability to correctly allocate attention across all keyframes and integrate the information.
- The R&T family scores relatively high (~60-73%) because appearance descriptions provide strong visual anchors; the TR&C and MEA&N families score lowest (<30%), reflecting severe deficiencies in temporal reasoning and multi-entity aggregation.
- Smaller models (e.g., Ovis-2.5-9B) sometimes outperform larger ones (GPT-4.1), suggesting the problem is not purely a matter of model scale.
Highlights & Insights¶
- Elegant Design of the MRFS Metric: It isn't just a simple frame count; by using binary search with a fixed selector and excluding text-solvable cases, it provides a fair and computationally efficient measurement across benchmarks.
- A/B Card Separation: Deliberately placing appearance descriptions and behavior queries in different timeframes forces the model to perform identity binding through temporal tracking, turning it into a mandatory multi-frame task.
- Dual Bottleneck Diagnostic Framework: Decoupling frame selection from fusion reasoning via oracle experiments clearly identifies that "finding frames" and "using frames" are distinct challenges, providing a roadmap for future research.
Limitations & Future Work¶
- Parts of the benchmark are generated via automated pipelines, which may contain residual systematic biases.
- With only 336 videos, the scene diversity may not represent all real-world scenarios.
- MRFS depends on specific frame selectors and models; different combinations might yield different rankings.
- The focus is on diagnosing current failures rather than providing specific architectural solutions.
- High accuracy in R&T tasks may suggest that the ER requirements for these tasks could be stricter.
Related Work & Insights¶
- vs Video-MME: Video-MME focuses on longer contexts but does not control evidential requirement; HERBench ensures high ER through structural design, making evidence density rather than duration the primary difficulty.
- vs MINERVA: MINERVA also uses language de-biasing and has high MRFS but focuses on multi-step reasoning and reasoning chain auditing rather than mandatory multi-frame integration.
- vs MVBench: MVBench covers various temporal tasks but uses short videos, and questions can often be solved via single frames.
Rating¶
- Novelty: ⭐⭐⭐⭐ First VideoQA benchmark centered on evidential requirement; innovative MRFS metric.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation of 13 models, cross-benchmark MRFS comparison, multi-dimensional diagnostic analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear framework and comprehensive task taxonomy, though the paper is extensive.
- Value: ⭐⭐⭐⭐ Provides a critical diagnostic of Video-LLM shortfalls, offering significant reference for field advancement.