FunQA: Towards Surprising Video Comprehension¶

Conference: ECCV 2024
arXiv: 2306.14899
Code: https://github.com/Jingkang50/FunQA
Area: LLM/NLP
Keywords: video question answering, counter-intuitive reasoning, surprising videos, VLM, benchmark

TL;DR¶

The authors construct a large-scale counter-intuitive video question answering benchmark, FunQA (consisting of 4.3K videos and 312K QA pairs), covering three categories of surprising videos: Humor, Creativity, and Magic. They also propose the FunMentor agent, which enhances the counter-intuitive reasoning capabilities of VLMs through multi-turn dialogue.

Background & Motivation¶

Background¶

Background: Existing VideoQA benchmarks mainly focus on common, everyday scenarios (cooking, instruction, etc.) and lack evaluation regarding "surprising" video comprehension.

Limitations of Prior Work¶

Limitations of Prior Work: Understanding entertaining videos goes beyond basic visual perception; it requires comprehending deviations from common sense—i.e., why a certain scene is funny, creative, or mind-boggling.

Key Challenge¶

Key Challenge: GPT-4V has already achieved an accuracy of 80% on NExT-QA, necessitating the creation of more challenging benchmarks.

Key Insight¶

Key Insight: Existing datasets for humor/creativity comprehension heavily rely on audio and narrative cues, with the role of visual understanding being underemphasized.

Supplementary Note¶

Supplementary Note: The average response length in FunQA is 34.2 words, which far exceeds NExT-QA's 2.6 words, indicating a demand for deeper video comprehension.

Method¶

Overall Architecture¶

The FunQA dataset consists of three subsets: 1. HumorQA: Funny videos, featuring unexpected contrasts and plot twists. 2. CreativeQA: Creative performance videos, highlighting ingenious disguises and creative techniques. 3. MagicQA: Magic performance videos, centering on seemingly impossible events.

Four types of task design: 1. Counter-intuitive timestamp localization: Locating the specific video segment where the surprising event occurs. 2. Detailed video description: Generating coherent and objective descriptions of the video content. 3. Counter-intuitive reasoning: Explaining why the video is surprising or funny. 4. Advanced tasks: Title generation, creativity scoring, and magic trick explanation.

Key Designs¶

FunMentor Agent: - Plays a role similar to a mentor in a variety show. - Guides the VLM through multi-turn dialogue: 1. Basic Description → Guiding the model to focus on key details. 2. Comparative Analysis → Guiding the model to identify anomalous elements. 3. Reasoning & Integration → Guiding the model to formulate a comprehensive explanation. - Employs precise prompting strategies to produce fluent and logical answers.

Data Construction Pipeline (~900 hours, 50+ annotators): 1. Preprocessing: YouTube scraping → Two-stage manual filtering and trimming. 2. Manual Annotation: Chinese annotation → 10% secondary verification → Consensus evaluation. 3. Post-processing: GPT-3.5 translation and expansion → 312K QA pairs. 4. Provides both FunQA-MC (multiple-choice) and FunQA-DIA (dialogue) formats.

Loss & Training¶

FunMentor does not involve model training; it is an inference-time prompting strategy.
Evaluation utilizes a variety of metrics: GPT-4 assisted evaluation, BLEU, ROUGE, BERTScore, etc.

Key Experimental Results¶

Main Results¶

Model	H1 (Locating)	H2 (Description)	H3 (Reasoning)	C3 (Reasoning)	M3 (Reasoning)
Video-ChatGPT	1.23	2.05	1.89	1.67	1.54
VideoChat2	1.31	2.14	2.03	1.83	1.72
GPT-4V	1.98	2.89	2.67	2.45	2.19
GPT-4V + FunMentor	2.34	3.12	3.01	2.78	2.51

Ablation Study¶

FunMentor Component	Improvement on Reasoning Task
W/o FunMentor	Baseline
+ Single-turn guidance	+0.15
+ Multi-turn dialogue	+0.34
+ Full FunMentor	+0.34

Key Findings¶

All existing VLMs perform significantly worse on FunQA compared to typical VideoQA benchmarks.
Timestamp localization is the most challenging task, as models generally struggle to pinpoint counter-intuitive moments precisely.
FunMentor delivers massive improvements across all tasks, validating the effectiveness of multi-turn dialogue guidance.
The annotation consensus rate exceeds 90% in the "high consensus" category, and only 1% falls into "low consensus", demonstrating high data quality.
MagicQA is the most challenging subset, as it requires models to understand physical intuition and deduce the principles behind magic tricks.

Highlights & Insights¶

Fills a crucial gap: The first large-scale benchmark systematically targeting counter-intuitive and entertaining video comprehension.
Hierarchical task design: Scales up from perception (localization) to understanding (description) and finally to reasoning (explanation).
Simple yet effective FunMentor: A training-free multi-turn prompting strategy that significantly bolsters the reasoning capability of VLMs.
Exposes severe deficiencies in state-of-the-art VLMs concerning counter-intuitive reasoning.
Strict quality control: Involves 900+ annotation hours and multi-stage consensus validation.

Limitations & Future Work¶

The video sources are primarily from YouTube, which may introduce cultural biases (skewed toward Western humor styles).
A substantial portion of the 312K QA pairs was translated/expanded using GPT-3.5, posing quality control challenges.
FunMentor relies heavily on specific prompt engineering, and its generalizability across different VLMs warrants further validation.
Evaluation remains primarily reliant on automated metrics, with human evaluation conducted on a relatively small scale.

NExT-QA: A pioneer in open-ended VideoQA, but lacks sufficient difficulty.
CLEVRER: A synthetic video reasoning benchmark.
Whoops: Focuses on counter-intuitive image understanding.
Insight: Models' "understanding" capability should not merely be evaluated on everyday scenarios; counter-to-common-sense reasoning constitutes a far deeper test of cognitive abilities.

Rating¶

Dimension	Score (1-10)
Novelty	9
Technical Depth	6
Experimental Thoroughness	8
Value	8
Writing Quality	8
Overall Rating	7.8