FunQA: Towards Surprising Video Comprehension¶
Conference: ECCV 2024
arXiv: 2306.14899
Code: https://github.com/Jingkang50/FunQA
Area: LLM/NLP
Keywords: video question answering, counter-intuitive reasoning, surprising videos, VLM, benchmark
TL;DR¶
The authors construct a large-scale counter-intuitive video question answering benchmark, FunQA (consisting of 4.3K videos and 312K QA pairs), covering three categories of surprising videos: Humor, Creativity, and Magic. They also propose the FunMentor agent, which enhances the counter-intuitive reasoning capabilities of VLMs through multi-turn dialogue.
Background & Motivation¶
Background¶
Background: Existing VideoQA benchmarks mainly focus on common, everyday scenarios (cooking, instruction, etc.) and lack evaluation regarding "surprising" video comprehension.
Limitations of Prior Work¶
Limitations of Prior Work: Understanding entertaining videos goes beyond basic visual perception; it requires comprehending deviations from common sense—i.e., why a certain scene is funny, creative, or mind-boggling.
Key Challenge¶
Key Challenge: GPT-4V has already achieved an accuracy of 80% on NExT-QA, necessitating the creation of more challenging benchmarks.
Key Insight¶
Key Insight: Existing datasets for humor/creativity comprehension heavily rely on audio and narrative cues, with the role of visual understanding being underemphasized.
Supplementary Note¶
Supplementary Note: The average response length in FunQA is 34.2 words, which far exceeds NExT-QA's 2.6 words, indicating a demand for deeper video comprehension.
Method¶
Overall Architecture¶
The FunQA dataset consists of three subsets: 1. HumorQA: Funny videos, featuring unexpected contrasts and plot twists. 2. CreativeQA: Creative performance videos, highlighting ingenious disguises and creative techniques. 3. MagicQA: Magic performance videos, centering on seemingly impossible events.
Four types of task design: 1. Counter-intuitive timestamp localization: Locating the specific video segment where the surprising event occurs. 2. Detailed video description: Generating coherent and objective descriptions of the video content. 3. Counter-intuitive reasoning: Explaining why the video is surprising or funny. 4. Advanced tasks: Title generation, creativity scoring, and magic trick explanation.
Key Designs¶
FunMentor Agent: - Plays a role similar to a mentor in a variety show. - Guides the VLM through multi-turn dialogue: 1. Basic Description → Guiding the model to focus on key details. 2. Comparative Analysis → Guiding the model to identify anomalous elements. 3. Reasoning & Integration → Guiding the model to formulate a comprehensive explanation. - Employs precise prompting strategies to produce fluent and logical answers.
Data Construction Pipeline (~900 hours, 50+ annotators): 1. Preprocessing: YouTube scraping → Two-stage manual filtering and trimming. 2. Manual Annotation: Chinese annotation → 10% secondary verification → Consensus evaluation. 3. Post-processing: GPT-3.5 translation and expansion → 312K QA pairs. 4. Provides both FunQA-MC (multiple-choice) and FunQA-DIA (dialogue) formats.
Loss & Training¶
- FunMentor does not involve model training; it is an inference-time prompting strategy.
- Evaluation utilizes a variety of metrics: GPT-4 assisted evaluation, BLEU, ROUGE, BERTScore, etc.
Key Experimental Results¶
Main Results¶
| Model | H1 (Locating) | H2 (Description) | H3 (Reasoning) | C3 (Reasoning) | M3 (Reasoning) |
|---|---|---|---|---|---|
| Video-ChatGPT | 1.23 | 2.05 | 1.89 | 1.67 | 1.54 |
| VideoChat2 | 1.31 | 2.14 | 2.03 | 1.83 | 1.72 |
| GPT-4V | 1.98 | 2.89 | 2.67 | 2.45 | 2.19 |
| GPT-4V + FunMentor | 2.34 | 3.12 | 3.01 | 2.78 | 2.51 |
Ablation Study¶
| FunMentor Component | Improvement on Reasoning Task |
|---|---|
| W/o FunMentor | Baseline |
| + Single-turn guidance | +0.15 |
| + Multi-turn dialogue | +0.34 |
| + Full FunMentor | +0.34 |
Key Findings¶
- All existing VLMs perform significantly worse on FunQA compared to typical VideoQA benchmarks.
- Timestamp localization is the most challenging task, as models generally struggle to pinpoint counter-intuitive moments precisely.
- FunMentor delivers massive improvements across all tasks, validating the effectiveness of multi-turn dialogue guidance.
- The annotation consensus rate exceeds 90% in the "high consensus" category, and only 1% falls into "low consensus", demonstrating high data quality.
- MagicQA is the most challenging subset, as it requires models to understand physical intuition and deduce the principles behind magic tricks.
Highlights & Insights¶
- Fills a crucial gap: The first large-scale benchmark systematically targeting counter-intuitive and entertaining video comprehension.
- Hierarchical task design: Scales up from perception (localization) to understanding (description) and finally to reasoning (explanation).
- Simple yet effective FunMentor: A training-free multi-turn prompting strategy that significantly bolsters the reasoning capability of VLMs.
- Exposes severe deficiencies in state-of-the-art VLMs concerning counter-intuitive reasoning.
- Strict quality control: Involves 900+ annotation hours and multi-stage consensus validation.
Limitations & Future Work¶
- The video sources are primarily from YouTube, which may introduce cultural biases (skewed toward Western humor styles).
- A substantial portion of the 312K QA pairs was translated/expanded using GPT-3.5, posing quality control challenges.
- FunMentor relies heavily on specific prompt engineering, and its generalizability across different VLMs warrants further validation.
- Evaluation remains primarily reliant on automated metrics, with human evaluation conducted on a relatively small scale.
Related Work & Insights¶
- NExT-QA: A pioneer in open-ended VideoQA, but lacks sufficient difficulty.
- CLEVRER: A synthetic video reasoning benchmark.
- Whoops: Focuses on counter-intuitive image understanding.
- Insight: Models' "understanding" capability should not merely be evaluated on everyday scenarios; counter-to-common-sense reasoning constitutes a far deeper test of cognitive abilities.
Rating¶
| Dimension | Score (1-10) |
|---|---|
| Novelty | 9 |
| Technical Depth | 6 |
| Experimental Thoroughness | 8 |
| Value | 8 |
| Writing Quality | 8 |
| Overall Rating | 7.8 |