Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data¶

Conference: ECCV 2024
arXiv: 2407.13094
Code: Yes (Project Page)
Area: Video Understanding / Vision-Language
Keywords: video-text understanding, counterfactual augmentation, action semantics, contrastive learning, LLM-teacher

TL;DR¶

Proposal of the Retrieval from Counterfactually Augmented Data (RCAD) task and the Feint6K dataset, revealing that SOTA video-text models lag far behind humans in action semantic understanding (InternVideo 58.2% vs. Human 95.2%), and introduction of the LLM-teacher method to improve action embedding learning via LLM knowledge distillation.

Background & Motivation¶

Background: Video-text foundation models (e.g., InternVideo, LanguageBind) have achieved outstanding results on standard retrieval tasks (up to 87.9% R@1), and are widely considered to possess strong video understanding capabilities.

Limitations of Prior Work: Standard evaluation tasks exhibit severe shortcuts and biases. Many questions can be answered solely based on objects and context from a single frame—e.g., seeing "cymbal" suggests "playing cymbals," and seeing an outdoor scene suggests "football."

Key Challenge: Existing evaluations cannot distinguish whether a model truly understands cross-frame action semantics or is merely exploiting shortcuts from objects and scenes. The core incremental value of video understanding compared to image understanding—cross-frame reasoning and action semantics—is obscured by existing benchmarks.

Goal: (1) To design a legacy-free/shortcut-free evaluation paradigm to expose model weaknesses; (2) To improve model learning of action semantics.

Key Insight: By utilizing human-annotated counterfactually modified descriptions—preserving the same objects and scenes while only altering the action—the model is forced to perform cross-frame reasoning. LLM knowledge is leveraged to language-inject more effective contrastive learning of actions.

Core Idea: Counterfactual augmentation exposes the model's deficient action understanding after eliminating object shortcuts, and LLM-teacher improves action embeddings through synthetic negative generation and soft-label distillation.

Method¶

Overall Architecture¶

This work consists of two components: (1) An evaluation framework—the RCAD task and the Feint6K dataset—to evaluate the model's true action understanding ability after removing shortcuts; (2) An improvement method—LLM-teacher—which generates negative descriptions of action variants using LLMs and enhances action representations through soft-labeled contrastive learning.

Key Designs¶

RCAD Task Design (Retrieval from Counterfactually Augmented Data)
- Function: Given a video and a set of candidate descriptions (1 positive and 5 negatives), retrieve the description that semantically matches the video. The negatives contain the same objects as the positive, differing only in the actions.
- Mechanism: Negatives are counterfactually modified—keeping the textual structure and object entities unchanged, while only replacing the action verbs. For example, if the positive is "A man kicks a football," the negative might be "A man catches a football."
- Design Motivation: To eliminate object-based shortcuts, forcing the model to perform cross-frame reasoning to understand action semantics.
- Supports zero-shot evaluation without requiring downstream fine-tuning.
Feint6K Dataset Construction
- Function: Construction of a high-quality counterfactually augmented video-text evaluation set.
- Mechanism: Adoption of a human-in-the-loop system where 40 annotators manually modify action descriptions in the MSR-VTT and VATEX validation sets.
  - The new actions must be plausible in context but not happening in the video.
  - Providing annotators with demonstrations and feedback during the training phase.
  - Each annotation is audited; unqualified annotations are sent back for revision.
- Scale: 6,243 videos, derived from the MSR-VTT validation set and VATEX test set.
- The human baseline R@1 reaches 95.2% (MSR-VTT) and 96.8% (VATEX), demonstrating that the task is solvable and has a unique correct answer.
LLM-teacher Method
- Function: Improving action embedding learning of video-text models using LLM knowledge.
- Mechanism (Three Steps):
  - Synthetic Negative Generation: Handled by extracting action/object tokens from the original descriptions using an AMR parser, and then generating variant descriptions with two methods:
- Method I — Mask Filling: Using the MLM capability of XLM-RoBERTa to predict alternative action words.
- Method II — LLM Chatbot: Leveraging the in-context learning of LLMs to generate more flexible replacements (e.g., modifying prepositions).
  - Contrastive Learning: Using synthetic negatives for comparison, where the loss is the temperature-scaled cross-entropy: \(l = -\log \frac{\exp(\text{sim}(f_v, f_p)/\tau)}{\exp(\text{sim}(f_v, f_p)/\tau) + \sum_{i=1}^{k}\exp(\text{sim}(f_v, f_{n_i})/\tau)}\)
  - LLM Soft-Label Distillation: Some synthetic negatives may be semantically similar to the original description and should not be treated as strictly negative. Sentence-BERT is used to calculate similarities between descriptions to serve as soft labels from the LLM teacher, aligning the model output using KL divergence: \(l = \mathcal{L}_{\text{KL}}(z_{\text{video-text}}, z_{\text{LLM}})\)
- By default, 10 action-based synthetic descriptions are generated for each video.
- Design Motivation: In standard contrastive learning, objects serve as a natural shortcut—the model only needs to distinguish between "cymbal" and "football" to minimize the contrastive loss, without ever truly learning action embeddings.

Loss & Training¶

Binary pseudo-label version (LLM-teacher-lbl): Standard cross-entropy contrastive loss.
Soft-label version (LLM-teacher-lgt): KL divergence alignment with the soft distribution from the LLM teacher.
Applied to two pre-trained models: SimVTP and InternVideo.

Key Experimental Results¶

Main Results — Standard Retrieval vs. RCAD¶

Model	MSR-VTT R@1	Feint6K R@1	Gap	Human R@1
CLIP (Zero-shot)	26.3	37.3	—	95.2
InternVideo (Zero-shot)	37.5	45.8	-8.3	95.2
InternVideo (Fine-tuned)	49.1	58.6	+9.5	95.2
LanguageBind (Zero-shot)	42.8	41.3	-1.5	95.2
SimVTP (Fine-tuned)	50.2	35.7	-14.5	95.2
+ LLM-teacher-lgt	49.5	43.5	+7.8	—
InternVideo (Fine-tuned)	49.1	58.6	—	95.2
+ LLM-teacher-lgt	48.9	65.8	+7.2	—

VATEX Subset Results¶

Model	VATEX R@1	Feint6K R@1	Human R@1
InternVideo (Fine-tuned)	87.9	58.2	96.8
+ LLM-teacher-lgt	87.3(-0.6)	65.6(+7.4)	—
SimVTP (Fine-tuned)	76.6	33.6	96.8
+ LLM-teacher-lgt	75.3(-1.3)	40.1(+6.5)	—

Ablation Study¶

Configuration	VATEX R@1	Feint6K R@1	Description
DefaultGP (10 action descriptions, XLM-RoBERTa)	87.3	65.6	Default configuration
5 action descriptions	87.6	64.7	-0.9, more negatives are better
5 action + 5 object descriptions	87.5	64.2	Object negatives do not help
LLM Chatbot substitution	87.0	65.9	Slightly better but slower inference

Key Findings¶

InternVideo standard retrieval drops from 87.9% to 58.2% on RCAD, a steep decline of 29.7%, lagging far behind humans by 38.6%.
The cosine similarity change \(|\Delta s|\) for object replacement is much larger than that for action replacement, proving that the model's embeddings are far more discriminative for objects than for actions.
LLM-teacher-lgt (soft label) outperforms LLM-teacher-lbl (hard label) because some synthetic negatives are semantically close to the positive sample.
LLM-teacher only degrades by 0.2-0.6% on standard retrieval but gains 7.2-7.4% on RCAD.
Object negatives do not help improve RCAD, verifying that the model already possesses decent object embeddings, and the missing link is the action embedding.

Highlights & Insights¶

Contribution of Evaluation Paradigm: Eliminating shortcuts through counterfactual augmentation exposes fundamental deficiencies in action understanding for SOTA video-text models. The plunge from 87.9% to 58.2% is a wake-up call, indicating that high scores on standard benchmarks mostly result from object matching rather than motion/action comprehension.
In-depth Analysis of Shortcut Learning: Objects serve as natural shortcuts in contrastive learning—CLIP pre-training already equips models with excellent object embeddings, meaning the model only needs to distinguish objects to minimize loss during video-text contrast, without needing to learn action semantics. This analysis is profound and backed by experimental evidence (the \(\Delta s\) analysis).
Elegant Design of LLM-teacher: It changes training data and objectives without modifying model architectures. Soft-label distillation works better than hard labels because descriptions like "kicking a ball" and "throwing a ball" are semantically close despite being different.

Limitations & Future Work¶

Feint6K is only based on MSR-VTT and VATEX, resulting in limited video diversity.
RCAD has only 6 candidates per video (1 positive, 5 negatives); increasing the number of candidates could offer higher discriminative power.
LLM-teacher experiences a slight degradation (0.2-0.6%) on standard retrieval, indicating a trade-off.
Improvements on the video encoder side (e.g., better temporal modeling) were not explored, with optimization focused solely on training objectives.
The human baseline is 95.2% instead of 100%, indicating that some counterfactual scenarios might be ambiguous.

vs. InternVideo: InternVideo is pre-trained on 7 large-scale datasets, reaching 87.9% on standard retrieval but only 58.2% on RCAD. LLM-teacher boosts this to 65.8% without architectural modifications, showing that the issue is the training objective, not the model capacity.
vs. CLIP4Clip/ViCLIP: These models extend CLIP to the video domain, but they all inherit CLIP's object bias, performing similarly poorly on RCAD.
vs. Counterfactual VQA: This work draws inspiration from counterfactual data augmentation in NLP, but represents the first systematic application to video-text understanding evaluation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The counterfactual evaluation paradigm uncovers a blind spot in the field, and the LLM-teacher approach is simple yet effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive multi-model evaluation, human baselines, cosine similarity analysis, and ablation studies.
Writing Quality: ⭐⭐⭐⭐⭐ Closely-knit motivational flow, with clear logic spanning from evaluation to analysis and method.
Value: ⭐⭐⭐⭐⭐ Serves as a wake-up call to the field; RCAD could potentially become a new standard evaluation.