VMDT: Decoding the Trustworthiness of Video Foundation Models¶

Conference: NeurIPS 2025 arXiv: 2511.05682 Code: Available Area: Video Generation Keywords: Video foundation models, trustworthiness evaluation, safety, fairness, adversarial robustness

TL;DR¶

This paper introduces VMDT (Video-Modal DecodingTrust), the first unified benchmark platform for evaluating the trustworthiness of T2V and V2T video foundation models across five dimensions—safety, hallucination, fairness, privacy, and adversarial robustness—covering large-scale assessments of 7 T2V and 19 V2T models, and revealing the complex relationship between model scale and trustworthiness.

Background & Motivation¶

Background: AI trustworthiness evaluation has predominantly focused on LLMs and image models (e.g., DecodingTrust, MMDT), leaving the video modality without a systematic trustworthiness benchmark. Video foundation models (VFMs) are rapidly advancing, yet their trustworthiness evaluation lags significantly behind.

Limitations of Prior Work: The video modality presents unique challenges—such as temporal risks (harmful content only manifesting during continuous playback) and photosensitive epilepsy triggers (flickering effects undetectable in static frames)—that are overlooked in image-based evaluation.

Key Challenge: While VFM capabilities are rapidly improving, safety alignment, fairness control, and privacy protection mechanisms remain severely insufficient, and no unified platform exists to measure and track progress.

Goal: To construct the first trustworthiness evaluation platform for video models covering both T2V and V2T directions across five key dimensions.

Key Insight: Drawing on the evaluation frameworks of DecodingTrust (text) and MMDT (image), the authors design dedicated datasets and metrics tailored to the specific characteristics of the video modality.

Core Idea: Decompose video model trustworthiness into five orthogonal dimensions—safety, hallucination, fairness, privacy, and adversarial robustness—and construct dedicated datasets and evaluation methods for each, forming a unified benchmark platform.

Method¶

Overall Architecture¶

The VMDT platform evaluates two categories of models: T2V (text-to-video, 7 models) and V2T (video-to-text, 19 models), with dimension-specific datasets and evaluation metrics designed for each of the five trustworthiness dimensions. Each dimension accounts for the unique characteristics of the video modality.

Key Designs¶

Safety Evaluation:
- T2V: 780 prompts covering 13 risk categories, including vanilla (direct harmful instructions) and transformed (seemingly benign but underlying harmful) scenarios.
- V2T: 990 video-prompt pairs covering 6 major categories and 27 subcategories of risk.
- Special attention to video-specific risks: temporal risks (harmful content only apparent during continuous playback) and physical harm (photosensitive epilepsy triggers).
- Core metrics: Bypass Rate (BR, rate at which the model fails to refuse) and Harmful content Generation Rate (HGR).
- GPT-4o is used as the evaluator (human agreement rate: 86%–88%).
Hallucination Evaluation:
- T2V: 1,650 prompts; V2T: 1,218 prompts.
- 7 scenarios: Natural Selection (NS), Distraction (DIS), Misleading (MIS), Counterfactual (CR), Temporal (TMP, video-specific), Co-Occurrence (CO), OCR.
- 5 task types: object recognition, attribute recognition, action recognition, counting, spatial understanding (+ scene understanding for V2T).
- T2V evaluation uses Qwen2.5-VL-72B-Instruct (Pearson correlation 0.765); V2T uses keyword matching.
Fairness Evaluation:
- T2V: 1,086 prompts; V2T: 5,008 video-prompt pairs.
- Three demographic attributes: gender, race, and age.
- Three metrics: \(F_1(g)\) social stereotype fairness, \(F_2(g)\) decision fairness, and \(O\) overkill fairness (sacrificing historical accuracy in pursuit of diversity).
- Ideal values: \(F_1=0\), \(F_2=0\), \(O=0\).
Privacy Evaluation:
- T2V: 1,000 prompts (from the WebVid-10M training corpus) to evaluate data memorization.
- V2T: 200 driving videos to evaluate location inference capability.
- T2V metrics: \(\ell_2\) distance and cosine similarity; V2T metric: location inference accuracy.
Adversarial Robustness Evaluation:
- T2V: 329 benign/adversarial prompt pairs; V2T: 1,523 pairs.
- Three attack algorithms: greedy, genetic, and gradient-based (T2V); FMM-Attack (V2T).
- Five tasks: action/attribute/counting/object/spatial understanding.
- Metrics: performance degradation between benign accuracy and robust accuracy.

Loss & Training¶

This paper is an evaluation work (benchmark) and does not involve model training. The evaluation pipeline proceeds as follows: construct datasets per dimension → run model inference to generate outputs → automated evaluation (GPT-4o or keyword matching) → aggregated analysis.

Key Experimental Results¶

Main Results¶

T2V Safety (HGR, lower is safer):

Model	Vanilla HGR	Transformed HGR	Avg. HGR
Nova Reel	0.08	0.11	0.10
Luma	0.19	0.14	0.17
CogVideoX-5B	0.45	0.26	0.36
Pika	0.52	0.28	0.40

T2V Hallucination (Accuracy %, higher is better):

Model	NS	DIS	MIS	CR	TMP	CO	OCR	Avg
Luma	63.8	74.7	78.3	68.5	45.5	82.9	59.7	67.6
Pika	56.5	68.9	72.3	70.7	53.7	77.3	41.5	63.0
Vchitect-2.0	58.5	66.6	47.9	28.3	46.9	35.3	59.2	49.0

Comprehensive Cross-Dimension Scores:

Rank	Best T2V	Worst T2V	Best V2T	Worst V2T
Model	Luma	CogVideoX-5B	InternVL2.5-78B	Qwen2.5-VL-3B
Overall Score	70.1	55.7	72.7	65.3

Ablation Study¶

Effect of Model Scale on Each Dimension (V2T):

Dimension	Scale Effect	Correlation
Hallucination	Scale↑ → Hallucination↓	Positive (improvement)
Adversarial Robustness	Scale↑ → Robustness↑	Positive (\(P=0.034\))
Fairness	Scale↑ → Unfairness↑	Negative
Privacy	Scale↑ → Location inference↑	Positive (Pearson\(=0.544\), \(P=0.016\))
Safety	No significant scale effect	Uncorrelated

Key Findings¶

All open-source T2V models lack safety rejection mechanisms (\(\text{BR}=1.00\)), and even closed-source models offer only limited safety protection.
T2V models exhibit lower HGR in transformed scenarios, reflecting capability limitations rather than safety improvements.
Model scale is a double-edged sword for V2T models: it improves hallucination and robustness but worsens fairness and privacy risks.
T2V models exhibit severe over-representation across gender/race/age (biased toward male, white, and young subjects), with biases more pronounced than in T2I models.
Even the best-performing models achieve an overall score of only approximately 70–73, indicating a substantial gap from an ideal trustworthy model.

Highlights & Insights¶

Pioneering Contribution: The first unified video trustworthiness evaluation platform covering both T2V and V2T, filling an important gap in the field.
Video-Specific Risk Discovery: Safety issues such as temporal risks and photosensitive epilepsy triggers cannot be captured by image-based evaluations.
Scale Paradox: The first systematic demonstration of the non-monotonic relationship between model scale and trustworthiness in V2T models—larger models exhibit fewer hallucinations but are less fair.
Open-Source vs. Closed-Source Gap: A substantial performance gap between open-source and closed-source models, particularly in the safety dimension.
Cross-Modal Comparison: A systematic comparison of fairness between T2V and T2I models reveals that T2V models are biased toward male subjects while T2I models are biased toward female subjects (overcorrection).

Limitations & Future Work¶

Evaluation relies on GPT-4o as the judge, introducing bias from the evaluator model (despite a human agreement rate of approximately 86–88%).
Only 7 T2V models are evaluated, providing insufficient coverage (e.g., Sora is not included).
Privacy evaluation covers only pixel-level memorization and location inference, excluding more complex privacy risks such as facial recognition.
The adversarial attack methods employed are limited and do not account for stronger adaptive attacks.
The paper lacks concrete recommendations and mitigation strategies for model improvement.

DecodingTrust / TrustGPT: LLM trustworthiness evaluation frameworks; VMDT extends these to the video modality.
MMDT: Multimodal (image) DecodingTrust; the direct predecessor of VMDT, whose dataset design is reused in several dimensions.
T2VSafetyBench / SafeWatch-Bench: Prior work on video safety evaluation; VMDT integrates and extends their data and taxonomy.
Insight: Trustworthiness evaluation should advance in tandem with model capabilities; as the information-richest modality, video presents the most complex trustworthiness challenges.

Rating¶

Novelty: ⭐⭐⭐⭐ First comprehensive video model trustworthiness platform
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Large-scale evaluation across 26 models × 5 dimensions
Writing Quality: ⭐⭐⭐⭐ Well-structured with rich and insightful findings
Value: ⭐⭐⭐⭐ Serves as an important reference for the safe development of video models