Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs¶

Conference: NeurIPS 2025 arXiv: 2505.11842 Code: https://liuxuannan.github.io/Video-SafetyBench.github.io/ Area: Multimodal / VLM Safety Keywords: Video Safety, LVLM Evaluation, Attack Success Rate, Multimodal Safety Benchmark, RJScore

TL;DR¶

This paper presents Video-SafetyBench, the first comprehensive benchmark for safety evaluation of video LVLMs. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, constructed via a controllable video generation pipeline. A confidence-based evaluation metric, RJScore, is proposed to assess model outputs. Large-scale evaluation across 24 LVLMs reveals an average attack success rate of 67.2% under benign queries.

Background & Motivation¶

As large vision-language models (LVLMs) are increasingly deployed in real-world settings, systematic safety evaluation becomes critical. However, significant gaps remain in existing multimodal safety assessment:

Existing benchmarks focus on static images: Works such as FigStep, MM-SafetyBench, HADES, and VLSBench exclusively consider image-text inputs, overlooking the unique safety risks introduced by the temporal dynamics of video (e.g., harmful actions that evolve over time).

Video inputs expand the attack surface: Compared to single-frame images, continuous frame sequences in video pose greater challenges for safety alignment, as adversaries can exploit temporal information to circumvent safety mechanisms.

Evaluation metrics are insufficient for boundary cases: Existing automated judges have limited capability in handling uncertain or borderline harmful outputs, lacking calibration mechanisms aligned with human judgment.

Key Challenge: The safety risks of video LVLMs are increasingly prominent, yet no systematic video-text attack benchmark or reliable safety evaluation methodology exists.

Key Insight: (1) Construct compositional video-text attack tasks encompassing both harmful queries with explicit malicious intent and benign queries that implicitly convey malice through video context; (2) Design a controllable video generation pipeline to ensure semantic alignment between video content and harmful intent; (3) Propose RJScore, an LLM confidence-based evaluation metric with calibrated decision thresholds.

Method¶

Overall Architecture¶

Video-SafetyBench consists of three main components: (1) a safety taxonomy covering 13 primary categories and 48 subcategories; (2) a three-stage controllable video generation pipeline; and (3) the RJScore evaluation metric based on LLM token-level confidence. Each video is paired with both a harmful query and a benign query variant.

Key Designs¶

Two-Level Safety Taxonomy:
- Function: Defines a systematic hierarchy of video safety risks.
- Mechanism: 13 primary categories (violent crime, non-violent crime, sexual crime, child sexual exploitation, defamation, professional advice, privacy, intellectual property, weapons of mass destruction, hate speech, self-harm/suicide, sexual content, elections) and 48 fine-grained subcategories.
- Design Motivation: Adapted from existing LLM safety taxonomies and extended for video-specific scenarios to ensure comprehensive coverage.
Three-Stage Controllable Video Generation Pipeline:
- Function: Synthesizes videos semantically aligned with harmful intent.
- Mechanism: Video semantics are decomposed into "what to show" (subject image) and "how to move" (motion text), executed in three steps:
  - Stage 1 (Text): Harmful queries are generated based on safety policies and then rewritten into benign variants via LLM (replacing harmful phrases with video-referential expressions, e.g., "high-explosive device" → "the device shown in the video").
  - Stage 2 (Text → Image): An LLM (GPT-4o) transforms abstract queries into rich scene descriptions, which are then fed to a T2I model (Midjourney-V6, KLING 1.5) to generate subject images.
  - Stage 3 (Image + Text → Video): An LVLM infers motion trajectory descriptions; combined with the subject image, these are input to an I2V model (KLING 1.6, Sora, Jimeng) to generate 10-second videos.
- Design Motivation: Direct T2V generation lacks precise semantic control. Decomposing into subject and motion dimensions substantially improves controllability. The resulting dataset achieves FID=73 (video quality) and VQAScore=0.522 (text-video alignment), both superior to existing benchmarks.
RJScore Evaluation Metric:
- Function: Quantifies the harmfulness of model outputs and aligns judgment with human annotations.
- Mechanism: Qwen2.5-72B assigns a 5-level toxicity score to each output. Rather than taking the argmax prediction, the logit values of the 5 candidate tokens are converted to softmax probabilities, and the expected score is computed as \(RJScore = \sum_{k=1}^{5} k \cdot p(k)\). A decision threshold of \(\tau=2.85\) is calibrated via 5-fold cross-validation.
- Design Motivation: Binary classification fails to handle uncertain and borderline cases. Token-level logit distributions capture judgment confidence, and threshold calibration yields 91% agreement with human annotations, surpassing GPT-4o (88.2%).

Attack Modalities¶

Harmful Query Attack: Text explicitly expresses malicious intent, with video amplifying the harmful effect.
Benign Query Attack: Text itself is innocuous but implicitly conveys malice through video references (e.g., "drop the device shown in the video" instead of "throw a high-explosive device").

Key Experimental Results¶

Main Results¶

Attack success rate (ASR) evaluated across 24 LVLMs (7 closed-source + 17 open-source):

Model	Harmful Query ASR	Benign Query ASR	Notes
Qwen-VL-Max	25.4%	78.3%	Benign ASR higher by 52.9%
GPT-4o	14.8%	43.3%	Safest among closed-source
Claude 3.5 Sonnet	7.8%	19.9%	Safest overall
Qwen2-VL-72B	44.6%	83.3%	Most vulnerable among open-source 72B
Qwen2.5-VL-7B	—	68.7%	7B safer than 72B
Qwen2.5-VL-72B	—	74.0%	Larger model ≠ safer
Average (all models)	39.1%	67.2%	Benign queries higher by 28.1%

Ablation Study¶

Judge Model	Human Agreement	F1	FPR	FNR
Rule-based	76.5%	75.1%	46.4%	0.7%
HarmBench	77.1%	76.1%	2.7%	43.0%
Llama Guard 3	79.5%	79.4%	12.2%	28.6%
GPT-4o	88.2%	88.1%	19.7%	3.9%
Qwen-2.5-72B	88.4%	88.3%	18.4%	4.7%
RJScore (τ=2.85)	91.0%	91.0%	12.3%	5.8%

Key Findings¶

Benign queries are far more dangerous than harmful queries: The average ASR under benign queries exceeds that under harmful queries by 28.1%, indicating that models struggle to detect malicious intent implicitly conveyed through video references.
Model scale does not imply greater safety: Within the Qwen2.5-VL series, benign-query ASR for 7B/32B/72B models is 68.7%/73.2%/74.0%, respectively — larger models are actually more susceptible.
Video inputs are more dangerous than static images: Video-based ASR is on average 8.6% higher than image-based ASR, with temporal information introducing additional risk.
Professional advice (S6-SA) is the most vulnerable category: Nearly all models exhibit ASR exceeding 70% in this category, particularly under benign queries.

Highlights & Insights¶

This is the first systematic study of video-text compositional attacks on LVLM safety, filling a critical gap in video safety evaluation.
The subject-plus-motion decomposition in the controllable video generation pipeline is elegant and effective, ensuring semantic alignment between video content and harmful intent.
The RJScore approach — combining token-level logit confidence with cross-validation threshold calibration — provides a more reliable paradigm for LLM-as-Judge evaluation.
The high ASR of benign query attacks reveals a fundamental weakness of LVLMs in reasoning about implicitly conveyed malicious intent.

Limitations & Future Work¶

The benchmark currently relies solely on synthetic videos; attack effectiveness on real-world videos may differ.
The benign query rewriting strategy is relatively simple (template-based substitution); more sophisticated semantic obfuscation techniques warrant exploration.
Audio modality safety risks are not addressed.
Evaluation focuses exclusively on output harmfulness, without considering over-refusal (excessive defense against legitimate requests).

This work provides strong directional guidance for extending existing image safety benchmarks (MM-SafetyBench, VLSBench, etc.) to the video domain.
The confidence quantification methodology underlying RJScore is transferable to other LLM-as-Judge scenarios.
The benign query attack paradigm is conceptually consistent with VLSBench's "benign text + harmful image" approach, but poses greater risks in the video domain.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐