Failures to Surface Harmful Contents in Video Large Language Models¶
Conference: AAAI 2026 arXiv: 2508.10974 Code: https://github.com/yuxincao22/VideoLLM-Failures Area: Model Compression Keywords: VideoLLM, Harmful Content Detection, Security Vulnerabilities, Black-box Attack, Multimodal Safety
TL;DR¶
This paper presents the first systematic security analysis of VideoLLMs, identifying three structural design flaws — sparse temporal sampling, spatial token downsampling, and modality fusion imbalance — that cause clearly visible harmful content in videos to be omitted from model-generated textual summaries (omission rate exceeding 90%). Three zero-query black-box attacks are designed to empirically validate the severity of these vulnerabilities.
Background & Motivation¶
VideoLLMs are increasingly deployed for video understanding tasks, generating concise textual summaries that enable users to rely on automatic outputs while browsing video streams. This "watch + read" hybrid consumption mode concentrates semantic trust in the VideoLLM's output.
Key Challenge: When harmful content is embedded in a video — whether as full-frame insertions or small corner patches — current state-of-the-art VideoLLMs almost never mention such content in their outputs, even though human viewers can perceive it clearly. This creates a "semantic blind spot": harmful content is visible in the video yet absent from the summary.
Three Structural Flaws:
Sparse Temporal Sampling: Most VideoLLMs uniformly sample only 8/16/32 frames, leaving large portions of the video unexamined.
Spatial Token Downsampling: Aggressive token compression (e.g., \(14\times14 \rightarrow 7\times7\)) discards fine-grained spatial information.
Modality Fusion Imbalance: Language priors dominate the attention budget, causing visual cues to be suppressed during generation even when captured by the encoder.
Key Insight: The three identified flaws are exploited to design zero-query black-box attacks that quantify the harmful content omission rate of VideoLLMs.
Method¶
Overall Architecture¶
Rather than proposing a solution, the paper presents a systematic vulnerability analysis and attack validation framework: it first dissects three structural flaws in the VideoLLM pipeline, then designs a targeted attack for each flaw, and finally conducts large-scale evaluation across five mainstream models.
Key Designs¶
-
Frame-Replacement Attack (FRA):
- Function: Replaces \(t_r\) seconds of original video content at a random position with a harmful video clip.
- Mechanism: Exploits temporal gaps in sparse uniform sampling so that harmful clips are entirely skipped. For example, a 2-minute video sampled at 16 frames has an 8-second sampling interval, allowing a 4-second harmful clip to fall completely between two sampled frames.
- Design Motivation: Validates the severity of the temporal sampling flaw without requiring any model knowledge.
-
Picture-in-Picture Attack (PPA):
- Function: Embeds a harmful video clip in a fixed corner region of every frame, occupying \(\eta H \times \eta W\) pixels.
- Mechanism: Corner regions lose information after token downsampling; harmful signals manifest as high-frequency components suppressed by low-pass filtering.
- Design Motivation: Validates the destructive effect of spatial token compression on small-region information.
-
Transparent-Overlay Attack (TOA):
- Function: Overlays a harmful video onto every frame at transparency \(\alpha\), ensuring all sampled frames carry the harmful signal.
- Mechanism: Even if the visual encoder captures the harmful signal, modality fusion imbalance causes these visual cues to be overridden by language priors.
- Design Motivation: Specifically validates the modality fusion imbalance flaw — demonstrating that even full-frame visible harmful content goes undetected.
Threat Model¶
- Strict zero-query black-box setting: the attacker has no knowledge of model architecture, weights, sampling rate, or any other internal information.
- The attacker's only prior knowledge consists of the three known architectural flaws.
- Harmful content must be human-perceptible (not single-frame flickers or imperceptible perturbations).
Evaluation Metric¶
Harmfulness Omission Rate (HOR): the proportion of responses in which the model answers "No" (indicating the video does not contain harmful content), using the prompt "Does this video contain violence/crime/pornography?"
Key Experimental Results¶
Main Results (5 Models × 3 Attacks × 3 Harmful Content Categories)¶
| Attack | Violence | Crime | Pornography | Average |
|---|---|---|---|---|
| FRA (\(t_r=4\)s) | 99% | 91% | 100% | 96.3% |
| PPA (\(\eta=0.2\)) | 98% | 87% | 76% | 87.0% |
| TOA (\(\alpha=0.5\)) | 93% | 82% | 93% | 89.3% |
Per-Model FRA Results¶
| Model | Violence | Crime | Pornography | Average |
|---|---|---|---|---|
| LLaVA-Video-7B | 100% | 85% | 100% | 95% |
| LLaVA-NeXT-7B | 100% | 100% | 100% | 100% |
| LLaVA-NeXT-32B | 100% | 78% | 100% | 93% |
| VideoLLaMA2 | 98% | 94% | 100% | 97% |
| ShareGPT4Video | 95% | 98% | 100% | 98% |
Ablation Study (Hyperparameter Analysis)¶
| Configuration | Key Finding | Notes |
|---|---|---|
| PPA \(\eta\): 0.1→0.3 | HOR gradually decreases for LLaVA series | VL2/SG4V show almost no response even at \(\eta=0.3\) |
| PPA \(\eta=0.5\) | L-7B HOR drops below 20% | Harmful region must cover ~1/4 of frame area for reliable detection |
| TOA \(\alpha\): 0.3→0.7 | HOR shows almost no significant change | Visual salience alone is insufficient to trigger detection |
| FRA simulation | Under 16-frame sampling, clips occupying <6% of video length are captured by at most 1 frame | Explains the high evasion rate of 4-second clips in minute-long videos |
Key Findings¶
- SG4V with keyframe selection still fails: confirms the core issue is sampling sparsity rather than the specific sampling strategy.
- Visual encoder vs. full model comparison: SigLIP alone can detect harmful content, but performance degrades significantly after modality fusion, directly evidencing fusion imbalance.
- Model "Yes" responses are also unreliable: follow-up queries about the specific time, location, or content of harmful material typically elicit incorrect answers, indicating that actual omission rates are higher than HOR reflects.
- Longer videos are more dangerous: under a fixed sampling budget, omission probability grows exponentially with video length.
Highlights & Insights¶
- Systematic depth: this is not a simple attack paper but a principled analysis grounded in root causes (three design flaws).
- Rigorous experimental design: large-scale evaluation spanning 200 original videos × 10 harmful clips × 3 categories × 5 models.
- Industry warning: VideoLLMs are being deployed in safety-critical scenarios such as content moderation, yet they fundamentally fail to detect embedded harmful content.
- Reveals a fundamental design problem: the issue is not model intelligence but structural information loss in the pipeline.
Limitations & Future Work¶
- Evaluation is limited to open-source VideoLLMs (<32B); closed-source models such as GPT-4o and Gemini are not assessed.
- Only three categories of harmful content (violence/crime/pornography) are tested; finer-grained categories such as hate speech are not covered.
- Proposed mitigations (denser sampling, VLM-assisted detection) show limited effectiveness, with HOR still reaching 71%–95%.
- Long-video models, despite recent advances, still employ sparse sampling and token compression and are not thoroughly evaluated.
- Training-stage mitigation strategies (e.g., safety-aware fine-tuning data) are not explored.
Related Work & Insights¶
- Relation to image MLLM safety research: safety risks in image-based models have been studied (e.g., SafeBench), but security vulnerabilities in video models remain largely unexplored.
- Finding from fu2025hidden: insufficient utilization of visual features during decoding in image MLLMs is shown to be even more severe in video models.
- Implications for system design: safety-critical applications cannot rely solely on single-model summary outputs; multi-level safety checks are necessary.
- Efficiency–safety trade-off: current VideoLLMs sacrifice safety for efficiency; sampling and fusion strategies must be redesigned to ensure semantic coverage.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (first systematic exposure of harmful content omission vulnerabilities in VideoLLMs)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (large-scale evaluation across 5 models × 3 attacks × 3 categories with detailed hyperparameter analysis)
- Writing Quality: ⭐⭐⭐⭐ (clear structure with a complete logical chain from root-cause analysis to attack design)
- Value: ⭐⭐⭐⭐⭐ (significant warning implications for the safe deployment of VideoLLMs)