Evaluation of Vision-LLMs in Surveillance Video¶

Conference: NeurIPS 2025 arXiv: 2510.23190 Code: GitHub Area: 3D Vision Keywords: Vision-LLM, Zero-shot Anomaly Detection, Surveillance Video, Privacy Protection, Natural Language Inference

TL;DR¶

This paper proposes a training-free two-stage framework that leverages small Vision-LLMs to generate textual descriptions of video content, followed by an NLI classifier for zero-shot scoring. It systematically evaluates the impact of prompting strategies and privacy-preserving filters on anomalous behavior recognition in surveillance videos.

Background & Motivation¶

Surveillance data volumes far exceed human monitoring capacity: The widespread deployment of cameras generates massive volumes of video data, making real-time manual monitoring impractical and necessitating automated anomaly detection.

Traditional methods rely on large amounts of annotated data: Supervised approaches (e.g., MIL, GNN) require fine-grained event boundary annotations, which are costly and difficult to generalize to novel anomaly categories.

Existing anomaly detection datasets offer limited coverage: Datasets such as UCF-Crime, XD-Violence, and RWF-2000 are constrained in scale and label diversity, limiting the generalization capability of models trained on them.

Systematic evaluation of Vision-LLMs for anomaly recognition is lacking: Although VLMs have demonstrated strong performance on conventional action recognition, their zero-shot capability on rare or criminal behaviors has not been systematically validated.

Privacy protection is a hard constraint for real-world deployment: Surveillance scenarios require video anonymization (e.g., blurring, GAN-based replacement), yet the impact of such operations on VLM performance remains unclear.

Zero-shot flexibility has practical value: If VLMs can recognize novel anomaly types via natural language prompts without retraining, system adaptability and scalability would be greatly enhanced.

Method¶

Overall Architecture¶

A two-stage pipeline: (1) a frozen Vision-LLM converts a sequence of video frames into natural language descriptions; (2) a frozen NLI classifier (BART-large-MNLI) scores the textual entailment between the description and each candidate label, assigning the highest-scoring label as the predicted category. No gradient updates are required throughout.

Key Design 1: Video-to-Text Description Generation¶

Function: Video frames are sampled and fed into the VLM, which generates concise textual descriptions of no more than 40 words.
Mechanism: The pre-trained VLM's embedded world knowledge is leveraged for semantic reasoning, reformulating the pixel-to-label mapping as a language inference problem.
Design Motivation: Through large-scale pre-training, VLMs have acquired rich vision-language alignment capabilities, enabling meaningful video descriptions without task-specific fine-tuning.

Key Design 2: NLI-Based Zero-Shot Classification¶

Function: The generated textual description serves as the premise, and each candidate anomaly label serves as a hypothesis; an NLI model computes the entailment score for each pair.
Mechanism: Multi-class classification is reformulated as a textual entailment task, exploiting the semantic matching capability of pre-trained NLI models for zero-shot classification.
Design Motivation: New anomaly categories can be incorporated simply by appending text labels to the candidate set, without modifying any model parameters, enabling genuine zero-shot flexibility.

Key Design 3: Multi-Level Prompting Strategies¶

Function: Three prompting schemes are designed — unguided prompting (free-form description), guided prompting (providing a list of candidate categories), and guided + few-shot prompting (additionally providing example images and descriptions).
Mechanism: Structured prompts constrain the VLM's output space, directing it to generate descriptions more relevant to the classification task.
Design Motivation: Open-ended descriptions may deviate from information critical to anomaly detection; guided prompts focus the model's attention on task-relevant semantics.

Key Design 4: Privacy-Preserving Filter Evaluation¶

Function: Three privacy protection schemes are tested on RWF-2000 — local head blurring, GAN-based face anonymization (DeepPrivacy2), and GAN-based full-body anonymization.
Mechanism: Anonymized datasets are pre-generated and compared under identical evaluation conditions to assess the impact of different privacy filters on VLM anomaly detection performance.
Design Motivation: Privacy protection is indispensable in real-world deployment; quantifying its specific cost to model performance provides empirical guidance for engineering decisions.

Loss & Training¶

This method requires no training — both the VLM and the NLI classifier use frozen parameters. At inference, a conservative decoding strategy is adopted: temperature 0.05–0.1, maximum new tokens 64–128, and a repetition penalty of 1.5. Long videos are processed in temporal windows; a video is considered correctly predicted if any single window yields the correct prediction.

Key Experimental Results¶

Experiment 1: Effect of Prompting Strategies on UCF-Crime¶

Model	Unguided Top-1 (%)	Guided Top-1 (%)	Guided + Few-shot Top-1 (%)
Gemma-3 (4B)	26.29	33.85	29.80
NVILA-8B	13.39	27.00	45.05
Qwen-2.5-VL-7B	25.31	34.69	—
VideoLLaMA-3-7B	19.94	34.16	—

Guided prompting consistently improves accuracy (+7–14 pp).
Few-shot examples yield a substantial gain for NVILA (+18 pp) but lead to a decline for Gemma-3, and generally increase the false positive rate across models.

Experiment 2: Effect of Privacy-Preserving Filters on RWF-2000¶

Model	No Filter Acc/FP (%)	Blur ΔAcc/ΔFP	GAN Face ΔAcc/ΔFP	GAN Full-Body ΔAcc/ΔFP
Gemma-3 (4B)	86.25/20.50	–5.0/+10.5	–2.8/+7.0	–4.0/+7.0
NVILA-8B	82.50/14.00	–1.8/+2.0	–1.8/+5.0	–11.3/+7.5
Qwen-2.5-VL-7B	82.25/24.50	–4.8/+9.0	–1.0/+2.0	–6.5/+11.0
VideoLLaMA-3-7B	83.25/8.50	–2.5/+2.0	–4.5/–5.5	–8.8/–6.5

Privacy filters consistently reduce accuracy by 2–11 pp and increase false positive rates.
GAN full-body anonymization has the largest impact, as temporally inconsistent appearances generated across frames distort motion cues.
VideoLLaMA-3 exhibits a reduction in false positives under GAN filters, highlighting model-specific differences in sensitivity to privacy processing.

Highlights & Insights¶

Fully training-free: The entire pipeline requires no gradient updates, enabling genuine zero-shot detection — new anomaly types can be detected simply by adding new text labels.
Modular architecture: The VLM and NLI classifier are decoupled and can be independently upgraded or replaced.
First systematic evaluation of privacy protection's impact on VLM-based anomaly detection: Fills an important experimental gap in the field.
Rigorous experimental design: Few-shot examples are drawn from the training set to prevent data leakage; single-variable comparisons are maintained with clear control conditions.

Limitations & Future Work¶

Overall accuracy remains low: The highest performance on UCF-Crime reaches only 45%, far from practical deployment requirements.
Only small models (≤8B) are evaluated: Larger models (e.g., GPT-4V, Gemini) may perform substantially better but are not included.
Single-run experiments: Due to computational constraints, each experiment is conducted only once, lacking statistical significance analysis.
Limited dataset coverage: Only two datasets are used, excluding broader scenarios (e.g., the multimodal data in XD-Violence).
GAN temporal inconsistency: The paper identifies but does not address the inter-frame inconsistency introduced by GAN full-body anonymization.

Supervised anomaly detection: The MIL paradigm for UCF-Crime [Sultani et al. 2018], REWARD [Karim et al. 2024], AnomalyCLIP [Zanella et al. 2024a], MissionGNN [Yun et al. 2025].
VLMs for anomaly detection: LAVAD [Zanella et al. 2024b] uses an LLM for temporal aggregation of caption-based anomaly scores; Holmes-VAD [Zhang et al. 2024] instruction-tunes a multimodal LLM; TEVAD [Chen et al. 2023] improves anomaly scoring via text.
VLM backbones: Gemma-3 [Team et al. 2025], Qwen-2.5-VL [Bai et al. 2025], VideoLLaMA-3 [Zhang et al. 2025], NVILA [Liu et al. 2024].

Rating¶

Novelty: ⭐⭐⭐ — The framework concept (VLM + NLI) is not particularly novel, but the systematic evaluation of privacy protection offers meaningful contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablation across multiple models × prompting strategies × privacy filters, though statistical testing is absent.
Writing Quality: ⭐⭐⭐⭐ — Well-organized and clearly presented, with complete formal derivations.
Value: ⭐⭐⭐⭐ — Provides a practical baseline and engineering guidance for zero-shot surveillance anomaly detection under privacy constraints.