Text-guided Fine-Grained Video Anomaly Understanding¶

Conference: CVPR2026 arXiv: 2511.00524 Code: github.com/momiji-bit/T-VAU Area: Interpretability Keywords: Video Anomaly Detection, Anomaly Heatmap, Region-Aware Encoder, Large Vision-Language Model, Multi-turn Dialogue

TL;DR¶

This paper proposes the T-VAU framework, which achieves pixel-level spatiotemporal anomaly localization via an Anomaly Heatmap Decoder (AHD), and introduces a Region-Aware Anomaly Encoder (RAE) that injects heatmap evidence into an LVLM for unified reasoning over anomaly detection, localization, and semantic explanation.

Background & Motivation¶

Video Anomaly Detection (VAD) is critical for security surveillance. Existing approaches suffer from fundamental limitations: - Traditional VAD: Outputs video- or frame-level anomaly scores, providing coarse-grained binary decisions without interpretable evidence. Fine-grained cues may be diluted during feature aggregation. - Direct LVLM application: Can produce textual judgments but lacks pixel-level localization capability; unreliable capture of weak anomaly signals leads to unfaithful textual descriptions. - LVLM–diffusion hybrids: Combine visualization and text but may be unstable or inconsistent.

Core requirement: Anomaly understanding demands not only "whether anomalous" but also "where," "which object is responsible," and "how it evolves over time"—requiring a closed loop from pixel-level evidence to language-level reasoning.

Key Insight: (i) Extract spatiotemporal anomaly evidence via visual–textual alignment; (ii) inject this evidence as structured prompts into an LVLM for multi-task, multi-turn reasoning.

Method¶

Overall Architecture¶

T-VAU adds two lightweight trainable modules—AHD (Anomaly Heatmap Decoder) and RAE (Region-Aware Anomaly Encoder)—on top of a frozen LVLM backbone. The input consists of a video, natural language queries, and normal/anomalous text prompts; the output is a pixel-level anomaly heatmap and multi-turn dialogue responses.

Key Designs¶

Anomaly Heatmap Decoder (AHD):
- Extracts multi-scale features \(V_i\) from the visual encoder (layers 1/8/16/32).
- Projects visual features to the text space via MLP and computes cosine similarity: \(h_c^i[t,h,w] = \text{CosineSimilarity}(V'_i[t,:,h,w], T_c)\)
- Fuses across layers with learnable weights \(w_i\): \(H_c = \sum_i w_i \cdot h_c^i\)
- Applies Softmax and selects the anomaly channel to obtain the final heatmap.
- Design Motivation: Leverages visual–textual alignment to extract anomaly signals directly from intermediate representations, eliminating the need for threshold setting.
Region-Aware Anomaly Encoder (RAE):
- Computes temporal differences between adjacent frame heatmaps: \(X[t] = H_c[t+1] - H_c[t]\), capturing motion-aware information.
- Extracts region-aware features via a convolutional backbone.
- Divides each frame into a 3×3 grid and applies adaptive pooling to obtain region prompts \(P_{region}\).
- Generates a global prompt \(p_{global}\) via spatial average pooling.
- Final prompt sequence: \(P_{An} = [P_{base}, P_{region}, p_{global}]\)
- Concatenated with visual prompts and dialogue context before being fed into the LLM decoder.
Fine-Grained Anomaly Understanding Dataset Construction:
- Based on ShanghaiTech and UBnormal, using a three-stage pipeline:
- Frame-level structured prompting → extraction of object attributes and spatial information → aggregation into object timelines.
- Anomaly-focused refinement: background suppression via anomaly masks and Gaussian blur.
- Cross-modal consistency verification: bidirectional appearance↔motion validation.

Loss & Training¶

AHD stage: Only the AHD is optimized; all other components are frozen.
RAE stage: Curriculum-based SFT training (appearance–motion narration → anomaly-focused refinement).
Overall: The LVLM backbone is frozen; only the two lightweight modules, AHD and RAE, are trained.

Key Experimental Results¶

Main Results¶

Dataset	Metric	T-VAU	Prev. SOTA	Gain
UBnormal	Micro-AUC	94.8	68.2 (Georgescu FT)	+26.6
UBnormal	RBDC	67.8	28.7 (Georgescu FT)	+39.1
UBnormal	TBDC	76.7	58.1 (Georgescu FT)	+18.6
ShanghaiTech	BLEU-4 (Target)	62.67	55.73 (InternVL 8B)	+6.94
ShanghaiTech	BLEU-4 (Trajectory)	88.84	82.65 (InternVL 8B)	+6.19
ShanghaiTech	Yes/No Acc	97.67%	94.28% (InternVL 8B)	+3.39%

Ablation Study¶

Configuration	RBDC/TBDC	BLEU-4 (Target)	Yes/No Acc
T-VAU (full)	67.8/76.7	62.67	97.67%
w/o AHD	N/A	61.82	95.38%
w/o RAE	67.8/76.7	-	-
w/o AHD & RAE	N/A	61.82	95.38%

Key Findings¶

AHD and RAE are strongly complementary: AHD provides pixel-level evidence while RAE translates this evidence into interpretable language.
Under a one-shot setting, AHD already achieves 94.5% micro-AUC and 64.3% RBDC, demonstrating exceptional data efficiency.
Fine-tuning yields further gains, but the one-shot baseline is already competitive.
The parameter overhead is only approximately 50M (8,274→8,325M), making the framework lightweight and efficient.

Highlights & Insights¶

The closed-loop "evidence→reasoning" design: anomaly heatmaps serve as visual evidence, and RAE structurally injects this evidence into the language model.
The fine-grained dataset construction pipeline is systematic and complete: frame-level extraction → temporal aggregation → anomaly-focused refinement → cross-modal verification.
Trajectory visualization (cumulative heatmaps across frames) provides intuitive temporal consistency verification.
The threshold-free anomaly localization design avoids the threshold sensitivity inherent in conventional methods.

Limitations & Future Work¶

Performance remains challenging for micro-actions (minimal displacement) and highly non-rigid motion scenarios.
Scene-dependent appearance variations (specular reflections, fog, etc.) degrade localization accuracy.
The dataset is constructed from ShanghaiTech and UBnormal, limiting scene diversity.
Freezing the LVLM backbone may constrain deeper anomaly understanding capacity.

Compared to VAU methods such as HAWK and Holmes-VAU, T-VAU provides explicit pixel-level evidence through AHD.
Training-free methods such as LAVAD are conceptually appealing but lack precise localization.
Examining anomaly detection through the lens of Subtle Visual Computing (SVC) represents a promising research direction.

Rating¶

Novelty: ⭐⭐⭐⭐ The evidence–reasoning closed-loop design via AHD+RAE is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dimensional evaluation with complete ablation and qualitative analysis.
Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear, and inter-component relationships are well articulated.
Value: ⭐⭐⭐⭐ Advances anomaly detection from score prediction to interpretable reasoning.