Text-guided Fine-Grained Video Anomaly Understanding¶
Conference: CVPR2026 arXiv: 2511.00524 Code: github.com/momiji-bit/T-VAU Area: Interpretability Keywords: Video Anomaly Detection, Anomaly Heatmap, Region-Aware Encoder, Large Vision-Language Model, Multi-turn Dialogue
TL;DR¶
This paper proposes the T-VAU framework, which achieves pixel-level spatiotemporal anomaly localization via an Anomaly Heatmap Decoder (AHD), and introduces a Region-Aware Anomaly Encoder (RAE) that injects heatmap evidence into an LVLM for unified reasoning over anomaly detection, localization, and semantic explanation.
Background & Motivation¶
Video Anomaly Detection (VAD) is critical for security surveillance. Existing approaches suffer from fundamental limitations: - Traditional VAD: Outputs video- or frame-level anomaly scores, providing coarse-grained binary decisions without interpretable evidence. Fine-grained cues may be diluted during feature aggregation. - Direct LVLM application: Can produce textual judgments but lacks pixel-level localization capability; unreliable capture of weak anomaly signals leads to unfaithful textual descriptions. - LVLM–diffusion hybrids: Combine visualization and text but may be unstable or inconsistent.
Core requirement: Anomaly understanding demands not only "whether anomalous" but also "where," "which object is responsible," and "how it evolves over time"—requiring a closed loop from pixel-level evidence to language-level reasoning.
Key Insight: (i) Extract spatiotemporal anomaly evidence via visual–textual alignment; (ii) inject this evidence as structured prompts into an LVLM for multi-task, multi-turn reasoning.
Method¶
Overall Architecture¶
T-VAU adds two lightweight trainable modules—AHD (Anomaly Heatmap Decoder) and RAE (Region-Aware Anomaly Encoder)—on top of a frozen LVLM backbone. The input consists of a video, natural language queries, and normal/anomalous text prompts; the output is a pixel-level anomaly heatmap and multi-turn dialogue responses.
Key Designs¶
-
Anomaly Heatmap Decoder (AHD):
- Extracts multi-scale features \(V_i\) from the visual encoder (layers 1/8/16/32).
- Projects visual features to the text space via MLP and computes cosine similarity: \(h_c^i[t,h,w] = \text{CosineSimilarity}(V'_i[t,:,h,w], T_c)\)
- Fuses across layers with learnable weights \(w_i\): \(H_c = \sum_i w_i \cdot h_c^i\)
- Applies Softmax and selects the anomaly channel to obtain the final heatmap.
- Design Motivation: Leverages visual–textual alignment to extract anomaly signals directly from intermediate representations, eliminating the need for threshold setting.
-
Region-Aware Anomaly Encoder (RAE):
- Computes temporal differences between adjacent frame heatmaps: \(X[t] = H_c[t+1] - H_c[t]\), capturing motion-aware information.
- Extracts region-aware features via a convolutional backbone.
- Divides each frame into a 3×3 grid and applies adaptive pooling to obtain region prompts \(P_{region}\).
- Generates a global prompt \(p_{global}\) via spatial average pooling.
- Final prompt sequence: \(P_{An} = [P_{base}, P_{region}, p_{global}]\)
- Concatenated with visual prompts and dialogue context before being fed into the LLM decoder.
-
Fine-Grained Anomaly Understanding Dataset Construction:
- Based on ShanghaiTech and UBnormal, using a three-stage pipeline:
- Frame-level structured prompting → extraction of object attributes and spatial information → aggregation into object timelines.
- Anomaly-focused refinement: background suppression via anomaly masks and Gaussian blur.
- Cross-modal consistency verification: bidirectional appearance↔motion validation.
Loss & Training¶
- AHD stage: Only the AHD is optimized; all other components are frozen.
- RAE stage: Curriculum-based SFT training (appearance–motion narration → anomaly-focused refinement).
- Overall: The LVLM backbone is frozen; only the two lightweight modules, AHD and RAE, are trained.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | T-VAU | Prev. SOTA | Gain |
|---|---|---|---|---|
| UBnormal | Micro-AUC | 94.8 | 68.2 (Georgescu FT) | +26.6 |
| UBnormal | RBDC | 67.8 | 28.7 (Georgescu FT) | +39.1 |
| UBnormal | TBDC | 76.7 | 58.1 (Georgescu FT) | +18.6 |
| ShanghaiTech | BLEU-4 (Target) | 62.67 | 55.73 (InternVL 8B) | +6.94 |
| ShanghaiTech | BLEU-4 (Trajectory) | 88.84 | 82.65 (InternVL 8B) | +6.19 |
| ShanghaiTech | Yes/No Acc | 97.67% | 94.28% (InternVL 8B) | +3.39% |
Ablation Study¶
| Configuration | RBDC/TBDC | BLEU-4 (Target) | Yes/No Acc |
|---|---|---|---|
| T-VAU (full) | 67.8/76.7 | 62.67 | 97.67% |
| w/o AHD | N/A | 61.82 | 95.38% |
| w/o RAE | 67.8/76.7 | - | - |
| w/o AHD & RAE | N/A | 61.82 | 95.38% |
Key Findings¶
- AHD and RAE are strongly complementary: AHD provides pixel-level evidence while RAE translates this evidence into interpretable language.
- Under a one-shot setting, AHD already achieves 94.5% micro-AUC and 64.3% RBDC, demonstrating exceptional data efficiency.
- Fine-tuning yields further gains, but the one-shot baseline is already competitive.
- The parameter overhead is only approximately 50M (8,274→8,325M), making the framework lightweight and efficient.
Highlights & Insights¶
- The closed-loop "evidence→reasoning" design: anomaly heatmaps serve as visual evidence, and RAE structurally injects this evidence into the language model.
- The fine-grained dataset construction pipeline is systematic and complete: frame-level extraction → temporal aggregation → anomaly-focused refinement → cross-modal verification.
- Trajectory visualization (cumulative heatmaps across frames) provides intuitive temporal consistency verification.
- The threshold-free anomaly localization design avoids the threshold sensitivity inherent in conventional methods.
Limitations & Future Work¶
- Performance remains challenging for micro-actions (minimal displacement) and highly non-rigid motion scenarios.
- Scene-dependent appearance variations (specular reflections, fog, etc.) degrade localization accuracy.
- The dataset is constructed from ShanghaiTech and UBnormal, limiting scene diversity.
- Freezing the LVLM backbone may constrain deeper anomaly understanding capacity.
Related Work & Insights¶
- Compared to VAU methods such as HAWK and Holmes-VAU, T-VAU provides explicit pixel-level evidence through AHD.
- Training-free methods such as LAVAD are conceptually appealing but lack precise localization.
- Examining anomaly detection through the lens of Subtle Visual Computing (SVC) represents a promising research direction.
Rating¶
- Novelty: ⭐⭐⭐⭐ The evidence–reasoning closed-loop design via AHD+RAE is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dimensional evaluation with complete ablation and qualitative analysis.
- Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear, and inter-component relationships are well articulated.
- Value: ⭐⭐⭐⭐ Advances anomaly detection from score prediction to interpretable reasoning.