Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning¶
Conference: CVPR 2026 arXiv: 2603.00550 Code: None Area: LLM Evaluation Keywords: Weakly supervised video anomaly detection, connected components, intention reasoning, CLIP, multiple instance learning
TL;DR¶
This paper proposes the LAS-VAD framework, which introduces an Anomaly-Connected Components (ACC) mechanism to partition video frames into semantically consistent groups for pseudo-label generation to compensate for the absence of frame-level annotations, and an Intention-Aware Mechanism (IAM) that leverages position-velocity-acceleration features to distinguish normal from anomalous behaviors with similar appearances but different intentions. The method achieves 89.96% AP (I3D) on XD-Violence.
Background & Motivation¶
Background: Weakly supervised video anomaly detection (WS-VAD) uses only video-level labels and identifies anomalous temporal intervals via multiple instance learning (MIL). Mainstream approaches adopt a pipeline of pretrained feature extraction followed by a classifier.
Limitations of Prior Work: - Insufficient semantic information: The absence of frame-level annotations makes it difficult for models to learn semantic representations of anomalies; learning can only proceed indirectly through MIL's top-K strategy. - Ambiguous behavioral discrimination: Normal and anomalous behaviors can appear highly similar (e.g., "picking up an object" vs. "stealing an object"), making appearance-only features insufficient for discrimination.
Key Challenge: Missing frame-level annotations ↔ requirement for frame-level semantic understanding; similar appearance ↔ different intentions.
Key Insight: - Semantic problem: Construct a connected-component graph based on inter-frame similarity so that frames in the same group share semantic labels → pseudo-labels. - Intention problem: Anomalous behaviors often exhibit abnormal velocity/acceleration (stealing is faster than normally picking up an object); kinematic features are used to reason about intention.
Core Idea: Learning anomaly semantics = spatial semantic grouping (ACC) + motion intention reasoning (IAM) + anomaly attribute augmentation.
Method¶
Overall Architecture¶
The core pipeline of LAS-VAD: 1. Extract frame-level features \(X_\text{video} \in \mathbb{R}^{T \times D}\) using a CLIP visual encoder. 2. Model temporal dependencies with a local Transformer and GCN to obtain enhanced features \(X_f\). 3. Extract anomaly category features \(X_\text{lang}\) via a text encoder; generate anomaly attribute description features \(X_\text{aux}\) with an LLM. 4. The ACC module generates frame-level pseudo-labels to guide learning. 5. The IAM module distinguishes intentions via kinematic features. 6. Multi-branch prediction fusion: \(p^f = \frac{1}{3}(q^m + q^a + q^l)\).
Key Designs¶
-
Anomaly-Connected Components (ACC):
- Function: Partition video frames into semantically consistent, non-overlapping groups to generate frame-level pseudo-labels.
- Mechanism:
- Compute inter-frame visual similarity \(\mathcal{A}_v = \frac{X_f \cdot X_f^T}{\|X_f\| \cdot \|X_f\|}\).
- Correct bias using cross-modal semantic similarity: \(\hat{\mathcal{A}}_w[i,j] = \mathcal{A}_v[i,j] \cdot (1 + \eta \cdot \max_c \min(q^l[i,c], q^l[j,c]))\).
- Binarize \(\mathcal{A} = (\hat{\mathcal{A}} > \tau)\) to construct an adjacency matrix.
- Apply DFS to identify connected components \(B_1, B_2, ..., B_r\); frames within the same component share a semantic label.
- Design Motivation: Circumvents the absence of frame-level annotations — it is unnecessary to know the label of each individual frame; it suffices to identify which frames belong to the same semantic group.
-
Intention-Aware Mechanism (IAM):
- Function: Infer behavioral intention from kinematic features to distinguish behaviors with similar appearances but different intentions.
- Mechanism:
- Extract positional features \(X_p\) from \(X_f\); compute velocity \(X_v\) and acceleration \(X_a\) via differencing.
- Gating: \(X_v = \text{Sigmoid}(\text{Conv}(X_v^\text{diff})) \times X_v^\text{diff}\).
- Concatenate to obtain intention features \(X_\text{int} \in \mathbb{R}^{T \times D}\).
- Establish intention prototypes \(Z \in \mathbb{R}^{(C+1) \times D}\) with momentum updates.
- Cross-intention contrastive learning: mine the least similar positive samples within the same class and the most similar negative samples from different classes, constrained by infoNCE: $\(\mathcal{L}_\text{cst} = -\frac{1}{T}\sum_{t=1}^T \log \frac{\exp(X_\text{int}^t \cdot S_\text{pos}^t)}{\sum_{i=1}^M \exp(X_\text{int}^t \cdot S_\text{neg}^t)}\)$
- Design Motivation: The difference between theft and normal retrieval lies in "grasping speed" — velocity/acceleration features naturally encode this intentional distinction.
-
Anomaly Attribute Augmentation:
- Function: Leverage LLM-generated attribute descriptions for each anomaly category (e.g., "explosion → flames, dense smoke") as auxiliary features to support detection.
- Mechanism: \(X_\text{text} = [X_\text{lang}; X_\text{aux}]\); compute cross-modal cosine similarity with video features to obtain \(q^l\).
- Design Motivation: Anomalous events are accompanied by characteristic attributes; textual descriptions provide additional semantic guidance.
Loss & Training¶
- \(\mathcal{L}_\text{ags}\): Binary cross-entropy (coarse-grained anomaly/normal).
- \(\mathcal{L}_\text{fg}\): Multi-class cross-entropy (fine-grained anomaly categories).
- \(\mathcal{L}_\text{aux}\): L1 loss on ACC pseudo-labels.
- \(\mathcal{L}_\text{reg}\): Consistency regularization between coarse- and fine-grained predictions.
Key Experimental Results¶
Main Results¶
| Dataset | Feature | Metric | LAS-VAD | Prev. SOTA | Gain |
|---|---|---|---|---|---|
| XD-Violence | I3D | AP(%) | 89.96 | LEC-VAD 88.47 | +1.49 |
| XD-Violence | CLIP | AP(%) | 87.92 | LEC-VAD 86.56 | +1.36 |
| UCF-Crime | I3D | AUC(%) | 91.05 | π-VAD 90.33 | +0.72 |
| UCF-Crime | CLIP | AUC(%) | 90.86 | LEC-VAD 89.97 | +0.89 |
Fine-grained mAP (XD-Violence, avg IoU 0.1–0.5):
| Method | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | AVG |
|---|---|---|---|---|---|---|
| LEC-VAD | 19.65 | 17.17 | 14.37 | 9.45 | 7.18 | 13.56 |
| LAS-VAD | 22.07 | 19.96 | 16.18 | 11.24 | 8.64 | 15.62 |
Ablation Study¶
| ATT | ACC | IAM | mAP AVG | Note |
|---|---|---|---|---|
| ✗ | ✗ | ✗ | 24.24 | Baseline |
| ✓ | ✗ | ✗ | 26.50 | Attribute augmentation is effective |
| ✓ | ✓ | ✗ | 29.78 | ACC contributes most (+3.28) |
| ✓ | ✓ | ✓ | 29.98 | IAM yields further improvement |
Key Findings¶
- ACC (connected components) is the most impactful module; pseudo-labels provide critical frame-level supervision.
- IAM's intention reasoning is notably effective in scenarios with similar appearances, although the overall gain is relatively modest (+0.20).
- LLM-generated anomaly attribute descriptions provide meaningful semantic supplementation (+2.26).
- The method achieves state-of-the-art results across two datasets and three feature extractors (C3D/I3D/CLIP).
Highlights & Insights¶
- Connected components for frame grouping: The paper elegantly applies the graph-theoretic concept of connected components to semantic grouping of video frames — a concise and effective idea. The key lies in the cross-modal semantic correction step: pure visual similarity is biased, and cross-modal correction yields more accurate groupings.
- Kinematic encoding of intention: Designing features from the physics-inspired notion of position-velocity-acceleration is intuitively sound — theft actions are genuinely faster than normal retrieval. The gating mechanism for noise filtering is also a well-motivated design choice.
- LLM attribute descriptions as textual priors: Using GPT-4 to generate anomaly attribute descriptions is simple yet effective and requires no manual prompt engineering.
Limitations & Future Work¶
- The threshold \(\tau\) in ACC requires manual tuning (set to 0.9) and may be sensitive to different video types.
- The position/velocity/acceleration feature extraction in IAM is relatively simple (fully connected layers + differencing), which may be insufficient for modeling complex motion patterns.
- The method relies on GPT-4 for attribute description generation, introducing an external model dependency.
- The momentum update mechanism for intention prototypes may be unstable during early training.
- Validation on larger-scale datasets (e.g., anomaly subsets of Kinetics-700) has not been conducted.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of ACC and IAM is original; connected-component-based frame grouping is a notable contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across two datasets and multiple features, with complete ablation studies.
- Writing Quality: ⭐⭐⭐ Motivation descriptions are somewhat verbose, and the notation is dense.
- Value: ⭐⭐⭐⭐ Represents solid progress in the weakly supervised VAD domain.