Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment¶
Conference: AAAI 2026 arXiv: 2511.10334 Code: https://github.com/lessiYin/DSANet Area: Multimodal VLM Keywords: Weakly supervised video anomaly detection, semantic disentanglement, normal pattern modeling, contrastive alignment, CLIP
TL;DR¶
This paper proposes DSANet, which enhances the discriminability between normal and anomalous features in weakly supervised video anomaly detection (WS-VAD) at two levels: coarse-grained self-guided normal pattern modeling (SG-NM) and fine-grained disentangled contrastive semantic alignment (DCSA). DSANet achieves state-of-the-art performance with 86.95% AP (+1.14%) on XD-Violence and 13.01% fine-grained mAP (+3.39%) on UCF-Crime.
Background & Motivation¶
State of the Field¶
Weakly supervised video anomaly detection (WS-VAD) aims to localize anomalous segments in videos using only video-level labels (normal/anomalous, without frame-level annotations). Mainstream approaches follow the multiple instance learning (MIL) framework: features are extracted using pretrained backbones (I3D or CLIP), and a binary classifier produces frame-level anomaly scores. Recent methods (VadCLIP, PEMIL, ITC, etc.) leverage the vision-language pretraining capability of CLIP to identify anomaly categories via text prompts.
Limitations of Prior Work¶
Despite promising detection performance, existing WS-VAD methods suffer from two fundamental deficiencies:
Coarse-grained level: Incomplete understanding of normal patterns - The discriminative nature of MIL causes models to focus on identifying the most salient anomalous clips - Rich and diverse normal patterns in videos are not explicitly modeled - Failure to construct robust normal representations leads to blurred normal–anomaly boundaries and high false positive rates - For example, a complex yet normal scene (e.g., a crowded supermarket) may be misclassified as anomalous
Fine-grained level: Severe inter-class confusion - Different anomaly categories may appear visually similar (e.g., "robbery" and "theft" both involve objects being taken) - Background patterns between anomalous events and normal contexts may also be similar - Without frame-level supervision, models frequently conflate co-occurring background patterns with genuine anomalies - Features from different anomaly categories become entangled in the embedding space, reducing inter-class separability
Core Idea¶
"Learn to tell apart" at two levels: - Coarse-grained: Supplement the blind spots of MIL discriminative learning through generative normal pattern reconstruction - Fine-grained: Eliminate inter-class confusion by disentangling event/background features and aligning each with its corresponding semantics
Method¶
Overall Architecture¶
DSANet comprises three collaborative branches: 1. Anomaly detection branch: Produces frame-level binary anomaly scores under the MIL framework (foundation) 2. Self-guided normal pattern modeling branch (SG-NM): Mines video-specific normal patterns and guides feature reconstruction (coarse-grained enhancement) 3. Anomaly classification branch: Aligns video features with text category embeddings for fine-grained classification, incorporating the DCSA mechanism (fine-grained enhancement)
Key Designs¶
1. Self-Guided Normal Pattern Modeling (SG-NM)¶
Mechanism: Even in anomalous videos, local regions retain inherent normality (e.g., normal backgrounds during anomalous events). SG-NM dynamically mines normal prototypes directly from the input video without requiring an external memory bank.
Procedure: - Normal frame selection: The anomaly scores \(S_{det}\) from the detection branch are used to select the \(M\) frames with the lowest scores (\(M = 80\%\) of video length), forming a candidate normal feature set \(F_n \in \mathbb{R}^{M \times D}\) - Dynamic Normal Prototype (DNP) extraction: \(K=16\) learnable queries extract \(K\) distilled normal prototypes \(P \in \mathbb{R}^{K \times D}\) from \(F_n\) via single-layer cross-attention - Normal compactness loss: Ensures DNPs purely represent normal features
- Feature reconstruction: An 8-layer cross-attention decoder uses DNPs as keys/values to reconstruct video features. Key design: The residual connection is removed from the first decoder layer to ensure reconstruction relies entirely on normal patterns, preventing anomaly leakage
- Consistency loss: Aligns reconstructed anomaly scores \(S_{rec}\) with detection scores \(S_{det}\)
Design Motivation: MIL learns "what is anomalous" (discriminatively); SG-NM complements this by learning "what is normal" (generatively). The two are mutually reinforcing: MIL provides anomaly scores to help SG-NM select normal frames, while reconstruction errors from SG-NM calibrate the detection boundaries of MIL.
2. Disentangled Contrastive Semantic Alignment (DCSA)¶
Mechanism: Video features are explicitly disentangled into event components and background components, each aligned with their corresponding textual semantics.
Visual feature disentanglement: Soft disentanglement (not hard segmentation) using anomaly scores from the detection branch
where \(w_{event} = \text{Softmax}(S_{det})\) and \(w_{bkg} = 1 - w_{event}\)
Text feature preparation: The CLIP text encoder generates text embeddings \(T_{text} = \{t_0, t_1, ..., t_{C-1}\}\) for \(C\) categories, where \(t_0\) represents the "normal" class
Separation loss: Pushes the normal embedding away from all anomaly class embeddings
Dual contrastive alignment loss \(\mathcal{L}_{dcsa} = \mathcal{L}_{event} + \mathcal{L}_{bkg}\): - Event alignment: \(F_{event}\) is aligned with the ground-truth category \(t_c\) (anomalous videos → corresponding anomaly class; normal videos → normal class) - Background alignment: \(F_{bkg}\) is always aligned with the normal class \(t_0\) (regardless of whether the video is anomalous, the background should be biased toward normal)
Design Motivation: This disentanglement prevents entanglement between anomaly-related features and background patterns. The regularization effect of always aligning the background with the normal class prevents the model from mistakenly treating co-occurring background patterns as anomalous.
3. Lightweight Text Adapter¶
Lightweight adapters are inserted into the first \(L\) Transformer blocks of the CLIP text encoder:
This achieves domain adaptation while preserving CLIP's general knowledge, and is shown to be more effective than prompt learning.
Loss & Training¶
Training details: - Visual features: Extracted with frozen CLIP (ViT-B/16) - Optimizer: AdamW, batch size 64/96 - Trained for 10 epochs on a single RTX 4090 - The SG-NM branch is used only during training; it introduces no additional computational overhead at inference
Inference strategy: Hierarchical Belief Modulation — \(S_{det}\) serves as a temporal prior, \(S_{align}\) assigns category probabilities, and calibration is performed via temperature ratio \(\beta\).
Key Experimental Results¶
Main Results¶
Coarse-grained detection:
| Method | XD-Violence AP(%) | UCF-Crime AUC(%) |
|---|---|---|
| VadCLIP (CLIP) | 84.51 | 88.02 |
| ITC (CLIP) | 85.45 | 89.04 |
| ReFLIP (CLIP) | 85.81 | 88.57 |
| DSANet | 86.95 | 89.44 |
Fine-grained detection (mAP@IoU, XD-Violence):
| Method | @0.1 | @0.2 | @0.3 | @0.4 | @0.5 | AVG |
|---|---|---|---|---|---|---|
| VadCLIP | 37.03 | 30.84 | 23.38 | 17.90 | 14.31 | 24.70 |
| ReFLIP | 39.24 | 33.45 | 27.71 | 20.86 | 17.22 | 27.36 |
| DSANet | 40.93 | 34.63 | 28.21 | 22.70 | 17.89 | 28.87 |
Fine-grained detection (UCF-Crime):
| Method | @0.1 | AVG |
|---|---|---|
| ReFLIP | 14.23 | 9.62 |
| DSANet | 21.39 | 13.01 |
The fine-grained improvement on UCF-Crime is particularly notable (AVG +3.39%), demonstrating DCSA's inter-class discrimination capability in challenging scenarios.
Ablation Study¶
| Configuration | AP(%) | AVG(%) | Note |
|---|---|---|---|
| Baseline (VadCLIP) | 84.51 | 24.70 | — |
| + Adapter | 85.00 | 28.15 | Text adaptation is effective |
| + Adapter + SG-NM | 85.94 | 28.39 | Normal modeling improves detection |
| + Adapter + DCSA | 85.67 | 28.25 | Semantic disentanglement improves classification |
| + All (DSANet) | 86.95 | 28.87 | Components are synergistic |
Comparison of text encoder tuning strategies:
| Strategy | AP(%) | AVG(%) |
|---|---|---|
| Frozen | 81.57 | 27.60 |
| Manual prompt | 81.05 | 28.05 |
| Learnable prompt (CoOp) | 82.88 | 28.26 |
| Adapter (Ours) | 86.95 | 28.87 |
Key Findings¶
- DNP quality validation: The minimum cosine distance distribution from normal frames to DNPs (mean 0.35) is clearly separated from that of anomalous frames (mean 0.69), confirming that DNPs form compact normal representations
- DCSA effectiveness validation: The accuracy of aligning background prototypes with the normal class reaches 99.63% (vs. 87.63% for VadCLIP), with substantially stronger diagonal dominance in confusion matrices
- t-SNE visualization: In DSANet's feature space, inter-class boundaries between different anomaly categories are clearer and intra-class clusters are more compact
- Temporal visualizations show that DSANet's predictions align more closely with ground truth; VadCLIP tends to capture only the most salient clips, resulting in inaccurate temporal boundaries
Highlights & Insights¶
- Complementary dual learning paradigm: MIL discriminatively learns "what is anomalous" + SG-NM generatively learns "what is normal" — this complementary design is highly instructive and generalizable to other detection tasks
- Structured disentangled alignment: Event features aligned to their corresponding category; background features always aligned to normal — this asymmetric alignment constraint is notably elegant, incorporating stronger structural priors than simple alignment losses
- Self-contained design of SG-NM: No external memory bank is required; normal patterns are dynamically mined from the current video, making the approach scalable and data-efficient
- The design detail of removing the residual connection in the first reconstruction decoder layer is subtle but critical — it prevents anomalous information from leaking into the reconstruction output via skip connections
- The model can be trained on a single RTX 4090, making it computationally accessible
Limitations & Future Work¶
- SG-NM is used only during training; reconstruction errors from this branch are not exploited at inference — it remains an open question whether reconstruction residuals could serve as auxiliary signals during inference
- Candidate normal frame selection depends on anomaly scores from the detection branch — inaccurate scores in early training may degrade DNP quality
- The quality of textual category labels affects DCSA performance — automatic generation of richer category descriptions warrants investigation
- Only CLIP ViT-B/16 is evaluated; effectiveness on larger models (e.g., ViT-L) remains to be verified
- The number of anomaly categories is fixed to those present in the training set; generalization to open-world scenarios has not been validated
Related Work & Insights¶
- VadCLIP serves as the baseline, upon which DSANet introduces two additional dimensions: normal pattern modeling and semantic disentanglement
- The progression from CLIP's vision-language alignment to task-specific semantic disentanglement illustrates a promising pathway for adapting pretrained models to downstream tasks
- The event/background disentanglement is conceptually related to foreground/background separation in video understanding, and may prove useful in other video understanding tasks
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐