AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios¶

Conference: CVPR 2025
arXiv: 2410.14379
Code: GitHub
Area: Others
Keywords: novel class discovery, anomaly classification, MEBin, mask-guided attention, industrial inspection

TL;DR¶

Proposes AnomalyNCD, the first self-supervised multi-class anomaly classification method for industrial scenarios: MEBin extracts major anomaly regions $\rightarrow$ mask-guided ViT focuses on weak-semantic anomalies $\rightarrow$ region fusion strategy achieves flexible region/image-level classification, improving F1 by 10.8% and NMI by 8.8% on MVTec AD.

Background & Motivation¶

Background: Mature methods (e.g., PatchCore, EfficientAD) exist for industrial anomaly detection, which can localize anomalies but fail to distinguish fine-grained anomaly classes (e.g., fracture vs. ablation). Downstream processing requires identifying anomaly categories and even discovering novel classes.

Limitations of Prior Work: - Anomaly clustering methods (AC, UniFormaly): Frozen feature extractors cannot learn anomaly-specific features. - Generic NCD methods (UNO, GCD, SimGCD): Assume objects are centered in images, which is inapplicable to industrial scenarios. - Two major obstacles: - ❶ Non-salient anomalies: Industrial anomalies are local damage and are not located at the image center. - ❷ Weak-semantic anomalies: Industrial anomalies have weak semantics; ViTs tend to focus on the background rather than the anomaly.

Key Challenge: The attention of the classification network (ViT) naturally focuses on salient objects rather than subtle anomalies, rendering standard NCD pipelines completely ineffective for industrial defects.

Key Insight: Designing MEBin to isolate anomalies from detection results $\rightarrow$ cropping them into anomaly-centered sub-images $\rightarrow$ using mask-guided attention to force the [CLS] token to focus on anomaly regions.

Core Idea: Anomaly-centered cropping + mask-guided ViT attention = enabling the classification network to "see" weak-semantic anomalies.

Method¶

Overall Architecture¶

Use anomaly detection methods (e.g., MuSc, PatchCore) to obtain anomaly probability maps.
MEBin binarizes the probability maps and extracts anomaly-centered sub-images.
Mask-Guided ViT (MGViT) learns discriminative features of anomalies.
Teacher-Student framework generates pseudo-labels for classification learning.
Region fusion strategy merges sub-image predictions into image-level classification.

Key Designs¶

Main Element Binarization (MEBin)
- Function: Stably extract major anomaly regions from anomaly detection results.
- Three-step pipeline:
  - Step 1: Determine the threshold range $[s_{\min}, s_{\max}]$, where $s_{\min}$ is the maximum value of the minimum anomaly scores of all anomaly maps.
  - Step 2: Uniformly sample $\mathcal{T}=64$ thresholds for binarization.
  - Step 3: Find the most frequent number of connected components $\bar{\delta}_i$, and select the minimum threshold for complete segmentation.
- Core Advantage: Adaptive threshold selection without validation sets, generalizable to various AD methods.
- Contrast with Otsu: Otsu tends to over-detect, especially on normal images.
Mask-Guided Vision Transformer (MGViT)
- Function: Guide the [CLS] token's attention to focus on the anomaly regions.
- Mechanism: Insert masks into the self-attention of the last $L_m=9$ layers.
- Comparison of three designs:
  - (a) Masking both CLS and patch tokens $\rightarrow$ suppresses context.
  - (b) Masking only patch tokens $\rightarrow$ also suppresses context.
  - (c) Masking only the CLS token (adopted) $\rightarrow$ patch tokens maintain global receptive fields.
- Masked Attention: $\text{Attn} = \text{softmax}(\text{concat}(\mathbf{Q}^{cls}\mathbf{K}^\top + \bar{\mathcal{M}}, \mathbf{Q}^{patch}\mathbf{K}^\top))\mathbf{V}$
- Where $\bar{\mathcal{M}}(i) = 0$ if $\mathcal{M}(i) > 0.5$, else $-\infty$.
Pseudo-Label Correction (PLC)
- Function: Correct pseudo-labels of over-detected regions using anomaly scores.
- Formula: $\hat{q}_{i,k} \leftarrow w_{i,k}\mathbf{e} + (1-w_{i,k})\hat{q}_{i,k}$, where $w_{i,k} = \max(0.5 - s_{i,k}, 0)$.
- Effect: Recall for normal class improves by 14.9%.
Region Fusion Strategy
- Function: Determine the image-level class based on sub-image classifications.
- Core Idea: Area weighting (instead of simple averaging or anomaly score weighting).
- Formula: $\alpha_{i,k}^u = \frac{\exp(a_{i,k}^u / \tau_\alpha)}{\sum_k \exp(a_{i,k}^u / \tau_\alpha)}$
- Design Motivation: Over-detected regions have small areas but high anomaly scores; area weighting mitigates their impact.

Loss & Training¶

$$\mathcal{L} = \lambda(\mathcal{L}_{rep}^l + \mathcal{L}_{cls}^l) + (1-\lambda)(\mathcal{L}_{rep} + \mathcal{L}_{cls}^u + \mu\mathcal{L}_{reg}^u)$$ - $\mathcal{L}_{rep}^l$: Supervised contrastive learning, $\mathcal{L}_{rep}$: Self-supervised contrastive learning. - $\mathcal{L}_{cls}^l$: Cross-entropy with GT labels, $\mathcal{L}_{cls}^u$: Cross-entropy with pseudo-labels. - $\mathcal{L}_{reg}^u$: Mean entropy maximization regularization.

Key Experimental Results¶

Main Results (Unsupervised setting, using only unlabeled images)¶

Method	MVTec AD NMI↑	MVTec AD ARI↑	MVTec AD F1↑
SimGCD	0.452	0.346	—
AC (Anomaly Clustering)	0.525	0.431	—
MuSc + AnomalyNCD	0.613	0.526	0.712

Semi-supervised Setting (Using normally labeled images)¶

AD Method + AnomalyNCD	MVTec AD NMI↑	MVTec AD ARI↑	MVTec AD F1↑
PatchCore	0.670	0.601	0.769
CPR	0.736	0.674	0.805

Ablation Study¶

Component	NMI	ARI	F1
(a) w/o MGA	0.598	0.494	0.698
(b) all tokens	0.507	0.382	0.600
(c) patch tokens	0.563	0.467	0.686
(d) class token (Ours)	0.613	0.526	0.712

MEBin vs. Fixed Threshold	FPR↓	FNR↓	F1↑
Fixed threshold 0.5	High	High	0.640
Otsu	Highest	Medium	0.499
MEBin	0.153	0.035	0.712

Key Findings¶

MGA performs best when applied only to the CLS token (+5.0% NMI, +2.6% F1).
Replacing mask attention in the last 9 layers is optimal ($L_m=9$).
Area-weighted fusion outperforms average/score weighting.
NMI reaches 0.871 under GT masks, indicating that the quality of AD methods is the bottleneck.
Using labeled anomalous data ($\mathcal{D}_l$) yields a +3.0% NMI gain.

Highlights & Insights¶

The first self-supervised multi-class anomaly classification method for industrial scenarios, compatible with any AD method.
MEBin provides adaptive threshold selection, generalizable to various AD methods.
Elegant mask-guided attention design, requiring modification of only the CLS token's attention.
Supports composite anomalies (a single image containing multiple anomaly types).

Limitations & Future Work¶

Performance is highly dependent on the quality of the upstream AD method (e.g., EfficientAD's extreme span of anomaly probability causes poor results).
MEBin's computation is CPU-based (OpenCV connected components analysis), bottlenecking the inference by taking over 80% of the time.
Requires the number of novel classes $\mathcal{C}_u$ as a prior.

Rating¶

Novelty: ⭐⭐⭐⭐ Pioneering integration of NCD with industrial anomaly classification, unique MEBin design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely detailed ablation studies (7 ablations + cross-dataset evaluations + class-wise results).
Writing Quality: ⭐⭐⭐⭐ Clear structure and intuitive diagrams.
Value: ⭐⭐⭐⭐ A crucial cornerstone for downstream processing in industrial quality inspection.