AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios¶
Conference: CVPR 2025
arXiv: 2410.14379
Code: GitHub
Area: Others
Keywords: novel class discovery, anomaly classification, MEBin, mask-guided attention, industrial inspection
TL;DR¶
Proposes AnomalyNCD, the first self-supervised multi-class anomaly classification method for industrial scenarios: MEBin extracts major anomaly regions \(\rightarrow\) mask-guided ViT focuses on weak-semantic anomalies \(\rightarrow\) region fusion strategy achieves flexible region/image-level classification, improving F1 by 10.8% and NMI by 8.8% on MVTec AD.
Background & Motivation¶
Background: Mature methods (e.g., PatchCore, EfficientAD) exist for industrial anomaly detection, which can localize anomalies but fail to distinguish fine-grained anomaly classes (e.g., fracture vs. ablation). Downstream processing requires identifying anomaly categories and even discovering novel classes.
Limitations of Prior Work: - Anomaly clustering methods (AC, UniFormaly): Frozen feature extractors cannot learn anomaly-specific features. - Generic NCD methods (UNO, GCD, SimGCD): Assume objects are centered in images, which is inapplicable to industrial scenarios. - Two major obstacles: - ❶ Non-salient anomalies: Industrial anomalies are local damage and are not located at the image center. - ❷ Weak-semantic anomalies: Industrial anomalies have weak semantics; ViTs tend to focus on the background rather than the anomaly.
Key Challenge: The attention of the classification network (ViT) naturally focuses on salient objects rather than subtle anomalies, rendering standard NCD pipelines completely ineffective for industrial defects.
Key Insight: Designing MEBin to isolate anomalies from detection results \(\rightarrow\) cropping them into anomaly-centered sub-images \(\rightarrow\) using mask-guided attention to force the [CLS] token to focus on anomaly regions.
Core Idea: Anomaly-centered cropping + mask-guided ViT attention = enabling the classification network to "see" weak-semantic anomalies.
Method¶
Overall Architecture¶
- Use anomaly detection methods (e.g., MuSc, PatchCore) to obtain anomaly probability maps.
- MEBin binarizes the probability maps and extracts anomaly-centered sub-images.
- Mask-Guided ViT (MGViT) learns discriminative features of anomalies.
- Teacher-Student framework generates pseudo-labels for classification learning.
- Region fusion strategy merges sub-image predictions into image-level classification.
Key Designs¶
-
Main Element Binarization (MEBin)
- Function: Stably extract major anomaly regions from anomaly detection results.
- Three-step pipeline:
- Step 1: Determine the threshold range \([s_{\min}, s_{\max}]\), where \(s_{\min}\) is the maximum value of the minimum anomaly scores of all anomaly maps.
- Step 2: Uniformly sample \(\mathcal{T}=64\) thresholds for binarization.
- Step 3: Find the most frequent number of connected components \(\bar{\delta}_i\), and select the minimum threshold for complete segmentation.
- Core Advantage: Adaptive threshold selection without validation sets, generalizable to various AD methods.
- Contrast with Otsu: Otsu tends to over-detect, especially on normal images.
-
Mask-Guided Vision Transformer (MGViT)
- Function: Guide the [CLS] token's attention to focus on the anomaly regions.
- Mechanism: Insert masks into the self-attention of the last \(L_m=9\) layers.
- Comparison of three designs:
- (a) Masking both CLS and patch tokens \(\rightarrow\) suppresses context.
- (b) Masking only patch tokens \(\rightarrow\) also suppresses context.
- (c) Masking only the CLS token (adopted) \(\rightarrow\) patch tokens maintain global receptive fields.
- Masked Attention: \(\text{Attn} = \text{softmax}(\text{concat}(\mathbf{Q}^{cls}\mathbf{K}^\top + \bar{\mathcal{M}}, \mathbf{Q}^{patch}\mathbf{K}^\top))\mathbf{V}\)
- Where \(\bar{\mathcal{M}}(i) = 0\) if \(\mathcal{M}(i) > 0.5\), else \(-\infty\).
-
Pseudo-Label Correction (PLC)
- Function: Correct pseudo-labels of over-detected regions using anomaly scores.
- Formula: \(\hat{q}_{i,k} \leftarrow w_{i,k}\mathbf{e} + (1-w_{i,k})\hat{q}_{i,k}\), where \(w_{i,k} = \max(0.5 - s_{i,k}, 0)\).
- Effect: Recall for normal class improves by 14.9%.
-
Region Fusion Strategy
- Function: Determine the image-level class based on sub-image classifications.
- Core Idea: Area weighting (instead of simple averaging or anomaly score weighting).
- Formula: \(\alpha_{i,k}^u = \frac{\exp(a_{i,k}^u / \tau_\alpha)}{\sum_k \exp(a_{i,k}^u / \tau_\alpha)}\)
- Design Motivation: Over-detected regions have small areas but high anomaly scores; area weighting mitigates their impact.
Loss & Training¶
$\(\mathcal{L} = \lambda(\mathcal{L}_{rep}^l + \mathcal{L}_{cls}^l) + (1-\lambda)(\mathcal{L}_{rep} + \mathcal{L}_{cls}^u + \mu\mathcal{L}_{reg}^u)\)$ - \(\mathcal{L}_{rep}^l\): Supervised contrastive learning, \(\mathcal{L}_{rep}\): Self-supervised contrastive learning. - \(\mathcal{L}_{cls}^l\): Cross-entropy with GT labels, \(\mathcal{L}_{cls}^u\): Cross-entropy with pseudo-labels. - \(\mathcal{L}_{reg}^u\): Mean entropy maximization regularization.
Key Experimental Results¶
Main Results (Unsupervised setting, using only unlabeled images)¶
| Method | MVTec AD NMI↑ | MVTec AD ARI↑ | MVTec AD F1↑ |
|---|---|---|---|
| SimGCD | 0.452 | 0.346 | — |
| AC (Anomaly Clustering) | 0.525 | 0.431 | — |
| MuSc + AnomalyNCD | 0.613 | 0.526 | 0.712 |
Semi-supervised Setting (Using normally labeled images)¶
| AD Method + AnomalyNCD | MVTec AD NMI↑ | MVTec AD ARI↑ | MVTec AD F1↑ |
|---|---|---|---|
| PatchCore | 0.670 | 0.601 | 0.769 |
| CPR | 0.736 | 0.674 | 0.805 |
Ablation Study¶
| Component | NMI | ARI | F1 |
|---|---|---|---|
| (a) w/o MGA | 0.598 | 0.494 | 0.698 |
| (b) all tokens | 0.507 | 0.382 | 0.600 |
| (c) patch tokens | 0.563 | 0.467 | 0.686 |
| (d) class token (Ours) | 0.613 | 0.526 | 0.712 |
| MEBin vs. Fixed Threshold | FPR↓ | FNR↓ | F1↑ |
|---|---|---|---|
| Fixed threshold 0.5 | High | High | 0.640 |
| Otsu | Highest | Medium | 0.499 |
| MEBin | 0.153 | 0.035 | 0.712 |
Key Findings¶
- MGA performs best when applied only to the CLS token (+5.0% NMI, +2.6% F1).
- Replacing mask attention in the last 9 layers is optimal (\(L_m=9\)).
- Area-weighted fusion outperforms average/score weighting.
- NMI reaches 0.871 under GT masks, indicating that the quality of AD methods is the bottleneck.
- Using labeled anomalous data (\(\mathcal{D}_l\)) yields a +3.0% NMI gain.
Highlights & Insights¶
- The first self-supervised multi-class anomaly classification method for industrial scenarios, compatible with any AD method.
- MEBin provides adaptive threshold selection, generalizable to various AD methods.
- Elegant mask-guided attention design, requiring modification of only the CLS token's attention.
- Supports composite anomalies (a single image containing multiple anomaly types).
Limitations & Future Work¶
- Performance is highly dependent on the quality of the upstream AD method (e.g., EfficientAD's extreme span of anomaly probability causes poor results).
- MEBin's computation is CPU-based (OpenCV connected components analysis), bottlenecking the inference by taking over 80% of the time.
- Requires the number of novel classes \(\mathcal{C}_u\) as a prior.
Rating¶
- Novelty: ⭐⭐⭐⭐ Pioneering integration of NCD with industrial anomaly classification, unique MEBin design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely detailed ablation studies (7 ablations + cross-dataset evaluations + class-wise results).
- Writing Quality: ⭐⭐⭐⭐ Clear structure and intuitive diagrams.
- Value: ⭐⭐⭐⭐ A crucial cornerstone for downstream processing in industrial quality inspection.