UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression¶

Conference: CVPR 2026 arXiv: 2509.25934 Code: GitHub Area: Multi-modal VLM / Anomaly Detection Keywords: anomaly detection, multi-modal, mixture of experts, feature decompression, unified framework

TL;DR¶

This paper proposes UniMMAD, the first unified framework for multi-modal (RGB/Depth/IR, etc.) and multi-class anomaly detection. It follows a General-to-Specific paradigm: a general multi-modal encoder compresses features, and a Cross Mixture-of-Experts (C-MoE) decompresses them into domain-specific features. The method achieves state-of-the-art results on 5 datasets spanning industrial, medical, and synthetic scenarios at 59 FPS inference speed.

Background & Motivation¶

Background: Existing anomaly detection methods treat modalities and categories as independent factors, training dedicated models for each combination, leading to fragmented solutions and high memory overhead.

Limitations of Prior Work: (a) Multi-class reconstruction-based methods use shared decoders, causing distortion of normality boundaries and inter-domain interference under large domain gaps; (b) Each modality–category pair requires a separate model, which is not scalable; (c) Existing methods struggle to handle cross-domain scenarios spanning industrial, medical, and synthetic settings simultaneously.

Key Challenge: A unified model must handle multi-modal inputs (RGB, 3D, infrared, etc.) and up to 66 categories, yet features across domains vary drastically, and naive parameter sharing leads to inter-domain interference.

Goal: Design a parameter-efficient unified framework that simultaneously handles multi-modal, multi-class, and multi-domain anomaly detection.

Key Insight: A General-to-Specific paradigm — first compress multi-modal features with a general encoder (suppressing anomalies), then decompress them into domain-specific features with sparse MoE (restoring normality).

Core Idea: General encoder compression combined with Cross MoE sparse-routing decompression, where different domains activate different experts, enabling domain isolation within a single unified model.

Method¶

Overall Architecture¶

Two stages: (1) General multi-modal encoder: an input embedding layer unifies channel dimensions; residual blocks with a Feature Compression Module (FCM) progressively compress features and suppress potential anomalies. (2) Cross MoE decoder: a conditional router selects top-K experts based on domain-specific statistics; a MoE-in-MoE structure achieves ~75% parameter reduction.

Key Designs¶

Feature Compression Module (FCM):
- Function: Suppress potential anomalies and facilitate cross-modal interaction.
- Mechanism: Inner multi-scale bottleneck (parallel \(1\times1\), \(3\times3\), \(5\times5\) convolutions) combined with an outer bottleneck, forming a two-level compression structure.
- Design Motivation: Anomalies at different scales require receptive fields of different sizes for effective detection.
Cross Mixture-of-Experts (C-MoE):
- Function: Decompress general features into domain-specific features.
- Mechanism: A conditional router projects general features as key/value and prior features as query; global average pooling generates domain statistics; sparse top-K gating selects experts. Annealing load-balancing loss: \(\mathcal{L}_{\text{MoE}} = (1 - e/E)^2 \cdot \text{CV}(G)\).
- Design Motivation: Different domains activate different experts to avoid inter-domain interference; the annealing mechanism gradually relaxes the load-balancing constraint as training progresses.
MoE-in-MoE Nested Structure:
- Function: Embed dense base experts inside each leader expert of the sparse MoE.
- Mechanism: The convolutional kernels of each MoE-Leader are formed by a weighted combination of shared base experts, reducing parameters by ~75%.
- Design Motivation: Reduce parameter count while preserving sparse activation characteristics.
Inference Acceleration: Grouped Dynamic Convolution:
- Function: Pre-compute and cache weighted kernels; convert multi-expert parallel computation into grouped convolution.
- Effect: 59 FPS, 45× faster than CFM and 150×+ faster than M3DM.

Loss & Training¶

Decompression consistency loss \(\mathcal{L}_{\text{DeC}}\): negative cosine similarity with focal-loss modulation (\(\gamma = 2\)).
Total loss: \(\mathcal{L} = \mathcal{L}_{\text{DeC}} + \mathcal{L}_{\text{MoE}}\).
Weighted sampling: training probability inversely proportional to per-class sample count.
300 epochs, batch size 10; WideResNet50 used as prior feature generator.

Key Experimental Results¶

Main Results: Image-Level AUC (across 5 multi-scene datasets)¶

Dataset	UniMMAD	Best Baseline	Baseline
MVTec-3D	92.53	92.44	CFM
Eyecandies	85.57	81.85	CFM
MulSen-AD	85.46	78.87	MulSen-TripleAD
BraTs (Medical)	95.84	96.06	INP-Former
UniMed (Medical)	96.34	95.98	MambaAD

Ablation Study: Component Contributions (Mean across datasets)¶

Configuration	AUC_I	AUC_P	MF1_P
Baseline (Reverse TS)	75.62	86.62	28.46
+ FCM	77.37	86.65	28.93
+ General-to-Specific	84.31	96.11	37.13
+ C-MoE (Full)	91.15	96.68	42.91

The General-to-Specific paradigm contributes the largest gain (+8.9%), followed by C-MoE (+6.8%).

Efficiency Comparison¶

Metric	UniMMAD	CFM	M3DM
FPS	59.09	13.18	0.39
FLOPs (GFlops)	110.91	431.14	—

Key Findings¶

The General-to-Specific paradigm is the single largest contributor (+8.9% AUC_I), validating the compress-then-decompress design.
Removing cross-condition routing drops AUC_I from 91.1 to 85.1 (−6.7%), demonstrating the critical role of domain-conditioned routing.
Unified training vs. task-specific training differs by only 0.1–0.3%, confirming effective task isolation.
Supports continual learning: performance degradation on old tasks remains below 8% when new tasks are added.

Highlights & Insights¶

The General-to-Specific paradigm is an elegant design: compression suppresses anomalies, decompression restores normality, and MoE routing achieves domain isolation.
MoE-in-MoE nested structure: achieves 75% parameter reduction without sacrificing performance, far more efficient than simply increasing the number of experts.
Practical inference acceleration: pre-computation combined with grouped convolution brings MoE inference close to single-model speed (59 FPS).
Cross-domain unification: a single model handles industrial, medical, and synthetic scenarios simultaneously, demonstrating the generality of the paradigm.

Limitations & Future Work¶

Pixel-level detection (MF1_P) does not surpass baselines on certain datasets (e.g., MVTec-3D: 44.16 vs. 44.72).
The method requires a pre-trained WideResNet50 as a prior feature generator, introducing an additional dependency.
Image-level AUC on the BraTs medical dataset is slightly below INP-Former (95.84 vs. 96.06).

vs. M3DM: M3DM relies on memory-bank retrieval and is extremely slow at inference (0.39 FPS); UniMMAD is 150× faster.
vs. CFM: CFM uses fixed fusion, resulting in lower AUC and slower inference; UniMMAD's C-MoE dynamic fusion is more effective.
vs. INP-Former: UniMMAD remains competitive even in purely RGB single-modal, many-class settings (MVTec-AD + VisA, 66 classes).

Rating¶

Novelty: ⭐⭐⭐⭐ General-to-Specific + C-MoE constitutes a systematic innovation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five multi-modal datasets + two single-modal datasets, with comprehensive ablation and efficiency comparisons.
Writing Quality: ⭐⭐⭐⭐ Framework description is clear and well-organized.
Value: ⭐⭐⭐⭐ A unified anomaly detection framework with practical industrial application value.