Skip to content

EPHAD: An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination

Conference: NeurIPS 2025
arXiv: 2510.21296
Code: GitHub
Area: Others
Keywords: anomaly detection, data contamination, test-time adaptation, CLIP, post-hoc adjustment

TL;DR

EPHAD proposes a test-time post-processing framework that corrects the output of anomaly detection models trained on contaminated data via Bayesian-style fusion with external evidence (e.g., CLIP, LOF) through exponential tilting. The framework requires no access to the training pipeline and consistently improves detection performance of contaminated models across 8 visual and 26 tabular AD datasets.

Background & Motivation

Background: Unsupervised anomaly detection (AD) assumes clean training data, learning compact representations of "normal" samples and flagging deviations as anomalies. Existing methods include one-class classification (DeepSVDD), feature embedding (PatchCore), density estimation (CFLOW/FastFlow), and reconstruction-based methods (DRÆM), all of which perform well when training data is clean.

Limitations of Prior Work: Real-world datasets are frequently contaminated by undetected anomalous samples—e.g., hidden defective products in industrial data or unlabeled cases in medical datasets. Existing mitigation strategies either require modifying the training pipeline (Refine filters suspicious anomalies via OCC ensemble; LOE iteratively assigns scores via block coordinate descent), require knowledge of the contamination ratio, or rely on semi-supervised annotations. These conditions are entirely unavailable when deploying proprietary black-box AD models.

Key Challenge: How can the performance degradation caused by data contamination be mitigated without access to the training pipeline, training data, or contamination ratio? This "preparation-agnostic" setting reflects the common real-world scenario of deploying proprietary AD models, and is conceptually dual to the test-time alignment problem in generative models.

Key Insight: Drawing on ideas from test-time adaptation (TTA) and KL-regularized alignment in generative models, EPHAD applies post-hoc correction to contaminated model outputs at test time using external "evidence." Core Idea: The AD model's output scores are treated as a contaminated prior, which is fused with an evidence function via exponential tilting so that the adjusted distribution is closer to the true normal sample distribution in the KL divergence sense.

Method

Overall Architecture

EPHAD is a general post-processing framework: given an AD model trained on (possibly contaminated) data and its output scores, an evidence function \(T(x)\) is used at test time to apply exponential tilting correction to the original scores. The framework has a single hyperparameter \(\beta\) controlling the trust trade-off between the model and the evidence. The entire process requires no modification to the original model, no retraining, and no knowledge of the contamination ratio.

Key Designs

  1. Exponential Tilting Fusion Mechanism:

    • Function: Fuses the output density of the contaminated model with the evidence function to produce corrected anomaly scores.
    • Mechanism: The contaminated distribution \(f_\pm(x)\) is exponentially tilted to yield a corrected density \(\check{f}_\pm(x) \propto f_\pm(x) \cdot \exp(T(x)/\beta)\). For mainstream score-based AD methods, this simplifies to \(\check{s}_{in}(x) = s_{in}^\pm(x) + T(x)/\beta\), i.e., the original inlier score plus a weighted contribution from the evidence; the normalization constant is negligible since AD relies only on ranking.
    • Design Motivation: Proposition 4.1 provides a theoretical guarantee—when the expected log-weight of the evidence function over truly normal samples is positive, the corrected density is strictly closer to the true normal distribution in KL divergence. This formulation is also the optimal solution to the KL-regularized objective \(J_{KL} = \mathbb{E}[T(x)] - \beta \cdot \mathrm{KL}(\check{f} \| f)\), directly paralleling test-time alignment and RLHF in generative models.
  2. Multi-Source Evidence Functions:

    • Function: Provides a "second opinion," independent of the contaminated model, on whether a sample is normal.
    • Mechanism: For visual AD, CLIP is used (following WinCLIP, normal/anomaly text templates are defined and the softmax similarity between the image and the two text classes is computed as \(T(x)\)); for tabular AD, output scores from classical methods such as LOF or IForest serve as evidence.
    • Design Motivation: As a multimodal foundation model, CLIP generalizes broadly and is unaffected by contamination in any specific training set. Classical methods such as LOF operate under different assumptions and thus provide complementary information. Key insight: the evidence does not need to perform well in isolation—it only needs to assign positive scores to truly normal samples.
  3. EPHAD-Ada: Adaptive Temperature Selection:

    • Function: Automatically determines the hyperparameter \(\beta\) in an unsupervised manner at test time, eliminating the need for a labeled validation set.
    • Mechanism: Based on the entropy minimization principle—empirical entropies \(H(p_Y^o)\) and \(H(p_Y^e)\) of the inlier probabilities produced by the original model and the evidence function are computed respectively, and \(\beta_{ada} = H(p_Y^e) / (H(p_Y^o) + \delta)\). Inlier probabilities are estimated by converting score rankings into Beta distribution posterior means.
    • Design Motivation: When the original model exhibits high confidence (low \(H(p_Y^o)\)), the model should be trusted more (large \(\beta\)); when the evidence exhibits high confidence (low \(H(p_Y^e)\)), the evidence should be trusted more (small \(\beta\)), achieving automatic balance.

Loss & Training

EPHAD requires no training and is a purely post-processing method. The core operation is a weighted fusion of the existing AD model's scores: \(\check{s}_{in}^\pm(x) = s_{in}^\pm(x) + T(x)/\beta\), after which the adjusted scores are used to re-rank samples for anomaly determination.

Key Experimental Results

Main Results (Visual AD, 10% Contamination Rate)

Method + Dataset Original AUROC (%) +EPHAD (%) +EPHAD-Ada (%) Notes
CFLOW / CIFAR10 65.47 97.38 96.43 CLIP evidence yields large gains
FastFlow / FMNIST 83.66 93.49 92.10 Significant improvement on semantic AD
ULSAD / MVTec 91.93 91.31 92.25 Strong model + weak evidence: slight drop / neutral
RD / ViSA 86.33 77.76 79.42 CLIP is weak in industrial settings
PatchCore / RealIAD 70.08 69.76 77.18 Adaptive \(\beta\) from Ada is superior

Ablation Study

Configuration Key Observation Notes
\(\beta=0.5\) (default) Optimal for most semantic AD scenarios Balances prior and evidence
EPHAD-Ada More robust in industrial settings Automatically avoids cases where evidence is weaker than model
Contamination rate 0→20% Greater gains at higher contamination Essentially harmless at 0%
CLIP vs. LOF evidence CLIP superior for visual; LOF superior for tabular Evidence must match the domain

Key Findings

  • CLIP as evidence yields substantial improvements on semantic AD datasets (CIFAR10/FMNIST), with gains of +20–30 AUROC, but provides limited benefit or even degrades performance on industrial defect detection.
  • When the AD model is substantially stronger than the evidence (e.g., ULSAD achieves 64.27% on SVHN while CLIP achieves only 58.46%), fusion may degrade performance; EPHAD-Ada's adaptive \(\beta\) mitigates this issue.
  • In tabular AD experiments (26 datasets), EPHAD-Ada generally performs best, as the quality of LOF/IForest evidence is more uncertain.
  • Compared to methods that modify training (Refine/LOE/SoftPatch), EPHAD achieves comparable performance on industrial AD (RealIAD) through purely post-hoc processing.

Highlights & Insights

  • The transfer of test-time alignment ideas from generative models to anomaly detection is elegant; the KL-regularized objective is formally identical to the RLHF alignment formulation, representing a sophisticated cross-domain conceptual borrowing.
  • The post-processing paradigm of "modifying outputs rather than models" is highly practical for real-world deployment—applicable even when the model is an encrypted API.
  • The theoretical analysis provides clear sufficient conditions under which evidence fusion is guaranteed to improve performance (Proposition 4.1), avoiding naive or arbitrary fusion.
  • The progressive validation narrative—from 2D synthetic toy examples to real industrial datasets—is compelling and well-structured.

Limitations & Future Work

  • Fusion can be harmful when the AD model is substantially stronger than the evidence function; although EPHAD-Ada mitigates this, it cannot fully prevent it.
  • CLIP evidence depends on text template design—what constitutes an "anomaly" in industrial settings is difficult to describe precisely in natural language.
  • The framework is limited to image-level or sample-level anomaly determination; pixel-level anomaly localization is not addressed.
  • The evidence function itself may be affected by test-set distribution shift, and the "good evidence" condition required by the theoretical guarantee is difficult to verify in practice.
  • Simultaneous fusion of multiple evidence sources is not explored—when multiple evidence functions are available, how should they be optimally combined?
  • Generalizability to non-visual and non-tabular modalities (e.g., time series, graph data) remains to be validated.
  • TTA meets AD: EPHAD is the first work to introduce preparation-agnostic TTA into anomaly detection, opening a new direction for post-hoc correction in AD.
  • Foundation models as universal evidence: Although CLIP's zero-shot AD capability is modest in isolation, it serves effectively as a "correction signal" complementary to trained models, inspiring a paradigm of multi-model collaboration.
  • Connection to RLHF: Exponential tilting is equivalent to KL-regularized reward maximization, and further insights may be drawn from more recent alignment strategies such as DPO.
  • Data contamination in AD: ADBench analysis shows that approximately 70% of datasets have anomaly ratios below 10% with a median of 5%, indicating that low-level contamination is the norm rather than the exception.
  • Extension to multi-evidence fusion: The current framework employs a single evidence function \(T(x)\); in principle, multiple evidence sources (e.g., CLIP + LOF + domain rules) could be combined for more robust correction.

Rating

  • Novelty: ⭐⭐⭐⭐ — Transfers TTA and generative model alignment ideas to AD with a novel perspective and theoretically elegant formulation.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 8 visual + 26 tabular + 1 industrial dataset, 7 AD baselines, comprehensive ablation.
  • Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear and toy examples are intuitive, though the extensive tables are somewhat verbose.
  • Value: ⭐⭐⭐⭐ — The post-processing paradigm is highly practical, though caution is warranted in strong-model + weak-evidence scenarios.