UNDERSTANDING THE EFFECT OF BIAS IN DEEP ANOMALY DETECTION Anonymous authors Paper under double-blind review

Abstract

Anomaly detection presents a unique challenge in machine learning, due to the scarcity of labeled anomaly data. Recent work attempts to mitigate such problems by augmenting training of deep anomaly detection models with additional labeled anomaly samples. However, the labeled data often does not align with the target distribution and introduces harmful bias to the trained model. In this paper, we aim to understand the effect of a biased anomaly set on anomaly detection. We formally state the anomaly detection problem as a supervised learning task, and focus on the anomaly detector's recall at a given false positive rate as the main performance metric. Given two different anomaly score functions, we formally define their difference in performance as the relative scoring bias of the anomaly detectors. Along this line, our work provides two key contributions. We establish the first finite sample rates for estimating the relative scoring bias for deep anomaly detection, and empirically validate our theoretical results on both synthetic and real-world datasets. We also provide extensive empirical study on how a biased training anomaly set affects the anomaly score function and therefore the detection performance on different anomaly classes. Our study demonstrates scenarios in which the biased anomaly set can be useful or problematic, and provides a solid benchmark for future research.

1. INTRODUCTION

Anomaly detection (Chandola et al., 2009; Pimentel et al., 2014 ) trains a formal model to identify unexpected or anomalous instances in incoming data, whose behaviors differ from normal instances. It is particularly useful for detecting problematic events such as digital fraud, structural defects, and system malfunctions. Building accurate anomaly detection models is a well-known challenge in machine learning, due to the scarcity of labeled anomaly data. The classical and most common approach is to train anomaly detection models using only normal datafoot_0 , i.e., first train a model using a corpus of normal data to capture normal behaviors, then configure the model to flag instances with large deviations as anomalies. Researchers have also developed deep learning methods to better capture the complex structure in the data (Ruff et al. ( 2018 2019a)). Existing works show that these new models achieve considerable performance improvements beyond the models trained using only normal data. We hereby refer to these models as deep supervised 2 anomaly detection (Chandola et al., 2009) . When exploring these models, we found that when the labeled anomalies (used to train the model) do not align with the target distribution, they could introduce harmful bias to the trained model. Specifically, when comparing the performance of a supervised anomaly detector to its semi-supervised version, the performance difference varies significantly across test anomaly data, some better and some worse. That is, using labeled anomalies during model training does not always improve model performance; instead, it may introduce large variance (or bias) in anomaly detection outcomes. In this paper, we aim to understand the effect of a biased training set on deep anomaly detection models. We formally state the anomaly detection problem, focusing on the anomaly detector's recall at a given false positive rate as the main performance metric. We factor the contribution of the labeled anomalies by the detector's anomaly scoring function, and show that different types of labeled anomalies produce different anomaly scoring functions. Next, given any two different anomaly scoring functions, we formally define their difference in performance as the relative scoring bias of the anomaly detectors. Our novel notion of scoring bias for anomaly detection aligns with the notion of bias in the classical supervised learning setting, with the key difference being the different performance metric-we target recall at a given false positive rate, the metric used by real-world anomaly detection tasks (Li et al., 2019; Liu et al., 2018) . Along this line, we establish the first finite sample rates for estimating the relative scoring bias for deep anomaly detection. We empirically validate our assumptions and theoretical results on both synthetic and three real-world datasets (Fashion-MNIST, Statlog (Landsat Satellite), and Cellular Spectrum Misuse (Li et al., 2019) ). Furthermore, we provide an empirical study on how a biased training anomaly set affects the anomaly score function and therefore the resulting detection performance. We consider the above three real-world datasets and six deep-learning based anomaly detection models. Our study demonstrates scenarios in which the biased anomaly set can be useful or problematic, and provides a solid benchmark for future research. In this paper, we introduce a formal analysis on the effect of a biased training set on deep anomaly detection. Our main contributions are the following: • We discover the issue of large performance variance in deep anomaly detectors, caused by the use of the biased anomaly set as training data. • We model the effect of biased training as relative scoring bias, and establish the first finite sample rates for estimating the relative scoring bias of the trained models. • We conduct empirical experiments to verify and characterize the impact of the relative scoring bias on six popular anomaly detection models, and three real-world datasets. To the best of our knowledge, our work is the first to formally study the effect of a biased anomaly training set on deep anomaly detection. Our results show both significant positive and negative impacts of these biases, and suggest that model trainers must treat anomalies with additional care. We believe this leads to new opportunities for improving deep anomaly detectors and deserves more attention from the research community.

2. RELATED WORK

Anomaly Detection Models. While the literature on anomaly detection models is extensive, the most relevant to our work are deep learning based models. Following the terminology used by Chandola et al. (2009) , we consider two types of models: • Semi-supervised anomaly detection refers to models trained on only normal data, e.g., Ruff et al. ( While the existing work has shown empirically that the choice of abnormal data in training can help detect some unseen abnormal distributions, it does not offer any theoretical explanation for the phe-



Existing literature has used different terms to describe this type of models: some using semi-supervised anomaly detection(Chandola et al., 2009) and others using unsupervised anomaly detection(Ruff et al., 2018). 2 Some works termed these models as semi-supervised anomaly detection(Ruff et al., 2020b; Yamanaka et al., 2019; Ruff et al., 2020a; Hendrycks et al., 2019a) while others termed them as supervised anomaly detection(Chandola et al., 2009).



); Wang et al. (2019a); Zhou & Paffenroth (2017)). Following the terminology introduced by Chandola et al. (2009), we refer to these models as semi-supervised anomaly detection. Recently, a new line of anomaly detection models proposes to leverage available labeled anomalies during model training, i.e., train an anomaly detection model using both normal data and additional labeled anomaly samples as they become available (Ruff et al. (2020b); Yamanaka et al. (2019); Ruff et al. (2020a); Hendrycks et al. (

); Sakurada & Yairi (2014); Zhou & Paffenroth (2017); • Supervised anomaly detection refers to models trained on normal data and a small set of labeled anomalies, e.g., Pang et al. (2019); Daniel et al. (2019); Yamanaka et al. (2019); Ruff et al. (2020a;b). One can also categorize models by their architecture: hypersphere (Ruff et al., 2018; 2020a;b) and autoencoder (or reconstruction) based models (Zhou & Paffenroth, 2017; Yamanaka et al., 2019). Another line of recent work proposes to use synthetic or auxiliary anomalies to train anomaly detection models (Golan & El-Yaniv (2018); Hendrycks et al. (2019c); Lee et al. (2018); Hendrycks et al. (2019b)), "forcing" the model to learn a more compact representation of the normal data.

