DEEP BATCH ACTIVE ANOMALY DETECTION WITH DIVERSE QUERIES

Abstract

Selecting informative data points for expert feedback can significantly improve the performance of anomaly detection in various contexts, such as medical diagnostics or fraud detection. In this paper, we determine a set of conditions under which the ranking of anomaly scores generalizes from labeled queries to unlabeled data. Inspired by these conditions, we propose a new querying strategy for batch active anomaly detection that leads to systematic improvements over current approaches. It selects a diverse set of data points for labeling, achieving high data coverage with a limited budget. These labeled data points provide weak supervision to the unsupervised anomaly detection problem. However, correctly identifying anomalies in the contaminated training data requires an estimate of the contamination ratio. We show how this anomaly rate can be estimated from the query set by importance-weighting, removing the associated bias due to the non-uniform sampling procedure. Extensive experiments on image, tabular, and video data sets show that our approach results in state-of-the-art active anomaly detection performance.

1. INTRODUCTION

Detecting anomalies in data is a fundamental task in machine learning with applications in various domains, from industrial fault detection to medical diagnosis. The main idea is to train a model (such as a neural network) on a data set of "normal" samples to minimize the loss of an auxiliary (e.g., self-supervised) task. Using the loss function to score test data, one hopes to obtain low scores for normal data and high scores for anomalies (Ruff et al., 2021) . Oftentimes, the training data is contaminated with unlabeled anomalies, and many approaches either hope that training will be dominated by the normal samples (inlier priority, Wang et al. ( 2019)) or try to detect and exploit anomalies in the training data (e.g., Qiu et al. (2022a) ). In some set-ups, expert feedback is available to check if individual samples are normal or should be considered anomalies. These labels are usually expensive to obtain but are very valuable to guide an anomaly detector during training. For example, in a medical setting, one may ask a medical doctor to confirm whether a given image shows normal or abnormal cellular tissue. Other application areas include detecting network intrusions or machine failures. As expert feedback is typically expensive, it is essential to find effective strategies for querying informative data points. Previous work on active anomaly detection primarily involves domain-specific applications and/or ad hoc architectures, making it hard to disentangle modeling choices from querying strategies (Trittenbach et al., 2021) . This paper aims to disentangle different factors that affect detection accurary. We theoretically analyze generalization performance under various querying strategies and find that diversified sampling systematically improve over existing popular querying strategies, such as querying data based on their predicted anomaly score or around the decision boundaries. Based on these findings, we propose active latent outlier exposure (ALOE): a state-of-the-art active learning strategy compatible with many unsupervised and self-supervised losses for anomaly detection (Ruff et al., 2021; Qiu et al., 2022a) . ALOE draws information from both queried and unqueried parts of the data based on two equally-weighted losses. Its sole hyperparameter-the assumed anomaly rate-can be efficiently estimated based on an importance sampling estimate. We show on a multitude of data sets (images, tabular data, and video) that ALOE leads to a new state of the art. In summary, our main contributions are as follows:

