TOWARDS LIGHTWEIGHT, MODEL-AGNOSTIC AND DIVERSITY-AWARE ACTIVE ANOMALY DETECTION

Abstract

Active Anomaly Discovery (AAD) is flourishing in the anomaly detection research area, which aims to incorporate analysts' feedback into unsupervised anomaly detectors. However, existing AAD approaches usually prioritize the samples with the highest anomaly scores for user labeling, which hinders the exploration of anomalies that were initially ranked lower. Besides, most existing AAD approaches are specially tailored for a certain unsupervised detector, making it difficult to extend to other detection models. To tackle these problems, we propose a lightweight, model-agnostic and diversity-aware AAD method, named LMADA. In LMADA, we design a diversity-aware sample selector powered by Determinantal Point Process (DPP). It considers the diversity of samples in addition to their anomaly scores for feedback querying. Furthermore, we propose a model-agnostic tuner. It approximates diverse unsupervised detectors with a unified proxy model, based on which the feedback information is incorporated by a lightweight non-linear representation adjuster. Through extensive experiments on 8 public datasets, LMADA achieved 74% F1-Score improvement on average, outperforming other comparative AAD approaches. Besides, LMADA can also achieve significant performance boosting under any unsupervised detectors.

1. INTRODUCTION

Anomaly detection aims to detect the data samples that exhibit significantly different behaviors compared with the majority. It has been applied in various domains, such as fraud detection (John & Naaz, 2019) , cyber intrusion detection (Sadaf & Sultana, 2020 ), medical diagnosis (Fernando et al., 2021) , and incident detection (Wang et al., 2020) . Numerous unsupervised anomaly detectors have been proposed (Zhao et al., 2019; Boukerche et al., 2020; Wang et al., 2019) . However, practitioners are usually unsatisfied with their detection accuracy (Das et al., 2016) , because there is usually a discrepancy between the detected outliers and the actual anomalies of interest to users (Das et al., 2017; Zha et al., 2020; Siddiqui et al., 2018) . To mitigate this problem, Active Anomaly Discovery (AAD) (Das et al., 2016) , is proposed to incorporate analyst's feedback into unsupervised detectors so that the detection output better matches the actual anomalies. The general workflow of Active Anomaly Discovery is shown in Fig. 1 . In the beginning, a base unsupervised anomaly detector is initially trained. After that, a small number of samples are selected to present to analysts for querying feedback. The labeled samples are then utilized to update the detector for feedback information incorporation. Based on the updated detection model, a new set of samples are recommended for the next feedback iteration. Finally, the tuned detection model is ready to be applied after multiple feedback iterations, until the labeling budget is exhausted. Despite the progress of existing AAD methods (Das et al., 2017; Zha et al., 2020; Siddiqui et al., 2018; Keller et al., 2012; Zhang et al., 2019; Li et al., 2019; Das et al., 2016) , some intrinsic limitations of these approaches still pose great barriers to their real-world applications. Firstly, most AAD methods adopt the top-selection strategy for the feedback querying (Das et al., 2017; Zha et al., 2020; Siddiqui et al., 2018; Li et al., 2019) , i.e., the samples with the highest anomaly scores are always prioritized for user labeling. However, it hinders exploring the actual anomalies that are not initially scored highly by the base detector. As such, these AAD approaches are

