UNSUPERVISED MODEL SELECTION FOR TIME-SERIES ANOMALY DETECTION

Abstract

Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the F 1 score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.

1. INTRODUCTION

Anomaly detection in time-series data has gained considerable attention from the academic and industrial research communities due to the explosion in the amount of data produced and the number of automated system requiring some form of monitoring. A large number of anomaly detection methods have been developed to solve this task (Schmidl et al., 2022; Blázquez-García et al., 2021) , ranging from simple algorithms (Keogh et al., 2005; Ramaswamy et al., 2000) to complex deeplearning models (Xu et al., 2018; Challu et al., 2022) . These models have significant variance in performance across datasets (Schmidl et al., 2022; Paparrizos et al., 2022b) , and evaluating their actual performance on real-world anomaly detection tasks is non-trivial, even when labeled datasets are available (Wu & Keogh, 2021) . Labels are seldom available for many, if not most, anomaly detection tasks. Labels are indications of which time points in a time-series are anomalous. The definition of an anomaly varies with the use case, but these definitions have in common that anomalies are rare events. Hence, accumulating a sizable number of labeled anomalies typically requires reviewing a large portion of a dataset by a domain expert. This is an expensive, time-consuming, subjective, and thereby error-prone task which is a considerable hurdle for labeling even a subset of data. Unsurprisingly, a large number of time-series anomaly detection methods are unsupervised or semi-supervised -i.e. they do not require any anomaly labels during training and inference. There is no single universally best method (Schmidl et al., 2022; Paparrizos et al., 2022b) . Therefore it is important to select the most accurate method for a given dataset without access to anomaly labels. The problem of unsupervised anomaly detection model selection has been overlooked in the literature, even though it is a key problem in practical applications. Thus, we offer an answer to the * Work carried out when the first two authors were interns at Amazon AWS AI Labs. 1

