UNSUPERVISED MODEL SELECTION FOR TIME-SERIES ANOMALY DETECTION

Abstract

Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the F 1 score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.

1. INTRODUCTION

Anomaly detection in time-series data has gained considerable attention from the academic and industrial research communities due to the explosion in the amount of data produced and the number of automated system requiring some form of monitoring. A large number of anomaly detection methods have been developed to solve this task (Schmidl et al., 2022; Blázquez-García et al., 2021) , ranging from simple algorithms (Keogh et al., 2005; Ramaswamy et al., 2000) to complex deeplearning models (Xu et al., 2018; Challu et al., 2022) . These models have significant variance in performance across datasets (Schmidl et al., 2022; Paparrizos et al., 2022b) , and evaluating their actual performance on real-world anomaly detection tasks is non-trivial, even when labeled datasets are available (Wu & Keogh, 2021) . Labels are seldom available for many, if not most, anomaly detection tasks. Labels are indications of which time points in a time-series are anomalous. The definition of an anomaly varies with the use case, but these definitions have in common that anomalies are rare events. Hence, accumulating a sizable number of labeled anomalies typically requires reviewing a large portion of a dataset by a domain expert. This is an expensive, time-consuming, subjective, and thereby error-prone task which is a considerable hurdle for labeling even a subset of data. Unsurprisingly, a large number of time-series anomaly detection methods are unsupervised or semi-supervised -i.e. they do not require any anomaly labels during training and inference. There is no single universally best method (Schmidl et al., 2022; Paparrizos et al., 2022b) . Therefore it is important to select the most accurate method for a given dataset without access to anomaly labels. The problem of unsupervised anomaly detection model selection has been overlooked in the literature, even though it is a key problem in practical applications. Thus, we offer an answer to the M 3   0 0 0 0 1 1 1 0 0 0   0 0 0 1 1 1 1 0 0 0   1 0 0 0 0 1 1 1 0 question: Given an unlabeled dataset and a set of candidate models, how can we select the most accurate model? 0 M 3 M 2 M 1 M 3 M 2 M 1 M 3 M 2 Our approach is based on computing "surrogate" metrics that correlate with model performance yet do not require anomaly labels, followed by aggregating model ranks induced by these metrics (Fig. 1 ). Empirical evaluation on 10 real-world datasets spanning diverse domains such as medicine, sports, etc. shows that our approach can perform unsupervised model selection as efficiently as selection based on labeling a subset of data. In summary, our contributions are as follows: • To the best of our knowledge, we propose one of the first methods for unsupervised selection of anomaly detection models on time-series data. To this end, we identify intuitive and effective unsupervised metrics for model performance. Prior work has used a few of these unsupervised metrics for problems other than time-series anomaly detection model selection. • We propose a novel robust rank aggregation method for combining multiple surrogate metrics into a single model selection criterion. We show that our approach performs on par with selection based on labeling a subset of data. • We conduct large-scale experiments on over 275 diverse time-series, spanning a gamut of domains such as medicine, entomology, etc. with 5 popular and widely-used anomaly detection models, each with 1 to 4 hyper-parameter combinations, resulting in over 5,000 trained models. Upon acceptance, we will make our code publicly available. In the next section, we formalize the model selection problem. We then describe the surrogate metric classes in Sec. 3, and present our rank aggregation method in Sec. 4. Our empirical evaluation is described in Sec. 5. Finally, we summarize related work in Sec. 6 and conclude in Sec. 7.

2. PRELIMINARIES & THE MODEL SELECTION PROBLEM

Let {x t , y t } T t=1 denote a multivariate time-series (TS) with observations (x 1 , . . . , x T ), x t ∈ R d and anomaly labels (y 1 , . . . , y T ), y t ∈ {0, 1}, where y t = 1 indicates that the observation x t is an anomaly. The labels are only used for evaluating our selection approaches, but not for the model selection procedure itself. Next, let M = {A i } N i=1 denote a set of N candidate anomaly detection models. Each model A i is a tuple (detector, hyper-parameters), i.e., a combination of an anomaly detection method (e.g. LSTM-VAE (Park et al., 2018) ) and a fully specified hyper-parameter configuration (e.g. for LSTM-VAE, hidden layer size=128, num layers=2, • • • ). Here, we only consider models that do not require anomaly labels for training. Some models still require unlabeled time-series as training data. We therefore, consider a train/test split {x t } ttest-1 t=1 ,



Figure 1: The Model Selection Workflow. We identify three classes of surrogate metrics of model quality (Sec. 3), and propose a novel robust rank aggregation framework to combine multiple rankings from metrics (Sec. 4).

