UNSUPERVISED MODEL SELECTION FOR TIME-SERIES ANOMALY DETECTION

Abstract

Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the F 1 score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.

1. INTRODUCTION

Anomaly detection in time-series data has gained considerable attention from the academic and industrial research communities due to the explosion in the amount of data produced and the number of automated system requiring some form of monitoring. A large number of anomaly detection methods have been developed to solve this task (Schmidl et al., 2022; Blázquez-García et al., 2021) , ranging from simple algorithms (Keogh et al., 2005; Ramaswamy et al., 2000) to complex deeplearning models (Xu et al., 2018; Challu et al., 2022) . These models have significant variance in performance across datasets (Schmidl et al., 2022; Paparrizos et al., 2022b) , and evaluating their actual performance on real-world anomaly detection tasks is non-trivial, even when labeled datasets are available (Wu & Keogh, 2021) . Labels are seldom available for many, if not most, anomaly detection tasks. Labels are indications of which time points in a time-series are anomalous. The definition of an anomaly varies with the use case, but these definitions have in common that anomalies are rare events. Hence, accumulating a sizable number of labeled anomalies typically requires reviewing a large portion of a dataset by a domain expert. This is an expensive, time-consuming, subjective, and thereby error-prone task which is a considerable hurdle for labeling even a subset of data. Unsurprisingly, a large number of time-series anomaly detection methods are unsupervised or semi-supervised -i.e. they do not require any anomaly labels during training and inference. There is no single universally best method (Schmidl et al., 2022; Paparrizos et al., 2022b) . Therefore it is important to select the most accurate method for a given dataset without access to anomaly labels. The problem of unsupervised anomaly detection model selection has been overlooked in the literature, even though it is a key problem in practical applications. Thus, we offer an answer to the M 3   0 0 0 0 1 1 1 0 0 0   0 0 0 1 1 1 1 0 0 0   1 0 0 0 0 1 1 1 0 question: Given an unlabeled dataset and a set of candidate models, how can we select the most accurate model? 0 M 3 M 2 M 1 M 3 M 2 M 1 M 3 M 2 Our approach is based on computing "surrogate" metrics that correlate with model performance yet do not require anomaly labels, followed by aggregating model ranks induced by these metrics (Fig. 1 ). Empirical evaluation on 10 real-world datasets spanning diverse domains such as medicine, sports, etc. shows that our approach can perform unsupervised model selection as efficiently as selection based on labeling a subset of data. In summary, our contributions are as follows: • To the best of our knowledge, we propose one of the first methods for unsupervised selection of anomaly detection models on time-series data. To this end, we identify intuitive and effective unsupervised metrics for model performance. Prior work has used a few of these unsupervised metrics for problems other than time-series anomaly detection model selection. • We propose a novel robust rank aggregation method for combining multiple surrogate metrics into a single model selection criterion. We show that our approach performs on par with selection based on labeling a subset of data. • We conduct large-scale experiments on over 275 diverse time-series, spanning a gamut of domains such as medicine, entomology, etc. with 5 popular and widely-used anomaly detection models, each with 1 to 4 hyper-parameter combinations, resulting in over 5,000 trained models. Upon acceptance, we will make our code publicly available. In the next section, we formalize the model selection problem. We then describe the surrogate metric classes in Sec. 3, and present our rank aggregation method in Sec. 4. Our empirical evaluation is described in Sec. 5. Finally, we summarize related work in Sec. 6 and conclude in Sec. 7.

2. PRELIMINARIES & THE MODEL SELECTION PROBLEM

Let {x t , y t } T t=1 denote a multivariate time-series (TS) with observations (x 1 , . . . , x T ), x t ∈ R d and anomaly labels (y 1 , . . . , y T ), y t ∈ {0, 1}, where y t = 1 indicates that the observation x t is an anomaly. The labels are only used for evaluating our selection approaches, but not for the model selection procedure itself. Next, let M = {A i } N i=1 denote a set of N candidate anomaly detection models. Each model A i is a tuple (detector, hyper-parameters), i.e., a combination of an anomaly detection method (e.g. LSTM-VAE (Park et al., 2018) ) and a fully specified hyper-parameter configuration (e.g. for LSTM-VAE, hidden layer size=128, num layers=2, • • • ). Here, we only consider models that do not require anomaly labels for training. Some models still require unlabeled time-series as training data. We therefore, consider a train/test split {x t } ttest-1 t=1 , {x t } T t=ttest , where the former is used if a model requires training (without labels). In our notation, A i denotes a trained model. We assume that a trained model A i , when applied to observations {x t } T t=ttest , produces anomaly scores {s i t } T t=ttest , s i t ∈ R ≥0 . We assume that a higher anomaly score indicates that the observation is more likely to be an anomaly. However, we do not assume that the scores correspond to likelihoods of any particular statistical model or that the scores are comparable across models. Model performance or, equivalently, quality of the scores can be measured using a supervised metric Q({s t } T t=1 , {y t } T t=1 ), such as the Area under Precision Recall curve or best F 1 score, commonly used in literature. We discuss the choice of the quality metric in the next section. In general, rather than considering a single time-series, we will be performing model selection for a set of L time-series with observations X = {{x j t } T t=1 } L j=1 and labels Y = {{y t } T t=1 } L j=1 , where j indexes time-series. Let X train , X test , and Y test denote the train and test portions of observations, and the test portion of labels, respectively. We are now ready to introduce the following problem. Problem 1. Unsupervised Time-series Anomaly Detection Model Selection. Given observations X test and a set of models M = {A i } N i=1 trained using X train , select a model that maximizes the anomaly detection quality metric Q(A i (X test ), Y test ). The selection procedure cannot use labels.

2.1. MEASURING ANOMALY DETECTION MODEL PERFORMANCE

Anomaly Detection can be viewed as a binary classification problem where each time point is classified as an anomaly or a normal observation. Hence, the performance of a model A i can be measured using standard precision and recall metrics. However, these metrics ignore the sequential nature of time-series; thus, time-series anomaly detection is usually evaluated using adjusted versions of precision and recall (Paparrizos et al., 2022a) . We adopt widely used adjusted versions of precision and recall (Xu et al., 2018; Challu et al., 2022; Shen et al., 2020; Su et al., 2019; Carmona et al., 2021) . These metrics treat time points as independent samples, except when an anomaly lasts for several consecutive time points. In this case, detecting any of these points is treated as if all points inside the anomalous segment were detected. Adjusted precision and recall can be summarized in a metric called adjusted F 1 score, which is the harmonic mean of the two. The adjusted F 1 score depends on the choice of a decision threshold on anomaly scores. In line with several recent studies, we consider threshold selection as a problem orthogonal to our problem (Schmidl et al., 2022; Laptev et al., 2015; Blázquez-García et al., 2021; Rebjock et al., 2021; Paparrizos et al., 2022a; b) and therefore, consider metrics that summarize model performance across all possible thresholds. Common metrics include area under the precision-recall curve and best F 1 (maximum F 1 over all possible thresholds). Like Xu et al. (2018) , we found a strong correlation between the two (App. A.11). We also found best F 1 to have statistically significant positive correlation with volume under the surface of ROC curve, a recently proposed robust evaluation metric (App. A.10). Thus, in the remainder of the paper, we restrict our analysis to identifying models with the highest adjusted best F 1 score (i.e. Q is adjusted best F 1 ).

3. SURROGATE METRICS OF MODEL PERFORMANCE

The problem of unsupervised model selection is often viewed as finding an unsupervised metric which correlates well with supervised metrics of interest (Ma et al., 2021; Goix, 2016; Lin et al., 2020; Duan et al., 2019) . Each unsupervised metric serves as a noisy measure of "model goodness" and reduces the problem of picking the best performing model according to the metric. We identified three classes of imperfect metrics that closely align with expert intuition in predicting the performance of time-series anomaly detection models. Our metrics are unsupervised because they do not require anomaly labels. However, some metrics such as F 1 score on synthetic anomalies are typically used for supervised evaluation. To avoid confusion we use term surrogate for our metrics. Below we elaborate on each class of surrogate metrics, starting with an intuition behind each class. Prediction Error If a model can forecast or reconstruct time-series well it must also be a good anomaly detector. A large number of anomaly detection methods are based on forecasting or reconstruction (Schmidl et al., 2022) , and for such models, we can compute forecasting or recon-struction error without anomaly labels. We collectively call these prediction error metrics and consider a common set of statistics, such as mean absolute error, mean squared error, mean absolute percentage error, symmetric mean absolute percentage error, and the likelihood of observations. For multivariate time-series, we average each metric across all the variables. For example, given a series of observations {x t } T t=1 and their predictions {x t } T t=1 , mean squared error is defined as MSE = 1 T T t=1 (x t -xt ) 2 . In the interest of space, we refer the reader to a textbook (e.g., Hyndman & Athanasopoulos ( 2018)) for definitions of other metrics. Prior work on time-series anomaly detection (Saganowski & Andrysiak, 2020; Laptev et al., 2015; Kuang et al., 2022) used forecasting error to select between some classes of models. Prediction metrics are not perfect for model selection because the time-series can contain anomalies. Low prediction errors are desirable for all points except anomalies, but we do not know where anomalies are without labels. While prediction metrics can only be computed for anomaly detection methods based on time-series forecasting or reconstruction, the next two classes of metrics apply to any anomaly detection method. We inject 9 different types of synthetic anomalies randomly, one at a time, across multiple copies of the original timeseries. The --is original time-series before anomalies are injected, -is the injected anomaly and the solid -is the time-series after anomaly injection.

Synthetic Anomaly Injection

A good anomaly detection model will perform well on data with synthetically injected anomalies. Some studies have previously explored using synthetic anomaly injection to train anomaly detection models (Carmona et al., 2021) . Here, we extend this line of research by systematically evaluating model selection capability after injecting different kinds of anomalies. Given an input time-series without anomaly labels, we randomly inject an anomaly of a particular type. The location of the injected anomaly is then treated as a (pseudo-)positive label, while the rest of the time points are treated as a (pseudo-)negative label. We then select a model that achieves the best adjusted F 1 score with respect to pseudo-labels. Instead of relying on complex deep generative models (Wen et al., 2020) we devise simple and efficient yet effective procedures to inject anomalies of previously described types (Schmidl et al., 2022; Wu & Keogh, 2021) as shown in Fig. 2 . We defer the details of anomaly injection approaches to Appendix A.5. Model selection based on anomaly injection might not be perfect because (i) real anomalies might be of a different type compared to injected ones, and (ii) there may already be anomalies for which our pseudo-labels will be negative. Model Centrality There is only one ground truth, thus models close to ground truth are close to each other. Hence, the most "central" model is the best one. Centrality-based metrics have seen recent success in unsupervised model selection for disentangled representation learning (Lin et al., 2020; Duan et al., 2019) and anomaly detection (Ma et al., 2021) . To adapt this approach to time-series, we leverage the scores from the anomaly detection models to define a notion of model proximity. Anomaly scores of different models are not directly comparable. However, since anomaly detection is performed by thresholding the scores, we can consider ranking of time points by their anomaly scores. Let σ k (i) denote the rank of time point i according to the scores from model k. We define the distance between models A k and A l as the Kendall's τ distance, which is the number of pairwise disagreements: d τ (σ k , σ l ) = i<j I{(σ k (i) -σ k (j))(σ l (i) -σ l (j)) < 0} Next, we measure model centrality as the average distance to its K nearest neighbors, where K is a parameter. This metric favors models which are close to their nearest neighbors. The centralitybased approach is imperfect, because "bad" models might produce similar results and form a cluster. Each metric, while imperfect, is predictive of anomaly detection performance but to a varying extent as we show in Sec. 5. Access to multiple noisy metrics raises a natural question: How can we combine multiple imperfect metrics to improve model selection performance?

4. ROBUST RANK AGGREGATION

Many studies have explored the benefits of combining multiple sources of noisy information to reduce error in different areas of machine learning such as crowd-sourcing (Dawid & Skene, 1979) and programmatic weak supervision (Ratner et al., 2016; Goswami et al., 2021; Dey et al., 2022; Gao et al., 2022; Zhang et al., 2022a) . Each of our surrogate metrics induces a (noisy) model ranking. Thus two natural questions arise in the context of our work: (1) How do we reliably combine multiple model rankings? (2) Does aggregating multiple rankings help in model selection? For σ ∈ S N , let σ(i) denote the rank of element i and σ -foot_0 (j) denote the element at rank j. Then the rank aggregation problem is as follows. 4.1 APPROACHING MODEL SELECTION AS A RANK AGGREGATION PROBLEM Let [N ] = {1, 2, • • • , N } denote Problem 2. Rank Aggregation. Given a collection of M permutations σ 1 , • • • , σ M ∈ S N of N items, find σ * ∈ S N that best summarizes these permutations according to some objective C(σ * , σ 1 , • • • , σ M ). In our context, the N items correspond to the N candidate anomaly detection models, M corresponds to the number of surrogate metrics, σ * is the best summary ranking of models, and σ * -1 (1) is the predicted best model. 1 The problem of rank aggregation has been extensively studied in social choice, bio-informatics and machine learning literature. Here, we consider Kemeny rank aggregation which involves choosing a distance on the set of rankings and finding a barycentric or median ranking (Korba et al., 2017) . Specifically, the rank aggregation problem is known as the Kemeny-Young problem if the aggregation objective is defined as C = 1 M M i=1 d τ (σ * , σ i ) , where d τ is the Kendall's τ distance (Eq. 1). Kemeny aggregation has been shown to satisfy many desirable properties, but it is NP-hard even with as few as 4 rankings (Dwork et al., 2001) . To this end, we consider an efficient approximate solution using the Borda method (Borda, 1784) in which each item (model) receives points from each ranking (surrogate metric). For example, if a model is ranked r-th by a surrogate metric, it receives (N -r) points from that metric. The models are then ranked in descending order of total points received from all metrics. Suppose that we have several surrogate metrics for ranking anomaly detection models. Why would it be beneficial to aggregate the rankings induced by these metrics? The following theorem provides an insight into the benefit of Borda rank aggregation. Consider two arbitrary models i and j with (unknown) ground truth rankings σ * (i) and σ * (j). Without the loss of generality, assume σ * (i) < σ * (j), i.e., model i is better. Next, let's view each surrogate metric k = 1, . . . , M as a random variable Ω k taking values in permutations S N , such that Ω k (i) ∈ [N ] denotes the rank of item i. We apply Borda rank aggregation on realizations of these random variables. Theorem 1 states that, under certain assumptions, the probability of Borda ranking making a mistake (i.e., placing model i lower than j) is upper bounded in a way such that increasing M decreases this bound. In appendix A.14, we empirically show that model selection performance improves as we increase the number of surrogate metrics (rankings). Theorem 1. Assume that Ω k 's are pairwise independent, and that P(Ω k (i) < Ω k (j)) > 0.5 for any arbitrary models i and j, i.e., each surrogate metric is better than random. Then the probability that Borda aggregation makes a mistake in ranking items i and j is at most 2 exp -M ϵ 2 2 , for some fixed ϵ. The proof of the theorem is given in Appendix A.8. We do not assume that Ω k are identically distributed, but only that Ω k 's are pairwise independent. We emphasize that the notion of independence of Ω k as random variables is different from rank correlation between realizations of these variables (i.e., permutations). For example, two permutations drawn independently from a Mallow's distribution can have a high Kendall's τ coefficient (rank correlation). Therefore, our assumption of independence, commonly adopted in theoretical analysis (Ratner et al., 2016) , does not contradict our observation of high rank correlation between surrogate metrics. The Borda method weighs all surrogate metrics equally. However, our experience and theoretical analysis suggest that focusing only on "good" metrics can improve performance after aggregation (see Sec. 5 and Corollary 1). But how do we identify "good" ranks without access to anomaly labels?

4.2. EMPIRICAL INFLUENCE AND ROBUST RANK AGGREGATION

Since we do not have access to anomaly labels, we introduce an intuitive unsupervised way to identify "good" or "reliable" rankings based on the notion of empirical influence. Empirical influence functions have been previously used to detect influential cases in regression (Cook & Weisberg, 1980) . Given a collection of M rankings S = {σ 1 , • • • , σ M }, let Borda(S) denote the aggregate ranking from the Borda method. Then for some σ i ∈ S, empirical influence EI(σ i , S) measures the impact of σ i on the Borda solution. EI(σ i , S) = f (S) -f (S \ σ i ), where f (A) = 1 |A| |A| i=1 d τ (Borda(A), σ i ) (2) Under the assumption that a majority of rankings are good, EI is likely to identify bad rankings as ones with high positive EI. This is intuitive since removing a bad ranking results in larger decrease in the objective f (S \ {σ i }). In appendix A.9, we show that under idealized settings, empirical influence can identify bad rankings. We cluster all rankings based on their EI using a two-component single-linkage agglomerative clustering (Murtagh & Contreras, 2012) , discard all rankings from the "bad" cluster, and apply Borda aggregation to the remaining ones. To further improve ranking performance of aggregated rank, especially at the top, we only consider models at the top-k ranks, setting the positions of the rest of the models to N (Fagin et al., 2003) . See Appendix A.7.1 for details. We collectively refer to these variations as Robust Borda rank aggregation.

5.1. DATASETS

We carry out experiments on two popular and widely used real-world collections with diverse timeseries and anomalies: UCR Anomaly Archive (UCR) (Wu & Keogh, 2021) Wu and Keogh recently found flaws in many commonly used benchmarks for anomaly detection tasks. They introduced the UCR time-series Anomaly Archive, a collection of 250 diverse univariate time-series of varying lengths spanning domains such as medicine, sports, entomology, and space science, with natural as well as realistic synthetic anomalies overcoming the pitfalls of previous benchmarks. Given the heterogeneity of included time-series, we split UCR into 9 subsets by domain, (1) Acceleration, (2) Air Temperature, (3) Arterial Blood Pressure, ABP, (4) Electrical Penetration Graph, EPG, (5) Electrocardiogram, ECG, (6) Gait, (7) NASA, (8) Power Demand, and (9) Respiration, RESP. We will refer to each of these subsets as a separate dataset. Server Machine Dataset (SMD) (Su et al., 2019) SMD comprises of multivariate time-series with 38 features from 28 server machines collected from large internet company over a period of 5 weeks. Each entity includes common metrics such as CPU load, memory and network usage etc. for a server machine. Both train and test sets contain around 50,000 timestamps with 5% anomalous cases. While there has been some criticism of this dataset (Wu & Keogh, 2021) , we use SMD due to its multivariate nature and the variety it brings to our evaluation along with UCR. We emphasize that on visually inspecting the SMD, we found most of the anomaly labels to be accurate, much like (Challu et al., 2022) . We refer to SMD as a single dataset, and thus in our evaluation consider 10 datasets (9 from UCR + SMD).

5.2. MODEL SET

There are over a hundred unique time-series anomaly detection algorithms developed using a wide variety of approaches. For our experiments, we implemented a representative subset of 5 popular methods (Table 1 ), across two different learning types (unsupervised and semi-supervised) and three different method families (forecasting, reconstruction and distance-based) (Schmidl et al., 2022) . We defer brief descriptions of models to Appendix A.6. 

5.3. EXPERIMENTAL SETUP

Evaluating model selection performance. We evaluate all model selection strategies based on the adjusted best F 1 (Section 2) of the top-1 selected model. While there exist other popular measures to evaluate ranking algorithms such as Normalized Discounted Cumulative Gain (Wang et al., 2013) , Mean Reciprocal Rank, etc., our evaluation metric is grounded in practice, since in our setting only one model will be selected and used. Our approach can be used to select the best model for an individual time-series. Since anomalies are rare, a single time-series may contain only a few anomalies (e.g., one point anomaly in the case of some UCR time-series). This makes the evaluation of a selection strategy very noisy. To combat this issue, we perform model selection at the dataset level, i.e., we select a single method for a given dataset. Baselines. As a baseline, we consider an approach in which we label a part of a dataset and select the best method based on these labels. We call this baseline "supervised selection" or SS in short. This represents current practice, where a subset of data is labeled, and the best performing model is used in the rest of the dataset. Specifically, we randomly partition each dataset into selection (20%) and evaluation (80%) sets. Each time-series in the dataset is assigned to one of these sets. The baseline method selects a model based on anomaly labels in the selection set. Our surrogate metrics do not use the selection set. Instead, the surrogate metrics are computed over the evaluation set without using the labels. For reference, we also report the expected performance when selecting a model from the model set at random ("random") and the performance of the actual best model ("oracle"). Adjusted best F 1 scores of the selected model (either by the baseline, surrogate metric or rank aggregation) are then reported. The entire experiment is repeated 5 times with random selection/evaluation splits. Model selection strategies. We compare 17 individual surrogate metrics, including 5 prediction error metrics, 9 synthetic anomaly injection metrics, 3 metrics of centrality (average distance to 1, 3 and 5 nearest neighbors) (Section3); and 4 proposed robust variations of Borda rank aggregation (see Appendix A.7.2). In the preliminary round of evaluations we found 5 of the anomaly injection metrics (scale, noise, cutoff, contextual and speedup) to generally outperform other surrogate metrics across all datasets (Appendix A.3). We apply rank aggregation only to these 5 metrics. While this subset of surrogate metrics was identified using labels, it does not appear to depend on the dataset. Thus, given a new dataset without labels, we still propose to use rank aggregation over the same 5 surrogate metrics. Pairwise-statistical tests. We carry out one-sided paired Wilcoxon signed rank tests at a significance level of α = 0.05 to identify significant performance differences between all pairs of model selection strategies and baselines on each dataset. 2019) along with Python 3.9.13. All our experiments were carried out on an AWS g3.4xlarge EC2 instance with with 16 Intel(R) Xeon(R) CPUs, 122 GiB RAM and a 8 GiB GPU.

5.4. RESULTS AND DISCUSSION

A subset of our results are shown in Figure 3 and Table 2 . Here we provide results for model selection using some of the individual surrogate metrics, as well as three variations of rank aggregation: Borda, robust Borda and a special case of robust Borda where we select only one metric, the one that minimizes empirical influence, "minimum influence metric" or MIM (Section 4). The complete set of results with all ranking metrics can be found in Appendix A.3. Robust variations of Borda outperform Borda. Aggregating via robust Borda and MIM are significantly better than Borda aggregation on 4 and 3 datasets, respectively. While MIM is never significantly worse than Borda, Robust Borda is inferior to Borda on SMD. Robust variations of Borda minimizes losses to Oracle. From Table 2 it is evident that MIM and Robust borda have fewer significant losses to the oracle in comparison to the 5 surrogate metrics used as their input, and compared to SS. Thus, in general, robust aggregation is better than any randomly chosen individual surrogate metric. In terms of significant improvements over supervised and random selection, we found MIM to be a better alternative than Robust Borda. Synthetic anomaly injection metrics outperform other classes of metrics. We found synthetic anomaly injection to perform better than prediction error and centrality metrics (Appendix A.3). Among the three surrogate classes, centrality metrics performed the worst, most likely due to highly correlated bad rankings. Prior knowledge of anomalies might help identify good anomaly injection metrics. For instance, synthetic metrics such as speedup and cutoff perform well on respiration data since these datasets are likely to have such anomalies. The slowing down of respiration rate is a common anomaly and indicative of sleep apneafoot_1 . Similarly, synthetic noise performs well on air temperature dataset, likely because time-series have noise anomalies to simulate the sudden failure of temperature sensors. We also hypothesize that the superior performance of synthetic noise might be due to the intrinsic noise and distortion added to 100 of the 250 UCR datasets (Wu & Keogh, 2021) , thereby allowing the metric to choose models which are better suited to handle the noise. In practice, domain experts are aware of the type of anomalies that might exist in the data and using our anomaly injection methods; this prior knowledge can be effectively leveraged for model selection.

6. RELATED WORK

Time-series Anomaly Detection. The field has a long and rich history with well over a hundred proposed algorithms, differing in scope (e.g., multivariate vs univariate, unsupervised vs supervised), detection approach (e.g., forecasting, reconstruction, tree-based, etc.), etc. However, recent surveys have found that there is no single universally best method and the choice of the best method depends on the dataset (Schmidl et al., 2022; Blázquez-García et al., 2021) . Meta Learning and Model Selection. These approaches aim to identify the best model for a given dataset based on its characteristics such as the number of classes, attributes, instances etc. Recently, Zhao et al. (2021) explored the use of meta-learning to identify the best outlier detection algorithm on tabular datasets. For time-series datasets, both Ying et al. (2020) and Zhang et al. (2021) also approached model selection as meta-learning problem. All these studies rely on historical performance of models on a subset of "meta-train" datasets to essentially learn a mapping from the characteristics of a dataset ("meta-features") to model performance. These methods fail if anomaly labels are unavailable for meta-train datasets, or if the meta-train datasets are not representative of test datasets. Both these assumptions are frequently violated in practice, thus limiting the utility of such methods. In our approach, we do not assume access to similar datasets with anomaly labels. Weak Supervision and Rank Aggregation. Our work draws on intuition from a large body of work on learning from weak or noisy labels since predictions of each surrogate metric have inherent noise and biases. We refer the reader to recent surveys on programmatic weak supervision, e.g. (Zhang et al., 2022a) . Since we are interested in ranking anomaly detection methods rather than inferring their exact performance numbers, rank aggregation is a much more relevant paradigm for our problem setting. An extended version of Related Work is provided in Appendix A.1.

7. CONCLUSION

We consider the problem of unsupervised model selection for anomaly detection in time-series. Our model selection approach is based on surrogate metrics which correlate with model performance to varying degrees but do not require anomaly labels. We identify three classes of surrogate metrics namely, prediction error, synthetic anomaly injection, and model centrality. We devise effective procedures to inject different kinds of synthetic anomalies and extend the idea of model centrality to time-series. Next, we propose to combine rankings from different metrics using a robust rank aggregation approach and provide theoretical justification for rank aggregation. Our evaluation using over 5000 trained models and 17 surrogate metrics shows that our proposed approach performs on par with selection based on partial data labeling. Future work might focus on other definitions of model centrality and include more synthetic anomalies types.

REPRODUCIBILITY STATEMENT

We provide an open-source implementations of our model selection method and anomaly detection models at https://github.com/mononitogoswami/tsad-model-selection. All datasets used in this work are available in the public domain. Directions to download the Server Machine Dataset are available at https://github.com/NetManAIOps/OmniAnomaly, whereas the UCR Anomaly Archive can be downloaded from https://www.cs.ucr.edu/ ˜eamonn/time_series_data_2018/.

A APPENDIX

A.1 RELATED WORK Time-series Anomaly Detection. This field has a long and rich history with well over a hundred proposed algorithms (Schmidl et al. (2022) ; Blázquez-García et al. ( 2021)). The methods vary in their scope (e.g., multivariate vs univariate, unsupervised vs supervised), and in the anomaly detection approach (e.g., forecasting, reconstruction, tree-based, etc.). At a high level many of these methods learn a model for the "normal" time-series regime, and identify points unlikely under the given model as anomalies. For example, progression of a time-series can be described using a forecasting model, and points deviating from the forecast can be flagged as outliers (Malhotra et al. (2015) ). In another example, for tree-based methods, a data model is an ensemble of decision trees that recursively partition the points (sub-sequences of the time-series), until each sample is isolated. Abnormal points are easier to separate from the rest of the data, and therefore these will tend to have shorter distances from the root (Guha et al. (2016) ). For a comprehensive overview of the field, we refer the reader to one of the recent surveys on anomaly detection methods and their evaluation (Schmidl et al. (2022) ; Blázquez-García et al. ( 2021); Wu & Keogh (2021) ). Importantly, there is no universally best method, and the choice of the best method depends on the data at hand (Schmidl et al. (2022) ). Meta Learning and Model Selection. These approaches aim to identify the best model for the given dataset. The use of meta learning for model selection is a long-standing idea. For instance, Kalousis and Hilario used boosted decision trees to predict (propose) classifiers for a new dataset given data characteristics "meta-features") such as the number of classes, attributes, instances etc (Alexandros & Melanie, 2001) ). More recently, Zhao et al. (2021) explored the use of meta-learning for outlier detection algorithms. Their algorithm relies on a collection of historical outlier detection datasets with ground-truth (anomaly) labels and historical performance of models on these metatrain datasets. The authors leverage meta-features of datasets to build a model that "recommends" an anomaly detection method for a dataset. Concurrently, Kotlar et al. (2021) developed a metalearning algorithm with novel meta-features to select anomaly detection algorithms using labeled data in the training phase. Ying et al. (2020) focused on model selection at the level of time-series. They characterize each time-series based on a fixed set of features, and use anomaly-labeled time-series from a knowledge base to train a classifier for model selection and a regressor for hyper-parameter estimation. A similar approach is taken by Zhang et al. (2021) who extracted time-series features, identified best performed models via exhaustive hyper-parameter tuning, and trained a classifier (or regressor) using the extracted time-series features and information about model performance as labels. In contrast to methods mentioned in this sub-section, we are interested in unsupervised model selection. In particular, given an unlabeled dataset, we do not assume access to a similar dataset with anomaly labels. Weak Supervision and Rank Aggregation. We consider a number of unsupervised metrics, and these can be viewed as weak labels or noisy human labelers. The question is then which metric to use or how to aggregate the metrics. There has been a substantial body of work on reconciling weak labels, mostly in the context of classification, and to a lesser extend for regression. Many such methods build a joint generative model of unlabeled data and weak labels to infer latent true labels. We refer the interested reader to a recent survey by Zhang et al. (2022a) . Since we are interested in ranking anomaly detection methods rather than inferring their exact performance numbers, rank aggregation is a more relevant paradigm for our project. Rank aggregation concerns with producing a consensus ranking given a number of potentially inconsistent rankings (permutations). One of the dominant approaches in the field involves defining a (pseudo-)distance in the space of permutations and then finding a barycentric ranking, i.e., the one that minimizes distance to other rankings (Korba et al. (2017) ). The distance distribution can then be modeled, e.g., using the Mallows model, and rank aggregation reliability can be added by assuming a mixture from Mallows distributions with different dispersion parameters (Collas & Irurozki (2021) ). In our work, we draw on the ideas from this research. Automatic Machine Learning (AutoML). The success machine learning across a gamut of applications has led to an ever-increasing demand of predictive systems that can be used by non-experts. However, current AutoML systems are either limited to classification with labeled data (e.g., see AutoML and Combined Algorithm Selection and Hyperparameter Optimization (CASH) problems in (Feurer et al., 2015) ), or self-supervised regression or forecasting problems (Feurer et al., 2015; Gijsbers et al., 2022; 2019) . Using Forecasting Error to Select Good Forecasting Models. Some prior work on anomaly detection time-series anomaly detection use the idea of selecting models with low forecasting error to pick the best model for forecasting-based anomaly detection (Saganowski & Andrysiak, 2020; Laptev et al., 2015; Kuang et al., 2022) . Semi-supervised anomaly detection model selection. Recently, Zhang et al. (2022b) leveraged reinforcement learning to select the best base anomaly detection model given a dataset. They too use the idea of model centrality (referred to as Prediction-Consensus Confidence). However, their method is semi-supervised since it relies on actual anomaly labels to evaluate the reward function. Moreover, the base models do not include any recent deep learning-based anomaly detection models. Finally, they evaluate their method on only one dataset (SWaT). Limitations of Prior Work in the Context of Unsupervised Time-series Anomaly Detection Model Selection. In this section we briefly summarize the limitations of prior work considering the model selection problem in different settings. • Limited, predefined model set M: Many studies experiment with a small number of models for experiments (|M| ≤ 5). Most of these models are not state-of-the-art for time-series anomaly detection (e.g. Saganowski & Andrysiak (2020) use Holt-winters and ARFIMA), unrepresentative (only forecasting-based non-deep learning models), and specialised to certain domains (e.g. network anomaly detection (Saganowski & Andrysiak, 2020; Kuang et al., 2022) ). In contrast, our experiments are conducted with 19 models of 5 different types (distance, forecasting and reconstruction-based), which have been shown to perform well for general time-series anomaly detection problems (Schmidl et al., 2022; Paparrizos et al., 2022b) . Moreover, our model selection methodology is agnostic to the model set. Note that neither the model centrality and synthetic anomaly injection metrics, nor robust rank aggregation relies on the nature or hypothesis space of the models. • Limited, specialised datasets or old benchmarks with known issues: Most studies (Saganowski & Andrysiak, 2020; Kuang et al., 2022; Laptev et al., 2015) are evaluated on a limited number of real-world datasets (N <= 2), of specific domains (e.g. network anomaly detection (Saganowski & Andrysiak, 2020; Kuang et al., 2022) ). Most papers are either not evaluated on common anomaly detection benchmarks (Saganowski & Andrysiak, 2020; Kuang et al., 2022) , or evaluated on old benchmarking datasets (e.g. Numenta Anomaly Benchmark) with known issues (Wu & Keogh, 2021) . On the other hand, our methods are evaluated on 275 diverse time-series of different domains such as electrocardiography, air temperature etc. • Supervised model selection: Some prior work, e.g. (Kuang et al., 2022) used supervised model selection methods such as Bayesian Information Criterion (BIC). However, in our setting methods such as BIC cannot be used because we do not have anomaly labels and measuring model complexity of different model types (deep neural networks, nonparametric models etc.) in our case is non-trivial. • No publicly available code: Finally, most papers do not have publicly available implementations.

A.2 LIMITATIONS AND FUTURE WORK

Computational Complexity. Our methods are computationally expensive since they rely on predictions from all models in the model set. However, our main baseline is supervised selection which requires manually labeled data. Our premise is that manual labeling is much more expensive than compute time. For example, Amazon Web Services compute time in the order of cents/hour, where human time cost approximately $10/hour. In fact, recall that some of our benchmarks involve medical datasets where gold standard expert annotations can cost anywhere between $50-200 per hour (Abend et al., 2015) . Assumption: A majority of rankings can identify accurate models better than random. Empirical influence and robust borda methods provably identify good models only if a majority of the rankings can identify good models better than random (see appendices A.8, A.9). An interesting direction of future work is to relax this assumption, or make an orthogonal assumption. For example, there has been work on provably-optimal model selection of bandit algorithms which assume a specific structure of the problem (e.g., linear contextual bandits), and that at least one base model satisfies a regret guarantee with high probability (Pacchiano et al., 2020) . Assumption: Partial orderings rather than total ordering. We assume that all metrics impart a total ordering of all the models. However, this may not always be true. In our experience, some models may have the same performance as measured by a metric. To this end, future work may focus on aggregation techniques of partial rankings (Ailon, 2010) . Online model selection. We rely on predictions of models on the test data, however, in practice test data may be unavailable during model selection. A.5 GENERATING SYNTHETIC ANOMALIES Natural anomalies do not occur randomly, but are determined by the periodic nature of time-series. Therefore, before injecting anomalies we determine periodicity of time-series by finding the its auto-correlation, and only inject anomalies (barring spikes which are point-outliers) at beginning of a cycle. We only inject a single anomaly in each time-series in a repetition, and its length is chosen uniformly at random from [1, max length], the latter being a user defined parameter (Fig. 7 ). Spike Anomaly. To introduce a spike, we first draw m ∼ Bernoulli(p), and if m = 1 we inject a spike whose magnitude is governed by a normal distribution s ∼ N (0, σ 2 ). Flip Anomaly. We reverse the time-series: Speedup Anomaly. We increase or decrease the frequency of a time-series using interpolation, where the factor of change in frequency ( 1 2 or 2) is a user-defined parameter. Noise Anomaly. We inject noise drawn from a normal distribution N (0, σ 2 ), where σ is a userdefined parameter. 

Average anomaly

Figure 7 : Different kinds of synthetic anomalies. We inject 9 different types of synthetic anomalies randomly, one at a time, across multiple copies of the original time-series. The --is original timeseries before anomalies are injected, -is the injected anomaly and the solid -is the time-series after anomaly injection. Cutoff Anomaly Here, we set the time-series to either N (0, σ 2 ) or N (1, σ 2 ), where both the choice of location parameter and σ is user-defined. Average Anomaly. We run a moving average where the length of the window (len window) is determined by the user: ts a[i, start:end] = moving average(ts[i, start:end], len window) Scale Anomaly. We scale the time-series by a user-defined factor (factor): ts a[i, start:end] = factor * ts [i, start:end] Wander Anomaly. We add a linear trend with the increase in baseline (baseline) defined by users: ts a[i, start:end] = np.linspace(0, baseline, end-start) + ts [i, start:end] Contextual Anomaly. We convolve a the time-series subsequence linear function aY + b, where a ∼ N (1, σ 2 a ) and b ∼ N (0, σ 2 b )

A.6 DESCRIPTION OF MODELS

There are over a hundred unique time-series anomaly detection algorithms developed using a wide variety of approaches. For our experiments, we implemented a representative subset of 5 popular methods (Table 1 ), across two different learning types (unsupervised and semi-supervised) and three different method families (forecasting, reconstruction and distance-based) (Schmidl et al., 2022) . In the context of time-series anomaly detection, "semi-supervised" means requiring a part of timeseries known to be anomaly-free. Each time-series in the UCR and SMD collections comprises of a train part with no anomalies and a test part with one (UCR) or more anomalies (SMD). We used the train part for training semi-supervised models. We emphasize that our surrogate metrics do not rely on availability of the "clean" train part (and we didn't use the train part for evaluation). Moreover, while semi-supervised models are supposed to be trained on anomaly-free part, in practice they can still work (perhaps sub-optimally) after training on time-series that potentially contained anomalies. The models we implemented can also be categorized as forecasting, reconstruction and distance methods. Forecasting methods learn to forecast the next few time steps based on the current context window, whereas reconstruction methods encode observed sub-sequences into a low dimensional latent space and reconstruct them from the latent space. In both cases, the predictions (forecasts or reconstruction) of the models are compared with the observed values. Finally, distance based models used distance metrics (e.g. Euclidean distance) to compare sub-sequences, and expect anomalous sub-sequences to have large distances to sub-sequences they are most similar to. We created a pool of 3 k-NN (Ramaswamy et al., 2000) , and 4 moving average, 4 DGHL (Challu et al., 2022) , 4 LSTM-VAE (Park et al., 2018) and 4 RNN (Chang et al., 2017) by varying important hyper-parameters, for a total 19 models. Finally, to efficiently train our models, we sub-sampled all time-series with T > 2560 by a factor of 10.

Model Area Learning Type Method Family

Moving Average Statistics Unsupervised Forecasting k-NN (Ramaswamy et al., 2000) Classical ML Semi-supervised Distance Dilated RNN (Chang et al., 2017) Deep Learning Semi-supervised Forecasting DGHL (Challu et al., 2022) Deep Learning Semi-supervised Reconstruction LSTM-VAE (Park et al., 2018) Deep Learning Semi-supervised Reconstruction Table 4 : Models in the model set and their properties. Below is brief description of each model. Moving Average. These models compare observations with the "moving" average of recent observations in a sliding window of predefined size. These methods assume that if the current observation differs significantly from moving average, then it must be an anomaly. Specifically, if x t denotes the current observation, h represents the window size and let τ be an anomaly threshold, then x t is flagged as an anomaly if: x t - 1 h t-1 i=t-h-1 x i 2 2 > τ, Moving average is one of the simplest, efficient yet practical anomaly detection models, frequently used by industry practitioners. Moreover, moving average is unsupervised since it does not require any normal training data. k-Nearest Neighbors (k-NN) (Ramaswamy et al., 2000) . These distance-based anomaly detection models compare a window of observations to their k nearest neighbour windows in the training data. If the current window is significantly far its k closest windows on the training set, then it must be an outlier. In our implementation, the distance between time-series windows is measured using standard Euclidean distance. Specifically, let w t = {x t-h , • (Park et al., 2018) . The LSTM-VAE is a reconstruction based model that uses a Variational Autoencoder with LSTM components to model the probability of observing the target time-series. The model also uses the latent representation to dynamically adapt the threshold to flag anomalies based on the normal variability of the data. The anomaly scores corresponds to the negative log-likelihood of an observation, x t , based on the output distribution of the model, and it is flagged as an anomaly if the score is larger than a threshold τ : || xt -x t || 2 2 > τ LSTM-VAE -log p(x t ) > τ A.8 THEORETICAL JUSTIFICATION FOR BORDA RANK AGGREGATION Suppose that we have several surrogate metrics for ranking our anomaly detection models. Also suppose that each of these metrics is better than random ranking. Why would it be beneficial to aggregate the rankings induced by these metrics? Theorem 1 provides an insight into the benefit of Borda rank aggregation. Suppose that we have N anomaly detection models to rank. Consider two arbitrary models i and j with (unknown) ground truth rankings σ * (i) and σ * (j). Without the loss of generality, assume σ * (i) < σ * (j), i.e., model i is better. Next, let's view each surrogate metric k = 1, . . . , M as a random variable Ω k taking values in permutations S N (a set of possible permutations over N items), such that Ω k (i) ∈ [N ] denotes the rank of item i. We are interested in properties of Borda rank aggregation applied on realizations of these random variables. Theorem 1 states that, under certain assumptions, the probability of Borda ranking making a mistake (i.e., placing model i lower than j) is upper bounded in a way such that increasing M decreases this bound. Theorem. Assume that Ω k 's are pairwise independent, and that P(Ω k (i) < Ω k (j)) > 0.5 for any arbitrary models i and j, i.e., each surrogate metric is better than random. Then the probability that Borda aggregation makes a mistake in ranking models i and j is at most 2 exp -M ϵ 2

2

, for some fixed ϵ. Proof. We begin by defining random variables X k = 2I[Ω k (i) < Ω k (j)] -1. Thus X k = 1 if Ω k (i) < Ω k (j) and X k = -1, otherwise. Denote the expectation of these random variables as E k ≜ E[X k ]. We assume that E k > ϵ, ∀k ∈ [M ] for some small ϵ > 0, i.e., that each surrogate metric k is better than random. Next, define S M ≜ M k=1 X k , E M = M k=1 E k . Note that if S M > 0 then Borda aggregation will correctly rank i higher than j. If Ω k are pairwise independent, then X k are pairwise independent, and we can apply Hoeffding's inequality for all t > 0: P(|S M -E M | ≥ t) ≤ 2 exp(- 2t 2 4M ) P(E M -S M ≥ t) ≤ 2 exp(- t 2 2M ) P(S M ≤ 0) ≤ 2 exp(- (E M ) 2 2M ), setting t = E M P(S M ≤ 0) ≤ 2 exp(- M ϵ 2 2 ), because ϵ < E k , ∀k Recall, that S M < 0 results in an incorrect ranking (ranking j higher than i), and S M = 0 results in a tie (that we break uniformly at random). Therefore, the probability that Borda aggregation makes a mistake in ranking items i and j is upper bounded as per the above statement. We do not assume that Ω k are identically distributed, but we assume that Ω k 's are pairwise independent. We emphasize that the notion of independence of Ω k as random variables is different from rank correlation between realizations of these variables (i.e., permutations). For example, two permutations drawn independently from a Mallow's distribution can have a high Kendall's τ coefficient (rank correlation). Therefore, our assumption of independence, commonly adopted in theoretical analysis (Ratner et al., 2016) , does not contradict our observation of high rank correlation between surrogate metrics. Corollary 1. Under the same assumptions as Theorem 1, if Borda aggregation makes a mistake in ranking models i and j with probability at most δ > 0, then min 

A.12 DOES EMPIRICAL INFLUENCE IDENTIFY BAD METRICS?

To measure the correlation between empirical influence and quality of a metric, we provide (1) experiments on real-world datasets, and (2) an analysis on synthetic data in Appendix A.4, where we see that empirical influence can almost perfectly identify bad metrics. Experimental setup. We define the quality of a metric σ using two quantities: (Q1) adjusted best F 1 of top-5 models ranked by a metric σ, (Q2) difference in the adjusted best F 1 of top-5 and bottom-5 models ranked by a metric σ. This quality metric is inspired by the separation analysis in Paparrizos et al. (2022a) . Results. 6 out of 9 datasets had negative kendall-τ correlation between the quality of a metric and its empirical influence, when quality is measured using Q1. Of these 6 datasets, 4 datasets had statistically significant negative correlation. Recall that bad metrics have high positive empirical influence (Section 4.2). Even when quality is measured using Q2, 6 out of 9 datasets had negative kendall-τ correlation between the quality of a metric and its empirical influence. Of these 6 datasets, 2 datasets had statistically significant negative correlation.



We use terms "ranking" and "permutation" as synonyms. The subtle difference is that in a ranking multiple models can receive the same rank. Where necessary, we break ties uniformly at random. Sleep apnea is a common and potentially serious health condition in which breathing repeatedly stops and starts (https://www.nhlbi.nih.gov/health/sleep-apnea).



Figure 1: The Model Selection Workflow. We identify three classes of surrogate metrics of model quality (Sec. 3), and propose a novel robust rank aggregation framework to combine multiple rankings from metrics (Sec. 4).

Figure2: Different kinds of synthetic anomalies. We inject 9 different types of synthetic anomalies randomly, one at a time, across multiple copies of the original timeseries. The --is original time-series before anomalies are injected, -is the injected anomaly and the solid -is the time-series after anomaly injection.

the universe of elements and S N be the set of permutations on [N ].

Figure 3: Performance of the top-1 selected model. Box plots of adjusted best F 1 of the model selected by each metric or baseline across 5 unique combinations of the selection and evaluation sets. Orange (|) and blue (|) vertical lines represent the median and average performance of each metric or baseline, and the minimum influence metric (MIM), respectively. Each box plot is organized into 3 sections: performance of individual metrics, Borda and its robust variations, and three baselines.

Figure 5: Impact of Noisy Permutations. With increase in noise, the distance of the median permutation (σ) from the central permutation (σ 0 ) increase i.e. noise hurts rank aggregation.

Figure 6: Empirical Influence can Identify Outlying Permutations. Influence of a permutation is directly proportional to its distance from the central permutation (σ 0 ). Under the Mallow's model, outlying (or low probability) permutations have a higher distance (σ 0 ).

ts a[i, start:end] = ts[i, start:end][::-1]

k P(Ω k (i) < Ω k (j)) > log 2 δ 2M + 1 2 ,where k ∈ [M ] and i, j ∈ [N ] are any two arbitrary models.A.11 CORRELATION BETWEEN ADJUSTED BEST F 1 AND ADJUSTED PR-AUC On our datasets, performance of models measured using PR-AUC and adjusted best F 1 had statistically significant highly positive correlation (see Figure8.

Figure 8: Correlation between adjusted best F 1 and adjusted PR-AUC. r S and τ denote the spearman and kendall correlation, respectively. p-values of two-sided tests are reported in parenthesis.

Models set and their properties. Detailed descriptions of model architectures and hyperparameters can be found in Appendix A.6.

Results of pairwise-statistical tests. On a total of n = 10 datasets, minimum influence metric (MIM) and robust Borda have the fewest significant losses to oracle, and are considerably better than random model selection and Borda.

|| • || 2 represent the euclidean norm, and τ be an anomaly threshold. Then the current window is flagged as an anomaly if: Dilated RNN(Chang et al., 2017). Dilated Recurrent neural network (RNN) is a recently proposed improvement over vanilla RNNs, characterized by multi-resolution dilated recurrent skip connections. These skip connections alleviate problems arising from memorizing long term dependencies without forgetting short term dependencies, vanishing and exploding gradients, and the sequential nature of forward and back-propagation in vanilla RNNs. Given a historical window of time-series, we use a dilated RNN to forecast T la time steps ahead. Let w t = {x t , x T la } denote the current window of observations, ŵt = { xt , • • • , xT la } be the forecasts from the dilated RNN model, and τ be an anomaly threshold. Then x t is flagged as an anomaly if: DGHL(Challu et al., 2022). This model is a recent state-of-the-art Generative model for timeseries anomaly detection. It uses a Top-Down Convolution Network (CNN) to map a novel hierarchical latent representation to a multivariate time-series windows. The key difference with other generative models as GANs and VAEs is that is trained with the Alternating Back-Propagation algorithm. The architecture does not include an encoder, instead, latent vectors are directly sampled from the posterior distribution. Let x t be the observation and xt the reconstruction at timestamp t, and τ be an anomaly threshold. Then x t is flagged as an anomaly if:

ACKNOWLEDGMENTS

We would like to thank Abishek Sankararaman, Anoop Deoras, Baris ¸Kurt, Christos Faloutsos, Kelvin Kan, and Peihong Jiang for valuable discussions about the problem statement and methods. We thank the anonymous reviewers and area chairs for helpful suggestions. The first author would also like to thank AWS for travel grant to attend the conference.

ETHICS STATEMENT

We propose an unsupervised method for model selection of time-series anomaly detection models. In this paper, we restrict ourselves to finding accurate anomaly detection models. The fairness of models chosen via unsupervised model selection is an open problem. While prevailing wisdom suggests a tension between the accuracy and fairness of models, recent work suggests otherwise. In fact, Wick et al. (2019) argue that under reasonable assumptions and certain definitions of fairness, accuracy and fairness are no longer in tension when selection and label bias are accounted for. We believe that an interesting direction of future work is extending our model selection problem and method to account for possibly competing objectives such as accuracy, fairness, inference time etc.

A.3 ADDITIONAL RESULTS

The complete set of results are shown in Fig. 4 and Tab. 3.We compare 17 individual surrogate metrics, including 5 prediction error metrics, 9 synthetic anomaly injection metrics, 3 metrics of centrality (average distance to 1, 3 and 5 nearest neighbors) (Section3); and 4 proposed robust variations of Borda rank aggregation, namely Partial Borda, Trimmed Borda, Minimum Influence Metric and Robust Borda (see Appendix A.7.2). We also compare Borda and its variations with exact Kemeny aggregation Appendix A.7.3). 

Random

where d is the Kendall's τ distance. With increase in θ, there is stronger consensus, thereby making Kemeny aggregation easier.We aim to empirically answer two research questions:1. What is the impact of noisy permutations on the median rank?

2.. Can empirical influence identify outlying permutations?

To this end, we draw k permutations from the Mallow's distribution σ 1 , • • • σ k ∼ M(σ 0 , θ) and N -k permutations at uniformly at random from S n . We compute the median rank σ using the Borda method. We experiment with n ∈ {50, 100}, θ ∈ {0.05, 0.1, 0.2} and N = 50 based on the experimental setup used by Ali & Meilȃ (2012) .A.7 BORDA RANK AGGREGATION Borda is a positional ranking system and is an unbiased estimator of the Kemeny ranking of a sample distributed according to the Mallows Model (Fligner & Verducci, 1988) . Consider N models and a collection of M surrogate metrics S = {σ 1 , • • • , σ M }. We use σ k to both refer to the surrogate metric and the ranking it induces. Then under the Borda scheme, for any given metric σ k , the i th model receives N -σ k (i) points. Thus, the i th model accrues a total of M k=1 (N -σ k (i)) points. Finally, all the models are ranked in decreasing order of their total points and ties are broken uniformly at random. Borda can be computed in quasi-linear (O(M + N log N )) time.

A.7.1 IMPROVING RANKING PERFORMANCE AT THE TOP

In model selection, we only care about top-ranked models. Thus, to improve model selection performance at the top-ranks, we only consider models at the top-k ranks, setting the positions of the rest of the models to N (Fagin et al., 2003) . Here, k is a user-specified hyper-parameter. Specifically, under the top-k Borda scheme, the i th model accrues a total ofpoints. Intuitively, this increases the probability of models which have top-k ranks consistently across surrogate metrics, to have top ranks upon aggregation.

A.7.2 ROBUST VARIATIONS OF BORDA

We consider multiple variations of Borda, namely Partial Borda, Trimmed Borda, Minimum Influence Metric (MIM) and Robust Borda. We collectively refer to these as robust variations of Borda rank aggregation.Partial Borda improves model selection performance, especially at the top ranks by only considering models at the top-k ranks, as described in Appendix A.7.1.Trimmed Borda is another robust variation of Borda, which only aggregates reliable ("good") permutations. Trimmed Borda identifies outlying or "bad" permutations as ones having high positive empirical influence.Minimum Influence Metric (MIM) for a dataset is the surrogate metric with minimum influence. Recall that while high positive influence is indicative of "bad" permutations, low values of influence are indicative of "good" permutations.Robust Borda performs rank aggregation only based on the top-k ranks while truncating permutations with high positive influence. Hence, Partial Borda, Trimmed Borda and Minimum Influence Metric can be viewed as special cases of Robust Borda.

A.7.3 EXACT KEMENY RANK AGGREGATION

The Kemeny rank aggregation problem can be viewed as a minimization problem on a weighted directed graph (Conitzer et al., 2006) . Let G = (V, E) denote a weighted directed graph with the models as vertices. For every pair of models, we define an edge (i → j) ∈ E and set its weight asHere the indicator I i≽ k j = 1 denotes that metric k prefers model i over model j, and hence w ij denotes the number of metrics which prefer model i over j.Then we can formulate the rank aggregation problem as a binary linear program:Intuitively, we incur a cost for every metric's pairwise comparison that we do not honor.Proof. The proof follows from upper bounding the probability that Borda makes a mistake in ranking models i and j by δ:Consider the following relationship between ϵ and minFrom Equations 3 and 4, we can conclude that:To reduce chances of error, it follows from Corollary 1 that we can either increase the number of surrogate metrics, or collect a smaller number of highly accurate metrics. As an example, set δ = 0.05 and consider M = 20 metrics. Then for the Borda to err at most 5% of times, each surrogate metric must prefer A i over A j with probability at least 0.7. This also highlights the importance of empirical influence in identifying and removing bad permutations the Borda aggregation.

A.9 THEORETICAL JUSTIFICATION FOR EMPIRICAL INFLUENCE

In this section we provide two arguments in support of Empirical Influence (EI). First, we note that in an idealized case, where all but one metrics are perfect, EI helps identify the imperfect metric. Lemma 2. Consider a set of M surrogate metrics that consists of (M -1) perfect metrics and one imperfect metric. The perfect metrics induce permutations, where σ * is the ground truth permutation, while the imperfect metric induces permutation σ M ̸ = σ * . Then EI(σ M ) ≥ EI(σ i ) for any 1 ≥ i ≥ M -1.We have that EI(σ M ) ≥ EI(σ i ).Next, recall that Borda(A) is an unbiased estimator of the central permutation in the Kemeny-Young problem. Given a set of permutations A, the central permutation is defined aswhere the minimization is over all possible permutations χ of N items.We now consider a modified notion of Empirical Influence that uses the central permutation instead of Borda approximation.whereUsing this modified definition, we show how EI is capable of distinguishing between permutations close and far from the central permutation. Recall that good permutations are those that are close to the central permutation. Lemma 3. Consider a set of permutations S = {σ 1 , . . . , σ M -2 , σ ok , σ bad }, where σ ok and σ bad represent good and bad permutations, respectively. Let σ * be the unknown ground truth permutation.Next, for some r > 0, assume that d τ (σ * , σ ok ) < r and d τ (σ * , σ bad ) > 3r.We also assume that overall, our set S is of reasonable quality in the sense that d τ (c(S \ σ), σ * ) < r for any σ ∈ S.Then, we have that EI ′ (σ bad ) > EI ′ (σ ok ).Proof. Consider the differenceHere, c ok = c(S \ σ bad ) and c bad = c(S \ σ ok ).We have that Therefore EI ′ (σ bad ) > EI ′ (σ ok ).A.10 CORRELATION BETWEEN ADJUSTED BEST F 1 AND VUS-ROC Motivation. Paparrizos et al. (2022a) in a recent study found the volume under the surface of the ROC curve (VUS-ROC) to be more robust than other compared evaluation metrics. Their study did not compare VUS-ROC against adjusted best F 1 , used in this paper. The VUS-ROC metric is defined for univariate time-series, whereas one of the datasets we use, the Server Machine dataset (SMD), has multivariate time-series. Hence, our main results are still with respect to adjusted best F 1 . However, we carry out an experiment to evaluate the correlation between VUS-ROC correlates with adjusted best F 1 .Experimental setup. For every time-series, we computed the spearman-r and kendall-τ correlation between the performance of models as measured by VUS-ROC and adjusted best F 1 . We carried out this experiment on 250 time-series from the UCR anomaly archive (Wu & Keogh, 2021) . We measured the dispersion of an evaluation measure (VUS-ROC or adjusted best F 1 ) on a time-series as the interquartile range of performance values of all models in that time-series. The sliding window parameter that VUS-ROC relies on, was automatically set to the auto-correlation of time-series.

Results

. Out of 250 time-series, only 23 time-series did not have statistically significant positive kendall-τ and spearman-r correlation. All these time-series had low dispersion of the evaluation measures i.e. the difference in performance of most models trained on the time-series, measured in terms of VUS-ROC or adjusted best F 1 , was less than 10%.

A.13 RESULTS WHEN AGGREGATING OVER ALL METRICS

When we perform robust rank aggregation over all the metrics, instead of only 5 anomaly injection metrics (scale, noise, cutoff, contextual and speedup), we observed a degradation in the performance of both minimum influence metric (MIM) and robust Borda. MIM has more significant wins and fewer significant losses when only the 5 best performing metrics are aggregated. This degradation in performance can be explained by the fact that some metrics have ranking performance worse than random model selection, for example model centrality metrics and prediction likelihood. In this short experiment we provide empirical support for Theorem 1.Experimental setup. Theorem 1 considers error bounds when ranking two models. Thus, for each time series we identify the best-and worst-performing models and empirically test whether aggregation over more rankings helps reduce this error. We varied the number of surrogate metrics from 2 to 17, and perform borda rank aggregation to find the best model. The performance of the aggregated borda rank is measured as the adjusted best F 1 of the top ranked model. For each combination of m metrics, we report the performance averaged over all 17 m unique combinations of the metrics.Results. First, we present the performance of the best model as identified by borda aggregation of a combination of metrics, averaged across all the time-series. Overall, we see a clear trend where the performance of the selected best model increases with the number of surrogate metrics (rankings). In only 27 out of 181 time-series, we did not observe a monotonic improvement as the number of metrics increased. However, we notice that in all such cases, the performance of all anomaly detection models was low and similar to each other. Under such circumstances, not only do surrogate metrics tend to be noisy, violating the assumptions of the theorem, but also, model selection does not matter much. 

