QUANTILE-LSTM: A ROBUST LSTM FOR ANOMALY DETECTION IN TIME SERIES DATA

Abstract

Anomalies refer to the departure of systems and devices from their normal behaviour in standard operating conditions. An anomaly in an industrial device can indicate an upcoming failure, often in the temporal direction. In this paper, we contribute to: 1) multiple novel LSTM architectures, q-LSTM by immersing quantile techniques for anomaly detection. 2) a new learnable, parameterized activation function Parameterized Elliot Function (PEF) inside LSTM, which saturates late compared to its nonparameterized siblings including the sigmoid and tanh, to model temporal long-range dependency. The proposed algorithms are compared with other well-known anomaly detection algorithms and are evaluated in terms of performance metrics, such as Recall, Precision and F1-score. Extensive experiments on multiple industrial timeseries datasets (Yahoo, AWS, GE, and machine sensors, Numenta and VLDB Benchmark data) and non-time series data show evidence of effectiveness and superior performance of LSTM-based quantile techniques in identifying anomalies.

1. INTRODUCTION

Anomalies indicate a departure of a system from its normal behaviour. In Industrial systems, they often lead to failures. By definition, anomalies are rare events. As a result, from a Machine Learning standpoint, collecting and classifying anomalies pose significant challenges. For example, when anomaly detection is posed as a classification problem, it leads to extreme class imbalance (data paucity problem). Morales-Forero & Bassetto (2019) have applied a semi-supervised neural network, a combination of an autoencoder and LSTM, to detect anomalies in the industrial dataset to mitigate the data paucity problem. Sperl et al. (2020) also tried to address the data imbalance issue of anomaly detection and applied a semi-supervised method to inspect large amounts of data for anomalies. However, these approaches do not address the problem completely since they still require some labeled data. Our proposed approach is to train models on a normal dataset and device some post-processing techniques to detect anomalies. It implies that the model tries to capture the normal behavior of the industrial device. Hence, no expensive dataset labeling is required. Similar approaches were tried in the past. Autoencoder-based family of models uses some form of thresholds to detect anomalies. For example, Sakurada & Yairi (2014); Jinwon & Ch (2015) mostly relied on reconstruction errors. The reconstruction error can be considered as an anomaly score. If the reconstruction error of a datapoint is higher than a threshold, then the datapoint is declared as an anomaly. However, the threshold value can be specific to the domain and the model, and deciding the threshold on the reconstruction error can be cumbersome.

MOTIVATION AND CONTRIBUTION

Unlike the above, our proposed quantile-based thresholds applied in the quantile-LSTM are generic and not specific to the domain or dataset. We have introduced multiple versions of the LSTM-based anomaly detector in this paper, namely (i) quantile-LSTM (ii) iqr-LSTM and (iii) Median-LSTM. All the LSTM versions are based on estimating the quantiles instead of the mean behaviour of an industrial device. For example, the median is 50% quantile. Our contributions are three-fold: (1) Introduction of Quantiles in design of quantile-based LSTM techniques and their application in anomaly identification. (2) Proposal of the Parameterized Elliot as a 'flexible-form, adaptive, learnable' activation function in LSTM, where the parameter is learnt from the dataset. We have shown empirically that the modified LSTM architecture with PEF performed better than the Elliot Function (EF) and showed that such behavior might be attributed to the slower saturation rate of PEF. PEF contributes to improved performance in anomaly detection in comparison to its non-parameterized siblings. (3) Evidence of superior performance of the proposed Long Short Term Memory networks (LSTM) methods over state-of-the-art (SoTA) deep learning and non-deep learning algorithms across multiple Industrial and Non-industrial data sets including Numenta Anomaly Benchmark and the VLDB anomaly benchmark (Appendix, Table 7, 8, 9 and 10 ). There are three key pieces to modelling anomalies: type of time-series we need to work with; model the temporal dependency and post-process the forecasts to flag that forecast as an anomaly. Given the nature of anomalies, it is obvious they should model the departure normality or the tail behaviour. Quantities are the natural statistical quantities to consider in this respect. The temporal modeling of time-series models is some sort of dynamical systems, including the classical statistical models like ARMA and its variants. LSTMs are the most popular versions of the non-parametric non-linear dynamical models. One could technically swap LSTMs with any other sequence architectures suitable for the problem. The added advantage LSTMs brings is the multiplicative gates which help prevent vanishing gradients. This is coupled with the introduction of Parameterized Elliot as activation function (PEF) which shifts the saturation. A classifier to flag anomalies is also a comparator, either learnt via supervised task or is based on reasonable heuristics. For the former, we need labels which we assume do not have in large numbers in reality, For the latter, there is no option but to default to some heuristics. But thankfully, with a non-parametric, non-linear dynamical system such as q-LSTM modelling the quantities, even fixed, deterministic comparators turn out to be adaptive comparators. Therefore, we can consider our contribution as setting this template and making certain sensible choices in each of the three important puzzles of this template. The rest of the paper is organized as follows. The proposal and discussion of various LSTM-based algorithms are presented in section 2. Section 3 describes the LSTM structure and introduces the PEF. This section also explains the intuition behind choosing a parameterized version of the AF and better variability due to it. Experimental results are presented in section 4. Section 5 discusses relevant literature in anomaly detection. We conclude the paper in section 6.

2. ANOMALY DETECTION WITH QUANTILE LSTMS

Quantiles are used as a robust alternative to classical conditional means in Econometrics and Statistics, as they can capture the uncertainty in a prediction and model tail behaviours (Koenker, 2005) . The additional benefit lies in quantiles making very few distributional assumptions. It was also shown by Tambwekar et al. (2022) that quantiles aid in explainability as they can be used to obtain several univariate summary statistics that can be directly applied to existing explanation tools. This served as the motivation behind adopting the idea of quantiles from classification to anomaly detection, as quantiles capture tail behavior succinctly. It is well known that quantiles minimize check loss (Horowitz, 1992) , which is a generalized version of Mean Absolute Error (MAE) arising from medians rather than means. It is also known that medians are often preferred to means in robust settings, particularly in skewed and heavy-tailed data. Thus, in time series data, where LSTM architecture has shown beneficial, LSTM architecture is coupled with the idea of quantiles to capture anomalies (outliers). It is to be noted that this method is applied to univariate time series data only, and the method is agnostic to data distribution (see Table 6 ). As the empirical results exhibit, the distributional variance does not impact the prediction quality. Before we discuss quantile-based anomaly detection, we describe the data structure and processing setup, with some notations. Let us consider x i , i = 1, 2, .., n be the n time-series training datapoints. We consider T k = {x i : i = k, • • • , k + t} be the set of t datapoints, and let T k be split into w disjoint windows with each window of integer size m = t w and T k = {T 1 k , • • • , T w k }. Here, T j k = {x k+m(j-1) , ..., x k+m(j)-1 }. Let Q τ (D) be the sample quantile of the datapoints in the set D. The training data consists of, for every T k , X k,τ ≡ {Q τ (T j k )}, j = 1, • • • , w as predictors with y k,τ ≡ Q τ (T k+1 ), sample quantile at a future time-step, as the label or response. Let ŷk,τ be the predicted value by an LSTM model.

