UNSUPERVISED ANOMALY DETECTION BY ROBUST COLLABORATIVE AUTOENCODERS

Abstract

Unsupervised anomaly detection plays a critical role in many real-world applications, from computer security to healthcare. A common approach based on deep learning is to apply autoencoders to learn a feature representation of the normal (non-anomalous) observations and use the reconstruction error of each observation to detect anomalies present in the data. However, due to the high complexity brought upon by over-parameterization of the deep neural networks (DNNs), the anomalies themselves may have small reconstruction errors, which degrades the performance of these methods. To address this problem, we present a robust framework for detecting anomalies using collaborative autoencoders. Unlike previous methods, our framework does not require supervised label information nor access to clean (uncorrupted) examples during training. We investigate the theoretical properties of our framework and perform extensive experiments to compare its performance against other DNN-based methods. Our experimental results show the superior performance of the proposed framework as well as its robustness to noise due to missing value imputation compared to the baseline methods.

1. INTRODUCTION

Anomaly detection (AD) is the task of identifying abnormal observations in the data. It has been successfully applied to many applications, from malware detection to medical diagnosis (Chandola et al., 2009) . Driven by the success of deep learning, AD methods based on deep neural networks (DNNs) (Zhou & Paffenroth, 2017; Aggarwal & Sathe, 2017; Ruff et al., 2018; Zong et al., 2018; Hendrycks et al., 2018) have attracted increasing attention recently. Unfortunately, DNN methods have several known drawbacks when applied to AD problems. First, since many of them are based on the supervised learning approach (Hendrycks et al., 2018) , this requires labeled examples of anomalies, which are often expensive to acquire and may not be representative enough in non-stationary environments. Supervised AD methods are also susceptible to the class imbalance problem as anomalies are rare compared to normal observations. Some DNN methods rely on having access to clean data to ensure that the feature representation learning is not contaminated by anomalies during training (Zong et al., 2018; Ruff et al., 2018; Pidhorskyi et al., 2018; Fan et al., 2020) . This limits their applicability as acquiring a representative clean data itself is a tricky problem. Due to these limitations, there have been concerted efforts to develop robust unsupervised DNN methods that do not assume the availability of supervised labels nor clean training data (Chandola et al., 2009; Liu et al., 2019) . Deep autoencoders are perhaps one of the most widely used unsupervised AD methods (Sakurada & Yairi, 2014; Vincent et al., 2010) . An autoencoder compresses the original data by learning a latent representation that minimizes the reconstruction loss. It is based on the working assumption that normal observations are easier to compress than anomalies. Unfortunately, such an assumption may not hold in practice since DNNs are often over-parameterized and have the capability to overfit the anomalies (Zhang et al., 2016) , thus degrading their overall performance. To improve their performance, the unsupervised DNN methods must consider the trade-off between model capacity and overfitting to the anomalies. One way to control the model capacity is through regularization. Many regularization methods for deep networks have been developed to control model capacity, e.g., by constraining the norms of the model parameters or explicitly perturbing the training process (Srivastava et al., 2014) . However, these approaches do not prevent the networks from being able to perfectly fit random data (Zhang et al., 2016) . As a consequence, the regulariza-tion approaches cannot prevent the anomalies from being memorized, especially in an unsupervised learning setting. Our work is motivated by recent advances in supervised learning on the robustness of DNNs for noisy labeled data by learning the weights of the training examples (Jiang et al., 2017; Han et al., 2018) . Unlike previous studies, our goal is to learn the weights in an unsupervised learning fashion so that normal observations are assigned higher weights than the anomalies when calculating reconstruction error. The weights help to reduce the influence of anomalies when learning a feature representation of the data. Since existing approaches for weight learning are supervised, they are inapplicable to unsupervised AD. Instead, we propose an unsupervised robust collaborative autoencoders (RCA) method that trains a pair of autoencoders in a collaborative fashion and jointly learns their model parameters and sample weights. Each autoencoder selects a subset of samples with lowest reconstruction errors from a mini-batch to learn their feature representation. By discarding samples with high reconstruction errors, the algorithm is biased towards learning the representation for clean data, thereby reducing its risk of memorizing anomalies. However, by selecting only easyto-fit samples in each iteration, this may lead to premature convergence of the algorithm without sufficient exploration of the loss surface. Thus, instead of selecting the samples to update its own model parameters, each autoencoder will shuffle its selected samples to the other autoencoder, who will use the samples to update their model parameters. The sample selection procedure is illustrated in Figure 1 . During the testing phase, we apply the dropout mechanism used in training to produce multiple output predictions for each test point by repeating the forward pass multiple times. These ensemble of outputs are then aggregated to obtain a more robust estimate of the anomaly score. The main contributions of this paper are as follows. First, we present a novel framework for unsupervised AD using robust collaborative autoencoders (RCA). Second, we provide rigorous theoretical analysis to understand the mechanism behind RCA. We also describe the convergence of RCA to the solution obtained if it was trained on clean data only. We show that the worst-case scenario for RCA is better than conventional autoencoders and analyze the conditions under which RCA is guaranteed to find the anomalies. Finally, we empirically demonstrate that RCA outperforms state-of-the-art unsupervised AD methods for the majority of the datasets used in this study, even in the presence of noise due to missing value imputation.

2. RELATED WORK

There are numerous methods developed for anomaly detection, a survey of which can be found in Chandola et al. (2009) . Reconstruction-based methods, such as principal component analysis (PCA) and autoencoders, are popular approaches, whereby the input data is projected to a lowerdimensional space before it was transformed back to its original feature space. The distance between the input and reconstructed data is used to determine the anomaly scores of the data points. More advanced unsupervised AD methods have been developed recently. Zhou & Paffenroth (2017) combined robust PCA with an autoencoder to decompose the data into a mixture of normal and anomaly parts. Zong et al. (2018) jointly learned a low dimensional embedding and density of the data, using the density of each point as its anomaly score while Ruff et al. ( 2018) extended the traditional one-class SVM approach to a deep learning setting. Wang et al. ( 2019) applied an end-to-end selfsupervised learning approach to the unsupervised AD problem. However, their approach is designed for image data, requiring operations such as rotation and patch reshuffling. Despite the recent progress on deep unsupervised AD, current methods do not explicitly prevent the neural network from incorporating anomalies into their learned representation, thereby degrading the model performance. One way to address the issue is by assigning a weight to each data point, giving higher weights to the normal data to make the model more robust against anomalies. The idea of learning a weight for each data point is not new in supervised learning. A classic example is boosting (Freund et al., 1996) , where hard to classify examples are assigned higher weights to encourage the model to classify them more accurately. An opposite strategy is used in self-paced learning (Kumar et al., 2010) , where the algorithm assigns higher weights to easier-to-classify examples and lower weights to harder ones. This strategy was also used by other methods for learning from noisy labeled data, including Jiang et al. (2017) and Han et al. (2018) . Furthermore, there are many studies providing theoretical analysis on the benefits of choosing samples with smaller loss to drive the optimization algorithm (Shen & Sanghavi, 2018; Shah et al., 2020) .

