UNSUPERVISED ANOMALY DETECTION BY ROBUST COLLABORATIVE AUTOENCODERS

Abstract

Unsupervised anomaly detection plays a critical role in many real-world applications, from computer security to healthcare. A common approach based on deep learning is to apply autoencoders to learn a feature representation of the normal (non-anomalous) observations and use the reconstruction error of each observation to detect anomalies present in the data. However, due to the high complexity brought upon by over-parameterization of the deep neural networks (DNNs), the anomalies themselves may have small reconstruction errors, which degrades the performance of these methods. To address this problem, we present a robust framework for detecting anomalies using collaborative autoencoders. Unlike previous methods, our framework does not require supervised label information nor access to clean (uncorrupted) examples during training. We investigate the theoretical properties of our framework and perform extensive experiments to compare its performance against other DNN-based methods. Our experimental results show the superior performance of the proposed framework as well as its robustness to noise due to missing value imputation compared to the baseline methods.

1. INTRODUCTION

Anomaly detection (AD) is the task of identifying abnormal observations in the data. It has been successfully applied to many applications, from malware detection to medical diagnosis (Chandola et al., 2009) . Driven by the success of deep learning, AD methods based on deep neural networks (DNNs) (Zhou & Paffenroth, 2017; Aggarwal & Sathe, 2017; Ruff et al., 2018; Zong et al., 2018; Hendrycks et al., 2018) have attracted increasing attention recently. Unfortunately, DNN methods have several known drawbacks when applied to AD problems. First, since many of them are based on the supervised learning approach (Hendrycks et al., 2018) , this requires labeled examples of anomalies, which are often expensive to acquire and may not be representative enough in non-stationary environments. Supervised AD methods are also susceptible to the class imbalance problem as anomalies are rare compared to normal observations. Some DNN methods rely on having access to clean data to ensure that the feature representation learning is not contaminated by anomalies during training (Zong et al., 2018; Ruff et al., 2018; Pidhorskyi et al., 2018; Fan et al., 2020) . This limits their applicability as acquiring a representative clean data itself is a tricky problem. Due to these limitations, there have been concerted efforts to develop robust unsupervised DNN methods that do not assume the availability of supervised labels nor clean training data (Chandola et al., 2009; Liu et al., 2019) . Deep autoencoders are perhaps one of the most widely used unsupervised AD methods (Sakurada & Yairi, 2014; Vincent et al., 2010 ). An autoencoder compresses the original data by learning a latent representation that minimizes the reconstruction loss. It is based on the working assumption that normal observations are easier to compress than anomalies. Unfortunately, such an assumption may not hold in practice since DNNs are often over-parameterized and have the capability to overfit the anomalies (Zhang et al., 2016) , thus degrading their overall performance. To improve their performance, the unsupervised DNN methods must consider the trade-off between model capacity and overfitting to the anomalies. One way to control the model capacity is through regularization. Many regularization methods for deep networks have been developed to control model capacity, e.g., by constraining the norms of the model parameters or explicitly perturbing the training process (Srivastava et al., 2014) . However, these approaches do not prevent the networks from being able to perfectly fit random data (Zhang et al., 2016) . As a consequence, the regulariza-tion approaches cannot prevent the anomalies from being memorized, especially in an unsupervised learning setting. Our work is motivated by recent advances in supervised learning on the robustness of DNNs for noisy labeled data by learning the weights of the training examples (Jiang et al., 2017; Han et al., 2018) . Unlike previous studies, our goal is to learn the weights in an unsupervised learning fashion so that normal observations are assigned higher weights than the anomalies when calculating reconstruction error. The weights help to reduce the influence of anomalies when learning a feature representation of the data. Since existing approaches for weight learning are supervised, they are inapplicable to unsupervised AD. Instead, we propose an unsupervised robust collaborative autoencoders (RCA) method that trains a pair of autoencoders in a collaborative fashion and jointly learns their model parameters and sample weights. Each autoencoder selects a subset of samples with lowest reconstruction errors from a mini-batch to learn their feature representation. By discarding samples with high reconstruction errors, the algorithm is biased towards learning the representation for clean data, thereby reducing its risk of memorizing anomalies. However, by selecting only easyto-fit samples in each iteration, this may lead to premature convergence of the algorithm without sufficient exploration of the loss surface. Thus, instead of selecting the samples to update its own model parameters, each autoencoder will shuffle its selected samples to the other autoencoder, who will use the samples to update their model parameters. The sample selection procedure is illustrated in Figure 1 . During the testing phase, we apply the dropout mechanism used in training to produce multiple output predictions for each test point by repeating the forward pass multiple times. These ensemble of outputs are then aggregated to obtain a more robust estimate of the anomaly score. The main contributions of this paper are as follows. First, we present a novel framework for unsupervised AD using robust collaborative autoencoders (RCA). Second, we provide rigorous theoretical analysis to understand the mechanism behind RCA. We also describe the convergence of RCA to the solution obtained if it was trained on clean data only. We show that the worst-case scenario for RCA is better than conventional autoencoders and analyze the conditions under which RCA is guaranteed to find the anomalies. Finally, we empirically demonstrate that RCA outperforms state-of-the-art unsupervised AD methods for the majority of the datasets used in this study, even in the presence of noise due to missing value imputation.

2. RELATED WORK

There are numerous methods developed for anomaly detection, a survey of which can be found in Chandola et al. (2009) . Reconstruction-based methods, such as principal component analysis (PCA) and autoencoders, are popular approaches, whereby the input data is projected to a lowerdimensional space before it was transformed back to its original feature space. The distance between the input and reconstructed data is used to determine the anomaly scores of the data points. More advanced unsupervised AD methods have been developed recently. Zhou & Paffenroth (2017) combined robust PCA with an autoencoder to decompose the data into a mixture of normal and anomaly parts. Zong et al. (2018) jointly learned a low dimensional embedding and density of the data, using the density of each point as its anomaly score while Ruff et al. (2018) extended the traditional one-class SVM approach to a deep learning setting. Wang et al. (2019) applied an end-to-end selfsupervised learning approach to the unsupervised AD problem. However, their approach is designed for image data, requiring operations such as rotation and patch reshuffling. Despite the recent progress on deep unsupervised AD, current methods do not explicitly prevent the neural network from incorporating anomalies into their learned representation, thereby degrading the model performance. One way to address the issue is by assigning a weight to each data point, giving higher weights to the normal data to make the model more robust against anomalies. The idea of learning a weight for each data point is not new in supervised learning. A classic example is boosting (Freund et al., 1996) , where hard to classify examples are assigned higher weights to encourage the model to classify them more accurately. An opposite strategy is used in self-paced learning (Kumar et al., 2010) , where the algorithm assigns higher weights to easier-to-classify examples and lower weights to harder ones. This strategy was also used by other methods for learning from noisy labeled data, including Jiang et al. (2017) and Han et al. (2018) . Furthermore, there are many studies providing theoretical analysis on the benefits of choosing samples with smaller loss to drive the optimization algorithm (Shen & Sanghavi, 2018; Shah et al., 2020) . 

3. METHODOLOGY

This section introduces the proposed robust collaborative autoencoder (RCA) framework and analyze its properties. Let X ∈ R n×d denote the input data, where n is the number of observations and d is the number of features. Our goal is to classify each data point x i ∈ X as an anomaly or a normal observation. Let O ⊂ X denote the set of true anomalies in the data. We assume the anomaly ratio, = |O|/n, is givenfoot_0 or can be approximately estimated. The RCA framework trains a pair of autoencoders, A 1 and A 2 , with different initializations. In each iteration during training, the autoencoders will each apply a forward pass on a mini-batch randomly sampled from the training data and compute the reconstruction error of each data point in the minibatch. The data points in the mini-batch are then sorted according to their reconstruction errors and each mini-batch selects the points with lowest errors to be exchanged with the other autoencoder. Each autoencoder subsequently performs a back-propagation step to update its model parameters using the samples it receives from the other autoencoder. Upon convergence, the averaged reconstruction error of each data point is used to determine the anomaly score. A pseudocode of the training phase for RCA is given in Algorithm 1, while the testing phase is given in Algorithm 2.  A1 ←backprop( Ŝ1, S[c2], dropout = r), A2 ← backprop( Ŝ2, S[c1], dropout = r) ; end Xtst1 ← forward(A1, Xtst, dropout = 0), Xtst2 ← forward(A2, Xtst, dropout = 0); ξtest ← L( Xtst1, Xtst) + L( Xtst2, Xtst); if ξtest < ξ * then ξ * = ξ test , A * 1 = A1, A * 2 = A2 end; β = max(β - max epoch , for i = 1 to v do ξ1= forward(A * 1 , Xtst, dropout = r), ξ2=forward(A * 2 , Xtst, dropout = r) ; ξ = ξ ∪ (ξ1 + ξ2)/2 end anomaly score = average(ξ); RCA differs from conventional autoencoders in several ways. First, its autoencoders are trained using only selected data points with small reconstruction errors. The selected points are then exchanged between the autoencoders to avoid premature convergence. Furthermore, in the testing phase, each autoencoder applies a dropout mechanism to generate multiple predicted outputs. The averaged ensemble output is used as the final anomaly score. Details of these steps are given next.

3.1. SAMPLE SELECTION

Given a mini-batch, X m ⊂ X, our sample selection procedure chooses a subset of points as "clean" samples to update the parameters of an autoencoder by minimizing the following objective function: min w,c xi∈Xm c i f (x i , w), s.t. ∀i : c i ∈ {0, 1}, c T 1 = βn, where f (x, w) denotes the reconstruction loss of a data point x for an autoencoder with parameter w. Although c i is binary-valued, as will be shown later, the probability that a data point is selected to update an autoencoder depends on the probability it is chosen to be part of the mini-batch and the probability it has among the lowest reconstruction errors within the mini-batch. We use an alternating minimization approach to solve the objective function. By fixing c and optimizing w, this reduces to solving the standard autoencoder problem using the selected "clean" samples (i.e., those with c i = 1) by applying an optimizer such as Adam (Kingma & Ba, 2014) . When w is fixed, the objective function reduces to a linear programming problem, which admits a simple solution, where the data points are sorted based on their reconstruction loss, f (x i , w). We set c i = 1 to (β × 100)% of the points in the mini-batch with lowest reconstruction loss. This procedure is denoted by the sample selection(•) function in Algorithm 1. A weight decay approach is applied to β when selecting the "clean" samples. In the early stages of training, all the samples in the minibatch are selected to update the model, with gradually fewer points selected as training progresses. The rationale for this approach is that we should not drop too many data points early, especially in the first few epochs, when the autoencoders have not properly learn the feature representation. The autoencoders start to overfit the anomalies when the number of training epochs is sufficiently large. A linear decay function from β = 1 to β = 1 -(see the second last line in Algorithm 1) was found to work well in practice, so we use this setting in our experiments. Next, we analyze the convergence properties of our sample selection procedure. Let k be the minibatch size and w be the current parameter of the autoencoder. Our algorithm selects (β × 100)% of the data points with lowest errors in the mini-batch for updating the autoencoder. Let x (i) ∈ X denote the data point with i th smallest reconstruction loss among all n points and p i (w) be the probability that x (i) is chosen by the sample selection procedure to update the parameters of the autoencoder. Assuming sampling without replacement, we consider two cases: i ≤ βk and i > βk. In the first case (when i ≤ βk), x (i) is used to update the autoencoder as long as it is selected to be part of the mini-batch. In the second case (when i > βk), x (i) is chosen only if it is part of the mini-batch and has among the (βk)-th lowest errors among all data points in the mini-batch. Thus: p i (w) =      ( n-1 k-1 ) ( n k ) = k n if i ≤ βk, βk-1 j=0 ( i-1 j )( n-i k-j-1 ) ( n k ) otherwise. (2) The corresponding probability p i (w) for sampling with replacement is also provided in the Appendix section. The objective function for our sample selection procedure (Equation 1) can be stated as F (w) = x (i) ∈X p i (w)f (x (i) , w). Let Ω(w * sr ) be the set of stationary points of F (w). Furthermore, let Ω(w * ) be the set of stationary points for the loss on clean data only, i.e., F (w) = xi / ∈O f i (w), while Ω(w * ns ) be the set of stationary points for the loss on the entire data, i.e., xi∈X f i (w). For brevity, we have used f i (w) to denote f (x i , w). Furthermore, we denote Ω i (w * ) as the set of stationary points for the individual loss, f i (w). Our analysis on the convergence properties of our sample selection approach is based on the following assumptions: Assumption 1 (Gradient Regularity) max i,w ∇f i (w) ≤ G. Assumption 2 (Bounded Clean Objective) Let F (w) = xi / ∈O f i (w) , There exists a constant B > 0 such that the following inequality holds: max i,j |(F (w i ) -F (w j ))| ≤ B. Assumption 3 (Individual L-smooth) For every individual loss f t (w), the following inequality holds: ∀i, j : ∇f t (w i ) -∇f t (w j ) ≤ L i w i -w j . Assumption 4 (Equal Minima) The minimum value of every individual loss is the same, i.e., ∀i, j : min w f i (w) = min w f j (w). Assumption 5 (Individual Strong Convexity) For every individual loss f t (w), the following inequality holds: ∀i, j : ∇f t (w i ) -∇f t (w j ) ≥ µ i w i -w j . To simplify the notation, we define L max = max i (L i ), L min = min i (L i ), µ max = max i (µ i ), and µ min = min i (µ i ). Since F (w) is the sum over the loss for clean data, it is easy to see that Assumption 3 implies F (w) is n(1 -)L max smoothness, while Assumption 5 implies that F (w) is n(1 -)µ min convex. We thus define M = n(1 -)L max , and m = n(1 -)µ min . Note that Assumptions 1-3 are common for non-convex optimization. Assumption 4 is reasonable in an over-parameterized DNN setting (Zhang et al., 2016) . Although Assumption 5 is the strongest assumption, it is only used to discuss correctness of our algorithm in Theorem 3, but not for Theorems 1 and 2. A similar assumption has been used in Shah et al. (2020) in their proof of correctness. We define the constants δ > 0 and φ ≥ 1 as follows: ∀x i / ∈ O, ∀x j ∈ O : max v∈Ωi(w * ),y∈Ω(w * ) v -y ≤ δ ≤ min z∈Ωj (w * ),y∈Ω(w * ) z -y , ∀x j ∈ O : max z∈Ωj (w * ),y∈Ω(w * ) z -y ≤ φδ (3) If the loss is convex, then Ω i (w * ) = {w * i } and Ω(w * ) = {w * }. The above equation can be simplified to: w * i -w * ≤ δ ≤ w * j -w * ≤ φδ, ∀x i / ∈ O, ∀x j ∈ O. The constants δ and φ thus provide bounds on the distance between w * j of an anomaly point and w * for clean data. We first consider a non-convex setting. Our goal is to determine whether the parameters learned from the samples chosen by our procedure, which optimizes F (w), converges to w * , the solution obtained by minimizing the loss for clean data, F (w) = xi / ∈O f i (w). Theorem 1 Let F (w) = xi / ∈O f i (w) be a twice-differentiable function. Consider the sequence {w (t) } generated by optimizing F (w) = i p i (w)f (x (i) , w), i.e., w (t+1) = w (t) -η (t) i p i (w (t) )∇f (x (i) ). Let max w (t) ∇F (w (t) ) -i p i (w (t) )∇f (i) (w (t) ) 2 = C. Based on Assumptions 1-3, if η (t) satisfies ∞ t=1 η (t) = ∞ and ∞ t=1 η (t) 2 ≤ ∞, then min t=0,1,••• ,T ∇F (w (t) ) 2 → C when T → ∞. Remark 1 The preceding theorem shows the convergence property of optimizing F (w) to a Capproximated stationary point of the loss function for clean data, where C depends on the sample selection approach. For example, if p i (w) = 1, ∀x i / ∈ O and p i (w) = 0, ∀x i ∈ O, then C = 0. In this ideal situation, the solution for optimizing F (w) reduces to that for vanilla SGD on clean data. Next, we compare the stationary point of F (w) against the stationary point of the loss function for the entire data (without sample selection). Theorem 2 Let F (w) = xi / ∈O f i (w) be a twice-differentiable function and C is defined in Theorem 1. Consider the sequence {w sr } generated by optimizing F (w) = i p i (w)f (x (i) , w), i.e., w (t+1) = w (t) -η (t) i p i (w (t) )∇f (x (i) ) and the sequence {w ns } generated by standard SGD on the entire data, w (t+1) = w (t) -η (t) ∇f (x i ). Based on Assumptions 1-3 and C ≤ (min(n G, M δ)) 2 , if η (t) satisfies ∞ t=1 η (t) = ∞ and ∞ t=1 η (t) 2 ≤ ∞, then there exists a large enough T and w ∈ Ω(w * ns ) such that min t=0,1,...,T ∇F (w (t) sr ) ≤ ∇F ( w) . Remark 2 This theorem is analogous to the result given in Shah et al. (2020) , which has a convex assumption on their loss function, whereas our theorem is applicable even for the non-convex case. Although the theorem is for worst case analysis, our experiments show that, on average, our method easily outperforms other DNN methods that use all the data. Theorem 2 suggests that, as long as C is smaller than a threshold, sample selection gives a better convergence to the stationary points for clean data, compared to using all the data. As the anomaly ratio increases or the distance to nearest outlier increases, sample selection will improve the convergence to stationary points for clean data in the worst case scenario compared to no sample selection. Below we give a sufficient condition for guaranteeing correctness when the objective is restricted to a convex setting. We assume that ∀x i / ∈ O : f i (w * ) = 0 and ∀x j ∈ O : f j (w * ) > 0. Assuming f (w) is convex and its gradient is upper bounded, there exists a ball of radius r > 0 around w * defined as follows: B r (w * ) = {w | f i (w) < f j (w), ∀x i / ∈ O, x j ∈ O, w -w * ≤ r} . The above definition describes a ball around the optimal point, in which the normal observations have a smaller loss than the anomalies. Based on this definition, the following theorem describes the sufficient condition for our algorithm to converge to a solution within the ball. Theorem 3 Let F (w) = xi / ∈O f i (w) be a twice-differentiable function and L c max = max xi / ∈O (L i ) be the maximum Lipschitz smoothness for the clean data and µ o min = min xj ∈O (µ j ) be the minimum convexity for anomalies. Consider the sequence {w sr } generated by w (t+1) = w (t) -η (t) i p i (w (t) )∇f (x i ) and max w (t) ∇F (w (t) ) -i p i (w (t) )∇f i (w (t) ) 2 = C. Define κ = L c max µ o min and suppose Assumptions 1-5 hold. If η (t) satisfy ∞ t=1 η (t) = ∞, ∞ t=1 η (t) 2 ≤ ∞, and C ≤ δ (1+κ)m 2 = O δ κ 2 , then there exists r > 0 such that w * sr ∈ B r (w * ). The proof is given in the Appendix section. The convergence guarantee depends on having a small enough value of C, which is related to δ, the distance between the nearest anomalies and the normal points, and the landscape of the loss surface κ. A small κ suggests that the loss surface is very sharp for anomalies (large µ o min ) but flat for normal data (small L c max ). In this case, most areas of the loss surface will have a smaller loss for normal observations but larger loss for anomalies (assuming equal minima among all the points). Due to their larger loss, the anomalies have smaller probability to be selected as "clean" sample by the proposed RCA algorithm. The analysis above shows that sample selection benefits convergence of our method to the stationary points for clean data. However, our ultimate goal is to improve generalization performance, not just converging to good stationary points of the training data. When sample selection is applied to just one autoencoder, the algorithm may converge too quickly as we use only samples with low reconstruction loss to compute the gradient, making it susceptible to overfitting (Zhang et al., 2016) . Thus, instead of using only the self-selected samples for model update, we train two autoencoders collaboratively and shuffle the selected samples between them to avoid overfitting. Similar ideas have been found to be effective in supervised learning for data with noisy labels (Han et al., 2018) .

3.2. ENSEMBLE EVALUATION

Unsupervised anomaly detection using an ensemble of model outputs have been shown to be highly effective in previous studies (Liu et al., 2008; Zhao et al., 2019; Emmott et al., 2015; Aggarwal & Sathe, 2017) . However, incorporating ensemble method to deep learning is a challenging problem as it is expensive to train a large number of DNNs. In this paper, we use the dropout mechanism (Srivastava et al., 2014) to emulate the ensemble process. Dropouts are typically used during the training phase only. In RCA, we employ the dropout mechanism during testing as well. Specifically, we use the networks of perturbed structures to perform multiple forward passes over the data in order to obtain a set of reconstruction losses for each test point. The final anomaly score is computed by averaging the reconstruction losses. Although dropout may increase the overall reconstruction loss, we expect a more robust estimation of the anomaly score using this procedure.

4. EXPERIMENTS

We have performed extensive experiments on both synthetic and real-world data to compare the performance of RCA against other baseline methods and to investigate its robustness to noise due to missing value imputation. The code is attached in the supplementary materials in submission.

4.1. RESULTS ON SYNTHETIC DATA

To better understand how RCA overcomes the limitations of conventional autoencoders (AE) on datasets with anomalies, we experimented with a synthetic 2-dimensional dataset. The dataset contains a pair of crescent-shaped moons with Gaussian noise (Pedregosa et al., 2011) representing the normal observations and anomalies generated from a 2-dimensional uniform distribution. In this experiment, we vary the proportion of anomalies from 10% to 40% while fixing the sample size to be 10,000. Samples with the top-[(1 -)n] highest anomaly scores are classified as anomalies, where is the anomaly ratio. Figure 2 : The first and second rows are results for 10% and 40% anomaly ratio, respectively. The last column shows the fraction of points with highest reconstruction loss that are true anomalies. We show the results for 10% (top row) and 40% (bottom row) anomaly ratiofoot_1 in Figure 2 . Observe that the performance of both methods decreases with increasing anomaly ratio. However, results for RCA (third column) are more robust than AE (second column). In particular, when the anomaly ratio is 40%, AE fails to capture the true manifold of the normal data, unlike RCA. The result is consistent with Theorem 2, which states that, when anomaly ratio increases, using the subset of data selected by our algorithm is better than using all the data.

4.2. RESULTS ON REAL-WORLD DATA

For performance comparison, we use 18 benchmark datasets obtained from the Stony Brook ODDS library (Rayana, 2016) . A summary description of the data is given in Table 2 in appendix. We reserve 60% of the data for training and the remaining 40% for testing. The performance of the competing methods are evaluated based on their Area under ROC curve (AUC) scores. We also performed experiments on the CIFAR10 dataset, for which the results are given in the Appendix. We compared RCA against the following baseline methods: SVDD (deep one-class SVM) (Ruff et al., 2018) , VAE (Variational autoencoder) (An & Cho, 2015; Kingma & Welling, 2013) , DAGMM (deep gaussian mixture model) (Zong et al., 2018) , SO-GAAL (Single-Objective Generative Adversarial Active Learning) (Liu et al., 2019) , OCSVM (one-class SVM) (Chen et al., 2001) , and IF (isolation forest) (Liu et al., 2008) . Note that SVDD and DAGMM are two recent deep unsupervised AD methods while OCSVM and IF are two state-of-the-art AD methods. In addition, we also perform an ablation study to compare RCA against its four variants: AE (standard autoencoders without collaborative networks) and RCA-E (RCA without ensemble evaluation), and RCA-SS (RCA without sample selection). Since the methods are unsupervised, to ensure fair comparison, we maintain similar hyperparameter settings for all the competing DNN-based approaches to the best that we can (details can be found in the supplementary materials). Experimental results are reported based on their average AUC scores across 10 random initializations. Figure 3a summarize the results of our experiments. The full table can be found in the Appendix. Note that RCA outperforms the deep unsupervised AD methods (SO-GAAL, DAGMM, SVDD) in 17 out of 18 datasets. These results suggest that the strategies employed by RCA are more effective at detecting anomalies compared to the ones used by the baseline deep unsupervised AD methods. Surprisingly, some of the more complex DNN baselines such as SO-GAAL, DAGMM, and SVDD perform poorly on the datasets. Their poor performance can be explained as follows. First, most of these baseline methods assume the availability of clean training data, whereas in our experiments, the training data was contaminated with anomalies to reflect more realistic settings. Second, we use the same network architecture on every datasets for all the methods (including RCA), since there is no guidance for tuning the network structure given that it is an unsupervised AD task. Finally, as will be discussed in Section 4.3, the performance of conventional unsupervised AD methods such as OCSVM and IF degrade significantly as the amount of missing values in the data increases, unlike the proposed RCA framework.

4.3. RESULTS FOR ABLATION STUDY AND ANOMALY DETECTION WITH MISSING VALUES

As real-world datasets are often imperfect, we compare the performance of RCA and other baseline methods in terms of their robustness to missing values. Mean imputation is a common approach to deal with missing values. In this experiment, we add missing values randomly in the features of Missing Ratio RCA-E RCA-SS VAE SO-GAAL AE DAGMM SVDD OCSVM IF 0.0 11-2-5 16-0-2 14-0-4 17-0-1 15-0-3 18-0-0 18-0-0 10-1-7 10-0-8 0.1 12-1-5 14-1-3 16-1-1 16-0-2 14-0-4 17-0-1 18-0-0 13-1-4 12-0-6 0.2 11-1-6 13-3-2 14-2-2 17-0-1 13-0-5 18-0-0 18-0-0 15-0-3 9-0-9 0.3 9-3-6 13-1-4 15-0-3 17-1-0 13-0-5 18-0-0 18-0-0 16-0-2 14-1-3 0.4 10-0-8 12-2-4 14-0-4 15-0-3 12-0-6 17-0-1 18-0-0 16-0-2 15-0-3 0.5 8-3-7 10-1-7 11-1-6 14-0-4 9-0-9 15-0-3 17-0-1 14-1-3 13-0-5 Table 1 : Comparison of RCA against various competing methods in terms of (#win-#draw-#loss) on 18 benchmark datasets with different missing ratios. RCA-E (no ensemble), RCA-SS (no sample selection), AE (no ensemble and no sample selection) are used for ablation study of our method. each benchmark dataset and apply mean imputation to replace the missing values. Such imputation process will likely introduce noise into the data. We vary the percentage of missing values from 10% to 50% and compare the average AUC scores of the competing methods. The results are summarized in Table 1 , which shows the number of wins, draws, and losses of RCA compared to each baseline method on the 18 benchmark datasets. We also include results from ablation study to investigate the effectiveness of using sample selection and ensemble evaluation. Specifically, we compare RCA against its variants, RCA-E, RCA-SS, and AE. The results show that our framework is better than the baselines on the majority of the datasets in almost all settings. In particular, RCA consistently outperforms both DAGMM and SVDD by more than 80%, demonstrating the robustness of our algorithm compared to other deep unsupervised AD methods when training data is contaminated. Additionally, as the missing ratio increases to more than 30%, it outperforms IF and OCSVM by more than 70% on the datasets. On the other hand, the advantage of RCA over its variants, AE, RCA-SS, and RCA-E, is significant when the missing ratio is less than 40%, but becomes less significant at higher missing ratios. Finally, since the true anomaly ratio is often unknown in practice, we conducted experiments to evaluate the robustness of RCA when is overestimated by 5%, 10%, or 20% from their true values on all the datasets. Fig. 3b shows the AUC scores for RCA do not change significantly even when was highly overestimated on most of the datasets.

5. CONCLUSION

This paper introduces RCA, a robust collaborative autoencoder framework for unsupervised AD. The framework is designed to overcome limitations of existing deep unsupervised AD methods due to over-parameterization of the DNNs, which hampers their effectiveness. We theoretically show the effectiveness of our algorithm to prevent model overfitting due to anomalies. In addition, we empirically show that RCA outperforms various state-of-the-art unsupervised AD algorithms in most experimental settings. We also found RCA to be more robust to noise introduced by missing value imputation compared to other baseline methods. In the future, we aim to extend the proposed framework to incorporate more than two autoencoders. We will also investigate whether it is possible to relax some of the assumptions behind our theoretical bounds to more realistic scenarios. impractical for unsupervised anomaly detection without ground truth labels available. In our experiments, all baselines and our method use the same network structure across different datasets. Since there is no official code of DAGMM from the authors, our implementation of DAGMM are highly depends on these two open source implementations 3 . Another question people may ask is that why complexed deep methods such as DAGMM, SVDD, SO-GAAL cannot beat shallow methods such as OCSVM and Isolation Forest. According to our best knowledge, there is no evidence SVDD, SO-GAAL and DAGMM performed better than OCSVM and IF on datasets beyond the benchmark image data (CIFAR-10). In fact, the SO-GAAL, OCSVM and IF results reported in our paper for ODDS dataset are consistent with the numbers reported in PyOD 4 , a Python toolkit for outlier detection. Also, Reference [8] compared DAGMM, SVDD (denoted as E2E), against OCSVM. The results shown in Table 1 of the paper are similar to ours. In order to make sure that our algorithm is better than SVDD or there is nothing wrong in our implementation about SVDD, we also conduct experiments on CIFAR10. Since our methods are not specifically designed for the image data, we process the CIFAR10 results in following way: We use pytorch official implmentation of vgg19 (pretrained on ImageNet) to extract 4096 dimensional feature representation of CIFAR10 to perform the anomaly detection for each class (10 subdatasets). Each sub-datasets consists 5000 normal class and 5% anomalies, which are random sampled from other class. The training, testing data consists 80% and 20% of data (i.e. training data has (5000 + 250) * 0.8 = 4200 samples, testing data has 1050 samples). All results are averaged over 5 random seeds. The results are in the figure 4 . For RCA, DAGMM, SVDD, we use the same network structure in our paper. For SVDD Original, we directly borrow the number from original SVDD paper (Ruff et al., 2018) . A.4 PROOF OF THEOREM 1 denote M = n(1 -)L max to be the smoothness of function F = i / ∈O f i (w), denote the up- date rule w (t+1) = w (t) -η (t) i p i ∇f i (w). For the normal stochastic gradient descent, by smoothness, we have: F (w (t+1) ) -F (w (t) ) ≤ ∇F (w (t) ), w (t+1) -w (t) + M 2 w (t) -w (t+1) 2 ≤ -η (t) ∇F (w (t) ), ∇f i (w (t) ) + η (t) 2 M 2 ∇f i (w (t) ) 2 2 (4) 3 https://github.com/danieltan07/dagmm, https://github.com/tnakae/DAGMM 4 We use the pyod implmentation (https://github.com/yzhao062/pyod) of SO-GAAL, VAE, IF, OCSVM ) by our sampling probability and applying triangle inequality on the last term with inequality p 2 i ≤ p i , we have EF (w t+1 ) -F (w (t) ) ≤ -η (t) ∇F (w (t) ), i p i (w (t) )∇f i (w (t) ) + i p i (w (t) ) η (t) 2 M 2 ∇f i (w (t) ) 2 ≤ -η (t) ∇F (w (t) ), i p i (w (t) )∇f i (w (t) ) -∇F (w) -η (t) ∇F (w) 2 + η (t) 2 M G 2 2 complete the square and let F (w (t) ) = i p i (w (t) )∇f i (w (t) ), we have ≤ η (t) 2 ∇F (w (t) ) 2 + η (t) 2 ∇F (w (t) ) -∇ F (w (t) ) 2 -η (t) ∇F (w (t) ) 2 + η (t) 2 M G 2 2 ≤ η (t) 2 ∇F (w (t) ) -∇ F (w (t) ) 2 - η (t) 2 ∇F (w (t) ) 2 + η (t) 2 M G 2 2 Move the gradient norm to the left, and take total expectation, we have η (t) E ∇F (w (t) ) 2 ≤ 2 EF (w (t) ) -EF (w (t+1) ) + η (t) E ∇F (w (t) ) -∇ F (w (t) ) 2 + η (t) 2 M G 2 Sum it from t = 0 to t = T , we have: T t=0 η (t) E ∇F (w (t) ) 2 ≤ 2 EF (w (0) ) -EF (w (T +1) ) + T t=0 η (t) E ∇F (w (t) ) -∇ F (w (t) ) 2 + η (t) 2 M G 2 ≤ 2B + T t=0 η (t) C + T t=0 η (t) 2 M G 2 min t=0,1,2,...,T E ∇F (w (t) ) 2 ≤ E ∇F (w (t) ) 2 ≤ 2B η (t) + M G 2 η (t) 2 η (t) + C (5) By using the assumption of learning rate ( η (t) = ∞, η (t) 2 ≤ ∞), the first two term can be ignored when T goes to infinity. We can get the convergence in theorem 1 (The convergence Under review as a conference paper at ICLR 2021 rate log(T ) is get by assume learning rate is η (t) = 1/t, which satisfy the above learning rate assumption). We also provided a better convergence rate compared to submitted manuscript with a stricter learning rate setting. Start from equation 4, we have: EF (w t+1 ) -F (w (t) ) ≤ -η (t) ∇F (w (t) ), i p i (w (t) )∇f i (w (t) ) + i p i (w (t) ) η (t) 2 M 2 ∇f i (w (t) ) 2 ≤ -η (t) ∇F (w (t) ), i p i (w (t) )∇f i (w (t) ) -∇F (w) -η (t) ∇F (w) 2 + η (t) 2 M G 2 2 complete square in a different way and let F (w (t) ) = i p i (w (t) )∇f i (w (t) ), we have ≤ η (t) 1 2 η (t) ∇F (w (t) ) 2 + 1 η (t) ∇F (w (t) ) -∇ F (w (t) ) 2 -η (t) ∇F (w (t) ) 2 + η (t) 2 M G 2 2 ≤ η (t) 2 2 ∇F (w (t) ) 2 + 1 2 ∇F (w (t) ) -∇ F (w (t) ) 2 -η (t) ∇F (w (t) ) 2 + η (t) 2 M G 2 2 ≤ 1 2 ∇F (w (t) ) -∇ F (w (t) ) 2 -η (t) ∇F (w (t) ) 2 + η (t) 2 (M + 1)G 2 2 Move the gradient norm to the left, and take total expectation, we have η (t) E ∇F (w (t) ) 2 ≤ EF (w (t) ) -EF (w (t+1) ) + 1 2 E ∇F (w (t) ) -∇ F (w (t) ) 2 + η (t) 2 (M + 1)G 2 Sum it from t = 0 to t = T , we have: T t=0 η (t) E ∇F (w (t) ) 2 ≤ EF (w (0) ) -EF (w (T +1) ) + T t=0 1 2 E ∇F (w (t) ) -∇ F (w (t) ) 2 + T t=0 η 2 t (M + 1)G 2 2 ≤ B + T C 2 + T t=0 η (t) 2 (M + 1)G 2 2 min t=0,1,2,...,T E ∇F (w (t) ) 2 ≤ B η (t) + (M + 1)G 2 2 η (t) 2 η (t) + T C η (t) By assume the learning rate is constant, we write the RHS as a function of learning rate f (η) = B T η + (M + 1)G 2 η 2 + T C T η = B T η + (M + 1)G 2 η 2 + C η Let (M + 1)G 2 = H, we have = B T η + Hη 2 + C η We study the minima of function: f (x) = a x + bx x * = (a/b), f (x * ) = 2( √ ab) Thus, to minimize the bound in RHS, we have the optimal learning rate η * = 2B HT + 2C H by letting a = B T + C, b = H 2 . Then, the optimal value of RHS is: f (η * ) = 2( ( B T + C)( H 2 )) = 2BH T + 2CH ≤ 2BH T + √ 2CH = O( 1 √ T ) + O( √ C) . Thus we can conclude that min t=0,1,2,...,T E ∇F (w (t) ) 2 → O( 1 √ T ) + O( √ C). This is even a better results compared to theorem 1 in our paper, since we could achieve the convergence rate of 1 T compared to log(T ), while the error term is dependent on √ C instead of C. However, to achieve this rate, we need stricter condition on learning rate. A.5 PROOF OF THEOREM 2 We analysis the stationary point of using all data. Let w * ns denotes the stationary point by using the entire data, w * denotes the stationary point by using the clean data, by the stationary condition of w * ns , we have p i / ∈O ∇f i (w * ns ) = - q j∈O ∇f j (w * ns ) p i / ∈O ∇f i (w * ns ) = q j∈O ∇f j (w * ns ) Upper bound for LHS p i / ∈O ∇f i (w * ns ) = p i / ∈O ∇f i (w * ns ) - p i / ∈O ∇f i (w * ) ≤ p i / ∈O ∇f i (w * ns ) -∇f i (w * ) ≤ p i / ∈O L i w * nsi -w * = (1 -)nL max max i w * nsi -w * ≤ M δ (δ is defined in equation 3 in the submitted manuscript) Another upper bound for LHS from RHS p i / ∈O ∇f i (w * ns ) = q j∈O ∇f j (w * ns ) ≤ n G Thus, we have ∇F (w * ns ) ≤ min(n G, M δ) From theorem 1, by setting η * = 2B HT + 2C H , it is trivial to get min t=0,1,2,...,T E ∇F (w (t) ) 2 ≤ 2BH T + 2CH Or by assumption of η (t) = ∞, η (t) ≤ ∞, from equation 5 we can get min t=0,1,2,...,T E ∇F (w (t) ) 2 ≤ E ∇F (w (t) ) 2 ≤ O( 1 η (t) ) + O( η (t) 2 η (t) ) + C We would like to study the worst case, when the upper bound of our algorithm is better than the upper bound without sample selection. Thus, we want the following holds when t goes to infinity 2BH T + 2CH ≤ min(n G, M δ) 2 When t goes to infinity, by the assumption we have for learning rate, we have: √ C ≤ 1/2H(min(n G, M δ)) 2 (fixed optimal lr) Or similarly, in deminishing learning rate setting, we have C ≤ (min(n G, M δ)) 2 ( η (t) = ∞, η (t) 2 ≤ ∞) Thus both LHS and RHS are the upper bound, thus we could get the conclusion that in worst cases, our solution is better than algorithm without sample selection, which gets our conclusion of existence. A.6 PROOF OF THEOREM 3 Now, we try to prove the correctness of the algorithm. We assume w * satisfy f i (w * ) < f j (w * ), ∀x i / ∈ O, x j ∈ O. Without loss of generality, we could define δ, φ ≥ 1 as below: δ ≤ w * j -w * ≤ φδ, ∀j ∈ O (7) Now, we try to answer the first question, under what conditions, our solution is perfect. We define some neighbors around the optimal point. According to our anomaly detection setting, we with those anomalies should have higher loss compared to the normal data point: B r (w * ) = w|f i (w) < f j (w), ∀i / ∈ O, j ∈ O, w -w * ≤ r . We knew that such ball with radius r must be existed since the loss function above is both strongly smooth. In order to better describe B r (w * ), we would like to analysis the boundary of the ball. Denote the set of intersection between the loss surface of the normal data and the loss surface of the abnormal data as: Ω w = w ij f xi / ∈O (w) = f xj ∈O (w) Then, we can write the boundary point of the ball as: w B = arg min w∈Ωw w B -w * At w B , by using smoothness and convexity, we have f i (w B ) ≤ f i (w * ) + w B -w * , ∇f i (w * ) + L i 2 w B -w * 2 f j (w B ) ≥ f j (w * j ) + w B -w * j , ∇f j (w * ) + µ j 2 w B -w * j 2 By the definition of w B and equal minimum assumption, without loss of generality, we could assume the minimum is 0, which does not affect the results. Then, we have f i (w B ) ≤ L i 2 w B -w * 2 -f j (w B ) ≤ - µ j 2 w B -w * j 2 Adding two inequality, we have: Now, our goal is trying to upper bound the term ∇F (w * sr ) . According to theorem 1 for fixed optimal learning rate, we have min t=0,1,2,...,T E ∇F (w (t) ) 2 ≤ E ∇F (w (t) ) 2 ≤ w B -w * 2 ≥ µ j L i w B -w *

2BH

T + 2CH Now, we have: w * sr -w * 2 ≤ m 2 2BH T + 2CH w B -w * 2 ≥ ( 1 1 + κ δ) 2 Thus, we could get the sufficient condition for w * sr -w * 2 ≤ w B -w * 2 as m 2 2BH T + 2CH ≤ ( 1 1 + κ δ) 2 by using the optimal fixed learning rate and assume T is sufficiently large, rearrange the term, we have: √ C ≤ 1 √ 2H ( δ (1 + κ)m ) 2 = O( δ κ ) 2 C ≤ O( δ κ ) 4 Similarly, from the theorem 1 for the diminishing learning rate condition η (t) = ∞, η (t) ≤ ∞, we have C ≤ ( δ (1 + κ)m ) 2 = O( δ κ ) 2 We can conclude that as long as the above inequality holds, we can guarantee that our algorithm returns the correct answer. In submitted manuscript, due to the limit of the space, we only show the results for 10% and 40% anomaly ratio. In this section, we provide the results for 10%, 20%, 30%, and 40% in Figure 6 .



In practice, users would typically specify the top-k anomalies to be examined and verified, where k = n . More results for 20% and 30% can be found in the appendix



Figure 1: An illustration of the training phase for the proposed RCA framework.

Robust Collaborative Autoencoders (Training Phase) input: training data Xtrn, test data Xtst, reconstruction loss function L, anomaly ratio , dropout rate r > 0, and maximum training epochs: max epoch; return trained autoencoders, A * 1 and A * 2 ; Initialize autoencoders A1 and A2; sample selection rate β = 1 and best loss ξ * = +∞; while epoch ≤ max epoch do for minibatch S in Xtrn do Ŝ1 ← forward(A1, S, dropout = 0), Ŝ2 ← forward(A2, S, dropout = 0); c1 ← sample selection(L( Ŝ1, S), β), c2 ← sample selection(L( Ŝ2, S), β) ; Ŝ1 ← forward(A1, S[c2], dropout = r), Ŝ2 ← forward(A2, S[c1], dropout = r);

AUC comparison of RCA against DNN baselines. Details are given in Table 3 in Appendix. Results of RCA for overestimating by 0.05, 0.1, 0.2. Details are given in Table 4 in Appendix.

Figure 3: Experimental results on 18 benchmark datasets from ODDS repository.

Figure4: Results on CIFAR10 for SVDD, DAGMM, and our method. x-axis are the normal class while y-axis is the auc score. SVDD are our implementation on contaminated data described above. The SVDD-Original is the number reported in SVDD original paper (where they use clean training data). We could see that even use contaminated data, our model still outperforms deep svdd in most settings, and SVDD performs bad when training data has contamination. (i.e. SVDD performs much worse than SVDD Original) Take expectation on ∇f i (w (t) ) by our sampling probability and applying triangle inequality on the last term with inequality p 2 i ≤ p i , we have

By triangle inequality, we havew B -w * + w B -w * j ≥ w * j -w *Combining above two inequalities, we havew B -w * ≥ µ j L i ( w * j -w * -w B -w *) c max denotes the maximum lipschitz smoothness in clean data and µ o min denotes the minimum convexity in anomalies.Define F (w) = xi / ∈O f i (w), we have that function F (w) satisfy m = n(1 -)µ min convexity. Similarly, we know that F (w) also satisfy M = n(1 -)L max smoothness. Then, we can have:w * sr -w * ≤ m ∇F (w * sr ) -∇F (w * ) = m ∇F (w * sr )w * sr -w * 2 ≤ m 2 ∇F (w * sr ) 2

Figure 6: The rows are the results for 10%,20%,30% and 40% anomaly ratio from top to bottom, respectively. The last column shows the fraction of points with highest reconstruction loss that are true anomalies. C.3 RESULTS FOR TWO MOON

Summary of benchmark data, where N is sample size and d is number of features.

Performance comparison of RCA against various baseline methods in terms of average AUC scores and its standard deviation across 10 random seeds.

Sensitivity Analysis about , the first number is averaged auc score and the second number is the standard deviation. All experiments are repeated for 10 random seeds.

annex

The reason why SVDD and DAGMM performs bad in ODDS might have several reasons. First, the results reported in the SVDD paper assume training data has no contamination. For the DAGMM paper, it has only 2 datasets that overlaps with our experiments (thyroid,arrhythmia), for which the results reported are also for clean data. In contrast, the training data used in our experiments are contaminated with anomalies. Also, the DAGMM paper acknowledges that their performance degrades when the amount of contamination increases. Furthermore, unlike our experiments, the results reported in the DAGMM paper use different network structure for each dataset to obtain good performance, this is because that they have clean dataset, which makes it possible to tuning the network structure and hyperparameters. However, choosing the right network structure is min , which leads large κ. According to our bound, large κ is not good in terms of correctness guarantee. As we shall see in the left figure, in this case, the dangerous zone is very large compared to the right figure, where we have large L c max , small µ o min , and a small κ. In the right figure, we could see that the orange curve will be dropped since the probability of the orange curve is sampled is very small in our method.

B PROBABILITY OF SAMPLING WITH REPLACEMENT

In submitted manuscript, we only show the probability of sampling without replacement due to limit of the space. In here, we show the probability of sampling with replacement:

C SUPPLEMENTARY EXPERIMENTAL RESULTS

In this section, we show the supplementary experiment results, which are omitted in submitted manuscript due to the limit of space.

C.1 NETWORK HYPERPARAMETER FOR TWO-MOON DATA

The network structure for both autoencoders and our method is the same for a fair comparison. The network has one layer encoder and two layer decoder. In the hidden layer, we have a 0.5 dropout ratio, the number of hidden nodes for all layers are set to be 128. All activation function is chosen to be the tanh function. The maximum training epochs are 200. The stopping criterion is the loss of testing data. The ensemble number is set to be 1000 for our method. Training ratio is 60% and testing ratio is 40%. We use Adam optimizer for both method and the initial learning rate is set to be 3e-4. Batchsize for both methods are 128.C.2 NETWORK HYPERPARAMETER FOR BENCHMARK DATA Specifically, we use a 6-layer fully connected autoencoder with 128 hidden nodes in every layer except for the bottleneck layer, which has 10 hidden nodes. We also set the dropout rate to 0.5 for every hidden layer. The deep neural networks are trained using ADAM, with learning rate initialized to 3e-4 and a batchsize of 128. The maximum epochs is set to be 100 with a stopping criterion determined from the minimum reconstruction loss of the test data. The reconstruction loss function for the opt-digits dataset is cross entropy loss since the feature of this data are all discrete. The rest data reconstruction loss are all mean square error loss. The activation function is chose to be LeakyReLU with α = 0.1. For the SVDD, we pretrain the autoencoder for 50 epochs, and use the encoder as the initialization of SVDD except the last layer.

