OUTLIER PRESERVING DISTRIBUTION MAPPING AU-TOENCODERS

Abstract

State-of-the-art deep outlier detection methods map data into a latent space with the aim of having outliers far away from inliers in this space. Unfortunately, this is shown to often fail the divergence penalty they adopt pushes outliers into the same high-probability regions as inliers. We propose a novel method, OP-DMA, that successfully addresses the above problem. OP-DMA succeeds in mapping outliers to low probability regions in the latent space by leveraging a novel Prior-Weighted Loss (PWL) that utilizes the insight that outliers are likely to have a higher reconstruction error than inliers. Building on this insight, explicitly encourages outliers to be mapped to low-propbability regions of its latent by weighing the reconstruction error of individual points by a multivariate Gaussian probability density function evaluated at each point's latent representation. We formally prove that OP-DMA succeeds to map outliers to low-probability regions. Our experimental study demonstrates that OP-DMA consistently outperforms state-of-art methods on a rich variety of outlier detection benchmark datasets.

1. INTRODUCTION

Background. Outlier detection, the task of discovering abnormal instances in a dataset, is critical for applications from fraud detection, error measurement identification to system fault detection (Singh & Upadhyaya, 2012) . Given outliers are by definition rare, it is often infeasible to get enough labeled outlier examples that are represetnative of all the forms the outliers could take. Consequently, unsupervised outlier detection methods that do not require prior labeling of inliers or outliers are frequently adopted (Chandola et al., 2009) . State-of-Art Deep Learning Methods for Outlier Detection. Deep learning methods for outlier detection commonly utilize the reconstruction error of an autoencoder model as an outlier score for outlier detection (Sakurada & Yairi, 2014; Vu et al., 2019) . However, directly using the reconstruction error as the outlier score has a major flaw. As the learning process converges, both outliers and inliers tend to converge to the average reconstruction error (to the same outlier score) -making them indistinguishable (Beggel et al., 2019) . This is demonstrated in Figure 1a , which shows that the ratio of average reconstruction error for outliers converges to that of the inliers. To overcome this shortcoming, recent work (Beggel et al., 2019; Perera et al., 2019) utilizes the distribution-mapping capabilities of generative models that encourage data to follow a prior distribution in the latent space. These cutting-edge methods assume that while the mapping of inlier points will follow the target prior distribution, outliers will not due to their anomalous nature. Instead, outliers will be mapped to low-probability regions of the prior distribution, making it easy to detect them as outliers (Beggel et al., 2019; Perera et al., 2019) . However, this widely held assumption has been shown to not hold in practice (Perera et al., 2019) . Unfortunately, as shown in Figure 1b , both inliers and outliers are still mapped to the same high probability regions of the target prior distribution, making them difficult to distinguish. Problem Definition. Given a given dataset X ∈ R M of multivariate observations, let f : R M → R N , N ≤ M , be a function from the multivariate feature space of X to a latent space f (x) ∈ R N such that f (X) ∼ P Z , where P Z is a known and tractable prior probability density function. The dataset X ∈ R M is composed as X = X O + X I , where X O and X I are a set of outlier and inlier points, respectively. During training, it is unknown whether any given point x ∈ X is an outlier or an inlier. Intuitively, our goal is to find a function f that maps instances of a dataset X into a latent space S with a known distribution, such that outliers are mapped to low probability regions and inliers to high probability regions. More formally, we define unsupervised distribution-mapping outlier detection as the problem of finding a function f * with the aforementioned properties of f such that we maximize the number of outliers x o ∈ X O and inliers x i ∈ X I for which P Z (f * (x o )) < P Z (f * (x i )) holds.

Challenges.

To address the open problem defined above, the following challenges exist: 1. Overpowering divergence penalty. Intuitively, distribution mapping methods utilize a divergence penalty to achieve a latent space mapping of input data that has a high probability of following a target prior distribution. While the data overall should follow this prior distribution, a solution must be found to instead maps outliers to low-probability regions of the prior. Having the data match the prior overall, while having outliers mapped to low probability regions of the prior creates a conflict, as the two tasks are diametrically opposed. To achieve such a mapping requires overpowering the divergence penalty in order to map outliers to low probability regions in the latent space. 2. Unknown outlier status. In unsupervised outlier detection, during training points do not have any labels indicating whether they are outliers or inliers. This unsupervised scenario, while common in practice (Singh & Upadhyaya, 2012) , makes it challenging to design strategies that explicitly coerce outliers to be mapped to low-probability regions. Our OP-DMA Approach. In this work, we propose the Outlier Preserving Distribution Mapping Autoencoder (OP-DMA). Our core idea is to propose a novel Prior Weighted Loss (PWL) function that solves the two conflicting tasks of mapping the input data to a prior distribution while encouraging outliers to be mapped to low probability regions of that prior. This PWL directly addresses the shortcomings of the existing distribution mapping outlier detection methods (Vu et al., 2019; Perera et al., 2019) , and to the best of our knowledge is the first unsupervised cost function that explicitly encourages outliers to be mapped to low probability regions. We assume that outliers will have a high reconstruction error during the initial stages of training, which causes the PWL to place them in low-probability (low PDF) regions in the latent space. This way, PWL overcomes the challenge of overpowering the divergence penalty. It succeeds in mapping outliers to low-probability regions (far from the mean of the latent distribution) even though each input point's outlier status is unknown. Our OP-DMA framework is pluggable, meaning off-theshelf distance-based outlier methods can be flexibly plugged in post-transformation. Our key contributions are as follows: 1. Propose OP-DMA, a novel distribution-mapping autoencoder that effectively separates outliers from inliers in the latent space without knowing nor making assumptions on the original distribution of the data in the feature space. 2. Design the Prior-Weighted Loss (PWL), which when coupled with a divergence penalty encourages outliers to be mapped to low-probability regions while inliers are mapped to high-probability regions of the latent space of an autoencoder. 3. Provide rigorous theoretical proof that the optimal solution for OP-DMA places outliers further than inliers from the mean of the distribution of the data in the latent space. 4. Demonstrate experimentally that OP-DMA consistently outperforms other state-of-art outlier detection methods on a rich variety of real-world benchmark outlier datasets. Significance: OP-DMA is a versatile outlier detection strategy as it can handle input data that has arbitrary distributions in the feature space, while not making any distance or density assumptions on the data. To the best of our knowledge, we are the first to propose a loss function that explicitly encourages outliers to be mapped to low-probability regions while inliers are mapped to high probability regions. Our PWL approach is pluggable, and can easily be incorporated into alternate outlier detectors. Our ideas could also spur further research into various prior weighted loss functions.

2. RELATED WORK

State-of-the-art deep outlier detection methods fall into one of three categories: 1) Autoencoders coupled with classic outlier detectors (Erfani et al., 2016; Chalapathy et al., 2018) , 2) Reconstruction error-based outlier detection methods (Zhou & Paffenroth, 2017; Chen et al., 2017; Sabokrou et al., 2018; Xia et al., 2015) , or 3) Generative outlier detection methods (Perera et al., 2019; Vu et al., 2019; Liu et al., 2019) . 1) Autoencoders coupled with classic outlier detectors project data into a lower dimensional latent space before performing outlier detection on that latent representation. These methods make the strict assumption that outliers in the original space will remain outliers in the latent space. Further, they fail to explicitly encourage this in the mapping function. 2) Reconstruction error-based outlier detection methods utilize the reconstruction error of an autoencoder network to identify outliers. They typically use the reconstruction error directly as the anomaly score (An & Cho, 2015) . In more recent work, they try to separate outliers into a separate low-rank matrix analogous to RPCA (Zhou & Paffenroth, 2017) or they introduce a separate discriminator network (Sabokrou et al., 2018) . However, as shown in (Beggel et al., 2019) , for autoencoders the reconstruction error of outliers often converges to that of inliers. This negatively impacts the performance of such reconstruction error methods. 3) Generative outlier detection methods leverage deep generative models (Goodfellow et al., 2014; Kingma & Welling, 2013) to generate the latent space such that the distribution of the latent space is encouraged to match a known prior so that thereafter an appropriate outlier method for the prior can be applied (Vu et al., 2019) to the latent space, or a discriminator can identify outliers in the latent space (Vu et al., 2019) or both the latent space and reconstructed space (Perera et al., 2019) . However, as discussed in Section 1, in practice both inliers and outliers are both mapped to the prior distribution as outliers that are mapped to low-probability regions will generally incur a high cost from the divergence term which matches the latent distribution to the prior. OP-DMA shares characteristics with each of these three categories. However, unlike the other methods in these categories, OP-DMA actively encourages outlier to be mapped to low-probability regions instead of just assuming that this will be the case. OP-DMA is is a generative outlier method that uses the reconstruction error to encourage outliers to be mapped to low-probability regions. Further, it can flexibly be paired with nearly any classic outlier detector after distribution mapping.

3. PROPOSED APPROACH: OP-DMA

Overview of approach. OP-DMA consists of three main components: 1. A distribution mapping autoencoder (DMA) that OP-DMA utilizes to map a dataset X from the feature space R M into a lower dimensional latent space R N , such that the distribution of the encoded data in the lower dimensional latent space has a known probability distribution P Z . This step is crucial as it makes it easy for OP-DMA to easily identify low probability regions of the latent space (outliers should be mapped here). This can be done because after the distribution mapping, we can explicitly calculate the Probability Density Function (PDF) of the latent space so long as we selected a prior distribution with a known PDF. 2. A novel Probability-Weighted Loss (PWL) function for distribution mapping that encourages outliers to be mapped to low-probability regions of the latent space, solving both the challenges of overpowering divergence penalty and unknown outlier status. 3. An traditional outlier detection method is used to identify outliers in the transformed latent space. The choice of outlier detection method is flexible as long as it is amenable to the prior distribution P Z selected in step 1 of OP-DMA. For instance, when a Gaussian distribution is used for the prior, then OP-DMA utilizes a classical distance-based outlier detection method for step 3. These steps are described in the following subsections and illustrated in Figure 2 .

3.1. DISTRIBUTION MAPPING AUTOENCODER (DMA)

In order to use prior-weighting to map outliers to low-probability regions of a known PDF in a latent space, our distribution mapping method must meet two design requirements: 1. A one-to-one mapping between each original data point, its latent representation and the reconstructed data point must be established so that each data point's reconstructed data point is unique and can be determined, and vice versa. 2. The divergence term must impose a cost based on how well a batch of latent data points match the prior overall, rather than requiring individual data points to have a high probability of being a draw from the prior. To meet these requirements, we select the Wasserstein AutoEncoder (WAE) (Tolstikhin et al., 2017) as the foundation for our distribution mapping. WAEs are distribution-mapping autoencoders that minimize the Wasserstein distance between the original data and its reconstruction, while mapping the input data to a latent space with a known prior distribution. To see why we base our distributionmapping technique on this method, consider the WAE objective function for encoder network Q and decoder network G: W λ c (X, Y ) = Reconstruction Error inf Q E P X E Q(Z|X) [c(X, G(Z))] + Divergence Penalty λD(P Q , P Z ) . The first term on the right hand side of Equation 1 corresponds to the reconstruction error between the input data and reconstructed data for cost function c. The second term D is a divergence penalty between the distribution of the latent space and the prior distribution, with λ a constant weight term that determines how much that divergence is penalized. Let us deterministically produce the latent representation Q(X) and output G(Q(X)|X) (by using Q(X) = δ µ(X) , where µ is some function Figure 2 : An overview of OP-DMA. Data is mapped to latent space with prior distribution using a Prior-Weighted Loss (PWL), which encourages outliers to be mapped to low-probability regions. This allows for distance-based outlier detection. mapping input data set X to Q(X), for instance). It is now clear why Wasserstein autoencoders are an appropriate choice to model our distribution mapping method, as the reconstruction error term E P X E Q(Z|X) [c(X, G(Z))] in Equation 1 represents a one-to-one correspondence between input data, its latent representation and the reconstructed output (meeting requirement 1). Additionally, D is a batch-level cost term that would be incurred if the latent representation of a batch doesn't match the prior distribution but doesn't require individual points to be mapped to a high probability region of the prior (meeting requirement 2). However, we note that WAEs unfortunately do not encourage outliers in the feature space to remain outliers in the latent space. Consider D to be a discriminator network. Then D is likely to learn a boundary around the high probability region of the prior distribution. Thus the encoder network Q will be penalized for mapping an outlier to a low probability region outside of the boundary found by D as the discriminator D would correctly identify it as a generated point.

3.2. PRIOR-WEIGHTED LOSS (PWL): NOVEL LOSS FUNCTION FOR OUTLIER COERCION

We now describe our novel Prior-Weighted Loss (PWL) that tackles the above challenge of WAEs mapping outliers to high probability regions. The key idea is that outliers will initially have higher reconstruction error than inliers during training. This core idea draws from the area of anomaly detection using reconstruction probability (An & Cho, 2015) . We thus propose the Prior Weighted Loss (PWL), a novel cost term that weights each data point's reconstruction error term in Equation 1by the point's latent likelihood, P Z (Q(x)). The latent likelihood is the PDF of the latent space's prior distribution evaluated at its corresponding latent representation. The prior weighted loss c is defined as c := c(x, G(Q(x))) • P Z (Q(X)) As the latent likelihood is large in high probability regions and small in low probability regions by definition, points with a high reconstruction error that are mapped to high-probability regions will be penalized more than those with high reconstruction error that are mapped to low probability regions. Since outliers are assumed to result in a high reconstruction error (at least during early training epochs), by reducing the penalty to the network for poorly reconstructed points that have been mapped to low-probability regions of the prior, the network is encouraged to map outliers to these low-probability regions. We now introduce our OP-DMA objective W λ c as: W λ c = Prior weighted loss inf Q:P Q =P Z E P X E Q(Z|X) [c (X, G(Z))] + Divergence penalty λD(P Q , P Z ) Since we have significantly modified the reconstruction error term in the Wasserstein autoencoder loss function, a natural question is whether or not OP-DMA still corresponds to an autoencoder. Specifically, will the decoder's output still match the input data to the encoder? If this does not hold, two issues could arise: 1) The latent features learned by the network might be unrelated to the input, and hence useless in cases where it is desirable to use the latent representation in a downstream task. 2) More importantly for our outlier detection task, if the network is no longer encouraged to reconstruct the input, the crucial property that outliers will have a higher reconstruction error may no longer hold. In such a case, the "reconstruction error" may be meaningless. Fortunately, we can show that our OP-DMA loss function still corresponds to a Wasserstein divergence between the input and reconstructed distributions (Theorem 1). For this, we must demonstrate that is that the prior-weighted cost c meets the requirements of a Wasserstein divergence's cost function, namely, that c (x 1 , x 2 ) ≥ 0 (∀ x 1 , x 2 ∈ supp(P )), (c (x, x) = 0) (∀ x ∈ supp(P )), and E γ [c (x 1 , x 2 )] ≥ 0 (∀ γ ∈ Γ[P, P Z ]) Theorem 1. Let W c be a Wasserstein distance. Then W c is a Wasserstein distance, with c' the prior-weighted c.

3.3. UNSUPERVISED STATISTICAL OUTLIER DETECTION METHOD

Intuitively, an ideal mapping would place all inliers within regions where the latent likelihood is greater than some value V , and all outliers into some alternate regions where the latent likelihood is less than that value V . The core result fundamental to our work is thus that this scenario is indeed the optimal solution for the loss function of OP-DMA as stated in Theorem 2. Table 1 : Description of real-world datasets' dimensionality, size, and outlier percentage.Most datasets taken from the standard ODDs databasefoot_0 , while RC Flu was taken from the Reality-Commons Social Evolution database 2 . We also evaluate on the well-known MNIST 3 dataset. Theorem 2. Let Q be an encoder network such that D(P Q , P Z , F) = 0, where D(A, B, F) is the Maximum Mean Discrepancy between A and B, F is the set of bounded continuous functions and P Z = N (0, Σ). Let us consider or dataset X as a centered random variable, X : Ω → R n , X ∼ P X . Let X(A), A ⊂ Ω, be outliers and let H = Ω -A be the inliers, where X(A) p X (x)dx = α. Further, let c (a, G(Q(a)) > c (h, G(Q(h)) ∀ a ∈ X(A), h ∈ X(H). Then, the optimal solution of OP-DMA is to map such that Q(X(A)) mahalanobis ≥ δ and Q(X(H)) mahalanobis < δ, where δ = 1-α 0 t -n/2-1 e 1 2t 2 n 2 Γ( n 2 ) dt (3) This important result implies that after transformation with OP-DMA outliers can be separated from inliers using a simple distance metric. This lays a solid foundation for a simple yet effective outlier detection scheme. Namely, we first transform the dataset X to a latent representation with a multivariate Gaussian prior distribution, as justified by Theorem 2. Then, as Equation 3states, outliers can be isolated using a simple distance-based approach. More specifically, any standard outlier detection method that finds outliers in Gaussian distributions (e.g. EllipticEnvelope method (Rousseeuw & Driessen, 1999) ) can be used to find outliers in the latent space. 3.4 PULLING IT ALL TOGETHER: UNSUPERVISED OUTLIER DETECTION USING OP-DMA OP-DMA, our end-to-end outlier detection approach, is now summarized. First, the input data is transformed to match a prior distribution with a distribution mapping autoencoder using our novel Prior-Weighted Loss (PWL) (Equation 2). We chose this prior to be a multivariate Gaussian distribution with 0 mean and identity covariance, as justified by Theorem 2. Then, an Elliptic Envelope (Rousseeuw & Driessen, 1999 ) is used to identify outliers. The outlier detection process is outlined in Appendix A.3. We use the unbiased estimator of Maximum Mean Discrepency (MMD) from (Gretton et al., 2012) for the divergence term. For the kernel k of MMD, we use the inverse multiquadratics kernel as in (Tolstikhin et al., 2017) and Mean Squared Error (MSE) for c.

4. EXPERIMENTAL EVALUATION

Compared Methods. We compare OP-DMA to state-of-the-art distribution mapping outlier detection methods. These include methods that perform outlier detection on the latent space of a WAE (Tolstikhin et al., 2017 ), a VAE (Kingma & Welling, 2013) , and an Adversarial Autoencoder (AAE) (Makhzani et al., 2015) -all with a Gaussian prior but they do not integrate our PWL idea. We test against MO-GAAL (Liu et al., 2019) and ALOCC (Sabokrou et al., 2018) , two state-of-the-art deep generative outlier detection models. We also test against LOF (Breunig et al., 2000) and OC-SVM (Schölkopf et al., 2001) , two popular state-of-the-art non-deep outlier detection methods. Data Sets. We evaluated on a rich variety of real-world data sets from the ODDs 1 benchmark data store (Rayana, 2016) . These datasets cover a wide range of dimensionality in the feature space from 6 to 274, and also different outlier contamination percentages from 0.2% to 32%. Table 2 : Weighted F1 scores with 95% confidence interval for OP-DMA vs state-of-the-art methods on benchmark outlier detection datasets. Best performing method in bold, second-best underlined. No confidence intervals on LOG and OC-SVM as they are deterministic. Table 1 breaks down the statistics of each dataset. We evaluate all methods on their ability to detect subjects who have a fever from smartphone sensible data using the MIT Social Evolution dataset (Madan et al., 2011) (RC Fever)foot_1 to demonstrate OP-DMA's effectiveness for mobile healthcare. Finally, we also evalue on the MNIST datasetfoot_2 . We used all MNIST images of "7"s as inliers, and randomly sampled "0"s as outliers such that "0"s account for ∼ 1% of the data. Since outlier detection is unsupervised without any supervised training phase, we perform outlier detection in an unsupervised manner on the entire dataset instead of having to introduce train/test splits. In each dataset, all points are labeled as either inlier or outlier as ground truth. We emphasize that these ground truth labels are only used for evaluation but not for training all methods. Metrics. Due to the large class imbalance inherent to outlier detection, we use the F1 score as our performance metric (Lazarevic-McManus et al., 2008) as commonly used to evaluate outlier detection methods (An & Cho, 2015; Zhou & Paffenroth, 2017; Zong et al., 2018) . Parameter Configurations of Methods. Encoders and decoders of all methods consist of 3-layer neural networks, where the decoder in each pair mirrors the structure of its encoder. The number of nodes in the hidden layer of each network is a hyperparameter from {5, 6, 9, 15, 18, 100}. The number of nodes in the latent layer varies from {2, 3, 6, 9, 15}. The regularization parameter λ is chosen such that the reconstruction error is on the same order of magnitude as the MMD error for the first epoch. We use the standard parameters of MO-GAAL from the authors' codefoot_3 . We also use the standard configuration of ALOCC from the authors codefoot_4 , except we add an additional dense layer at the beginning of each subnetwork. We do this as ALOCC assumes input to be images of a certain shape. The additional dense layer transforms the input data from its original dimensionality into the required shape. For these we use the standard parameters from Scikit-Learn. Experiment 1: Versatile Anomaly Detection. We validate the versatility of our OP-DMA method by showing that our method consistently outperforms state-of-the-art methods on a rich variety of benchmark datasets. As shown in Table 2 , OP-DMA outperforms the majority (9/13) of the other methods on the benchmark datasets. We see that OP-DMA's superior performance is not limited to datasets with either a high or low percentage of outliers. OP-DMA is the best performing method on the dataset with the largest ratio of outliers (Satellite) as well as that with the smallest ratio (Cover). Experiment 2: Sensitivity to Contamination Parameter. The contamination parameter α is used to fit the standard outlier method, Elliptic Envelope, plugged into our OP-DMA framework on the encoded data after training. Thus, we test the sensitivity of EllipticEnvelope to the value of the contamination parameter by evaluating the F1 score of outlier detection on the Satellite dataset mapped by OP-DMA. The results (Figure 3 (a)) show that as long as this parameter is not significantly underestimated, the F1-score is robust to different values of the contamination parameter. Experiment 3: Verifying that Outliers are Mapped To Low-Probability Regions. We transformed data from a multi-modal distribution in R 4 consisting of a mixture of two Gaussians centered at (0,0,0,0) and (5,5,5,5) to a standard normal Gaussian in R 2 . Outliers in the original space were drawn from a uniform distribution and consisted of 2.4% of the total data. As Figure 3 b) shows, outliers are successfully mapped far from the inlier data points. Furthermore, the average value of the prior evaluated at the outlier points is 0.02, while the average for inliers is 0.08, confirming that outliers are mapped to lower-probability regions than inliers.

5. CONCLUSION

We have introduced OP-DMA, an autoencoder-based solution that unlike prior methods is truly outlier preserving in its distribution mapping method. That is, OP-DMA maps outliers in the feature space to low probability regions in the latent space in which a multivariate standard normal Gaussian prior distribution is enforced. Outliers are consequently easily identifiable in the latent space. Our experimental study comparing OP-DMA to state-of-the-art methods on a collection of benchmark outlier detection datasets shows that it consistently outperforms these methods on the majority of the datasets. We have also demonstrated that there is not a significant increase in running time between our method and state-of-the-art methods.

A APPENDIX

A.1 PROOF OF THEOREM 1 Proof. Since c is a Wasserstein divergence, we know that c(x 1 , x 2 ) ≥ 0 (∀ x 1 , x 2 ∈ supp(P )), (c(x, x) = 0) (∀ x ∈ supp(P )), and E γ [c(x 1 , x 2 )] ≥ 0 (∀ γ ∈ Γ[P, P Z ]). Since P Z (z) ≥ 0 (∀ z), c will also fulfill the three aforementioned properties of c. Thus, W c is a Wasserstein divergence. A.2 PROOF OF THEOREM 2 Proof. The Mahalanobis distance of Q(X) can itself be expressed as a random variable, δ = Q(X)Σ -1 Q(X) T . Let Φ δ be the CDF of δ. Then, Φ δ (1 -α) = P (δ ≤ 1 -α) = P (δ 2 ≤ (1 -α) 2 ) = Φ δ 2 ((1 -α) 2 ). Let Y = Q(X)M -1 , where M T M = Σ is the Choleski decomposition of the covariance Σ. Since D(P Q , P Z , F) = 0, and D(A, B, F) = 0 iff A = B, we thus know that Q(X) ∼ N (0, Σ). Thus since Q(X) is normally distributed and centered, Y is normally distributed with identity covariance. Since δ 2 = Q(X)Σ -1 Q(X) T = Y Y T , Φ δ 2 is the CDF of of the sum of squares of n normally distributed variables with mean 0 and σ = 1. Thus, Φ δ 2 is the Chi Squared distribution. The inverse Chi Squared CDF will thus give us the distance δ such that 1 -α percent of the points are within δ = 1-α 0 t -n/2-1 e 1 2t 2 n 2 Γ( n 2 ) dt Now, let us assume that for some parameter choice Θ for Q that αP (Q(X(A)|Θ ) ≤ δ) = β, β > 0. Consequently, (1 -α)P (Q(X(H)|Θ ) > δ) = β, since P (Q(X) > δ) = α and X(A) p X (x)dx = α. Conversely, let us assume that there is a parameter configuration Θ such that αP (Q(X(A)|Θ) ≤ δ) = 0 and so (1 -α)P (Q(X(H)|Θ) > δ) = 0. Since P Z ∼ N (0, Σ), P Z (d 1 ) < P Z (d 2 ) for d 1 mahalanobis > d 2 mahalanobis . Thus, since we assume c(a, G(Q(a)) > c(h, G(Q(h)) ∀ a ∈ X(A), h ∈ X(H), then Thus, the optimal solution for OP-DMA's cost function is one that maps outliers to regions with a larger Mahalanobis distance than that of inliers. 



http://odds.cs.stonybrook.edu/ http://realitycommons.media.mit.edu/socialevolution4.html http://yann.lecun.com/exdb/mnist/ https://github.com/leibinghe/GAAL-based-outlier-detection https://github.com/khalooei/ALOCC-CVPR2018



Figure 1: Data Set used in (a) and (b) is Inliers taken from MNIST "1"s while outliers are MNIST "0"s, such that the outliers account for roughly 20% of the total data. (a) Left plot shows average reconstruction error of outliers over average reconstruction error of inliers during the training of a standard autoencoder. of the total data. As the plot shows, the ratio of errors for outliers to inliers goes to 1, meaning outliers are difficult to distinguish from inliers after training. (b) The right plot shows inliers and outliers in the 2-dimensional latent space of a Wasserstein Autoencoder (a popular type of distribution mapping autoencoder). As seen, the outliers are in high-probability regions of the latent space and are thus difficult to separate from the inliers.

Figure 3: a) F1 score of OP-DMA for various values of the contamination parameter on the Satimage-2 dataset. b) Outliers and inliers in OP-DMA's latent space.

P X E Q(Z|X) c (x p , G(Q(x p |Θ ))) = E P X E Q(Z|X) c(x p , G(Q(x p |Θ )))P Z (x p ) > E P X E Q(Z|X) c(x p , G(Q(x p |Θ)))P Z (x p ) = E P X E Q(Z|X) c (x p , G(Q(x p |Θ))).

OP-DMA ALGORITHM Algorithm 1: Unsupervised Outlier Detection with OP-DMA Require: Regularization coefficient λ Contamination parameter α Initialized encoder network Q Φ and decoder network G Θ with random weights Φ and Θ Dataset X while Θ, Φ not converged do Sample {x 1 , x 1 , ..., x n } from X, {z 1 , z 1 , ..., z n } from N (0, I), and {z 1 , z1 , ..., zn } from Q Φ (Z|X) Update weights Φ and Θ by descending1 n n i=1 c(x i , G Θ (z i )) • λ • P Z (z i ) h , zj ) end Find D min = {Q Φ (x i ), Q Φ (x j ), ..., Q Φ (x k )}, D min = (1 -α) D with MinimumCovariance Determinant estimator, inf Σ Det{ Σ}. Find estimated mean μ from D min return Q Φ (x i ) mahalanobis = (Q Φ (x i ) -μ) Σ(Q Φ (x i ) -μ) for x i ∈ D as outlier scores

