OUTLIER PRESERVING DISTRIBUTION MAPPING AU-TOENCODERS

Abstract

State-of-the-art deep outlier detection methods map data into a latent space with the aim of having outliers far away from inliers in this space. Unfortunately, this is shown to often fail the divergence penalty they adopt pushes outliers into the same high-probability regions as inliers. We propose a novel method, OP-DMA, that successfully addresses the above problem. OP-DMA succeeds in mapping outliers to low probability regions in the latent space by leveraging a novel Prior-Weighted Loss (PWL) that utilizes the insight that outliers are likely to have a higher reconstruction error than inliers. Building on this insight, explicitly encourages outliers to be mapped to low-propbability regions of its latent by weighing the reconstruction error of individual points by a multivariate Gaussian probability density function evaluated at each point's latent representation. We formally prove that OP-DMA succeeds to map outliers to low-probability regions. Our experimental study demonstrates that OP-DMA consistently outperforms state-of-art methods on a rich variety of outlier detection benchmark datasets.

1. INTRODUCTION

Background. Outlier detection, the task of discovering abnormal instances in a dataset, is critical for applications from fraud detection, error measurement identification to system fault detection (Singh & Upadhyaya, 2012) . Given outliers are by definition rare, it is often infeasible to get enough labeled outlier examples that are represetnative of all the forms the outliers could take. Consequently, unsupervised outlier detection methods that do not require prior labeling of inliers or outliers are frequently adopted (Chandola et al., 2009) . State-of-Art Deep Learning Methods for Outlier Detection. Deep learning methods for outlier detection commonly utilize the reconstruction error of an autoencoder model as an outlier score for outlier detection (Sakurada & Yairi, 2014; Vu et al., 2019) . However, directly using the reconstruction error as the outlier score has a major flaw. As the learning process converges, both outliers and inliers tend to converge to the average reconstruction error (to the same outlier score) -making them indistinguishable (Beggel et al., 2019) . This is demonstrated in Figure 1a , which shows that the ratio of average reconstruction error for outliers converges to that of the inliers. To overcome this shortcoming, recent work (Beggel et al., 2019; Perera et al., 2019) utilizes the distribution-mapping capabilities of generative models that encourage data to follow a prior distribution in the latent space. These cutting-edge methods assume that while the mapping of inlier points will follow the target prior distribution, outliers will not due to their anomalous nature. Instead, outliers will be mapped to low-probability regions of the prior distribution, making it easy to detect them as outliers (Beggel et al., 2019; Perera et al., 2019) . However, this widely held assumption has been shown to not hold in practice (Perera et al., 2019) . Unfortunately, as shown in Figure 1b , both inliers and outliers are still mapped to the same high probability regions of the target prior distribution, making them difficult to distinguish. Problem Definition. Given a given dataset X ∈ R M of multivariate observations, let f : R M → R N , N ≤ M , be a function from the multivariate feature space of X to a latent space f (x) ∈ R N such that f (X) ∼ P Z , where P Z is a known and tractable prior probability density function. The dataset X ∈ R M is composed as X = X O + X I , where X O and X I are a set of outlier and inlier points, respectively. During training, it is unknown whether any given point x ∈ X is an outlier or an inlier. Intuitively, our goal is to find a function f that maps instances of a dataset X into a latent space S with a known distribution, such that outliers are mapped to low probability regions and inliers to high probability regions. More formally, we define unsupervised distribution-mapping outlier detection as the problem of finding a function f * with the aforementioned properties of f such that we maximize the number of outliers x o ∈ X O and inliers x i ∈ X I for which P Z (f * (x o )) < P Z (f * (x i )) holds.

Challenges.

To address the open problem defined above, the following challenges exist: 1. Overpowering divergence penalty. Intuitively, distribution mapping methods utilize a divergence penalty to achieve a latent space mapping of input data that has a high probability of following a target prior distribution. While the data overall should follow this prior distribution, a solution must be found to instead maps outliers to low-probability regions of the prior. Having the data match the prior overall, while having outliers mapped to low probability regions of the prior creates a conflict, as the two tasks are diametrically opposed. To achieve such a mapping requires overpowering the divergence penalty in order to map outliers to low probability regions in the latent space. 2. Unknown outlier status. In unsupervised outlier detection, during training points do not have any labels indicating whether they are outliers or inliers. This unsupervised scenario, while common in practice (Singh & Upadhyaya, 2012), makes it challenging to design strategies that explicitly coerce outliers to be mapped to low-probability regions. Our OP-DMA Approach. In this work, we propose the Outlier Preserving Distribution Mapping Autoencoder (OP-DMA). Our core idea is to propose a novel Prior Weighted Loss (PWL) function that solves the two conflicting tasks of mapping the input data to a prior distribution while encouraging outliers to be mapped to low probability regions of that prior. This PWL directly addresses the shortcomings of the existing distribution mapping outlier detection methods (Vu et al., 2019; Perera et al., 2019) , and to the best of our knowledge is the first unsupervised cost function that explicitly encourages outliers to be mapped to low probability regions. We assume that outliers will have a high reconstruction error during the initial stages of training, which causes the PWL to place them in low-probability (low PDF) regions in the latent space. This way, PWL overcomes the challenge of overpowering the divergence penalty. It succeeds in mapping outliers to low-probability regions (far from the mean of the latent distribution) even though each input point's outlier status is unknown. Our OP-DMA framework is pluggable, meaning off-theshelf distance-based outlier methods can be flexibly plugged in post-transformation. Our key contributions are as follows:



Figure 1: Data Set used in (a) and (b) is Inliers taken from MNIST "1"s while outliers are MNIST "0"s, such that the outliers account for roughly 20% of the total data. (a) Left plot shows average reconstruction error of outliers over average reconstruction error of inliers during the training of a standard autoencoder. of the total data. As the plot shows, the ratio of errors for outliers to inliers goes to 1, meaning outliers are difficult to distinguish from inliers after training. (b) The right plot shows inliers and outliers in the 2-dimensional latent space of a Wasserstein Autoencoder (a popular type of distribution mapping autoencoder). As seen, the outliers are in high-probability regions of the latent space and are thus difficult to separate from the inliers.

