NOVELTY DETECTION VIA ROBUST VARIATIONAL AUTOENCODING

Abstract

We propose a new method for novelty detection that can tolerate high corruption of the training points, whereas previous works assumed either no or very low corruption. Our method trains a robust variational autoencoder (VAE), which aims to generate a model for the uncorrupted training points. To gain robustness to high corruption, we incorporate the following four changes to the common VAE: 1. Extracting crucial features of the latent code by a carefully designed dimension reduction component for distributions; 2. Modeling the latent distribution as a mixture of Gaussian low-rank inliers and full-rank outliers, where the testing only uses the inlier model; 3. Applying the Wasserstein-1 metric for regularization, instead of the Kullback-Leibler (KL) divergence; and 4. Using a least absolute deviation error for reconstruction. We establish both robustness to outliers and suitability to low-rank modeling of the Wasserstein metric as opposed to the KL divergence. We illustrate state-of-the-art results on standard benchmarks for novelty detection.

1. INTRODUCTION

Novelty detection refers to the task of detecting testing data points that deviate from the underlying structure of a given training dataset (Chandola et al., 2009; Pimentel et al., 2014; Chalapathy & Chawla, 2019) . It finds crucial applications, in areas such as insurance and credit fraud (Zhou et al., 2018) , mobile robots (Neto & Nehmzow, 2007) and medical diagnosis (Wei et al., 2018) . Ideally, novelty detection requires learning the underlying distribution of the training data, where sometimes it is sufficient to learn a significant feature, geometric structure or another property of the training data. One can then apply the learned distribution (or property) to detect deviating points in the test data. This is different from outlier detection (Chandola et al., 2009) , in which one does not have training data and has to determine the deviating points in a sufficiently large dataset assuming that the majority of points share the same structure or properties. We note that novelty detection is equivalent to the well-known one-class classification problem (Moya & Hush, 1996) . In this problem, one needs to identify members of a class in a test dataset, and consequently distinguish them from "novel" data points, given training points from this class. The points of the main class are commonly referred to as inliers and the novel ones as outliers. Novelty detection is also commonly referred to as semi-supervised anomaly detection. In this terminology, the notion of being "semi-supervised" is different than usual. It emphasizes that only the inliers are trained, where there is no restriction on the fraction of training points. On the other hand, the unsupervised case has no training (we referred to this setting above as "outlier detection") and in the supervised case there are training datasets for both the inliers and outliers. We remark that some authors refer to semi-supervised anomaly detection as the setting where a small amount of labeled data is provided for both the inliers and outliers (Ruff et al., 2020) . There are a myriad of solutions to novelty detection. Nevertheless, such solutions often assume that the training set is purely sampled from a single class or that it has a very low fraction of corrupted samples. This assumption is only valid when the area of investigation has been carefully studied and there are sufficiently precise tools to collect data. However, there are different important scenarios, where this assumption does not hold. One scenario includes new areas of studies, where it is unclear how to distinguish between normal and abnormal points. For example, in the beginning of the COVID-19 pandemic it was hard to diagnose COVID-19 patients and distinguish them from other patients with pneumonia. Another scenario occurs when it is very hard to make precise measurements, for example, when working with the highly corrupted images obtained in cryogenic electron microscopy (cryo-EM). Therefore, we study a robust version of novelty detection that allows a nontrivial fraction of corrupted samples, namely outliers, within the training set. We solve this problem by using a special variational autoencoder (VAE) (Kingma & Welling, 2014) . Our VAE is able to model the underlying distribution of the uncorrupted data, despite nontrivial corruption. We refer to our new method as "Mixture Autoencoding with Wasserstein penalty", or "MAW". In order to clarify it, we first review previous works and then explain our contributions in view of these works.

1.1. PREVIOUS WORK

Solutions to one-class classification and novelty detection either estimate the density of the inlier distribution (Bengio & Monperrus, 2005; Ilonen et al., 2006) or determine a geometric property of the inliers, such as their boundary set (Breunig et al., 2000; Schölkopf et al., 2000; Xiao et al., 2016; Wang & Lan, 2020; Jiang et al., 2019) . When the inlier distribution is nicely approximated by a low-dimensional linear subspace, Shyu et al. (2003) proposes to distinguish between inliers and outliers via Principal Component Analysis (PCA). In order to consider more general cases of nonlinear low-dimensional structures, one may use autoencoders (or restricted Boltzmann machines), which nonlinearly generalize PCA (Goodfellow et al., 2016, Ch. 2) and whose reconstruction error naturally provides a score for membership in the inlier class. Instances of this strategy with various architectures include Zhai et al. ( 2016 2019) observed that interpolation of a latent space, which was trained using digit images of a complex shape, can lead to digit representation of a simple shape. If there are also outliers (with a simple shape) among the inliers (with a complex shape), encoding the inlier distribution becomes even more difficult. Nevertheless, some previous works already explored the possibility of corrupted training set (Xiao et al., 2016; Wang & Lan, 2020; Zong et al., 2018) . In particular, Xiao et al. ( 2016 2020) considers ratios of 10%, but with very small numbers of training points. In this work we consider corruption ratios up to 30%, with a method that tries to estimate the distribution of the training set, and not just a geometric property. VAEs (Kingma & Welling, 2014) have been commonly used for generating distributions with reconstruction scores and are thus natural for novelty detection without corruption. They determine the latent code of an autoencoder via variational inference (Jordan et al., 1999; Blei et al., 2017) . Alternatively, they can be viewed as autoencoders for distributions that penalize the Kullback-Leibler (KL) divergence of the latent distribution from the prior distribution. The first VAE-based method for novelty detection was suggested by An & Cho (2015) . It was recently extended by Daniel et al. (2019) who modified the training objective. A variety of VAE models were also proposed for special anomaly detection problems, which are different than novelty detection (Xu et al., 2018; Zhang et al., 2019; Pol et al., 2019) . Current VAE-based methods for novelty detection do not perform well when the training data is corrupted. Indeed, the learned distribution of any such method also represents the corruption, that is, the outlier component. To the best of our knowledge, no effective solutions were proposed for collapsing the outlier mode so that the trained VAE would only represent the inlier distribution. An adversarial autoencoder (AAE) (Makhzani et al., 2016) and a Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) can be considered as variants of VAE. The penalty term of AAE takes the form of a generative adversarial network (GAN) (Goodfellow et al., 2016) , where its generator is the encoder. A Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) generalizes AAE with a framework that minimizes the Wasserstein metric between the sample distribution and the inference distribution. It reformulates the corresponding objective function so that it can be implemented in the form of an AAE. There are two relevant lines of works on robustness to outliers in linear modeling that can be used in nonlinear settings via autoencoders or VAEs. Robust PCA aims to deal with sparse elementwise corruption of a data matrix (Candès et al., 2011; De La Torre & Black, 2003; Wright et al., 2009; Vaswani & Narayanamurthy, 2018) . Robust subspace recovery (RSR) aims to address general corruption of selected data points and thus better fits the framework of outliers (Watson, 2001; De La Torre & Black, 2003; Ding et al., 2006; Zhang et al., 2009; McCoy & Tropp, 2011; Xu et al., 2012; Lerman & Zhang, 2014; Zhang & Lerman, 2014; Lerman et al., 2015; Lerman & Maunu, 2017; Maunu et al., 2019; Lerman & Maunu, 2018; Maunu & Lerman, 2019) . Autoencoders that use robust PCA for anomaly detection tasks were proposed in Chalapathy et al. (2017); Zhou & Paffenroth (2017) . Dai et al. (2018) show that a VAE can be interpreted as a nonlinear robust PCA problem. Nevertheless, explicit regularization is often required to improve robustness to sparse corruption in VAEs (Akrami et al., 2019; Eduardo et al., 2020) . RSR was successfully applied to outlier detection by Lai et al. (2020) . One can apply their work to the different setting of novelty detection; however, our proposed VAE formulation seems to work better.



); Zong et al. (2018); Sabokrou et al. (2018); Perera et al. (2019); Pidhorskyi et al. (2018). In all of these works, but Zong et al. (2018), the training set is assumed to solely represent the inlier class. In fact, Perera et al. (

); Zong et al. (2018) test artificial instances with at most 5% corruption of the training set and Wang & Lan (

