NOVELTY DETECTION VIA ROBUST VARIATIONAL AUTOENCODING

Abstract

We propose a new method for novelty detection that can tolerate high corruption of the training points, whereas previous works assumed either no or very low corruption. Our method trains a robust variational autoencoder (VAE), which aims to generate a model for the uncorrupted training points. To gain robustness to high corruption, we incorporate the following four changes to the common VAE: 1. Extracting crucial features of the latent code by a carefully designed dimension reduction component for distributions; 2. Modeling the latent distribution as a mixture of Gaussian low-rank inliers and full-rank outliers, where the testing only uses the inlier model; 3. Applying the Wasserstein-1 metric for regularization, instead of the Kullback-Leibler (KL) divergence; and 4. Using a least absolute deviation error for reconstruction. We establish both robustness to outliers and suitability to low-rank modeling of the Wasserstein metric as opposed to the KL divergence. We illustrate state-of-the-art results on standard benchmarks for novelty detection.

1. INTRODUCTION

Novelty detection refers to the task of detecting testing data points that deviate from the underlying structure of a given training dataset (Chandola et al., 2009; Pimentel et al., 2014; Chalapathy & Chawla, 2019) . It finds crucial applications, in areas such as insurance and credit fraud (Zhou et al., 2018) , mobile robots (Neto & Nehmzow, 2007) and medical diagnosis (Wei et al., 2018) . Ideally, novelty detection requires learning the underlying distribution of the training data, where sometimes it is sufficient to learn a significant feature, geometric structure or another property of the training data. One can then apply the learned distribution (or property) to detect deviating points in the test data. This is different from outlier detection (Chandola et al., 2009) , in which one does not have training data and has to determine the deviating points in a sufficiently large dataset assuming that the majority of points share the same structure or properties. We note that novelty detection is equivalent to the well-known one-class classification problem (Moya & Hush, 1996) . In this problem, one needs to identify members of a class in a test dataset, and consequently distinguish them from "novel" data points, given training points from this class. The points of the main class are commonly referred to as inliers and the novel ones as outliers. Novelty detection is also commonly referred to as semi-supervised anomaly detection. In this terminology, the notion of being "semi-supervised" is different than usual. It emphasizes that only the inliers are trained, where there is no restriction on the fraction of training points. On the other hand, the unsupervised case has no training (we referred to this setting above as "outlier detection") and in the supervised case there are training datasets for both the inliers and outliers. We remark that some authors refer to semi-supervised anomaly detection as the setting where a small amount of labeled data is provided for both the inliers and outliers (Ruff et al., 2020) . There are a myriad of solutions to novelty detection. Nevertheless, such solutions often assume that the training set is purely sampled from a single class or that it has a very low fraction of corrupted samples. This assumption is only valid when the area of investigation has been carefully studied and there are sufficiently precise tools to collect data. However, there are different important scenarios, where this assumption does not hold. One scenario includes new areas of studies, where it is unclear how to distinguish between normal and abnormal points. For example, in the beginning of the COVID-19 pandemic it was hard to diagnose COVID-19 patients and distinguish them from other patients with pneumonia. Another scenario occurs when it is very hard to make precise measurements, for example, when working with the highly corrupted images obtained in cryogenic electron microscopy (cryo-EM). Therefore, we study a robust version of novelty detection that allows a nontrivial fraction of corrupted samples, namely outliers, within the training set. We solve this problem by using a special variational autoencoder (VAE) (Kingma & Welling, 2014) . Our VAE is able to model the underlying distribution of the uncorrupted data, despite nontrivial corruption. We refer to our new method as "Mixture Autoencoding with Wasserstein penalty", or "MAW". In order to clarify it, we first review previous works and then explain our contributions in view of these works.

1.1. PREVIOUS WORK

Solutions to one-class classification and novelty detection either estimate the density of the inlier distribution (Bengio & Monperrus, 2005; Ilonen et al., 2006) or determine a geometric property of the inliers, such as their boundary set (Breunig et al., 2000; Schölkopf et al., 2000; Xiao et al., 2016; Wang & Lan, 2020; Jiang et al., 2019) . When the inlier distribution is nicely approximated by a low-dimensional linear subspace, Shyu et al. (2003) proposes to distinguish between inliers and outliers via Principal Component Analysis (PCA). In order to consider more general cases of nonlinear low-dimensional structures, one may use autoencoders (or restricted Boltzmann machines), which nonlinearly generalize PCA (Goodfellow et al., 2016, Ch. 2) and whose reconstruction error naturally provides a score for membership in the inlier class. Instances of this strategy with various architectures include Zhai et al. (2016) ; Zong et al. (2018) ; Sabokrou et al. (2018) ; Perera et al. (2019) ; Pidhorskyi et al. (2018) . In all of these works, but Zong et al. (2018) , the training set is assumed to solely represent the inlier class. In fact, Perera et al. (2019) observed that interpolation of a latent space, which was trained using digit images of a complex shape, can lead to digit representation of a simple shape. If there are also outliers (with a simple shape) among the inliers (with a complex shape), encoding the inlier distribution becomes even more difficult. Nevertheless, some previous works already explored the possibility of corrupted training set (Xiao et al., 2016; Wang & Lan, 2020; Zong et al., 2018) . In particular, Xiao et al. (2016) ; Zong et al. (2018) test artificial instances with at most 5% corruption of the training set and Wang & Lan (2020) considers ratios of 10%, but with very small numbers of training points. In this work we consider corruption ratios up to 30%, with a method that tries to estimate the distribution of the training set, and not just a geometric property. VAEs (Kingma & Welling, 2014) have been commonly used for generating distributions with reconstruction scores and are thus natural for novelty detection without corruption. They determine the latent code of an autoencoder via variational inference (Jordan et al., 1999; Blei et al., 2017) . Alternatively, they can be viewed as autoencoders for distributions that penalize the Kullback-Leibler (KL) divergence of the latent distribution from the prior distribution. The first VAE-based method for novelty detection was suggested by An & Cho (2015) . It was recently extended by Daniel et al. (2019) who modified the training objective. A variety of VAE models were also proposed for special anomaly detection problems, which are different than novelty detection (Xu et al., 2018; Zhang et al., 2019; Pol et al., 2019) . Current VAE-based methods for novelty detection do not perform well when the training data is corrupted. Indeed, the learned distribution of any such method also represents the corruption, that is, the outlier component. To the best of our knowledge, no effective solutions were proposed for collapsing the outlier mode so that the trained VAE would only represent the inlier distribution. An adversarial autoencoder (AAE) (Makhzani et al., 2016) and a Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) can be considered as variants of VAE. The penalty term of AAE takes the form of a generative adversarial network (GAN) (Goodfellow et al., 2016) , where its generator is the encoder. A Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) generalizes AAE with a framework that minimizes the Wasserstein metric between the sample distribution and the inference distribution. It reformulates the corresponding objective function so that it can be implemented in the form of an AAE. There are two relevant lines of works on robustness to outliers in linear modeling that can be used in nonlinear settings via autoencoders or VAEs. Robust PCA aims to deal with sparse elementwise corruption of a data matrix (Candès et al., 2011; De La Torre & Black, 2003; Wright et al., 2009; Vaswani & Narayanamurthy, 2018) . Robust subspace recovery (RSR) aims to address general corruption of selected data points and thus better fits the framework of outliers (Watson, 2001; De La Torre & Black, 2003; Ding et al., 2006; Zhang et al., 2009; McCoy & Tropp, 2011; Xu et al., 2012; Lerman & Zhang, 2014; Zhang & Lerman, 2014; Lerman et al., 2015; Lerman & Maunu, 2017; Maunu et al., 2019; Lerman & Maunu, 2018; Maunu & Lerman, 2019) . Autoencoders that use robust PCA for anomaly detection tasks were proposed in Chalapathy et al. (2017) ; Zhou & Paffenroth (2017) . Dai et al. (2018) show that a VAE can be interpreted as a nonlinear robust PCA problem. Nevertheless, explicit regularization is often required to improve robustness to sparse corruption in VAEs (Akrami et al., 2019; Eduardo et al., 2020) . RSR was successfully applied to outlier detection by Lai et al. (2020) . One can apply their work to the different setting of novelty detection; however, our proposed VAE formulation seems to work better.

1.2. THIS WORK

We propose a robust novelty detection procedure, MAW, that aims to model the distribution of the training data in the presence of nontrivial fraction of outliers. We highlight its following four features: • MAW models the latent distribution by a Gaussian mixture of low-rank inliers and full-rank outliers, and applies the inlier distribution for testing. Previous applications of mixture models for novelty detection were designed for multiple modes of inliers and used more complicated tools such as constructing another network (Zong et al., 2018) or applying clustering (Aytekin et al., 2018; Lee et al., 2018) . • MAW applies a novel dimension reduction component, which extracts lower-dimensional features of the latent distribution. The reduced small dimension allows using full covariances for both the outliers (with full rank) and inliers (with deficient rank); whereas previous VAE-based methods for novelty detection used diagonal covariances in their models (An & Cho, 2015; Daniel et al., 2019) . The new component is inspired by the RSR layer in Lai et al. (2020) ; however, they are essentially different since the RSR layer is only applicable for data points and not for probability distributions. • For the latent code penalty, MAW uses the Wasserstein-1 (W 1 ) metric. Under a special setting, we prove that the Wasserstein metric gives rise to outliers-robust estimation and is suitable to the low-rank modeling of inliers by MAW. We also show that these properties do not hold for the KL divergence, which is used by VAE, AAE and WAE. We remark that the use of the Wasserstein metric in WAE is different than that of MAW. Indeed, in WAE it measures the distance between the data distribution and the generated distribution and it does not appear in the latent code. Our use of W 1 can be viewed as a variant of AAE, which replaces GAN with Wasserstein GAN (WGAN) (Arjovsky et al., 2017) . That is, it replaces the minimization of the KL divergence by that of the W 1 distance. • MAW achieves state-of-the-art results on popular anomaly detection datasets. Additional two features are as follows. First, for reconstruction, MAW replaces the common least squares formulation with a least absolute deviations formulation. This can be justified by the use of either a robust estimator (Lopuhaa & Rousseeuw, 1991) or a likelihood function with a heavier tail. Second, MAW is attractive for practitioners. It is simple to implement in any standard deep learning library, and is easily adaptable to other choices of network architecture, energy functions and similarity scores. We remark that since we do not have labels for the training set, we cannot supervisedly learn the Gaussian component with low-rank covariance by the inliers and Gaussian component with the full-rank covariance by the outliers. However, the use of two robust losses (least absolute deviation and the W 1 distance) helps obtain a careful model for the inliers, which is robust to outliers. Note that in our testing, we only use the model for the inliers. We explain MAW in §2. We establish the advantage of its use of the Wasserstein metric in §3. We carefully test MAW in §4. At last, we conclude this work in §5.

2. DESCRIPTION OF MAW

We motivate and overview the underlying model and assumptions of MAW in §2.1. We describe the simple implementation details of its components in §2.2. Fig. 1 illustrates the general idea of MAW and can assist in reading this section.

2.1. THE MODEL AND ASSUMPTIONS OF MAW

MAW aims to robustly estimate a mixture inlier-outlier distribution for the training data and then use its inlier component to detect outliers in the testing data. For this purpose, it designs a novel variational autoencoder with an underlying mixture model and a robust loss function in the latent space. We find the variational framework natural for novelty detection. Indeed, it learns a distribution that describes the inlier training examples and generalizes to the inlier test data. Moreover, the variational formulation allows a direct modeling of a Gaussian mixture model in the latent space, unlike a standard autoencoder. We assume L training points in R D , which we designate by {x (i) } L i=1 . Let x be a random variable on R D with the unknown training data distribution that we estimate by the empirical distribution of the training points. We assume a latent random variable z of low and even dimension 2 ≤ d ≤ D, where our default choice is d = 2. We further assume a standardized Gaussian prior, p(z), so that z ∼ N (0, I d×d ). The posterior distribution p(z|x) is unknown. However, we assume an approximation to it, which we denote by q(z|x), such that z|x is a mixture of two Gaussian distributions representing the inlier and outlier components. More specifically, z|x ∼ ηN (µ 1 , Σ 1 ) + (1 -η)N (µ 2 , Σ 2 ), where we explain next its parameters. We assume that η > 0.5, where our default value is η = 5/6, so that the first mode of z represents the inliers and the second one represents the outliers. The other parameters are generated by the encoder network and a following dimension reduction component.We remark that unlike previous works which adopted Gaussian mixtures to model the clusters of inliers (Reddy et al., 2017; Zong et al., 2018) , the Gaussian mixture model in MAW aims to separate between inliers and outliers. The dimension reduction component involves a mapping from a higher-dimensional space onto the latent space. It is analogous to the RSR layer in Lai et al. (2020) that projects encoded points onto the latent space, but requires a more careful design since we consider a distribution rather than sample points. Due to this reduction, we assume that the mapped covariance matrices of z|x are full, unlike common single-mode VAE models that assume a diagonal covariance (Kingma & Welling, 2014; An & Cho, 2015) . Our underlying assumption is that the inliers lie on a low-dimensional structure and we thus enforce the lower rank d/2 for Σ 1 , but allow Σ 2 to have full rank d. Nevertheless, we later describe a necessary regularization of both matrices by the identity. Following the VAE framework, we approximate the unknown posterior distribution p(z|x) within the variational family Q = {q(z|x)}, which is indexed by µ 1 , Σ 1 , µ 2 and Σ 2 . Unlike a standard VAE, which maximizes the evidence lower bound (ELBO), MAW maximizes the following ELBO-Wasserstein, or ELBOW, function, which uses the W 1 distance (see also §A.1): ELBOW(q) = E p(x) E q(z|x) log p(x|z) -W 1 (q(z), p(z)) . (1) Following the VAE framework, we use a Monte-Carlo approximation to estimate E q(z|x) log p(x|z) with i.i.d. samples, {z (t) } T t=1 , from q(z|x) as follows: E q(z|x) log p(x|z) ≈ 1 T T t=1 log p(x|z (t) ). To improve the robustness of our model, we choose the negative log likelihood function -log p(x|z (t) ) to be a constant multiple of the 2 norm of the difference of the random variable x and a mapping of the sample z (t) from R d to R D by the decoder, D, that is, -log p(x|z (t) ) ∝ x -D(z (t) ) 2 . ( ) Note that we deviate from the common choice of the squared 2 norm, which corresponds to an underlying Gaussian likelihood and assume instead a likelihood with a heavier tail. MAW trains its networks by minimizing -ELBOW(q). For any 1 ≤ i ≤ L, it samples {z (i,t) gen } T t=1 from q(z|x (i) ) , where all samples are independent. Using the aggregation formula: q(z) = L -1 L i=1 q(z|x (i) ) , which is also used by an AAE, the approximation of p(x) by the empirical distribution of the training data, and (1)-( 3), MAW applies the following approximation of -ELBOW(q): - 1 LT L i=1 T t=1 x (i) -D(z (i,t) gen ) 2 + W 1 1 L L i=1 q(z|x (i) ), p(z) . Details of minimizing (4) are described in §2.2. We remark that the procedure described in §2.2 is independent of the multiplicative constant in (3) and therefore this constant is ignored in (4). During testing, MAW identifies inliers and outliers according to high or low similarity scores computed between each given test point and points generated from the learned inlier component of z|x.

2.2. DETAILS OF IMPLEMENTING MAW

MAW has a VAE-type structure with additional WGAN-type structure for minimizing the W 1 loss in (4). We provide here details of implementing these structures. Some specific choices of the networks are described in §4 since they may depend on the type of datasets. The VAE-type structure of MAW contains three ingredients: encoder, dimension reduction component and decoder. The encoder forms a neural network E that maps the training sample x (i) ∈ R D to µ (i) 0,1 , µ (i) 0,2 , s (i) 0,1 , s (i) 0,2 in R D , where our default choice is D = 128. The dimension reduction component then computes the following statistical quantities of the Gaussian mixture z|x (i) : means µ (i) 1 and µ (i) 2 in R d and covariance matrices Σ (i) 1 and Σ (i) 2 in R d×d . First, a linear layer, represented by A ∈ R D ×d , maps (via A T ) the features µ (i) 0,1 and µ (i) 0,2 in R D to the following respective vectors in R d : µ (i) 1 = A T µ (i) 0,1 and µ (i) 2 = A T µ (i) 0,2 . For j = 1, 2, form M (i) j = A T diag(s (i) 0,j )A. For j = 2, compute Σ (i) 2 = M (i) 2 M (i)T 2 . For j = 1, we first need to reduce the rank of M (i) 1 . For this purpose, we form M (i) 1 = U (i) 1 diag(σ (i) 1 )U (i)T 1 , the spectral decomposition of M (i) 1 , and then truncate its bottom d/2 eigenvalues. That is, let σ(i) 1 ∈ R d have the same entries as the largest d/2 entries of σ (i) 1 and zero entries otherwise. Then, compute M (i) 1 = U (i)T 1 diag( σ(i) 1 )U (i) 1 and Σ (i) 1 = M (i) 1 M (i)T 1 . Since the TensorFlow package requires numerically-significant positive definiteness of covariance matrices, we add an identity matrix to both Σ (i) 1 and Σ (i) 2 . Despite this, the low-rank structure of Σ (i) 1 is still evident. Note that the dimension reduction component only trains A. The decoder, D : R d → R D , maps independent samples, {z (i,t) gen } T t=1 , generated for each 1 ≤ i ≤ L by the distribution ηN (µ (i) 1 , Σ (i) 1 ) + (1 -η)N (µ (i) 2 , Σ (i) 2 ), into the reconstructed data space. The loss function associated with the VAE structure is the first term in (4). We can write it as L VAE (E, A, D) = 1 LT L i=1 T t=1 x (i) -D(z (i,t) gen ) 2 . ( ) The dependence of this loss on E and A is implicit, but follows from the fact that the parameters of the sampling distribution of each z (i,t) gen were obtained by E and A. The WGAN-type structure seeks to minimize the second term in (4) using the dual formulation W 1 1 L L i=1 q(z|x (i) ), p(z) = sup f Lip ≤1 E zhyp∼p(z) f (z hyp ) -E zgen∼ 1 L L i=1 q(z|x (i) ) f (z gen ). (8) The generator of this WGAN-type structure is composed of the encoder E and the dimension reduction component, which we represent by A. It generates the samples {z (i,t) gen } L,T i=1,t=1 described above. The discriminator, Dis, of the WGAN-type structure plays the role of the Lipschitz function f in (8). It compares the latter samples with the i.i.d. samples {z (i,t) hyp } T t=1 from the prior distribution. In order to make Dis Lipschitz, its weights are clipped to [-1, 1] during training. In the MinMax game of this WGAN-type structure, the discriminator minimizes and the generator (E and A) maximizes L W1 (Dis) = 1 LT L i=1 T t=1 Dis(z (i,t) gen ) -Dis(z (i,t) hyp ) . We note that maximization of ( 9) by the generator is equivalent to minimization of the loss function L GEN (E, A) = - 1 LT L i=1 T t=1 Dis(z (i,t) gen ) . ( ) During the training phase, MAW alternatively minimizes the losses ( 7)-( 10) instead of minimizing a weighted sum. Therefore, any multiplicative constant in front of either term of (4) will not effect the optimization. In particular, it was okay to omit the multiplicative constant of (3) when deriving (4). For each testing point y (j) , we sample {z (j,t) in } T t=1 from the inlier mode of the learned latent Gaussian mixture and decode them as {ỹ (j,t) } T t=1 = {D(z (j,t) in )} T t=1 . Using a similarity measure S(•, •) (our default is the cosine similarity), we compute S (j) = T t=1 S(y (j) , ỹ(j,t) ). If S (j) is larger than a chosen threshold, then y (j) is classified normal, and otherwise, novel. Additional details of MAW are in §A.

3. THEORETICAL GUARANTEES FOR THE W 1 MINIMIZATION

Here and in §D we theoretically establish the superiority of using the W 1 distance over the KL divergence. We formulate a simplified setting that aims to isolate the minimization of the WGAN-type structure introduced in §2.2, while ignoring unnecessary complex components of MAW. We assume a mixture parameter η > 1/2, a separation parameter > 0 and denote by R the regularizing function, which can be either the KL divergence or W 1 , and by S K + and S K ++ the sets of K × K positive semidefinite and positive definite matrices, respectively. For µ 0 ∈ R K and Σ 0 ∈ S K ++ , we consider the minimization problem min µ1,µ2∈R K ;Σ1,Σ2∈S K + s.t. µ1-µ2 2 ≥ ηR (N (µ 1 , Σ 1 ), N (µ 0 , Σ 0 )) + (1 -η)R (N (µ 2 , Σ 2 ), N (µ 0 , Σ 0 )) . ( ) We further motivate it in §D.1. For MAW, µ 0 = 0 and Σ 0 = I, but our generalization helps clarify things. This minimization aims to approximate the "prior" distribution N (µ 0 , Σ 0 ) with a Gaussian mixture distribution. The constraint µ 1 -µ 2 2 ≥ distinguishes between the inlier and outlier modes and it is a realistic assumption as long as is sufficiently small. Our cleanest result is when Σ 0 , Σ 1 and Σ 2 coincide. It demonstrates robustness to the outlier component by the W 1 (or W p , p ≥ 1) minimization and not by the KL minimization (its proof is in §D.2). Proposition 3.1. If µ 0 ∈ R K , Σ 0 ∈ S K ++ , > 0 and 1 > η > 1/2 , then the minimizer of ( 11) with R = W p , p ≥ 1 and the additional constraint: Σ 0 = Σ 1 = Σ 2 , satisfies µ 1 = µ 0 , and thus the recovered inlier distribution coincides with the "prior distribution". However, the minimizer of ( 11) with R = KL and the same constraint satisfies µ 0 = ηµ 1 + (1 -η)µ 2 . In §D.3, we analyze the case where Σ 1 is low rank and Σ 2 ∈ S K ++ . We show that ( 11) is ill-defined when R = KL. The R = W 1 case is hard to analyze, but we can fully analyze the R = W 2 case and demonstrate exact recovery of the prior distribution by the inlier distribution when η approaches 1.

4. EXPERIMENTS

We describe the competing methods and experimental choices in §4.1. We report on the comparison with the competing methods in §4.2. We demonstrate the importance of the novel features of MAW in §4.3.

4.1. COMPETING METHODS AND EXPERIMENTAL CHOICES

We compared MAW with the following methods (descriptions and code links are in §E): Deep Autoencoding Gaussian Mixture Model (DAGMM) (Zong et al., 2018) , Deep Structured Energy-Based Models (DSEBMs) (Zhai et al., 2016) , Isolation Forest (IF) (Liu et al., 2008) , Local Outlier Factor (LOF) (Breunig et al., 2000) , One-class Novelty Detection Using GANs (OCGAN) (Perera et al., 2019) , One-Class SVM (OCSVM) (Heller et al., 2003) and RSR Autoencoder (RSRAE) (Lai et al., 2020) . DAGMM, DSEBMs, OCGAN and OCSVM were proposed for novelty detection. IF, LOF and RSRAE were originally proposed for outlier detection and we thus apply their trained model for the test data. For MAW and the above four reconstruction-based methods, that is, DAGMM, DSEBMs, OCGAN and RSRAE, we use the following structure of encoders and decoders, which vary with the type of data (images or non-images). For non-images, which are mapped to feature vectors of dimension D, the encoder is a fully connected network with output channels (32, 64, 128, 128 × 4) . The decoder is a fully connected network with output channels (128, 64, 32, D), followed by a normalization layer at the end. For image datasets, the encoder has three convolutional layers with output channels (32, 64, 128), kernel sizes (5 × 5, 5 × 5, 3 × 3) and strides (2, 2, 2). Its output is flattened to lie in R 128 and then mapped into a 128 × 4 dimensional vector using a dense layer (with output channels 128 × 4). The decoder of image datasets first applies a dense layer from R 2 to R 128 and then three deconvolutional layers with output channels (64, 32, 3), kernel sizes (3 × 3, 5 × 5, 5 × 5) and strides (2, 2, 2). For MAW we set the following parameters, where additional details are in §A. Intrinsic dimension: d = 2; mixture parameter: η = 5/6, sampling number: T = 5, and size of A (used for dimension reduction): 128 × 2. For all experiments, the discriminator is a fully connected network with size (32, 64, 128, 1).

4.2. COMPARISON OF MAW WITH STATE-OF-THE-ART METHODS

We use five datasets for novelty detection: KDDCUP-99 (Dua & Graff, 2017) , Reuters-21578 (Lewis, 1997) , COVID-19 Radiography database (Chowdhury et al., 2020) , Caltech101 (Fei-Fei et al., 2004) and Fashion MNIST (Xiao et al., 2017) . We distinguish between image datasets (COVID-19, Catlech101 and Fashion MNIST) and non-image datasets (KDDCUP-99 and Reuters-21578). We describe each dataset, common preprocessing procedures and choices of their largest clusters in §F. Each dataset contains several clusters (2 for KDDCUP-99, 5 largest ones for Reuters-21578, 3 for COVID-19, 11 largest ones for Caltech101 and 10 for Fashion MNIST). We arbitrarily fix a class and uniformly sample N training inliers and N test testing inliers from that class. We let N = 6000, 350, 160, 100 , 300 and N test = 1200, 140, 60, 100 , 60 for KDDCUP-99, Reuters-21578, COVID-19, Caltech101 and Fashion MNIST, respectively. We then fix c in {0.1, 0.2, 0.3, 0.4, 0.5}, and uniformly sample c percentage of outliers from the rest of the clusters for the training data. We also fix c test in {0.1, 0.3, 0.5, 0.7, 0.9} and uniformly sample c test percentage of outliers from the rest of the clusters for the testing data. Using all possible thresholds for the finite datasets, we compute the AUC (area under curve) and AP (average precision) scores, while considering the outliers as "positive". For each fixed c = 0.1, 0.2, 0.3, 0.4, 0.5 we average these results over the values of c test , the different choices of inlier clusters (among all possible clusters), and three runs with different random initializations for each of these choices. We also compute the corresponding standard deviations. We report these results in Figs. 2 and 3 and further specify numerical values in §H.1. We observe state-of-the-art performance of MAW in all of these datasets. In Reuters-21578, DSEBMs performs slightly better than MAW and OCSVM has comparable performance. However, these two methods are not competitive in the rest of the datasets. In §G, we report results for a different scenario where the outliers of the training and test sets have different characteristics. In this setting, we show that MAW performs even better when compared to other methods. 

4.3. TESTING THE EFFECT OF THE NOVEL FEATURES OF MAW

We experimentally validate the effect of the following five new features of MAW: the least absolute deviation for reconstruction, the W 1 metric for the regularization of the latent distribution, the Gaussian mixture model assumption, full covariance matrices resulting from dimension reduction component and the lower rank constraint for the inlier mode. The following methods respectively replace each of the above component of MAW with a traditional one: MAW-MSE, MAW-KL divergence, MAW-same rank, MAW-single Gaussian and MAW-diagonal cov., respectively. In addition, we consider a standard variational autoencoder (VAE). Additional details for the latter six methods are in §B. We compared the above six methods with MAW using two datasets: KDDCUP-99 and COVID-19 with training outlier ratio c = 0.1, 0.2 and 0.3. We followed the experimental setting described in §4.1. Fig. 4 reports the averages and standard deviations of the computed AUC and AP scores, where the corresponding numerical values are further recorded in §H.2. The results indicate a clear decrease of accuracy when missing any of the novel components of MAW or using a standard VAE. 

5. CONCLUSION AND FUTURE WORK

We introduced MAW, a robust VAE-type framework for novelty detection that can tolerate high corruption of the training data. We proved that the Wasserstein regularization used in MAW has better robustness to outliers and is more suitable to a low-dimensional inlier component than the KL divergence. We demonstrated state-of-the-art performance of MAW with a variety of datasets and experimentally validated that omitting any of the new ideas results in a significant decrease of accuracy. We hope to further extend our proposal in the following ways. First of all, we plan to extend and test some of our ideas for the different problem of robust generation, in particular, for building generative networks which are robust against adversarial training data. Second of all, we would like to carefully study the virtue of our idea of modeling the most significant mode in a training data. In particular, when extending the work to generation, one has to verify that this idea does not lead to mode collapse. Furthermore, we would like to explore any tradeoff of this idea, as well as our setting of robust novelty detection, with fairness. At last, we hope to further extend our theoretical guarantees. For example, two problems that currently seem intractable are the study of the W 1 version of Proposition D.1 and of the minimizer of ( 14).

A ADDITIONAL EXPLANATIONS AND IMPLEMENTATION DETAILS OF MAW

In §A.1 we review the ELBO function and explain how ELBOW is obtained from ELBO. Additional implementation details of MAW are in §A.2. At last, §A.3 provides algorithmic boxes for training MAW and applying it for novelty detection.

A.1 REVIEW OF ELBO AND ITS RELATIONSHIP WITH ELBOW

A standard VAE framework would minimize the expected KL-divergence from p(z|x) to q(z|x) in Q, where the expectation is taken over p(x). By Bayes' rule this is equivalent to maximizing the evidence lower bound (ELBO): ELBO(q) = E p(x) E q(z|x) log p(x|z) -E p(x) KL(q(z|x) p(z)) . The first term of ELBO is the reconstruction likelihood. Its second term restricts the deviation of q(z|x) from p(z) and can be viewed as a regularization term. ELBOW is a more robust version of ELBO with a different regularization. That is, it replaces E p(x) KL(q(z|x) p(z)) with W 1 (q(z), p(z)). We remark that the W 1 distance cannot be computed between q(z|x) and p(z) and ELBOW thus practically replaces q(z|x) with its expected distribution, q(z) = E p(x) q(z|x) (or a discrete approximation of this).

A.2 ADDITIONAL IMPLEMENTATION DETAILS OF MAW

The matrix A and the network parameters for encoders, decoders and discriminators are initialized by the Glorot uniform initializer (Glorot & Bengio, 2010) . The neural networks within MAW are implemented with TensorFlow (Abadi et al., 2015) and trained for 100 epochs with batch size 128. We apply batch normalization to each layer of any neural network. The neural networks were optimized by Adam (Kingma & Ba, 2015) with learning rate 0.00005. For the VAE-structure of MAW, we use Adam with learning rate 0.00005. For the WGAN-type structure discriminator of MAW, we perform RMSprop (Bengio & Monperrus, 2005) with learning rate 0.0005, following the recommendation of Arjovsky et al. (2017) for WGAN.

A.3 ALGORITHMIC BOXES FOR MAW

Algorithms 1 and 2 describe training MAW and applying MAW for novelty detection, respectively. In these descriptions, we denote by θ, ϕ and δ the trainable parameters of the encoder E, decoder D and discriminator Dis, respectively. Recall that A includes the trained parameters of the dimension reduction component. for each batch {x (i) } i∈I do 3: µ (i) 0,1 , µ (i) 0,2 , s (i) 0,1 , s (i) 0,2 ← E(x (i) ) 4: µ (i) j ← A T µ (i) 0,j , M (i) j ← A T diag(s (i) 0,j )A, j = 1, 2 5: Compute M (i) 1 according to ( 5) and (6) 6: Σ (i) 1 ← M (i) 1 M (i)T 1 , Σ (i) 2 ← M (i) 2 M (i)T 2 7: for t = 1, • • • , T do 8: sample a batch {z (i,t) gen } i∈I ∼ ηN (µ (i) 1 , Σ (i) 1 ) + (1 -η)N (µ (i) 2 , Σ 2 ) 9: sample a batch {z (θ, A, ϕ) ← (θ, A, ϕ) -α∇ (θ,A,ϕ) L VAE (θ, A, ϕ) according to (7) 12: δ ← δ -α∇ δ L W1 (δ) according to (9) 13: δ ← clip(δ, [-1, 1]) 14: (θ, A) ← (θ, A) -α∇ (θ,A) L GEN (θ, A) according to (10) 15: end for 16: end for

B MORE DETAILS ON TESTING THE EFFECT OF THE NOVEL FEATURES OF MAW

In §4.3 we experimentally validated the essential components of MAW by implementing variants of MAW that replace each novel component of MAW with a standard one. We notice that the AUC and AP scores in Figs. 2 and 3 consistently decrease when the outlier ratios increase, and thus the chosen training outlier ratios (c = 0.1, 0.2 and 0.3) are sufficient to demonstrate the effectiveness of MAW over its variants. We provide additional details on each of these variants of MAW. • MAW-MSE replaces the least absolute deviation loss L VAE with the common mean squared error (MSE). That is, it replaces x (i) -D(z (i,t) gen ) 2 in (7) with x (i) -D(z (i,t) gen ) 2 2 . • MAW-KL divergence replaces the Wasserstein regularization L W1 with the KL-divergence. This is implemented by replacing the WGAN-type structure of the discriminator with a standard GAN. • MAW-same rank uses the same rank d for both the covariance matrices Σ (i) 1 and Σ (i) 2 , instead of forcing Σ (i) 1 to have lower rank d/2. • MAW-single Gaussian replaces the Gaussian mixture model for the latent distribution with a single Gaussian distribution with a full covariance matrix. • MAW-diagonal cov. replaces the full covariance matrices resulting from the dimension reduction component by diagonal covariances. Its encoder directly produces 2-dimensional means and diagonal covariances (one of rank 1 for the inlier mode and one of rank 2 for the outlier mode). • VAE has the same encoder and decoder structures as MAW. Instead of a dimension reduction component, it uses a dense layer which maps the output of the encoder to a 4-dimensional vector composed of a 2-dimensional mean and 2-dimensional diagonal covariance. This is common for a traditional VAE.

C SENSITIVITY OF HYPERPARAMETERS

We examine the sensitivity of some of the reported results to changes of some hyperparameters. In §C.1, we report the sensitivity to choices of the intrinsic dimension. In §C.2, we report the sensitivity to choices of the mixture parameter.

Algorithm 2 Applying MAW to novelty detection

Input: Test data {y (j) } N j=1 ; sampling number T ; trained MAW model; threshold T ; similarity S(•, •) Output: Binary labels for novelty for each j = 1, . . . , N 1: for j = 1, . . . , N do 2: µ (j) 0,1 , s (j) 0,1 ← E(y (j) ) 3: µ (j) 1 ← A T µ (j) 0,1 , M (j) 1 ← A T diag(s (j) 0,1 )A 4: Compute M (j) 1 according to ( 5) and ( 6) 5: Σ (j) 1 ← M (j) 1 M (j)T 1 6: for t = 1, • • • , T do 7: sample z (j,t) in ∼ N (µ (j) 1 , Σ (j) 1 ) 8: ỹ(j,t) ← D z (j,t) in 9: compute S(y (j) , ỹ(j,t) ) 10: end for 11: j) , ỹ(j,t) ) S (j) ← T -1 T t=1 S(y ( 12: if S (j) ≥ T then 13: y (j) is a normal example 14: end if 17: end for 18: return Normality labels for j = 1, . . . , N

C.1 SENSITIVITY TO DIFFERENT INTRINSIC DIMENSIONS

In all of the other experiments in this paper the default value of the intrinsic dimension is d = 2. Here we study the sensitivity of our numerical results to the following choices intrinsic dimensions: d = 2, 4, 8, 16, 32 and 64, while using the KDDCUP-99 and COVID-19 datasets. We fix the intermediate training outlier ratio c = 0.3 for demonstration purpose. We compute the AUC and AP scores averaged over testing outlier ratios c test = 0.1, 0.3, 0.5, 0.7 and 0.9 with three runs per setting. Fig. 5 reports the averaged results and their standard deviations, which are indicated by error bars. We can see from Fig. 5 that our default choice of intrinsic dimension d = 2 results in the best performance. For COVID-19 we see a clear decrease of accuracy with the increase of the intrinsic dimension. For KDDCUP-99 we still see a preference for d = 2, but the decrease with higher dimensions is not so noticeable as in COVID-19. These experiments confirm our default choice and indicate that the accuracy may decrease when the intrinsic dimension is not sufficiently small.

C.2 SENSITIVITY TO MIXTURE PARAMETERS

In the rest of our experiments the default value of the mixture parameter η is 5/6. Namely, we assume that the inlier mode has larger weight among the Gaussian mixture. In this section, we study the sensitivity of the accuracy of MAW to the mixture parameters: {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 5/6, 0.9}. We use 5/6 ≈ 0.83, instead of the nearby value 0.8, since it was already tested for MAW. The training outlier ratios are 0.1, 0.2 and 0.3. We report results on both KDDCUP-99 and COVID-19 in Fig. 6 . We notice that the AUC and AP scores mildly increase as the mixture parameter η increases (though they may slightly decrease at 0.9). It seems that MAW seems to learn the inlier mode better with larger weight for the inlier mode and consequently gain more robustness. Nevertheless, the variation in the accuracy as a function of η is not large in general.

D ADDITIONAL THEORETICAL GUARANTEES FOR THE W 1 MINIMIZATION

In §D.1 we fully motivate our focus on studying (11) in order to understand the advantage of the use of the Wasserstein distance over the KL divergence in the framework of MAW. In §D.2 we prove Proposition 3.1. Additional and more technical proposition that involves low-rank inliers is stated and proved in §D.2. Figure 6 : AUC and AP scores with mixture parameters η = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 5/6 and 0.9 for KDDCUP-99 (on the left) and COVID-19 (on the right). From the top to the bottom row, the training outlier ratio are c = 0.1, 0.2 and 0.3, respectively.

D.1 MOTIVATION OF STUDYING (11)

The implementation of any VAE or its variants, such as AAE, WAE and MAW, requires the optimization of a regularization penalty R, which measures the discrepancy between the latent distribution and the prior distribution. This penalty is typically the KL divergence, though one may use appropriate metrics such as W 2 or W 1 . Therefore, one needs to minimize R 1 L L i=1 q(z|x (i) ), p(z) over the observed variational family Q = {q(z|x)}, which indexed by some parameters of q. Here, L is the batch size of the input data and L i=1 q(z|x (i) ) is its observed aggregated distribution. Since the explicit expressions of the regularization measurements between aggregated distributions are unknown, it is not feasible to study the minimizer of (12). We thus consider the following approximation of (12): L i=1 1 L R q(z|x (i) ), p(z) . ( ) We can minimize one term of this sum at a time, that, is minimize R (q(z|x), p(z)) over Q. This minimization strategy is common in the study of the Wasserstein barycenter problem (Agueh & Carlier, 2011; Peyré et al., 2019; Chen et al., 2018) . One of the underlying assumptions of MAW is that the prior distribution p(z) is Gaussian and q(z|x) is a Gaussian mixture. That is, p(z) ∼ N (µ 0 , Σ 0 ) and q(z|x) ∼ ηN (µ 1 , Σ 1 ) + (1 -η)N (µ 2 , Σ 2 ). This gives rise to the following minimization problem min µ1,µ2∈R K ;Σ1,Σ2∈S K + R (ηN (µ 1 , Σ 1 ) + (1 -η)N (µ 2 , Σ 2 ), N (µ 0 , Σ 0 )) . ( ) Similarly to approximating ( 12) by ( 13), we approximate ( 14) by the following minimization problem: min µ1,µ2∈R K ;Σ1,Σ2∈S K + ηR (N (µ 1 , Σ 1 ), N (µ 0 , Σ 0 )) + (1 -η)R (N (µ 2 , Σ 2 ), N (µ 0 , Σ 0 )) . Recall that in MAW N (µ 1 , Σ 1 ) and N (µ 2 , Σ 2 ) are associated with the inlier and outlier distribution of MAW. We further assume that there is a sufficiently small threshold > 0 for which µ 1 -µ 2 2 ≥ . This is a reasonable assumption since, in practice, if µ 1 and µ 2 are very close, the reconstruction loss will be large. These assumptions lead to the optimization problem (11) proposed in §3.

D.2 PROOF OF PROPOSITION 3.1

Recall that µ 0 ∈ R K is the mean of the prior Gaussian, > 0 is the fixed separation parameter for the means of the two modes and η > 1/2 is the fixed mixture parameter. For i = 0, 1, 2, we denote the Gaussian probability distribution by N (µ i , Σ i ). Since in our setting Σ 0 = Σ 1 = Σ 2 , we denote the common covariance matrix in S K ++ by Σ. That is, Σ = Σ i for i = 0, 1, 2. We first analyze the solution of ( 11) with R = W p , where p ≥ 1, and then analyze the solution of ( 11) with R = KL. The case R = W p , p ≥ 1: We follow the next three steps to prove that the minimizer of (11) satisfies µ 1 = µ 0 . Step I: We prove that W p (ν i , ν 0 ) ≡ W p (N (µ i , Σ), N (µ 0 , Σ)) = µ i -µ 0 2 for p ≥ 1 and i = 1, 2 . First, we note that using the definition of W p , p ≥ 1 and the common notation Π(ν i , ν 0 ) for the distribution on R K × R K with marginals ν i and ν 0 W p p (ν i , ν 0 ) = inf π∈Π(νi,ν0) E (x,y)∼π x -y p 2 ≥ inf π∈Π(νi,ν0) E (x,y)∼π x -E (x,y)∼π y p 2 = µ i -µ 0 p 2 , where the inequality follows the fact that . p 2 is convex and from Jensen's inequality. On the other hand, for i = 1 or i = 2, let x * be an arbitrary random vector with distribution ν i , and let y * = x * -µ i + µ 0 . The distribution of y * is Gaussian with mean µ 0 and covariance Σ i , that is, this distribution is ν 0 . Let π * be the joint distribution of the random variables x * and y * . We note that π * is in Π(ν i , ν 0 ) and that E (x,y)∼π * x -y p 2 = E (x,y)∼π * µ i -µ 0 p 2 = µ i -µ 0 p 2 . Therefore, W p p (ν i , ν 0 ) = inf π∈Π(νi,ν0) E (x,y)∼π x -y p 2 ≤ E (x,y)∼π * x -y p 2 = µ i -µ 0 p 2 . ( ) The combination of ( 16) and ( 17) immediately yields (15). Step II: We prove that (11 ) with R = W p , p ≥ 1, is equivalent to min µ1,µ2∈R K ; s.t. µ0,µ1,µ2:colinear & µ1-µ2 2 ≥ η µ 1 -µ 0 2 + (1 -η) µ 2 -µ 0 2 . ( ) We first note that (11 ) with R = W p , p ≥ 1 is equivalent to min µ1,µ2∈R K s.t. µ1-µ2 2 ≥ η µ 1 -µ 0 2 + (1 -η) µ 2 -µ 0 2 . ( ) Indeed, this is a direct consequence of the expression derived in step I for R in this case. It is thus left to show that if µ 1 , µ 2 ∈ R K minimize ( 19), then we can construct µ 1 , µ 2 ∈ R K that are colinear with µ 0 and also minimize (19). For any µ 1 and µ 2 in R K with µ 1 -µ 2 2 ≥ and for the given µ 0 ∈ R K , we define μ0 , μ1 and μ2 ∈ R K and demonstrate them in Fig. 7 . The point μ0 is the projection of µ 0 onto µ 1 -µ 2 and μi := µ i + µ 0 -μ0 for i = 1, 2. We observe the following properties, which can be proved by direct calculation, though Fig. 7 also clarifies them: µ i -µ 0 2 ≥ μi -µ 0 2 for i = 1, 2, and consequently, η µ 1 -µ 0 2 + (1 -η) µ 2 -µ 0 2 ≥ η μ1 -µ 0 2 + (1 -η) μ2 -µ 0 2 ; (20) μ1 -μ2 2 = µ 1 -µ 2 2 ≥ ; (21) and μ1 , μ2 , and µ 0 are colinear. ( 22) Clearly, the combination of ( 20), ( 21) and ( 22) concludes the proof of step II. That is, it implies that if µ 1 , µ 2 ∈ R K minimize (19), then µ 1 and µ 2 defined above are colinear with µ 0 and also minimize (19). Step III: We directly solve (18) and consequently ( 11) with R = W p , p ≥ 1. Due to the colinearity constraint in (11), we can write µ 0 = (1 + t)µ 1 -tµ 2 for t ∈ R. (23) The objective function in (18) can then be written as µ 1 -µ 2 2 (η|t| + (1 -η)|1 + t|) ≥ (η|t| + (1 -η)|1 + t|) , where equality is achieved if and only if µ 1 -µ 2 2 = . We thus define r(t) = η|t| + (1 -η)|1 + t| and note that r(t) =    t + (1 -η), t ≥ 0 (1 -2η)t + (1 -η), 0 ≥ t ≥ -1 -t + (η -1), -1 ≥ t and its derivative is r (t) =    1, t > 0 1 -2η, 0 > t > -1 -1, -1 > t. The above expressions for r and r and the assumption that η > 1/2 imply that r(t) is increasing when t > 0, decreasing when t < 0 and r(0) = 1 -η < η = r(1). Thus r has a global minimum at t = 0. Hence, it follows from (23) that the minimizer of (11), and equivalently (11 ) with R = W p , p ≥ 1 satisfies µ 1 = µ 0 . The case R = KL: We prove that the solution of (11) with R = KL satisfies µ 0 = ηµ 1 +(1-η)µ 2 . We practically follow similar steps as the proof above. Step I: We derive an expression for KL(ν i ||ν 0 ), where i = 1, 2. We use the following general formula, which holds for the case where Σ 0 , Σ 1 and Σ 2 are general covariance matrices in S K ++ (see e.g., (2) in Hershey & Olsen (2007)): KL(ν i ||ν 0 ) = 1 2 log det Σ 0 det Σ i -K + tr(Σ -1 0 Σ i ) + (µ i -µ 0 ) T Σ -1 0 (µ i -µ 0 ) . Since in our setting Σ 1 = Σ 2 = Σ, this expression has the simpler form: KL(ν i ||ν 0 ) = 1 2 (µ i -µ 0 ) T Σ -1 (µ i -µ 0 ). η, its covariance is obtained by an appropriate projection of the covariance Σ 0 onto a κ-dimensional subspace. We similarly note that as η → 1, Σ 2 → diag(1 κ ; ∞ K-k ), so that the outliers will disperse. We further note that Proposition D.2 implies that the KL divergence fails is unsuitable for low-rank covariance modeling as it leads to an infinite value in the optimization problem. At last, we note that the inlier and outlier covariances, Σ 1 and Σ 2 , obtained by Proposition D.1, are diagonal. Furthermore, the proof of Proposition D.1 clarifies that the underlying minimization problem of this proposition may assume without loss of generality that the inlier and outlier covariances are diagonal (see e.g., (32), which is formulated below). On the other hand, the numerical results in §4.3 support the use of full covariances, instead of diagonal covariance. Nonetheless, we claim that the full covariances matrices of MAW comes naturally from the dimension reduction component of MAW. This component also contains trainable parameters for the covariances and they will effect the weights of the encoder, that is, will effect both the W 1 minimization and the reconstruction loss. Thus the analysis of the W 1 minimization component is sufficient for inferring the whole behavior of MAW. For tractability purposes, the minimization in (11) ignores the dimension reduction component. For completeness we remark that there are two other differences between the use of ( 11) in Proposition D.1 and the way it arises in MAW that may possibly also result in the advantage of using full covariance in MAW. First of all, the minimization in Proposition D.1 uses R = W 2 , whereas MAW uses R = W 1 , which we find intractable when using the rest of the setting of Proposition D.1. Second of all, the optimization problem ( 11) with R = W 1 is an approximation of the minimization of W 1 1 L L i=1 q(z|x (i) ), p(z) (see §D.1 for explanation), which is also intractable (even if one uses R = W 2 ). In §D.4 we prove Proposition D.1 and in §D.5 we prove Proposition D.2.

D.4 PROOF OF PROPOSITION D.1

We follow the same steps of the proof of Proposition 3.1 Step I: We immediately verify the formula W 2 (N (µ i , Σ i ), N (0, I)) = µ i 2 2 + Σ 1 2 i -I 2 F for i = 1, 2. ( ) We use the following general formula, which holds for the case where Σ 0 , Σ 1 and Σ 2 are general covariance matrices in S K + (see e.g., (4) in Panaretos & Zemel (2019)): W 2 2 (N (µ i , Σ i ), N (µ 0 , Σ 0 )) = µ i -µ 0 2 2 + tr(Σ i + Σ 0 -2(Σ 1 2 i Σ 0 Σ 1 2 i ) 2 ), i = 1, 2 . (31) Indeed, (30) is obtained as a direct consequence of (31) using the identity tr Σ i + I -2Σ 1 2 i = tr Σ 1 2 i -I 2 = Σ 1 2 i -I 2 F . Step II: We reformulate the underlying minimization problem in two different stages. We first claim that the minimizer of (11) with R = W 2 and the constraint that Σ 1 is of rank κ and Σ 2 is of rank K can be expressed as the minimizer of min µ1,µ2∈R K s.t. µ1-µ2 2 = , Σ1,Σ2 diagonal in R K×K & rank(Σ1)=κ, rank(Σ2)=K η µ 1 2 2 + Σ 1 2 1 -I 2 F + (1 -η) µ 2 2 2 + Σ 1 2 2 -I 2 F . In view of ( 11) and ( 30) we only need to prove that the minimizer of ( 32) is the same if one removes the constraint that Σ 1 and Σ 2 are both diagonal matrices and require instead that they are in ∈ S K + . This is easy to show. Indeed, if for i = 1 or i = 2, Σ i ∈ S K + , then it can be diagonalized as follows: Σ i = U T i Λ i U i , where Λ i ∈ S K + is diagonal and U i is orthogonal. Hence, Σ 1 2 i = U T i Λ 1 2 i U i and Σ 1 2 i -I 2 F = U T i Λ 1 2 i U i -I 2 F = U T i (Λ 1 2 i -I)U i 2 F = Λ 1 2 i -I 2 F . Consequently, W 2 (N (µ i , Σ i ), N (0, I)) = W 2 (N (µ i , Λ i ), N 0, I)) for i = 1, 2 , and the above claim is concluded. Next, we vectorize the minimization problem in (32) as follows. We denote by R + the set of positive real numbers. Let b be a general vector in R K + , a be a general vector in R κ + and a := (a ; 0 K-κ ) ∈ R K . Given, the constraints on Σ 1 and Σ 2 , we can parametrize the diagonal elements of Σ 2 by a and We remark that for the neural networks based methods (DAGMM, DSEBMs, OCGAN and RSRAE), we followed similar implementation details as the one described in §A.2 for MAW. Deep Autoencoding Gaussian Mixture Model (DAGMM) Zong et al. (2018) is a deep autoencoder model. It optimizes an end-to-end structure that contains both an autoencoder and an estimator for a Gaussian mixture model. Anomalies are detected using this Gaussian mixture model. We remark that this mixture model is proposed for the inliers. Deep Structured Energy-Based Models (DSEBMs) Zhai et al. (2016) makes decision based on an energy function which is the negative log probability that a sample follows the data distribution. The energy based model is connected to an autoencoder in order to avoid the need of complex sampling methods. Isolation Forest (IF) Liu et al. (2008) iteratively constructs special binary trees for the training set and identifies anomalies in the test set as the ones with short average path lengths in the trees. Local Outlier Factor (LOF) Breunig et al. (2000) measures how isolated a data point is from its surrounding neighborhood. This measure is based on an estimation of the local density of a data point using its k nearest neighbors. In the novelty detection setting, it identifies novelties according to low density regions learned from the training data. One-class Novelty Detection Using GANs (OCGAN) Perera et al. (2019) is composed of four neural networks: a denoising autoencoder, two adversarial discriminators, and a classifier. It aims to adversarially push the autoencoder to learn only the inlier features. One-Class SVM (OCSVM) Heller et al. (2003) estimates the margin of the training set, which is used as the decision boundary for the test set. Usually it utilizes a radial basis function kernel to obtain flexibility. Robust Subspace Recovery Autoencoder (RSRAE) Lai et al. (2020) uses an autoencoder structure together with a linear RSR layer imposed with a penalty based on the 2,1 energy. The RSR layer extracts features of inliers in the latent code while helping to reject outliers. The instances with higher reconstruction errors are viewed as outliers. RSRAE trains a model using the training data. We then apply this model for detecting novelties in the test data.

F ADDITIONAL DETAILS ON THE DIFFERENT DATASETS

Below we provide additional details on the five datasets used in our experiments. We remark that each dataset contains several clusters (2 for KDDCUP-99, 5 for Reuters-21578, 3 for COVID-19, 11 largest ones for Caltech101 and 10 for Fashion MNIST). We summarize the number of inliers and outliers per dataset (for both training and testing) in Table 1 . KDDCUP-99 is a classic dataset for intrusion detection. It contains feature vectors of connections between internet protocols and a binary label for each feature vector identifying normal vs. abnormal ones. The abnormal ones are associated with an "attack" or "intrusion". Reuters-21578 contains 21,578 documents with 90 text categories having multi-labels. Following Lai et al. (2020) , we consider the five largest classes with single labels. We utilize the scikit-learn packages: TFIDF and Hashing Vectorizer (Rajaraman & Ullman, 2011) Fashion MNIST is an image dataset containing 10 categories of grayscale images of clothing and accessories items. Each image is of size 28 × 28 and we rescale the pixel intensities to lie in [-1, 1] We remark that COVID-19, Caltech101 and Reuters-21578 separate between training and testing datapoints. For KDDCUP-99, we randomly split it into training and testing datasets of equal sizes. The competitive advantage of MAW in comparison to the rest of the methods is also noticeable in this setting. We note that OCSVM, the traditional distance-based method, and IF, the traditional densitybased method, perform poorly in this scenario, whereas they performed well in our original setting.

H NUMERICAL RESULTS OF EXPERIMENTS

We present as tables the numerical values depicted in Figs. 2 and 3 in §H.1 and those in Fig. 4 



Figure 1: Demonstration of the architecture of MAW for novelty detection.

Figure 2: AUC (on left) and AP (on right) scores with training outlier ratios c = 0.1, 0.2, 0.3, 0.4 and 0.5 for the two non-image datasets: KDDCUP-99 and Reuters-21578.

Figure 3: AUC (on left) and AP (on right) scores with training outlier ratios c = 0.1, 0.2, 0.3, 0.4 and 0.5 for the three image datasets: COVID-19, Caltech101 and Fashion MNIST.

Figure 4: AUC (on left) and AP (on right) scores for variants of MAW (missing a novel component) with training outlier ratios c = 0.1, 0.2, 0.3 using the KDDCUP-99 and COVID-19 datasets.

Training MAW Input: Training data {x (i) } L i=1 ; initialized parameters θ, ϕ and δ of E, D and Dis, respectively; initialized A; weight η; number of epochs; batch size I; sampling number T ; learning rate α Output: Trained parameters θ, ϕ and A 1: for each epoch do 2:

Figure 5: AUC and AP scores with intrinsic dimensions d = 2, 4, 8, 16, 32 and 64 for KDDCUP-99 (on the left) and COVID-19 (on the right), where c = 0.3.

Figure 7: Illustration of the points μ0 , μ1 and μ2 and their properties.

to preprocess the documents into 26,147 dimensional vectors. COVID-19 (Radiography) contains chest X-ray RGB images, which are labeled according to the following three categories: COVID-19 positive, normal and bacterial Pneumonia cases. We resize the images to size 64 × 64 and rescale the pixel intensities to lie in [-1, 1]. Caltech101 contains RGB images of objects from 101 categories with identifying labels. Following Lai et al. (2020) we use the largest 11 classes and preprocess their images to have size 32 × 32 and rescale the pixel intensities to lie in [-1, 1].

EXPERIMENTS WITH DIFFERENT OUTLIER TYPESIn this section, we test the performance of MAW and the benchmark methods when the training and test sets are corrupted by outliers with different structures. We generate a dataset, which we call "Mix Caltech101", in the following way. We fix the largest class of Caltech101 (containing airplane images) as the inlier class and randomly split it into the training inlier class (68.75 %) and testing inlier class (31.25 %). We form the training set by corrupting the training inlier class with random samples from the ten classes of CIFAR10(Krizhevsky et al., 2009) with training outlier ratio c ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For the test set, we corrupt the testing inlier class by "tile images" from MVTech dataset(Bergmann et al., 2019) with testing outlier ratio c test in {0.1, 0.3, 0.5, 0.7, 0.9}. The rest of the settings of the experiments are identical to the description in §4.2. We present the AUC and AP scores and their standard deviations in Fig.8.

Figure 8: AUC and AP scores with training outlier ratio c ∈ {0.1, 0.2, 0.3, 0.4, 0.5} for the Mix Caltech101 dataset.

Numbers of inliers and outliers for training and testing used in the five datasets.

in §H.2.H.1 TABLE REPRESENTATION FOR FIGS. 2 AND 3Tables 2-11 report the averaged AUC and AP scores with training outlier ratios c ∈ {0.1, 0.2, 0.3, 0.4, 0.5} that were depicted in Figs.2 and 3. Each table describes one of the averaged scores (AUC or AP) for one of the five datasets(KDDCUP-99, Reuters-21578,  COVID-19, Caltech101 and Fashion MNIST)  and also indicates the standard deviation of each value. The outperforming methods are marked in bold. ± 0.028 0.830 ± 0.013 0.770 ± 0.017 0.700 ± 0.002 0.648 ± 0.016 DAGMM 0.500 ± 0.000 0.511 ± 0.027 0.566 ± 0.110 0.559 ± 0.087 0.570 ± 0.091 DSEBMs 0.887 ± 0.012 0.825 ± 0.012 0.790 ± 0.015 0.690 ± 0.002 0.648 ± 0.AP scores of Reuters-21578. ± 0.041 0.677 ± 0.026 0.627 ± 0.029 0.518 ± 0.004 0.474 ± 0.013 DAGMM 0.316 ± 0.000 0.316 ± 0.013 0.365 ± 0.020 0.362 ± 0.015 0.372 ± 0.012 DSEBMs 0.763 ± 0.012 0.697 ± 0.011 0.666 ± 0.007 0.515 ± 0.003 0.473 ± 0.± 0.021 0.639 ± 0.018 0.606 ± 0.020 0.551 ± 0.030 0.534 ± 0.010 DAGMM 0.547± 0.068 0.565± 0.051 0.538 ± 0.062 0.524 ± 0.060 0.523 ± 0.057 DSEBMs 0.471± 0.000 0.471± 0.000 0.471 ± 0.000 0.471 ± 0.000 0.471 ± 0.± 0.014 0.442 ± 0.011 0.424 ± 0.018 0.368 ± 0.015 0.376 ± 0.008 DAGMM 0.354± 0.053 0.390 ± 0.057 0.316 ± 0.052 0.357 ± 0.050 0.368 ± 0.047 DSEBMs 0.372± 0.000 0.372 ± 0.000 0.372 ± 0.000 0.372 ± 0.000 0.372 ± 0.± 0.017 0.760 ± 0.028 0.700 ± 0.038 0.608 ± 0.031 0.570 ± 0.021 DAGMM 0.684 ± 0.100 0.588 ± 0.115 0.500± 0.100 0.509 ± 0.101 0.514 ± 0.095 DSEBMs 0.536 ± 0.011 0.612± 0.025 0.577 ± 0.030 0.564 ± 0.021 0.536 ± 0.± 0.027 0.722 ± 0.041 0.664 ± 0.082 0.579 ± 0.047 0.568 ± 0.036

AP scores of Caltech101. ± 0.027 0.572 ± 0.039 0.531 ± 0.064 0.412 ± 0.029 0.414 ± 0.021 DAGMM 0.574± 0.088 0.422 ± 0.112 0.308 ± 0.102 0.351 ± 0.074 0.363 ± 0.076 DSEBMs 0.385± 0.003 0.472± 0.051 0.398±0.019 0.383 ± 0.023 0.365 ± 0.± 0.013 0.879 ± 0.011 0.852 ± 0.022 0.830 ± 0.017 0.801 ± 0.016 DAGMM 0.607 ± 0.093 0.376 ± 0.070 0.427 ± 0.090 0.401 ± 0.078 0.411 ± 0.081 DSEBMs 0.730 ± 0.092 0.729 ± 0.105 0.739 ± 0.086 0.723 ± 0.106 0.687 ± 0.± 0.022 0.848 ± 0.022 0.829 ± 0.042 0.831 ± 0.028 0.808 ± 0.028 ±0.013 0.754 ± 0.014 0.723±0.029 0.686 ± 0.025 0.672 ±0.021 DAGMM 0.482 ±0.051 0.303 ±0.057 0.334 ±0.113 0.318 ±0.056 0.330 ± 0.038 DSEBMs 0.600 ± 0.045 0.609± 0.120 0.613±0.089 0.605 ±0.086 0.565 ± 0.± 0.029 0.736 ± 0.032 0.716 ± 0.048 0.683 ± 0.036 0.680 ± 0.042H.2 TABLE REPRESENTATION FOR FIG. 4Tables 12-15 record the averaged AUC and AP scores with training outlier ratios c = 0.1, 0.2, 0.3 that were depicted in Fig.4. Each table describes one of the averaged scores (AUC or AP) for one of the two representative datasets (KDDCUP-99 and COVID-19) and also indicates the standard deviation of each value. The outperforming methods are marked in bold.

annex

Step II: We reformulate the optimization problem. The above step imples that (11) with R = KL can be written as minor equivalently,We express the eigenvalue decomposition of Σ -1 as Σ -1 = U ΛU T , where Λ ∈ S K + , and U is an orthogonal matrix. Applying the change of variables µ i = Λ 1 2 U T µ i for i = 0, 1, 2, we rewrite (25) asAt last, applying the same colinearity argument as above (supported by Fig. 7 ) we conclude the following equivalent formulation of (26):Step III: We directly solve ( 27). Due to the colinearity constraint, we can write µ 0 = (1 + t)µ 1 -tµ 2 for t ∈ R (28) and express the objective function of ( 27) aswhere equality is achieved if and only if µ 1 -µ 2 2 = . We thus define r(t) = ηt 2 + (1 -η)(1 + t) 2 and note that r (t) = 2(t + (1 -η)) and r (t) = 2, and thus conclude that r(t) obtains its global minimum at t = η -1. This observation and (28) imply that the minimizers µ 1 and µ 2 of (11We study the minimization problem (11) when Σ 1 has low rank and Σ 2 ∈ S K ++ , and also when R = W 2 or R = KL. Unfortunately, the case where R = W 1 is hard to analyze and compute. We first formulate our result for R = W 2 . In this case we assume that the prior distribution is a standard Gaussian distribution on R K . That is, it has mean µ 0 = 0 K and covariance Σ 0 = I K×K . We further denote by 1 K the vector (1, • • • , 1) ∈ R K . Similarly, we may define for any n ∈ N, 0 n , 1 n , I n×n . When it is clear from the context we only use 0, 1 and I. For vectors a ∈ R n and b ∈ R m , we denote the concatenated vector in R n+m by (a; b).where one can note that η > 1 2 and u ∈ (0, 1), then the minimizer of (11) with R = W 2 and with the constraint that Σ 1 is of rank κ and Σ 2 is of rank K, or equivalently, the minimizer of minWe next formulate our simple result on the ill-posedness of (11) with R = W 2 and with the same constraint as in Proposition D.1.Therefore, the solution of ( 11) with R = KL with the additional constraint that Σ 1 is of rank κ and Σ 0 = I is ill-posed.Next we clarify the implications of both propositions. Note that Proposition D.1 implies that as η → 1, u → 0. Hence for the inlier component µ 1 → 0 K as η → 1 and Σ 1 = diag(1 κ ; 0 K-κ ), so in the limit the inlier distribution has the same mean as the prior distribution and, independently of b, that is, we set Σ 1 2 1 = diag(a) and Σ 1 2 2 = diag(b). The objective function of (32) can then be written asCombining this last expression and the same colinearity argument as in §D.2 (supported by Fig. 7 ), ( 32) is equivalent to minStep III: We solve (33). By the colinearity constraint, we can write (0, where u ∈ R. We thus obtain that (µ 2 ; b)Furthermore, denoting the coordinates of a and b by {a i } κ i=1 and {b i } K i=1 , we similarly obtain thatThe last two of equations imply thatCombining ( 30), (34) and the above two equations, we rewrite the objective function of (33) as follows:where equality is achieved if and only if µ 1 -µ 2 2 = . One can make the following two observations: u = 0 does not yield a minimizer of (33), and for any u = 0, (36) obtains its minimum at a = 1 κ . In view of these observations and the derivation above, we defineand note that (33) is equivalent toWe rewrite f (u) asTheir derivatives areThese expressions for r 1 and r 2 imply that the critical points for r 1 areand the critical points for r 2 are.We note that r 1 is increasing on (ur1 , ∞) and decreasing on (-∞, u(2)r1 ). On the other hand, r 2 is increasing on (ur2 ∈ (0, 1). The derivative of f with respect to u isr2 , ∞) and decreasing on (-∞, u(2)Consequently, the minimum of f is obtained at u := u(2) r2 . By (34) and ( 35), the means µ 1 , µ 2 and the covariance matrices Σ 1 , Σ 2 satisfy:Moreover, the norms of µ 1 and µ 2 can be computed from (35) as u and (1 -u ) , respectively.

D.5 PROOF OF PROPOSITION D.2

Notice that since Σ 0 ∈ S K ++ , det(Σ 0 ) > 0. On the other hand, sinceand this observation and (24) imply that KL(N

E ADDITIONAL DETAILS ON THE BENCHMARK METHODS

We overview the benchmark methods compared with MAW, where we present them according to alphabetical order of names. We will include all tested codes in a supplemental webpage.For completeness, we mention the following links (or papers with links) we used for the different codes. For DSEBMs and DAGMM we used the codes of Golan & El-Yaniv (2018) . For LOF, OCSVM and IF we used the scikit-learn (Buitinck et al., 2013) packages for novelty detection. For OCGAN we used its TensorFlow implementation from https://pypi.org/project/ocgan. For RSRAE, we adapted the code of Lai et al. (2020) to novelty detection.All experiments were executed on a Linux machine with 64GB RAM and four GTX1080Ti GPUs. 

