RESTRICTED GENERATIVE PROJECTION FOR ONE-CLASS CLASSIFICATION AND ANOMALY DETECTION Anonymous

Abstract

We present a novel framework for one-class classification and anomaly detection. The core idea is to learn a mapping to transform the unknown distribution of training (normal) data to a known distribution that is supposed to be different from the transformed distribution of unknown abnormal data. Crucially, the target distribution of training data should be sufficiently simple, compact, and informative. The simplicity is to ensure that we can sample from the distribution easily, the compactness is to ensure that the decision boundary between normal data and abnormal data is clear and reliable, and the informativeness is to ensure that the transformed data preserve the important information of the original data. Therefore, we propose to use truncated Gaussian, uniform in hyperball, uniform on hypersphere, or uniform between hyperspheres, as the target distribution. We then minimize the distance between the transformed data distribution and the target distribution while keeping the reconstruction error for the original data small enough. Our model is simple and easy to train especially compared with those based on generative models. Comparative studies on a few benchmark datasets verify the effectiveness of our method in comparison to baselines.

1. INTRODUCTION

Anomaly detection (AD) aims to distinguish normal data and abnormal data using a model trained on only normal data without using any information of abnormal data (Chandola et al., 2009; Pang et al., 2021; Ruff et al., 2021) . AD is useful in numerous real problems such as intrusion detection for video surveillance, fraud detection in finance, and fault detection for sensors. Many AD methods have been proposed in the past decades (Schölkopf et al., 1999; 2001; Tax & Duin, 2004; Liu et al., 2008) . For instance, Schölkopf et al. (2001) proposed the one-class support vector machine (OC-SVM) that finds, in a high-dimensional kernel feature space, a hyperplane yielding a large distance between the normal training data and the origin. Tax & Duin (2004) presented the support vector data description (SVDD), which obtains a spherically shaped boundary (with minimum volume) around the normal training data to identify abnormal samples. There are also many deep learning based AD methods (Erfani et al., 2016; Ruff et al., 2018; Golan & El-Yaniv, 2018; Hendrycks et al., 2018; Abati et al., 2019; Pidhorskyi et al., 2018; Zong et al., 2018; Wang et al., 2019; Liznerski et al., 2020; Qiu et al., 2021; Raghuram et al., 2021; Wang et al., 2021) . Deep learning based AD methods may be organized into three categories. The first category is based on compression and reconstruction. These methods usually use autoencoder (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2013) to learn a low-dimensional representation to reconstruct the high-dimensional data (Vincent et al., 2008; Wang et al., 2021) . It is expected that the learned autoencoder on the normal training data has a much higher reconstruction error on unknown abnormal data than on normal data. The second category is based on the combination of classical one-class classification (Tax & Duin, 2004; Golan & El-Yaniv, 2018) and deep learning (Ruff et al., 2018; 2019; 2020; Perera & Patel, 2019; Bhattacharya et al., 2021; Shenkar & Wolf, 2022; Chen et al., 2022) . For instance, Ruff et al. (2018) proposed a method called deep one-class SVDD. The main idea is to use deep learning to construct a minimum-radius hypersphere to include all the training data, while the unknown abnormal data are expected to fall outside. The last category is based on generative learning or adversarial learning (Malhotra et al., 2016; Deecke et al., 2018; Pidhorskyi et al., 2018; Nguyen et al., 2019; Perera et al., 2019; Goyal et al., 2020; Raghuram et al., 2021; Yan et al., 2021) . For example, Perera et al. (2019) proposed to use the generative adversarial net-work (GAN) (Goodfellow et al., 2014) with constrained latent representation to detect anomalies for image data. Goyal et al. (2020) presented a method called deep robust one-class classification (DROCC). The method aims to find a low-dimensional manifold to accommodate the normal data via an adversarial optimization approach. Although deep learning AD methods have shown promising performance on various datasets, they still have limitations. For instance, the one-class classification methods such as Deep SVDD (Ruff et al., 2018) only ensure that the normal data could be included by a hypersphere but cannot guarantee that the normal data are distributed evenly in the hypersphere, which may lead to large empty regions in the hypersphere and hence yield incorrect decision boundary. The adversarial learning methods such as (Nguyen et al., 2019; Perera et al., 2019; Goyal et al., 2020) may suffer from high computational cost and instability in optimization. In this work, we present a restricted generative projection framework for one-class classification and anomaly detection. The model of the framework is efficient to train and able to provide reliable decision boundaries for precise anomaly detection. Our main idea is to train a deep neural network to convert the distribution of normal training data to a target distribution that is simple, compact, and informative, which will provide a reliable decision boundary to identify abnormal data from normal data. There are many choices for the target distribution, such as truncated Gaussian and uniform on hypersphere. Our contributions are three-fold. • We present a novel framework for one-class classification and anomaly detection. It aims to transform the data distribution to some target distributions that are easy to be violated by unknown abnormal data. • We present a few simple, compact, and informative target distributions and propose to minimize the distances between the converted data distribution and these target distributions via minimizing the maximum mean discrepancy. • We conduct extensive experiments to compare the performance of different target distributions and compare our method with state-of-the-art competitors. The numerical results on five benchmark datasets verify the effectiveness of our methods.

2. METHODOLOGY

Suppose we have a set of m-dimensional training data X = {x 1 , x 2 , . . . , x n } drawn from an unknown bounded distribution D x and any samples drawn from D x are normal data. We want to train a model M on X to determine whether a test data x new is drawn from D x or not. One may consider to estimate the density function (denoted by p x ) of D x using some techniques such as kernel density estimation (Rosenblatt, 1956) . Suppose the estimation px is good enough, then one can determine whether x new is normal or not according to the value of px (x new ): if px (x new ) is zero or close to zero, x new is an abnormal data point; otherwise, x new is a normal data pointfoot_0 . However, the dimensionality of the data is often high and hence it is very difficult to obtain a good estimation px . We propose to learn a mapping T : R m → R d to transform the unknown bounded distribution D x to a known distribution D z while there still exists a mapping T ′ : R d → R m that can recover D x from D z approximately. Let p z be the density function of D z . Then we can determine whether x new is normal or not according to the value of p z (T (x new )). To be more precisely, we want to solve the following problem minimize T , T ′ M (T (D x ), D z ) + λM (T ′ (T (D x )), D x ) , where M(•, •) denotes some distance metric between two distributions and λ is a trade-off parameter for the two terms. Note that if λ = 0, T may convert any distribution to D z and lose the ability of distinguishing normal data and abnormal data. Based on the universal approximation theorems (Pinkus, 1999; Lu et al., 2017) and substantial success of neural networks, we use deep neural networks (DNN) to model T and T ′ respectively. Let f θ and g ϕ be two DNNs with parameters θ and ϕ respectively. We solve the following problem minimize θ, ϕ M D f θ (x) , D z + λM D g ϕ (f θ (x)) , D x , where f θ and g ϕ serve as encoder and decoder respectively. However, problem (2) is intractable because D x is unknown and D f θ (x) , D g ϕ (f θ (x)) cannot be computed analytically. Note that the samples of D x and D g ϕ (f θ (x)) are given and paired. Then the second term in the objective of (2) can be replaced by sample reconstruction error such as 1 n n i=1 ∥x i -g ϕ (f θ (x i ))∥ 2 . On the other hand, we can also sample from D f θ (x) and D z easily but their samples are not paired. Hence, the metric M in the first term of the objective of (2) should be able to measure the distance between two distributions using their finite samples. To this end, we propose to use the kernel maximum mean discrepancy (MMD) (Gretton et al., 2012) to measure the distance between D f θ (x) and D z . In statistics, MMD is often used for Two-Sample test and its principle is to find a function that assumes different expectations on two different distributions: MMD[F, p, q] = sup ∥f ∥ H ≤1 (E p [f (x)] -E q [f (y)]) , where p, q are probability distributions, F is a class of functions f : X → R and H denotes a reproducing kernel Hilbert space. Its empirical estimate can be given by MMD 2 [F, X, Y ] = 1 m(m -1) m i=1 m j̸ =i k(x i , x j ) + 1 n(n -1) n i=1 n j̸ =i k(y i , y j ) - 2 mn m i=1 n j=1 k(x i , y j ), where X = {x 1 , . . . , x m } and Y = {y 1 , . . . , y n } are samples consisting of i.i.d observations drawn from p and q, respectively. k(•, •) denotes a kernel function, e.g., k(x, y) = exp(-γ∥x -y∥ 2 ), a Gaussian kernel. Based on the above analysis, we reformulate (2) as minimize θ, ϕ MMD 2 (Z θ , Z T ) + λ n n i=1 ∥x i -g ϕ (f θ (x i ))∥ 2 , ( ) where Z θ = {f θ (x 1 ), f θ (x 2 ), . . . , f θ (x n )} and Z T = {z i : z i ∼ D z , i = 1, . . . , n}. The first term of the objective function in (5) makes f θ learn the mapping T from data distribution D x to target distribution D z and the second term ensures that f θ can preserve the main information of observations provided that λ is sufficiently large. Figure 1 shows the paradigm of our method. GiHS (Figure 2 .a) is actually a truncated Gaussian. Suppose we want to draw n samples from GiHS. A simple approach is drawing (1 + ρ)n samples from a standard d-dimensional Gaussian and discarding the ρn samples with larger ℓ 2 norms. The maximum ℓ 2 norm of the remaining n points is the radius of the hypersphere. One may also use inverse transform method (Marsaglia, 1963) . We have the following proposition (a simple proof can be found in the appendix). Proposition 2.1. Suppose z 1 , z 2 , . . . , z n are sampled from N (0, I d ) independently. Then for any r > d/2, the following inequality holds P (∥z j ∥ ≥ r) ≤ exp - r 2 + d(2r 2 -d) 2 , j = 1, . . . , n. In the proposition, using union bound, we have P max 1≤j≤n ∥z j ∥ ≤ r ≥ 1 -n exp - r 2 + d(2r 2 -d) 2 . It means a hypersphere of radius r can include all the n samples with probability at least 1 -nρ, where ρ = exp(-(r 2 + d(2r 2 -d))/2). On the other hand, according to (6), if we sample n/ρ samples from N (0, I d ), the expected number of samples falling into a hypersphere of radius r is at least n. Thus, suppose we sample n ′ samples and remove the n ′ -n samples with larger ℓ 2 norms, the expected radius of the smallest hypersphere encasing the remaining n samples is at most r 0 , where r 0 is the solution of the function log(n ′ /n) = r 2 + d(2r 2 -d))/2 w.r.t r. UiHS (Figure 2 .b) is a hyperball in which all the samples are distributed uniformly. To sample from UiHS, we first need to sample from U(-1, 1) d . Then we discard all the data points outsides the radius-1 hyperball centered at the origin. The following proposition shows some probability result of sampling from a d-dimensional uniform distribution. Proposition 2.2. Suppose z 1 , z 2 , . . . , z n are sampled from U(-1, 1) d independently. Then for any r > 0, the following inequality holds P (∥z j ∥ ≥ r) ≤ 4d 5(3r 2 -d) 2 , j = 1, . . . , n. Using union bound for (8), we obtain P max 1≤j≤n ∥z j ∥ ≤ r ≥ 1 - 4dn 5(3r 2 -d) 2 . ( ) It means a hypersphere of radius r can include all the n samples with probability at least 1 -nρ, where ρ = 0.8d -1 (3r 2 -d) 2 . On the other hand, (8) indicates that if we draw n/ρ samples from U(-1, 1) d , the expected number of samples falling into a hypersphere of radius r is at least n. Actually, sampling from UiHS is closely related to the Curse of Dimensionality and we need to sample a large number of points from U(-1, 1) d if d is large because only a small volume of the hypercube is inside the hyperball. To be more precisely, letting V hypercube be the volume of a hypercube with length 2r and V hyperball be the volume of a hyperball with radius r, we have V hyperball V hypercube = π d/2 d2 d-1 Γ(d/2) ≜ η, ( ) where Γ is the gamma function. Therefore, we need to draw n/η samples from U(-1, 1) d to ensure that the expected number of samples included in the hyperball is n, where η is small if d is large. UbHS (Figure 2 .c) can be obtained via UiHS. We first sample from UiHS and then remove all samples included by a smaller hypersphere. Since the volume ratio of two hyperballs with radius r 1 and r 2 is ( r1 r2 ) d , we need to draw n/(1 -(r 2 /r 1 ) d ) samples from UiHS to ensure that the expected number of samples between the two hyperspheres is n. UoHS (Figure 2 .d) can be easily obtained via sampling from N (0, I d ). Specifically, for every z i drawn from N (0, I d ), we normalize it as z i ← rz i /∥z i ∥, where r is the predefined radius of the hypersphere. In testing stage, we only use the trained f * θ to calculate anomaly scores. For a given test sample x new , we define anomaly score s for each target distribution by s(x new ) = |∥f * θ (x new )∥ 2 -r|, for UoHS ∥f * θ (x new )∥ 2 , for GiHB or UiHS (∥f * θ (x new )∥ 2 -r 1 ) • (∥f * θ (x new )∥ 2 -r 2 ), for UbHS We call our method Restricted Generative Projection (RGP), which has four variants, denoted by RGP-GiHS, RGP-UiHS, RGP-UbHS, and RGP-UoHS respectively, though any target bounded distribution applies.

3. CONNECTION WITH PREVIOUS WORK

Our method has a connection with the variational autoencoder (VAE) (Kingma & Welling, 2013) . Both methods are autoencoders. The latent distribution in VAE is often Gaussian and not bounded while the latent distribution in our method is more general and bounded. The optimizations of VAE and our method are also different: VAE involves KL-divergence while our method involves MMD. Our method is closely related to Deep SVDD (Ruff et al., 2018) . Both our method and Deep SVDD aim to project the normal training data into some space such that a decision boundary between normal data and unknown abnormal data can be found easily. Deep SVDD minimizes the sum of the squared distances between the projected data and a predefined center to find a hypersphere to include the normal training data, which cannot ensure that the data in the hypersphere are evenly distributed or close to Gaussian and hence may lead to an ineffective decision boundary. Compared with the autoencoder based anomaly detection method NAE (Yoon et al., 2021) that uses reconstruction error to normalize autoencoder, our method pays more attention to learning a mapping that can transform the unknown data distribution into a simple and compact target distribution. Similar to our method, Perera et al. (2019) also considered bounded latent distribution in autoencoder for anomaly detection. They proposed to train a denoising autoencoder with a hyper-cube (multi-dimensional uniform) supported latent space, via adversarial training. Obviously, the latent distribution and optimization of our method are different from theirs. In addition, the latent distributions of our method are more compact than the multi-dimensional uniform latent distribution of their method.

4.1. DATASETS AND BASELINES

In this section, we compare the proposed method with several state-of-the-art anomaly detection methods on three tabular datasets and two widely used image datasets for one-class classification. The datasets are detailed as follows. • Abalone (Dua, 2017) is a dataset of physical measurements of abalone to predict the age. It contains 1,920 instances with 8 attributes. • Arrhythmia (Rayana, 2016) is an ECG dataset. It was used to identify arrhythmic samples in five classes and contains 452 instances with 279 attributes. • Thyroid (Rayana, 2016 ) is a hypothyroid disease dataset that contains 3,772 instances with 6 attributes. • Fashion-MNIST (Xiao et al., 2017) contains 70,000 grey-scale images with 10 classes. • CIFAR-10 ( Krizhevsky et al., 2009 ) is a widely-used benchmark for image anomaly detection. It contains 60,000 color images with 10 classes. We compare our method with three classic shallow models, four deep autoencoder based methods, three deep generative model based methods, and some latest anomaly detection methods. • Classic shallow models: local outlier factor(LOF) (Breunig et al., 2000) , one-class support vector machine(OC-SVM) (Schölkopf et al., 2001) , isolation forest (IF) (Liu et al., 2008) . • Deep autoencoder based methods: denoising auto-encoder (DAE) (Vincent et al., 2008) , DCAE (Seeböck et al., 2016) , E2E-AE and DAGMM (Zong et al., 2018) , DCN (Caron et al., 2018) . • Deep generative model based methods: AnoGAN (Schlegl et al., 2017) , ADGAN (Deecke et al., 2018) , OCGAN (Perera et al., 2019) , • Some latest anomaly detection methods: DeepSVDD (Ruff et al., 2018) , GOAD (Bergman & Hoshen, 2020) , DROCC (Goyal et al., 2020) , HRN (Hu et al., 2020) , SCADN (Yan et al., 2021) , NeuTraL AD (Qiu et al., 2021) , GOCC (Shenkar & Wolf, 2022) .

4.2. IMPLEMENTATION DETAILS AND EVALUATION METRICS

In this section, we introduce the implementation details of the proposed method RGP and describe experimental settings for image and tabular datasets. Note that all the compared methods do not utilize any pre-trained feature extractors. For the three tabular datasets (Abalone, Arrhythmia, and Thyroid), in our method, f θ , g ϕ are both MLPs. We follow the dataset preparation of (Zong et al., 2018) to preprocess the three tabular datasets for one-class classification task. The hyper-parameter λ is set to 1.0 for the three datasets. For the two image datasets (Fashion-MNIST and CIFAR-10), in our method, f θ , g ϕ are both CNNs. Since both image datasets contain 10 different classes, we conduct 10 independent one-class classification tasks on both datasets. In each task on CIFAR-10, there are 5,000 training samples and 10,000 testing samples. In each task on Fashion-MNIST, there are 6,000 training samples and 10,000 testing samples. The hyper-parameter λ is chosen from {0.05, 0.2, 0.1, 1.0} and varies for different classes. In our method, regarding the restricted target distributions GiHS, we first generate a large number (denoted by N ) of samples from Gaussian or uniform, sort the samples according to their ℓ 2 norms, and set r to be the pN -th smallest ℓ 2 norm, where p = 0.9. In each iteration (mini-batch) of the optimization, we resample Z T according to r. The sampling for UiHS and UbHS are similar to GiHS and hence omitted here for simplicity. For UoHS, we draw samples from Gaussian and normalize them to have unit ℓ 2 norm, then they lie on a unit hypersphere uniformly. The procedure is repeated in each iteration (mini-batch) of the optimization. We use Adam (Kingma & Ba, 2014) as the optimizer in our method. For Fashion-MNIST, CIFAR-10, and Arrhythmia, the learning rate is set to 0.0001. For Abalone and Thyroid, the learning rate is set to 0.001. The details of our network settings are provided in the supplementary material. All experiments were run on AMD EPYC CPU with 64 cores and with NVIDIA Tesla A100 GPU, CUDA 11.6. To evaluate the performance of all methods, we follow the previous works such as (Ruff et al., 2018) and (Zong et al., 2018) to use AUC (Area Under the ROC curve) for image datasets and F1-score for tabular datasets. • RGP has superior performance on most classes of Fashion-MNIST and CIFAR-10 datasets under the setting of UoHS(uniform distribution on hypersphere). Furthermore, our four settings have relatively close performance on Fashion-MNIST. On CIFAR-10, UoHS outperformed other settings consistently.

4.3. RESULTS ON IMAGE DATASETS

Table 1 : Average AUC(%) of one-class anomaly detection on Fashion-MNIST. For the competitive methods we only report their mean performance due to the space limit, while we further report the standard deviation for the proposed methods. * denotes we run the official released code to obtain the results, and the best two results are marked in bold. Table 2 : Average AUC(%) of one-class anomaly detection on CIFAR-10. For the competitive methods we only report their mean performance due to the space limit, while we further report the standard deviation for the proposed method. * denotes we run the official released code to obtain the results, and the best two results are marked in bold. 

4.4. RESULTS ON TABULAR DATASETS

In Table 3 , we report the F1-scores of our methods in comparison to ten baselines on Arrhythmia, Thyroid, and Abalone. Our four methods significantly outperform all baseline methods in all cases. Particularly, RGP-UoHS has 18.73%, 8.70%, and 16.96% improvements on the three datasets in terms of F1-score compared to the runner-up, respectively. Compared with the performance improvement of RGPs on Fashion-MNIST and CIFAR-10, RGPs on the three tabular datasets are more significant. In addition to the quantitative results, we choose Thyroid (with 6 attributes) as an example and transform the data distribution to 2-dimensional target distributions, which are visualized in Figure 3 . We see that RGPs are effective to transform the data distribution to the restricted target distributions. Therefore, RGPs can learn clear boundaries for normal data and hence have satisfactory performance in anomaly detection. Table 3 : Average F1-Scores(%) with standard deviation on three tabular datasets.* denotes we run the official released code of NeuTral AD to obtain the result of Abalone, and the results of Arrhythmia and Thyroid are from the original paper (Qiu et al., 2021) . The best two results are marked in bold. Methods Abalone Arrhythmia Thyroid OC-SVM (Schölkopf et al., 2001) 48.00 ± 46.00 ±0.00 39.00 ±1.00 LOF (Breunig et al., 2000) 33.00 ±1.00 51.00 ±1.00 54.00 ±1.00 DCN (Caron et al., 2018) 40.00 ±1.00 38.00 ±3.00 33.00 ±3.00 E2E-AE (Zong et al., 2018) 33.00 ±3.00 45.00 ±3.00 13.00 ±4.00 DAGMM (Zong et al., 2018) 20.00 ±3.00 49.00 ±3.00 49.00 ±4.00 DeepSVDD (Ruff et al., 2018) 62.00 ±1.00 54.00 ±1.00 73.00 ±0.00 GoAD (Bergman & Hoshen, 2020) 61.00 ±2.00 51.00 ±2.00 72.00 ±1.00 DROCC (Goyal et al., 2020) 68.00 ± 2.00 69.00 ± 2.00 78.00 ± 3. 

4.5. ABLATION STUDY

There is one hyperparameter in our method, namely λ in problem (5). Now we show the influence of λ on the performance of our method. Figure 4 shows F1-scores of our methods with λ varying from 0 to 100, on the three tabular datasets. Too small or too large λ can lower the performance of RGP. When λ is very tiny, the reconstruction term of (5) makes less impact on the training target and f θ can easily transform the training data to the target distribution but ignores the importance of original data distribution (see Figure 5 ). On the other hand, when λ is very large, the MMD term of optimization becomes trivial for the whole training target and f θ under the constraint of reconstruction term more concentrates on the original data distribution yet can not learn a good mapping from data distribution to target distribution. Figure 5 

5. CONCLUSION

We have presented a novel and simple framework for one-class classification and anomaly detection. Our method RGP aims to convert the data distribution to a simple, compact and informative target distribution that can be easily violated by abnormal data. We presented four target distributions and the numerical results showed that uniform on hypersphere is more effective than other distributions in anomaly detection. The reason is that the uniform on hypersphere is much more compact than other distributions and hence provides smaller room for abnormal data points to fall into.

A MORE NUMERICAL RESULTS

In addition to the results of Tables, we here report the average AUC of all classes in Figure 6 . We see that on Fashion-MNIST, our methods have competitive performance as IF and HRN. On CIFAR-10, our RGP-UoHS outperformed all other methods.

Average AUC(%)

Average AUC(%) Fashion-MNIST CIFAR-10 We also conduct one-class anomaly detection on MNIST, and report experimental results in Table 4 . In addition to comparison of evaluation metrics, we also measure the time cost of different types of anomaly detection methods on training phase with GUP acceleration or not. We run the experiments of one-class classification on MNIST, Fashion-MNIST, CIFAT10 and report the time cost in Table 5 , 6, 7. Table 5 compares training time of Deep SVDD (Ruff et al., 2018) , DROCC (Goyal et al., 2020) , OCGAN (Perera et al., 2019) , RGP on MNIST with the same batch size and training epochs, and without GPU acceleration. Table 6 and Table 7 show the training time of Deep SVDD (Ruff et al., 2018) , DROCC (Goyal et al., 2020) , RGP on Fashion-MNIST, CIFAR10 with accelerating of NVIDIA RTX2080 GPU(1x). (Ruff et al., 2018) and DROCC (Goyal et al., 2020) in table 6 and table 7 are both from the official settings and listed in Table 8 . We do not obtain the time cost of OCGAN (Perera et al., 2019) with GPU acceleration since the official code occurs system error when using GPU. C PROOF FOR PROPOSITION 2.2 Proof. Letting z 1 , z 2 , . . . , z d be i.i.d variables of U(-1, 1). Then E(z i ) = 1 3 and Var(z i ) = 4 45 , i = 1, 2, . . . , n. Using Chebyshev's Inequality, for any t > 0, we obtain P d i=1 z 2 i - d 3 ≥ t ≤ 4d 45t 2 . ( ) It follows that P d i=1 z 2 i ≥ d 3 + t ≤ 4d 45t 2 . ( ) Letting r 2 = d 3 + t, we have P (∥z i ∥ ≥ r) ≤ 4d 5(3r 2 -d) 2 , j = 1, 2, . . . , n. (16) Note that one may obtain tighter tail bound than (16).



Here we assume that the distributions of normal data and abnormal data do not overlap. Otherwise, it is difficult to determine whether a single point is normal or not.



Figure 1: The training stage of Restricted Generative Projection.

Figure 2: Samples (in orange) from 2-D target (bounded) distributions. Plots (a), (b), (c), (d) denote GiHS, UiHS, UbHS, UoHS respectively.

00 NeuTral AD * (Qiuet al., 2021) 62.07 ± 2.81 60.30 ± 1.10 76.80 ± 1.90 GOCC (Shenkar & Wolf, 2022) -61.80 ± 1.80 76.80 ± 1.20 RGP-GiHS (Ours) 84.79 ± 0.36 78.63 ± 0.55 92.60 ± 0.37 RGP-UiHS (Ours) 84.79 ± 0.65 76.98± 0.72 91.83 ± 0.39 RGP-UbHS (Ours) 83.79 ± 0.81 74.22 ± 0.78 89.53 ± 0.00 RGP-UoHS (Ours) 86.73 ± 1.10 77.70 ± 0.40 94.96 ± 0.28

Figure 3: Visualization of mapping data distribution to 2-dimensional target distribution on Thyroid datasets. Plots (a), (b), (c), (d) refer to GiHS, UiHS, UbHS, UoHS, respectively. The blue points, orange points, green points, red points denote samples from target distribution, samples from training data, normal samples from testing set and abnormal samples from testing set, respectively.

illustrates the influence of hyperparameter λ on training set of Thyroid dataset. It can be clearly observed that f θ transform training data to target distribution better with the decrease of the λ.

Figure 4: The ablation study of hyper-parameter λ on testing set of three tabular datasets under four different restrictions. Plots (a), (b), (c), (d) correspond to GiHS, UiHS, UbHS, UoHB, respectively. λ is chosen from {0, 0.1, 0.5, 1, 5, 10, 100}

Figure 5: The ablation study of hyper-parameter λ on training set of Thyroid dataset under four different restrictions. Plots (a), (b), (c), (d) correspond to GiHS, UiHS, UbHS, UoHS, respectively.

Figure 6: The Average AUC(%) score of Fashion-MNIST and CIFAR-10 on all classes.

Firstly, in contrast to classic shallow methods such as OC-SVM(Schölkopf et al., 2001) andIF (Liu et al., 2008), our RGP has significantly higher AUC scores on all classes of CIFAR-10 and most classes of Fashion-MNIST. An interesting phenomenon is that most deep learning based methods including ours have inferior performance compared to IF(Liu et al., 2008) on class 'Sandal' of Fashion-MNIST.• Our methods outperformed the deep autoencoder based methods and generative model based methods in most cases and have competitive performance compared to the stateof-the-art in all cases.

Average AUC(%) of one-class anomaly detection on MNIST.

Table 6 and Table 7 indicate that the three methods have the same magnitude on time cost with GPU acceleration and our method has a few advantage on both Fashion-MNIST and CIFAR10 when ensuring the convergence of model. The training hyperparameters of Deep SVDD

Training time(seconds) of one-class anomaly detection on MNIST without GPU acceleration.

Training time(seconds) of one-class anomaly detection on Fashion-MNIST with GPU acceleration.

Training time(seconds) of one-class anomaly detection on CIFAR10 with GPU acceleration.Ruff et al., 2018) 943.97 946.16 941.57 932.88 944.01 943.51 942.04 942.93 941.69 941.69 942.04 DROCC (Goyal et al., 2020) 970.90 976.35 974.08 971.07 971.99 972.03 971.09 974.67 971.60 972.54 972.63 RGP-GiHS (Ours) 747.48 748.08 747.83 748.57 748.34 747.33 747.33 746.74 747.48 746.03 747.52 RGP-UiHS (Ours) 746.95 747.17 747.51 746.31 747.72 747.80 747.58 747.78 747.80 747.36 747.39 RGP-UbHS (Ours) 746.12 746.11 745.21 747.21 746.97 746.08 745.06 746.01 746.49 745.45 746.07 RGP-UoHS (Ours) 713.25 712.59 712.70 712.61 712.97 712.86 712.59 713.02 712.72 712.77 712.80

Training settings of one-class anomaly detection on Fashion-MNIST, CIFAR10 and MNIST.Proof. Letting z 1 , z 2 , . . . , z d be i.i.d Gaussian variables with mean 0 and variance 1. According to Lemma 1 of(Laurent & Massart, 2000) (in which letting a 1 = • • • = a D = 1), the following inequality holds for any t Letting z = (z 1 , z 2 , . . . , z d ) ⊤ and r 2 = d + 2

D DETAILED NEURAL NETWORKS ARCHITECTURE

For the two image datasets (Fashion-MNIST and CIFAR-10), in our method, f θ , g ϕ are both CNNs. The detailed neural network architecture is showed in Table 9 . 16), LeakyReLU Conv2d(in channels=32, out channels=in channels, kernel size=3, padding=1, bias=False) Basic-Block(in channels, out channels) Conv2d(in channels, out channels, kernel size=3, padding=1, bias=False) BatchNorm2d(out channels), LeakyReLU Conv2d(out channels, out channels, kernel size=3, padding=1, bias=False) BatchNorm2d(out channels) Table 9 : Architecture of the CNN-based neural network for CIFAR-10 and Fashion-MNIST.For the three tabular datasets (Abalone, Arrhythmia, Thyroid), in our method, f θ , g ϕ are both MLPs. The detailed neural network architecture is showed in Table 10 f θ g ϕ Linear(input dim, 64, bias=False), LeakyReLU Linear(128, input dim, bias=False) Linear (64, 128, bias=False) Table 10 : Architecture of the MLP-based neural network for tabular dataset.

