LEARNING AND EVALUATING REPRESENTATIONS FOR DEEP ONE-CLASS CLASSIFICATION

Abstract

We present a two-stage framework for deep one-class classification. We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations. The framework not only allows to learn better representations, but also permits building one-class classifiers that are faithful to the target task. We argue that classifiers inspired by the statistical perspective in generative or discriminative models are more effective than existing approaches, such as a normality score from a surrogate classifier. We thoroughly evaluate different self-supervised representation learning algorithms under the proposed framework for one-class classification. Moreover, we present a novel distribution-augmented contrastive learning that extends training distributions via data augmentation to obstruct the uniformity of contrastive representations. In experiments, we demonstrate state-of-the-art performance on visual domain oneclass classification benchmarks, including novelty and anomaly detection. Finally, we present visual explanations, confirming that the decision-making process of deep one-class classifiers is intuitive to humans.

1. INTRODUCTION

One-class classification aims to identify if an example belongs to the same distribution as the training data. There are several applications of one-class classification, such as anomaly detection or outlier detection, where we learn a classifier that distinguishes the anomaly/outlier data without access to them from the normal/inlier data accessible at training. This problem is common in various domains, such as manufacturing defect detection, financial fraud detection, etc. Generative models, such as kernel density estimation (KDE), is popular for one-class classification [1, 2] as they model the distribution by assigning high density to the training data. At test time, low density examples are determined as outliers. Unfortunately, the curse of dimensionality hinders accurate density estimation in high dimensions [3] . Deep generative models (e.g. [4, 5, 6] ), have demonstrated success in modeling high-dimensional data (e.g., images) and have been applied to anomaly detection [7, 8, 9, 10, 11] . However, learning deep generative models on raw inputs remains as challenging as they appear to assign high density to background pixels [10] or learn local pixel correlations [12] . A good representation might still be beneficial to those models. Alternately, discriminative models like one-class SVM (OC-SVM) [13] or support vector data description (SVDD) [14] learn classifiers describing the support of one-class distributions to distinguish them from outliers. These methods are powerful when being with non-linear kernels. However, its performance is still limited by the quality of input data representations. In either generative or discriminative approaches, the fundamental limitation of one-class classification centers on learning good high-level data representations. Following the success of deep learning [15] , deep one-class classifications [16, 17, 18] , which extend the discriminative one-class classification using trainable deep neural networks, have shown promising results compared to their kernel counterparts. However, a naive training of deep one-class classifiers leads to a degenerate solution that maps all data into a single representation, also known as "hypersphere collapse" [16] . Previous works circumvent such issues by constraining network architectures [16] , autoencoder pretraining [16, 17] , surrogate multi-class classification on simulated outliers [19, 20, 21, 22] or injecting noise [18] . In this work, we present a two-stage framework for building deep one-class classifiers. As shown in Figure 1 , in the first stage, we train a deep neural network to obtain a high-level data representation. In the second stage, we build a one-class classifier, such as OC-SVM or KDE, using representations from the first stage. Comparing to using surrogate losses [20, 21] , our framework allows to build a classifier that is more faithful to one-class classification. Decoupling representation learning from classifier construction further opens up opportunities of using state-of-the-art representation learning methods, such as self-supervised contrastive learning [23] . While vanilla contrastive representations are less compatible with one-class classification as they are uniformly distributed on the hypersphere [24], we show that, with proper fixes, it provides representations achieving competitive one-class classification performance to previous state-of-the-arts. Furthermore, we propose a distribution-augmented contrastive learning, a novel variant of contrastive learning with distribution augmentation [25] . This is particularly effective in learning representations for one-class classification, as it reduces the class collision between examples from the same class [26] and uniformity [24] . Lastly, although representations are not optimized for one-class classification as in end-to-end trainable deep one-class classifiers [16], we demonstrate state-of-the-art performance on visual one-class classification benchmarks. We summarize our contributions as follows: • We present a two-stage framework for building deep one-class classifiers using unsupervised and self-supervised representations followed by shallow one-class classifiers. 8, 35, 36] . These include simple methods such as kernel density estimation or mixture models [37], as well as advanced ones [4, 5, 6, 38, 39, 40, 41] . However, the density from generative models for high-dimensional data could be misleading [9, 12, 42, 43] . New detection mechanisms based on the typicality [44] or likelihood ratios [10] have been proposed to improve out-of-distribution detection. Self-supervised learning is commonly used for learning representations from unlabeled data by solving proxy tasks, such as jigsaw puzzle [45], rotation prediction [46], clustering [47], instance discrimination [48] and contrastive learning [23, 49, 50] . The learned representations are then used for multi-class classification, or transfer learning, all of which require labeled data for downstream tasks. They have also been extended to one-class classification. For example, contrastive learning is adopted to improve the out-of-distribution detection under multi-class setting [51], whereas our work focuses on learning from a single class of examples, leading to propose a novel distributionaugmented contrastive learning. Notably, learning to predict geometric transformations [20, 21, 22] extends the rotation prediction [46] to using more geometric transformations as prediction targets. Unlike typical applications of self-supervised learning where the classifier or projection head [23] are discarded after training, the geometric transformation classifier is used as a surrogate for oneclass classification. As in Section 4.1, however, the surrogate classifier optimized for the selfsupervised proxy task is suboptimal for one-class classification. We show that replacing it with



We systematically study representation learning methods for one-class classification, including augmentation prediction, contrastive learning, and the proposed distribution-augmented contrastive learning method that extends training data distributions via data augmentation. • We show that, with a good representation, both discriminative (OC-SVM) and generative (KDE) classifiers, while being competitive with each other, are better than surrogate classifiers based on the simulated outliers [20, 21]. • We achieve strong performance on visual one-class classification benchmarks, such as CIFAR-10/100 [27], Fashion MNIST [28], Cat-vs-Dog [29], CelebA [30], and MVTec AD [31]. • We extensively study the one-class contrastive learning and the realistic evaluation of anomaly detection under unsupervised and semi-supervised settings. Finally, we present visual explanations of our deep one-class classifiers to better understand their decision making processes. 2 RELATED WORK One-class classification [32] has broad applications, including fraud detection [33], spam filtering [34], medical diagnosis [35], manufacturing defect detection [31], to name a few. Due to the lack of granular semantic information for one-class data, learning from unlabeled data have been employed for one-class classification. Generative models, which model the density of training data distribution, are able to determine outlier when the sample shows low density [

