ONE-CLASS CLASSIFICATION ROBUST TO GEOMETRIC TRANSFORMATIONS

Abstract

Recent studies on one-class classification have achieved a remarkable performance, by employing the self-supervised classifier that predicts the geometric transformation applied to in-class images. However, they cannot identify inclass images at all when the input images are geometrically-transformed (e.g., rotated images), because their classification-based in-class scores assume that input images always have a fixed viewpoint, as similar to the images used for training. Pointing out that humans can easily recognize such transformed images as the same class, in this work, we aim to propose a one-class classifier robust to geometrically-transformed inputs, named as GROC. To this end, we introduce a conformity score which indicates how strongly an input image agrees with one of the predefined in-class transformations, then utilize the conformity score with our proposed agreement measures for one-class classification. Our extensive experiments demonstrate that GROC is able to accurately distinguish in-class images from out-of-class images regardless of whether the inputs are geometricallytransformed or not, whereas the existing methods fail.

1. INTRODUCTION

One-class classification refers to the problem of identifying whether an input example belongs to a single target class (in-class) or any of novel classes (out-of-class). The main challenge of this task is that only in-class examples are available at training time. Thus, by using only positive examples, a model has to learn the decision boundary that distinguishes in-class examples from out-of-class examples, whose distribution is assumed to be unknown in practice. Early work on one-class classification mainly utilized kernel-based methods (Schölkopf et al., 2000; Tax & Duin, 2004 ) to find a hypersphere (or hyperplane) enclosing all training in-class examples, or density estimation techniques (Parzen, 1962) to measure the likelihood of an input example. In the era of deep learning, numerous literature have tried to employ deep neural networks to effectively learn the high-dimensional data (e.g., images). Most of them aim to detect out-of-class examples based on density estimation, by adopting the architecture of autoencoders (Ruff et al., 2018; Zong et al., 2018) or generative adversarial networks (GANs) (Schlegl et al., 2017; Zenati et al., 2018) . Nevertheless, their supervision is not useful enough to capture the semantic of highdimensional data for a target class, which eventually leads to the limited performance. Recently, there have been several attempts to make use of self-supervised learning (Golan & El-Yaniv, 2018; Hendrycks et al., 2019; Bergman & Hoshen, 2020) for more informative supervision on the target class, and made a major breakthrough to this problem. They build a self-labeled image set by applying a bunch of geometric transformations to training images, then train a classifier to accurately predict the transformation applied to original input images. This approach achieved the state-of-theart performance for one-class classification even without modeling the latent distribution of in-class examples for density estimation. However, all the aforementioned methods are quite vulnerable to spatial variances within the images, because they were developed based on the assumption that in-class (and out-of-class) images have a fixed viewpoint. In particular, the existing self-supervised methods do not work completely for the inputs with various viewpoints in that their capability of predicting the geometric transformation relies on the fixed viewpoint. Note that humans usually recognize that the images of a target object with different viewpoints belong to the same class; in this sense, the one-class classifiers also should be robust to the viewpoint of input images. In other words, we need to make geometricallytransformed in-class images not to be identified as out-of-class, from the perspective that a geometric transformation (e.g., rotation & x,y-translation) does not change the semantic (i.e., object class) but the viewpoint. The goal of our work is to propose an effective strategy that can circumvent the limitation of viewpoint sensitivity, without compromising the performance for the images with the fixed viewpoint. We first present several evaluation settings for validating the robustness to flexible viewpoints, artificially introduced by geometric transformations. Then, we describe our proposed solution, termed as GROC, which measures a conformity score indicating how confidently an input image matches with one of the predefined (anchor) in-class transformations. In this work, we offer two measures for the conformity score, which are the inner product similarity and the conditional likelihood, and show how they can be optimized by the training in-class images. The empirical experiments on the proposed evaluation scenarios show that GROC considerably outperforms all the other competing methods in terms of the robustness to geometric transformation.

2. PRELIMINARIES

2.1 PROBLEM FORMULATION Let X be a set of all kinds of images, X in ⊆ X and X out = X \X in be the sets of all in-class and out-of-class images, respectively. Given training in-class data X tr in ⊆ X in , we consider the one-class classification problem which differentiates in-class and out-of-class data. The problem aims to build a classifier by using only the known in-class data for training. The classifier learns an in-class score function, S in (x) : X → R, where a higher score indicates that the input x is more likely to be in X in . Based on the score, the classifier determines whether the input belongs to in-class or not.

2.2. SELF-SUPERVISED LEARNING METHODS FOR ONE-CLASS CLASSIFICATION

Recently, the self-supervised learning methods (Golan & El-Yaniv, 2018; Hendrycks et al., 2019; Bergman & Hoshen, 2020) have achieved the state-of-the-art performance in one-class classification. For self-supervised learning, they first create a self-labeled dataset and use it to train a multi-class classifier. Concretely, let T = {T 0 , • • • , T i , • • • , T K-1 } be a set of predefined (anchor) geometric transformations, where T 0 (x) = x is the identity mapping and each transformation T i is a composition of multiple unit transformations (i.e., rotation & x,y-translation). The self-labeled dataset consists of transformed images and their corresponding labels. D self = {(T i (x), i)|x ∈ X tr in , 0 ≤ i < K}, where T i (•) is the i-th transformation operator and its label i is the transformation id of T i (•). Using the self-labeled dataset, these methods train a softmax classifier based on a multi-class classification loss (i.e., cross-entropy) for a discrimination among the transformations. For one-class classification, they define an in-class score under the assumption that a well-trained classifier would better predict the transformation for the in-class images than that for the out-of-class images. In the end, the in-class score for an unseen image x is defined by the sum of softmax probabilities that its transformed images are correctly classified as their labels (Golan & El-Yaniv, 2018; Bergman & Hoshen, 2020) . S in (x) = K-1 i=0 p (y = i|T i (x)) , where p (y = i|T i (x)) is the softmax probability that T i (x) is classified as the i-th transformation. The state-of-the-art method based on this self-supervised approach (Hendrycks et al., 2019) significantly improves the performance by formulating the classification task in a multi-label manner. Since each transformation is determined by the combination of unit transformations from three categories 1 , (i.e., rotation, (horizontal) x-translation, and (vertical) y-translation), the unit transformations applied to an input image can be independently predicted for each category. Thus, they adopt a softmax head for each transformation category, then train a classifier to predict the degree of transformations within each category. The final in-class score is also replaced with the one summarizing all the softmax heads, each of which is for the unit transformation applied to the input.

3.1. MOTIVATION

The underlying concept of the self-supervised methods based on transformation classification is to learn discriminative features of in-class images, in order to classify various viewpoints caused by the geometric transformations. The precondition for this approach is that the viewpoint of training images is always the same, otherwise the classifier cannot be trained due to the inconsistent supervision. However, at test time, the input images can have different viewpoints from those appearing in the training images. We remark that the images of the same object with different viewpoints belong to the same class, as usually recognized by humans. In this sense, it is desired that in-class images with various viewpoints are identified as in-class, not out-of-class. That is, the robustness to geometric transformations should be considered for one-class classification. Sea lion, n02077923 In this respect, the existing self-supervised methods totally fail to compute the effective inclass scores for inputs with various viewpoints. We observe that they produce an undesirable in-class score especially when the input image has the same (or similar) viewpoint represented by the anchor transformations T \{T 0 }. For example, suppose a classifier is trained on D self with the transformations T of clockwise rotations {0 • , 90 • , 180 • , 270 • }. Given two images of sea lions x and x , let x have the same viewpoint with the training images and x have the 90 • rotated viewpoint, which is equivalent to T 1 (x ). As illustrated in Figure 1 , the softmax probability of each transformed image has a high value for the input x , but the one for x has a low value. Consequently, it cannot correctly identify x as in-class, though it comes from the target class as well. We point out that setting the target label of each transformed image to the applied transformation is not valid any longer when an input viewpoint is changed. 𝑆𝑆 𝑖𝑖𝑖𝑖 𝒙𝒙 ′ = � 𝑖𝑖 𝑝𝑝 𝑦𝑦 = 𝑖𝑖| 𝑇𝑇 𝑖𝑖 𝒙𝒙 ′ 𝒙𝒙 ′ Transform 𝒙𝒙 ′′ 𝑇𝑇 1 𝒙𝒙 ′′ 𝑇𝑇 2 𝒙𝒙 ′′ 𝑇𝑇 3 𝒙𝒙 ′′ 𝑇𝑇 0 𝒙𝒙 ′′ A straightforward solution for this challenge is augmenting the training dataset so that it can cover various viewpoints of in-class images. Unfortunately, the data augmentation technique is not applicable because it results in inconsistent supervision for the task of discriminating the viewpoints, which is the learning objective of the self-supervised methods. On the other hand, there exist several one-class classification methods (Ruff et al., 2018; Zong et al., 2018) that can adopt the data augmentation technique. However, they cannot achieve the performance as high as the self-supervised methods even in the case that all input images have a fixed viewpoint; this will be further discussed in Section 4. To sum up, we need to consider another strategy to develop a robust one-class classifier that works well even for the input images having various viewpoints.

3.2. PROPOSED SETUPS

We first propose three evaluation setups for testing the robustness to various viewpoints: 1) fixed viewpoint, 2) anchor viewpoint, and 3) random viewpoint. We artificially introduce the spatial variance (i.e., the changes of the viewpoint) in test images by using the geometric transformations. Note that X te denotes the test data, which contains both in-class and out-of-class images. Fixed viewpoint setup. In this setup, we consider only the fixed viewpoint that is used for training, as done in previous work. We do not change the viewpoint of the original test images, X te f v = X te . Anchor viewpoint setup. This setup is designed for verifying the robustness to the viewpoints induced by the anchor transformations. We build a test dataset by X te av = {T (x)|T ∼ T , x ∈ X te }, where T is randomly sampled from the set of the anchor transformations T for each image x. Random viewpoint setup. The random viewpoint setup further considers the geometric transformations that are not included in the set of anchor transformations. We first define the superset of T , denoted by T * , including a number of transformations with continuous degrees. A test dataset for this setup is built by X te rv = {T (x)|T ∼ T * , x ∈ X te }, where T is sampled for each image x. As a preliminary result, we plot the in-class score distributions for in-class and out-of-class test images, computed by the state-of-the-art self-supervised method (Hendrycks et al., 2019) . In Figure 2 , we observe that the score distributions of in-class and out-of-class images in X te f v are clearly distinguishable, which supports the great performance for one-class classification. On the contrary, the two score distributions are almost overlapping with each other in cases of X te av and X te rv , strongly indicating that they fail to figure out in-class images due to their various viewpoints. We additionally investigate the performance drop for geometrically/non-geometrically transformed inputs. In Figure 2 (d), it is obvious that geometric transformations make the self-supervised method totally malfunction, while non-geometric transformations (e.g., brightness, contrast, sharpness, and color temperature) hardly degrade the final performance for one-class classification.

3.3. PROPOSED STRATEGY

To deal with the viewpoint sensitivity of the self-supervised methods, we note that the in-class images match better with the in-class transformations than the out-of-class images do, regardless of their viewpoints. Our proposed strategy, named as GROC, defines the in-class score by the sum of the conformity scores for K transformed images; S conf (•; T ) calculates how conformable an input image is to the given set T . Formally, it is defined by the maximum similarity between the representation of an input image and that of each anchor transformation: S in (x) = K-1 i=0 S conf (T i (x); T ), where S conf (x; T ) = max Tj ∈T [sim (x, T j )] . (3) The foremost condition for GROC is that the representations of the anchor transformations should be discriminative, so that the similarity measure can effectively capture the viewpoint of input images. Note that the similarity between an image x and a transformation T j , denoted by sim(x, T j ), can be defined in various ways. In the following subsections, we offer two similarity measures for the conformity score, respectively modeled by inner product similarity and conditional likelihood, and present how the representations of input images and anchor transformations are optimized.

3.3.1. INNER PRODUCT SIMILARITY FOR CONFORMITY SCORE

To model the similarity measure for the conformity score, we use our encoder network f (•; θ) : X → R d which outputs the representation of an input image, and the weight matrix W ∈ R K×d that parameterizes the representations of K anchor transformations. We first present GROC-IP whose similarity measure is simply computed by the inner product of f (x; θ) and w j : sim (x, T j ) = f (x; θ) w j . Based on the inner product similarity, the encoder network needs to map all in-class images with the same viewpoint close to their corresponding transformation vector, while keeping the transformation vectors far from each other. In other words, it has to extract discriminative features for classifying the input images according to their viewpoint; this has been already achieved by the conventional softmax classifier in a self-supervised manner. Therefore, we adopt the optimization strategy provided by the existing self-supervised methods for one-class classification. After we build a softmax classifier by adding the linear classification layer of weights W on the top of the encoder network f (•; θ), we train both the weight matrix and the network by using the cross-entropy loss with D self . In case of GROC-IP, the conformity score becomes equivalent to the maximum logit value computed by the softmax classifier.

3.3.2. CONDITIONAL LIKELIHOOD FOR CONFORMITY SCORE

Our second method, GROC-CL, defines the similarity measure by using the likelihood of an input image conditioned on each transformation. For simplicity, we assume that the conditional likelihood has the form of an isotropic Gaussian distribution, whose mean is µ j ∈ R d and standard deviation is σ j ∈ R + for the condition j (i.e., transformation T j ). In short, GROC-CL models the representation of T j as (µ j , σ j ) rather than w j . Similar to Section 3.3.1, all the Gaussian distributions need to be separable for a discrimination among different viewpoints. Based on this assumption, the similarity between x and T j is defined by the log-likelihood as follows. sim (x, T j ) = log N (f (x; θ)|µ j , σ 2 j I) ≈ - f (x; θ) -µ j 2 2 2σ 2 j + log σ d j . (5) Note that this similarity can be interpreted as the Mahalanobis distance between f (x; θ) and µ j with the covariance matrix σ 2 j I. The challenge here is to optimize the encoder network so that its outputs follow N (µ j , σ 2 j I) for all in-class inputs having the viewpoint corresponding to T j . We train the parameters for the Gaussian distributions (µ, σ) and the network f (•; θ) by the following objective. max θ,µ,σ,b x∈X tr in K-1 j=0 sim (T j (x), T j ) + 1 ν log exp (sim (T j (x), T j ) + b j ) K-1 k=0 exp (sim (T j (x), T k ) + b k ) , where b j ∈ R is the bias term for transformation j. As discussed in (Lee et al., 2020) , this objective aims to learn the discriminative conditional likelihoods modeled by K separable Gaussian distributions. To be precise, the first term is enforcing that f (T j (x)) follows the j-th Gaussian distribution, and the second term makes them distinguishable based on Gaussian discriminant analysis.

4.1. COMPETING METHODS

In our experiments, we consider a variety of approaches to one-class classification as the competing methods. We choose One-class SVM (OCSVM) (Schölkopf et al., 2000) and Deep SVDD (DSVDD) (Ruff et al., 2018) The main competitors are the self-supervised methods based on transformation classification: Geometric Transformation (GT) (Golan & El-Yaniv, 2018) and Multi-labeled Geometric Transformation (MGT) (Hendrycks et al., 2019) . The details of the methods are presented in Section 2.2. For the anchor transformations, GT adopts four transformation categories (i.e., horizontal flipping, x,y-translation, and rotation), while MGT excludes horizontal flipping from the above categories. The last competing method is SimCLR (Chen et al., 2020) which learns the transformation-invariant representations of input images in a self-supervised manner. Since it is optimized to maximize the agreement among the images differently-transformed from a single image, it is capable of alleviating the viewpoint sensitivity to some degree. Several recent work on representation learning based on this approach (Chen et al., 2020; Grill et al., 2020; He et al., 2020) showed the remarkable performance for a wide range of downstream tasks. Note that SimCLR is not originally designed for one-class classification, thus we tailor it for our task. We define the final in-class score by S in (x) = K-1 i=1 f (T 0 (x); θ) f (T i (x); θ) f (T 0 (x); θ) 2 f (T i (x); θ) 2 . ( ) For its optimization, we use the set of the anchor transformations adopted by MGT. More details of SimCLR are provided in Appendix A.

4.2. DATASETS

We validate the effectiveness of the proposed methods using three benchmark image datasets: CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) , and SVHN (Netzer et al., 2011) . We scale the pixel values of all images to be in [-1, 1] as done in (Golan & El-Yaniv, 2018) . Note that CIFAR-100 has 20 super-classes and we use these super-classes rather than 100 full classes.

4.3. EXPERIMENTAL SETTINGS

Following the experimental setting of the previous studies (Golan & El-Yaniv, 2018; Ruff et al., 2018; Zenati et al., 2018) , we employ the one-vs-all evaluation scheme. For the dataset with C classes, we generate C one-class classification settings; the images of a target class are regarded as in-class data, and the other images belonging to C -1 classes are regarded as out-of-class data. We use the Area Under the ROC curve (AUROC) as an evaluation metric. We build the set of the anchor transformations by the combination of the following unit transformations: x-translation ∈ {-8, 0, +8}, y-translation ∈ {-8, 0, +8}, and rotation ∈ {0 • , 90 • , 180 • , 270 • }. As presented in Section 3.2, we evaluate our method by using the three proposed setups: fixed viewpoint, anchor viewpoint, and random viewpoint. For the random viewpoint setup, we build the test set by randomly sampling the transformation degrees for the x,y-translation in the range of [-8, 8] , and for the rotation in the range of [0 • , 360 • ]. In case of SVHN, we exclude the x,y-translation because it may change the semantic (i.e., class label) of the images; the label of each image is determined by the digit in the middle of the image.

4.4.1. COMPARISONS WITH SELF-SUPERVISED METHODS

The experimental results of the self-supervised methods on the three setups are presented in Table 1 . For each dataset, the original class name is replaced by its class id due to the limited space. In summary, our methods effectively overcome the limitation of viewpoint sensitivity, without compromising the performance in the fixed viewpoint setup. We analyze the results from various perspectives. Fixed viewpoint setup. In this setup, the state-of-the-art competitor MGT consistently shows the best results, and our methods show the comparable results to MGT. We also observe that the performance of SimCLR is not as good as those of the classification-based methods. Note that the classification-based methods aim for a discrimination among the differently-transformed images, whereas SimCLR tries to make them indistinguishable. From this observation, we can conclude that directly learning the transformation-invariant visual features is less effective to identify in-class and out-of-class images for one-class classification. Anchor viewpoint setup. In case that the input images have anchor viewpoints, the classificationbased methods fail to distinguish in-class images from out-of-class images; their performances are even worse than a random guess whose AUROC is 0.5. Because their capability of discriminating the geometric transformations depends on the fixed viewpoint, they do not work at all for the input images with various viewpoints, as discussed in Section 3.1. In contrast, our methods considerably outperform all the competing methods while providing outstanding performances robust to the changes of the viewpoint. These results show that our methods successfully identify the inclass images irrespective of their viewpoint, by the help of the conformity score that measures how confidently an input image matches with one of the in-class transformations. Interestingly, the performance of SimCLR is higher than that of the classification-based methods in this setup. This is because its learning objective that encourages the multiple transformations of an input image to be similar with each other makes the one-class classifier less affected by the viewpoint. Random viewpoint setup. In the hardest setting where the input images have random viewpoints, the classification-based methods cannot beat a random guessing, similarly to the anchor viewpoint setup. Our methods perform the best for all the datasets, which strongly indicates the robustness to the changes of the viewpoint. In conclusion, both of the our methods (i.e., GROC-IP and GROC-CL) are able to correctly classify the images having diverse viewpoints into in-class or out-of-class, even for the viewpoints that have not been seen during the training. . For all the setups, the competing methods show poorer performances compared to our methods. Specifically, the data augmentation technique results in the limited performance gains in the anchor/random setups, whereas it even brings an adverse effect in the fixed setup. This implies that the simple approach is not sufficient to address the viewpoint sensitivity of the one-class classifiers.

4.4.3. FURTHER ANALYSIS

We also provide in-depth analyses on the performance of the self-supervised methods, for the SVHN dataset. In the fixed viewpoint setup, we observe that there is a distinct performance improvement of our GROC over MGT, especially for the cases of in-class 0, 1, and 8. Figure 4 shows the in-class score distributions obtained by GROC and MGT, where the class 0 is set to in-class. Since the digit '0' has a symmetric shape, it is difficult for MGT to differentiate its transformations between the rotation 0 • and 180 • (or, 90 • and 270 • ). For this reason, as illustrated in the rightmost figure, MGT outputs relatively low scores for the images having a single '0' but high scores for the images having other digits around '0'. On the contrary, GROC produces the similar (and high) in-class scores for these images (i.e., with or without other digits), which can be separated from the scores of out-ofclass images. This helps GROC to less overlap the in-class scores between in-class and out-of-class images, and as a result, it leads to higher AUROC compared to MGT. On the other hand, for the cases of in-class 6 and 9, the performance of GROC slightly degrades because the 180 • rotation of the out-of-class digit '9' is more likely to be conformable to the inclass digit '6', and vice versa. Nevertheless, under the assumption that the input images have various viewpoints, it is impossible even for humans to accurately figure out whether the images that look like '6' (or '9') belong to in-class or out-of-class. 

5. CONCLUSION

This paper proposes a novel one-class classification method robust to geometric transformations, which effectively addresses the challenge that in-class images cannot be correctly distinguished from out-of-class images when they have various viewpoints. We first present new evaluation setups that cover the diverse viewpoints by artificially introducing the spatial variance into test images. Then, we define the conformity-based in-class score so as to measure how strongly an input image is conformable to one of the anchor transformations, whose representations are optimized to be discriminative. The extensive experiments demonstrate that the proposed GROC keeps its outstand-Under review as a conference paper at ICLR 2021 ing performance even in the anchor/random viewpoint setups where the input images have various viewpoints, whereas the state-of-the-art methods perform even worse than a random guessing.

A SIMCLR

In the experiments, we slightly modified SimCLR (Chen et al., 2020) for the one-class classification task. The main idea of SimCLR is learning representations by maximizing the agreement among the images differently-transformed from the same image via a contrastive loss in the latent space. For the optimization of SimCLR, we use the set of the anchor transformations T that is adopted by the classification-based self-supervised method (i.e., MGT); this set is different from the one used by the original SimCLR. Let x ∈ X tr in and x = T (x) where T is a transformation operator randomly sampled from the set of the anchor transformations T . Given a batch B = {x 1 , • • • , x N } ⊂ X tr in , we define B = {x 1 , x2 , • • • , x2N-1 , x2N } where x2k-1 and x2k are generated by applying different transformations to each image x k in the batch. The loss function for a pair of two differentlytransformed images (x 2k-1 , x2k ) from the input image x k is defined as follows. l(x 2k-1 , x2k ) = -log exp (sim (f (x 2k-1 ; θ) , f (x 2k ; θ)) /τ ) 2N i=1 I[i = (2k -1)]exp (sim (f (x 2k-1 ; θ) , f (x i ; θ)) /τ ) , ( ) where N is the number of images in a batch, f (•) is an encoder network including a projection layer, and sim(u, v) = u v/ u 2 v 2 is the cosine similarity. It is worth noting that we include the projection layer in the encoder network in order to obtain the transformation-invariant representations, unlike the original version of SimCLR that discards the projection layer for their downstream tasks. In the end, the objective function of SimCLR for a batch B is defined as L SimCLR = 1 2N N k=1 [l (x 2k-1 , x2k ) + l (x 2k , x2k-1 )] .

B IMPLEMENTATION DETAILS

We choose a 16-4 WideResnet (Zagoruyko & Komodakis, 2016) as the backbone architecture. We adopt the training strategy for multi-label classification, proposed in (Hendrycks et al., 2019) . During the training, we use the cosine annealing for scheduled learning (Loshchilov & Hutter, 2016) with initial learning rate 0.1 and Nesterov momentum. The dropout rate (Srivastava et al., 2014) is set to 0.3. All the self-supervised methods based on transformation classification also use the same backbone architecture and training hyperparameters with ours. For DSVDD, we use LeNet (LeCun et al., 1998) style network as described in the paper and implementation.foot_2 For SimCLR, we employ the ResNet18 (He et al., 2016) with the fixed τ value of 0.5. For GROC-CL, the regularization coefficient ν for optimizing the conditional likelihoods is set to 0.0001.

C EXPERIMENTAL RESULTS

In Table 2 , we report the full comparison results with the non-self-supervised methods for one-class classification (summarized in Figure 3 ).

D THEORETICAL BACKGROUNDS FOR GROC-CL

GROC-CL basically utilizes the encoder network f that induces the latent space, where the similarity between the representation of an input image and that of each anchor transformation is modeled by an isotropic Gaussian distribution (conditioned on the transformation). Formally, the similarity between x and T j can be described as sim (x, T j ) = log p (x|T j ) = log N f (x) |µ j , σ 2 j I , where p (x|T j ) is the class-conditional (or transformation-conditional) probability. As discussed in Section 3.3, the representation of each anchor transformation should be distinguishable from the others', which is the foremost condition for GROC, in order to effectively calculate the conformity score and identify in-class/out-of-class images from the score. In this sense, GROC-CL can be understood from the perspective of Gaussian Discriminant Analysis (GDA). To this end, we optimize the encoder network by maximizing the posterior probability of a transformed image T j (x) having the maximum similarity with the transformation j, which is denoted by p (T j |T j (x)). For simplicity, we assume that the prior probability for each class (or transformation) follows the Bernoulli distribution, i.e., p (T j ) = β j / k β k . p (T j |T j (x)) = p (T j ) p (T j (x)|T j ) K-1 k=0 p (T k ) p (T j (x)|T k ) = exp -2σ 2 j -1 f (T j (x)) -µ j 2 -log σ d j + log β j K-1 k=0 exp -(2σ 2 k ) -1 f (T j (x)) -µ k 2 -log σ d k + log β k = exp (sim (T j (x), T j ) + b j ) K-1 k=0 exp (sim (T j (x), T k ) + b k ) . ( ) Note that taking the log of this equation becomes equivalent to the second term in Equation ( 6). In addition, we need to force the empirical class-conditional distribution to follow the isotropic Gaussian distribution and also approximate the empirical class mean to the obtained class mean µ j . Thus, we minimize the Kullback-Leibler (KL) divergence between the j-th empirical classconditional distribution P j and the corresponding Gaussian distribution N µ j , σ 2 j I . The empirical class-conditional distribution for transformation j is defined as follows. P j = 1 |X tr in | x∈X tr in δ (z -f (T j (x))) , where δ (•) is the Dirac measure. Finally, the KL divergence is obtained by KL Pj N µj, σ 2 j I = - 1 |X tr in | x∈X tr in δ (z -f (Tj(x))) log 1 2πσ 2 j d/2 exp - z -µj 2 2σ 2 j dz + 1 |X tr in | x∈X tr in δ (z -f (Tj(x))) log   1 |X tr in | x∈X tr in δ (z -f (Tj(x)))   dz = - 1 |X tr in | x∈X tr in log 1 2πσ 2 j d/2 exp - f (Tj(x)) -µj 2 2σ 2 j + log 1 |X tr in | = 1 |X tr in | x∈X tr in f (Tj(x)) -µj 2 2σ 2 j + log σ d j + constant = - 1 |X tr in | x∈X tr in sim (Tj(x), Tj) + constant. ( ) The final form of the KL divergence is derived by using the definition of the Dirac measure. After the constant term is excluded, it becomes the same with the first term in Equation ( 6). Minimizing the KL term for all the classes (or transformations) matches the empirical class-conditional distributions with the isotropic Gaussian distributions. 



They build the set of transformations T by the combination of the following unit transformations: rotation ∈ {0 • , 90 • , 180 • , • }, x-translation ∈ {-8, 0, +8}, and y-translation ∈ {-8, 0, +8}. https://github.com/lukasruff/Deep-SVDD-PyTorch



Figure 1: A toy example of computing the in-class scores for two in-class (sea lion) images with different viewpoints. Each row represents how the in-class score is calculated for a given input.

Figure 2: The one-class classification performance of the self-supervised method (Hendrycks et al., 2019). (a-c) The in-class score distributions of in-class and out-of-class test images (Dataset: CIFAR-10, In-class: Horse), (d) AUROC for geometrically/non-geometrically transformed inputs.

as the non-self-supervised methods. OCSVM is a classical kernelbased method for one-class classification, which finds a maximum-margin hyperplane the separates enclosing most of the training in-class examples. DSVDD, a deep learning variant of OCSVM, explicitly models the latent space in which training in-class examples gather to a specific center point.

Figure3presents the comparison results against the non-self-supervised methods on the three evaluation setups. Due to the limited space, we report the score averaged over C in-class settings for

Figure 4: The in-class score distributions of in-class and out-of-class test images in the fixed viewpoint setup (Dataset: SVHN, In-class: 0).

The AUROC of self-supervised methods for one-class classification on the three evaluation setups. The best results are marked in bold face.

The AUROC of non-self-supervised methods for one-class classification on the three evaluation setups. .68 / 0.41 / 0.92 / 0.94 0.54 / 0.53 / 0.50 / 0.88 / 0.87 0.54 / 0.36 / 0.61 / 0.80 / 0.80 9 0.49 / 0.52 / 0.37 / 0.90 / 0.93 0.38 / 0.37 / 0.39 / 0.86 / 0.88 0.38 / 0.40 / 0.40 / 0.79 / 0.83 avg 0.59 / 0.61 / 0.58 / 0.88 / 0.89 0.55 / 0.56 / 0.57 / 0.82 / 0.82 0.54 / 0.52 / 0.58 / 0.73 / 0.74 CIFAR-100 0 0.66 / 0.64 / 0.58 / 0.77 / 0.78 0.65 / 0.48 / 0.58 / 0.71 / 0.72 0.64 / 0.49 / 0.58 / 0.62 / 0.64 1 0.52 / 0.56 / 0.47 / 0.74 / 0.72 0.51 / 0.57 / 0.57 / 0.66 / 0.62 0.49 / 0.45 / 0.54 / 0.60 / 0.62 2 0.52 / 0.56 / 0.39 / 0.75 / 0.82 0.52 / 0.52 / 0.54 / 0.70 / 0.75 0.48 / 0.55 / 0.61 / 0.65 / 0.70 3 0.51 / 0.58 / 0.47 / 0.80 / 0.80 0.50 / 0.53 / 0.57 / 0.72 / 0.75 0.48 / 0.53 / 0.44 / 0.52 / 0.65 4 0.52 / 0.58 / 0.46 / 0.81 / 0.81 0.51 / 0.53 / 0.56 / 0.76 / 0.76 0.46 / 0.31 / 0.56 / 0.70 / 0.72 5 0.44 / 0.50 / 0.43 / 0.67 / 0.66 0.44 / 0.50 / 0.47 / 0.60 / 0.59 0.43 / 0.55 / 0.51 / 0.53 / 0.55 6 0.52 / 0.51 / 0.50 / 0.84 / 0.85 0.51 / 0.49 / 0.47 / 0.77 / 0.82 0.51 / 0.42 / 0.40 / 0.59 / 0.

