SSD: A UNIFIED FRAMEWORK FOR SELF-SUPERVISED OUTLIER DETECTION

Abstract

We ask the following question: what training information is required to design an effective outlier/out-of-distribution (OOD) detector, i.e., detecting samples that lie far away from the training distribution? Since unlabeled data is easily accessible for many applications, the most compelling approach is to develop detectors based on only unlabeled in-distribution data. However, we observe that most existing detectors based on unlabeled data perform poorly, often equivalent to a random prediction. In contrast, existing state-of-the-art OOD detectors achieve impressive performance but require access to fine-grained data labels for supervised training. We propose SSD, an outlier detector based on only unlabeled in-distribution data. We use self-supervised representation learning followed by a Mahalanobis distance based detection in the feature space. We demonstrate that SSD outperforms most existing detectors based on unlabeled data by a large margin. Additionally, SSD even achieves performance on par, and sometimes even better, with supervised training based detectors. Finally, we expand our detection framework with two key extensions. First, we formulate few-shot OOD detection, in which the detector has access to only one to five samples from each class of the targeted OOD dataset. Second, we extend our framework to incorporate training data labels, if available. We find that our novel detection framework based on SSD displays enhanced performance with these extensions, and achieves state-of-the-art performance 1 .

1. INTRODUCTION

Deep neural networks are at the cornerstone of multiple safety-critical applications, ranging from autonomous driving (Ramanagopal et al., 2018) to biometric authentication (Masi et al., 2018; Günther et al., 2017) . When trained on a particular data distribution, referred to as in-distribution data, deep neural networks are known to fail against test inputs that lie far away from the training distribution, commonly referred to as outliers or out-of-distribution (OOD) samples (Grubbs, 1969; Hendrycks & Gimpel, 2017) . This vulnerability motivates the use of an outlier detector before feeding the input samples to the downstream neural network modules. However, a key question is to understand what training information is crucial for effective outlier detection? Will the detector require fine-grained annotation of training data labels or even access to a set of outliers in the training process? Since neither data labels nor outliers are ubiquitous, the most compelling option is to design outlier detectors based on only unlabeled in-distribution data. However, we observe that most of the existing outlier detectors based on unlabeled data fail to scale up to complex data modalities, such as images. For example, autoencoder (AE) (Hawkins et al., 2002) based outlier detectors have achieved success in applications such as intrusion detection (Mirsky et al., 2018) , and fraud detection (Schreyer et al., 2017) . However, this approach achieves close to chance performance on image datasets. Similarly, density modeling based methods, such as PixelCNN++ (Salimans et al., 2017) and Glow (Kingma & Dhariwal, 2018) are known to assign even a higher likelihood to outliers in comparison to indistribution data (Nalisnick et al., 2019) . In contrast, existing state-of-the-art OOD detectors achieve high success on image datasets but assume the availability of fine-grained labels for in-distribution samples (Hendrycks & Gimpel, 2017; Bendale & Boult, 2016; Liang et al., 2018; Dhamija et al., 2018; Winkens et al., 2020) . This is a strong assumption since labels, in-particular fine-grained labels, can be very costly to collect in some applications (Google AI Pricing, 2020), which further motivates the use of unlabeled data. The inability of supervised detectors to use unlabeled data and poor performance of existing unsupervised approaches naturally give rise to the following question. Can we design an effective out-of-distribution (OOD) data detector with access to only unlabeled data from training distribution? A framework for outlier detection with unlabeled datafoot_1 involves two key steps: 1) Learning a good feature representation with unsupervised training methods 2) Modeling features of in-distribution data without requiring class labels. For example, autoencoders attempt to learn the representation with a bottleneck layer, under the expectation that successful reconstruction requires learning a good set of representations. Though useful for tasks such as dimensionality reduction, we find that these representations are not good enough to sufficiently distinguish in-distribution data and outliers. We argue that if unsupervised training can develop a rich understanding of key semantics in in-distribution data then absence of such semantics in outliers can cause them to lie far away in the feature space, thus making it easy to detect them. Recently, self-supervised representation learning methods have made large progress, commonly measured by accuracy achieved on a downstream classification task (Chen et al., 2020; He et al., 2020; Oord et al., 2018; Misra & Maaten, 2020; Tian et al., 2020) . We leverage these representations in our proposed cluster-conditioned framework based on the Mahalanobis distance (Mahalanobis, 1936) . Our key result is that self-supervised representations are highly effective for the task of outlier detection in our self-supervised outlier detection (SSD) framework where they not only perform far better than most of the previous unsupervised representation learning methods but also perform on par, and sometimes even better, than supervised representations. What if access to a fraction of OOD data or training data labels is available? How do we move past a detector based on unlabeled data and design a framework which can take advantage of such information? Though access to outliers during training is a strong assumption, it may be feasible to obtain a few prior instances of such outliers (Görnitz et al., 2013) . We characterize this setting as few-shot OOD detection, where we assume access to very few, often one to five, samples from the targeted set of outliers. While earlier approaches (Liang et al., 2018; Lee et al., 2018b) mostly use such data to calibrate the detector, we find that access to just a few outliers can bring an additional boost in the performance of our detector. Crucial to this success is the reliable estimation of first and second order statistics of OOD data in the high dimensional feature space with just a few samples. Finally, if class labels are available in the training phase, how can we incorporate them in the SSD framework for outlier detection? Recent works have proposed the addition of the supervised crossentropy and self-supervised learning loss with a tunable parameter, which may require tuning for optimal parameter setting for each dataset (Hendrycks et al., 2019b; Winkens et al., 2020) . We demonstrate that incorporating labels directly in the contrastive loss achieves 1) a tuning parameterfree detector, and 2) state-of-the-art performance.

1.1. KEY CONTRIBUTIONS

SSD for unlabeled data. We propose SSD, an unsupervised framework for outlier detection based on unlabeled in-distribution data. We demonstrate that SSD outperforms most existing unsupervised outlier detectors by a large margin while also performing on par, and sometimes even better than supervised training based detection methods. We validate our observation across four different datasets: CIFAR-10, CIFAR-100, STL-10, and ImageNet. Extensions of SSD. We provide two extensions of SSD to further improve its performance. First, we formulate few-shot OOD detection and propose detection methods which can achieve a significantly large gain in performance with access to only a few targeted OOD samples. Next, we extend SSD, without using any tuning parameter, to also incorporate in-distribution data labels and achieve state-of-the-art performance.

2. RELATED WORK

OOD detection with unsupervised detectors. Interest in unsupervised outlier detection goes back to Grubbs (1969) . We categorize these approaches in three groups 1) Reconstruction-error based detection using Auto-encoders (Hawkins et al., 2002; Mirsky et al., 2018; Schreyer et al., 2017) or Variational auto-encoders (Abati et al., 2019; An & Cho, 2015) 2) Classification based, such as Deep-SVDD (Ruff et al., 2018; El-Yaniv & Wiener, 2010; Geifman & El-Yaniv, 2017) and 3) Probabilistic detectors, such as density models like Glow and PixelCNN++ (Ren et al., 2019; Nalisnick et al., 2019; Salimans et al., 2017; Kingma & Dhariwal, 2018) . We compare with detectors from each category and find that SSD outperforms them by a wide margin. OOD detection with supervised learning. Supervised detectors have been most successful with complex input modalities, such as images and language (Chalapathy et al., 2018a; DeVries & Taylor, 2018; Dhamija et al., 2018; Jiang et al., 2018; Yoshihashi et al., 2018; Lee et al., 2018a) . Most of these approaches model features of in-distribution data at output (Liang et al., 2018; Hendrycks & Gimpel, 2017; Dhamija et al., 2018) or in the feature space (Lee et al., 2018b; Winkens et al., 2020) for detection. We show that SSD can achieve performance on par with these supervised detectors, without using data labels. A subset of these detectors also leverages generic OOD data to boost performance (Hendrycks et al., 2019a; Mohseni et al., 2020) . Access to OOD data at training time. Some recent detectors also require OOD samples for hyperparameter tuning (Liang et al., 2018; Lee et al., 2018b; Zisselman & Tamar, 2020) . We extend SSD to this setting but assume access to only a few OOD samples, referred to as few-shot OOD detection, which our framework can efficiently utilize to bring further gains in performance. In conjunction with supervised training. Vyas et al. (2018) uses ensemble of leave-one-out classifier, Winkens et al. (2020) uses contrastive self-supervised training, and Hendrycks et al. (2019b) uses rotation based self-supervised loss, in conjunction with supervised cross-entropy loss to achieve state-of-the-art performance in OOD detection. Here we extend SSD, to incorporate data labels, when available, and achieve better performance than existing state-of-the-art. Anomaly detection. In parallel to OOD detection, this research direction focuses on the detection of semantically related anomalies in applications such as intrusion detection, spam detection, disease detection, image classification, and video surveillance. We refer the interested reader to Pang et al. (2020) for a detailed review. While a large number of works focuses on developing methods particularly for single-class modeling in anomaly detection (Perera et al., 2019; Schlegl et al., 2017; Ruff et al., 2018; Chalapathy et al., 2018b; Golan & El-Yaniv, 2018; Wang et al., 2019) , some recent work achieve success in both OOD detection and anomaly detection Tack et al. (2020) ; Bergman & Hoshen (2020) . We provide a detailed comparison of our approach with previous work in both categories.

3. SSD: SELF-SUPERVISED OUTLIER/OUT-OF-DISTRIBUTION DETECTION

In this section, we first provide the necessary background on outlier/out-of-distribution (OOD) detection and then present the underlying formulation of our self-supervised detector (SSD) that relies on only unlabeled in-distribution data. Finally, we describe two extensions of SSD to (optionally) incorporate targeted OOD samples and in-distribution data labels (if available). Notation. We represent the input space by X and corresponding label space as Y. We assume in-distribution data is sampled from P in X ×Y . In the absence of data labels, it is sampled from marginal distribution P in X . We sample out-of-distribution data from P ood X . We denote the feature extractor by f : X → Z, where Z ⊂ R d , a function which maps a sample from the input space to the d-dimensional feature space (Z). The feature extractor is often parameterized by a deep neural network. In supervised learning, we obtain classification confidence for each class by g • f : X → R c . In most cases, g is parameterized by a shallow neural network, generally a linear classifier. Problem Formulation: Outlier/Out-of-distribution (OOD) detection. Given a collection of samples from P in X × P ood X , the objective is to correctly identify the source distribution, i.e., P in X or P ood X , for each sample. We use the term supervised OOD detectors for detectors which use in-distribution data labels, i.e., train the neural network (g • f ) on P in X ×Y using supervised training techniques. Unsupervised OOD detectors aim to solve the aforementioned OOD detection tasks, with access to only P in X . In this work, we focus on developing effective unsupervised OOD detectors. Background: Contrastive self-supervised representation learning. Given unlabeled training data, it aims to train a feature extractor, by discriminating between individual instances from data, to learn a good set of representations. Using image transformations, it first creates two views of each image, commonly referred to as positives. Next, it optimizes to pull each instance close to its positive instances while pushing away from other images, commonly referred to as negatives. Assuming that (x i , x j ) are positive pairs for the ith image from a batch of N images and h(.) is a projection header, τ is the temperature, contrastive training minimizes the following loss, referred to as Normalized temperature-scaled cross-entropy (NT-Xent), over each batch. L batch = 1 2N 2N i=1 -log e u T i uj /τ 2N k=1 1(k = i)e u T i u k /τ ; u i = h (f (x i )) h(f (x i )) 2

3.1. UNSUPERVISED OUTLIER DETECTION WITH SSD

Leveraging contrastive self-supervised training. In the absence of data labels, SSD consists of two steps: 1) Training a feature extractor using unsupervised representation learning, 2) Developing an effective OOD detector based on hidden features which isn't conditioned on data labels. Higher eigenvalues dominates euclidean distance, but are least helpful for outlier detection. Mahalnobis distance avoid this bias with appropriate scaling and performs much better. We leverage contrastive self-supervised training for representation learning in our outlier detection framework, particularly due to its state-of-the-art performance (Chen et al., 2020; Tian et al., 2020) . We will discuss the effect of different representation learning methods later in Section 4.2. Cluster-conditioned detection. In absence of data labels, we develop a cluster-conditioned detection method in the feature space. We first partition the features for in-distribution training data in m clusters. We represent features for each cluster as Z m . We use k-means clustering method, due to its effectiveness and low computation cost. Next, we model features in each cluster independently, and calculate the following outlier score (s x ) = min m D(x, Z m ) for each test input x, where D(., .) is a distance metric in the feature space. We discuss the choice of the number of clusters in Section 4.2. Choice of distance metric: Mahalanobis distance. We use Mahalanobis distance to calculate the outlier score as follows: s x = min m (z x -µ m ) T Σ -1 m (z x -µ m ) where µ m and Σ m are the sample mean and sample covariance of features (Z) of the in-distribution training data. We justify this choice with quantitative results in Figure 1 . With eigendecomposition of sample covariance (Σ m = Q m Λ m Q -1 m ), s x = min m Q T m (z x -µ m ) T Λ -1 m Q T m (z x -µ m ) which is equivalent to euclidean distance scaled with eigenvalues in the eigenspace. We discriminate between in-distribution (CIFAR-10) and OOD (CIFAR-100) data along each principal eigenvector (using AUROC, higher the better). With euclidean distance, i.e., in absence of scaling, components with higher eigenvalues have more weight but provide least discrimination. Scaling with eigenvalues removes the bias, making Mahalnobis distance effective for outlier detection in the feature space.

3.2. FEW-SHOT OOD DETECTION (SSD k )

In this extension of the SSD framework, we consider the scenario where a few samples from the OOD dataset used at inference time are also available at the time of training. We focus on one-shot and five-shot detection, which refers to access to only one and fives samples, from each class of the targeted OOD dataset, respectively. Our hypothesis is that in-distribution samples and OOD samples will be closer to other inputs from their respective distribution in the feature space, while lying further away from each other. We incorporate this hypothesis by using following formulation of outlier score. s x = (z x -µ in ) T Σ -1 in (z x -µ in ) -(z x -µ ood ) T Σ -1 ood (z x -µ ood ) (3) where µ in , Σ in and µ ood , Σ ood are the sample mean and sample covariance in the feature space for in-distribution and OOD data, respectively. Challenge. The key challenge is to reliably estimate the statistics for OOD data, with access to only a few samples. Sample covariance is not an accurate estimator of covariance when the number of samples is less than the dimension of feature space (Stein, 1975) , which is often in the order of thousands for deep neural networks. Shrunk covariance estimators and data augmentation. We overcome this challenge by using following two techniques: 1) we use shrunk covariance estimators (Ledoit & Wolf, 2004) , and 2) we amplify number of OOD samples using data augmentation. We use shrunk covariance estimators due to their ability to estimate covariance better than sample covariance, especially when the number of samples is even less than the feature dimension. To further improve the estimation we amplify the number of samples using data augmentation at the input stage. We use common image transformations, such as geometric and photometric changes to create multiple different images from a single source image from the OOD dataset. Thus given a set of k OOD samples {u 1 , u 2 , . . . , u k }, we first create a set of k × n samples using data augmentation, U = {u 1 1 , . . . , u n 1 , . . . u 1 k , . . . , u n k }. Using this set, we calculate the outlier score for a test sample in the following manner. s x = (z x -µ in ) T Σ -1 in (z x -µ in ) -(z x -µ U ) T S -1 U (z x -µ U ) where µ U and S U are the sample mean and estimated covariance using shrunk covariance estimators for the set U, respectively.

3.3. HOW TO BEST USE DATA LABELS (SSD+)

If fine-grained labels for in-distribution data are available, an immediate question is how to incorporate them in training to improve the success in detecting OOD samples. Conventional approach: Additive self-supervised and supervised training loss. A common theme in recent works (Hendrycks et al. 2019b; Winkens et al. 2020 ) is to add self-supervised (L ssl ) and supervised (L sup ) training loss functions, i.e., L training = L sup + αL ssl , where the hyperparameter α is chosen for best performance on OOD detection. A common loss function for supervised training is cross-entropy. Our approach: Incorporating labels in contrastive self-supervised training. As we show in Section 4.2, even without labels, contrasting between instances using self-supervised learning is highly successful for outlier detection. We argue for a similar instance-based contrastive training, where labels can also be incorporated to further improve the learned representations. To this end, we use the recently proposed supervised contrastive training loss function (Khosla et al., 2020) , which uses labels for a more effective selection of positive and negative instances for each image. In particular, we minimize the following loss function. L batch = 1 2N 2N i=1 -log 1 2Ny i -1 2N k=1 1(k = i)1(y k = y i )e u T i u k /τ 2N k=1 1(k = i)e u T i u k /τ (5) where N yi refers to number of images with label y i in the batch. In comparison to contrastive NT-Xent loss (Equation 1), now we use images with identical labels in each batch as positives. We will show the superior performance of this approach compared to earlier approaches, and note that it is also a parameter-free approach which doesn't require additional OOD data to tune parameters. We further use the proposed cluster-conditioned framework with Mahalnobis distance, as we find it results in better performance than using data labels. We further summarize our framework in Algorithm 1. Algorithm 1: Self-supervised outlier detection framework (SSD) Input :X in , X test , feature extractor (f ), projection head (h), Required True-positive rate (T ), Optional: X ood , Y in # X in ∈ P in X , X ood ∈ P ood X Output :Is outlier or not? ∀x ∈ X test Function getFeatures(X ): return {f (x i )/ f (x i ) 2 , ∀ x i ∈ X }; Function SSDScore(Z, µ, Σ): return {(z -µ) T Σ -1 (z -µ), ∀ z ∈ Z}; Function SSDkScore(Z, µ in , Σ in , µ ood , Σ ood ): return {(z -µ in ) T Σ -1 in (z -µ in ) -(z -µ ood ) T Σ -1 ood (z -µ ood ), ∀ z ∈ Z }; end Parition X in in training set (X train ) and calibration set (X cal ); if Y in is not available then L batch = 1 2N 2N i=1 -log e u T i u j /τ 2N k=1 1(k =i)e u T i u k /τ ; ui = h(f (x i )) h(f (x i )) 2 ; # Train feature extractor else  L batch = 1 2N 2N i=1 -log 1 2Ny i -1 2N k=1 1(k =i)1(y k =y i )e u T i u k /τ 2N k=1 1(k =i)e u T i u k

4.1. COMMON SETUP ACROSS ALL EXPERIMENTS

We use recently proposed NT-Xent loss function from SimCLR (Chen et al., 2020) method for self-supervised training. We use the ResNet-50 network in all key experiments but also provide ablation with ResNet-18, ResNet-34, and ResNet-101 network architecture. We train each network, for both supervised and self-supervised training, with stochastic gradient descent for 500 epochs, 0.5 starting learning rate with cosine decay, and weight decay and batch size set to 1e-4 and 512, respectively. We set the temperature parameter to 0.5 in the NT-Xent loss. We evaluate our detector with three performance metrics, namely FPR (at TPR=95%), AUROC, and AUPR. For the supervised training baseline, we use identical training budget as SSD while also using the Mahalanobis distance based detection in the feature space. Due to space constraints, we present results with AUROC, which refers to area under the receiver operating characteristic curve, in the main paper and provide detailed results with other performance metrics in Appendix B.5. Our setup incorporates six image datasets along with additional synthetic datasets based on random noise. We report the average results over three independent runs in most experiments.

Number of clusters.

We find the choice of the number of clusters dependent on which layer we extract the features from in the Residual neural networks. While for the first three blocks, we find an increase in AUROC with the number of clusters, the trend is reversed for the last block (Appendix B.2). Since the last block features achieve the highest detection performance, we model the in-distribution features as a single cluster in subsequent experiments.

4.2. PERFORMANCE OF SSD

Comparison with unsupervised learning based detectors. We present this comparison in Table 1 . We find that SSD improves average AUROC by up to 55, compared to standard outlier detectors 2018)). A common limitation of each of these three detectors is to find images from the SVHN dataset as more in-distribution when trained on CIFAR-10 or CIFAR-100 dataset. In contrast, SSD is able to successfully detect a large fraction of outliers from SVHN dataset. We also experiment with Rotation-loss Gidaris et al. ( 2018), a non-contrastive self-supervised training objective. We find that SSD with contrastive NT-Xent loss achieves 9.6% higher average AUROC compared to using Rotation-loss. Ablation studies. We ablate along individual parameters in self-supervised training with CIFAR-10 as in-distribution data (Figure 2 ). While architecture doesn't have a very large effect on AUROC for most OOD dataset, we find that the number of training epochs and batch size plays a key role in detecting outliers from the CIFAR-100 dataset, which is hardest to detect among the four OOD datasets. We also find an increase in the size of training dataset helpful in the detection of all four OOD datasets. Comparison with supervised representations. We earlier asked the question whether data labels are even necessary to learn representations crucial for OOD detection? To answer it, we compare SSD with a supervised network, trained with an identical budget as SSD while also using Mahalanobis distance in the feature space, across sixteen different pairs of in-distribution and out-of-distribution datasets (Table 2 ). We observe that self-supervised representations even achieve better performance than supervised representations for 56% of the tasks in Table 2 . Success in anomaly detection. We now measure the performance of SSD in anomaly detection where we consider one of the CIFAR-10 classes as in-distribution and the rest of the classes as a source of anomalies. Similar to the earlier setup, we train a ResNet-50 network using self-supervised training with NT-Xent loss function. While the contrastive loss attempt to separate individual instances in the feature space, we find that adding an 2 regularization in the feature space helps in improving performance. In particular, we add this regularization (with a scaling coefficient of 0.01) to bring individual instance features close to the mean of all feature vectors in the batch. Additionally, we reduce the temperature from 0.5 to 0.1 to reduce the separability of individual instances due to the contrastive loss. Overall, we find that our approach outperforms all previous works and achieves competitive performance with the concurrent work of Tack et al. (2020) (Table 3 ).

4.3. FEW-SHOT OOD DETECTION (SSD k )

Setup. We focus on one-shot and five-shot OOD detection, i.e., set k to 1 or 5 in Equation 4and use Ledoit-Wolf (Ledoit & Wolf, 2004) estimator for covariance estimation. To avoid a bias on selected samples, we report average results over 25 random trials. Results. Compared to the baseline SSD detector, one-shot and five-shot settings improve the average AUROC, across all OOD datasets, by 1.6 and 2.1, respectively (Table 1 , 2). In particular, we observe large gains with CIFAR-100 as in-distribution and CIFAR-10 as OOD where five-shot detection improves the AUROC from 69.6 to 78.3. We find the use of shrunk covariance estimator most critical in the success of our approach. Use of shrunk covariance estimation itself improves the AUROC from 69.6 to 77.1. Then data augmentation further improves it 78.3 for the five-shot detection. With an increasing number of transformed copies of each sample, we also observe improvement in AUROC, though it later plateaus close to ten copies (Appendix B.3). What if additional OOD images are available Note that some earlier works, such as Liang et al. (2018) , assume that 1000 OOD inputs are available for tuning the detector. We find that with access to this large amount of OOD samples, SSD k can improve the state-of-the-art by an even larger margin. For example, with CIFAR-100 as in-distribution and CIFAR-10 as out-of-distribution, it achieves 89.4 AUROC, which is 14.2% higher than the current state-of-the-art (Winkens et al., 2020) .

4.4. SUCCESS WHEN USING DATA LABELS (SSD+)

Now we integrate labels of training data in our framework and compare it with the existing state-ofthe-art detectors. We report our results in Table 4 . Our approach improves the average AUROC by 0.8 over the previous state-of-the-art detector. Our approach also achieves equal or better performance than previous state-of-the-art across individual pairs of in and out-distribution dataset. For example, using labels in our framework improves the AUROC of Mahalanobis detector from 55.5 to 72.1 for CIFAR-100 as in-distribution and CIFAR-10 as the OOD dataset. Using the simple softmax probabilities, training a two-layer MLP on learned representations further improves the AUROC to 78.3. Combining SSD+ with a five-shot OOD detection method further brings a gain of 1.4 in the average AUROC.

5. DISCUSSION AND CONCLUSION

On tuning hyperparameters in SSD. In our framework, we either explicitly avoid the use of additional tuning-parameters (such as when combining self-supervised and supervised loss functions (Hendrycks et al., 2019a) 93.3 98.4 75.7 86.9 88.6 Rotation-loss + Supervised (Hendrycks et al., 2019b) 90.9 98.9 ---Contrastive + Supervised (Winkens et al., 2020) in SSD+) or refrain from tuning the existing set of parameters for each OOD dataset. For example, we use a standard set of parameters for self-supervised training and model the learned features with a single-cluster. Why contrastive self-supervised learning is effective in the SSD framework? We focus on the NT-Xent loss function, which is parameterized by a temperature variable (τ ). Its objective is to pull positive instances, i.e., different transformations of an image, together while pushing away from other instances. Earlier works have shown that such contrastive training forces the network to learn a good set of feature representations. However, a smaller value of temperature quickly saturates the loss, discouraging it to further improve the feature representations. We find that the performance of SSD also degrades with lower temperature, suggesting the necessity of learning a good set of feature representation for effective outlier detection (Table 5 ). How discriminative ability of feature representations evolves over the course of training. We analyze this effect in Figure 3 Performance of SSD improves with the amount of available unlabeled data. A compelling advantage of unsupervised learning is to learn from unlabeled data, which can be easily collected. As presented in Figure 2 , we find that performance of SSD increases with the size of training dataset. We conduct another experiment with the STL-10 dataset, where in addition to the 5k training images, we also use additional 10k images from the unlabeled set. This improves the AUROC from 94.7 to 99.4 for CIFAR-100 as the OOD dataset, further demonstrating the success of SSD in leveraging unlabeled data (Appendix B.4). In conclusion, our framework provides an effective & flexible approach for outlier detection using unlabeled data. hyperparameter selection, we set the image augmentation pipeline to be the same as the one used in training. Finally, we use the Ledoit-Wolf method (Ledoit & Wolf, 2004) to estimate the covariance of the OOD samples.

A.2 PERFORMANCE METRICS FOR OUTLIER DETECTORS

We use the following three performance metrics to evaluate the performance of outlier detectors. • FPR at TPR=95%. It refers to the false positive rate (= FP / (FP+TN)), when true positive rate (= TP / (TP + FN)) is equal to 95%. Effectively, its goal is to measure what fraction of outliers go undetected when it is desirable to have a true positive rate of 95%. • AUROC. It refers to the area under the receiver operating characteristic curve. We measure it by calculating the area under the curve when we plot TPR against FPR. • AUPR. It refers to the area under the precision-recall curve, where precision = TP / (TP+FP) and recall = TP / (TP+FN). Similar to AUROC, AUPR is also a threshold independent metric.

A.3 DATASETS USED IN THIS WORK

We use the following datasets in this work. Whenever there is a mismatch between the resolution of images in in-distribution and out-of-distribution (OOD) data, we appropriately scale the OOD images with bilinear scaling. When there is an overlap between the classes of the in-distribution and OOD dataset, we remove the common classes from the OOD dataset. • CIFAR-10 ( Krizhevsky et al., 2009) . It consists of 50,000 training images and 10,000 test images from 10 different classes. Each image size is 32×32 pixels. • CIFAR-100 (Krizhevsky et al., 2009) . CIFAR-100 also has 50,000 training images and 10,000 test images. However, it has 100 classes which are further organized in 20 sub-classes. Note that its classes aren't identical to the CIFAR-10 dataset, with a slight exception with class truck in CIFAR-10 and pickup truck in CIFAR-100. However, their classes share multiple similar semantics, making it hard to catch outliers from the other dataset. • SVHN (Netzer et al., 2011) . SVHN is a real-world street-view housing number dataset. It has 73,257 digits available for training, and 26,032 digits for testing. Similar to the CIFAR-10/100 dataset, the size of its images is also 32×32 pixels. • STL-10 ( Coates et al., 2011) . STL-10 has identical classes as the CIFAR-10 dataset but focuses on the unsupervised learning. It has 5,000 training images, 8,000 test images, and a set of 100,000 unlabeled images. Unlike the previous three datasets, the size of its images is 96×96 pixels. • DTD (Cimpoi et al., 2014) . Describable Textures Dataset (DTD) is a collection of textural images in the wild. It includes a total of 5,640 images, split equally between 47 categories where the size of images range between 300×300 and 640×640 pixels. • ImageNetfoot_2 (Deng et al., 2009) . ImageNet is a large scale dataset of 1,000 categories with 1.2 Million training images and 50,000 validation images. It has high diversity in both inter-and intra-class images and is known to have strong generalization properties to other datasets. • Blobs. Similar to Hendrycks et al. (2019a) , we algorithmically generate these amorphous shapes with definite edges. • Gaussian Noise. We generate images with Gaussian noise using a mean of 0.5 and a standard deviation of 0.25. We clip the pixel value to the valid pixel range of [0, 1]. • Uniform Noise. It refers to images where each pixel value is uniformly sampled from the [0, 1] range.

B.1 LIMITATIONS OF OUTLIER DETECTORS BASED ON SUPERVISED TRAINING

Existing supervised training based detector assumes that fine-grained data labels are available for the training data. What happens to the performance of current detectors if we relax this assumption by assuming that only coarse labels are present. We simulate this setup by combining consecutive classes from the CIFAR-10 dataset into two groups, referred to as CIFAR-2, or five groups referred to as CIFAR-5. We use CIFAR-100 as the out-of-distribution dataset. We find that the performance of existing detectors degrades significantly when only coarse labels are present (Figure 4 ). In contrast, SSD operates on unlabeled data thus doesn't suffer from similar performance degradation. 

B.2 ON CHOICE OF NUMBER OF CLUSTERS

We find that the choice of optimal number of clusters is dependent on which layer we use as the feature extractor in a Residual Neural network. We demonstrate this trend in Figure 5 , with CIFAR-10 as in-distribution dataset and CIFAR-100 as out-of-distribution dataset. We extract features from the last layer of each block in the residual network and measure the SSD performance with them. While for the first three blocks, we find an increase in AUROC with number of clusters, the trend is reversed for the last block (Figure 5 ). Since last block features achieve highest detection performance, we model in-distribution features using a single cluster.

B.3 ABLATION STUDY FOR FEW-SHOT OOD DETECTION

For few shot OOD detection, we ablate along the number of transformations used for each sample. We choose CIFAR-100 as in-distribution and CIFAR-10 as OOD dataset with SSD k , set k to five, and choose ResNet-18 network architecture. When increasing number of transformations from 1, 5, 10, 20, 50 the AUROC of detector is 74.3, 75.7, 76.1, 76.3, 76.7. To achieve a balance between the performance and computational cost, we use ten transformations for each sample in our final experiments.

B.4 PERFORMANCE OF SSD IMPROVES WITH AMOUNT OF UNLABELED DATA

With easy access to unlabeled data, it is compelling to develop detectors that can benefit from the increasing amount of such data. We earlier demonstrated this ability of SSD for the CIFAR-10 dataset in Figure 2 . Now we present similar results with the STL-10 dataset. We first train a self-supervised network, and an equivalent supervised network with 5,000 training images from the STL-10 dataset. We refer to these networks by SSD-5k and Sup-5k, respectively. Next, we include additional 10,000 images from the available 100k unlabeled images in the dataset. As we show in Figure 6 , SSD is able to achieve large gains in performance with access to the additional unlabeled training data.

B.5 RESULTS WITH DIFFERENT PERFORMANCE METRICS

We provide our detailed experimental results for each component in the SSD framework with three different performance metrics in Table 6 , 7,. 



Our code is publicly available at https://github.com/inspire-group/SSD We refer to OOD detection without using class labels of in-distribution data as unsupervised OOD detection. We refer to the commonly used ILSVRC 2012 release of ImageNet dataset.



Figure 1: AUROC along individual principle eigenvector with CIFAR-10 as indistribution and CIFAR-100 as OOD.Higher eigenvalues dominates euclidean distance, but are least helpful for outlier detection. Mahalnobis distance avoid this bias with appropriate scaling and performs much better.

Figure 2: Ablating across different training parameters in SSD under following setup: In-distribution dataset = CIFAR-10, OOD dataset = CIFAR-100, Training epochs = 500, Batch size = 512.

Figure 3: AUROC over the course of training with CIFAR-10 as indistribution and CIFAR-100 as OOD set.

Figure 4: Existing supervised detector requires fine-grained labels. In contrast, SSD can achieve similar performance with only unlabeled data.

/τ

Comparison of SSD with different outlier detectors using only unlabeled training data.

Comparing performance of self-supervised (SSD) and supervised representations. We also provide results for few-shot OOD detection (SSD k ) for comparison with our baseline SSD detector.

Comparison of SSD with other detectors for anomaly detection task on CIFAR-10 dataset.

Comparison of SSD+, i.e., incorporating labels in the SSD detector, with state-of-the-art detectors based on supervised training.

Uses 4× wider ResNet-50 network, † Requires additional out-of-distribution data for tuning.

Test Accuracy and AUROC with different temperature values in NT-Xent (Equation1) loss. Using CIFAR-10 as in-distribution and CIFAR-100 as OOD dataset with ResNet18 network.

Experimental results of SSD detector with multiple metrics over CIFAR-10, CIFAR-100, and STL-10 dataset.

acknowledgement

Acknowledgments We would like to thank Chong Xiang, Liwei Song, and Arjun Nitin Bhagoji for their helpful feedback on the paper. This work was supported in part by the National Science Foundation under grants CNS-1553437 and CNS-1704105, by a Qualcomm Innovation Fellowship, by the Army Research Office Young Investigator Prize, by Army Research Laboratory (ARL) Army Artificial Intelligence Institute (A2I2), by Office of Naval Research (ONR) Young Investigator Award, by Facebook Systems for ML award, by Schmidt DataX Fund, and by Princeton E-ffiliates Partnership.

annex

Published as a conference paper at ICLR 2021 

