ARE ALL OUTLIERS ALIKE? ON UNDERSTANDING THE DIVERSITY OF OUTLIERS FOR DETECTING OODS

Abstract

Deep neural networks (DNNs) are known to produce incorrect predictions with very high confidence on out-of-distribution (OOD) inputs. This limitation is one of the key challenges in the adoption of deep learning models in high-assurance systems such as autonomous driving, air traffic management, and medical diagnosis. This challenge has received significant attention recently, and several techniques have been developed to detect inputs where the model's prediction cannot be trusted. These techniques use different statistical, geometric, or topological signatures. This paper presents a taxonomy of OOD outlier inputs based on their source and nature of uncertainty. We demonstrate how different existing detection approaches fail to detect certain types of outliers. We utilize these insights to develop a novel integrated detection approach that uses multiple attributes corresponding to different types of outliers. Our results include experiments on CI-FAR10, SVHN and MNIST as in-distribution data and Imagenet, LSUN, SVHN (for CIFAR10), CIFAR10 (for SVHN), KMNIST, and F-MNIST as OOD data across different DNN architectures such as ResNet34, WideResNet, DenseNet, and LeNet5.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved remarkable performance-levels in many areas such as computer vision (Gkioxari et al., 2015) , speech recognition (Hannun et al., 2014) , and text analysis (Majumder et al., 2017) . But their deployment in the safety-critical systems such as self-driving vehicles (Bojarski et al., 2016) , aircraft collision avoidance (Julian & Kochenderfer, 2017) , and medical diagnoses (De Fauw et al., 2018) is hindered by their brittleness. One major challenge is the inability of DNNs to be self-aware of when new inputs are outside the training distribution and likely to produce incorrect predictions. It has been widely reported in literature (Guo et al., 2017a; Hendrycks & Gimpel, 2016 ) that deep neural networks exhibit overconfident incorrect predictions on inputs which are outside the training distribution. The responsible deployment of deep neural network models in high-assurance applications necessitates detection of out-of-distribution (OOD) data so that DNNs can abstain from making decisions on those. Recent approaches for OOD detection consider different statistical, geometric or topological signatures in data that differentiate OODs from the training distribution. For example, the changes in the softmax scores due to input perturbations and temperature scaling have been used to detect OODs (Hendrycks & Gimpel, 2016; Liang et al., 2017; Guo et al., 2017b) . Papernot & McDaniel (2018) use the conformance among the labels of the nearest neighbors while Tack et al. (2020) use cosine similarity (modulated by the norm of the feature vector) to the nearest training sample for the detection of OODs. Lee et al. (2018) consider the Mahalanobis distance of an input from the in-distribution data to detect OODs. Several other metrics such as reconstruction error (An & Cho, 2015) , likelihood-ratio between the in-distribution and OOD samples (Ren et al., 2019) , trust scores (ratio of the distance to the nearest class different from the predicted class and the distance to the predicted class) (Jiang et al., 2018) , density function (Liu et al., 2020; Hendrycks et al., 2019a) , probability distribution of the softmax scores (Lee et al., 2017; Hendrycks et al., 2019b; Tack et al., 2020; Hendrycks et al., 2019a) have also been used to detect OODs. All these methods attempt to develop a uniform approach with a single signature to detect all OODs accompanied by empirical evaluations that use datasets such as CIFAR10 as in-distribution data and other datasets such as SVHN as OOD. Our study shows that OODs can be of diverse types with different defining characteristics. Consequently, an integrated approach that takes into account the diversity of these outliers is needed for effective OOD detection. We make the following three contributions in this paper: • Taxonomy of OODs. We define a taxonomy of OOD samples that classify OODs into different types based on aleatoric vs epistemic uncertainty (Hüllermeier & Waegeman, 2019) , distance from the predicted class vs the distance from the tied training distribution, and uncertainty in the principal components vs uncertainty in non-principal components with low variance. • Incompleteness of existing uniform OOD detection approaches. We examine the limitations of the state-of-the-art approaches to detect various types of OOD samples. We observe that not all outliers are alike and existing approaches fail to detect particular types of OODs. We use a toy dataset comprising two halfmoons as two different classes to demonstrate these limitations. • An integrated OOD detection approach. We propose an integrated approach that can detect different types of OOD inputs. We demonstrate the effectiveness of our approach on several benchmarks, and compare against state-of-the-art OOD detection approaches such as the ODIN (Liang et al., 2017) and Mahalanobis distance method (Lee et al., 2018) .

2. OOD TAXONOMY AND EXISTING DETECTION METHODS

DNNs predict the class of a new input based on the classification boundaries learned from the samples of the training distribution. Aleatory uncertainty is high for inputs which are close to the classification boundaries, and epistemic uncertainty is high when the input is far from the learned distributions of all classes (Hora, 1996; Hüllermeier & Waegeman, 2019) . Given the predicted class of a DNN model on a given input, we can observe the distance of the input from the distribution of this particular class and identify it as an OOD if this distance is high. We use this top-down inference approach to detect this type of OODs which are characterized by an inconsistency in model's prediction and input's distance from the distribution of the predicted class. Further, typical inputs to DNNs are high-dimensional and can be decomposed into principal and non-principal components based on the direction of high variation; this yields another dimension for classification of OODs. We, thus, categorize an OOD using the following three criteria. 1. Is the OOD associated with higher epistemic or aleatoric uncertainty, i.e., is the input away from in-distribution data or can it be confused between multiple classes? 2. Is the epistemic uncertainty of an OOD sample unconditional or is it conditioned on the class predicted by the DNN model? 3. Is the OOD an outlier due to unusually high deviation in the principal components of the data or due to small deviation in the non-principal (and hence, statistically invariant) components? In Different approaches differ in their ability to detect different OOD types as illustrated in Figure 3 . • Figure 3 (a) shows that the Mahalanobis distance (Lee et al., 2018) from the mean and tied covariance of all the training data in the feature space cannot detect OODs in the clusters B and C corresponding to class-conditional epistemic uncertainty and aleatoric uncertainty, respectively. It attains the overall true negative rate (TNR) of 39.09% at the 95% true positive rate (TPR). • Figure 3 (b) shows that the softmax prediction probability (SPB) (Hendrycks & Gimpel, 2016) cannot detect the OODs in cluster A corresponding to high epsitemic uncertainty. The TNR ( at 95% TPR) reported by the SPB technique is 60.91%. • Figure 3 (c) shows that class-wise Principal Component Analysis (PCA) (Hoffmann, 2007) cannot detect OODs in cluster C corresponding to high aleatoric uncertainty. We performed PCA of the two classes separately in the feature space and used the minimum reconstruction error to detect OODs. This obtained overall TNR of 80.91% (at 95% TPR). • Figure 3(d) shows that K-Nearest Neighbor (kNN) (Papernot & McDaniel, 2018) 1. Mahalanobis distance from the in-distribution density estimate that considers either tied (Lee et al., 2018) or class-wise covariance estimate. This attribute captures the overall or classconditional epistemic uncertainty of an OOD. Our refinement to also use class-wise covariance significantly improves detection of OODs when coupled with PCA approach described below. 2. Conformance measure among the variance of the Annoy (Bernhardsson, 2018) nearest neighbors calculated as the Mahalanobis distance of the input's conformance to the closest class conformance. Our experiments found this to be very effective in capturing aleatoric uncertainty. This new attribute is a fusion of nearest-neighbor and Mahalanobis distance methods in literature. 3. Prediction confidence of the classifier as the maximum softmax score on the perturbed input where the perturbation used is the same as ODIN approach (Liang et al., 2017) . This boosts the detection of high aleatoric uncertainty by sharpening the class-wise distributions. 4. Reconstruction error using top 40% of PCA components where the components are obtained via class conditional PCA of the training data. This boosts the detection of high class-wise epistemic uncertainty by eliminating irrelevant features. This fusion of attributes from existing state-of-the-art detection methods and new attributes was found to be the most effective integrated appraoch capable of detecting the different types of OODs. We evaluated it on several benchmarks as discussed in Section 4 with ablation study in Appendix.  x = x -sign(-∇ x logS ŷ(x;T ) ) The values of the magnitude of noise ( ) and the temperature scaling parameter (T ) are chosen from one of the following three categories: • = 0 and T = 0 • = 0 and T = 10 • = 0.005 and T = 10 4. Conformance measure among the nearest neighbors: We compute an m-dimensional feature vector to capture the conformance among the input's nearest neighbors in the training samples, where m is the dimension of the input. We call this m-dimensional feature vector as the conformance vector. The conformance vector is calculated by taking the mean deviation along each dimension of the nearest neighbors from the input. We hypothesize that this deviation for the in-distribution samples would vary from the OODs due to aleatory uncertainty. The value of the conformance measure is calculated by computing mahalanobis distance of the input's conformance vector to the closest class conformance distribution. Similar to the distance for the in-distribution density estimate, the parameters of this mahalanobis distance are chosen from the following two categories: • empirical class means and tied empirical covariance on the conformance vectors of the training samples • empirical class means and empirical class covariance on the conformance vectors of the training samples The value of the number of the nearest neighbors is chosen from the set {10, 20, 30, 40, 50} via validation. We used Annoy (Approximate Nearest Neighbors Oh Yeah) (Bernhardsson, 2018) to compute the nearest neighbors. The weights of the four attributes forming the signature of the OOD detector are generated in the following manner. We use a small subset (1000 samples) of both the in-distribution and the generated OOD data to train a binary classifier using the logistic loss. The OOD data used to train the classifier is generated by perturbing the in-distribution data using the Fast Gradient Sign attack (FGSM) (Goodfellow et al., 2014) . The trained classifier (or OOD detector) is then evaluated on the real OOD dataset at the True Positive Rate of 95%. The best result, in terms of the highest TNR on the validation dataset (from the training phase of the OOD detector), from the twelve combinations of the aforementioned sub-categories (one from each of the four attributes) are then reported on the test (or real) OOD datasets. Datasets and metrics. We evaluate the proposed integrated OOD detection on benchmarks such as CIFAR10 (Krizhevsky et al., 2009) and SVHN (Netzer et al., 2011) . We consider standard metrics (Hendrycks & Gimpel, 2016; Liang et al., 2017; Lee et al., 2018) such as the true negative rate (TNR) at 95% true positive rate (TPR), the area under the receiver operating characteristic curve (AUROC), area under precision recall curve (AUPR), and the detection accuracy (DTACC) to evaluate our performance.

DNN-based classifier architectures.

To demonstrate that the proposed approach generalizes across various network architectures, we consider a wide range of DNN models such as , ResNet (He et al., 2016) , WideResNet (Zagoruyko & Komodakis, 2016) , and DenseNet (Huang et al., 2017) . Comparison with the state-of-the-art. We compare our approach with the three state-of-the-art approaches: SPB (Hendrycks & Gimpel, 2016) , ODIN (Liang et al., 2017) , and Mahalanobis (Lee et al., 2018) . For the ODIN method, the perturbation noise is chosen from the set {0, 0.0005, 0.001, 0.0014, 0.002, 0.0024, 0.005, 0.01, 0.05, 0.1, 0.2}, and the temperature T is chosen from the set {1, 10, 100, 1000}. These values are chosen from the validation set of the adversarial samples of the in-distribution data generated by the FGSM attack. For the Mahalanobis method, we consider their best results obtained after feature ensemble and input preprocessing with the hyperparameters of their OOD detector tuned on the in-distribution and adversarial samples generated by the FGSM attack. The magnitude of the noise used in pre-processing of the inputs is chosen from the set {0.0, 0.01, 0.005, 0.002, 0.0014, 0.001, 0.0005}. CIFAR10. With CIFAR10 as in-distribution, we consider SVHN (Netzer et al., 2011) , Tiny-Imagenet (Deng et al., 2009) , and LSUN (Yu et al., 2015) as the OOD datasets. For CIFAR10, we consider two DNNs: ResNet50, and WideResNet. Table 1 shows the results.

SVHN.

With SVHN as in-distribution, we consider CIFAR10, Imagenet, and LSUN and as the OOD datasets. For SVHN, we use the DenseNet classifier. Table 1 shows the results. Key observations. We do not consider pre-processing of the inputs in our integrated OOD detector. Even without input pre-processing and with the exception of CIFAR10 OOD dataset for SVHN indistribution trained on DenseNet, we could perform equally well (and even out-perform in most of the cases) as the Mahalanobis method on its best results generated after pre-processing the input. We also consider a Subset-CIFAR100 as OODs for CIFAR10. Specifically, from the CIFAR100 classes, we select sea, road, bee, and butterfly as OODs which are visually similar to the ship, automobile, and bird classes in the CIFAR10, respectively. Thus, there can be numerous OOD samples due to aleatoric and class-conditional epistemic uncertainty which makes the OOD detection challenging. Figure 5 shows the t-SNE (Maaten & Hinton, 2008) plot of the penultimate features from Mahalanobis distance cannot detect these OODs but our integrated approach can detect them. the ResNet50 model trained on CIFAR10. We show 4 examples of OODs (2 due to epistemic and 2 due to aleatoric uncertainty) from Subset-CIFAR100. These OODs were detected by our integrated approach but missed by the Mahalanobis approach. These observations justify the effectiveness of integrating multiple attributes to detect OOD samples. Additional experimental results in the appendix. We also compare the performance of the integrated OOD detector with the SPB, ODIN and Mahalanobis detector in supervised settings, as reported by the Mahalanobis method for OOD detection (Lee et al., 2018) . These results include experiments on CIFAR10, SVHN and MNIST as in-distribution data and Imagenet,LSUN, SVHN (for CIFAR10), CIFAR10 (for SVHN), KMNIST, and F-MNISTas OOD data across different DNN architectures such as ResNet34, WideResNet,DenseNet, and LeNet5. All these results, along with the ablation studies on OOD detectors with single attributes are included in the Appendix. In almost all of the reported results in the Appendix, our OOD detector could outperform the compared state-of-the-art methods with improvements of even 2X higher TNR at 95% TPR in some cases.

5. DISCUSSION AND FUTURE WORK

Recent techniques propose refinement in the training process of the classifiers for OOD detection. Some of these techniques include fine-tuning the classifier's training with an auxiliary cost function for OOD detection (Hendrycks et al., 2019a; Liu et al., 2020) . Other techniques make use of selfsupervised models for OOD detection (Tack et al., 2020; Hendrycks et al., 2019b) . We perform preliminary experiments to compare the performance of these techniques with our integrated OOD detector that makes use of the feature space of the pre-trained classifiers to distinguish in-distribution samples from OODs. Our approach does not require modification of the training cost function of the original task. These results are reported in the Appendix. We consider making use of the feature space of in our OOD detection technique as a promising prospective future work. Another direction of the future work is to explore the score functions used in these refined training processes for OOD detection (Liu et al., 2020; Hendrycks et al., 2019a; Tack et al., 2020; Hendrycks et al., 2019b) 

6. CONCLUSION

We introduced a taxonomy of OODs and proposed an integrated approach to detect different types of OODs. Our taxonomy classifies OOD on the nature of their uncertainty and we demonstrated that no single state-of-the-art approach detects all these OOD types. Motivated by this observation, we formulated an integrated approach that fuses multiple attributes to target different types of OODs. We have performed extensive experiments on a synthetic dataset and several benchmark datasets (e.g., MNIST, CIFAR10, SVHN). Our experiments show that our approach can accurately detect various types of OODs coming from a wide range of OOD datasets such as KMNIST, Fashion-MNIST, SVHN, LSUN, and Imagenet. We have shown that our approach generalizes over multiple DNN architectures and performs robustly when the OOD samples are similar to in-distribution data. We compare our results with the state-of-the-art methods in supervised settings, as reported by the Mahalanobis method for OOD detection (Lee et al., 2018) . In supervised settings, a small subset of the real OOD dataset is used in the training of the OOD detector. Datasets and metrics. We evaluate the proposed integrated OOD detection on benchmarks such as MNIST (LeCun et al., 1998) , CIFAR10 (Krizhevsky et al., 2009) , and SVHN (Netzer et al., 2011) . We consider standard metrics (Hendrycks & Gimpel, 2016; Liang et al., 2017; Lee et al., 2018) such as the true negative rate (TNR) at 95% true positive rate (TPR), the area under the receiver operating characteristic curve (AUROC), area under precision recall curve (AUPR) with both in-distribution and OODs as positive samples (AUPR IN and AUPR OUT respectively), and the detection accuracy (DTACC) to evaluate our performance.

DNN-based classifier architectures.

To demonstrate that the proposed approach generalizes across various network architectures, we consider a wide range of DNN models such as Lenet (LeCun et al., 1998) , ResNet (He et al., 2016) , and DenseNet (Huang et al., 2017) . Comparison with the state-of-the-art. We compare our approach with the three state-of-the-art approaches: SPB (Hendrycks & Gimpel, 2016) , ODIN (Liang et al., 2017) and Mahalanobis (Lee et al., 2018) . Since, these experiments are performed in supervised settings, we fix T = 10 and = 0.005 for generating results from the ODIN method. For Mahalanobis distance, we consider the distance in the penultimate layer feature space as well as features from all the layers of the DNN without preprocessing of the input in either settings. MNIST. With MNIST as in-distribution, we consider KMNIST (Clanuwat et al., 2018) and Fashion-MNIST(F-MNIST) (Xiao et al., 2017) as OOD datasets. For MNIST, we use the LeNet5 (LeCun et al., 1998) DNN. Results in terms of TNR (at 95% TPR), AUROC, and DTACC are reported in tables 6, 7, and 8. Table 6 shows the results with the features from the penultimate layer in comparison to the ODIN and Mahalanobis methods. Table 7 shows the results with the features from all the layers in comparison to the Mahalanobis method. Table 8 shows the results with the features from the penultimate layer in comparison to the SPB method. In all these settings, our approach outperforms the state-of-the-art approaches for both the OOD datasets. Results in comparison to AUPR IN and AUPR OUT are shown in table 9. Here also, our technique out-performs all the three OOD detectors on all the test cases. CIFAR10. With CIFAR10 as in-distribution, we consider STL10 (Coates et al., 2011 ), SVHN (Netzer et al., 2011) , Imagenet (Deng et al., 2009) , LSUN (Yu et al., 2015) , and a subset of CIFAR100 (SCIFAR100) (Krizhevsky et al., 2009) as OOD datasets. For CIFAR10, we consider three DNNs: DenseNet, ResNet34, and ResNet50. Results in terms of TNR (at 95% TPR), AUROC, and DTACC are reported in tables 6, 7, and 8. Table 6 shows the results with the features from the penultimate layer in comparison to the ODIN and Mahalanobis methods. Table 7 shows the results with the features from all the layers in comparison to the Mahalanobis method. Table 8 shows the results with the features from the penultimate layer in comparison to the SPB method. Results in comparison to AUPR IN and AUPR OUT are shown in tables 10, 11, and 12. Here also, the integrated OOD detection technique could out-perform the other three detectors on most of the test cases. Note that images from STL10 and the subset of CIFAR100 are quite similar to CIFAR10 images. Furthermore, from the CIFAR100 classes, we select sea, road, bee, and butterfly as OODs which are visually similar to the ship, automobile, and bird classes in the CIFAR10, respectively.

SVHN.

With SVHN as in-distribution, we consider STL10, CIFAR10, Imagenet, LSUN and, SCI-FAR100 as OOD datasets. For SVHN, we consider two DNNs: DenseNet and ResNet34. Results in terms of TNR (at 95% TPR), AUROC, and DTACC are reported in tables 6, 7, and 8. Table 6 shows the results with the features from the penultimate layer in comparison to the ODIN and Mahalanobis methods. Table 7 shows the results with the features from all the layers in comparison to the Mahalanobis method. Table 8 shows the results with the features from the penultimate layer in comparison to the SPB method. Results in comparison to AUPR IN and AUPR OUT are shown in tables 13, and 14. Here also, the integrated OOD detection technique could out-perform the other three detectors on most of the test cases. Key observations. As shown in Table 6 and Table 7, and Table 8 , our approach outperforms the state-of-the-art on all three datasets and with various DNN architectures. On CIFAR10, in terms of the TNR metric, our approach with Resnet50 outperforms Mahalanobis by 56% when SVHN is OOD and our approach with Resnet34 outperforms ODIN by 36% when LSUN is OOD. While considering STL10 and Subset-CIFAR100 as OODs for CIFAR10, the images from both these datasets are quite similar to CIFAR-10 images. Thus, there can be numerous OOD samples due to aleatoric and class-conditional epistemic uncertainty which makes detection challenging. Although our performance is low on the STL10 dataset, it still outperforms the state-of-the-art. For instance, the proposed approach achieves a 27% better TNR score than the Mahalanobis using ResNet50. On SVHN, in terms of the TNR metric, our approach outperforms ODIN and Mahalanobis by 63% and 13%, respectively on SCIFAR100 using ResNet34. The above observations justify the effectiveness of integrating multiple attributes to detect OOD samples.

A.2.3 ABLATION STUDY

We report ablation study on OOD detection with individual attributes and compare it with our integrated approach on the penultimate feature space of the classifier in the supervised settings as described in the previous section. We call the OOD detector with Mahalanobis distance estimated on class mean and tied covariance (Lee et al., 2018) as Mahala-Tied. Detector based on Mahalanobis distance estimated on class mean and class covariance is referred as Mahala-Class. Similarly conformance among the K-nearest neighbors (KNN) measured by Mahala-Tied and Mahala-Class is referred as KNN-Tied and KNN-Class respectively in these experiments. Results for this study on CIFAR10 with DenseNet architecture, SVHN with DenseNet and ResNet34 architectures are shown in Tables 15, 16 and 17 respectively. The integrated approach could out-perform all the single attribute based OOD detector in all the tested cases due to detection of diverse OODs. An important observation made from these experiments is that the performance of the single attribute based methods could depend on the architecture of the classifier. For example, while the performance of PCA was really bad in case of DenseNet (for both CIFAR10 and SVHN) as compared to all other methods, it could out-perform all but the integrated approach for SVHN on ResNet34. 



Figure 1: The different types of OODs in a 2D space with three different classes. The class distributions are represented as Gaussians with black boundaries and the tied distribution of all training data is a Gaussian with red boundary.

Figure 1 demonstrates different types ofOODs which differ along these criteria. Type 1 OODs have high epistemic uncertainty and are away from the indistribution data. Type 2 OODs have high epistemic uncertainty with respect to each of the 3 classes even though approximating all in-distribution (ID) data using a single Guassian distribution will miss these outliers. Type 3 OODs have high aleatoric uncertainty as they are close to the decision boundary between class 0 and class 1. Type 4 and 5 have high epistemic uncertainty with respect to their closest classes. While Type 4 OODs are far from the distribution along the principal axis, Type 5 OODs vary along a relatively invariant axis where even a small

Figure 3: Detected OODs are shown in blue and undetected OODs are in red. Different techniques fail to detect different types of OODs.

Figure 4: Complementary information about different types of OODs improves detection. (Top-left) 15% TNR with non-conformance among the labels of the nearest neighbors. (Top-right) Adding Mahalanobis distance over the tied in-distribution improves TNR to 54.09%. (Bottom-left) Adding Class-wise PCA further improves TNR to 95.91% TNR. (Bottom-right) Adding softmax score further improves TNR to 99.55%. TPR is 95% in all the cases.

Figure 5: t-SNE plot of the penultimate layer feature space of ResNet50 trained on CIFAR10. We show four OOD images from the SCIFAR100. OOD 1 and OOD 2 are far from the distributions of all classes and thus represent OODs due to epistemic uncertainty. OOD 3 and OOD 4 are OODs due to aleatoric uncertainty as they lie closer to two class distributions. Third OOD is closer to the cat and frog classes of the ID and forth OOD is closer to the airplane and automobile classes of the ID.Mahalanobis distance cannot detect these OODs but our integrated approach can detect them.

Comparison of TNR, AUROC, DTACC, AUPR with SPB, ODIN and Mahalanobis methods

as attributes (or categories of attributes) forming the signature of the integrated OOD detector. Another avenue of future work is to explore OOD generation techniques other than adversarial examples generated by the FGSM attack for training of the integrated OOD detector.

Results with Energy based OOD detector(Liu et al., 2020) / Our method.

Results with Outlier Exposure based OOD detector(Hendrycks et al., 2019a) / Our method.

Results with self-supervised learning based OOD detector(Hendrycks et al., 2019b) / Our method.

Results with contrastive learning based OOD detector(Tack et al., 2020) / Our method.

Results with ODIN/Mahalanobis/Our method. The best results are highlighted.

Results with Mahalanobis/Our method with feature ensemble. The best results are highlighted.

Experimental results with SPB/Our method. The best results are highlighted.

Experimental Results with MNIST on Lenet5 for AUPR IN and AUPR OUT. The best results are highlighted.

Experimental Results with CIFAR10 on DenseNet for AUPR IN and AUPR OUT. The best results are highlighted.

Experimental Results with CIFAR10 on ResNet34 for AUPR IN and AUPR OUT. The best results are highlighted.

Experimental Results with CIFAR10 as on ResNet50 for AUPR IN and AUPR OUT. The best results are highlighted.

Experimental Results with SVHN as on DenseNet for AUPR IN and AUPR OUT. The best results are highlighted.

Experimental Results with SVHN as on ResNet34 for AUPR IN and AUPR OUT. The best results are highlighted.

Ablation study with CIFAR10 on DenseNet. The best results are highlighted.

Ablation study with SVHN on DenseNet. The best results are highlighted.

Ablation study with SVHN on ResNet34. The best results are highlighted.

annex

We first present preliminary results for comparison with the OOD detection techniques based on fine-tuning of the classifiers (Hendrycks et al., 2019a; Liu et al., 2020; Tack et al., 2020; Hendrycks et al., 2019b) . We then present our results on various vision datasets and different architectures of the pre-trained DNN based classifiers for these datasets in comparison to the ODIN, the Mahalanobis and the SPB methods in supervised settings. Finally, we then report results from the ablation study on OOD detection with individual attributes and compare it with our integrated approach on OOD detection.

A.2.1 COMPARISON WITH THE OOD DETECTION TECHNIQUES BASED ON REFINEMENT OF THE TRAINING PROCESS FOR CLASSIFIERS

Recent techniques propose refinement in the training process of the classifiers for OOD detection. Some of these techniques include fine-tuning the training of classifiers with a trainable cost function for OOD detection (Hendrycks et al., 2019a; Liu et al., 2020) , self-supervised training of the classifiers to enhance OOD detection (Tack et al., 2020; Hendrycks et al., 2019b) etc.We perform preliminary experiments to compare the performance of these techniques with our integrated OOD detector that uses features of the pre-trained classifiers to distinguish in-distribution samples from OODs. Table 2 compares TNR (at 95% TPR), AUROC and AUPR for the energy based OOD detector (Liu et al., 2020) and our integrated OOD detector on CIFAR10 with pretrained WideResNet model. The integrated OOD detector was trained on in-distribution and adversarial samples generated by FGSM attack. Table 3 compares the results of the WideResNet model trained on CIFAR10 and fine-tuned with the outlier exposure from the 80 Million Tiny Images with our OOD detector that uses features from the pre-trained WideResNet model trained on CIFAR10. Since 80 Million Tiny Images dataset is no longer available for use, we used a small subset of Im-ageNet (treated as OOD dataset for CIFAR10 and SVHN datasets (Lee et al., 2018) ) for generating OODs for training of the integrated OOD detector. Table 4 compares the OOD detection performance of the self-supervised training based OOD detector with our method. We trained our OOD detector with the in-distribution CIFAR10 as in-distribution samples and adversarial samples generated by FGSM attack from the test dataset of CIFAR10 as OODs. The trained OOD detector was then tested on LSUN as OODs and the results are reported in Table 4 . With ResNet-50 as the classifier for CIFAR10, we trained our OOD detector with the in-distribution CIFAR10 as in-distribution samples and adversarial samples generated by FGSM from the test dataset of CIFAR10 as OODs. The trained OOD detector was then tested on SVHN as OODs and these results are compared with the contrastive based learning for OOD detection (Tack et al., 2020) in table 5.

