A SIMPLE AND EFFECTIVE BASELINE FOR OUT-OF-DISTRIBUTION DETECTION USING AN ABSTENTION CLASS Anonymous

Abstract

Refraining from confidently predicting when faced with categories of inputs different from those seen during training is an important requirement for the safe deployment of deep learning systems. While simple to state, this has been a particularly challenging problem in deep learning, where models often end up making overconfident predictions in such situations. In this work we present a simple, but highly effective approach to deal with out-of-distribution detection that uses the principle of abstention: when encountering a sample from an unseen class, the desired behavior is to abstain from predicting. Our approach uses a network with an extra abstention class and is trained on a dataset that is augmented with an uncurated set that consists of a large number of out-of-distribution (OoD) samples that are assigned the label of the abstention class; the model is then trained to learn an effective discriminator between in and out-of-distribution samples. We compare this relatively simple approach against a wide variety of more complex methods that have been proposed both for out-of-distribution detection as well as uncertainty modeling in deep learning, and empirically demonstrate its effectiveness on a wide variety of of benchmarks and deep architectures for image recognition and text classification, often outperforming existing approaches by significant margins. Given the simplicity and effectiveness of this method, we propose that this approach be used as a new additional baseline for future work in this domain.

1. INTRODUCTION AND RELATED WORK

Most of supervised machine learning has been developed with the assumption that the distribution of classes seen at train and test time are the same. However, the real-world is unpredictable and open-ended, and making machine learning systems robust to the presence of unknown categories and out-of-distribution samples has become increasingly essential for their safe deployment. While refraining from predicting when uncertain should be intuitively obvious to humans, the peculiarities of DNNs makes them overconfident to unknown inputs Nguyen et al. (2015) and makes this a challenging problem to solve in deep learning. A very active sub-field of deep learning, known as out-of-distribution (OoD) detection, has emerged in recent years that attempts to impart to deep neural networks the quality of "knowing when it doesn't know". The most straight-forward approach in this regard is based on using the DNNs output as a proxy for predictive confidence. For example, a simple baseline for detecting OoD samples using thresholded softmax scores was presented in Hendrycks & Gimpel (2016) . where the authors provided empirical evidence that for DNN classifiers, in-distribution predictions do tend to have higher winning scores than OoD samples, thus empirically justifying the use of softmax thresholding as a useful baseline. However this approach is vulnerable to the pathologies discussed in Nguyen et al. (2015) . Subsequently, increasingly sophisticated methods have been developed to attack the OoD problem. Liang et al. (2018) introduced a detection technique that involves perturbing the inputs in the direction of increasing the confidence of the network's predictions on a given input, based on the observation that the magnitude of gradients on in-distribution data tend to be larger than for OoD data. The method proposed in Lee et al. (2018) also involves input perturbation, but confidence in this case was measured by the Mahalanobis distance score using the computed mean and covariance of the pre-softmax scores. A drawback of such methods, however, is that it introduces a number of hyperparameters that need to be tuned on the OoD dataset, which is infeasible in many real-world scenarios as one does not often know in advance the properties of unknown classes. A modified version of the perturbation approach was recently proposed in in Hsu et al. (2020) that circumvents some of these issues, though one still needs to ascertain an ideal perturbation magnitude, which might not generalize from one OoD set to the other. Given that one might expect a classifier to be more uncertain when faced with OoD data, many methods developed for estimating uncertainty for DNN predictions have also been used for OoD detection. A useful baseline in this regard is the temperature scaling method of Guo et al. (2017) that was was proposed for calibrating DNN predictions on in-distribution data and has been observed to also serve as a useful OoD detector in some scenarios. Further, label smoothing techniques like mixup Zhang et al. (2017) have also been shown to be able to improve OoD detection performance in DNNs Thulasidasan et al. ( 2019). An ensemble-of-deep models approach, that is also augmented with adversarial examples during training, described in Lakshminarayanan et al. (2017) was also shown to improve predictive uncertainty and succesfully applied to OoD detection. In the Bayesian realm, methods such as Maddox et al. (2019) and Osawa et al. (2019) have also been used for OoD detection, though at increased computational cost. However, it has been argued that for OoD detection, Bayesian priors on the data are not completely justified since one does not have access to the prior of the open-set Boult et al. (2019) . Nevertheless, simple approaches like dropout -which have been shown to be equivalent to deep gaussian processes Gal & Ghahramani (2016) have been used as baselines for OoD detection. Training the model to recognize unknown classes by using data from categories that do not overlap with classes of interest has been shown to be quite effective for out-of-distribution detection and a slew of methods that use additional data for discriminating between ID and OD data have been proposed. DeVries & Taylor (2018) describes a method ithat uses a separate confidence branch and misclassified training data samples that serve as a proxy for OoD samples. In the outlier exposure technique described in Hendrycks et al. (2018) , the predictions on natural outlier images used in training are regularized against the uniform distribution to encourage high-entropy posteriors on outlier samples. An approach that uses an extra-class for outlier samples is described in Neal et al. (2018) , where instead of natural outliers, counterfactual images that lie just outside the class boundaries of known classes are generated using a GAN and assigned the extra class label. A similar approach using generative samples for the extra class, but using a conditional Variational Auto-Encoders Kingma & Welling (2013) for generation, is described in Vernekar et al. (2019) . A method to force a DNN to produce high-entropy (i.e., low confidence) predictions and suppress the magnitude of feature activations for OoD samples was discussed in Dhamija et al. (2018) , where, arguing that methods that use an extra background class for OoD samples force all such samples to lie in one region of the feature space, the work also forces separation by suppressing the activation magnitudes of samples from unknown classes The above works have shown that the use of known OoD samples (or known unknowns) often generalizes well to unknown unknown samples. Ineed, even though the space of unknown classes is potentially infinite, and one can never know in advance the myriad of inputs that can occur during test time, empirically this approach has been shown to work. The abstention method that we describe in the next section borrows ideas from many of the above methods: as in Hendrycks et al. (2018) , we uses additional samples of real images and text from non-overlapping categories to train the model to abstain, but instead of entropy regularization over OoD samples, out method uses an extra abstention class. While it has been sometimes argued in the literature that that using an additional abstention (or rejection) class is not an effective approach for OoD detection Dhamija et al. (2018) ; Lee et al. (2017) , comprehensive experiments we conduct in this work demonstrate that this is not the case. Indeed, we find that such an approach is not only simple but also highly effective for OoD detection, often outperforming existing methods that are more complicated and involve tuning of multiple hyperparameters. The main contributions of this work are as follows: • To the best of our knowledge, this is the first work to comprehensively demonstrate the efficacy of using an extra abstention (or rejection class) in combination with outlier training data for effective OoD detection. • In addition to being effective, our method is also simple: we introduce no additional hyperparameters in the loss function, and train with regular cross entropy. From a practical standpoint, this is especially useful for deep learning practitioners who might not wish to make modifications to the loss function while training deep models. In addition, since outlier data is simply an additional training class, no architectural modifications to existing networks are needed. • Due to the simplicity and effectiveness of this method, we argue that this approach be considered a strong baseline for comparing new methods in the field of OoD detection.

2. OUT-OF-DISTRIBUTION DETECTION WITH AN ABSTAINING CLASSIFIER (DAC)

Our approach uses a DNN trained with an extra abstention class for detecting out-of-distribution and novel samples; from here on, we will refer to this as the deep abstaining classifier (DAC). We augment our training set of in-distribution samples (D in ) with an auxiliary dataset of known out-of-distribution samples ( Dout ), that are known to be mostly disjoint from the main training set (we will use D out to denote unknown out-of-distribution samples that we use for testing). We assign the training label of K + 1 to all the outlier samples in Dout (where K is the number of known classes) and train with cross-entropy; the minimization problem then becomes: min θ E (x,y)∼Din [-log P θ (y = ŷ|x)] + E x,y∼ Dout [-log P θ (y = K + 1|x)] where θ are the weights of the neural network. This is somewhat similar to the approaches described in Hendrycks et al. (2018) as well as in Lee et al. (2017) , with the main difference being that in those methods, an extra class is not used; instead predictions on outliers are regularized against the uniform distribution. Further the loss on the outlier samples is weighted by a hyperparameter λ which has to be tuned; in contrast, our approach does not introduce any additional hyperparameters. In our experiments, we find that the presence of an abstention class that is used to capture the mass in Dout significantly increases the ability to detect D out during testing. For example, in Figure 1 , we show the distribution of the winning logits (pre-softmax activations) in a regular DNN (left). For the same experimental setup, the abstention logit of the DAC produces near-perfect separation of the in and outof-distribution logits indicating that using an abstention class for mapping outliers can be a very effective approach to OoD detection. Theoretically, it might be argued that the abstention class might only capture data that is aligned with the weight vector of that class, and thus this approach might fail to detect the myriad of OoD inputs that might span the entire input region. Comprehensive experiments over a wide variety of benchmarks described in the subsequent section, however, empirically demonstrate that while the detection is not perfect, it performs very well, and indeed, much better than more complicated approaches. Once the model is trained, we use a simple thresholding mechanism for detection. Concretely, the detector, g(x) : X → 0, 1 assigns label 1 (OoD) if the softmax score of the abstention class, i.e., p K+1 (x) is above some threshold δ, and label 0, otherwise: g(x) = 1 if p K+1 (x) ≥ δ 0 otherwise (2) Like in other methods, the threshold δ has to be determined based on acceptable risk that might be specific to the application. However, using performance metrics like area under the ROC curve (AUROC), we can determine threshold-independent performance of various methods, and we use this as one of our evaluation metrics in all our experiments.

3. EXPERIMENTS

The experiments we describe here can be divided into two sets: in the first set, we compare against methods that are explicitly designed for OoD detection, while in the second category, we compare against methods that are known to improve predictive uncertainty in deep learning. In both cases, we report results over a variety of architectures to demonstrate the efficacy of our method.

3.1. DATASETS

For all computer vision experiments, we use CIFAR-10 and CIFAR-100 Krizhevsky & Hinton (2009) as the in-distribution datasets, in addition to augmenting our training set with 100K unlabeled samples from the Tiny Images dataset Torralba et al. (2008) . For the out-of-distribution datasets, we test on the following: • SVHN Netzer et al. (2011) , a large set of 32×32 color images of house numbers, comprising of ten classes of digits 0 -9. We use a subset of the 26K images in the test set. • LSUN Yu et al. (2015) , the Large-scale Scene Understanding dataset, comprising of 10 different types of scenes. • Places365 Zhou et al. ( 2017), a large collection of pictures of scenes that fall into one of 365 classes. • Tiny ImageNet tin (2017) (not to be confused with Tiny Images) which consists of images belonging to 200 categories that are a subset of ImageNet categories. The images are 64 × 64 color, which we scale down to 32 × 32 when testing. • Gaussian A synthetically generated dataset consisting of 32 × 32 random Gaussian noise images, where each pixel is sampled from an i.i.d Gaussian distribution. For the NLP experiments, we use 20 Newsgroup Lang (1995) , TREC Sherman, and SST Socher et al. (2013) datasets as our in-distribution datasets, which are the same as those used by Hendrycks et al. (2018) to facilitate direct comparison. We use the 50-category version of TREC, and for SST, we use binarized labels where neutral samples are removed. For out OoD training data, we use unlabeled samples from Wikitext2 by assigning them to the abstention class. We test our model on the following OoD datasets: • SNLI Bowman et al. ( 2015) is a dataset of predicates and hypotheses for natural language inference. We use the hypotheses for testing . • IMDB Maas et al. ( 2011) is a sentiment classification dataset of movie reviews, with similar statistics to those of SST. • Multi30K Barrault et al. ( 2018) is a dataset of English-German image descriptions, of which we use the English descriptions. • WMT16 Bojar et al. ( 2016) is a dataset of English-German language pairs designed for machine translation task. We use the English portion of the test set from WMT16. • Yelp Zhang et al. ( 2015) is a dataset of restaurant reviews.

3.2. COMPARISON AGAINST OOD METHODS

In this section, we compare against a slew of recent state-of-the-art methods that have been explicitly designed for OoD detection. For the image experiments, we compare against the following: • Deep Outlier Exposure, as described in Hendrycks et al. (2018) and discussed in Section 1 • Ensemble of Leave-out Classifiers Vyas et al. (2018) where each classifier is trained by leaving out a random subset of training data (which is treated as OoD data), and the rest is treated as ID data. • ODIN, as described in Liang et al. (2018) and discussed in Section 1. ODIN uses input perturbation and temperature scaling to differentiate between ID and OoD samples. • Deep Mahalanobis Detector, proposed in Lee et al. (2018) which estimates the classconditional distribution over hidden layer features of a deep model using Gaussian discriminant analysis and a Mahalanobis distance based confidence-score for thresholding, and further, similar to ODIN, uses input perturbation while testing. • OpenMax, as described in Bendale & Boult (2016) for novel category detection. This method uses mean activation vectors of ID classes observed during training followed by Weibull fitting to determine if a given sample is novel or out-of-distribution. For all of the above methods, we use published results when available, keeping the architecture and datasets the same as in the experiments described in the respective papers. For the NLP experiments, we only compare against the published results in Hendrycks et al. (2018) . For OpenMax, we re-implement the authors' published algorithm using the PyTorch framework Paszke et al. (2019) .

3.2.1. METRICS

Following established practices in the literature, we use the following metrics to measure detection performance of our method: • AUROC or Area Under the Receiver Operating Characteristic curve depicts the relationship between the True Positive Rate (TPR) (also known as Recall)and the False Positive Rate (FPR) and can be interpreted as the probability that a positive example is assigned a higher detection score than a negative example Fawcett (2006) . Unlike 0/1 accuracy, the AUROC has the desirable property that it is not affected by class imbalancefoot_0 . • FPR at 95% TPR which is the probability that a negative sample is misclassified as a positive sample when the TPR (or recall) on the positive samples is 95%. In work that we compare against, the out-of-distribution samples are treated as the positive class, so we do the same here, and treat the in-distribution samples as the negative class.

3.2.2. RESULTS

Detailed results against the various OoD methods are shown in Tables 1 through 3 for vision and language respectively, where we have a clear trend: in almost all cases, the DAC outperforms the other methods, often by significant margins especially when the in-distribution data is more complex, as is the case with CIFAR-100. While the Outlier Exposure method Hendrycks et al. ( 2018) (shown at the top in Table 1 ) is conceptually similar to ours, the presence of an extra abstention class in our model often bestows significant performance advantages. Further, we do not need to tune a separate hyperparameter which determines the weight of the outlier loss, as done in Hendrycks et al. (2018) . In fact, the simplicity of our method is one of its striking features: we do not introduce any additional hyperparameters in our approach, which makes it significantly easier to implement than methods such as ODIN and the Mahalanobis detector; these methods need to be tuned separately on each OoD dataset, which is usually not possible as one does not have access to the distribution of unseen classes in advance. Indeed, when performance of these methods is tested without tuning on the OoD test set, the DAC significantly outperforms methods such as the Mahalanobis detector (shown at the bottom of Table 1 ). We also show the performance against the OpenMax approach of Bendale & Boult (2016) in Table 2 and in every case, the DAC outperforms OpenMax by significant margins. While the abstention approach uses an extra class and OoD samples while training, and thus does incur some training overhead, it is significantly less expensive during test time, as the forward pass is no different from that of a regular DNN. In contrast, methods like ODIN and the Mahalanobis detector require gradient calculation with respect to the input in order to apply the input perturbation; the DAC approach thus offers a computationally simpler alternative. Also, even though the DAC approach introduces additional network parameters in the final linear layers (due to the presence of an extra abstention class), and thus might be more prone to overfitting, we find that this to be not the case as evidenced by the generalization of OoD performance to different types of test datasets.

3.3. COMPARISON AGAINST UNCERTAINTY-BASED METHODS

Next we perform experiments to compare the OoD detection performance of the DAC against various methods that have been proposed for improving predictive uncertainty in deep learning. In these cases, Table 1 : Comparison of the extra class method (ours) with various other out-of-distribution detection methods when trained on CIFAR-10 and CIFAR-100 and tested on other datasets. All numbers from comparison methods are sourced from their respective original publications. For our method, we also report the standard deviation over five runs (indicated by the subscript), and treat the performance of other methods within one standard deviations as equivalent to ours. For fair comparison with the Mahalanobis detector (MAH) Lee et al. (2018) , we use results when their method was not tuned separately on each OoD test set ( one expects that such methods will cause the DNN to predict with less confidence when presented with inputs from a different distribution or from novel categories; we compare against the following methods: • Softmax Thresholding This is the simplest baseline, where OoD samples are detected by thresholding on the winning softmax score; scores falling below a threshold are rejected. • Entropy Thresholding Another simple baseline, where OoD samples are rejected if the Shannon entropy calculated over the softmax posteriors is above a certain threshold. • MonteCarlo Dropout A Bayesian inspired approach proposed in Gal & Ghahramani (2016) for improving the predictive uncertainty for deep learning. We found a dropout probability of p = 0.5 to perform well, and use 100 forward passes per sample during the prediction. • Temperature Scaling, which improves DNN calibration as described in Guo et al. (2017) . The scaling temperature T is tuned on a held-out subset of the validation set of the indistribution data. • Mixup As shown in Thulasidasan et al. (2019) , Mixup can be an effective OoD detector, so we also use this as one of our baselines. • Deep Ensembles which was introduced in Lakshminarayanan et al. ( 2017) for improving uncertainty estimates for both classification and regression. In this approach, multiple versions of the same model are trained using different random initializations, and while training, adversarial samples are generated to improve model robustness. We use an ensemble size of 5 as suggested in their paper. • SWAG, as described in Maddox et al. (2019) , which is a Bayesian approach to deep learning and exploits the fact that SGD itself can be viewed as approximate Bayesian inference Mandt et al. (2017) . We use an ensemble size of 30 as proposed in the original paper.

3.3.1. RESULTS

Detailed results are shown in Table 4 , where the best performing method for each metric is shown in bold. The DAC is the only method in this set of experiments that uses an augmented dataset, and as is clearly evident from the results, this confers a significant advantage over the other methods in most cases. Calibration methods like temperature scaling, while producing well calibrated scores on in-distribution data, end up reducing the confidence on in-distribution data as well, and thus losing discriminative power between the two types of data. We also note here that many of the methods listed in the 

4. CONCLUSION

We presented a simple, but highly effective method for open set and out-of-distribution detection that clearly demonstrated the efficacy of using an extra abstention class and augmenting the training set with outliers. While previous work has shown the efficacy of outlier exposure Hendrycks et al. (2018) , here we demonstrated an alternative approach for exploiting outlier data that further improves upon existing methods, while also being simpler to implement compared to many of the other methods. The ease of implementation, absence of additional hyperparameter tuning and computational efficiency during testing makes this a very viable approach for improving out-of-distribution and novel category detection in real-world deployments; we hope that this will also serve as an effective baseline for comparing future work in this domain.



An alternate area-under-the-curve metric, known as Area under Precision Recall Curve, or AUPRC, is used when the size of the negative class is high compared to the positive class. We do not report AUPRC here, as we keep our in-distribution and out-of-distribution sets balanced in these experiments.



Figure 1: An illustration of the separability of scores on in and out-of-distribution data for a regular DNN (left) and the DAC (right).

Table 6 in Lee et al. (2018). Gaussian 40.58 22.18 0.04 0.02 84.74 10.19 99.98 0.01 21.50 11.73 1.66 1.76 89.37 5.46 99.48 0.47



DAC vs OE for NLP Classification task. OE implementation was based on code available at https://github.com/hendrycks/outlier-exposure

table, like temperature scaling and deep ensembles, can be combined with the abstention approach. Indeed, the addition of an extra abstention class and training with OoD data is compatible with most uncertainty modeling techniques in deep learning; we leave the exploration of such combination approaches for future work. 0.06 99.92 0.02 Places-365 91.70 0.56 92.08 0.58 89.29 0.72 92.39 1.00 96.38 0.87 94.78 0.23 96.07 0.63 100.00 0.00 Gaussian 79.39 13.06 79.22 13.24 94.26 2.74 80.40 9.22 95.62 3.69 95.82 1.30 96.51 0.52 100.00 0.00 SVHN 91.38 1.03 91.70 1.11 83.56 3.85 91.99 0.34 92.62 2.96 90.58 0.20 96.12 1.19 99.50 0.16 Tiny ImageNet 91.98 1.85 92.31 1.88 88.84 5.19 92.59 1.47 95.67 2.02 94.07 0.29 95.65 0.13 99.88 0.04 Gaussian 52.84 22.15 53.52 21.98 12.63 4.24 55.91 18.40 10.96 11.42 9.15 1.36 9.55 0.52 0.00 0.00 SVHN 24.64 2.41 24.67 2.49 53.32 14.76 25.06 3.46 29.71 12.57 24.99 0.65 13.04 4.05 1.78 0.76 Tiny ImageNet 26.21 10.40 26.24 10.66 33.25 15.67 24.96 5.58 14.84 6.12 17.03 0.58 12.07 0.14 0.33 0.14

The performance of DAC as an OoD detector, evaluated on various metrics and compared against competing baselines. All experiments used the ResNet-34 architecture, except for MC Dropout, in which case we used the WideResNet 28x10 network. ↑ and ↓ indicate that higher and lower values are better, respectively. Best performing methods (ignoring statistically insignificant differences)on each metric are in bold.

