PRACTICAL EVALUATION OF OUT-OF-DISTRIBUTION DETECTION METHODS FOR IMAGE CLASSIFICATION Anonymous

Abstract

We reconsider the evaluation of OOD detection methods for image recognition. Although many studies have been conducted so far to build better OOD detection methods, most of them follow Hendrycks and Gimpel's work for the method of experimental evaluation. While the unified evaluation method is necessary for a fair comparison, there is a question of if its choice of tasks and datasets reflect real-world applications and if the evaluation results can generalize to other OOD detection application scenarios. In this paper, we experimentally evaluate the performance of representative OOD detection methods for three scenarios, i.e., irrelevant input detection, novel class detection, and domain shift detection, on various datasets and classification tasks. The results show that differences in scenarios and datasets alter the relative performance among the methods. Our results can also be used as a guide for practitioners for the selection of OOD detection methods.

1. INTRODUCTION

Despite their high performance on various visual recognition tasks, convolutional neural networks (CNNs) often show unpredictable behaviors against out-of-distribution (OOD) inputs, i.e., those sampled from a different distribution from the training data. For instance, CNNs often classify irrelevant images to one of the known classes with high confidence. A visual recognition system should desirably be equipped with an ability to detect such OOD inputs upon its real-world deployment. There are many studies of OOD detection that are based on diverse motivations and purposes. However, as far as the recent studies targeted at visual recognition are concerned, most of them follow the work of Hendrycks & Gimpel (2017) , which provides a formal problem statement of OOD detection and an experimental procedure to evaluate the performance of methods. Employing this procedure, the recent studies focus mainly on increasing detection accuracy, where the performance is measured using the same datasets. On the one hand, the employment of the experimental procedure has arguably bought about the rapid progress of research in a short period. On the other hand, little attention has been paid to how well the employed procedure models real-world problems and applications. They are diverse in purposes and domains, which obviously cannot be covered by the single problem setting with a narrow range of datasets. In this study, to address this issue, we consider multiple, more realistic scenarios of the application of OOD detection, and then experimentally compare the representative methods. To be specific, we consider the three scenarios: detection of irrelevant inputs, detection of novel class inputs, and detection of domain shift. The first two scenarios differ in the closeness between ID samples and OOD samples. Unlike the first two, domain shift detection is not precisely OOD detection. Nonetheless, it is the same as the other two in that what we want is to judge if the model can make a meaningful inference for a novel input. In other words, we can generalize OOD detection to the problem of judging this. Then, the above three scenarios are naturally fallen into the same group of problems, and it becomes natural to consider applying OOD detection methods to the third scenario. It is noteworthy that domain shift detection has been poorly studied in the community. Despite many demands from practitioners, there is no established method in the context of deep learning for image classification. Based on the above generalization of OOD detection, we propose a meta-approach in which any OOD detection method can be used as its component. For each of these three scenarios, we compare the following methods: the confidence-based baseline (Hendrycks & Gimpel, 2017) , MC dropout (Gal & Ghahramani, 2016) , ODIN (Liang et al., 2017) , cosine similarity (Techapanurak et al., 2019; Hsu et al., 2020) , and the Mahalanobis detector (Lee et al., 2018) . Domain shift detection is studied in (Elsahar & Gallé, 2019) with natural language processing tasks, where proxy-A distance (PAD) is reported to perform the best; thus we test it in our experiments. As for choosing the compared methods, we follow the argument shared by many recent studies (Shafaei et al., 2019; Techapanurak et al., 2019; Yu & Aizawa, 2019; Yu et al., 2020; Hsu et al., 2020) that OOD detection methods should not assume the availability of explicit OOD samples at training time. Although this may sound obvious considering the nature of OOD, some of the recent methods (e.g., Liang et al. (2017) ; Lee et al. (2018) ) use a certain amount of OOD samples as validation data to determine their hyperparameters. The recent studies, (Shafaei et al., 2019; Techapanurak et al., 2019) , show that these methods do perform poorly when encountering OOD inputs sampled from a different distribution from the assumed one at test time. Thus, for ODIN and the Mahalanobis detector, we employ their variants (Hsu et al., 2020; Lee et al., 2018) that can work without OOD samples. The other compared methods do not need OOD samples. The contribution of this study are summarized as follows. i) Listing three problems that practitioners frequently encounter, we evaluate the existing OOD detection methods on each of them. ii) We show a practical approach to domain shift detection that is applicable to CNNs for image classification. iii) We show experimental evaluation of representative OOD detection methods on these problems, revealing each method's effectiveness and ineffectiveness in each scenario.

2.1. PRACTICAL SCENARIOS OF OOD DETECTION

We consider image recognition tasks in which a CNN classifies a single image x into one of C known classes. The CNN is trained using pairs of x and its label, and x is sampled according to x ∼ p(x). At test time, it will encounter an unseen input x, which is usually from p(x) but is sometimes from p (x), a different, unknown distribution. In this study, we consider the following three scenarios.

Detecting Irrelevant Inputs

The new input x does not belong to any of the known classes and is out of concern. Suppose we want to build a smartphone app that recognizes dog breeds. We train a CNN on a dataset containing various dog images, enabling it to perform the task with reasonable accuracy. We then point the smartphone to a sofa and shoot its image, feeding it to our classifier. It could classify the image as a Bull Terrier with high confidence. Naturally, we want to avoid this by detecting the irrelevance of x. Most studies of OOD detection assumes this scenario for evaluation. Detecting Novel Classes The input x belongs to a novel class, which differs from any of C known classes, and furthermore, we want our CNN to learn to classify it later, e.g., after additional training. For instance, suppose we are building a system that recognizes insects in the wild, with an ambition to make it cover all the insects on the earth. Further, suppose an image of one of the endangered (and thus rare) insects is inputted to the system while operating it. If we can detect it as a novel class, we would be able to update the system in several ways. The problem is the same as the first scenario in that we want to detect whether x ∼ p(x) or not. The difference is that x is more similar to samples of the learned classes, or equivalently, p (x) is more close to p(x), arguably making the detection more difficult. Note that in this study, we don't consider distinguishing whether x is an irrelevant input or a novel class input, for the sake of simplicity. We left it for a future study.

Detecting Domain Shift

The input x belongs to one of C known classes, but its underlying distribution is p (x), not p(x). We are especially interested in the case where a distributional shift p(x) → p (x) occurs either suddenly or gradually while running a system for the long term. Our CNN may or may not generalize beyond this shift to p (x). Thus, we want to detect if it does not. If we can do this, we would take some actions, such as re-training the network with new training data (Elsahar & Gallé, 2019) . We consider the case where no information is available other than the incoming inputs x s. A good example is a surveillance system using a camera deployed outdoor. Let us assume the images' quality deteriorates after some time since its deployment, for instance, due to the camera's aging. Then, the latest images will follow a different distribution from that of the training data. Unlike the above two cases where we have to decide for a single input, we can use multiple inputs; we should, especially when the quality of input images deteriorate gradually as time goes. The problem here has three differences from the above two scenarios. First, the input is a valid sample belonging to a known class, neither an irrelevant sample nor a novel class sample. Second, we are basically interested in the accuracy of our CNN with the latest input(s) and not in whether x ∼ p(x) or p (x). Third, as mentioned above, we can use multiple inputs {x i } i=1,...,n for the judgment. Additional remarks on this scenario. Assuming a temporal sequence of inputs, the distributional shift is also called concept drift (Gama et al., 2014) . It includes several different subproblems, and the one considered here is called virtual concept drift in its terminology. Mathematically, concept drift occurs when p(x, y) changes with time. It is called virtual when p(x) changes while p(y|x) does not change. Intuitively, this is the case where the classes (i.e., concept) remain the same but p(x) changes, demanding the classifier to deal with inputs drawn from p (x). Then, we are usually interested in predicting if x lies in a region of the data space for which our classifier is well trained and can correctly classify it. If not, we might want to retrain our classifier using additional data or invoke unsupervised domain adaptation methods (Ganin & Lempitsky, 2015; Tzeng et al., 2017) .

2.2. COMPARED METHODS

We select five representative OOD detection methods that do not use real OOD samples to be encountered at test time. Baseline: Max-softmax Hendrycks & Gimpel (2017) showed that the maximum of the softmax outputs, or confidence, can be used to detect OOD inputs. We use it as the score of an input being in-distribution (ID). We will refer to this method as Baseline. It is well known that the confidence can be calibrated using temperature to better represent classification accuracy (Guo et al., 2017; Li & Hoiem, 2020) . We also evaluate this calibrated confidence, which will be referred to as Calib.

MC Dropout

The confidence (i.e., the max-softmax) is also thought of as a measure of uncertainty of prediction, but it captures only aleatoric uncertainty (Hüllermeier & Waegeman, 2019) . Bayesian neural networks (BNNs) can also take epistemic uncertainty into account, which is theoretically more relevant to OOD detection. MC (Monte-Carlo) dropout (Gal & Ghahramani, 2016) is an approximation of BNNs that is computationally more efficient than an ensemble of networks (Lakshminarayanan et al., 2017) . To be specific, using dropout (Srivastava et al., 2014) at test time provides multiple prediction samples, from which the average of their max-softmax values is calculated and used as ID score. Cosine Similarity It is recently shown in Techapanurak et al. (2019) ; Hsu et al. (2020) that using scaled cosine similarities at the last layer of a CNN, similar to the angular softmax for metric learning, enables accurate OOD detection. To be specific, the method first computes cosine similarities between the feature vector of the final layer and class centers (or equivalently, normalized weight vectors for classes). They are multiplied with a scale and then normalized by softmax to obtain class scores. The scale, which is the inverse temperature, is predicted from the same feature vector. These computations are performed by a single layer replacing the last layer of a standard CNN. The maximum of the cosine similarities (without the scale) gives ID score. The method is free of hyperparameters for OOD detection. We will refer to it as Cosine. ODIN (with OOD-sample Free Extension) ODIN was proposed by Liang et al. (2017) to improve Baseline by perturbing an input x → x + • sgn(δx) in the direction δx of maximally increasing the max-softmax and also by temperature scaling. Thus, there are two hyperparameters, the perturbation size and the temperature T . In Liang et al. (2017) , they are chosen by assuming the availability of explicit OOD samples. Recently, Hsu et al. (2020) proposed to select ← argmax y κ (x + • sgn(δx)), where y κ is the max-softmax and the summation is taken over ID samples in the validation set. As for the temperature, they set T = 1000. ID score is given by y κ (x + • sgn(δx)). To distinguish from the original ODIN, we refer to this as ODIN * .

Mahalanobis Detector

The above three methods are based on the confidence. Another approach is to formulate the problem as unsupervised anomaly detection. Lee et al. (2018) proposed to model the distribution of intermediate layer's activation by a Gaussian distribution for each class but with a shared covariance matrix among the classes. Given an input, the Mahalanobis distance concerning the predicted class is calculated at each layer. A score for OOD is given by the weighted sum of those calculated at different layers. The weights are predicted by logistic regression, which is determined by assuming the availability of OOD samples. To be free from the assumption, another method is suggested that generates adversarial examples from ID samples and regard them as OOD samples. It is also reported in (Hsu et al., 2020) that setting all the weights to one works reasonably well. We evaluate the last two methods that do not need OOD samples. Although the original method optionally uses input perturbation similar to ODIN, we do not use it because our experiments show that its improvement is very small despite its high computational cost. Effects of Fine-tuning a Pre-trained Network It has been well known that fine-tuning a pretrained network on a downstream task improves its prediction accuracy, especially when a small amount of training data is available. It was pointed out in (He et al., 2019 ) that the improvement is little when there is sufficient training data. Hendrycks et al. (2019) then show that even in that case, using a pre-trained network helps increase the overall robustness of the inference. It includes improved OOD detection performance, in addition to robustness to adversarial attacks, better calibration of confidence, robustness to covariate shift. However, their experimental validation is performed only on a single configuration with a few datasets. It remains unclear if the improvement can generalize to a broader range of purposes and settings that may differ in image size, the number of training samples, and ID/OOD combinations.

3. EXPERIMENTAL RESULTS

We use Resnet-50 (He et al., 2016) for a base network. We use it as is for Baseline, ODIN * , and Mahalanobis, which share the same networks with the same weights, which will be referred to as Standard. We apply dropout to the last fully-connected layer with p = 0.5 and draw ten samples for MC dropout. We modify the last layer and the loss function for Cosine, following Techapanurak et al. (2019) . We use the ImageNet pre-trained model provided by the Torchvision libraryfor their pre-trained models. We employ AUROC to evaluate OOD detection performance with the first two scenarios, following previous studies. 2013)). These datasets will be referred to as Dog, Plant, Food, Bird, and Cars. They are diverse in terms of image contents, the number of classes, difficulty of tasks (e.g., fine-grained/coarse-grained), etc. Choosing one of the five as ID and training a network on it, we regard each of the other four as OOD, measuring the OOD detection performance of each method on the 5 × 4 ID-OOD combination. We train each network for three times to measure the average and standard deviation for each configuration. Table 1 shows the accuracy of the five datasets/tasks for the three networks (i.e., Standard, MC dropout, and Cosine) trained from scratch and fine-tuned from a pre-trained model, respectively. It is seen that there is large gap between training-from-scratch and fine-tuning a pre-trained model for the datasets with fewer training samples. 7(3.4) 28.9(2.8) 36.2(1.4)  79.4(0.1) 79.3(0.3) 78.5(0.3)  Plant  94.1(0.4) 94.7(0.2) 95.8(0.9)  95.2(0.6) 95.5(0.5) 92.7(2.6)  Food  75.5(1.0) 76.4(0.2) 76.6(0.1)  80.5(0.0) 80.7(0.1) 79.2(0.1)  Bird  24.7(0.9) 28.5(0.6) 31.3(2.4)  71.9(0.3) 72.4(0.4) 70.1(0.3)  Car  18.2(3.8) 22.0(1.6) 36.0(6.2)  77.6(0.3) 77.7(0.3) 73.7(0  and 6 in Appendix A. The upper row of Fig. 2 shows the results with the networks trained from scratch. It is seen that the ranking of the compared methods are mostly similar for different ID datasets. For the five datasets, Cosine is consistently among the top group; Mahalanobis will be ranked next, since it performs mediocre for Dog and Food. For the tasks with low classification accuracy, Dog, Bird, and Car, as shown in Table 1 , the OOD detection accuracy tends to be also low; however, there is no tendency in the ranking of the OOD detection methods depending on the ID classification accuracy.

3.1. DETECTION OF IRRELEVANT INPUTS

The lower row of Fig. 2 shows the results with the fine-tuned networks. It is first observed for any dataset and method that the OOD detection accuracy is significantly higher than the networks trained from scratch. This reinforces the argument made by Hendrycks et al. (2019) that the use of pre-trained networks improves OOD detection performance. Furthermore, the performance increase is a lot higher for several cases than reported in their experiments that use CIFAR-10/100 and Tiny ImageNet (Deng et al., 2009) . The detection accuracy is pushed to a near-maximum for each case. Thus, there is only a little difference among the methods; Cosine and Mahalanobis(sum) shows slightly better performance for some datasets.

3.2. DETECTION OF NOVEL CLASSES

We conducted two experiments with different datasets. The first experiment uses the Oxford-IIIT Pet dataset (Parkhi et al., 2012) , consisting of 25 dog breeds and 12 cat breeds. We use only the dog breeds and split them into 20 and 5 breeds. We then train each network on the first 20 dog breeds using the standard train/test splits per class. The remaining five breeds (i.e., Scottish Terrier, Shiba Inu, Staffordshire Bull Terrier, Wheaten Terrier, Yorkshire Terrier) are treated as OOD. It should be noted that the ImageNet dataset contains 118 dog breeds, some of which overlap with them. We intentionally leave this overlap to simulate a similar situation that could occur in practice. In the second experiment, we use the Food-101 dataset. We remove eight classesfoot_0 contained in the ImageNet dataset. We split the remaining 93 classes into 46 and 47 classes, called Food-A and -B, respectively. Each network is trained on Food-A. We split Food-A into 800/100/100 samples per class to form train/val/test sets. Treating Food-B as OOD, we evaluate the methods' performance. Table 2 shows the methods' performance of detecting OOD samples (i.e., novel samples). In the table we separate the Mahalanobis detector and the others; the latter are all based on confidence or its variant, whereas Mahalanobis is not. The ranking of the methods is similar between the two experiments. Cosine attains the top performance for both of the two training methods. While this is similar to the results of irrelevant sample detection (Fig. 2 ), the gap to the second best group (Baseline, Calib., and MC dropout) is much larger here; this is significant for training from scratch. Another difference is that neither variant of Mahalanobis performs well; they are even worse than Baseline. This will be attributable to the similarity between ID and OOD samples here. The classification accuracy of the original tasks, Dog and Food-A are given in Table 7 in Appendix B.

3.3.1. PROBLEM FORMULATION

Given a network trained on a dataset D s , we wish to estimate its classification error on a different dataset D t . In practice, a meta-system monitoring the network estimates the classification error on each of the incoming datasets D (1) t , D t , • • • , which are chosen from the incoming data stream. It issues an alert if the predicted error for the latest D (T ) t is higher than the pre-fixed target. We use an OOD score S for this purpose. To be specific, given D t = {x i } i=1,...,n , we calculate an average of the score S = n i S i /n, where S i is the OOD score for x i ; note that an OOD score is simply given by a negative ID score. We want to use S to predict the classification error err = n i=1 1(y i = t i )/n, where y and t are a prediction and the true label, respectively. Following Elsahar & Gallé (2019) , we train a regressor f to do this, as err ∼ f (S). We assume multiple labeled datasets D o 's are available, each of which do not share inputs with D s or D t . Choosing a two-layer MLP for f , we train it on D o 's plus D s . As they have labels, we can get the pair of err and S for each of them. Note that D t does not have labels. It is reported in Elsahar & Gallé (2019) that Proxy-A Distance (PAD) (Ben-David et al., 2007) performs well on several NLP tasks. Thus, we also test this method (rigorously, the one called PAD * in their paper) for comparisons. It first trains a binary classifier using portions of D s and D t to distinguish the two. Then, the classifier's accuracy is evaluated on the held-out samples of D s and D t , which is used as a metric of the distance between their underlying distributions. Intuitively, the classification is easy when their distance is large, and vice versa. We train f using 1 -(mean absolute error) for S as in the previous work.

3.3.2. DOMAIN SHIFT BY IMAGE CORRUPTION

We first consider the case when the shift is caused by the deterioration of image quality. An example is a surveillance camera deployed in an outdoor environment. Its images are initially of high quality, but later their quality deteriorates gradually or suddenly due to some reason, e.g., dirt on the lens, failure of focus adjustment, seasonal/climate changes, etc. We want to detect it if it affects classifi- We consider two classification datasets/tasks, Food-A (i.e., 46 selected classes from Food-101 as explained in Sec. 3.2) and ImageNet (the original 1,000 object classification). For Food-A, we first train each network on the training split, consisting only of the original images. We divide the test split into three sets, 1,533, 1,533, and 1,534 images, respectively. The first one is used for D s as is (i.e., without corruption). We apply the image corruption method to the second and third sets. To be specific, splitting the 19 corruption types into 6 and 13, we apply the 6 corruptions to the second set to make D o 's, and the 13 corruptions to the last to make D t 's. As each corruption has five severity levels, there are 30(= 6 × 5) D o 's and 65(= 13 × 5) D t 's. The former is used for training f (precisely, 20 are used for training and 10 are for validation), and the latter is for evaluating f . For ImageNet, we choose 5,000, 2,000, and 5,000 images from the validation split without overlap. We use them to make D s , D o 's, and D t 's, respectively. As with Food-A, we apply the 6 and 13 types of corruption to the second and third sets, making 30 D o 's and 65 D t 's, respectively. For the evaluation of f , we calculate mean absolute error (MAE) and root mean squared error (RMSE) of the predicted err over the 65 D t 's. We repeat this for 20 times with different splits of image corruptions (19 → 6 + 13), reporting their mean and standard deviation. Table 3 shows the results for Food-A and ImageNet. (The accuracies of the original classification tasks of Food-A and ImageNet are reported in Table 7 and Table 10 in Appendix B and C.) It is seen for both datasets that Cosine achieves the top-level accuracy irrespective of the training methods. For Food-A, using a pre-trained network boosts the performance for the confidence-based methods (i.e., from Baseline to ODIN * ), resulting in that MC dropout performs the best; Cosine attains almost the same accuracy. On the other hand, Mahalanobis and PAD do not perform well regardless of the datasets and training methods. This well demonstrates the difference between detecting the distributional shift p(x) → p (x) and detecting the deterioration of classification accuracy. We show scatter plots of S vs. err in Fig. 4 and 5 in Appendix C, which provides a similar, or even clearer, observation.

3.3.3. OFFICE-31

To study another type of domain shift, we employ the Office-31 dataset (Saenko et al., 2010) , which is popular in the study of domain adaptation. The dataset consists of three subsets, Amazon, DSLR, and Webcam, which share the same 31 classes and are collected from different domains. We train our CNNs on Amazon and evaluate the compared methods in terms of prediction accuracy of classification errors for samples in DSLR and Webcam. The classification accuracy of the CNNs on Amazon is provided in Table 11 in Appendix D. To obtain D o 's for training f , we employ the same image corruption methods as Sec. 3.3.2; we apply them to Amazon samples to create virtual domain-shifted samples. The effectiveness of modeling the true shifted data, i.e., DSLR and Webcam, with these samples is unknown and needs to be experimentally validated. If this works, it will be practically useful. Specifically, we split the test splits of Amazon containing 754 images evenly into two sets. We use one for D s and the other for creating D o 's. We apply all the types of corruption, yielding 95(= 19 × 5) D o 's. We then split them into those generated by four corruptions and those generated by the rest; the latter is used for training f , and the former is used for the validation. We iterate this for 20 times with different random splits of the corruption types, reporting the average over 20 × 3 trials, as there are three CNN models trained from different initial weights. To evaluate each method (i.e., f based on a OOD score), we split DSLR and Webcam into subsets containing 50 samples, yielding 18 D t 's in total. We apply f to each of them, reporting the average error of predicting classification errors. Table 4 shows the results. It is observed that Cosine works well in both training methods. The two variants of Mahalanobis show good performance when using a pre-trained model, but this may be better considered a coincidence, as explained below. Figure 3 shows the scatter plots of OOD score vs. classification error for each method. The green dots indicate D o 's, corrupted Amazon images used for training f , and the blue ones indicate D t 's, subsets from DSLR and Webcam containing 50 samples each. For the method for which the green dots distribute with narrower spread, the regressor f will yield more accurate results. Thus, it is seen from Fig. 3 

3.4. ANALYSES OF THE RESULTS

We can summarize our findings in the following. i) Using a pre-trained network has shown improvements in all the scenarios, confirming the report of Hendrycks et al. (2019) . ii) The detector using cosine similarity consistently works well throughout the three scenarios. The method will be the first choice if it is acceptable to modify the network's final layer. iii) The Mahalanobis detector, a SOTA method, works well only for irrelevant input detection. This is not contradictory with the previous reports, since they employ only this very scenario. The method fits a Gaussian distribution to ID samples belonging to each class and uses the same covariance matrix for all the classes. This strategy might work well on easy cases when incoming OOD samples are mapped distantly from the Gaussian distributions. However, such a simple modeling method will not work in more challenging cases. For instance, incoming OOD samples could be mapped near the ID distributions, as in novel class detection. In such cases, the ID sample distribution needs to be very precisely modeled, for which the assumption of Gaussian distributions with a single covariance matrix is inadequate. iv) Domain shift detection requires detecting classification accuracy deterioration, not detecting a distributional shift of inputs, contrary to its name. This theoretically favors the confidence-based methods; they (particularly MC dropout) indeed work well, when used with a pre-trained network. However, the Mahalanobis detector is more like an anomaly detection method, although its similarity with a softmax classifier is suggested in (Lee et al., 2018 ). An input sample for which the network can make a correct classification can be detected as an 'anomaly' by the Mahalanobis detector.

4. RELATED WORK

Many studies of OOD detection have been conducted so far, most of which are proposals of new methods; those not mentioned above include (Vyas et al., 2018; Yu & Aizawa, 2019; Sastry & Oore, 2020; Zisselman & Tamar, 2020; Yu et al., 2020) . Experimental evaluation similar to our study but on the estimation of the uncertainty of prediction is provided in (Ovadia et al., 2019) . In (Hsu et al., 2020) , the authors present a scheme for conceptually classifying domain shifts in two axes, semantic shift and non-semantic shift. Semantic shift (S) represents OOD samples coming from the distribution of an unseen class, and non-semantic shift (NS) represents to OOD samples coming from an unseen domain. Through the experiments using the DomainNet dataset (Peng et al., 2019) , they conclude that OOD detection is more difficult in the order of S > NS > S+NS. In this study, we classify the problems into three types from an application perspective. One might view this as somewhat arbitrary and vague. Unfortunately, Hsu et al.'s scheme does not provide help. For instance, according to their scheme, novel class detection is S, and domain shift is NS. However, it is unclear which to classify irrelevant detection between S and S+NS. Moreover, their conclusion (i.e., S > NS > S+NS) does not hold for our results; the difficulty depends on the closeness between classes and between domains. After all, we think that only applications can determine what constitutes domain and what constitutes classes. Further discussion will be left for a future study. As mentioned earlier, the detection of domain shift in the context of deep learning has not been well studied in the community. The authors are not aware of a study for image classification and find only a few Elsahar & Gallé (2019) even when looking at other fields. On the other hand, there are a large number of studies of domain adaptation (DA); (Ganin & Lempitsky, 2015; Tzeng et al., 2017; Zhang et al., 2017; Toldo et al., 2020; Zou et al., 2019) to name a few. It is to make a model that has learned a task using the dataset of a particular domain adapt to work on data from a different domain. Researchers have been studied several problem settings, e.g., closed-set, partial, open-set, and boundless DA (Toldo et al., 2020) . However, these studies all assume that the source and target domains are already known; no study considers the case where the domain of incoming inputs is unidentified. Thus, they do not provide a hint of how to detect domain shift.

5. SUMMARY AND CONCLUSION

In this paper, we first classified OOD detection into three scenarios from an application perspective, i.e., irrelevant input detection, novel class detection, and domain shift detection. We have presented a meta-approach to be used with any OOD detection method to domain shift detection, which has been poorly studied in the community. We have experimentally evaluated various OOD detection methods on these scenarios. The results show the effectiveness of the above approach to domain shift several as well as several findings such as which method works on which scenario.

A DETECTION OF IRRELEVANT INPUTS

A.1 ADDITIONAL RESULTS In our experiment for irrelevant input detection, using five datasets, we consider every pair of them, one for ID and the other for OOD. In the main paper, we reported only the average detection accuracy over four such pairs for an ID dataset. We report here the results for all the ID-OOD pairs. Tables 5 and 6 show the performance of the compared methods for training from scratch and for fine-tuning of a pre-trained network. 99.0(0.2) 97.2(0.4) 91.1(0.5) -97.6(0.6) D5 100.0(0.0) 95.2(1.5) 95.9(1.2) 99.8(0.1) -Calib.

D1

-97.4(0.4) 92.3(0.5) 94.2(0.9) 98.4(1.1) D2 100.0(0.0) -55.0(8.3) 92.9(6.0) 73.0(6.2) D3 99.4(0.1) 97.1(0.6) -98.2(0.5) 94.2(1.1) D4 98.1(0.5) 98.1(0.3) 91.8(0.5) -94.7(1.1) D5 100.0(0.0) 96.9(1.1) 96.4(1.1) 99.7(0.1) - -95.7(1.8) 92.7(0.5) 96.8(0.4) 99.6(0.2) D2 100.0(0.0) -61.4(7.3) 99.3(0.4) 88.0(7.1) D3 99.7(0.0) 96.1(0.9) -99.3(0.2) 97.9(0.8) D4 98.9(0.1) 97.2(0.7) 91.5(0.7) -98.7(0.5) D5 100.0(0.0) 96.4(1.1) 97.1(0.7) 99.9(0. In our experiments for novel class detection, we employ two datasets, Dog and Food-A. Table 7 shows the classification accuracy for each of them. It is seen that for Dog, using a pre-trained model boosts the accuracy. There is a tendency similar to that seen in Table 1 , that Cosine outperforms others in training from scratch. For Food-A, using a pre-trained model shows only modest improvement due to the availability of a sufficient number of samples. 

B.2 ADDITIONAL RESULTS

In one of the experiments explained in Sec. 3.2, we use only dog classes from the Oxford-IIIT Pet dataset. We show here additional results obtained when using cat classes. Choosing nine from 12 cat breeds contained in the dataset, we train the networks on classification of these nine breeds and test novel class detection using the remaining three breed classes. In another experiment, we use Food-A for ID and Food-B for OOD. We report here the results for the reverse configuration. Table 8 shows the classification accuracy of the new tasks. Table 9 shows the performance of the compared methods on the novel class detection. A similar observation to the experiments of Sec. 3.2 can be made. 

D.2 ADDITIONAL RESULTS

As with the experiments on image corruption, we evaluate how accurately the compared methods can predict the classification error on incoming datasets, D t 's. Table 4 and Fig. 3 show the error of the predicted classification accuracy and the scatter plots of the OOD score and the true classification accuracy, where D t 's are created by splitting DSLR and Webcam into sets containing 50 samples. We show here additional results obtained for D t 's created differently. Table 12 and Fig. 6 show the prediction errors and the scatter plots for D t 's containing 30 samples. Table 13 and Fig. 7 show those for D t 's of 100 samples. Table 14 and Fig. 8 show those for using the entire DSLR and Webcam for D t 's; thus there are only two D t 's. The standard deviations are computed for 20 × 3 trials (20 for random splitting of corruption types for train/val and 3 for network models trained from random initial weights), as explained in Sec. 3.3.3. 

E EFFECTIVENESS OF ENSEMBLES

An ensemble of multiple models is known to performs better than MC-dropout we considered in the main experiments for estimation of uncertainty etc. It is also known to be better approximation to Bayesian networks. Thus, we experimentally evaluate ensembles. We consider an ensemble of five models and train each model in two ways, i.e., "from-scratch" and "fine-tuning." We randomly initialize all the weights of each model for the former. We initialize the last layer randomly and other layers with the pre-trained model's weights for the latter. We evaluate ensembles for Baseline and Cosine. Tables 15, 16 , 17, and 18 show the results for the three scenarios. In the tables, "(con.)" means confidence is used as an ID score, or equivalently, negative confidence is used as an OOD score. "(en.)" means the entropy is used as an OOD score. We can observe the following from the tables: • An ensemble of models performs better than a single model. This is always true for Baseline. The same is true for Cosine except for domain shift detection. (The reason is not clear.) • An ensemble of Baseline models still performs lower than a single Cosine model for most cases. It sometimes shows better performance for fine-tuned models, but the margin is small. • Using entropy as OOD score tends to show slightly better performance than using confidence. We conclude that Cosine's superiority remains true even when we take ensembles into consideration. Table 15 : Irrelevant input detection performance of the ensemble models. Method From-scratch Fine-Tune Baseline (con.) 61.4(12.1) 97.7(3.1) Ensemble (con.) 67.8(13.7) 98.3(2.2) Baseline (en.) 64.8(13.7) 99.2(0.9) Ensemble (en.) 73.4(15.3) 99.5(0.5) Cosine 83.9(11.4) 99.0(0.7) Ensemble cosine 85.7(12.9) 99.1(0.7) As is mentioned in the main paper, we employ Resnet-50 in all the experiments. For the optimization, we use SGD with the momentum set to 0.9 and the weight decay set to 10 -4 . The learning rate starts at 0.1, and then is divided by 10 depending on the performance of the validation dataset. To fine-tune a pre-trained network, we use the learning rate of 0.001 for the standard network and that with MC dropout. For the network used with Cosine, we use the learning rate of 0.001 to the backbone part and a higher learning rate of 0.1 to the fully-connected layer; the weight decay for the fully-connected layer is set to 0, following Techapanurak et al. (2019) and Hsu et al. (2020) .

F.2 DATASETS

Table 19 shows the specification of the datasets used in our experiments. Note that we modify some of the dataset and use them in several experiments. In the experiments of domain shift detection, we employed image corruption to simulate/model domain shift. The example of the corrupted images are shown in Fig. 9 . 



Apple Pie, Breakfast Burrito, Chocolate Mousse, Gaucamole, Hamburger, Hot Dog, Ice Cream, Pizza



Figure 1: Example images for the five datasets. We employ the following five tasks and datasets: dog breed recognition (120 classes and 10,222 images; Khosla et al. (2011)), plant seeding classification (12 classes and 5,544 images; Giselsson et al. (2017)), Food-101 (101 classes and 101,000 images; Bossard et al. (2014)), CUB-200 (200 classes and 11,788 images; Welinder et al. (2010)), and Stanford Cars (196 classes and 16,185 images; Krause et al. (2013)). These datasets will be referred to as Dog, Plant, Food, Bird, and Cars. They are diverse in terms of image contents, the number of classes, difficulty of tasks (e.g., fine-grained/coarse-grained), etc. Choosing one of the five as ID and training a network on it, we regard each of the other four as OOD, measuring the OOD detection performance of each method on the 5 × 4 ID-OOD combination. We train each network for three times to measure the average and standard deviation for each configuration. Table1shows the accuracy of the five datasets/tasks for the three networks (i.e., Standard, MC dropout, and Cosine) trained from scratch and fine-tuned from a pre-trained model, respectively. It is seen that there is large gap between training-from-scratch and fine-tuning a pre-trained model for the datasets with fewer training samples.

Figure2shows the average AUROC of the compared OOD detection methods for each ID dataset over the four OOD datasets and three trials for each. The error bars indicate the minimum and

Figure 2: OOD detection performance of the compared methods. 'Dog' indicates their performance when Dog is ID and all the other four datasets are OOD, etc. Each bar shows the average AUROC of a method, and the error bar indicates its minimum and maximum values. Upper: The networks are trained from scratch. Lower: Pre-trained models are fine-tuned.

Figure 3: OOD score vs. the true classification error for 95(= 19 × 5) D o 's (corrupted Amazon images used for training the regressor f ; in green), 18 D t 's (subsets of DSLR and Webcam containing 50 samples each; in blue), and D s (original Amazon images; in red).

Figure 4: OOD score vs. classification error for 95(= 19×5) datasets, i.e., D o 's and D t 's (corrupted Food-A images).

Figure 9: Examples of the corrupted images obtained by applying different types of image corruption (Hendrycks & Dietterich, 2019).

Classification accuracy (mean and standard deviation in parenthesis) of the three networks on the five datasets/tasks.

Novel class detection performance of the compared methods measured by AUROC.

Errors of the predicted classification error by the compared methods. To simulate multiple types of image deterioration, we employ the method and code for generating image corruption developed byHendrycks & Dietterich (2019). It can generate 19 types of image corruptions, each of which has five levels of severity.

Errors of the predicted classification error by the compared methods on 50 sample subsets of DSLR and Webcam. The CNN is trained on Amazon and the regressor f is trained using corrupted images of Amazon.

The OOD detection performance (AUROC) for networks trained from scratch. D1=Dog, D2=Plant, D3=Food, D4=Bird, and D5=Cat.

Classification accuracy for the two tasks, Dog (20 dog breeds classification) and Food-A (46 food class classification), for which novel class detection is examined.

Classification accuracy for Cat (9 cat breed classification) and Food-B (47 food class classification).

Novel class detection performance (AUROC) of the compared methods. The OOD samples for Cat and Food-B are the held-out 3 cat breeds and Food-A, respectively. .1 CLASSIFICATION ACCURACY ON IMAGENET Table10shows the accuracy of the three networks used by the compared OOD detection methods for 1,000 class classification of the ImageNet dataset. We use center-cropping at test time. The cosine network shows lower classification accuracy here.

Classification accuracy on ImageNet for each network. SCATTER PLOTS OF OOD SCORE VS.CLASSIFICATION ERROR   In Sec. 3.3.2, we showed experimental results of domain shift detection using Food-A. Given a set D t of samples, each of the compared methods calculates an OOD score S for it, from which the average classification error err over samples from D t is predicted. Figure4shows scatter plots showing the relation between the OOD score S and the true classification error for a number of datasets (i.e., D t 's). We have 95(= 19 × 5) such datasets, each containing images undergoing one of the combinations of 19 image corruptions and 5 severity levels. The method with a narrower spread of dots should provide a more accurate estimation. These scatter plots well depict which method works well and which does not, which agrees well with Table3. The same holds true for the plots for ImageNet shown in

The classification accuracy of the three networks trained on Amazon for Amazon, DSLR, and Webcam.

Errors of the predicted classification error by the compared methods on 30 sample subsets of DSLR and Webcam. The CNN is trained on Amazon and the regressor f is trained using corrupted images of Amazon.

Errors of the predicted classification error by the compared methods on the entire set of DSLR and Webcam. The CNN is trained on Amazon and the regressor f is trained using corrupted images of Amazon.

Novel class detection performance of the ensemble models.

Errors of the predicted classification error by the ensemble models.

Errors of the predicted classification error by the ensemble models on 50 sample subsets of DSLR and Webcam.

Specifications of the datasets used in the experiments.

