PRACTICAL EVALUATION OF OUT-OF-DISTRIBUTION DETECTION METHODS FOR IMAGE CLASSIFICATION Anonymous

Abstract

We reconsider the evaluation of OOD detection methods for image recognition. Although many studies have been conducted so far to build better OOD detection methods, most of them follow Hendrycks and Gimpel's work for the method of experimental evaluation. While the unified evaluation method is necessary for a fair comparison, there is a question of if its choice of tasks and datasets reflect real-world applications and if the evaluation results can generalize to other OOD detection application scenarios. In this paper, we experimentally evaluate the performance of representative OOD detection methods for three scenarios, i.e., irrelevant input detection, novel class detection, and domain shift detection, on various datasets and classification tasks. The results show that differences in scenarios and datasets alter the relative performance among the methods. Our results can also be used as a guide for practitioners for the selection of OOD detection methods.

1. INTRODUCTION

Despite their high performance on various visual recognition tasks, convolutional neural networks (CNNs) often show unpredictable behaviors against out-of-distribution (OOD) inputs, i.e., those sampled from a different distribution from the training data. For instance, CNNs often classify irrelevant images to one of the known classes with high confidence. A visual recognition system should desirably be equipped with an ability to detect such OOD inputs upon its real-world deployment. There are many studies of OOD detection that are based on diverse motivations and purposes. However, as far as the recent studies targeted at visual recognition are concerned, most of them follow the work of Hendrycks & Gimpel (2017) , which provides a formal problem statement of OOD detection and an experimental procedure to evaluate the performance of methods. Employing this procedure, the recent studies focus mainly on increasing detection accuracy, where the performance is measured using the same datasets. On the one hand, the employment of the experimental procedure has arguably bought about the rapid progress of research in a short period. On the other hand, little attention has been paid to how well the employed procedure models real-world problems and applications. They are diverse in purposes and domains, which obviously cannot be covered by the single problem setting with a narrow range of datasets. In this study, to address this issue, we consider multiple, more realistic scenarios of the application of OOD detection, and then experimentally compare the representative methods. To be specific, we consider the three scenarios: detection of irrelevant inputs, detection of novel class inputs, and detection of domain shift. The first two scenarios differ in the closeness between ID samples and OOD samples. Unlike the first two, domain shift detection is not precisely OOD detection. Nonetheless, it is the same as the other two in that what we want is to judge if the model can make a meaningful inference for a novel input. In other words, we can generalize OOD detection to the problem of judging this. Then, the above three scenarios are naturally fallen into the same group of problems, and it becomes natural to consider applying OOD detection methods to the third scenario. It is noteworthy that domain shift detection has been poorly studied in the community. Despite many demands from practitioners, there is no established method in the context of deep learning for image classification. Based on the above generalization of OOD detection, we propose a meta-approach in which any OOD detection method can be used as its component. For each of these three scenarios, we compare the following methods: the confidence-based baseline (Hendrycks & Gimpel, 2017), MC dropout (Gal & Ghahramani, 2016) , ODIN (Liang et al., 2017) , cosine similarity (Techapanurak et al., 2019; Hsu et al., 2020) , and the Mahalanobis detector (Lee et al., 2018) . Domain shift detection is studied in (Elsahar & Gallé, 2019) with natural language processing tasks, where proxy-A distance (PAD) is reported to perform the best; thus we test it in our experiments. As for choosing the compared methods, we follow the argument shared by many recent studies (Shafaei et al., 2019; Techapanurak et al., 2019; Yu & Aizawa, 2019; Yu et al., 2020; Hsu et al., 2020) that OOD detection methods should not assume the availability of explicit OOD samples at training time. Although this may sound obvious considering the nature of OOD, some of the recent methods (e.g., Liang et al. (2017) ; Lee et al. ( 2018)) use a certain amount of OOD samples as validation data to determine their hyperparameters. The recent studies, (Shafaei et al., 2019; Techapanurak et al., 2019) , show that these methods do perform poorly when encountering OOD inputs sampled from a different distribution from the assumed one at test time. Thus, for ODIN and the Mahalanobis detector, we employ their variants (Hsu et al., 2020; Lee et al., 2018) that can work without OOD samples. The other compared methods do not need OOD samples. The contribution of this study are summarized as follows. i) Listing three problems that practitioners frequently encounter, we evaluate the existing OOD detection methods on each of them. ii) We show a practical approach to domain shift detection that is applicable to CNNs for image classification. iii) We show experimental evaluation of representative OOD detection methods on these problems, revealing each method's effectiveness and ineffectiveness in each scenario.

2.1. PRACTICAL SCENARIOS OF OOD DETECTION

We consider image recognition tasks in which a CNN classifies a single image x into one of C known classes. The CNN is trained using pairs of x and its label, and x is sampled according to x ∼ p(x). At test time, it will encounter an unseen input x, which is usually from p(x) but is sometimes from p (x), a different, unknown distribution. In this study, we consider the following three scenarios.

Detecting Irrelevant Inputs

The new input x does not belong to any of the known classes and is out of concern. Suppose we want to build a smartphone app that recognizes dog breeds. We train a CNN on a dataset containing various dog images, enabling it to perform the task with reasonable accuracy. We then point the smartphone to a sofa and shoot its image, feeding it to our classifier. It could classify the image as a Bull Terrier with high confidence. Naturally, we want to avoid this by detecting the irrelevance of x. Most studies of OOD detection assumes this scenario for evaluation. Detecting Novel Classes The input x belongs to a novel class, which differs from any of C known classes, and furthermore, we want our CNN to learn to classify it later, e.g., after additional training. For instance, suppose we are building a system that recognizes insects in the wild, with an ambition to make it cover all the insects on the earth. Further, suppose an image of one of the endangered (and thus rare) insects is inputted to the system while operating it. If we can detect it as a novel class, we would be able to update the system in several ways. The problem is the same as the first scenario in that we want to detect whether x ∼ p(x) or not. The difference is that x is more similar to samples of the learned classes, or equivalently, p (x) is more close to p(x), arguably making the detection more difficult. Note that in this study, we don't consider distinguishing whether x is an irrelevant input or a novel class input, for the sake of simplicity. We left it for a future study. Detecting Domain Shift The input x belongs to one of C known classes, but its underlying distribution is p (x), not p(x). We are especially interested in the case where a distributional shift p(x) → p (x) occurs either suddenly or gradually while running a system for the long term. Our CNN may or may not generalize beyond this shift to p (x). Thus, we want to detect if it does not. If we can do this, we would take some actions, such as re-training the network with new training

