PRACTICAL EVALUATION OF OUT-OF-DISTRIBUTION DETECTION METHODS FOR IMAGE CLASSIFICATION Anonymous

Abstract

We reconsider the evaluation of OOD detection methods for image recognition. Although many studies have been conducted so far to build better OOD detection methods, most of them follow Hendrycks and Gimpel's work for the method of experimental evaluation. While the unified evaluation method is necessary for a fair comparison, there is a question of if its choice of tasks and datasets reflect real-world applications and if the evaluation results can generalize to other OOD detection application scenarios. In this paper, we experimentally evaluate the performance of representative OOD detection methods for three scenarios, i.e., irrelevant input detection, novel class detection, and domain shift detection, on various datasets and classification tasks. The results show that differences in scenarios and datasets alter the relative performance among the methods. Our results can also be used as a guide for practitioners for the selection of OOD detection methods.

1. INTRODUCTION

Despite their high performance on various visual recognition tasks, convolutional neural networks (CNNs) often show unpredictable behaviors against out-of-distribution (OOD) inputs, i.e., those sampled from a different distribution from the training data. For instance, CNNs often classify irrelevant images to one of the known classes with high confidence. A visual recognition system should desirably be equipped with an ability to detect such OOD inputs upon its real-world deployment. There are many studies of OOD detection that are based on diverse motivations and purposes. However, as far as the recent studies targeted at visual recognition are concerned, most of them follow the work of Hendrycks & Gimpel (2017) , which provides a formal problem statement of OOD detection and an experimental procedure to evaluate the performance of methods. Employing this procedure, the recent studies focus mainly on increasing detection accuracy, where the performance is measured using the same datasets. On the one hand, the employment of the experimental procedure has arguably bought about the rapid progress of research in a short period. On the other hand, little attention has been paid to how well the employed procedure models real-world problems and applications. They are diverse in purposes and domains, which obviously cannot be covered by the single problem setting with a narrow range of datasets. In this study, to address this issue, we consider multiple, more realistic scenarios of the application of OOD detection, and then experimentally compare the representative methods. To be specific, we consider the three scenarios: detection of irrelevant inputs, detection of novel class inputs, and detection of domain shift. The first two scenarios differ in the closeness between ID samples and OOD samples. Unlike the first two, domain shift detection is not precisely OOD detection. Nonetheless, it is the same as the other two in that what we want is to judge if the model can make a meaningful inference for a novel input. In other words, we can generalize OOD detection to the problem of judging this. Then, the above three scenarios are naturally fallen into the same group of problems, and it becomes natural to consider applying OOD detection methods to the third scenario. It is noteworthy that domain shift detection has been poorly studied in the community. Despite many demands from practitioners, there is no established method in the context of deep learning for image classification.

