RE-BENCHMARKING OUT-OF-DISTRIBUTION DETEC-TION IN DEEP NEURAL NETWORKS

Abstract

Out-of-distribution (OOD) detection is a key challenge for making machine learning models robust in the real world, where we want models to be aware of uncertainty outside their training data distribution. Despite the rapid development of existing OOD detection algorithms, their experimental settings are usually inconsistent, e.g., datasets, evaluation metrics, model selection, implementation choices. In this paper, we aim to understand OOD detection fundamentally and provide a comprehensive benchmarking of the current state of the art OOD detection methods in a consistent and realistic evaluation setting. This benchmarking contains a serious of datasets split, model selection criteria and OOD detection algorithms.This experimental framework can be easily extended to new algorithms, datasets, and model selection criteria. We conduct extensive experiments on this benchmark and re-evaluate algorithms contained in this benchmark.

1. INTRODUCTION

Deep learning has achieved superior performance when the training and test data are sampled from the same distribution (Krizhevsky et al., 2012; He et al., 2016; Redmon et al., 2016), i.e., independent and identically distributed (i.i.d.) . However, the real world is open and presents diverse inputs for modern deep learning models. Out-of-distribution (OOD) detection is of great importance to AI safety, which aims to determine whether an input is different from its training data and reject them (or safe handling by human users) (Yang et al., 2021; Hendrycks & Gimpel, 2017) . Extensive recent research endeavors are devoted to the problem of out-of-distribution detection (Liang et al., 2018; Hsu et al., 2020; Huang et al., 2021; Ming et al., 2022; Sun et al., 2022; Meinke & Hein, 2020) , including softmax-based score and generative-based methods. The goal of out-of-distribution detection is to effectively separate between ID and OOD data. Most of the previous works with discriminative models focus on deriving score criterion of the probability density of OOD samples from different space of modern deep neural networks, e.g., output space (Hendrycks & Gimpel, 2017; Liu et al., 2020) , feature space (Lee et al., 2018) and gradient space (Huang et al., 2021) . Generative-based methods leverage deep generative models for OOD detection and assign a high likelihood to test OOD data (Ren et al., 2019; Schirrmeister et al., 2020; Xiao et al., 2020) . Although a plethora of OOD detection algorithms has been proposed recently, they are evaluated based on different OOD datasets, evaluation metrics, and model selection criteria. Inspired by standard benchmarks such as ImageNet (Russakovsky et al., 2015) , DomainBed (Gulrajani & Lopez-Paz, 2021) , we propose a comprehensive testbed for out-of-distribution detection. This benchmark includes 5 OOD datasets, 5 baseline algorithms, and three evaluation criteria. In this work, we provide a framework to reproduce experimentation in OOD detection and propose a simple yet effective method for adaptive model selection in OOD detection. Then, we carefully implement five OOD detection algorithms on five representative OOD datasets and three evaluation criteria and we reveal that the changeable threshold to separate IID and OOD data may be a possible reason to sub-optimal OOD detection performance, while existing works lacks the in-depth analysis on model selection and evaluation criteria of OOD detection algorithms. Based on the experiments and analysis in this work, leading to the recommendation for OOD research: The threshold of a OOD detection algorithm should be stable against model selection criteria. Based on the observations and analysis, we make several recommendations for future OOD detection research:

