LEARN WHAT YOU CAN'T LEARN: REGULARIZED ENSEMBLES FOR TRANSDUCTIVE OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

Machine learning models are often used in practice once they achieve good generalization results on in-distribution (ID) holdout data. To predict test sets in the wild, they should detect samples they cannot predict well. We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios, e.g. when OOD data consists of unseen classes or corrupted measurements. This paper studies how such "hard" OOD scenarios can benefit from tuning the detection method after observing a batch of the test data. This transductive setting is relevant when the advantage of even a slightly delayed OOD detection outweighs the financial cost for additional tuning. We propose a novel method that uses an artificial labeling scheme for the test data and early stopping regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch. We show via comprehensive experiments that our approach is indeed able to significantly outperform both inductive and transductive baselines on difficult OOD detection scenarios, such as unseen classes on CIFAR-10/CIFAR-100, severe corruptions (CIFAR-C), and strong covariate shift ImageNet vs ObjectNet.

1. INTRODUCTION

Modern machine learning (ML) systems can achieve good test set performance and are gaining popularity in many real-world applications -from aiding medical diagnosis (Beede et al., 2020) to making recommendations for the justice system (Angwin et al., 2016) . In reality however, some of the data points in a test set could come from a different distribution than the training (in-distribution) data. For example, sampling biases can lead to spurious correlations in the training set (Sagawa et al., 2020) , a faulty sensor can produce novel data corruptions (Lu et al., 2019) , or new unseen classes can emerge over time, like undiscovered bacteria (Ren et al., 2019) . Many of these samples are so different compared to the training distribution that the model does not have enough information to predict their labels but still outputs predictions with high confidence. It is important to identify these out-of-distribution (OOD) samples in the test set and flag them, for example to at least temporarily abstain from prediction (Geifman & El-Yaniv, 2017) and involve a human in the loop. To achieve this, Bayesian methods (Gal & Ghahramani, 2016; Malinin & Gales, 2018) or alternatives such as Deep Ensembles (Lakshminarayanan et al., 2017) try to identify samples on which a given model cannot predict reliably and include. Their aim is to obtain predictive models that simultaneously have low error on in-distribution (ID) data and perform well on OOD detection. Other approaches try to identify samples with low probability under the training distribution, independent of any prediction model, and use, for instance, density estimation (Nalisnick et al., 2019) or statistics of the intermediate layers of a neural network (Lee et al., 2018) . Most prior work have reported good OOD detection performance, reaching an almost perfect area under the ROC curve (AUROC) value of nearly 1. However these settings generally consider differentiating two vastly different data sets such as SVHN vs CIFAR10. We show that the picture is very different in a lot of other relevant settings. Specifically, for unseen classes within CIFAR10 or for data with strong distribution shifts (e.g. (resized) ImageNet vs ObjectNet (Barbu et al., 2019) ), the AUROC of state-of-the-art methods often drops below 0.8.

