LEARN WHAT YOU CAN'T LEARN: REGULARIZED ENSEMBLES FOR TRANSDUCTIVE OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

Machine learning models are often used in practice once they achieve good generalization results on in-distribution (ID) holdout data. To predict test sets in the wild, they should detect samples they cannot predict well. We show that current out-of-distribution (OOD) detection algorithms for neural networks produce unsatisfactory results in a variety of OOD detection scenarios, e.g. when OOD data consists of unseen classes or corrupted measurements. This paper studies how such "hard" OOD scenarios can benefit from tuning the detection method after observing a batch of the test data. This transductive setting is relevant when the advantage of even a slightly delayed OOD detection outweighs the financial cost for additional tuning. We propose a novel method that uses an artificial labeling scheme for the test data and early stopping regularization to obtain ensembles of models that produce contradictory predictions only on the OOD samples in a test batch. We show via comprehensive experiments that our approach is indeed able to significantly outperform both inductive and transductive baselines on difficult OOD detection scenarios, such as unseen classes on CIFAR-10/CIFAR-100, severe corruptions (CIFAR-C), and strong covariate shift ImageNet vs ObjectNet.

1. INTRODUCTION

Modern machine learning (ML) systems can achieve good test set performance and are gaining popularity in many real-world applications -from aiding medical diagnosis (Beede et al., 2020) to making recommendations for the justice system (Angwin et al., 2016) . In reality however, some of the data points in a test set could come from a different distribution than the training (in-distribution) data. For example, sampling biases can lead to spurious correlations in the training set (Sagawa et al., 2020) , a faulty sensor can produce novel data corruptions (Lu et al., 2019) , or new unseen classes can emerge over time, like undiscovered bacteria (Ren et al., 2019) . Many of these samples are so different compared to the training distribution that the model does not have enough information to predict their labels but still outputs predictions with high confidence. It is important to identify these out-of-distribution (OOD) samples in the test set and flag them, for example to at least temporarily abstain from prediction (Geifman & El-Yaniv, 2017) and involve a human in the loop. To achieve this, Bayesian methods (Gal & Ghahramani, 2016; Malinin & Gales, 2018) or alternatives such as Deep Ensembles (Lakshminarayanan et al., 2017) try to identify samples on which a given model cannot predict reliably and include. Their aim is to obtain predictive models that simultaneously have low error on in-distribution (ID) data and perform well on OOD detection. Other approaches try to identify samples with low probability under the training distribution, independent of any prediction model, and use, for instance, density estimation (Nalisnick et al., 2019) or statistics of the intermediate layers of a neural network (Lee et al., 2018) . Most prior work have reported good OOD detection performance, reaching an almost perfect area under the ROC curve (AUROC) value of nearly 1. However these settings generally consider differentiating two vastly different data sets such as SVHN vs CIFAR10. We show that the picture is very different in a lot of other relevant settings. Specifically, for unseen classes within CIFAR10 or for data with strong distribution shifts (e.g. (resized) ImageNet vs ObjectNet (Barbu et al., 2019) ), the AUROC of state-of-the-art methods often drops below 0.8. Almost all of these methods assume a setting where at test time, no training is possible and the OOD detection method can only be trained beforehand. This inductive setting allows real-time decision-making and is hence more broadly used. However, in many cases we can indeed do batch predictions, for example when sensor readings come in every second and it is sufficient to make a prediction and decision every few minutes (e.g. automatic irrigation system). In this case we have a batch of unlabeled test data available that we want to predict (and be warned about) that we can use together with the labeled training set to detect the OOD points in the set. We call this the transductive OOD setting (related to but quite different from transductive classification (Vapnik, 1998) ). Even in an online setting, transductive OOD could be very useful (see Section 2.1). (How) Can we achieve significantly better OOD detection performance in the transductive setting? Even though the transductive setting improves test accuracy in small data settings for tasks such as classification or zero-shot learning, it is unclear how to successfully leverage simultaneous availability of training and test set in the transductive OOD setting which is quite distinct from the former problems. A concurrent recent work Yu & Aizawa (2019) tackles this challenge by encouraging two classifiers to maximally disagree on the test set (i.e. to produce different predictions on test samples). However this leads to models that disagree to a similar degree on both ID and OOD data and hence one cannot distinguish between the two, as indicated by the low AUROC in Figure 1 . We introduce a new method called Regularized Ensembles for Transductive OOD detection (RETO) for overparameterized models, which heavily uses regularization to make sure that the ensemble disagrees only on the OOD samples in the test set, but not on the ID samples. In summary, our main contributions in this paper are as follows: • We experimentally identify many realistic OOD scenarios where SOTA methods achieve a subpar AUROC below 0.84. We hence argue that the field of OOD detection is far from satisfactorily solved and more methods will be proposed that include these (or other) hard OOD cases as benchmarks. • For the transductive OOD detection setting, we propose a new procedure, RETO, that manages to diversify the output of an ensemble only on the OOD portion of the test set and hence achieves significant improvements compared to SOTA methods (see Figure 1 ) with a relative gain of at least 32%.

2. REGULARIZED ENSEMBLES FOR TRANSDUCTIVE OOD DETECTION

Our main goal is to detect samples that are outside of the training distribution and focus on classification tasks. We are only interested in situations where we can obtain a model that generalizes well given the training data. If the models do not generalize well in-distribution (ID), then the primary task should be to find a better classifier instead. Given a classifier with good generalization, the next challenge becomes to ensure that samples on which the model cannot make confident predictions (e.g. samples that are too far from the training data) are correctly identified. This constitutes the



Figure 1: Left: Transductive OOD detection setting. Labeled training set and an unlabeled test set with samples both in-and outside the support of the training distribution. Right: Performance of RETO and some baselines on data sets ranked by their difficulty (Appendix C contains details on the hardness metric). The shaded area represents the gap in area under ROC curve (AUROC) between RETO and the next best baselines. The gap is wider for hard OOD detection settings.

