INFORMATIVE OUTLIER MATTERS: ROBUSTIFYING OUT-OF-DISTRIBUTION DETECTION USING OUTLIER MINING Anonymous

Abstract

Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep learning models in an open-world setting. However, existing OOD detection solutions can be brittle in the open world, facing various types of adversarial OOD inputs. While methods leveraging auxiliary OOD data have emerged, our analysis reveals a key insight that the majority of auxiliary OOD examples may not meaningfully improve the decision boundary of the OOD detector. In this paper, we provide a theoretically motivated method, Adversarial Training with informative Outlier Mining (ATOM), which improves the robustness of OOD detection. We show that, by mining informative auxiliary OOD data, one can significantly improve OOD detection performance, and somewhat surprisingly, generalize to unseen adversarial attacks. ATOM achieves state-of-the-art performance under a broad family of classic and adversarial OOD evaluation tasks. For example, on the CIFAR-10 in-distribution dataset, ATOM reduces the FPR95 by up to 57.99% under adversarial OOD inputs, surpassing the previous best baseline by a large margin. In this section, we formulate the problem of robust out-of-distribution detection, and provide background knowledge on the use of auxiliary data for OOD detection. Problem Statement. We consider a training dataset D train in drawn i.i.d. from a data distribution P X,Y , where X is the sample space and Y = {1, 2, • • • , K} is the set of labels. A classifier f (x) is trained on the in-distribution, P X , the marginal distribution of P X,Y . The OOD examples are revealed during test time, which are from a different distribution Q X , potentially with perturbations added. The task of robust out-of-distribution detection is to learn a detector G : x → {-1, 1}, which outputs 1 for x from P X and output -1 for a clean or perturbed OOD example x from Q X . Formally, let Ω(x) be a set of small perturbations on an OOD example x. The detector is evaluated on x from P X and on the worst-case input inside Ω(x) for an OOD example from Q X . The false negative rate (FNR) and false positive rate (FPR) are defined as: Note that no data from the test OOD distribution Q X are available for training.

1. INTRODUCTION

Out-of-distribution (OOD) detection has become an indispensable part of building reliable open-world machine learning models (Amodei et al., 2016 ). An OOD detector determines whether an input is from the same distribution as the training data, or a different distribution (i.e., out-of-distribution). The performance of the OOD detector is central for safety-critical applications such as autonomous driving (Eykholt et al., 2018) or rare disease identification (Blauwkamp et al., 2019) . Despite exciting progress made in OOD detection, previous methods mostly focused on clean OOD data (Hendrycks & Gimpel, 2016; Liang et al., 2018; Lee et al., 2018; Lakshminarayanan et al., 2017; Hendrycks et al., 2018; Mohseni et al., 2020) . Scant attention has been paid to the robustness aspect of OOD detection. Recent works (Hein et al., 2019; Sehwag et al., 2019; Bitterwolf et al., 2020) considered worst-case OOD detection under adversarial perturbations (Papernot et al., 2016; Goodfellow et al., 2014; Biggio et al., 2013; Szegedy et al., 2013) . For example, an OOD image (e.g., mailbox) can be perturbed to be misclassified by the OOD detector as in-distribution (traffic sign data). Such an adversarial OOD example is then passed to the image classifier and trigger undesirable prediction and action (e.g., speed limit 70). Therefore, it remains an important question to make out-of-distribution detection algorithms robust in the presence of small perturbations to OOD inputs. In this paper, we begin with formally formulating the task of robust OOD detection and providing theoretical analysis in a simple Gaussian data model. While recent OOD detection methods (Hendrycks et al., 2018; Hein et al., 2019; Meinke & Hein, 2019; Mohseni et al., 2020) have leveraged auxiliary OOD data, they often sample randomly uniformly from the auxiliary dataset. Contrary to the common practice, our analysis reveals a key insight that the majority of auxiliary OOD examples may not provide useful information to improve the decision boundary of OOD detector. Under a Gaussian model of the data, we theoretically show that using outlier mining significantly improves the error bound of OOD detector in the presence of non-informative auxiliary OOD data. Motivated by this insight, we propose Adversarial Training with informative Outlier Mining (ATOM), which justifies the theoretical intuitions above and achieves state-of-the-art performance on a broad family of classic and adversarial OOD evaluation tasks for modern neural networks. We show that, by carefully choosing which OOD data to train on, one can significantly improve the robustness of an OOD detector, and somewhat surprisingly, generalize to unseen adversarial attacks. We note that while hard negative mining has been extensively used in various learning tasks such as object recognition (Felzenszwalb et al., 2009; Gidaris & Komodakis, 2015; Shrivastava et al., 2016) , to the best of our knowledge, we are the first to exploit the novel connection between hard example mining and OOD detection. We show both empirically and theoretically that hard example mining significantly improves the generalization and robustness of OOD detection. To evaluate our method, we provide a unified framework that allows examining the robustness of OOD detection algorithms under a broad family of OOD inputs, as illustrated in Figure 1 . Our evaluation includes existing classic OOD evaluation task -Natural OOD, and adversarial OOD evaluation task -L ∞ OOD. Besides, we also introduce new adversarial OOD evaluation tasks -Corruption OOD and Compositional OOD. Under these evaluation tasks, ATOM achieves state-ofthe-art performance compared to eight competitive OOD detection methods (refer to Appendix B.3 for a detailed description of these methods). On the Natural OOD evaluation task, ATOM achieves comparable and often better performance than current state-of-the-art methods. On L ∞ OOD evaluation task, ATOM outperforms current state-of-the-art method ACET by a large margin (e.g. on CIFAR-10, outperforms it by 53.9%). Under the new Corruption OOD evaluation task, where the attack is unknown during training time, ATOM also achieves much better results than previous methods (e.g. on CIFAR-10, outperform previous best method by 30.99%). While almost every method fails under the hardest Compositional OOD evaluation task, ATOM still achieves impressive results (e.g. on CIFAR-10, reduce the FPR by 57.99%). The performance is noteworthy since ATOM is not trained explicitly on corrupted OOD inputs. In summary, our contributions are: • Firstly, we contribute theoretical analysis formalizing the intuition of mining hard outliers for improving the robustness of OOD detection. • Secondly, we contribute a theoretically motivated method, ATOM, which leads to state-ofthe-art performance on both classic and adversarial OOD evaluation tasks. We conduct extensive evaluations and ablation analysis to demonstrate the effectiveness of informative outlier mining. • Lastly, we provide a unified evaluation framework that allows future research examining the robustness of OOD detection algorithms under a broad family of OOD inputs. In this section, we present a theoretical analysisfoot_0 that motivates the use of informative outlier mining for OOD detection. To establish formal guarantees, we use a Gaussian data model to model data P X , Q X , and U X . Different from previous work by Schmidt et al. (2018) and Carmon et al. (2019) , our analysis gives rise to a separation result with or without informative outlier mining for OOD detection. To this end, we note that while hard negative mining has been explored in different domains of learning (e.g. object detection, deep metric learning, please refer to Section 6 for details), the vast literature of out-of-distribution detection has not explored this idea. Moreover, most uses of hard negative mining are on a heuristic basis, but in our case, the simplicity of the definition of OOD (see Section 2) allows us to derive precise formal guarantees, which further differs us from previous studies of hard negative mining. As a remark, our analysis also establishes formal evidence of the importance of using auxiliary outlier data for OOD detection, which is lacking in the current OOD detection studies. We refer readers to Section A for these results. At a high level, our analysis provides two important insights: (1) First, we show that a detection algorithm can work very well if all data is informative; yet it can fail completely in a natural setting where we mix informative auxiliary data with non-informative auxiliary data. (2) Second, we show that, tweaking the algorithm with simple thresholding to choose mildly hard auxiliary data (in our setting they are exactly the informative ones), can lead to good detection performance. Combining both thus provides direct evidence about the importance of hard negative mining for OOD. Gaussian Data Model. We now describe the Gaussian data model, inspired by the model in (Schmidt et al., 2018; Carmon et al., 2019) , but with important adjustment to the OOD detection setting. In particular, our setting has a family Q of possible test OOD distributions and have only indistribution data for training, modeling that the test OOD distribution is unknown for training. Given µ ∈ R d , σ > 0, ν > 0, we consider the following model: • P X (in-distribution data): N (µ, σ 2 I); The in-distribution data {x i } n i=1 is drawn from P X . • Q X (out-of-distribution data) can be any distribution from the family Q = {N (-µ + v, σ 2 I) : v ∈ R d , v 2 ≤ ν}. • Hypothesis class of OOD detector: G = {G θ (x) = sign(θ x) : θ ∈ R d }. A concrete instance of the model is defined by a set of parameter values for d, µ, σ, and ν; see Appendix A.2 for the family of instances we analyze. While the Gaussian model may be much simpler than the practical data, its simplicity is desirable for our analytical purpose for demonstrating the insights. Furthermore, the analysis in this simple model has implications for more complicated and practical methods, which we present in Section 4. Finally, the analysis can be generalized to mixtures of Gaussians which models practical data much better. Below, we consider the FNR and the FPR under ∞ perturbations of magnitude . Since Q X is not accessible at training time, our goal is to bound sup Q X ∈Q FPR(G; Q X , Ω ∞, (x)). Failing a good detector by mixing non-informative auxiliary data. We start by considering the case where all auxiliary data are informative: That is, all auxiliary {x i } n i=1 come from uniform mixture of the possible test OOD distributions in Q. In this case, it is straightforward to show that a simple averaging-based detector, θn,n = 1 n + n   n i=1 x i - n i=1 xi   , performs very well (See Proposition 4 in the appendix). Unfortunately, this detector can be easily failed by considering the following simple auxiliary data distribution U mix , which mixes the ideal auxiliary data with non-informative data • U mix (Non-ideal mixture): U mix is a uniform mixture of N (-µ, σ 2 I) and N (µ o , σ 2 I) with µ o = 10µ. Importantly, the distribution U mix models the case where the auxiliary OOD data has some noninformative outliers, and also with a small probability mass of samples (e.g., tail of N (-µ, σfoot_1 I)) in the support of in-distribution. In this case, the simple average method leads to E[ θn,n ] = -7µ/4 with a large error, since θn,n is misled by auxiliary data from N (µ o , σ 2 I) or tail of N (-µ, σ 2 I) in the support of in-distribution. Fixing the detector with informative outlier mining. We now show an important modification to the detection algorithm by using informative outlier mining, which leads to good detection performance. Specifically, we first use in-distribution data to get an intermediate solution: θint = 1 n n i=1 x i . Then, we use a simple thresholding mechanism to only pick points with mild confidence scores, which removes non-informative outliers. Specifically, we only select outliers x whose confidence scores f (x) = 1/(1 + e -x θint/d ) fall in an interval [a, b] . The final solution θom is -1 times the average of the selected outliers. We can prove the following: Proposition 1. (Error bound with outlier mining.) For any ∈ (0, 1/2) and any integer n 0 > 0, there exist a family of instances of the Gaussian data model such that the following is true. n auxiliary OOD data from U mix specified above. There exist thresholds a and b for θom and a universal constant c > 0 such that if the number of in-distribution data n ≥ c(n 0 log d + √ dn 0 ) and the number of auxiliary data n ≥ (d + n 0 • 4 2 ) d/n 0 , then θom has small errors: 2 E θom FNR(G θom ) ≤ 10 -3 , E θom sup Q X ∈Q FPR(G θom ; Q X , Ω ∞, (x)) ≤ 10 -3 . (3) Intuitively, the mining method removes the misleading points (most points in N (µ o , σ 2 I) and tail of N (-µ, σ 2 I) in the support of in-distribution). Outliers selected in this way are mostly informative and thus give an accurate final detector, which justifies outlier mining in the presence of non-informative data. If we compare this bound with that for the case without auxiliary data (Proposition 3), we can see that for sufficiently high dimension d, with the same amount of in-distribution data, any algorithm without outliers must fail but our outlier mining method can learn a good detector. We also note that our analysis and the result also hold for many other auxiliary data distributions U mix , and the particular U mix used here is for the simplicity of demonstration; see the appendix for more discussions. In the following section, we design a practical algorithm based on this insight and present empirical evidence of its effectiveness.  [ (x , K + 1; Fθ )], where is the cross entropy loss, and D train out is the OOD training dataset. We use Projected Gradient Descent (PGD) (Madry et al., 2017) to solve the inner max of the objective, and apply it to half of a minibatch while keeping the other half clean to ensure performance on both clean and perturbed data. Once trained , the OOD detector G(x) can be constructed by: G(x) = -1 if F (x) K+1 ≥ γ, 1 if F (x) K+1 < γ, ( ) where γ is the threshold, and in practice can be chosen on the in-distribution data so that a high fraction of the test examples are correctly classified by G. We call F (x) K+1 the OOD score of x. For an input that is labeled as in-distribution by G, one can obtain its semantic label using F (x): F (x) = arg max y∈{1,2,••• ,K} F (x) y (6) Informative Outlier Mining. Motivated by our theoretical analysis in Section 3 and our empirical observation shown in Figure 2 , we propose to adaptively choose OOD training examples where the detector is uncertain about. We provide the complete training algorithm using informative outlier mining in Algorithm 1. Our method is different from random sampling as used in previous works (Hendrycks et al., 2018; Hein et al., 2019; Meinke & Hein, 2019; Mohseni et al., 2020) . Specifically, during each training epoch, we randomly sample N data points from the auxiliary OOD dataset D auxiliary out , and use the current model to infer the OOD scoresfoot_2 . Next, we sort the data points according to the OOD scores and select a subset of n < N data points, starting with the qN th data in the sorted list. We then use the selected samples as OOD training data D train out for the next epoch of training. Intuitively, q determines the informativeness of the sampled points w.r.t the OOD detector. The larger q is, the less informative those sampled examples become. Note that informative outlier  V = { F (x) K+1 | x ∈ S}. Sort scores in V from the lowest to the highest. D train out ← V [qN : qN + n] {q ∈ [0, 1 -n/N ]} Train Fθ for one epoch using the training objective of (4). end for Build G and F using ( 5) and ( 6) respectively. mining is performed on (non-adversarial) auxiliary OOD data. Selected examples are then used in the robust training objective (4). To see how informative outlier mining alone improves the OOD detection, we also consider the following objective without adversarial training: minimize θ E (x,y)∼D train in [ (x, y; Fθ )] + λ • E x∼D train out [ (x, K + 1; Fθ )], which we name Natural Training with informative Outlier Mining (NTOM).

5. EXPERIMENTS

In this section, we describe our experimental setup (Section 5.1) and show that ATOM can substantially improve OOD detection performance on both clean OOD data and adversarially perturbed OOD inputs. We also conducted extensive ablation analysis to explore different aspects of our algorithm. Our experiments are mainly on image data, which is common in previous work, but we believe that our insights and method can be applied to other types of data, which is left for future work.

5.1. SETUP

In-distribution Datasets. We use SVHN (Netzer et al., 2011) , CIFAR-10, and CIFAR-100 (Krizhevsky et al., 2009) datasets as in-distribution datasets. Auxiliary OOD Datasets. By default, we use 80 Million Tiny Images (TinyImages) (Torralba et al., 2008) as D auxiliary out , which is a common setting in prior works. We also use ImageNet-RC, a variant of ImageNet (Chrabaszcz et al., 2017) as an alternative auxiliary OOD dataset. Out-of-distribution Datasets. For OOD test dataset, we follow the procedure in (Liang et al., 2018; Hendrycks et al., 2018) and use six different natural image datasets. For CIFAR-10 and CIFAR-100, we use SVHN, Textures (Cimpoi et al., 2014 ), Places365 (Zhou et al., 2017) , LSUN (crop), LSUN (resize) (Yu et al., 2015), and iSUN (Xu et al., 2015) . For SVHN, we use CIFAR-10, Textures, Places365, LSUN (crop), LSUN (resize), and iSUN. Besides natural image datasets, we also consider Gaussian Noise and Uniform Noise as OOD test data. Hyperparameters. The hyperparameter q is chosen on a separate validation set from TinyImages, which does not depend on test-time OOD data (see Appendix B.7). Based on the validation results in Table 5 , we set q = 0 for SVHN, q = 0.125 for CIFAR-10 and q = 0.5 for CIFAR-100. To ensure fair comparison, in each epoch, ATOM uses the same amount of outlier data as OE, where n is twice larger than the in-distribution data size (i.e., 50,000). For all experiments, we set λ = 1. For CIFAR-10 and CIFAR-100, we set N = 400, 000, and n = 100, 000; For SVHN, we set N = 586, 056, and n = 146, 514. More details about experimental set up are in Appendix B.1. Robust OOD Evaluation Tasks. We consider the following family of OOD inputs, for which we provide visualizations in Appendix B.5: • Natural OOD: This is equivalent to the classic OOD evaluation with clean OOD input x, and Ω = Ø. • L ∞ attacked OOD (white-box): We consider small L ∞ -norm bounded perturbations on OOD input x (Madry et al., 2017; Athalye et al., 2018) , which induce the model to produce high confidence scores (or low OOD scores) for OOD inputs. We denote the adversarial perturbations by Ω ∞, (x), where is the adversarial budget. We provide attack algorithms for all eight OOD detection methods in Appendix B.4. • Corruption attacked OOD (black-box): We consider a more realistic type of attack based on common corruptions (Hendrycks & Dietterich, 2019) , which could appear naturally in the physical world. For each OOD image, we generate 75 corrupted images (15 corruption types × 5 severity levels), and then select the one with the lowest OOD score. • Compositionally attacked OOD (white-box): Lastly, we consider applying L ∞ -norm bounded attack and corruption attack jointly to an OOD input x, as considered in (Laidlaw & Feizi, 2019) . Evaluation Metrics. We measure the following metrics: the false positive rate (FPR) at 5% false negative rate (FNR) and the area under the receiver operating characteristic curve (AUROC).

5.2. RESULTS

How does ATOM compare to existing solutions? We show in Table 1 that ATOM outperforms competitive OOD detection methods on both classic and adversarial OOD evaluation tasks. First, on classic OOD evaluation task (clean OOD data), ATOM achieves comparable or often even better performance than the current state-of-the-art methods. Second, on the existing adversarial OOD evaluation task -L ∞ OOD, ATOM outperforms current state-of-the-art method ACET (Hein et al., 2019 ) by a large margin (e.g. on CIFAR-10, our method outperforms ACET by 53.9% measured by FPR). Third, while ACET is somewhat brittle under the new Corruption OOD evaluation task, our method can generalize surprisingly well to the unknown corruption attacked OOD inputs, outperforming the best baseline by a large margin (e.g. on CIFAR-10, by up to 30.99% measured by FPR). Finally, while almost every method fails under the hardest compositional OOD evaluation task, our method still achieves impressive results (e.g. on CIFAR-10, reduces the FPR by 57.99%). The performance is noteworthy since our method is not trained explicitly on corrupted OOD inputs. Our training method leads to improved OOD detection while preserving classification performance on in-distribution data. Consistent performance improvement is observed on other in-distribution datasets (SVHN and CIFAR-100), alternative network architecture (WideResNet), and with alternative auxiliary dataset (ImageNet-RC). How does ATOM compare to NTOM? We perform an ablation study that isolates the effect of adversarial training. In Table 1 , we show that NTOM achieves comparable performance as ATOM on natural OOD and corruption OOD. However, NTOM is less robust under L ∞ OOD (with 79.35% reduction in FPR on CIFAR-10) and compositional OOD inputs. This underlies the importance of having both adversarial training and outlier mining (ATOM) for overall good performance. How does the sampling parameter affect performance? Table 2 shows the performance with different sampling parameter q. For all three datasets, training on OOD inputs primarily with large OOD scores (i.e., too easy examples with q = 0.75) worsens the performance, which suggests the necessity to include examples on which the OOD detector is uncertain about. We also show that using informative outlier mining overall works better than random sampling under properly chosen q. Interestingly, in setting where the in-distribution data and auxiliary OOD data are disjoint (e.g., SVHN/TinyImages), q = 0 is optimal, which suggests that the hardest outliers are mostly useful for training. However, in a more realistic setting, the auxiliary OOD data can almost always contain data similar to in-distribution data (e.g., CIFAR/TinyImages). Even without removing near-duplicates exhaustively, ATOM can adaptively avoid training on those near-duplicates of in-distribution data (e.g. using q = 0.125 for CIFAR-10 and q = 0.5 for CIFAR-100). How does the choice of auxiliary OOD dataset affect the performance? To see this, we additionally experiment with using ImageNet-RC as auxiliary OOD data. We observe consistent improvement of ATOM, and in many cases with performance better than using TinyImages. For example, on CIFAR-100, the FPR under natural OOD inputs is reduced from 32.20% (w/ TinyImages) to 15.49% (w/ ImageNet-RC). Interestingly, in all three datasets, using q = 0 (hardest outliers) yields the optimal performance since there is substantially less near-duplicates between ImageNet-RC and in-distribution data. This ablation suggests that ATOM's success does not depend on a particular auxiliary dataset. Full results are provided in Table 6 (Appendix B.8).

6. RELATED WORK

Robustness of OOD Detection. Worst-case aspects of OOD detection have previously been studied in (Hein et al., 2019; Sehwag et al., 2019) . However, these papers are primarily concerned with L ∞ norm bounded adversarial attacks, while our evaluation also includes common image corruption attacks. Besides, Meinke & Hein; Hein et al. only evaluate adversarial robustness of OOD detection on random noise images, while we also evaluate it on natural OOD images. Hein et al. (2019) has theoretically analyzed why ReLU networks can yield high-confidence but wrong predictions for OOD data. Meinke & Hein has shown the first provable guarantees for worst-case OOD detection on some balls around uniform noise, and Bitterwolf et al. recently studied the provable guarantees for worst-case OOD detection not only for noise but also for images from related but different image classification tasks. In this paper, we propose ATOM which achieves state-of-the-art performance on a broader family of clean and perturbed OOD inputs. The key difference of our method compared to prior work is introducing the informative outlier mining technique, which could significantly improve the generalization and robustness of OOD detection. Discriminative Based Out-of-Distribution Detection. Hendrycks & Gimpel introduced a baseline approach for OOD detection using the maximum softmax probability from a pre-trained network. Several works attempt to improve the OOD uncertainty estimation by using deep ensembles (Lakshminarayanan et al., 2017) , the calibrated softmax score (Liang et al., 2018) , and the Mahalanobis distance-based confidence score (Lee et al., 2018) . Some methods also modify the neural networks by re-training or fine-tuning on some auxiliary anomalous data that are either realistic (Hendrycks et al., 2018; Mohseni et al., 2020; Papadopoulos et al., 2019) or artificially generated by GANs (Lee et al., 2017) . Many other works (Subramanya et al., 2017; Malinin & Gales, 2018; Bevandić et al., 2018 ) also regularize the model to have lower confidence for anomalous examples. Generative Modeling Based Out-of-distribution Detection. Generative models (Dinh et al., 2016; Kingma & Welling, 2013; Rezende et al., 2014; Van den Oord et al., 2016; Tabak & Turner, 2013 ) can be alternative approaches for detecting OOD examples, as they directly estimate the in-distribution density and can declare a test sample to be out-of-distribution if it lies in the low-density regions. However, as shown by Nalisnick et al., deep generative models can assign a high likelihood to outof-distribution data. Deep generative models can be more effective for out-of-distribution detection using alternative metrics (Choi & Jang, 2018) , likelihood ratio (Ren et al., 2019; Serrà et al., 2019) , and modified training technique (Hendrycks et al., 2018) . Recently, Pope et al. shows that flowbased generative models are sensitive under adversarial attacks. Note that we mainly considered discriminative-based approaches, which can be more competitive due to the availability of label information (and, in some cases, auxiliary OOD data (Hein et al., 2019; Hendrycks et al., 2018; Meinke & Hein, 2019; Mohseni et al., 2020) ). Adversarial Robustness. Adversarial examples (Goodfellow et al., 2014; Papernot et al., 2016; Biggio et al., 2013; Szegedy et al., 2013) have received considerable attention in recent years. Many defense methods have been proposed to mitigate this problem. One of the most effective methods is adversarial training (Madry et al., 2017) , which uses robust optimization techniques to render deep learning models resistant to adversarial attacks. (Sung, 1996) for training face detection models, where they gradually grow the set of background examples by selecting those examples for which the detector triggers a false alarm. The idea has been used extensively for object detection literature (Felzenszwalb et al., 2009; Gidaris & Komodakis, 2015; Shrivastava et al., 2016) . It also has been used extensively in deep metric learning (Cui et al., 2016; Simo-Serra et al., 2015; Wang & Gupta, 2015; Suh et al., 2019) and deep embedding learning (Yuan et al., 2017; Smirnov et al., 2018; Wu et al., 2017; Duan et al., 2019) . To the best of our knowledge, we are the first to explore hard example mining for out-of-distribution detection.

7. CONCLUSION

In this paper, we propose Adversarial Training with informative Outlier Mining (ATOM), a method that enhances the robustness of the OOD detector. We show the merit of adaptively selecting the OOD training examples which the OOD detector is uncertain about. Extensive experiments show ATOM can significantly improve the decision boundary of the OOD detector, achieving state-ofthe-art performance under a broad family of clean and perturbed OOD evaluation tasks. We also provide theoretical analysis that justifies the benefits of outlier mining. Further, our unified evaluation framework allows future research to examine the robustness of the OOD detector. We hope our research can raise more attention to a broader view of robustness in out-of-distribution detection. Esteban G Tabak and Cristina V Turner. A family of nonparametric density estimation algorithms. 

Supplementary Material

Informative Outlier Matters: Robustifying Out-of-distribution Detection Using Outlier Mining

A THEORETICAL ANALYSIS

In this section, we provide theoretical analysis to the following two questions: 1) Why are auxiliary OOD data useful for training, even when they come from a different distribution than the test OOD distribution? 2) For practical auxiliary OOD data which contain non-informative samples, how can we make use of the auxiliary data to significantly improve the detection performance? For the first question, we provide an error bound to justify the benefit of auxiliary OOD data. We notice that a key difference of OOD detection from typical classification is that the test OOD distribution is not accessible for training, so one need to use the auxiliary OOD data from a different distribution. This makes OOD detection more challenging. Our intuition is that even if the auxiliary data are different from test OOD data, they can still calibrate detectors in quite general situations, then the detector can generalize to the test OOD data. To formalize this, we adopt the domain adaption framework for our analysis. For the second question, we provide analysis in a generative model of the data and motivate the importance of careful selection of informative auxiliary OOD data (i.e., informative outlier mining). Intuitively, since the auxiliary OOD data are different from the test OOD data, they may not be all useful. However, it is unclear why outlier mining can lead to significant improvements (see our experimental results in Section 5) and how to formalize this. Our intuition is that some of the auxiliary OOD data can be non-informative or even harmful, and they can overwhelm the benefit of informative outliers, leading to drastic drop in detection performance. To formalize it, one needs distributional assumptions on the data. We thus use a Gaussian data model and derive concrete bounds to illustrate our intuition. A.1 GENERALIZATION FROM AUXILIARY OOD DATA TO TEST OOD DATA To see why detectors trained on the auxiliary OOD data U X can generalize to the test OOD distribution Q X , we adopt the domain adaption framework (Ben-David et al., 2010) . Recall that in domain adaptation there are two domains s, t, each being a distribution over the input space X and label space {-1, 1}. A classifier is trained on s then applied on t. At a high level, we view our OOD detection problem as classification, where the source domain s is P X with labels 1 and U X with labels -1, and the target domain t is P X with labels 1 and Q X with label -1. We focus on the FPR metric below; the argument for FNR is similar. Suppose we learn the OOD detector from a hypothesis class G. Following Ben-David et al. ( 2010), we define (a variant) of the divergence of Q X and U X w.r.t. the hypothesis class G as d G (Q X , U X ) = sup G,G ∈G v(G, G ; Q X ) -v(G, G ; U X ) where v(G, G ; D) = FPR(G; D, Ω) -FPR(G ; D, Ω) is the error difference of G and G on the distribution D. The divergence upper bounds the change of the hypothesis error difference between Q X and U X . If it is small, then for any G, G ∈ G where G has a smaller error than G in U X , we know that G will also have a smaller (or not too larger) error than G in Q X . That is, if the divergence is small, then the ranking of the hypotheses w.r.t. the error is roughly the same in both distributions. This rank-preserving property thus makes sure that a good hypothesis learned in U X will also be good for Q X . Now we show that, if d G (Q X , U X ) is small (i.e., Q X and U X are aligned w.r.t. the class G), then a detector G with small FPR on U X will also have small FPR on Q X . O ! ! " ! # ! 1 4 2 2 1 4 $ " Figure 3: An illustration example to explain why U X helps to get a good detector Gr. With U X , we can prune away hypotheses Gr for any r ≥ 1.9. Thus, the resulting detector Gr can detect OOD samples from Q X successfully and robustly. Proposition 2. For any G ∈ G, FPR(G; Q X , Ω) ≤ inf G * ∈G FPR(G * ; Q X , Ω) + FPR(G; U X , Ω) + d G (Q X , U X ). Proof. For simplicity, we omit Ω from FPR(G; Q X , Ω). For any G * ∈ G, we have FPR(G; Q X ) = FPR(G * ; Q X ) + FPR(G; Q X ) -FPR(G * ; Q X ) (8) = FPR(G * ; Q X ) + FPR(G; U X ) -FPR(G * ; U X ) (9) + [(FPR(G; Q X ) -FPR(G * ; Q X )) -(FPR(G; U X ) -FPR(G * ; U X ))]. (10) The last term is (FPR(G; Q X ) -FPR(G * ; Q X )) -(FPR(G; U X ) -FPR(G * ; U X )) (11) = v(G, G * ; Q X ) -v(G, G * ; U X ) (12) ≤ d G (Q X , U X ). (13) Therefore, FPR(G; Q X ) ≤ FPR(G * ; Q X ) + FPR(G; U X ) + d G (Q X , U X ). Taking inf over G * ∈ G completes the proof. The error of the detector is bounded by three terms: the best error, the error on the training distributions, and the divergence between Q X and U X . Assuming that there exists a ground-truth detector with a small test error, and that the optimization can lead to a small training error, the test error is then characterized by the divergence. So in this case, as long as the rankings of the hypotheses (according to the error) on Q X and U X are similar, detectors learned on U X can generalize to Q X . An illustration example. In this example, the in-distribution P X is uniform over the disk around the origin in R 2 with radius 1, U X is uniform over the disk around (0, 3) with radius 1, and Q X is uniform over the disk around (3, 0) with radius 1. Assume the adversary budget is = 0.1, i.e., Ω ∞, = { δ ∞ ≤ 0.1}. The hypothesis class for the detector contains all functions of the form G r (x) = 2I[ x 2 ≤ r] -1 with parameter r. See Figure 3 . The example first shows the effect of the auxiliary OOD data: U X helps prune away hypotheses G r for any r ≥ 1.9. Furthermore, it also shows how learning over U X can generalize to Q X . Although Q X and U X have non-overlapping supports, U X helps to calibrate the error of the hypotheses, so any good detector trained on P X and U X can be used for distinguishing P X and Q X . Formally, the d G is small in Proposition 2. The analysis also shows the importance of training on perturbed instances from the auxiliary OOD data U X . Not using perturbation is equivalent to using Ω = {0}. In this case, the analysis shows that it only guarantees the error on unperturbed instances from Q X , even if Q X and U X has small divergence and the learned detector can have small training error on U X .

A.2 IMPORTANCE OF OUTLIER MINING: ANALYSIS IN A GAUSSIAN DATA MODEL

To understand how the outlier training data affect the generalization, we study a concrete distributional model, which is inspired by the models in (Schmidt et al., 2018; Carmon et al., 2019) . In this model, we establish in Section A.2.2 a separation of the in-distribution sample sizes needed in the two cases: with and without auxiliary OOD data for training. We also demonstrate in Section A.2.3 the benefit of outlier mining when the auxiliary OOD data consists of uninformative outliers. While the theoretical model is simple (in fact, much simpler than the practical data distributions), its simplicity is actually desired for our analytical purpose. More precisely, the separation of the sample sizes under this simple model suggests the same phenomenon can happen in more complicated models. This then means the auxiliary OOD data not only help training but are necessary for obtaining detectors with reasonable performance when in-distribution data is limited. Gaussian Model. To specify a distributional model for our robust OOD formulation, we need in-distribution P X , family of OOD distributions Q, and the hypothesis class H for the OOD detector G. When auxiliary OOD data is available, we also need to specify their distribution U X . Let µ ∈ R d be the mean vector, σ > 0 be the variance parameter, and ν > 0 be a parameter. In our (µ, σ, ν)-Gaussian model: • P X is N (µ, σ 2 I). • Q = {N (-µ + v, σ 2 I) : v ∈ R d , v 2 ≤ ν}. • H = {G θ (x) = sign(θ x) : θ ∈ R d }. Here G θ (x) = 1 means it predicts x to be an in-distribution example, and G θ (x) = -1 means it predicts an OOD example. We are interested in the False Negative Rate FNR(G) and worst False Positive Rate sup Q X ∈Q FPR(G; Q X , Ω ∞, (x)) over Q X ∈ Q under ∞ perturbations of magnitude . For simplicity, we denote them as FNR(G) and FPR(G; Q X ) in our proofs. Parameter Setting. The model parameters are set such that: 1. There exists a classifier that achieves very low errors FPR and FNR. 2. We need n in-distribution data from P X to learn a classifier with non-trivial robust errors. 3. Using n 0 in-distribution examples from P X and n auxiliary OOD data from U X where n 0 is much smaller than n, we can learn a classifier with non-trivial robust errors. Here n 0 , n, n are sample sizes whose values are specified later in our analysis. The family of instances of the Gaussian data model used for our analysis is as follows. First, fix an integer n 0 > 0, and an ∈ (0, 1/2), then set the following parameter values: d n 0 / 4 + n 0 log 2 d, µ 2 2 ∈ 9d 10 , 11d 10 , σ 2 = dn 0 , ν ≤ µ 2 /4. ( ) To interpret the parameter setting, one can view n 0 , as fixed and d/n 0 as a large number.

A.2.1 EXISTENCE OF ROBUST CLASSIFIERS

We give closed forms of the errors, and show that using θ = µ gives small errors under the chosen setting in equation 15. The calculation largely follows that in (Carmon et al., 2019) with some slight modification. Closed Forms of the Errors. By definition, the FNR of a detector G θ (on P X ) is: FNR(G θ ) = P x∼P X [θ x ≤ 0] = P x∼P X N µ θ σ θ 2 , 1 ≤ 0 =: Φ µ θ σ θ 2 (16) where Φ(x) := 1 √ 2π ∞ x e -t 2 /2 dt (17) is the Gaussian error function. Given a test OOD distribution Q v = N (-µ + v, σ 2 I), the robust FPR of G θ on Q v is: FPR(G θ ; Q v ) = P x∼Qv inf δ ∞≤ θ (x + δ) ≥ 0 (18) = P x∼Qv θ x + θ 1 ≥ 0 (19) = P x∼Qv N ((µ + v) θ, (σ θ 2 ) 2 ) ≥ -θ 1 (20) = Φ (µ + v) θ σ θ 2 - θ 1 σ θ 2 . ( ) Then the worst robust FPR of G θ on Q is: sup Qv∈Q FPR(G θ ; Q v ) = sup v 2≤ν Φ (µ + v) θ σ θ 2 - θ 1 σ θ 2 (22) = Φ µ θ σ θ 2 - ν σ - θ 1 σ θ 2 (23) ≤ Φ µ θ σ θ 2 - ν σ - √ d σ . ( ) Small Errors of G µ . Given the closed forms, we can now show that G µ achieves small FNR and FPR in our parameter setting. FNR(G µ ) = Φ µ 2 σ ≤ Φ 11 10 d n 0 1/4 ≤ e -11 20 √ d/n0 . ( ) sup Qv∈Q FPR(G µ ; Q v ) ≤ Φ µ 2 σ - ν σ - √ d σ (26) ≤ Φ 11 10 - 1 4 - d n 0 1/4 ≤ e -1 32 √ d/n0 . ( ) Therefore, in the regime d/n 0 1, the detector G µ achieves both small FNR on P X and robust FPR on any test OOD distribution in Q.

A.2.2 BENEFIT OF AUXILIARY OOD DATA

We will first consider the case when auxiliary OOD data are not available, and give a lower bound (Proposition 3). We will then consider the case when auxiliary OOD data are used, and give an upper bound (Proposition 4). Comparing Proposition 3 and Proposition 4 will then justify the benefit of the auxiliary OOD data. Learning Without Auxiliary OOD Data. Given in-distribution data x 1 , x 2 , . . . , x n , we consider the detector G θn given by θn = 1 n n i=1 x i . Next we show a lower bound of the in-distribution data needed for the case without auxiliary outliers. That is, a sample size of order n 0 • 2 √ d/n0 log d is necessary for all algorithms to obtain both non-trivial robust FPR and FNR. We emphasize that this lower bound is information theoretic, i.e., it holds without restriction on the computational power of the learning algorithm and the hypothesis class used for the OOD detector. Proposition 3. (Bound without auxiliary data) Consider the same family of instances as in Preposition 1, without any auxiliary OOD data. If n ≤ n 0 • 2 √ d/n0 16 log d , then for any algorithm A n , there exists an instance in the family with E FNR(A n (S)) + sup Q X ∈Q FPR(A n (S); Q X , Ω ∞, (x)) ≥ 1 4 (1 -d -1 ). ( ) Proof. The key for the proof is the observation that robust classification is a special case of our robust OOD problem. More precisely, consider the following robust classification problem. The data (x, y) with x ∈ R d and y ∈ {-1, +1} is generated as follows: first draw y uniformly at random, and then draw x from N (y • µ, σ 2 I). Given training data {(x i , y i )} n i=1 , the goal is to find classifier f θ (x) = sign(θ x) with small robust classification error err ∞, (f θ ) = E (x,y) max δ ∞ ≤ I[f θ (x + δ) = y] under ∞ perturbation of magnitude . It has been shown that (Theorem 6 in (Schmidt et al., 2018) or Theorem 1 in (Carmon et al., 2019) ) that when µ ∼ N (0, I) and n ≤ n 0 • 2 √ d/n0 8 log d and with the parameter setting equation 15, for any learning algorithm A n Eerr ∞, (A n (S)) ≥ 1 2 (1 -d -1 ). Now consider the following variant of the robust OOD problem in the proposition. Suppose µ ∼ N (0, I). Suppose besides the data from P X , we also have n i.i.d. samples from a test OOD distribution Q 0 = N (-µ, σ 2 I). Then the above robust classification problem can be reduced to this variant of robust OOD, by viewing the in-distribution data as with label +1 and viewing outliers as with label -1. Furthermore, it is clear that the sum of the FNR and FPR is larger than the robust classification error. Then E {FNR(A n (S)) + FPR(A n (S); Q 0 )} ≥ 1 2 (1 -d -1 ). ( ) When d is sufficiently large, we have µ satisfies the condition in equation 15 with probability at least 9/10. Then E FNR(A n (S)) + FPR(A n (S); Q 0 ) µ 2 2 ∈ (9d/10, 11d/10) ≥ 1 4 (1 -d -1 ). After conditioning, this variant can be reduced to the original robust OOD problem in the proposition and furthermore Q 0 ∈ Q. Furthermore, the fact that the expectation over the conditional distribution of µ is large implies that there exist an instance of µ with a large error. The statement then follows. Learning With Auxiliary OOD Data. Assuming we have access to auxiliary OOD data from a distribution U X where: • U X is defined by the following distribution: first draw v uniformly at random from the ball {v : v ∈ R d , v 2 ≤ ν}, then draw x from N (-µ + v, σ 2 I). Roughly speaking, U X is a uniform mixture of distributions in Q. Given in-distribution data x 1 , x 2 , . . . , x n from P X and auxiliary OOD data x1 , x2 , . . . , xn from U X , we consider the detector G θn,n given by θn,n = 1 n n i=1 x i - 1 n n i=1 xi . ( ) We will show that with n = n 0 and sufficiently large n , the detector has small errors. Again, as shown in the closed form solutions, the key factor determining the errors is µ θn,n σ θn,n 2 . The following lemma bounds this term. Lemma 1. There exist numerical constants c 0 , c 1 , c 2 such that under parameter setting equation 15 and d/n 0 > c 0 , µ θn,n σ θn,n 2 ≥ 9 10 n 0 d + n 0 n + n 1 + c 1 n 0 d 1/8 -1/2 (34) with probability ≥ 1 -e -c2(d/n0) 1/4 min{n+n ,(d/n0) 1/4 } -e -c2n . Proof. The proof follows the argument of Lemma 1 in (Carmon et al., 2019) but needs some modifications accommodating the difference in learning θ. Recall the generation of x i : first draw v i uniformly at random from the ball B(ν ) := {v : v ∈ R d , v 2 ≤ ν}, then draw x i from N (µ, σ 2 I), and finally let x i = v i -x i . So we have θn,n = 1 n + n   n i=1 x i + n i=1 x i   - 1 n + n   n i=1 v i   (35) = µ + δ + δ v ( ) where δ = 1 n + n   n i=1 x i + n i=1 x i   -µ ∼ N (0, σ 2 n + n I), δ v = - 1 n + n   n i=1 v i   . To lower bound the term µ θn,n θn,n 2 , we upper bound its squared inverse: θn,n 2 2 (µ θn,n ) 2 = µ + δ + δ v 2 2 ( µ 2 2 + µ δ + µ δ v ) 2 (39) = 1 µ 2 2 + δ + δ v 2 2 -1 µ 2 2 (µ δ + µ δ v ) 2 ( µ 2 2 + µ δ + µ δ v ) 2 (40) ≤ 1 µ 2 2 + 2 δ 2 2 + 2 δ v 2 2 ( µ 2 2 + µ δ + µ δ v ) 2 . ( ) For δ, we have δ 2 2 ∼ σ 2 n + n χ 2 d and µ δ µ 2 ∼ N 0, σ 2 n + n . ( ) So standard concentration bounds give P δ 2 2 ≥ σ 2 n + n d + 1 σ ≤ e -d/8σ 2 and P µ δ µ 2 ≥ (σ µ ) 1/2 ≤ 2e -(n+n ) µ 2/2σ . ( ) For δ v , by subguassian concentration bounds, we have P δ v 2 ≥ Cν √ n ≤ e -cn for some numeric constants c and C. Suppose the event δ v 2 < Cν √ n is true. Then |µ δ v | ≤ µ 2 δ v 2 ≤ Cν µ 2 √ n . ( ) Plugging the concentration bounds in equation 39 and doing the same manipulation leads to the bound. To finish the proof, we also need to show µ θn,n > 0, which can be shown by a similar argument as above. We then get the following guarantee. Again, the error bound 10 -3 is chosen for simplicity of the statement, but it can be made to arbitrarily small values. Proposition 4. (Error bound with ideal auxiliary data) Consider the same family of instances as in Preposition 1, with n auxiliary OOD data from U X specified above. If n ≥ n 0 and n ≥ n 0 • 4 2 d/n 0 , then E θn,n FNR(G θn,n ) ≤ 10 -3 , E θn,n sup Q X ∈Q FPR(G θn,n ; Q X , Ω ∞, (x)) ≤ 10 -3 . ( ) Proof. The proposition comes from Lemma 1, the parameter setting equation 15, and the closed form expressions equation 16 and equation 22 of the errors.

A.2.3 BENEFIT OF OUTLIER MINING

The above Gaussian example shows the benefit of having auxiliary OOD data for training. All the auxiliary OOD data given in the example are implicitly related to the ideal parameter for the detector θ * = µ and thus are informative for learning the detector. However, this may not be the case in practice: typically only part of the auxiliary OOD data are informative, while the remaining are not very useful or even can be harmful for the learning. In this section, we study such an example, and shows that how outlier mining can help to identify informative data and improve the learning performance. Suppose the algorithm gets n in-distribution data {x 1 , x 2 , . . . , x n } i.i.d. from P X and n auxiliary OOD data {x 1 , x2 , . . . , xn } for training. Instead of from U X specified above, the auxiliary OOD data are i.i.d. from the distribution U mix . • U mix is a uniform mixture of N (-µ, σ 2 I) and N (µ o , σ 2 I) for µ o = 10µ. That is, the distribution is defined by the following process: with probability 1/2 sample the outlier from the informative part N (-µ, σ 2 I), and with probability 1/2 sample the outlier from the uninformative part N (µ o , σ 2 I). We also note that µ 0 = 10µ is chosen for simplicity of analysis. µ 0 can also be cµ for some sufficiently large c > 1, or even µ o = cµ + c µ ⊥ for a sufficiently large c > 1, a small c and a unit vector µ ⊥ perpendicular to µ. Our analysis still go through with such assumptions. Naïve Method Without Outlier Mining. It is clear that naïvely applying the method in the previous section can lead to high errors: with n in-distribution examples from P X and n = n auxiliary OOD data from U mix , when n → ∞, we have θn,n → -7µ/4 which has the worst errors among all detectors. With Outlier Mining. Here we analyze the following algorithm using the outlier mining approach. The algorithm is simpler than what we used in Section 4 but shares the same intuition. First, we use the n in-distribution data points to get an intermediate solution: θint = 1 n n i=1 x i . ( ) We define the confidence score of a point x being in-distribution as: f (x) = σ(t) = 1 1 + e -t , where t(x) = x θint d . Here σ(t) = 1 1+e -t is the sigmoid function. We then select outlier training data whose confidence fall into an interval [a, b] and use them to learn the final solution: θom = n i=1 (-x i )I{f (x i ) ∈ [a, b]} n i=1 I{f (x i ) ∈ [a, b]} (49) where I{•} is the indicator function. Proposition 1. (Error bound with outlier mining.) For any ∈ (0, 1/2) and any integer n 0 > 0, there exist a family of instances of the Gaussian data model such that the following is true. n auxiliary OOD data from U mix specified above. There exist thresholds a and b for θom and a universal constant c > 0 such that if the number of in-distribution data n ≥ c(n 0 log d + √ dn 0 ) and the number of auxiliary data n ≥ (d + n 0 • 4 2 ) d/n 0 , then θom has small errors:foot_3  E θom FNR(G θom ) ≤ 10 -3 , E θom sup Q X ∈Q FPR(G θom ; Q X , Ω ∞, (x)) ≤ 10 -3 . (3) Proof. Let a = σ(-3/2), b = σ(-1/2). By definition we have δ om := θom -µ = n i=1 (-µ -xi )I{f (x i ) ∈ [a, b]} n i=1 I{f (x i ) ∈ [a, b]} . ( ) By the closed form expressions equation 16 and equation 22 of the errors, it is sufficient to lower bound the key term µ θom θom 2 , which comes down to show that δ om is small. First, let's consider θint . Let δ int := θint -µ. Then δ int 2 2 ∼ σ 2 n χ 2 d and µ δ int µ 2 ∼ N 0, σ 2 n . So standard concentration bounds give P δ int 2 2 ≥ σ 2 n d + 1 σ ≤ e -d/8σ 2 and P |µ δ int | µ 2 ≥ d n ≤ 2e -d/2σ 2 . ( ) So with probability ≥ 1 -3e -d/8σ 2 over the randomness of the n in-distribution points, we have the good event G int : δ int 2 2 ≤ σ 2 n d + 1 σ and |µ δint| µ 2 ≤ d n . Now, condition on a fix θint satisfying G int , and consider θom . Define z i := -µ -xi , I 0i := I{f ( xi ) ∈ [a, b]}, I 1i := I{ xi is from N (-µ, σ 2 I)}, I 2i := I{ xi is from N (µ o , σ 2 I)}. For simplicity, let's omit the subscript i and consider a sample x from U mix , and the corresponding variables z, I 0 , I 1 , and I 2 . Since I 1 + I 2 = 1, (-µ -x)I{f (x) ∈ [a, b]} = zI 0 I 1 + zI 0 I 2 . ( ) Case 1. Let's first consider the case when x is from N (-µ, σ 2 I). More precisely, we condition on a fixed θint and condition on I 1 = 1. Then z ∼ N (0, σ 2 I) and it can be decomposed along the direction θint := θint / θint 2 as follows: z = s • θint + z 2 (58) where s ∼ N (0, σ 2 ) and z 2 is a Gaussian distribution in the subspace orthogonal to θint . Then t(x) = x θint d = - µ θint d - s θint 2 d . Therefore, we have E[zI 0 I 1 |I 1 = 1, θint ] = E[s • θint I 0 |I 1 = 1, θint ] + E[z 2 I 0 |I 1 = 1, θint ] Clearly the second term is 0 since z 2 I 0 is symmetric. So E[zI 0 I 1 |I 1 = 1, θint ] = E[s • θint I{f (x) ∈ [a, b]}|I 1 = 1, θint ] (61) = E s • I{s ∈ [a , b ]}|I 1 = 1, θint • θint (62) = E [s • I{s ∈ [a , b ]}] • θint where a = - µ θint θint 2 - σ -1 (b)d θint 2 (64) = -2µ θint + d 2 θint 2 (65) = -2µ δ int -2 µ 2 2 + d 2 θint 2 , ( ) b = - µ θint θint 2 - σ -1 (a)d θint 2 (67) = -2µ θint + 3d 2 θint 2 (68) = -2µ δ int -2 µ 2 2 + 3d 2 θint 2 . ( ) By the bound on |µ δ int |, we have |E [s • I{s ∈ [a , b ]}] | ≤ d 2σ θint 2 (12/10+4/ √ n) d 2σ θint 2 (8/10-4/ √ n) σt 1 √ 2π e -t 2 /2 dt (70) ≤ d 2σ θint 2 • σ • 2d 2σ θint 2 • 1 √ 2π e -1 2 d 4σ θint 2 2 . ( ) Given the bound on δ int 2 2 , we have θint 2 ≤ µ 2 + δ int 2 ≤ µ 2 + σ 2 n d + 1 σ ≤ µ 2 + 2σ 2 d n , θint 2 ≥ µ 2 -δ int 2 ≥ µ 2 - σ 2 n d + 1 σ ≥ µ 2 - 2σ 2 d n . Since n ≥ Cn 0 log d and d ≥ C 2 n 0 log 2 d for a sufficiently large C, we have σ 2 θint 2 2 d 2 ≤ σ 2 d(11/10 + 2σ 2 /n) 2 d 2 ≤ 4 n 0 d + 8n 0 n ≤ 12 C log d (74) and thus |E [s • I{s ∈ [a , b ]}] | ≤ d 2 σ θint 2 2 1 √ 2π e -1 2 d 4σ θint 2 2 (75) ≤ d 2 σ θint 2 2 e - d 2 32σ 2 θint 2 2 (76) ≤ 1 d 2 . ( ) Combining with E[zI 0 I 1 |I 1 = 0, θint ] = 0 we get E[zI 0 I 1 | θint ] = c 1 • θint (78) for some c 1 satisfying |c 1 | ≤ 1/d 2 . Furthermore, zI 0 I 1 | θint is truncated Gaussian and thus is sub-Gaussian with sub-Gaussian norm bounded by σ. Then by sub-Gaussian concentration bounds, we have P   n i=1 µ z i I 0i I 1i - n i=1 µ E[z i I 0i I 1i | θint ] ≥ √ n d | θint   ≤ e -cd/σ 2 , ( ) P   n i=1 z i I 0i I 1i 2 ≥ 4σ √ n d + 2 √ n d | θint   ≤ e -d/σ 2 . ( ) for some constant c > 0. In other words, with probability ≥ 1 -2e -cd/σ 2 , we have n i=1 µ z i I 0i I 1i ≤ √ n d + n d 3/2 , ( ) n i=1 z i I 0i I 1i 2 ≤ 6σ √ n d. Conditioned on I 1 = 1, we also have E[I 0 I 1 |I 1 = 1, θint ] = P(s ∈ [a , b ]) (83) ≥ d θint 2 (8/10-4/ √ n) d 2σ θint 2 (-8/10+4/ √ n) 1 √ 2π e -t 2 /2 dt (84) ≥ 1 -2 +∞ d 4σ θint 2 1 √ 2π e -t 2 /2 dt (85) ≥ 1 -2 +∞ √ C log d 12 1 √ 2π e -t 2 /2 dt (86) ≥ 1 - 1 d . ( ) Let m = n 2 1 -1 d . Then by Chernoff's bound, we have P   n i=1 I 0i I 1i -m ≥ 1 2 m   ≤ e -c m for an absolute constant c > 0. That is, with probality ≥ 1 -e -cn , we have n i=1 I 0i I 1i ≥ n /5. Case 2. Next, let's consider the case when x is from N (µ o , σ 2 I). More precisely, we condition on a fixed θint and condition on I 2 = 1. Similar to case 1, we have z = -11µ + s • θint + z 2 (89) where s ∼ N (0, σ 2 ) and z 2 is a Gaussian distribution in the subspace orthogonal to θint . So E[(z + 11µ)I 0 I 2 |I 2 = 1, θint ] = E[sI 0 |I 2 = 1, θint ] • θint . ( ) For this, E[sI{f (x) ∈ [a, b]}|I 2 = 1, θint ] • θint = E [s • I{s ∈ [a , b ]}] • θint where a = -20µ δ int -20 µ 2 2 + d 2 θint 2 , ( ) b = -20µ δ int -20 µ 2 2 + 3d 2 θint 2 . ( ) By the bound on |µ δ int | and δ int 2 , we have |E [s • I{s ∈ [a , b ]}] | ≤ -d 2σ θint 2 (15-20/ √ n) -d 2σ θint 2 (21+20/ √ n) σt 1 √ 2π e -t 2 /2 dt (94) ≤ 8d 2σ θint 2 σ 22d 2σ θint 2 1 √ 2π e -1 2 14d 2σ θint 2 2 (95) ≤ 44d 2 σ θint 2 2 e -20d 2 σ 2 θint 2 2 (96) ≤ 1 d 2 . ( ) We also have E I 0 I 2 |I 2 = 1, θint = |E [I{s ∈ [a , b ]}] | (98) ≤ -d 2σ θint 2 (15-20/ √ n) -d 2σ θint 2 (21+20/ √ n) 1 √ 2π e -t 2 /2 dt (99) ≤ d 2σ θint 2 (6 + 40/ √ n) 1 √ 2π e -1 2 14d 2σ θint 2 2 (100) ≤ 1 d 3 . Combining the above, we have E[(z + 11µ)I 0 I 2 | θint ] = c 1 • θint for a constant c 1 satisfying |c 1 | ≤ 1/d 2 . Furthermore, (z + 11µ)I 0 I 2 | θint is truncated Gaussian and thus is sub-Gaussian with sub-Gaussian norm bounded by σ. Then by sub-Gaussian concentration bounds, we have P   n i=1 µ (z i + 11µ)I 0i I 2i - n i=1 µ E[(z i + 11µ)I 0i I 2i | θint ] ≥ √ n d | θint   ≤ e -cd/σ 2 , P   n i=1 (z i + 11µ)I 0i I 2i 2 ≥ 4σ √ n d + 2 √ n d | θint   ≤ e -d/σ 2 , ( ) for some constant c > 0. Also by Hoeffding's bound, we have P   n i=1 I 0i I 2i - n i=1 E[I 0i I 2i | θint ] ≥ n d/σ 2 | θint   ≤ 2e -2d/σ 2 . ( ) In other words, with probability ≥ 1 -4e -cd/σ 2 , we have n i=1 µ z i I 0i I 2i ≤ √ n d + 22d √ n dσ 2 + n d 2 √ d + 22d • n d 3 ≤ √ n d 1 + 22d σ + n d 3/2 , ( ) n i=1 z i I 0i I 2i 2 ≤ 6σ √ n d + 22 √ d n d σ 2 + n d 3 ≤ √ n d 6σ + 22 d σ 2 + 22n d 5/2 . ( ) Combining equation 79, equation 80, equation 88 and equation 106, equation 107 together, we get with probability ≥ 1 -Ce -cd/σ 2 , |µ δ om | ≤ C d n 1 + 22d σ + C d 3/2 , ( ) δ om 2 ≤ C d n 6σ + 22 d σ 2 + C d 5/2 . ( ) Then µ θom θom 2 can be lower bounded by µ θom θom 2 = µ µ + µ δ om µ + δ om 2 (110) ≥ µ µ + µ δ om µ 2 + δ om 2 (111) ≥ d(1 -1/ √ d) √ d(1 + 1/ √ d) (112) ≥ √ d 1 - 2 √ d . ( ) The proof is completed by plugging the above into the closed form expressions equation 16 and equation 22 of the errors.

B DETAILS OF EXPERIMENTS B.1 EXPERIMENTAL SETTINGS

Software and Hardware. We run all experiments with PyTorch and NVIDIA GeForce RTX 2080Ti GPUs. Number of Evaluation Runs. We run all experiments once with fixed random seeds. In-distribution Datasets. We use SVHN (Netzer et al., 2011) , CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) as in-distribution datasets. SVHN has 10 classes and contains 73,257 training images. CIFAR-10 and CIFAR-100 have 10 and 100 classes, respectively. Both datasets consist of 50,000 training images and 10,000 test images. Auxiliary OOD Datasets. We provide the details of auxiliary OOD datasets below. For each auxiliary OOD dataset, we use random cropping with padding of 4 pixels to generate 32 × 32 images, and further augment the data by random horizontal flipping. We don't use any image corruptions to augment the data. 1. TinyImages. 80 Million Tiny Images (TinyImages) (Torralba et al., 2008 ) is a dataset that contains 79,302,017 images collected from the Web. The images in the dataset are stored as 32 × 32 color images. Since CIFAR-10 and CIFAR-100 are labeled subsets of the TinyImages dataset, we need to remove those images in the dataset that belong to CIFAR-10 or CIFAR-100. We follow the same deduplication procedure as in (Hendrycks et al., 2018) and remove all examples in this dataset that appear in CIFAR-10 or CIFAR-100. Even after deduplication, the auxiliary OOD dataset may still contain some in-distribution data if we use CIFAR-10 or CIFAR-100 as in-distribution datasets, but the fraction of them is low. 2. ImageNet-RC. We use the downsampled ImageNet dataset (ImageNet64 × 64) (Chrabaszcz et al., 2017) , which is a downsampled variant of the original ImageNet dataset. It contains 1,281,167 images with image size of 64 × 64 and 1,000 classes. Some of the classes overlap with CIFAR-10 or CIFAR-100 classes. Since we don't use any label information from the dataset, we can say that the auxiliary OOD dataset is unlabeled. Since we randomly crop the 64 × 64 images into 32 × 32 images with padding of 4 pixels, with high probability, the resulting images won't contain objects belonging to the in-distribution classes even if the original images contain objects belonging to those classes. Therefore, we still can have a lot of OOD data for training and the fraction of in-distribution data in the auxiliary OOD dataset is low. We call this auxiliary OOD dataset ImageNet-RC. OOD Test Datasets. We provide the details of OOD test datasets below. All images are of size 32 × 32. 1. SVHN. The SVHN dataset (Netzer et al., 2011) contains color images of house numbers. There are ten classes of digits 0-9. The original test set has 26,032 images. We randomly select 1,000 test images for each class and form a new test dataset of 10,000 images for evaluation. 2. Textures. The Describable Textures Dataset (DTD) (Cimpoi et al., 2014) contains textural images in the wild. We include the entire collection of 5640 images for evaluation. 3. Places365. The Places365 dataset et al., 2017) contains large-scale photographs of scenes with 365 scene categories. There are 900 images per category in the test set. We randomly sample 10,000 images from the test set for evaluation. 4. LSUN (crop) and LSUN (resize). The Large-scale Scene UNderstanding dataset (LSUN) has a testing set of 10,000 images of 10 different scenes (Yu et al., 2015) . We construct two datasets, LSUN-C and LSUN-R, by randomly cropping image patches of size 32 × 32 and downsampling each image to size 32 × 32, respectively. 5. iSUN. The iSUN (Xu et al., 2015) consists of a subset of SUN images. We include the entire collection of 8925 images in iSUN. 6. CIFAR-10. We use the test set of CIFAR-10, which contains 10,000 images. 7. Gaussian Noise. The synthetic Gaussian noise dataset consists of 10,000 random 2D Gaussian noise images, where each RGB value of every pixel is sampled from an i.i.d Gaussian distribution with mean 0.5 and unit variance. We further clip each pixel value into the range [0,1]. 8. Uniform Noise. The synthetic uniform noise dataset consists 10,000 images where each RGB value of every pixel is independently and identically sampled from a uniform distribution on [0,1]. Architectures and Training Configurations. We use the state-of-the-art neural network architecture DenseNet (Huang et al., 2017) and WideResNet (Zagoruyko & Komodakis, 2016) . For DenseNet, we follow the same setup as in (Huang et al., 2017) , with depth L = 100, growth rate k = 12 (Dense-BC) and dropout rate 0. For WideResNet, we also follow the same setup as in (Zagoruyko & Komodakis, 2016) , with depth of 40 and widening parameter k = 4 (WRN-40-4). All neural networks are trained with stochastic gradient descent with Nesterov momentum (Duchi et al., 2011; Kingma & Ba, 2014) . We set momentum 0.9 and 2 weight decay with a coefficient of 10 -4 for all model training. Specifically, for SVHN, we train the networks for 20 epochs and the initial learning rate of 0.1 decays by 0.1 at 10, 15, 18 epoch; for CIFAR-10 and CIFAR-100, we train the networks for 100 epochs and the initial learning rate of 0.1 decays by 0.1 at 50, 75, 90 epoch. In ATOM and NTOM, we use batch size 64 for in-distribution data and 128 for out-of-distribution data. To solve the inner max of the robust training objective in ATOM, we use PGD with = 8/255, the number of iterations of 5, the step size of 2/255, and random start.

B.2 AVERAGE RUNTIME

We run our experiments using a single GPU on a machine with 4 GPUs and 32 cores. The estimated average runtime for each method is summarized in Table 3 .

B.3 OOD DETECTION METHODS

We consider eight common OOD detection methods listed in ODIN. Liang et al. computes calibrated confidence scores using temperature scaling and input perturbation techniques. In all of our experiments, we set temperature scaling parameter T = 1000. We choose perturbation magnitude η by validating on 1000 images randomly sampled from indistribution test set D test in and 1000 images randomly sampled from auxiliary OOD dataset D auxiliary out , which does not depend on prior knowledge of test OOD datasets. For DenseNet, we set η = 0.0006 for SVHN, η = 0.0016 for CIFAR-10, and η = 0.0012 for CIFAR-100. For WideResNet, we set η = 0.0002 for SVHN, η = 0.0006 for CIFAR-10, and η = 0.0012 for CIFAR-100. (Goodfellow et al., 2014) on them with perturbation size of 0.05 to train the Logistic Regression model and tune the noise perturbation magnitude η. η is chosen from {0.0, 0.01, 0.005, 0.002, 0.0014, 0.001, 0.0005}, and the optimal parameters are chosen to minimize the FPR at FNR 5%. Outlier Exposure (OE). Outlier Exposure (Hendrycks et al., 2018) 

B.4 ADVERSARIAL ATTACKS FOR OOD DETECTION METHODS

We propose adversarial attack objectives for different OOD detection methods. We consider a family of adversarial perturbations for the OOD inputs: (1) L ∞ -norm bounded attack (white-box attack); (2) common image corruptions attack (black-box attack); (3) compositional attack which combines common image corruptions attack and L ∞ norm bounded attack (white-box attack). L ∞ norm bounded attack. For data point x ∈ R d , the L ∞ norm bounded perturbation is defined as Ω ∞, (x) = {δ ∈ R d δ ∞ ≤ ∧ x + δ is valid}, where is the adversarial budget. x + δ is considered valid if the values of x + δ are in the image pixel value range. For MSP, ODIN, OE, ACET, and CCU methods, we propose the following attack objective to generate adversarial OOD example on a clean OOD input x: x = arg max x ∈Ω∞, (x) - 1 K K i=1 log F (x ) i where F (x) is the softmax output of the classifier network. For Mahalanobis method, we propose the following attack objective to generate adverasrial OOD example on OOD input x: x = arg max x ∈Ω∞, (x) -log 1 1 + e -( α M (x )+b) , where M (x ) is the Mahalanobis distance-based confidence score of x from the -th feature layer, {α } and b are the parameters of the logistic regression model. For SOFL method, we propose the following attack objective to generate adversarial OOD example for an input x: x = arg max x ∈Ω∞, (x) -log K+R i=K+1 F (x ) i where F (x) is the softmax output of the whole neural network (including auxiliary head) and R is the number of reject classes. For ROWL and ATOM method, we propose the following attack objective to generate adverasrial OOD example on OOD input x: x = arg max x ∈Ω∞, (x) -log F (x ) K+1 where F (x) is the softmax output of the (K+1)-way neural network. Due to computational concerns, by default, we will use PGD with = 8/255, the number of iterations of 40, the step size of 1/255 and random start to solve these attack objectives. We also perform ablation study experiments on the attack strength for ACET and ATOM, see Appendix B.10. 

B.5 VISUALIZATIONS OF FOUR TYPES OF OOD SAMPLES

We show visualizations of four types of OOD samples in Figure 4 .

B.6 HISTOGRAM OF OOD SCORES

In Figure 5 , we show histogram of OOD scores for model snapshots trained on CIFAR-10 (indistribution) using objective (4) without informative outlier mining. We plot every ten epochs for a model trained for a total of 100 epochs. We observe that the model quickly converges to a solution where OOD score distribution becomes dominated by easy examples with scores closer to 1. This is exacerbated as the model is trained for longer.

B.7 CHOOSE BEST Q USING VALIDATION DATASET

We create a validation OOD dataset by sampling 10,000 images from the 80 Million Tiny Images (Torralba et al., 2008) , which is disjoint from our training data. We choose q from {0, 0.125, 0.25, 0.5, 0.75}. The results on the validation dataset are shown in Table 5 . We select the best model based on the average FPR at 5% FNR across four types of OOD inputs. Based on the results, the optimal q is 0 for SVHN, 0.125 for CIFAR-10 and 0.5 for CIFAR-100.

B.8 EFFECT OF AUXILIARY OOD DATASETS

We present results in Table 6 , where we use an alternative auxiliary OOD dataset ImageNet-RC. The details of ImageNet-RC is provided in Section B.1. We use the same hyperparameters as used in training with TinyImages auxiliary data. For all three in-distribution datasets, we find that using q = 0 results in the optimal performance.

B.9 EFFECT OF NETWORK ARCHITECTURE

We perform experiments to evaluate different OOD detection methods using WideResNet, see Table 7 . For both ATOM and NTOM, we use the same hyperparameters as those selected for DenseNet and find that it also leads to good results.

B.10 EFFECT OF PGD ATTACK STRENGTH

To see the effect of using stronger PGD attacks, we evaluate ACET (best baseline) and ATOM on L ∞ attacked OOD and compositionally attacked OOD inputs with 100 iterations and 5 random restarts. Results are provided in Table 8 . Under stronger PGD attack, ATOM outperforms ACET. 

B.11 EVALUATION ON RANDOM NOISE OOD DATA

We report the performance of OOD detectors using DenseNet on random noise OOD test datasets in Table 9 (SVHN), Table 10 (CIFAR-10) and Table 11 (CIFAR-100).

B.12 PERFORMANCE OF OOD DETECTOR AND CLASSIFIER ON IN-DISTRIBUTION DATA

We summarize the performance of OOD detector G(x) and image classifier f (x) on in-distribution test data. See Table 12 for DenseNet and Table 13 for WideResNet. From the results, we can see that ATOM improves the OOD detection performance while achieving in-distribution classification accuracy that is on par with a pre-trained network.

B.13 COMPLETE EXPERIMENTAL RESULTS

We report the performance of OOD detectors using DenseNet on each of the six natural OOD test datasets in 



Due to lack of space, proofs are deferred to Appendix A. The error bound in the proposition can be made arbitrarily small and with high probability. The current bound is presented for simplicity. Since the inference stage can be fully parallel, outlier mining can be applied with relatively low overhead. The error bound in the proposition can be made arbitrarily small and with high probability. The current bound is presented for simplicity.



Figure 1: When deploying an image classification system (OOD detector G(x) + image classifier f (x)) in an open world, there can be multiple types of out-of-distribution examples. We consider a broad family of OOD inputs, including (a) Natural OOD, (b) L∞ OOD, (c) corruption OOD, and (d) Compositional OOD. A detailed description of these OOD inputs can be found in Section 5.1. In (b-d), a perturbed OOD input (e.g., a perturbed mailbox image) can mislead the OOD detector to classify it as an in-distribution sample. This can trigger the downstream image classifier f (x) to predict it as one of the in-distribution classes (e.g., speed limit 70). Through adversarial training with informative outlier mining (ATOM), our method can robustify the decision boundary of OOD detector G(x), which leads to improved performance across all types of OOD inputs. Solid lines are actual computation flow.

Figure 2: On CIFAR-10, we train a DenseNet with objective (4) for 100 epochs without informative outlier mining. At epoch 30, we randomly sample 400,000 data points from D auxiliary out , and plot the OOD score frequency distribution (a). We observe that the model quickly converges to solution where OOD score distribution becomes dominated by easy examples with score closer to 1, as shown in (b). Therefore, training on these easy OOD data points can no longer help improve the decision boundary of OOD detector. (c) shows the hardest examples mined from TinyImages w.r.t CIFAR-10. 4 ATOM: ADVERSARIAL TRAINING WITH INFORMATIVE OUTLIER MINING In this section, we introduce Adversarial Training with informative Outlier Mining (ATOM), which justifies our theoretical analysis and effectiveness in the context of modern neural networks. We first present the adversarial training objective and then describe how we use informative outlier mining to robustify OOD detection. Training Objective. We consider a (K + 1)-way classifier network f , where the (K + 1)-th class label indicates out-of-distribution class. Denote by Fθ (x) the softmax output of f on x. The robust training objective is given by minimize θ

Lee et al. propose to use Mahalanobis distance-based confidence scores to detect OOD samples. Following Lee et al., we use 1000 examples randomly selected from in-distribution test set D test in and adversarial examples generated by FGSM

makes use of a large, auxiliary OOD dataset D auxiliary out to enhance the performance of existing OOD detection. We train from scratch with λ = 0.5, and use in-distribution batch size of 64 and out-distribution batch size of 128 in our experiments. Other training parameters are specified in Section B.1. Self-Supervised OOD Feature Learning (SOFL). Mohseni et al. add an auxiliary head to the network and train in for the OOD detection task. They first use a full-supervised training to learn in-distribution training data for the main classification head and then a self-supervised training with OOD training set for the auxiliary head. Following the original setting, we set λ = 5 and use an in-distribution batch size of 64 and an out-distribution batch size of 320 in all of our experiments. In SVHN and CIFAR-10, we use 5 reject classes, while in CIFAR-100, we use 10 reject classes. We first train the model with the full-supervised learning using the training parameters specified in Section B.1 and then continue to train with the self-supervised OOD feature learning using the same training parameters. We use the large, auxiliary OOD dataset D auxiliary out as out-of-distribution training dataset. Adversarial Confidence Enhancing Training (ACET). Hein et al. propose Adversarial Confidence Enhancing Training to enforce low model confidence for the OOD data point, as well as worst-case adversarial example in the neighborhood of an OOD example. We use the large, auxiliary OOD dataset D auxiliary out as an OOD training dataset instead of using random noise data for a fair comparison. In all of our experiments, we set λ = 1.0. For both in-distribution and out-distribution, we use a batch size of 128. To solve the inner max of the training objective, we also apply PGD with = 8/255, the number of iterations of 5, the step size of 2/255, and random start to a half of a minibatch while keeping the other half clean to ensure proper performance on both perturbed and clean OOD examples for a fair comparison. Other training parameters are specified in Section B.1. Certified Certain Uncertainty (CCU). Certified Certain Uncertainty (Meinke & Hein, 2019) gives guarantees on the confidence of the classifier decision far away from the training data. We use the same training set up as in the paper and code, except we use our training configurations specified in Section B.1. MSP ODIN Mahalanobis SOFL OE ACET CCU ROWL ATOM (Common OOD detection methods and a family of natural and perturbed OOD examples we considered. Robust Open-World Deep Learning (ROWL). Sehwag et al. propose to introduce additional background classes for OOD datasets and perform adversarial training on both the in-and out-ofdistribution datasets to achieve robust open-world classification. When an input is classified as the background classes, it is considered as an OOD example. Thus, ROWL gives binary OOD scores (either 0 or 1) to the inputs. In our experiments, we only have one background class and randomly sample data points from the large, auxiliary OOD dataset D auxiliary out to form the OOD dataset. To ensure data balance across classes, we include 7,325 OOD data points for SVHN, 5,000 OOD data points for CIFAR-10 and 500 OOD data points for CIFAR-100. During training, we mix the in-distribution data and OOD data, and use a batch size of 128. To solve the inner max of the training objective, we use PGD with = 8/255, the number of iterations of 5, the step size of 2/255, and random start. Other training parameters are specified in Section B.1.

Figure 4: Examples of four types of OOD samples.

Figure5: On CIFAR-10, we train the model with objective (4) for 100 epochs without informative outlier mining. For every 10 epochs, we randomly sample 400,000 data points from the large auxiliary OOD dataset and use the current model snapshot to calculate the OOD scores.

Algorithm 1 ATOM: Adversarial Training with informative Outlier Mining Compute OOD scores on S using current model Fθ to get set

Comparison with competitive OOD detection methods. We use DenseNet as network architecture

Ablation study on informative outlier mining. We use DenseNet as network architecture. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages and are averaged over six natural OOD test datasets mentioned in section 5.1. We do not use OOD test set for tuning q. Please refer to Table5for validation results.

Communications on Pure and Applied Mathematics, 66(2):145-164, 2013. Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11):1958-1970, 2008. Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, and Pushmeet Kohli. Are labels required for improving adversarial robustness? arXiv preprint arXiv:1905.13725, 2019. Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790-4798, 2016. Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2794-2802, 2015.

The estimated average runtime for each result. We use DenseNet as network architecture. h means hour. For MSP, ODIN, and Mahalanobis, we use standard training. The evaluation includes four OOD detection tasks listed in Section 2.

Common Image Corruptions attack. We use common image corruptions introduced in(Hendrycks & Dietterich, 2019). We apply 15 types of algorithmically generated corruptions from noise, blur, weather, and digital categories to each OOD image. Each type of corruption has five levels of severity, resulting in 75 distinct corruptions. Thus, for each OOD image, we generate 75 corrupted images and then select the one with the lowest OOD score (or highest confidence score to be in-distribution). Note that we only need the outputs of the OOD detectors to construct such adversarial OOD examples; thus it is a black-box attack. Compositional Attack. For each OOD image, we first apply common image corruptions attack, and then apply the L ∞ -norm bounded attack to generate adversarial OOD examples.

SVHN),Table 15 (CIFAR-10) and Table 16 (CIFAR-100).

Evaluate models on validation dataset. We use DenseNet as network architecture. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages and are averaged over six different OOD test datasets mentioned in section 5.1. Bold numbers are superior results.

Comparison with competitive OOD detection methods. We use ImageNet-RC as the auxiliary OOD dataset (see section B.1 for the details) for SOFL, OE, ACET, CCU, NTOM and ATOM. We use DenseNet as network architecture for all methods. We evaluate on four types of OOD inputs: (1) natural OOD, (2) corruption attacked OOD, (3) L∞ attacked OOD, and (4) compositionally attacked OOD inputs. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages and are averaged over six natural OOD test datasets described in section 5.1. Bold numbers are superior results.

Comparison with competitive OOD detection methods. We use WideResNet as network architecture for all methods. We evaluate on four types of OOD inputs: (1) natural OOD, (2) corruption attacked OOD, (3) L∞ attacked OOD, and (4) compositionally attacked OOD inputs. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages and are averaged over six different OOD test datasets described in section 5.1. Bold numbers are superior results.

Evaluation on L∞ attacked OOD and compositionally attacked OOD inputs with strong PGD attack (100 iterations and 5 random restarts). We use DenseNet as network architecture for all methods. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages and are averaged over six different OOD test datasets described in section 5.1. Bold numbers are superior results.

Comparison with competitive OOD detection methods. We use SVHN as in-distribution dataset and use DenseNet as network architecture for all methods. We evaluate the performance on all four types of OOD inputs: (1) natural OOD, (2) corruption attacked OOD, (3) L∞ attacked OOD, and (4) compositionally attacked OOD inputs. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages. Bold numbers are superior results.

Comparison with competitive OOD detection methods. We use CIFAR-10 as in-distribution dataset

Comparison with competitive OOD detection methods. We use CIFAR-100 as in-distribution dataset and use DenseNet as network architecture for all methods. We evaluate the performance on all four types of OOD inputs: (1) natural OOD, (2) corruption attacked OOD, (3) L∞ attacked OOD, and (4) compositionally attacked OOD inputs. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages.

The performance of OOD detector and classifier on in-distribution test data. We use DenseNet for all methods. We use three metrics: FNR, Prediction Accuracy and End-to-end Prediction Accuracy. We pick the threshold for the OOD detectors such that 95% of in-distribution test data points are classified as in-distribution. Prediction Accuracy measures the accuracy of the classifier on in-distribution test data. End-to-end Prediction Accuracy measures the accuracy of the open world classification system (detector+classifier), where an example is classified correctly if and only if the detector treats it as in-distribution and the classifier predicts its label correctly.

The performance of OOD detector and classifier on in-distribution test data. We use WideResNet for all methods. We use three metrics: FNR, Prediction Accuracy and End-to-end Prediction Accuracy. We pick the threshold for the OOD detectors such that 95% of in-distribution test data points are classified as in-distribution. Prediction Accuracy measures the accuracy of the classifier on in-distribution test data. End-to-end Prediction Accuracy measures the accuracy of the open world classification system (detector+classifier), where an example is classified correctly if and only if the detector treats it as in-distribution and the classifier predicts its label correctly.

Comparison with competitive OOD detection methods. We use SVHN as in-distribution dataset and use DenseNet as network architecture for all methods. We evaluate the performance on all four types of OOD inputs: (1) natural OOD, (2) corruption attacked OOD, (3) L∞ attacked OOD, and (4) compositionally attacked OOD inputs. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages. Bold numbers are superior results.

Comparison with competitive OOD detection methods. We use CIFAR-10 as in-distribution dataset and use DenseNet as network architecture for all methods. We evaluate the performance on all four types of OOD inputs: (1) natural OOD, (2) corruption attacked OOD, (3) L∞ attacked OOD, and (4) compositionally attacked OOD inputs. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages. Bold numbers are superior results.

Comparison with competitive OOD detection methods. We use CIFAR-100 as in-distribution dataset and use DenseNet as network architecture for all methods. We evaluate the performance on all four types of OOD inputs: (1) natural OOD, (2) corruption attacked OOD, (3) L∞ attacked OOD, and (4) compositionally attacked OOD inputs. ↑ indicates larger value is better, and ↓ indicates lower value is better. All values are percentages. Bold numbers are superior results.

