ON THE IMPORTANCE OF IN-DISTRIBUTION CLASS PRIOR FOR OUT-OF-DISTRIBUTION DETECTION Anonymous authors

Abstract

Given a pre-trained in-distribution (ID) model, the task of inference-time out-ofdistribution (OOD) detection methods aims to recognize upcoming OOD data in inference time. However, some representative methods share an unproven assumption that the probability that OOD data belong to every ID class should be the same, i.e., probabilities that OOD data belong to ID classes form a uniform distribution. In this paper, we theoretically and empirically show that this assumption makes these methods incapable of recognizing OOD data when the ID model is trained with class-imbalanced data. Fortunately, by analyzing the causal relations between ID/OOD classes and features, we identify several common scenarios where probabilities that OOD data belong to ID classes should be the ID-class-prior distribution. Based on the above finding, we propose two effective strategies to modify previous inference-time OOD detection methods: 1) if they explicitly use the uniform distribution, we can replace the uniform distribution with the ID-class-prior distribution; 2) otherwise, we can reweight their scores according to the similarity between the ID-class-prior distribution and the softmax outputs of the pre-trained model. Extensive experiments show that both strategies significantly improve the accuracy of recognizing OOD data when the ID model is pre-trained with imbalanced data. As a highlight, when evaluating on the iNaturalist dataset, our method can achieve ∼36% increase on AUROC and ∼61% decrease on FPR95, compared with the original Energy method, reflecting the importance of ID-class prior in the OOD detection, which lights up a new road to study this problem.

1. INTRODUCTION

How to reliably deploy machine learning models into real-world scenarios has been attracting more and more attention (Huang et al., 2021; Liang et al., 2018; Liu et al., 2020) . In real-world scenarios, test data usually contain known and unknown classes (Hendrycks & Gimpel, 2017) . We expect the deployed model to eliminate the interference of unknown classes while classifying known classes well. Nevertheless, current models tend to be overconfident in the unknown classes (Nguyen et al., 2015) , and thus confusing known and unknown classes, which increases the risk of deploying these models in the real world. Especially if the scenarios are life-critical (e.g., car-driving scenarios), we cannot take the risks of deploying unreliable models in them. This motivates researchers to study out-of-distribution (OOD) detection, where we need to identify unknown classes (i.e., OOD classes) and classify known classes (i.e., in-distribution (ID) classes) well at the same time (Hendrycks & Gimpel, 2017; Hendrycks et al., 2019) . In the OOD detection, a well-known branch is to develop the inference-time/post hoc OOD detection methods (Huang et al., 2021; Liang et al., 2018; Liu et al., 2020; Hendrycks & Gimpel, 2017; Lee et al., 2018b; Sun et al., 2021) , where we are given a pre-trained ID model and then aim to recognize upcoming OOD data well. The key advantage of inference-time OOD detection methods is that the classification performance on ID data will be unaffected since we only use the ID model instead of changing it. A general way to design a large-scale-friendly inference-time OOD detection method is to propose a score function by using the ID model's information. For example, maximum softmax probability (MSP) uses the ID model's outputs (Hendrycks & Gimpel, 2017) , and GradNorm uses the ID model's gradients (Huang et al., 2021) . If the score of a data point is smaller, then this data point is an OOD data point with a higher probability. However, some representative methods share an unproven assumption: the probability that an OOD data point x out belongs to each ID class i is always the same. Namely, for any x out , they assume [Pr(x out belongs to class 1), . . . , Pr(x out belongs to class K)] = [1/K, . . . , 1/K] 1×K := u, where u is a uniform distribution and K is the number of ID classes. Taking the GradNorm (Huang et al., 2021) , a state-of-the-art OOD detection method, as an examplefoot_0 , let f Θ (x) be ID model's output of a data point x, and the score function of GradNorm is S GradNorm (f Θ , x) = ∂D KL (u∥ softmax(f Θ (x)) ∂Θ L 1 , where Θ represents the ID model's parameters, softmax(f Θ (x)) is a vector consisting of predicted probabilities that x belongs to ID classes, and D KL (• ∥ •) is the Kullback-Leibler divergence. It is clear that GradNorm considers u as a reference distribution to distinguish between ID and OOD data. If the divergence between softmax(f Θ (x)) and u is smaller, then x is an OOD data point with a higher probability. Nonetheless, since we do not have this assumption proven, we do not know whether it is correct. If not, the u-based score functions (e.g., Eq. 2) are ill-defined because they cannot guarantee that the lowest score corresponds to the most OOD-ness data. In this paper, we theoretically analyze the above assumption (i.e., Eq. 1) under three common causal graphs (Figure 1 ), and find that the above assumption holds only when the ID-class prior is u, i.e., the ID model is trained with class-balanced data. In other cases, the reference distribution of OOD data should be the ID-class-prior distribution P Y in (Theorem 1), i.e., [Pr(x out belongs to class 1), . . . , Pr(x out belongs to class K)] = P Y in . (3) Specifically, assume that we have K classes in training data (i.e., ID data). Let n j be the number of samples in class j, then the total number of samples is N = K j n j . Thus, we have P Y in = [n 1 /N, n 2 /N, ..., n K /N ].



Note that, MSP also has this assumption, we will discuss it in Section 3.2.



Three common causal graphs in OOD detection. Under these graphs, we prove that probabilities that an OOD data point x out belongs to ID classes should be the ID-class-prior distribution P Y in (Theorem 1). However, some representative OOD detection methods(Huang et al., 2021; Hendrycks & Gimpel, 2017)  assume such probabilities to be a uniform distribution u (e.g., GradNorm in Eq. 2). In this figure, each node represents a random variable, and gray ones indicate observable variables. X stands for features, Y stands for classes, and S stands for styles. In the three graphs, features are generated by classes (i.

