ON THE IMPORTANCE OF IN-DISTRIBUTION CLASS PRIOR FOR OUT-OF-DISTRIBUTION DETECTION Anonymous authors

Abstract

Given a pre-trained in-distribution (ID) model, the task of inference-time out-ofdistribution (OOD) detection methods aims to recognize upcoming OOD data in inference time. However, some representative methods share an unproven assumption that the probability that OOD data belong to every ID class should be the same, i.e., probabilities that OOD data belong to ID classes form a uniform distribution. In this paper, we theoretically and empirically show that this assumption makes these methods incapable of recognizing OOD data when the ID model is trained with class-imbalanced data. Fortunately, by analyzing the causal relations between ID/OOD classes and features, we identify several common scenarios where probabilities that OOD data belong to ID classes should be the ID-class-prior distribution. Based on the above finding, we propose two effective strategies to modify previous inference-time OOD detection methods: 1) if they explicitly use the uniform distribution, we can replace the uniform distribution with the ID-class-prior distribution; 2) otherwise, we can reweight their scores according to the similarity between the ID-class-prior distribution and the softmax outputs of the pre-trained model. Extensive experiments show that both strategies significantly improve the accuracy of recognizing OOD data when the ID model is pre-trained with imbalanced data. As a highlight, when evaluating on the iNaturalist dataset, our method can achieve ∼36% increase on AUROC and ∼61% decrease on FPR95, compared with the original Energy method, reflecting the importance of ID-class prior in the OOD detection, which lights up a new road to study this problem.

1. INTRODUCTION

How to reliably deploy machine learning models into real-world scenarios has been attracting more and more attention (Huang et al., 2021; Liang et al., 2018; Liu et al., 2020) . In real-world scenarios, test data usually contain known and unknown classes (Hendrycks & Gimpel, 2017) . We expect the deployed model to eliminate the interference of unknown classes while classifying known classes well. Nevertheless, current models tend to be overconfident in the unknown classes (Nguyen et al., 2015) , and thus confusing known and unknown classes, which increases the risk of deploying these models in the real world. Especially if the scenarios are life-critical (e.g., car-driving scenarios), we cannot take the risks of deploying unreliable models in them. This motivates researchers to study out-of-distribution (OOD) detection, where we need to identify unknown classes (i.e., OOD classes) and classify known classes (i.e., in-distribution (ID) classes) well at the same time (Hendrycks & Gimpel, 2017; Hendrycks et al., 2019) . In the OOD detection, a well-known branch is to develop the inference-time/post hoc OOD detection methods (Huang et al., 2021; Liang et al., 2018; Liu et al., 2020; Hendrycks & Gimpel, 2017; Lee et al., 2018b; Sun et al., 2021) , where we are given a pre-trained ID model and then aim to recognize upcoming OOD data well. The key advantage of inference-time OOD detection methods is that the classification performance on ID data will be unaffected since we only use the ID model instead of changing it. A general way to design a large-scale-friendly inference-time OOD detection method is to propose a score function by using the ID model's information. For example, maximum softmax probability (MSP) uses the ID model's outputs (Hendrycks & Gimpel, 2017) , and GradNorm uses the ID model's gradients (Huang et al., 2021) . If the score of a data point is smaller, then this data point is an OOD data point with a higher probability. Figure 1 : Three common causal graphs in OOD detection. Under these graphs, we prove that probabilities that an OOD data point x out belongs to ID classes should be the ID-class-prior distribution P Y in (Theorem 1). However, some representative OOD detection methods (Huang et al., 2021; Hendrycks & Gimpel, 2017) assume such probabilities to be a uniform distribution u (e.g., GradNorm in Eq. 2). In this figure, each node represents a random variable, and gray ones indicate observable variables. X stands for features, Y stands for classes, and S stands for styles. In the three graphs, features are generated by classes (i.e., Y → X) (Gong et al., 2016; Stojanov et al., 2021) or generated by classes and styles (i.e., Y → X ← S) (Yao et al., 2021) . The three causal graphs broadly exist in our common datasets. For example, (a) corresponds to datasets consisting of sketch images, like ImageNet-Sketch (Wang et al., 2019) where ID classes could be cars and OOD classes could be animals; (b) and (c) correspond to datasets consisting of common images, like ImageNet (Deng et al., 2009) and MNIST (LeCun et al., 1998) . In (b), the ID classes could be cars in ImageNet, and the OOD classes could be numbers in MNIST (different styles). In (c), the ID classes could be numbers in MNIST, and OOD classes could be classes in Fashion-MNIST (Xiao et al., 2017) (the same style). Through these graphs, it is clear that Y in ⊥ ⊥ X out , i.e., Y in and X out are independent. However, some representative methods share an unproven assumption: the probability that an OOD data point x out belongs to each ID class i is always the same. Namely, for any x out , they assume [Pr(x out belongs to class 1), . . . , Pr(x out belongs to class K)] = [1/K, . . . , 1/K] 1×K := u, (1 ) where u is a uniform distribution and K is the number of ID classes. Taking the GradNorm (Huang et al., 2021) , a state-of-the-art OOD detection method, as an examplefoot_0 , let f Θ (x) be ID model's output of a data point x, and the score function of GradNorm is S GradNorm (f Θ , x) = ∂D KL (u∥ softmax(f Θ (x)) ∂Θ L 1 , where Θ represents the ID model's parameters, softmax(f Θ (x)) is a vector consisting of predicted probabilities that x belongs to ID classes, and D KL (• ∥ •) is the Kullback-Leibler divergence. It is clear that GradNorm considers u as a reference distribution to distinguish between ID and OOD data. If the divergence between softmax(f Θ (x)) and u is smaller, then x is an OOD data point with a higher probability. Nonetheless, since we do not have this assumption proven, we do not know whether it is correct. If not, the u-based score functions (e.g., Eq. 2) are ill-defined because they cannot guarantee that the lowest score corresponds to the most OOD-ness data. In this paper, we theoretically analyze the above assumption (i.e., Eq. 1) under three common causal graphs (Figure 1 ), and find that the above assumption holds only when the ID-class prior is u, i.e., the ID model is trained with class-balanced data. In other cases, the reference distribution of OOD data should be the ID-class-prior distribution P Y in (Theorem 1), i.e., [Pr(x out belongs to class 1), . . . , Pr(x out belongs to class K)] = P Y in . (3) Specifically, assume that we have K classes in training data (i.e., ID data). Let n j be the number of samples in class j, then the total number of samples is N = K j n j . Thus, we have P Y in = [n 1 /N, n 2 /N, ..., n K /N ]. Empirically, we test the performance of OOD detection methods when the data are not class-balanced (Figure 2a ), i.e., P Y in ̸ = u. We find that the GradNorm, a state-of-the-art OOD detection method, will suffer from the imbalanced situation (see cyan and yellow bars in Figure 2b ). Besides, it is interesting to find that Energy (Liu et al., 2020) , the other one of representative OOD detection methods that do not explicitly use u, also suffers from this situation (see cyan and yellow bars in Figure 2c ). Based on Theorem 1 and Eq. 3, we propose two effective strategies to modify previous score-based OOD detection methods using the ID-class-prior distribution: replacing (RP) strategy and reweighting (RW) strategy. In RP strategy, previous methods explicitly use the uniform distribution (like GradNorm), we can modify them by replacing u with the ID-class-prior distribution P Y in . For example, we can modify score functions of GradNorm by replacing u in Eq. 2 with P Y in : S RP+GradNorm (f Θ , x) = ∂D KL (P Y in ∥ softmax(f Θ (x)) ∂Θ L 1 . ( ) For the methods that do not explicitly use the uniform distribution to compute scores (like Energy (Liu et al., 2020 )), we can use the RW strategy to reweight their scores according to the similarity between the ID-class-prior distribution P Y in and the softmax outputs of the pre-trained model softmax(f Θ (x)). Namely, S RW+Method (f Θ , x) = S Method (f Θ , x) • cos(softmax(f Θ (x)), P Y in ), where S Method (f Θ , x) is a score function proposed in previous studies (like Energy (Liu et al., 2020) ). We conduct extensive experiments to verify the effectiveness of RP and RW strategies. After our modification, the results (red bars in Figure 2 ) show a significant improvement, which illustrates the effectiveness of our theory. Meanwhile, our method achieves state-of-the-art performance on four evaluation tasks. As a highlight, when evaluate the OOD detection performance on iNaturalist dataset, our method can achieve ∼36% increase on AUROC and ∼61% decrease on FPR95, compared with the original Energy (Liu et al., 2020) (see Table 1 ). It further validates that we cannot default the reference distribution of OOD data to the uniform distribution. To improve the generalizability of OOD detection methods, the class-prior distribution of the training data should be taken into account, which might benefit future researches in the community.

2. PRELIMINARIES

Let X ⊂ R d and Y in = {1, ..., K} be the feature space and the ID label space. Denote X in ∈ X , X out ∈ X and Y in ∈ Y in by the random variables with respect to X and Y in . P(X in , Y in ) is the ID joint distribution, P X in is the ID marginal distribution, and P X out is the OOD marginal distribution. OOD Detection. Given the training data D train in = {(x 1 , y 1 ), ..., (x n , y n )} independent and identically distributed (i.i.d.) drawn from P(X in , Y in ), the aim of OOD detection is to learn a model h using D train in such that for any test data x drawn from P X in or P X out : 1) if x is drawn from P X in , then h can classify x into correct ID classes; and 2) if x is drawn from P X out , then h can detect x as OOD data. Inference-time OOD Detection. A well-known branch of OOD detection methods is to develop the inference-time OOD detection (or post hoc OOD detection) methods (Huang et al., 2021; Liang et al., 2018; Liu et al., 2020; Hendrycks & Gimpel, 2017; Lee et al., 2018b; Sun et al., 2021) , where we are given a pre-trained ID model and then aim to recognize upcoming OOD data well. The key advantage of inference-time OOD detection methods is that the classification performance on ID data will be unaffected since we only use the ID model instead of changing it. Score Functions. Many representative OOD detection methods use a score-based strategy: given a threshold γ, an ID model f Θ and a scoring function S, then x is detected as ID data if S(f Θ , x) ≥ γ: G γ (x) = ID, if S(f Θ , x) ≥ γ OOD , if S(f Θ , x) < γ (6) The performance of OOD detection depends on how to design a scoring function S to make OOD data obtain lower scores while ID data obtain higher scores-thus, we can recognize ID/OOD data.

3. METHODOLOGY

Clearly, without any assumptions or conditions, OOD detection cannot be addressed well due to the unavailability of OOD data (Zhang et al., 2021) . Therefore, to investigate the feasibility of OOD detection, in this section, we consider a natural case that ID classes and OOD features do not interfere with each other.

3.1. ASSUMPTION AND THEOREM

Assumption 1. Random variables X out and Y in are independent, i.e., P(X out , Y in ) = P X out P Y in . Justification of Assumption 1. To justify that Assumption 1 is realistic, we conclude three common causal graphs in Figure 1 . These graphs illustrate how the data is generated through the lens of causality. Notably, in Figure 1c , we can observe that X in and X out are actually dependent, which is very common in our daily life. It seems that the dependence of X in and X out could result in the failure of Assumption 1. However, since X in and X out are dependent only because of the same style (S in Figure 1 ) instead of classes (Y in Figure 1 ) (Yao et al., 2021) , the condition that X in and X out are dependent does not conflict with Assumption 1. In fact, there exist many practical scenarios which meet the causal structure in Figure 1c , e.g., MNIST and Fashion-MNIST (Xiao et al., 2017) . According to this assumption, we can prove our main theorem, which provides the theoretical foundation for our paper. 

3.2. RETHINKING MSP AND GRADNORM BY THEOREM 1

According to Eq. 6, we discover that the score-based strategy has an implied assumption that if a data point x has a lower score, then the data x has a higher probability detected as an OOD data point. Based on this assumption, we consider the ideal case that if a data point x has the smallest score, then what will happen? We answer this issue, when the score function is MSP. Rethinking MSP by Theorem 1. We consider the MSP score and answer above issue by Theorem 2. Theorem 2. Given a data point x ∈ X , if f * Θ (x) ∈ arg min f Θ (x) S MSP (f Θ , x), then softmax(f * Θ (x)) = u, where u = [1/K, . . . , 1/K] ∈ R K . The proof of Theorem 2 is in Appendix B.3. According to the implied assumption, we know that when the data point has the smallest score, then x has the largest probability detected as an OOD data point. Then, Theorem 2 shows that in the ideal case, the output of this data point x is a uniform distribution u, which is conflict with our observation (i.e., softmax(f Θ (x)) ≈ P Y in , if x is an OOD data point) in the ID class-imbalance case. Therefore, to avoid the contradiction, we can replace the uniform distribution in MSP as follows: S RP+MSP (f Θ , x) = max i∈{1,...,K} (softmax i (f Θ (x)) -P Y in (i)). Rethinking GradNorm by Theorem 1. Here, we discuss how to adjust the GradNorm score. By Eq. 2, it is clear that in the ideal case, we can conclude that softmax(f Θ (x)) ≈ u, i.e., lim γ→0 softmax(f Θ (x)) = u, where f Θ (x) satisfies S GradNorm (f Θ , x) < γ. Therefore, Eq. 8 is inconsistent with our observation (i.e., softmax(f Θ (x)) ≈ P Y in , if x is an OOD data point) in the ID class-imbalance case. Similar to the MSP scenario, the basic idea is to use the ID-class-prior distribution P Y in = [P Y in (1), ..., P Y in (K)] to replace the uniform distribution u, i.e., S RP+GradNorm (f Θ , x) = ∂D KL (P Y in ∥ softmax(f Θ (x)) ∂Θ L 1 . 3.3 OUR PROPOSAL: REPLACING AND REWEIGHTING STRATEGIES Replacing (RP) Strategy. For those methods (e.g., MSP and GradNorm) whose score functions are deeply related to the uniform distribution u, the simple and straight way of modifying them is to replace the uniform distribution u with the ID-class-prior distribution P Y in . As mentioned in Section 3.2, we modify the score functions of MSP and GradNorm as S RP+MSP (f Θ , x) = max i∈{1,...,K} (softmax(f Θ (x)) i -P Y in (i)), S RP+GradNorm (f Θ , x) = ∂D KL (P Y in ∥ softmax(f Θ (x)) ∂Θ L 1 . ( ) Reweighting (RW) Strategy. For the methods that have no obvious correlations with the uniform distribution u (e.g., ODIN (Liang et al., 2018) and Energy (Liu et al., 2020 )), we design the RW strategy as a complementary to the RP strategy. RW strategy reweights their scores according to a similarity between the ID-class-prior distribution P Y in and softmax(f Θ (x)). Here, we expect that the weights do not impact on the OOD scores seriously. In this paper, we use the cosine function as the weight function, which is one of the most popular distances and similarity functions in contrastive learning (Chen et al., 2020; Grill et al., 2020; He et al., 2020) . The main reason we choose cosine function is that cosine is a bounded function and suitable as a weighting parameter after normalization. Specifically, S RW+Method (f Θ , x) = -S Method (f Θ , x) • cos(softmax(f Θ (x)), P Y in ) = -S Method (f Θ , x) • softmax(f Θ (x)) • P ⊤ Y in ∥ softmax(f Θ (x))∥ • ∥P Y in ∥ , where S Method (f Θ , x) is a score function proposed in previous studies, e.g., ODIN and Energy. Next, we introduce the details about the reweighted ODIN and reweighted Energy in the following. Compared to MSP, the main improvement of ODIN is the use of a temperature scaling strategy. We can modify ODIN as follows:foot_1 for a temperature T > 0, S RW+ODIN (f Θ , x) = -max i∈{1,...,K} exp (f i (x)/T ) K j=1 exp (f j (x)/T ) • cos(softmax(f Θ (x)), P Y in ). Energy (Liu et al., 2020) proposes to replace the softmax function with the energy function (LeCun et al., 2006) for OOD detection. The energy function has a property that is highly correlated with the distribution: the system with a more concentrated probability distribution has lower energy, while the system with a more divergent probability distribution (more similar to the uniform distribution) has higher energy (LeCun et al., 2006) . Thus, the energy of ID data is smaller than OOD data. Based on Eq. 12, we modify Energy as follows: S RW+Energy (f Θ , x) = T • log K i=1 e fi(x)/T • cos(softmax(f Θ (x)), P Y in ). In this paper, we mainly realize our strategies using Eqs. 10, 11, 13 and 14.

4. EXPERIMENTS

In this section, we construct a series of imbalanced ID datasets whose data are sampled from the ImageNet-1K (Deng et al., 2009) . Then, we train the ID classifiers on them as pre-trained ID models, and use large-scale ImageNet OOD detection benchmark (Huang & Li, 2021 ) to evaluate our methods, i.e., RP+MSP (Eq. 10), RP+GradNorm (Eq. 11), RW+ODIN (Eq. 13), and RW+Energy (Eq. 14). In addition, we also evaluate our methods on a real-world imbalanced dataset iNaturalist (Horn et al., 2018) , see Appendix A.2.  p(x) = am a x a+1 . In Appendix B.5, we have shown that the parameter m does not affect the level of imbalance. Thus, we set m = 1. Additionally, we note that the level of imbalance depends on the tail index a (see Figure 3 ), thus, to evaluate the performance of our methods in different imbalanced cases, we take different tail index a. The frequency distributions of classes of the sampled datasets are shown in Figure 3 . As the increase of the tail index a, the sampled datasets become more imbalanced, thus, the ImageNet-LT-a8 dataset is the most imbalanced. In the inference time, we use the large-scale benchmark proposed by Huang & Li (2021) . In this benchmark, the OOD datasets include the subsets of iNaturalist (Horn et al., 2018) , SUN (Xiao et al., 2010) , Places (Zhou et al., 2018) , and Textures (Cimpoi et al., 2014) . Note that, there are no overlapping classes between ID datasets and OOD datasets (Huang & Li, 2021) . Evaluation Metrics. We use two common metrics to evaluate OOD detection methods (Huang et al., 2021) : the false positive rate that OOD data are classified as ID data when 95% of ID data are correctly classified (FPR95) (Provost et al., 1998) and the area under the receiver operating characteristic curve (AUROC) (Huang et al., 2021) . Baselines. In order to verify the effectiveness of our strategies, we select MSP, ODIN, Energy, GradNorm and Dice as the baselines, where Dice is the state-of-the-art (SOTA) method. Following (Huang et al., 2021; Liang et al., 2018) , the temperature parameter T in ODIN is set to be 1000 and in GradNorm is 1. Models and Hyperparameters. We use mmclassificationfoot_2 (Contributors, 2020) with Apache-2.0 license to train ID models. The training details of ResNet (He et al., 2016) and MobileNet (Howard et al., 2019) follow the default setting in mmclassification. Note that, all methods are realized by Pytorch 1.60 with CUDA 10.2, where we use several NVIDIA Tesla V100 GPUs.

4.2. EXPERIMENTAL RESULTS AND ANALYSIS

Verification of Two Strategies. Our strategies are applicable to various score functions used by OOD detection methods. The performance of our methods and baselines are shown in Table 1 . Overall, after modifying previous methods using our strategies, their performance are significantly improved, indicating the effectiveness of our strategies. Specifically, RW+Energy achieves the highest AUROC (78.59%) compared to all methods. As a highlight, RW+Energy shows the most significant performance improvement on all four datasets: ∼61% FPR95 decrease in iNaturalist, ∼17% FPR95 decrease in SUN, ∼10% FPR95 decrease in Places and ∼12% FPR95 decrease in Textures. Besides, our strategies outperform the existing baseline methods in all evaluation tasks. Compared with the best baseline, RW+Energy increases AUROC from 71.31% to 78.59%, while RP+GradNorm reduces FPR95 from 76.11% to 70.12%. Experimental results have shown that our strategies can significantly outperform the baselines in the ID-class-imbalanced scenarios. samples in the training dataset. Then, we evaluate the OOD detection performance on three datasets: ID Head+OOD, ID Mid+OOD and ID Tail+OOD (details can be found in Appendix A.1.3). If the performance of one method on ID Tail+OOD is better than that on ID Head+OOD, then this method performs better when facing tailed classes and OOD data.

Analysis of Detection

In the case of GradNorm, experiment results in Figure 4 show that our method RP+GradNorm improves the performance on the above three datasets (ID Head+OOD, ID Mid+OOD, and ID Tail+OOD). When we take a closer look at the performance improvement, we notice that the overall performance improvement of RP+GradNorm is mainly due to the significant improvement on the Tail+OOD dataset. This result might indicate that the previous method, like GradNorm, confuses OOD data and ID tailed classes, which hinders their OOD detection performance. And our strategies can overcome this issue. More detailed results are shown in Appendix A.1.3.

4.3. ABLATION STUDY

Analysis regarding Tail Index a. Here, we report the performance of our method and baselines when changing the tail index a ∈ {2, 3, . . . , 8}. We conduct repeated experiments on these seven datasets (ImageNet-LT-a2, ImageNet-LT-a3,. . . , ImageNet-LT-a8), and the results are shown in Figure 5 . Overall, our method RP+GradNorm always outperforms other baselines with different imbalance degrees. More importantly, the performance improvement between RP+GradNorm and each baseline gradually increases, as the increase of the imbalance degree. This indicates that RP+GradNorm can handle different imbalanced scenarios better. More detailed results are in Appendix A.1.2. Analysis regarding Network Architecture. We evaluate all methods on a different network architecture, MobileNet-V3 (Howard et al., 2019) . Experiment results in Table 2 show that our methods (RP+GradNorm and RW+Energy) still outperform baselines on four evaluation tasks even when we change the network architecture. In addition, RP+GradNorm has a better performance on FPR95 while RW+Energy has higher AUROC, corresponding to the performances of GradNorm and Energy. Analysis regarding Model Size. We provide an experiment about the model size of RP+GradNorm. We compare ResNet50, ResNet101 and ResNet152 trained on ImageNet-LT-a8 datasets. The results are shown in Table 3 . The optimal model is the smallest one (ResNet50), and we observe that as the increase of the model size, the performance decreases. One of the reasons may be that small models are more difficult to overfit and thus more suitable for OOD detection in imbalanced scenarios. More Experiments and Exploration. First, we can also regard the cosine similarity weights in the RW strategy as a score function, and conduct several experiments in Appendix A.1.4. Second, to evaluate the stability of our strategies, we conduct 10 independent replicate experiments in Appendix A.1.5. Third, to further explore the collocation of existing methods and our design strategy, we conduct experiments regarding RW+MSP, RW+GradNorm, RW+RP+MSP/GradNorm, as shown in Appendix A.1.6. Then, we further consider what will happen if the ID-class prior is not accurate in the practical applications and conduct relevant experiments in Appendix A.1.7. Finally, we show in Appendix A.1.8 that our methods still work well when models are trained with long-tailed learning strategies (Cao et al., 2019; Park et al., 2022) during the training phase. Recently, GradNorm (Huang et al., 2021) uses the similarity of the model-predicted probability distribution and the uniform distribution to improve OOD detection and achieve state-of-the-art performance. In this paper, we mainly work on the inference-time OOD detection methods and aim at improving the generalizability of OOD detection in real-world scenarios. Training-time OOD Detection: Other methods (Hsu et al., 2020; Hein et al., 2019; Bitterwolf et al., 2020; Wang et al., 2021b) will complete ID tasks and OOD detection simultaneously in the training time. Bitterwolf et al. (2020) uses adversarial learning to process OOD data in training time and make the model predict lower confidence scores for them. Wang et al. (2021b) generates pseudo OOD data by adversarial learning to re-training a K+1 model for OOD detection. These methods usually require auxiliary OOD data available in the training process. Thus, the model will be affected by both ID data and OOD data. It is important for these method to explore an inherent trade-off (Liu et al., 2019; Vaze et al., 2022; Yang et al., 2021) between ID tasks and OOD detection.

6. CONCLUSION

This paper theoretically and empirically shows that the unproven assumption of uniform distribution in previous methods is not valid when the training dataset is imbalanced. Moreover, by analyzing the causal relations between ID/OOD classes and features, we point out that the best reference distribution for OOD data is the ID-class-prior distribution. Based on this, we propose two simple and effective strategies to modify the uniform distribution assumption in previous inference-time OOD detection methods. RP strategy is suitable for the methods that directly use the uniform distribution to design the OOD score function, while RW strategy is designed for methods that potentially use the assumption. Extensive experiments show that both strategies can significantly improve the performance of OOD detection on large-scale image classification benchmarks.

7. ETHIC STATEMENT

This paper does not raise any ethics concerns. This study does not involve any human subjects, practices to data set releases, potentially harmful insights, methodologies and applications, potential conflicts of interest and sponsorship, discrimination/bias/fairness concerns, privacy and security issues, legal compliance, and research integrity issues.

8. REPRODUCIBILITY STATEMENT

To ensure the reproducibility of experimental results, we provide main codes in the Appendix D. The experimental setups for training and evaluation as well as the hyperparamters are detailedly described in Section 4.1. We randomly sample a balanced dataset from ImageNet-1K dataset, which has the same sample numbers with the imbalanced datasets. We conduct experiments on the balanced data as shown in Figure 6 and Figure 7 . All methods shows a similar trend, i.e., the performance drop a lot when the training dataset becomes imbalanced (from cyan bars to yellow bars). Moreover, our method shows a significant improvement with previous methods on all evaluation tasks (from yellow bars to red bars). ImageNet-LT-a8 dataset is the most imbalanced, while ImageNet-LT-a2 is the most balanced. We conduct repeated experiments on these seven datasets and results are shown in Table 4 . Obviously, the more imbalanced the training ID dataset becomes, the more our methods (RP+GradNorm and Cosine Similarity) demonstrates their superior performance of OOD detection, compared to other methods on all evaluation tasks. It is noticeable that the detection performance of GradNorm is relatively stable no matter how imbalanced the ratio changes, compared with other existing methods (such as MSP, ODIN, Energy). These methods explicitly/implicitly use the discrepancy between the classifier's output and the uniform distribution. Thus, they will be affected a lot if the prior distribution changes from the uniform distribution to an imbalanced/tailed one. As for GradNorm, we conjecture that considering the gradient space might be robust to the changes of priors (such as from a uniform prior to an imbalanced prior). To verify this conjecture, we conduct the experiment that KL divergence is directly used to measure the discrepancy between the output of the classifier and the uniform distribution (i.e., GradNorm without gradient-norm process). The results are shown in Table 5 . Obviously, this KL-based method is also significantly affected by the imbalanced situation, then we can verify this conjecture. Thus, we confirm that GradNorm's robustness of the imbalanced ratio depends on the gradient space.

A.1.3 ANALYSIS OF DETECTION RESULTS ON DIFFERENT ID CLASSES

We calculate the evaluation metrics for the three categories (ID-Head, ID-Mid, ID-Tail) by randomly sampling OOD data in equal proportions corresponding to the number of samples in each category. For example, we have 50000 ID samples and 10000 OOD samples in total. If the number of samples in category Head is 10000 and accordingly we will sample 2000 OOD samples, then we use the 12000 samples to calculate AUROC and FPR95. The results reflects the confusion degree between ID Head data and OOD data in the view of OOD detection methods. We conduct experiments to analyze the performance of different data types, as shown in Table 6 . We try to analyze tailed categories using a confusion matrix, and there are some cases in Table 7 . [ A B C D ] is the result, where A represents the number of ID samples in the current class that are correctly classified as ID while C represents the number of ID samples in the current class that are misclassified as ID. D represents the number of OOD samples close to ID samples in the current class that are correctly classified as OOD while B represents the number of OOD samples close to ID samples in the current class that are misclassified as OOD. In Class #88 and #94, more OOD samples are correctly classified after applying our strategies, while the performance of Class #671 remains stable. In Class #671, more ID samples are correctly classified, while the performance of Class #94 of ID samples is sacrificed for more performance improvement of OOD samples. Moreover, we visualize the OOD score distributions in Figure 8 -11. Obviously, the results and figures show that the previous methods tend to confuse OOD data and the minority classes, which hinders their performance of OOD detection. And our strategies can reduce the confusion to improve OOD detection performance. We can also regard the cosine similarity weights in the RW strategy as a score function, and conduct several experiments in Table 4 , Table 8 and Table 9 . We notice that the cosine similarity also achieves a significant improvement compared with baselines in main evaluation tasks. Yet we notice that the cosine similarity is sensitive to the ID data distribution, since the performance in random sampling experiments (see Table 8 ) is not good enough compared with RP+GradNorm.

A.1.5 PERFORMANCE EVALUATION UNDER RANDOM SAMPLING

To further evaluate the performance of our method RP+GradNorm, we conduct experiments in 10 different ID datasets, which are generated randomly by the Pareto distribution with a = 2 and a = 8 from ImageNet-1K. The results on ImageNet-a8 dataset are reported in 

A.1.6 ABLATION STUDY BETWEEN PROPOSED STRATEGIES

To further explore the collocation of existing methods and our design strategy, we conduct experiments regarding RW+MSP, RW+GradNorm, RW+RP+MSP/GradNorm, as shown in the table 11. Since the cosine distance is superior to the performance of MSP, RW strategy can modify the output of MSP to improve it better than RP strategy. However, for GradNorm, RP strategy performs better than RW strategy because the cosine distance does not perform much better than GradNorm. In contrast, RP strategy better matches the idea of the original method. Therefore, we suggest that the choice of the two strategies follows the idea of preferentially matching the original method. As for RP+RW+MSP/GradNorm, they do not perform well. Taking GradNorm as an example, we think the main reason for this phenomenon is that RP+GradNorm has an outstanding performance, but RW strategy performance is worse than RP+GradNorm. Thus, adding RW strategy to RP+GradNorm has a negative effect, so the performance is significantly reduced. On the other hand, as for MSP, We train ResNet50 with some long-tailed methods, like LDAM (Cao et al., 2019) and CMO (Park et al., 2022) . Then we evaluate it with different OOD detection methods and our strategies, as shown in Table 13 . The models with LDAM loss do perform better than those with CrossEntropy loss, but after applying our strategies, there are also significant improvements in all methods. However, CMO does not bring the performance improvement of OOD detection as LDAM does, and even performs worse than CrossEntropy. We think that this phenomenon indicates that not all long-tailed training methods are helpful to improve the OOD detector. But the results show that our strategies still works well while the models try to overcome the class imbalance in training time. At last, We would like to reiterate our view on class-imbalanced OOD detection: • Data imbalance is a common phenomenon, and even a slight imbalance (like the ImageNet-LT-a2 dataset) can still lead to a decrease in the performance of the OOD detector. After applying our strategies, this phenomenon can be improved. • Developers do not necessarily use strategies to overcome data imbalance during the training phase of the model, depending on whether developers need to pay more attention to the minority in specific applications. • Even if developers use strategies to overcome data imbalance during the training time, it is very hard to obtain a class-balanced classifier. Experiment results show that our method can achieve performance improvements with or without a strategy to overcome data imbalance. A.1.9 EVALUATION ON THE BALANCED DATASET As for RP strategy, when the training dataset is balanced (the class-prior distribution is uniform distribution in Eqs. 10 and 11), RP+Method will be same as the original method. For RW strategy, we conduct experiments on full ImageNet dataset (note that it is balanced) and the results are shown in Table 14 . The performances of Energy and ODIN are quite close to the performnce of Energy and ODIN using RW strategy on the balanced dataset. 

A.2.2 EXPERIMENTAL RESULTS

We evaluate our methods and previous methods on the proposed benchmark. The results in Table 15 illustrate that our methods show a significant improvement on OOD detection, compared with previous methods. In addition, the RW strategy appears to be more insensitive to the performance of the original method than the RP strategy. Specifically, the RW strategy is able to bring substantial growth when the original method performs particularly poorly, but the RP strategy is unable to do so.

B FURTHER ANALYSIS

B.1 PROOF OF THEOREM 1 Theorem 1. If Assumption 1 holds, then P Y in |X out (y|x) = P Y in (y), for any y ∈ Y in . Proof. Using Assumption 1 in the second equation, we have P Y in |X out (y|x) = P(Y in = y ∧ X out = x) P(X out = x) = P(Y in = y)P(X out = x) P(X out = x) = P Y in (y).

B.2 ALTERNATIVE CHOICE FOR ID-CLASS-PRIOR DISTRIBUTION

When the labels of the training dataset are not available, we can use the predictions made by the model as an alternative to simulate empirical ID-class-prior distribution. Specifically, for each sample x i in training dataset, the prediction made by the model is softmax(f Θ (x i )). Thus, we have P Y in = 1/N * N i=0 softmax(f Θ (x i )) . We also conduct experiments to confirm the assumption, and the results are shown in Table 16 . Noticeably, OOD detection performances with two kinds of ID-class-prior distribution are similar.  K i=1 softmax i (f Θ (x)) = 1 and softmax i (f Θ (x)) ≥ 0, for ∀ i = 1, ..., K. Existence. If we assume that for all i = 1, ..., K, softmax i (f Θ (x)) < 1 K , then K i=1 softmax i (f Θ (x)) < 1, which is conflict with K i=1 softmax i (f Θ (x)) = 1. Therefore, there is at least one i such that softmax i (f Θ (x)) ≥ 1 K , which implies that min f Θ (x) S MSP (f Θ , x) ≥ 1 K . Note that when softmax(f Θ (x)) = u, S MSP (f Θ , x) = 1 K , which implies that there exists fΘ (x) ∈ arg min f Θ (x) S MSP (f Θ , x) such that u = softmax( fΘ (x)).

Uniqueness. If there is f

* Θ (x) ∈ arg min f Θ (x) S MSP (f Θ , x) such that softmax(f * Θ (x)) ̸ = u, it is clear that K i=1 softmax i (f * Θ (x)) < 1, which is conflict with K i=1 softmax i (f Θ (x)) = 1. Therefore, softmax(f * Θ (x)) = u. Combining the results in existence and uniqueness, we have completed this proof. B.4 DISCUSSION ABOUT RP+ODIN ODIN (Liang et al., 2018) is an enhanced version of MSP, whose main improvement is the introduction of a temperature scaling strategy. The temperature parameter T smoothes the prediction distribution of the softmax function and thus making the prediction sparser and more similar to the uniform distribution. S ODIN (f Θ , x) = max i exp (f i (x)/T ) C j=1 exp (f j (x)/T ) (16) Since ODIN maps the prediction distribution of the softmax layer to another distribution space while we need to measure the similarity between the class-prior distribution and the model-predicted distribution, we need to use the same mapping method to deal with the class-prior distribution P Y in = [p 1 , p 2 , ..., p C ], as follows: P ′ Y in = exp (p 1 /T ) C j=1 exp (p j /T ) , exp (p 2 /T ) C j=1 exp (p j /T ) , . . . , exp (p C /T ) C j=1 exp (p j /T ) Then, we use this new class-prior distribution P ′ Y in to modify ODIN with RP strategy as Eq. ( 19).  S RP+ODIN (f Θ , x) = max (h Θ (x) -P ′ Y in ) When we follow the default setting T = 1000 in ODIN, we notice that P ′ Y in will be quite close to the uniform distribution, where each element is close to 1/K. Thus, Eq. ( 19) can be regarded as h Θ (x) minus a constant.

B.5 DISCUSSION ABOUT M IN PARETO DISTRIBUTION

For each class x i , the sample number is y i = N × p(x i ) = N × am a x a+1 i , ( ) where a is tail index, m is a constant and N is the sample number of the ImageNet-1K dataset. After sampling, the new data distribution for each class is p(y i ) = y i K i=1 y i = N × am a x a+1 i K i=1 (N × am a x a+1 i ) = 1 x a+1 i K i=1 1 x a+1 i . Obviously, the value of m do not affect the imbalance degree of sampled datasets. Thus, we keep m = 1 unchanged.

B.6 DISCUSSION ABOUT FEATURE-BASED METHODS

Feature-based methods, like KNN (Sun et al., 2022) , need a training set to generate class prototypes, i.e., an average feature vector for each category. Under class-imbalanced situations, prototypes of tailed classes would be more unreliable than the majority due to the limitation of training samples. We think using ID-class-prior distribution to reweight features may be an effective way to solve the imbalanced problem in feature space. B.7 DISCUSSION ABOUT CLASS-DEPENDENT THRESHOLDING The paper (Guarrera et al., 2022) designs an optimization for threshold selection rather than designing an OOD score. Class-dependent thresholding is designed for p train ̸ = p test and not for data imbalance issues. This approach only affects the precision and recall metrics at the deployment stage of the OOD detector, not the AUROC and FPR95 metrics that our paper focuses on. This can be seen in Table 1 in Guarrera et al. (2022) . B.8 DISCUSSION ABOUT POSSIBILITY FOR RP+MSP. In order to discuss about the possibility for aligning the minimizer of the score function with the class priors, we conduct experiments for max i (softmax i (f (x))/P Y in (i)) and show the corresponding results in the below table. Our experiments show that max i (softmax i (f (x))/P Y in (i)) also performs very well. C DETAILED RELATED WORKS OOD Detection. OOD detection is a crucial problem for reliably deploying machine learning models into real-world scenarios. OOD detection can be divided into two categories according to whether the classifier will be re-trained for OOD detection or not. 1) Inference-time/post hoc OOD Detection: Some methods (Huang et al., 2021; Liang et al., 2018; Liu et al., 2020; Hendrycks & Gimpel, 2017; Lee et al., 2018b; Sun et al., 2021) focus on designing OOD score functions for OOD detection in the inference time and are easy to use without changing the model's parameters. This property is important for deploying OOD detection methods in realworld scenarios where the cost of re-training is prohibitively expensive and time-consuming. MSP (Hendrycks & Gimpel, 2017) directly takes the maximum value of the model's prediction as the OOD score function. Based on MSP, ODIN (Liang et al., 2018) uses a temperature scaling strategy and input perturbation to improve OOD detection performance. Moreover, Liu et al. (2020) and Wang et al. (2021a) propose to replace the softmax function with the energy functions for OOD detection. Recently, GradNorm (Huang et al., 2021) uses the similarity of the model-predicted probability distribution and the uniform distribution to improve OOD detection and achieve state-of-the-art performance. In this paper, we mainly work on the inference-time OOD detection methods and aim at improving the generalizability of OOD detection in real-world scenarios. 2) Training-time OOD Detection: Other methods (Hsu et al., 2020; Hein et al., 2019; Bitterwolf et al., 2020; Wang et al., 2021b) will complete ID tasks and OOD detection simultaneously in the training time. Bitterwolf et al. (2020) uses adversarial learning to process OOD data in training time and make the model predict lower confidence scores for them. Wang et al. (2021b) generates pseudo OOD data by adversarial learning to re-training a K+1 model for OOD detection. These methods usually require auxiliary OOD data available in the training process. Thus, the model will be affected by both ID data and OOD data. It is important for these method to explore an inherent trade-off (Liu et al., 2019; Vaze et al., 2022; Yang et al., 2021) between ID tasks and OOD detection. The paper (Wang et al., 2022) is training-time OOD detection and uses OOD data to train the model. After being finetuned, the model can deal with the imbalanced issue and OOD problem. The problem is similar to our paper, but the setting is completely different with our paper (inference-time OOD detection). In our paper, we do not change any parameters of the model and design methods to deal with the imbalanced issue on OOD detection. Note that our work and this work are not comparable due to the different problem settings. Open Set Recognition. In open set recognition, machine learning models (Huang & Li, 2021; Lee et al., 2018a; Perera & Patel, 2019; Perera et al., 2020; Shalev et al., 2018; Radford et al., 2021; Fort et al., 2021) are required to both correctly classify the known data (ID) from the closed set and detect unknown data (OOD) from the open set. Some works (Lee et al., 2018a; Huang & Li, 2021) use the information in the label space for OOD detection, and they divide the large semantic space into multiple levels for models to easily understand. Perera & Patel (2019) 



Note that, MSP also has this assumption, we will discuss it in Section 3.2. In fact, ODIN uses the modified softmax function with temperature T , which is also related to the uniform distribution, so we can also modify ODIN with RP strategy. We can map the class-prior distribution to the same feature space with ODIN's OOD scores by temperature T . However, if following the default setting (T = 1000) in ODIN, ∥P Y in -u∥/T ≈ 0. Thus, RP+ODIN may not work. We will discuss this issue in Appendix B.4. https://github.com/open-mmlab/mmclassification https://github.com/open-mmlab/mmclassification



Figure 2: (a) Plot showing the data distribution of balanced and imbalanced datasets. OOD detection performances of (b) GradNorm and (c) Energy. Smaller FPR95 values are better. Cyan (left) bar: the original method on a balanced dataset. Yellow (middle) bar: the original method on an imbalanced dataset. Red (right) bar: the original method with our method on an imbalanced dataset. For a fair comparison, the sample numbers of balanced and imbalanced datasets are the same. More detailed results are shown in Appendix A.1.1.

Figure 3: Data distribution with different tail index a.

Figure 4: Performance comparison of different data type. The figures shows the OOD detection performance of GradNorm and RP+GradNorm in four OOD datasets.

Figure 5: OOD detection performance with ResNet101 trained on different imbalanced ID datasets. ↑ indicates larger values are better and ↓ indicates smaller values are better.

Figure 6: OOD detection performance (AUROC) of (a) MSP (b) ODIN (c) Energy and (d) GradNorm.Larger AUROC values are better. Cyan (left) bar: the original method on balanced dataset. Yellow (middle) bar: the original method on imbalanced dataset. Red (right) bar: the original method with our method on imbalanced dataset.

Figure 7: OOD detection performance (FPR95) of (a) MSP (b) ODIN (c) Energy and (d) GradNorm. Smaller FPR95 values are better. Cyan (left) bar: the original method on balanced dataset. Yellow (middle) bar: the original method on imbalanced dataset. Red (right) bar: the original method with our method on imbalanced dataset.

Figure 8: OOD score distribution of (a) MSP and (b) RP+MSP.

Figure 10: OOD score distribution of (a) Energy and (b) RW+Energy.

Next, we discuss how to utilize this novel observation to improve existing score-based OOD methods. When the labels of the training dataset are not available, we can use the predictions made by the model as an alternative to simulate empirical ID-class-prior distribution P Y in . The specific analysis and experiments can be found in the Appendix B.2.

OOD detection performance comparison with other competitive score-based OOD detection methods. All methods are based on ResNet101 trained on ImageNet-LT-a8. ↑ indicates larger values are better and ↓ indicates smaller values are better. All values are percetages. The bold indicates the best performance while the underline indicates the second.

OOD detection performance with MobileNet trained on ImageNet-LT-a8. ↑ indicates larger values are better and ↓ indicates smaller values are better. All values are percentages. The bold indicates the best performance while the underline indicates the second.

OOD detection performance with model size increases. The RP+GradNorm method is trained on ImageNet-LT-a8. All values are percentages.

Table of Contents of Appendix A Further Experiments A.1 Evaluation on ImageNet Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Evaluation on Imbalanced Data and Balanced Data . . . . . . . . . . . . . A.1.2 Analysis regarding Tail Index . . . . . . . . . . . . . . . . . . . . . . . . A.1.3 Analysis of Detection Results on Different ID Classes . . . . . . . . . . . A.1.4 Cosine Similarity as A Score Function . . . . . . . . . . . . . . . . . . . . A.1.5 Performance Evaluation under Random Sampling . . . . . . . . . . . . . . A.1.6 Ablation Study between Proposed Strategies . . . . . . . . . . . . . . . . A.1.7 Robustness to Inaccurate Class Prior Distribution . . . . . . . . . . . . . . A.1.8 OOD Detection with Long-tailed Learning . . . . . . . . . . . . . . . . . A.1.9 Evaluation on the Balanced Dataset . . . . . . . . . . . . . . . . . . . . . A.2 Evaluation on iNaturalist Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

OOD detection performance with ResNet101 trained on different imbalanced ID datasets. ↑ indicates larger values are better and ↓ indicates smaller values are better. All values are percentages.

OOD detection performance with ResNet101 trained on different imbalanced ID datasets. KL stands for using only KL divergence as the OOD detection function. ↑ indicates larger values are better and ↓ indicates smaller values are better. All values are percentages.

Performance comparison of different data type. All methods are based on ResNet101 trained on ImageNet-LT-a8. All values are percentages.

Confusion matrix in some tailed categories.

Performance comparison under random sampling. All methods are based on ResNet101 trained on different imbalanced ID dataset with tail index a = 8. The results are means ± standard errors among ten randomly sampled datasets. ↑ indicates larger values are better and ↓ indicates smaller values are better. All values are percentages.

OOD detection performance with model size increasing. The RP+GradNorm method is trained on ImageNet-LT-a8. All values are percentages.



Performance comparison under random sampling. All methods are based on ResNet101 trained on different imbalanced ID dataset with tail index a = 2. The results are means ± standard errors among ten randomly sampled datasets. ↑ indicates larger values are better and ↓ indicates smaller values are better. All values are percentages.

Ablation study between proposed strategies. All methods are trained on ImageNet-LT-a8 dataset with ResNet101.

OOD detection performances with different level of noises.We further consider what will happen if the ID-class prior is not accurate in the practical applications. In this regard, we conduct relevant experiments. Specifically, we simulate this kind of error regarding prior by adding noises with different intensities to the ID-class-prior distribution. Assume that the standard deviation of the ID-class-prior distribution is std. The noises we add will follow N (0, k • std), where k controls noise intensity. The results are shown in Table12. After adding noise, the performances of OOD detection do decrease, but not much, which shows that our method is robust to inaccurate class prior distributions.

OOD detection performance with long-tailed learning methods.

OOD detection performances on full ImageNet dataset.

OOD detection performance on the iNaturalist benchmark.Model and Hyperparameters. We use mmclassification 4 (Contributors, 2020) with Apache-2.0 license to train ID models. The training details of ResNet(He et al., 2016) follow the default setting in mmclassification. We use the model trained in ImageNet as the pre-trained model and finetune it in iNaturalist. Note that, all methods are realized by Pytorch 1.60 with CUDA 10.2, where we use several NVIDIA Tesla V100 GPUs.

Performance comparison of two different ID-class-prior distribution acquisition methods. All methods are trained on ImageNet-LT-a8 dataset with ResNet50. Proof of Theorem 2. According to the definition of softmax function, it is clear that

OOD detection performances on ImageNet-LT-a8 dataset.

designs two parallel networks training on different dataset and use the membership loss to encourage high activations for ID data while reducing activations for OOD data.Perera et al. (2020) uses self-supervision and data augmentation to improve the network's ability to detect OOD data. Input images are augmented with the representation obtained from a generative model. In this paper, we consider a more complex open set, large scale and imbalanced, to achieve OOD detetcion.

annex

with torch.no_grad():x = x.cuda() logits, _ = model_forward(model, x) softmax_output = m(logits) ood_score, _ = torch.max(softmax_output, dim=-1) ood_scores.extend(ood_score.data) return ood_scores def MSP_RP(data_loader, model, ID_Prior) 

