RG: OUT-OF-DISTRIBUTION DETECTION WITH REACTIVATE GRADNORM

Abstract

Detecting out-of-distribution (OOD) data is critical to building reliable machine learning systems in the open world. Previous works mainly perform OOD detection in feature space or output space. Recently, researchers have achieved promising results using gradient information, which combines the information in both feature and output space for OOD detection. However, existing works still suffer from the problem of overconfidence. To address this problem, we propose a novel method called "Reactivate Gradnorm (RG)", which exploits the norm of the clipped feature vector and the energy in the output space for OOD detection. To verify the effectiveness of our method, we conduct experiments on four benchmark datasets. Experimental results demonstrate that our RG outperforms existing state-of-the-art approaches by 2.06% in average AUROC. Meanwhile, RG is easy to implement and does not require additional OOD data or fine-tuning process. We can realize OOD detection in only one forward pass of any pretrained model.

1. INTRODUCTION

In addition to the need for the accuracy of predictions, more and more attention has been paid to whether the model can make rejection identification when faced with completely unfamiliar samples. People want models that are not only accurate in their familiar data distribution but also aware of uncertainty outside the training distribution. This gives rise to the importance of out-of-distribution (OOD) detection, which determines whether an input is in-distribution (ID) or OOD. And OOD detection is widely used in fields with high safety requirements, such as medical diagnosis (Nair et al., 2020) and autonomous driving (Amini et al., 2018) . Deep neural networks can easily make overconfident predictions on OOD inputs, which increases the challenge to separate ID and OOD data Van den Oord et al. (2016) ; Chen et al. (2021) . For instance, a model may wrongly but confidently classify an image of a crab into the clapping class, even though no crab-related concepts appear in the training set. Previous works focused on deriving OOD uncertainty measurements from the output space (Hendrycks & Gimpel, 2016; Liu et al., 2020) or feature space (Lee et al., 2018; Sun et al., 2022) . A recent work (Huang et al., 2021) based on gradients has intrigued us. Actually, gradient information can often be decomposed into information from feature space and output space, which can be derived from the process of the BP algorithm. However, this method still has room for further improvement on OOD detection, which encourages us to utilize both output space and feature space information for better OOD detection. In this paper, we perform OOD detection by jointly using information from feature space and output space. Formally, we propose Reactivate Gradnorm (RG), a simple and effective method to detect OOD by utilizing the inputs and outputs of the last layer of the neural networks. Specifically, RG directly uses the product of the 1-norm of the clipped input of the last layer of neural network and the logarithm of the exponential sum of the outputs (free energy) as the OOD scoring function. The reason for using the 1-norm of the hidden layer features is that the neurons will be activated for the ID sample. The motivation for cropping it comes from the fact that there will be a few OOD samples with strong features. Appropriate cropping can reduce the 1-norm of the features of the OOD samples without excessively affecting the 1-norm of the features of the ID samples. The energy information in the logits space is selected instead of the information in the probability space (like MSP (Hendrycks & Gimpel, 2016 )) because there is information loss from the logits space to the probability space (the relative size information of the logits will be ignored by the softmax layer). on the other hand, there are good theoretical and practical effects by using the energy as an OOD evaluation score. Empirically, we have established excellent performance on the large-scale ImageNet benchmark. RG is vastly superior to previous use of energy after crop ReAct (Sun et al., 2021) 8.9% by AUROC, and our source of inspiration Grodnorm 5.86% by AUROC. Our method also achieves excellent performance compared to the MOS (Huang & Li, 2021) 2.06% by AUROC. Our key results and contributions are summarized as follows: • We propose RG, a simple and effective OOD uncertainty estimation method, which is labelagnostic (no label required), OOD agnostic (no outlier data required), train data agnostic (Only the pre-trained model is used and no fine-tuning or extra training). • We conduct sufficient experiments on the combination of information from output space and input space to help us better understand the effectiveness of our OOD detection methods. RG promotes the average AUROC by 2.06% compared to the current best method under the same pre-trained model and dataset. Experiments show that using information from both feature space and output space has a gain for OOD detection. • We perform a simple theoretical analysis of our method that using information from both feature space and output space at the same time helps to model the distribution of training data, which facilitates ood detection. And we unify several previous approaches under the equation 10 in a new framework.

2. BACKGROUND

In a supervised learning, we denote by X = R d the input space and Y = {1, 2, ..., C} the output space. A neural network f (x, θ) = {f i (x, θ)} C i=1 with the parameter θ, we abbreviate it as f (x), which is a mapping from X to R C . When given a dataset D = {(x i , y i )} n i=1 , the supervised learning task is to minimize: R(f ) = E (x,y)∈D l CE (f (x), y) where l CE usually used the cross-entropy loss: l CE (f (x), y) = -log e fy(x) c i=1 f i (x) where y is the ground-truth label. Problem statement OOD detection can be formulated as a binary classification problem. The goal is to design a discriminator G(x) which is a mapping from X to R. Given a threshold c, we will decide a sample x as an OOD sample if and only if G(x) < c. The design of the discriminator G is often related to the neural network model f (x, θ), which will help the neural network model reject the recognition when G(x) < c. Typically, c will be set to the fraction of 95% of In-distribution (ID) data that can to be identified as ID. The key challenge is to derive a scoring function G(x) that captures OOD uncertainty. Previous OOD detection approaches primarily rely on the output or feature space for deriving OOD scores, and there has been some recent interest in utilizing gradient information for OOD detection. We will reveal that effective gradient-based OOD detection method is a method that combines the information from output space and feature space. And based on it, a more efficient method is proposed in the following section.

3. MOTIVATION AND METHOD

In this section, we will first describe the gradient-based OOD detection method and then analyze that the gradient-based OOD detection method is based on the synthesis of the information in the feature space and the information in the output space in section 3.1. The gradient-based OOD detection method inspired us to design an appropriate OOD score which is a combination of the norm of the clipped feature vector and the energy in the output space to achieve OOD detection in Section 3.2. In section 3.3, we unify several previous approaches under the equation 10 in a new framework.

3.1. GRADIENT-BASED OOD DETECTION

We start by introducing the loss function for backpropagation and then describe how to design the gradient norm for OOD uncertainty estimation. We provide a perspective to revisit gradient-based OOD detection: the idea of the gradient-based OOD detection method stems from that for a fully trained neural network, When we continue to feed the neural network with samples of the training set, then use the ground truth to calculate the loss and use the loss to backpropagation, the gradient of the neural network parameters will be small because the neural network is fully trained. However, when doing OOD detection, the ground-truth is missing. So we can not use the ground truth information to calculate prediction loss. A natural idea is to use a uniform distribution as a substitute for ground-truth and the gradient will be large for an ID sample. (Huang et al., 2021) : G(x) = || ∂KL(u||sof tmax(f (x))) ∂w || 1 (3) where u = { 1 C } c i=1 , KL(u|sof tmax(f (x))) = -1 C c i=1 log e f i (x) c i=1 fi(x) . w is the parameters of the network. Another way to replace the ground-truth is that we think the probability that the true label has a probability p y = e fy (x) c i=1 fi(x) to be class y. Then we can design the score as Igoe et al. (2022) mentioned: G(x) = E y∼p(x) || ∂logp y ∂w || 1 The probability that each sample has probability of p i belongs to the i th class, then the expectation of the gradient of the classification loss will be small for ID samples. Note that the negative loglikelihood is used when calculating the loss function, so for ID data, G(x) will also be larger than OOD data. Similarly, to avoid the problem of missing ground truth, we can also use such a loss function -C i=1 e fi(x) and design our own novel score as: G(x) = || ∂ C i=1 e fi(x) ∂w || 1 Under some special settings: only the gradient information of the last layer is used, we will find out that equation 3 is actually: G(x) = U V (6) where V is the L 1 norm of the features of the last layer of neural network input, and U = c i=1 | 1 C - p i |. Meanwhile the equation 4 also has the form of equation 6, where V is the L 1 norm of the features of the last layer of neural network input, and U = 2 c i=1 p i (1 -p i ). And the equation 5 also has the form of equation 6, where V is the L 1 norm of the features of the last layer of neural network input, and U = 2 c i=1 e fi(x) . From the expression, these scores are still difficult to overcome the problem of overconfidence prediction of OOD samples. As the example we mentioned in the introduction, a model may wrongly but confidently classify an image of a crab into the clapping class, even though no crab-related concepts appear in the training set. Suppose the crab has a strong feature related to the recognition of the clapping class, which will result in a large 1-norm in the feature space. This means that there is room for improvement in both U and V . Summarize Some gradient-based OOD detection methods can be transformed into a combination of feature information and output information. But that's not surprising, because in the BP algorithm the gradient in the last layer is equal to the product of the error propagated and input value where the former only depends on the output space, the latter only depends on the feature space. Previous work inspired us to explore suitable U and V for OOD detection.

3.2. THE CHOICE OF U AND V

U comes from the output space. In OOD detection, using the maximum softmax probability Hendrycks & Gimpel (2016) is a natural choice. However OOD samples may also have very confident predictions, so we use an energy score-based score as U : U = T log c i=1 e fi(x)/T (7) It has good theoretical and practical significance to use energy score as an indicator for OOD detection. It is also a very common strategy to estimate the certainty of predictions by summing the neural network output after activation, as in the method of estimating the certainty based on the Dirichlet distribution, using sof tplus(•) + 1 as the activation and summing (Sensoy et al., 2018) . V comes from the feature space. A common strategy is to model OOD data from a standard Gaussian distribution in feature space. For ood data, the i th element of features vector is v i = max(0, z i ), z i ∼ N (0, 1). The score of ood is measured by e -v 2 i . For ID data, use 1 -e -v 2 i as measure. Based on the assumption that each feature in the feature space is independent, we should multiply all V i as V. But this will cause numerical instability, such as if one of the factors is 0 then the product is 0. So we use V = i (1 -e -v 2 i ). Actually we take an approximation of it: V = min(1, v i ). They are all based on the same idea: to avoid overconfident predictions of OOD samples with few strong features, the contribution of each feature to the overall score should be suppressed. In general, the OOD score we use is: G(x) = T log( i e fi(x)/T ) i min(v i , k) where T is the temperature in the energy function, and default is 1. k is the clipping threshold of each feature in the feature vector, and default is 1. In this paper, equation 8 is used as the OOD detection score by default.

3.3. ADDITION-BASED COMBINATION OF U AND V

Different from the combination method of multiplying U and V in the previous section, in this section we will use another method to combine U and V and provide a perspective that unifies the previous approaches. To better explain this, let's look at the fully connected layer of the last layer of the neural network. Suppose for the sample x, the input of the last layer of the neural network is v = {v j } N j=1 . Suppose the joint probability that the feature v belongs to class i is P (v, C i ) = e g(v)+w t i v+bi , where g(•) is a mapping from R N to R, w i ∈ R d , b i ∈ R. The choice of g(•) cannot be too arbitrary, as it will be constrained by probability normalization. The probability that sample x belongs to class i is: P (C i |v) = P (v, C i ) c i=1 P (v, C i ) = e g(v)+w t i v+bi c i=1 e g(v)+w t i v+bi = e w t i v+bi c i=1 e w t i v+bi (9) This is exactly what the final fully connected layer of the neural network and the softmax layer are working. We will easily see the combination of the fully connected layer and the softmax layer ignores the relative size g(v). This also implies that if the information of v from feature space can be effectively used, the performance of OOD detection can be better than the information of the prob space alone, like MSP (Hendrycks & Gimpel, 2016) . Then, logP (v) = log( c i=1 P (v, C i )) = log( c i=1 e g(v)+w t i v+bi ) = g(v) + log( c i=1 e fi(x) ) So we can use g(v) from the feature space, and log( c i=1 e fi(x) ) from the output space to characterize the probability of a sample appearing. The greater the probability P (v), the more likely the sample is to belong to the ID sample. When g(v) = max j (-log( i̸ =j e w t i v+bi )), that is equal to MSP (Hendrycks & Gimpel, 2016) . When g(v) = 0, that is equal to Energy (Liu et al., 2020) . When g(v) represents the inverse of the norm of the residuals of the projection of v onto the main subspace, that is equal to VIM (Wang et al., 2022) . When g(v) is quadratic, that is GEM (Morteza & Li, 2022) . We can look at these methods under a unified framework. For practice, if we think that the ID samples are more likely to be close to the set D k = {v ∈ R N |v i ≤ k, i = 1, ..., N }. g(v) is used to penalize samples that do not belong to set D. The measure of penalty takes the L1 distance from v to the set D k . So we can design our own novel score as: G(x) = i (min(v i , k) -k) + log( i e fi(x) ) Ignore the constant term, and provide a balance coefficient α for the information from the feature space and the output space. We can use the score: G(x) = log( i e fi(x) ) + α i min(v i , k) Similar to the method we proposed in equation 8, they both use the same U and V , the difference is that the combination is changed from multiplication to addition. A natural idea is to choose a balance coefficient α that makes the standard deviations of the two terms close. Since the choice of the balance coefficient is intuitive, we use equation 8 by default in the experimental part. But we will do experiments to explore the appropriate balance coefficient.

4. EXPERIMENT

In this section, we evaluate RG on a large-scale OOD detection benchmark with ImageNet-1k as an in-distribution dataset. We describe the experimental setup in Section 4.1 and demonstrate the superior performance of RG over existing approaches in Section 4.2, followed by extensive ablations and analyses that improve the understanding of our approach.

4.1. EXPERIMENTAL SETUP

Dataset We evaluate our method on the large-scale ImageNet benchmark proposed by Huang & Li (2021) . ImageNet benchmark is not only rich in data sources, but also many categories. OOD detection for the ImageNet model is more challenging due to both a larger feature space (dim = 2048) as well as a larger label space (C = 1000). In particular, the large-scale evaluation can be relevant to real-world applications, where the deployed models often operate on images that have high resolution and contain many class labels. Moreover, as the number of feature dimensions increases, noisy signals may increase accordingly, which can make OOD detection more challenging. We evaluate on four OOD test datasets, which are from subsets of iNaturalist (Van Horn et al., 2018) , SUN (Xiao et al., 2010) , Places (Zhou et al., 2017) , and Textures (Cimpoi et al., 2014) , with non-overlapping categories w.r.t. ImageNet-1k. The OOD datasets include various domains including fine-grained images, scene images, and textural images. The amount of OOD data is also very large, with the exception of Textures which has 5640 images, the other datasets have 10000 images each.

Model and hyperparameters

We mianly use Google BiT-S models (Kolesnikov et al., 2020) pretrained on ImageNet-1k with a ResNetv2-101 architecture (He et al., 2016) . The BiT-S model is adopted not only for its excellent classification performance on ImageNet-1k but also for a better fair comparison with Gradnorm (Huang et al., 2021) methods. In Section 4.2. Additionally, we use clipping threshold 1 as the default and explore the effect of other clipping thresholds in Section 4.2. The temperature parameter T is set to be 1 unless specified otherwise, and we explore the effect of different temperatures in Section 4.2. We also report performance on another architecture, DenseNet121 (Huang et al., 2017) . At test time, all images are resized to 480 × 480.

4.2. RESULTS AND ABLATION STUDIES

Comparison with benchmark methods The results for ImageNet evaluations are shown in 1, where our method(RG) demonstrates superior performance. We report OOD detection performance for each OOD test dataset, as well as the average over the four datasets. For a fair comparison, all the methods use the same pre-trained backbone, without regularizing with auxiliary outlier data. Since our method is inspired by Gradnorm (Huang et al., 2021) , the settings of the method compared in that article are also the same as it, Such as MSP (Hendrycks & Gimpel, 2016) , ODIN (Liang et al., 2017) , Mahalanobis (Lee et al., 2018) , as well as Energy (Liu et al., 2020) . We also compared KL matching (Hendrycks et al., 2019) and the methods MOS (Huang & Li, 2021) which use the same pre-trained model on the same dataset. And for Conor Igoe (Igoe et al., 2022) , we use the L1 norm of feature and the Energy. In addition, we also compare the ReAct (Sun et al., 2021) , which has the same clipping threshold 1, and uses the Energy score after clipping as the score. We reproduce ReAct. Other methods have been reproduced on mos (Huang & Li, 2021) or Conor Igoe (Igoe et al., 2022) , so we reported the result from them. For a fair comparison, we primarily compare with methods utilizing a pre-trained discriminative network without regularizing with auxiliary outlier data. RG outperforms the best gradient-based baseline Gradnorm by 5.86% in AUROC. RG also outperforms a competitive feature-based method, Mahalanobis, by 44.79% in FPR95. RG also outperforms the method ReAct by 30.4% FPR95. Compared with the current group-based OOD detection method MOS, RG has promoted 2.06%. RG is stable with relatively small differences on four OOD datasets. Besides, OOD detection can be achieved in one forward pass without the need for another back pass like GradNorm. Our method is computationally small and requires no additional storage space, almost the same as MSP or energy methods. Ablation on U and V. We conduct experiments on the results of U and V alone and using U and V in combination. There are two directions of ablation experiments: ablation to separate feature space and output space, and ablation to Gradnorm. As we described in Section 3, the combined use of information from feature space and output space can achieve performance that surpasses either of them in OOD detection. On the other hand, we also noticed that the information from the output space plays a leading role in OOD detection, which can also reflect the effectiveness of the previous methods such as Energy that depend on the output space. Compared to Gradnorm, we have an increase in replacing V with energy alone. Replacing U alone with the cropped 1-norm also improves. This also shows that our U and V are suitable for OOD detection based on the BiT network. The effect of the clipping thresholds We evaluate our method RG with different clipping thresholds k from k = 0.1 to k = 5. As shown in 3, k = 0.5 or 0.7 is optimal, while either increasing or decreasing the clip threshold will degrade the performance. The appropriate clipping threshold is related to the distribution of ID samples in the feature space. If the clipping threshold is relatively large, OOD samples with few strong features will be identified as ID samples. If the clipping thresholds are relatively small, the information loss of the samples in the feature space will be serious, which will lead to ID samples being identified as OOD samples. As the clipping threshold increases, the results converge to the unclipped result, which is the result of the last row. When the clipping is small, the performance will be worse than no clipping. This shows that a suitable clipping threshold is beneficial to filter out those OOD samples that make overconfident predictions because of extremely strong features. The effect of temperature. We evaluate our method RG with different temperatures T. As shown in 4, T = 1 is optimal, while either increasing or decreasing the temperature will degrade the performance. This can be explained as the temperature balances the information in the feature space and the input space. The higher the temperature, the stronger the dependence on the feature space, and the lower the temperature, the stronger the dependence on the output space. As the temperature increases, the results converge to the penultimate row in 2, which is a result that depends entirely on the feature space. U plus V exploration. As described in Section 3.3, OOD detection can also be performed using an additive combination of U and V. we test the OOD detection performance based on equation 12. Then the key to the experiment is the choice of the hyperparameter α, this is a parameter that balances the effect of feature space and output space on OOD. A heuristic is to make the standard deviation of the two terms close, which we denote by α s . α s need to be obtained from the training data of the network. To calculate it quickly, we randomly select 1k values from the training set to calculate. The experimental results are shown in Table 5 . All of the average AUROC are better than the MOS which achieves an average AUROC of 0.901. On datasets where the output space benefits OOD detection more, the optimal alpha is slightly less than 1. On datasets where the feature space benefit OOD detection more, the optimal alpha is slightly greater than 1. Igoe et al. (2022) , and other numbers were reported in Huang et al. (2021) . RG is consistently effective, outperforming our source of inspiration Gradnorm by 10.29% in FPR95 and 4.29% in AUROC. If we use the optimal clipping threshold 0.5 shown in 3, FPR95 can drop 0.6% compared to the default clipping threshold 1. This shows that the appropriate clipping threshold is related in different network structures. Additionally, we also compare with state-of-the-art nonparametric feature space methods KNN (Sun et al., 2022) . Because that method requires relatively high storage space, We decided to compare on the same ResNet-50 model trained on ImageNet. The results are shown in the table 7 . we report KNN-based results from Sun et al. (2022) . This also shows that our method is effective on another pre-trained model.

5. RELATED WORKS

OOD detection by Output-based Methods The earliest OOD detection method is based on MSP, which uses the maximum softmax probability as the indicator score of ID data (Hendrycks & Gimpel, 2016) . The researchers' interest then turned to study OOD scores in the output space (Sastry & Oore, 2020; Dong et al., 2022) . ODIN (Liang et al., 2017; Hsu et al., 2020 ) is an output-based method that uses temperature scaling and input perturbation to increase the separability of ID and OOD. After that, researchers' interest shifted from softmax space to logit space. (Liu et al., 2020) proposed using an energy score for OOD detection, which enjoys theoretical interpretation from a likelihood perspective (Morteza & Li, 2022) . JointEnergy score (Wang et al., 2021) is then proposed to perform OOD detection for multi-label classification networks. Some recent studies have shown that one of the reasons for the overconfidence of OOD is the abnormally high activation of a few neurons, so appropriate inhibition of activated neurons is beneficial for OOD detection, which is ReAct (Sun et al., 2021) . Then (Sun & Li, 2022) Ming et al. (2022) . A simple density assumption is that the features follow a class-conditional Gaussian distribution (Lee et al., 2018) . For more complex distributions, flow technology can be used (Zisselman & Tamar, 2020) . Nonparametric methods for estimating density have also recently emerged like Cook et al. (2020) . And also OOD detection using k-nearestneighbor has shown good performance (Sun et al., 2022) , but it is based on a large amount of known ID data and finding K-nearest neighbors in practical applications is not an easy task in terms of storage and computation. OOD detection by Fusing information from feature space and output space. Recently, there have been some methods that directly or indirectly mix the feature space information and output space information for OOD detection. (Huang et al., 2021) uses the feature space and output space information implicitly. VIM (Wang et al., 2022) uses the reconstruction error in the feature space and the energy in the output space. Igoe et al. (2022) also use the information in the feature space and output space for OOD detection. But experiments show that our method is better in performance because we use a more appropriate distance in the feature space.

6. DISCUSSION

Gradient-based OOD detection can often be transformed into a combination of information about the feature space and output space, as in section 3.1, the process of the BP algorithm also indirectly reflects this. An important reason why our method outperforms the baseline is that it uses information from both feature space and output space. Using the proper fusion method makes sense for OOD detection. It is a good choice to fuse the information of energy and feature space. Our method benefits a lot from the network structure. Because the BN layer (Ioffe & Szegedy, 2015) is widely used in image recognition, the training data does not shift too much from 0, this is an important reason why reactivation strategy can improve performance. The reactivation strategy in our method is similar to Sun et al. (2021) , but we don't use the reactivated feature vector to get the energy score. In Section 3.3 we revisit the reactivation method based on distance. From this view, we penalized the deviation from the set D to get g(•). VIM (Wang et al., 2022) considers the set D to be a subspace for training data. OOD-based KNN (Sun et al., 2022) considers the set D to be a subset of feature vectors for training data. However, the choice of g(•) in Section 3.3 is only heuristic, a strict g(•) should also satisfy normalization equation: C i=1 e g(v)+w t i v+bi dv = 1. In the future, we might make some assumptions about g(•) and then learn the parameters of g(•) from the training data.

7. CONCLUSION

In this paper, we propose RG, a novel OOD uncertainty estimation approach utilizing information extracted from the feature space and output space. And we propose a framework for combining metrics in the feature space and energy in the output space for OOD detection. Experimental results show that our gradient-based method can improve the performance of OOD detection by up to 2.06% in AUROC, establishing superior performance. Extensive ablations provide further understanding of our approach. We believe that considering both feature space and output space information can improve the performance of OOD detection. At the same time, we hope our work draws attention to the strong promise of the OOD detection methods that combine information from feature space and output space.



proposes a weight sparsification-based OOD detection framework termed DICE. These methods have the advantage of being easy to use without modifying the training procedure and objective.OOD detection by Feature-based Methods. OOD detection based on feature space is often based on the assumption that after modeling the density function of ID data, OOD data is often in lowdensity regions, or OOD data is far from the center of ID samplesXiao et al. (2010);Zong et al.

Main Results. OOD detection performance comparison between RG and baselines. All methods utilize the standard ResNetv2-101 model trained on ImageNet. The classification model is trained on ID data only. All values are percentages.

Ablation on U and V. OOD detection performance by different U and V. All methods utilize the standard ResNetv2-101 model trained on ImageNet. The classification model is trained on ID data only. All values are percentages. U and V are combined by multiplication. The meaning of the first line is to use the U and V of GradNorm (GN). The meaning of the Second line is to use the V of GradNorm and U of Ours. The meaning of the fourth line is to only use the U of Ours.

The effect of the clipping thresholds. OOD detection based our method. Use default temperature. All values are percentages.

The effect of temperature. OOD detection based our method. Use default clipping threshold. All values are percentages.

The effect of balance factors. OOD detection based equation 12. All values are percentages. RG is effective on alternative neural network architecture. We evaluate RG on a different architecture DenseNet-121 and report performance in Table 6. For a fair comparison, we reproduction ReActSun et al. (2021) and Connor Igoe

OOD detection performance comparison with KNN-based OOD detection. All methods utilize the ResNet-50 model trained on ImageNet. The classification model is trained on ID data only. All values are percentages.

