HOW USEFUL ARE GRADIENTS FOR OOD DETECTION REALLY?

Abstract

One critical challenge in deploying machine learning models in real-life applications is out of distribution (OOD) detection. Given a predictive model which is accurate on in distribution (ID) data, an OOD detection system can further equip the model with the option to defer prediction when the input is novel and the model has low confidence. Notably, there has been some recent interest in utilizing gradient information in pre-trained models for OOD detection. While these methods are competitive, we argue that previous works conflate their performance with the necessity of gradients. In this work, we provide an in-depth analysis and comparison of gradient based methods and elucidate the key components that warrant their OOD detection performance. We further demonstrate that a general, non-gradientbased family of OOD detection methods are just as competitive, casting doubt on the usefulness of gradients for OOD detection. Recent advances in algorithms, models, and training infrastructure have brought about unprecedented performance of machine learning (ML) methods, across a wide range of data types and tasks. Despite their demonstrated potential on benchmark settings and domains, one obstacle which limits ML methods' applicability in real-world applications is the uncertainty or confidence of the predictions. Without any deliberate mechanisms, ML models will output a prediction for any given input, and the question of whether this prediction can be trusted will be especially critical in many high-risk decision-making settings (e.g. self



It is unrealistic for one to expect to train a model that has perfect predictions for all possible inputs, partly because real-world datasets are limited in their scope. Thus in lieu of trying to make predictions for all test inputs, one can attempt to first detect whether the input is covered by the support of the training data. This is the motivation behind OOD detection. Among the diverse approaches to OOD detection for image recognition, a recent line of work has suggested utilizing the information in gradients to derive efficient and performant methods for OOD detection (Liang et al., 2017; Lee & AlRegib, 2020; Agarwal et al., 2020; Lee & AlRegib, 2021; Huang et al., 2021; Sun et al., 2022; Kokilepersaud et al., 2022) . We motivate our work by first exploring the claim that gradients are useful for OOD detection. Through a comparison with various extensions of gradient-based scores, we analyze the key components that actually drive the performance of these methods, and we argue that gradient computations are not essential in deriving performant post hoc OOD detectors. Rather, these methods ultimately rely on the magnitude of the learned feature embedding and the predicted output distribution. We thereby refute many of the intuitions that previous works motivate their methods with. Based on our analysis, we advocate for the study of a more general, non-gradient-based framework for producing performant score functions and provide a comprehensive empirical evaluation of various instantiations of the score within this framework. The rest of this paper is structured as follows. We first provide a formal statement of the problem setting and the related works in Section 2. We then introduce existing and new gradient-based detectors and discuss how both can be simplified into intuitive forms (Section 3). Following this, we perform empirical evaluations of the methods (Section 4) and discuss their implications in Section 5. < l a t e x i t s h a 1 _ b a s e 6 4 = " i m q Y + F H w i M 0 0 V 8 X 8 I W r z r / e q q b Q = " > A A A C W 3 i c b V B N r x I x F C 2 D H 4 j o A 4 0 r E 9 M I J M 8 N m T E v 7 7 0 l 0 Y 1 L T O Q j A S S d c g c a O u 2 k v U M g k / k F 7 9 e 4 1 V / i w v 9 i B 2 Y h 4 E 2 a n p z T c 3 v v C R M p L P r + 7 4 p X f f T 4 y d P a s / r z x o u X V 8 3 W q 5 H V q e E w 5 F p q M w m Z B S k U D F G g h E l i g M W h h H G 4 + V z o 4 y 0 Y K 7 T 6 h v s E 5 j F b K R E J z t B R i 2 Z 3 J h S C Q d h h 1 p n F b L f Y U H f h O o y y J P + e X W 8 + 5 J 1 8 0 W z 7 P f 9 Q 9 B I E J W i T s g a L V u X d b K l 5 G o N C L p m 1 0 8 B P c J 4 x g 4 J L y O u z 1 E L C + I a t Y O q g Y j H Y e X b Y J 6 d d x y x p p I 0 7 C u m B P X E s t y K x p W d 3 N P 2 r Z y y 2 d h + H r l O x i j 3 X C v J / 2 j T F 6 H 6 e C Z W k C I o f B 4 l S S V H T I j y 6 F A Y 4 y r 0 D j B v h d q F 8 z Q z j L s L T X 4 r e q L W 0 p 4 P x O D R i t c Y i 0 e A 8 v 0 s w + t g L b n s 3 X 2 / a / U 9 l t j X y l r w n 1 y Q g d 6 R P v p A B G R J O H s g P 8 p P 8 q v z x q l 7 d a x y f e p X S 8 5 q c l P f m L 9 o O u F 8 = < / l a t e x i t > max k p (k) < l a t e x i t s h a 1 _ b a s e 6 4 = " Given an input x, the network predicts a probability vector, p, via the softmax function of the network outputs: W 4 v 5 X v B E a Y o Q / e D z x 5 s s j J B 1 V X c = " > A A A C i H i c b V B N a x s x E J W 3 X 6 n 7 5 b T H Q h F 1 C u m h Z r e E J j k U Q n z p M S 1 1 E r A c o 5 V n b W F 9 L N J s i B H 7 v / p X e u m 1 / R n V 2 n u o k w 4 I H u / N G 8 2 8 v F T S Y 5 r + 7 C T 3 7 j 9 4 + G j n c f f J 0 2 f P X / R 2 X 5 5 7 W z k B I 2 G V d Z c 5 9 6 C k g R F K V H B Z O u A 6 V 3 C R L 4 e N f n E N z k t r v u O q h I n m c y M L K T h G a t r 7 x q R B c A g 3 G P a Y r / Q 0 L D 9 n 9 d W Q M g U F M h X N S F n h u A h Z H Y Y 1 / U C Z 5 r j I i 1 D W V 2 F / + b 6 m z M n 5 A p l r e v f q a a + f D t J 1 0 b s g a 0 G f t H U 2 3 e 2 8 Y T M r K g 0 G h e L e j 7 O 0 x E n g D q V Q U H d Z 5 a H k Y s n n M I 7 Q c A 1 + E t b H 1 / R d Z G a 0 s C 4 + g 3 T N b j l m 1 7 L 0 r e d m Y / p X D 1 x 7 v 9 J 5 n N T c 5 W 9 r D f k / b V x h c T Q J 0 p Q V g h G b R Y p K U b S 0 S Z r O p A O B a h U B F 0 7 G W 6 h Y 8 B h k z H v 7 l 2 Y 2 W q v 8 p (k) (x) = P(Y = k|x) = exp f (k) θ (x)/T C k ′ =1 exp f (k ′ ) θ (x)/T , where the superscripts denote the index of the vector and T is the temperature. If not otherwise specified, we will assume T = 1. Although p(x) depends on both θ and x , we will always exclude the former and often exclude the latter when it is clear from context or unimportant. Lastly, we often abuse notation and use Y ∼ p(x) to mean Y is sampled from the categorical distribution parameterized by p(x). The OOD Detection Problem In a real-world scenario, the model may be given an input, x ∈ R d , during deployment that is substantially different from any of the datapoints in the training set. For example, a classifier trained to identify numeric digits may be given an image of a cat. Since the model's prediction p(x) cannot be trusted, it would be advantageous to flag such instances and defer prediction. This is the problem that OOD detection addresses. More formally, the goal in OOD detection is to derive a binary classifier which labels whether a given input, x, is ID (in distribution) or OOD. This goal is commonly addressed by learning a score function S : R d → R which quantifies the degree to which the input is OOD. Existing works have approached this problem of learning the mapping S from various perspectives, which generally vary by where the signal to generate S is extracted from. Here, we introduce broad groupings of methodologies to provide context for our work, and we refer the reader to Yang et al. (2021) ; Salehi et al. (2021) for an in-depth survey of the field. One class of methods focuses on the input space and learns score functions based on characteristics that can be derived from the input features. For example, distance-based methods are based on the intuition that OOD data should lie "far away" from ID data and define S with distances between the input point and reference points that are representative of ID (Lee et al., 2018; Techapanurak et al., 2020; Van Amersfoort et al., 2020) . Meanwhile, density-based methods utilize probabilistic models to describe the density of ID data and argue that OOD points should occur in areas of low densities. Thus generative models are often used and the score function is often derived with the predicted likelihood of the test input point (Ren et al., 2019; Serrà et al., 2019; Zisselman & Tamar, 2020) . Another class of methods aims to directly influence a predictive model's behavior to OOD data through explicit training. These methods often assume access to OOD examples during training and incorporates them into a predictive model's training procedure to maximize the separability between ID and OOD inputs. This is usually achieved by setting aside actual samples from an OOD test distribution (Hendrycks et al., 2018; Yu & Aizawa, 2019; Liu et al., 2020) , or if they are not available, by synthesizing OOD examples via adversarial training, perturbations, or sampling from boundaries or low density regions (Lakshminarayanan et al., 2016; Lee et al., 2017; Vernekar et al., 2019) . Many works have instead turned attention to the information that can be extracted from a predictive model that is fully trained on ID data. With the intuition that an ideal predicted distribution should have low uncertainty (heavily concentrated on the predicted class) for an ID point and have high uncertainty (close to a uniform distribution) for an OOD input, some methods focus on generating scores with the predicted distribution of pre-trained models Hendrycks & Gimpel (2016) ; Linmans et al. (2020) ; Liu et al. (2020) . Meanwhile, some works have examined the information available when backpropagating gradients of a loss function. Liang et al. (2017) ; Agarwal et al. (2020) utilize information in the gradient w.r.t. the input, while Lee & AlRegib (2020) ; Huang et al. (2021) examine gradients w.r.t. the model parameters. Huang et al. (2021) , in particular, proposed GRADNORM, an efficient post hoc score function which simply measures the norm of model parameter gradients and thus requires no additional training or hyperparameter tuning. Despite this, GRADNORM achieves State-of-the-Art (SotA) performance among post hoc methods. We explain GRADNORM in more detail in the next section. 3 GRADIENT BASED OOD METHODS Lee & AlRegib (2020) initially proposed using model gradients as a signal to detect OOD data. Specifically, they provide a fully trained network with an input point and backpropagate the cross entropy loss between a uniform probability distribution and the network's predicted class probabilities. We summarize their intuition for this algorithm in the following hypothesis: Hypothesis 3.1. The Feature-Extraction Hypothesis If learning a point (x, ỹ) requires a large change in a well-trained network, then x must have been novel, because most of a deep network is dedicated to feature extraction, which by assumption should already perform well on ID data. Meanwhile, GRADNORM contends the opposite case: they argue that trying to fit the predicted distribution of ID data to a uniform distribution label will necessitate higher gradients to the model parameters than with OOD data. We note that this intuition relies on the predicted distributions actually displaying high entropy for OOD inputs and low entropy for ID inputs. Despite conflicting intuitions, both works ultimately propose taking the KL divergence (identically, the cross-entropy loss) w.r.t. the uniform distribution and measuring a norm of the model gradients as a score function. Through extensive empirical ablations, GRADNORM hones in on the L 1 norm of the last layer gradients of the KL-divergence w.r.t. the uniform distribution as the score function, S GN (x) = ∂D KL (u||softmax(f θ (x)) ∂W L 1 . We can expand the divergence term and simplify this score into the following expression: S GN (x) = E Y ∼unif ∇ W L log p (Y ) 1 . That is, GRADNORM measures the norm of the expected gradient according to a uniform labeling distribution. Note that ∇ W L denotes the gradient w.r.t. the parameters in the final layer of f θ . Another Gradient Approach We note that Eq.2 is not a unique measure that leverages gradients. To aid our analysis, we also consider the following variant: S EG (x) = E Y ∼p(x) ∇ θ log p (Y ) p . We will show later that this score also has interpretable properties, and helps us understand the central components of gradient-based score functions. Due to the outer positioning of the expectation, we refer to the score function in Eq. 3 as EXGRAD. Two key differences in this score are 1) the label distribution of Y comes from the model's own predicted distribution, p, not the uniform distribution, and 2) this calculates the expected norm of the gradient, whereas Eq. 2 calculates the norm of the expected gradient. Other gradient-based scores are readily plausible, and we visit a suite of scores with empirical comparisons in Section 4.1.

3.1. DECOMPOSITION OF GRADIENT METHODS

To give further grounds for their method, Huang et al. (2021) analyze Eq. 2 and show that it can be decomposed into two terms: one that is characterized by the size of the magnitude of the encoding fed to the last layer of the network, and one that is characterized by the output of the network. In particular, they derive S GN (x) = 1 T ∥h∥ 1 C k=1 1 C -p (k) (4) where h is the encoding being fed to the last layer of the network. Following their notation, we will denote U to be the part of the score characterized by h and V to be the part characterized by the network output, i.e. here k) , and S GN = 2 T U V . Huang et al. (2021) show that U and V have the same trend where their values are large for ID points and lower for OOD points. While each can be used as OOD detectors alone, the product of the two terms results in stronger performance. U = ∥h∥ 1 , V = 1 2 C k=1 1 C -p ( Although not mentioned in their analysis, the V term in GRADNORM's score is simply the total variation (TV) distance between a discrete uniform distribution and the model's predicted distribution. This characterization of V gives insight as to why GRADNORM works. When the input image is in distribution, the network will likely have higher confidence, making the TV distance between p and the discrete uniform large. Decomposition of EXGRAD Doing a similar analysis, our alternative gradient-based detector, EXGRAD, can be broken down in a similar way. In particular, S EG (x) = 2 T U V U = ∥h∥ 1 V = C k=1 p (k) (1 -p (k) ) The full derivation is shown in Appendix B. Like with GRADNORM, the V term for EXGRAD turns out to be an interpretable quantity. Let B k ∼ Bernoulli(p (k) ) be the random variable corresponding to the event that x belongs to class k. Then, V = C k=1 Var(B k ). Intuitively, when the input image is in distribution and there is high confidence on a single class, the variance of each Bernoulli random variable will be low. Note that this is the opposite trend of the TV distance and ∥h∥ 1 . A visual comparison of these scores can be seen in Figure 1 . This general U V -style score will be of particular interest to us throughout this paper, and we will often refer to it as an "Encoding-Output" composition as it relies on both the networks encoding for the image (at least at the penultimate layer) and its output. We note that recent work on the "Familiarity Hypothesis" in Dietterich & Guyer (2022) , as well as work on the role of the feature norm in open-set recognition and OOD detection in Vaze et al. (2022) further motivates the study of U and suggests a plausible alternative theory to the gradient-based "Feature-Extraction Hypothesis".

4. EMPIRICAL EVALUATION

The previous section introduced variations to existing gradient-based scores and decomposed the scores into interpretable components. This naturally begs the questions, which scores actually perform well, and what should their favorable performance be attributed to? In this section, we explore these questions by comparing the performance of multiple gradient-based scores against simpler, non-gradient based approaches on OOD detection tasks in image classification. Specifically, the goals of these experiments are as follows: 1. to investigate whether different gradient-based scores are useful for OOD detection or whether only particular variants perform well (i.e. is it essential that the gradient-based detector is derived by taking the derivative of the KL divergence?); 2. to examine the plausibility of the feature-extraction hypothesis in explaining the strong performance of gradient-based OOD detection methods; 3. to investigate the claim that gradient-based approaches offer unique performance advantages in the context of post hoc OOD detection tasks. Indeed, we show that SotA OOD detection performance is achievable by leveraging information solely from the predicted class distribution and latent encoding, calling into question the additional utility gained from gradient based score functions. Furthermore, we find that there is significant variability in performance across tasks for different gradient-based score functions. Moreover, we show that gradient-based methods are often worse than computationally simpler approaches that require no backpropagation. Lastly, we perform experiments that challenge the plausibility of one emerging theory that attempts to explain the role of gradients in SotA OOD methods.

4.1. INVESTIGATING THE PERFORMANCE OF GRADIENT-BASED SCORE VARIANTS

Experimental Set Up A key motivation of our work is in the design of post hoc OOD detection mechanisms. In light of this, for each OOD method, we take a deep neural network pre-trained on a particular dataset (the ID dataset) and use data from one of the remaining datasets as OOD data. These models were obtained from a popular open-source libraryfoot_0 of pre-trained neural network image classifier models and achieve competitive performance on ID data. We investigate the performance of seven gradient-based OOD scores, one of which has been previously described in the literature (GRADNORM) and another which we described in detail in Section 3 (EXGRAD). We include additional experimental results on natural variants that can be derived by simple design choices in implementation, specifically by interchanging norms and expectations, choice of norm and choice of distribution to generate the synthetic label at test time. Finally, we compare these gradient-based methods against two non-gradient based approaches inspired by Huang et al. (2021) and our analysis in Section 3. We refer to these scores as "V term" scores. We used MNIST Deng (2012) , SVHN Netzer et al. (2011) and CIFAR-10 Krizhevsky (2009) as our base datasets for the first experimental setup, and use 10,000 samples from the test split of each dataset to form our ID and OOD datasets. Each OOD method defines a score function S(x) which we then use to define an OOD classifier I[S(x) > ϵ]. We then calculate AUROCs for each method, varying the threshold ϵ. Table 1 describes the mean and standard deviation of the AUROC for each method across the six ID-OOD dataset combinations. Full experimental results showing performance for each ID-OOD combination are available in Appendix A. The columns "Deep" and "Shallow" refer to variants of each method that perform gradients w.r.t. all parameters of the network (Deep) or w.r.t. just the parameters of the final layer (Shallow), i.e. the weights of the layer that generate logits.

Analysis of Results

We first observe that, on average over the 6 ID-OOD benchmark tasks, EXGRAD outperforms the previous SotA gradient-based post hoc OOD method, GRADNORM. Moreover, no method consistently dominates all other methods across ID-OOD splits (see Appendix A). This suggests that the singular focus in the literature on backpropagating KL divergence losses is unwarranted. Our next observation is that, in instances where extending gradients to the entire network produces change in performance, such changes are at best modest gains, and can in fact hurt performance. This aligns with results found in Huang et al. (2021) . Given that the last layer of a deep neural network is merely taking linear combinations of learned features, this observation serves as further evidence against the Feature-Extraction Hypothesis. Lastly, we note that the highest performing score functions are the non-gradient-based approaches. In particular, the two score functions inspired from the decomposition analyses improve beyond their best gradient-based variants by 2.9 and 0.7 percentage points respectively. These observations provide the basis for a general framework to achieve high-performing post hoc OOD detectors that do not rely on calculating test-time gradients: combining a norm of learned encodings with a function of the predicted class distribution. In the following section, we provide additional experimental details that leverage this proposed template to further illustrate the ability to achieve high performance without relying on test-time gradients.

4.2. EXPLORING ENCODING-OUTPUT COMPOSITIONS

The ImageNet Benchmark In addition to the previous experimental set up, in this section we also evaluate on the large-scale ImageNet benchmark proposed by Huang & Li (2021) . For this benchmark, the ImageNet-1k dataset (Deng et al., 2009) is used for the ID dataset. This ID dataset is different from MNIST, CIFAR10, and SVHN in that it is composed of higher resolution images and has C = 1,000 classes instead of C = 10. The iNaturalist (Van Horn et al., 2017) , SUN (Xiao et al., 2010) , Places (Zhou et al., 2017) , and Textures (Cimpoi et al., 2014) datasets are used as OOD datasets. This benchmark uses the 50,000 images in the ImageNet validation set as the ID data and The leftmost group uses only the output of the network (V only), the second is the product of these measures with the 1-norm of the encoding passed to the last layer of the network ∥h∥ 1 V , and the last grouping is detectors that leverage gradients of the last layer of the network. The highest AUROC found for each OOD task is bolded. We separately report the average of the ImageNet baselines and all of the other baselines. Note that we do show ∥h∥ 1 T V in this table since it is exactly GRADNORM. Score Expression AUROC Gradient Depth ∥E Y ∼p ∇ θ log p (Y ) ∥ 2 2 0.725 (± 0.086) Deep ∥E Y ∼p ∇ θ log p (Y ) ∥ 2 2 0.741 (± 0.081) Shallow E Y ∼Uniform ∥∇ θ log p (Y ) ∥ 1 0.825 (± 0.120) Deep E Y ∼Uniform ∥∇ θ log p (Y ) ∥ 1 0.850 (± 0.135) Shallow E Y ∼Uniform ∥∇ θ log p (Y ) ∥ 2 2 0.867 (± 0.148) Deep E Y ∼Uniform ∥∇ θ log p (Y ) ∥ 2 2 0.887 (± 0.107) Shallow ∇ θ E Y ∼Uniform log p (Y ) 1 (GRADNORM) 0.892 (± 0.087) Deep ∇ θ E Y ∼Uniform log p (Y ) 1 (GRADNORM) 0.906 (± 0.092) Shallow E Y ∼p log p (Y ) p (Y ) ∥∇ θ log p (Y ) ∥ 2 2 0.910 (± 0.090) Shallow E Y ∼p ∥∇ θ log p (Y ) ∥ 2 2 0.919 (± 0.041) Shallow E Y ∼p ∥∇ θ log p (Y ) ∥ 2 2 0.921 (± 0.034) Deep E Y ∼p log p (Y ) p (Y ) ∥∇ θ log p (Y ) ∥ 2 2 0.921 (± 0.106) Deep E Y ∼p ∥∇ θ log p (Y ) ∥ 1 (EXGRAD) 0.925 (± 0.047) Shallow E Y ∼p ∥∇ θ log p (Y ) ∥ 1 (EXGRAD) 0.926 (± 0.053) Deep C k=1 p (k) (1 -p (k) ) (EXGRAD V term) 0.933 (± 0.063) N/A C k=1 1 C -p (k) (GRADNORM V term) 0.935 (± 0.064) N/A 10,000 images for each of the OOD datasets (except for Textures which uses 5,640). The pre-trained model is from Google BiT-Sfoot_1 (Kolesnikov et al., 2020) and uses the ResNetv2-101 architecture (He et al., 2016) . Our code for these experiments was built on top of the code from Huang et al. ( 2021)foot_2 , and more details about this baseline can be found in their paper.

Experimenting with the Encoding-Output Composition

In what follows, we test a variety of detectors that adhere to the encoding-output composition described in Section 3.1. In particular, each of the methods we test here are a product of a U term, which is a function of the encoding h, and a V term, which is a function of the outputted probability vector p. We test all combinations of U ∈ {1, ∥h∥ 1 } and V ∈ {Energy, VarSum, MSP, TV}. Here, U = 1 denotes not using any encoding of the input and only using the output score (i.e. V term). For the V term, TV is the TV distance between p and a discrete uniform distribution (V term from Eq. 4). The scores for Energy (Liu et al., 2020) and MSP (Hendrycks & Gimpel, 2016) are as follows: S Energy = T log C k=1 e f (k) (x)/T S MSP = max k∈[C] p (k) Note that the energy score we use here is the negative version of the one originally introduced in Liu et al. ( 2020). Like before, we assume that T = 1 for all experiments. VarSum is a term inspired by the decomposition of EXGRAD (i.e the V term from Eq. 5). Because V = C k=1 p (k) (1 -p (k) ) is anti-correlated with every other score, we make VarSum a correlated version of this. In particular, VarSum = 1 - C k=1 p (k) (1 -p (k) ). Each of the V terms explored here are visualized in Figure 1 , with the exception of Energy, since logits cannot be deduced from probabilities alone. Table 2 displays the OOD detection performance for each of these methods in both the small-scale benchmark setting (with MNIST, CIFAR10, SVHN datasets) and the large-scale benchmark setting with the ImageNet dataset. The small-scale experiments assume the same setting as Section 4.1.

Analysis of Results

It is immediately clear that there is a significant difference between the ImageNet baseline and the other baselines, and as such, we average the AUROCs separately. Starting with the small-scale setting, we find the best performers to be methods which look at the outputted predicted distribution only. Interestingly, the two best scores appear to be VarSum and the TV distance, which to the best of our knowledge, have not been recommended for OOD detection by previous works. Literature generally warns against exclusively using the model's predicted distribution for OOD detection (Hein et al., 2019; Kirsch et al., 2021) since even OOD points can produce confident predictions with probabilities that are highly concentrated on a single class. These results indicate that this issue is milder in smaller scale models (MNIST, SVHN), and exacerbated for large models which are highly over-parameterized (CIFAR10, ImageNet). This result is generally in line with observations in calibration, which point out the over-confidence of neural network models as they become more over-parameterized (Guo et al., 2017; Wang et al., 2021) . Following this, it seems that information about the penultimate layer encoding is important for high performing encoders in the ImageNet baselines. Besides VarSum, every possible V improves across all ImageNet baselines when multiplied by ∥h∥ 1 . It is unclear why VarSum and EXGRAD do not follow this trend and have subpar performance; however, we believe it may be related to how the sum of variance landscape changes with a dramatic increase in classes. We believe that at the time of writing this paper GRADNORM is the current state-of-the-art post hoc method for all considered benchmarks. However, importantly, we find that ∥h∥ 1 × ENERGY is a strictly better detector than GRADNORM on the ImageNet benchmark. These results further strengthen our claim that gradients do not necessarily provide unique benefits for OOD detection performance. It is perhaps only the encoding-output decomposition that results in strong performance, rather than some unique property of gradients.

4.3. EXPLORING ENCODING CHOICES

While the previous section limited the feature encoding to the L 1 norm, in this section, we study the effect of encoding design choice by varying the norm applied. Although Huang et al. (2021) ablation which changes the order of the norm, they do this ablation on the gradients themselves. By focusing on the norm of h, we are able to have more control over the score and the resulting detector. Figure 2 displays the average AUROC performance in the ImageNet benchmark for 48 different pairs of U V scores, where U ∈ {∥h∥ p : p ∈ {0, 0.1, 0.3, 0.5, 0.8, 1, 2, 3, 4, 5, 6, ∞}}, and V ∈ {Energy, TV, MSP, VarSum}. Heatmaps for every experiment are shown in Appendix C. The results indicate that the choice of encoding does have a significant effect in a score's OOD detection performance. In particular, ∥h∥ 0.3 × Energy achieves an AUROC of 0.917, which is a drastic improvement over any score in Table 2 . It is also better than some previously proposed non-post hoc methods such as MOS (Huang & Li, 2021) , which achieves an average AUROC of 0.901. We note that this result is not exactly fair since the parameter scan was done over the benchmark test set. Nevertheless, this is an encouraging result for what could possibly be achieved by a method that considers both image encoding and network output. In particular, we believe that devising more sophisticated U terms than the naive ones tested here could be a promising direction for improving OOD detection performance within this framework.

4.4. CHALLENGING THE FEATURE-EXTRACTION HYPOTHESIS

As described above, one emerging theory that attempts to explain the role of gradients in OOD detection is the Feature-Extraction Hypothesis. Quoting Lee & AlRegib (2020) , Gradient-based optimization involves larger updates when there is a larger gap between predictions and correct labels for given inputs. It implies that the model requires more significant adjustments to its parameters, as it has not learned enough features to represent the inputs or relationships between learned features and classes for correct prediction. We challenge the plausibility of the Feature-Extraction Hypothesis in explaining the role of gradients with the following observation: the gradient calculated at a singleton test point x can be associated with a learning problem that requires no feature learning in order to minimise the loss. In particular, we note that previous gradient based scores involve calculating gradients for optimisation problems of the form min θ E Y [ℓ(Y, p(x))]. Notably, degenerate mappingsfoot_3 lie in the solution space for such optimisation problems, which by definition cannot extract features from input data. In order to investigate the plausibility of the Feature Extraction Hypothesis, we define a score function that by construction excludes degenerate mappings in its solution space. Specifically, we define the score S BG (x) = E Y ∼p ∇ θ log p (Y ) (x) + C i=1 log p (Y ID i ) (x ID i ) 1 , where (x ID i , Y ID i ) is an in distribution training point belonging to class i. We refer to the OOD classifier based on this score function as BATCHGRAD. In particuar, we note that the loss function inside the parentheses of equation 6 defines an optimisation problem that, for sufficiently expressive networks, cannot have degenerate mappings as its solution. This is because the 2nd term inside the parentheses penalizes choices of θ that parameterise a degenerate mapping. Note that this is in contrast to other loss functions, such as those used in EXGRAD and GRADNORM, where degenerate mappings are in the solution spaces of their associated optimisation problems. The motivation for the BATCHGRAD loss function is to design an experiment where gradients are guaranteed to point towards a true (i.e. non-degenerate) mapping, giving a better chance of observing the Feature Extraction Hypothesis in action, should it be true. In particular, if the Feature Extraction Hypothesis were true, then we would be especially likely to see a reduction in OOD detection performance when restricting gradients to the final layer of BATCHGRAD. However, as shown in the results in Table 3 we find that the opposite holds: the variant of BATCHGRAD with gradients restricted to just the last layer is more informative for OOD detection than when using the gradients w.r.t. the whole network. This observation is consistent with previous experiments showing the sufficiency, and at times improvement, of restricting gradients to final layer parameters. This experimental result, in combination with analysis demonstrating the significance of the last layer gradients, leads us to conclude that the Feature-Extraction Hypothesis is not an appropriate explanation for the high performance of gradient-based OOD detection methods.

5. DISCUSSION

In this work, we experimentally investigated gradient-based OOD detection methods. While using gradient-based approaches can result in strong performance, we find previous explanations attributing performance to gradients unsatisfactory. Although prior works have focused on the interpretation of taking the gradient with respect to the KL divergence between the outputted distribution and a discrete uniform distribution (Lee & AlRegib, 2020; Huang et al., 2021) , we find that other gradient-based scores that do not have this interpretation also perform well, especially on smaller scale problems. We also question the idea that the gradient space of the neural network holds key information used for OOD detection. Our experiments provide evidence against the hypothesis that gradient-based methods are informed by large changes needed in the network to capture unseen, OOD images. Moreover, we show that we can derive better performing detectors that are agnostic to gradients and only use the encoding-output decomposition discussed in Huang et al. (2021) . As such, we believe the strength of GRADNORM comes not from its leverage of gradients, but solely from the fact that it fuses information about network encodings and outputted distributions. Hence, while it is possible that gradients contain useful information for OOD detection, we do not believe that previous methods leverage information from gradients that cannot be derived more easily through other means. Future Work Our hope is that the insights provided in this work can be used to further improve OOD detection. In particular, we believe that more advanced methods can be developed that fuse together information from the network encoding and scores derived from the model's predicted distribution. For example, we evaluated the choice of input encodings by varying the order of the norm applied on the last hidden layer outputs. Investigating the utility of other hidden layer features or auxiliary encoders and studying what properties of an encoding are helpful in detecting OOD data could provide further insights to devising stronger OOD detectors. Another interesting direction for investigation is what role task difficulty and model capacity play in how these detectors perform. We found that when the in distribution task was easier, top performing detectors only depended on network outputs; however, it was essential to use both outputs and input encodings for the more complex ImageNet tasks. As mentioned in Section 4.2, existing works in uncertainty quantification and calibration have noted the unreliability of a deep model's predicted class distribution. While calibration is orthogonal to the scope of this work, we believe there could be fundamental ties in determining the boundary between solely relying on the output predicted distribution for OOD detection (i.e. utilizing the V term only), and additionally requiring extracted information from the input space via input encodings (utilizing the U term). We leave investigating this direction for future work.

A ADDITIONAL EXPERIMENTAL DETAILS

The tables below shows detailed breakdowns of AUROC results for the small-scale experiments. Note that the top row describes the ID dataset, and the two entries beneath describe the OOD datasets.  ̸ = k, ∂ ∂f (k ′ ) θ log e f (k) θ C k ′ =1 e f (k ′ ) θ = - e f (k ′ ) θ /T k e f (k ′ ) θ /T = -p (k ′ ) T Building on this, ∂ ∂W log e f (k) θ /T C k ′ =1 e f (k ′ ) θ /T = 1 T h -p (1) , -p (2) , . . . , 1 -p (k) , . . . = 1 T    -p (1) h (1) -p (2) h (1) . . . (1 -p (k) )h (1) . . . -p (1) h (2) -p (2) h (2) . . . (1 -p (k) )h (2) . . . . . . . . . . . . . . . . . .    The L1 norm of the gradient with respect to W is just the sum of the absolute values of each entry. D i=1   (1 -p (k) )|h i | + k ′ ̸ =k p (k ′ ) |h i |   = D i=1 (1 -p (k) )|h i | + (1 -p (k) )|h i | = 2(1 -p (k) ) ∥h∥ 1 Putting it all together, S(x) = E k∼p ∇ W log e f (k) θ /T C k ′ =1 e f (k ′ ) θ /T 1 = 2 T E k∼p (1 -p (k) ) ∥h∥ 1 = 2 T ∥h∥ 1 C c=1 p (k) (1 -p (k) ) C ADDITIONAL HEATMAPS In this section, we present additional heatmaps (similar to Figure 2 ) which indicate the AUROC for each of the methods evaluated in Section 4.3, separately for each of the ID-OOD dataset settings tested in the experiments section. The title of each plot indicates the ID and OOD dataset in order, i.e. "ID x OOD". Each cell of the heatmap shows the AUROC for a different configuration of probability output score (y-axis) and order of the norm on the encoding fed to the last layer (x-axis). 



https://github.com/aaron-xichen/pytorch-playground https://github.com/google-research/big_transfer https://github.com/deeplearning-wisc/gradnorm_ood Note that by "true mapping", we mean a mapping that is not degenerate, and by "degenerate mapping" we mean a mapping that maps to the same value for all inputs.



< l a t e x i t s h a _ b a s e = " D a x X Q jP E J O G Q P n Q w B D s l m N T c o = " > A A A C d i c b V D L T t t A F J v K G Q l C U S j B q o w o L I r l D b T S U K G Y g N Y A U h g u U G m Y c c I L H P v Z t q K O + z E C K k h H y j x M l U j j / f P S s r q v r G t b H Z O N M n l O F G G n s X M Q d S a O i g Q A l i Q W m I g m f i i G f w D p h G + c J N B T b K h F L D j D g u r X f V C I i E R w O Q e q f j b + G e T F z R U D E d R n C X f d Y a H + e t g J s k M e H e b / e N v + t O g i C C r Q J F V d R u / X B g e K p A I f M u W g J j L m E X B J e Q b Y e o g Y X z M h t A t o G Y K X C + b p r T o I Z N j Y m m k U b O M X g Q i a s j z P T a z j y r m J i o p O S u r V a S n d F O M f v U z o J E X Q f L Z I n E q K h p b B o G w w F F O C s C F c U t l I + Y Z b y I d K R u N k W + M a i K Y j L B M N u a C G + t o N v d P r + b Z e Z X t G t k j n m L B O Q O SO X I p C C d / y B P S / V n r D v X m n a p V n l y V F w A O j C x A = = < / l a t e x i t > P C k=1 p (k) (1 p (k) )

Figure 1: Output Score Components The probability output components (V components from Section 4.2) are shown on a 3-class probability simplex where each of the 3 vertices signifies probability 1 to a single class. From left to right, these are the output score terms used for EXGRAD, MSP, and GRADNORM. Lighter to darker shades indicate lower to higher values. 2 PRELIMINARIES AND RELATED WORK Problem Setting and Notation We focus on the classification setting where the task is to predict a class label y ∈ [C] given an input x ∈ R d , where C is the total number of possible classes and d is the dimensionality of the input. We assume we have access to training data {(x i , y i )} N i=1 in order to train the parameters θ of a deep neural network f θ : R d → R C . Here, θ is a collection of all of the weights from each layer in the network, thus, θ = {W l : l ∈ [L]}, where l denotes the index of each layer, and we assume the network has L layers in total.

Figure 2: Average AUROC over ImageNet Benchmark Each cell of the heatmap shows the average AUROC for a different configuration of probability output score (y-axis) and order of the norm on the encoding fed to the last layer (x-axis).

AUROC

AUROC scoresThe table shows AUROC scores for several detectors grouped by category.

BATCHGRAD Small-Scale Experimental Results  This table shows AUROC results, calculated across the 6 ID-OOD dataset combinations from {MNIST, CIFAR-10, SVHN}.

annex

Reproduciblity Statement Our experimental setup follows closely that of our main baseline Huang et al. (2021) . We have followed their code as provided in the public github repo https://github. com/deeplearning-wisc/gradnorm_ood, which includes the pre-trained models and data processing. For the "small scale" experiments (any experiments involving MNIST, CIFAR10, SVHN datasets), we use the pre-trained models as they are provided in the public github repo https://github.com/aaron-xichen/pytorch-playground, and follow the rest of the protocol as provided in the earlier repo.

