HOW USEFUL ARE GRADIENTS FOR OOD DETECTION REALLY?

Abstract

One critical challenge in deploying machine learning models in real-life applications is out of distribution (OOD) detection. Given a predictive model which is accurate on in distribution (ID) data, an OOD detection system can further equip the model with the option to defer prediction when the input is novel and the model has low confidence. Notably, there has been some recent interest in utilizing gradient information in pre-trained models for OOD detection. While these methods are competitive, we argue that previous works conflate their performance with the necessity of gradients. In this work, we provide an in-depth analysis and comparison of gradient based methods and elucidate the key components that warrant their OOD detection performance. We further demonstrate that a general, non-gradientbased family of OOD detection methods are just as competitive, casting doubt on the usefulness of gradients for OOD detection. Recent advances in algorithms, models, and training infrastructure have brought about unprecedented performance of machine learning (ML) methods, across a wide range of data types and tasks. Despite their demonstrated potential on benchmark settings and domains, one obstacle which limits ML methods' applicability in real-world applications is the uncertainty or confidence of the predictions. Without any deliberate mechanisms, ML models will output a prediction for any given input, and the question of whether this prediction can be trusted will be especially critical in many high-risk decision-making settings (e.g. self



It is unrealistic for one to expect to train a model that has perfect predictions for all possible inputs, partly because real-world datasets are limited in their scope. Thus in lieu of trying to make predictions for all test inputs, one can attempt to first detect whether the input is covered by the support of the training data. This is the motivation behind OOD detection. Among the diverse approaches to OOD detection for image recognition, a recent line of work has suggested utilizing the information in gradients to derive efficient and performant methods for OOD detection (Liang et al., 2017; Lee & AlRegib, 2020; Agarwal et al., 2020; Lee & AlRegib, 2021; Huang et al., 2021; Sun et al., 2022; Kokilepersaud et al., 2022) . We motivate our work by first exploring the claim that gradients are useful for OOD detection. Through a comparison with various extensions of gradient-based scores, we analyze the key components that actually drive the performance of these methods, and we argue that gradient computations are not essential in deriving performant post hoc OOD detectors. Rather, these methods ultimately rely on the magnitude of the learned feature embedding and the predicted output distribution. We thereby refute many of the intuitions that previous works motivate their methods with. Based on our analysis, we advocate for the study of a more general, non-gradient-based framework for producing performant score functions and provide a comprehensive empirical evaluation of various instantiations of the score within this framework. The rest of this paper is structured as follows. We first provide a formal statement of the problem setting and the related works in Section 2. We then introduce existing and new gradient-based detectors and discuss how both can be simplified into intuitive forms (Section 3). Following this, we perform empirical evaluations of the methods (Section 4) and discuss their implications in Section 5. < l a t e x i t s h a 1 _ b a s e 6 4 = " D a x X Q 7 j P E J O G Q P n 7 Q w B D s l m N T c o = " > A A A C d 3 i c b V D L T t t A F J 2 4 v K G Q l C U S j B q o w o L I r l D b T S U K G 5 Y g N Y A U h 2 g 8 u U 5 G m Y c 1 c 4 2 I L H 9 P v 4 Z t q 3 4 K O + z E C 0 K 4 0 k h H 5 8 y 5 j x M l U j j 0 / f 8 1 7 8 P S 8 s r q 2 v r G 5 t b H 7 Z 1 6 4 9 O N M 6 n l 0 O F G G n s X M Q d S a O i g Q A l 3 i Q W m I g m 3 0 f i i 1 G 8 f w D p h 9 G + c J N B T b K h F L D j D g u r X f 4 V C I 1 i E R 8 w O Q 5 e q f j b + G e T 3 F z R U D E d R n C X 5 f d Y a H + e t g J 4 s k M e H e b / e 9 N v + t O g i C C r Q J F V d 9 R u 1 / X B g e K p A I 5 f M u W 7 g J 9 j L m E X B J e Q b Y e o g Y X z M h t A t o G Y K X C + b 3 p r T o 4 I Z 0 N j Y 4 m m k U 3 b O M X g Q i a s 8 j z P T a z 1 j y r m J i o p O 5 S 3 u r V a S 7 2 n d F O M f v U z o J E X Q f L Z I n E q K h p b B 0 o G w w F F O C s C 4 F c U t l I + Y Z b y I d 3 5 K 2 R u N k W 5 + M a 4 i K 4 Y j L B M N 3 u a 3 C G 6 + t o N v 7 d P r 0 + b Z e Z X t G t k j n 0 m L B O Q 7 O S O X 5 I p 0 C C d / y B P 5 S / 7 V n r 0 D 7 4 v X m n 3 1 a p V n l 8 y V F 7 w A 6 O j C x A = = < / l a t e x i t > P C k=1 p (k) (1 p (k) ) < l a t e x i t s h a 1 _ b a s e 6 4 = " i m q Y + F H w i M 0 0 V 8 X 8 I W r z r / e q q b Q = " > A A A C W 3 i c b V B N r x I x F C 2 D H 4 j o A 4 0 r E 9 M I J M 8 N m T E v 7 7 0 l 0 Y 1 L T O Q j A S S d c g c a O u 2 k v U M g k / k F 7 9 e 4 1 V / i w v 9 i B 2 Y h 4 E 2 a n p z T c 3 v v C R M p L P r + 7 4 p X f f T 4 y d P a s / r z x o u X V 8 3 W q 5 H V q e E w 5 F p q M w m Z B S k U D F G g h E l i g M W h h H G 4 + V z o 4 y 0 Y K 7 T 6 h v s E 5 j F b K R E J z t B R i 2 Z 3 J h S C Q d h h 1 p n F b L f Y U H f h O o y y J P + e X W 8 + 5 J 1 8 0 W z 7 P f 9 Q 9 B I E J W i T s g a L V u X d b K l 5 G o N C L p m 1 0 8 B P c J 4 x g 4 J L y O u z 1 E L C + I a t Y O q g Y j H Y e X b Y J 6 d d x y x p p I 0 7 C u m B P X E s t y K x p W d 3 N P 2 r Z y y 2 d h + H r l O x i j 3 X C v J / 2 j T F 6 H 6 e C Z W k C I o f B 4 l S S V H T I j y 6 F A Y 4 y r 0 D j B v h d q F 8 z Q z j L s L T X 4 r e q L W 0 p 4 P x O D R i t c Y i 0 e A 8 v 0 s w + t g L b n s 3 X 2 / a / U 9 l t j X y l r w n 1 y Q g d 6 R P v p A B G R J O H s g P 8 p P 8 q v z x q l 7 d a x y f e p X S 8 5 q c l P f m L 9 o O u F 8 = < / l a t e x i t > max k p (k) < l a t e x i t s h a 1 _ b a s e 6 4 = " Given an input x, the network predicts a probability vector, p, via the softmax function of the network outputs: W 4 v 5 X v B E a Y o Q / e D z x 5 s s j J B 1 V X c = " > A A A C i H i c b V B N a x s x E J W 3 X 6 n 7 5 b T H Q h F 1 C u m h Z r e E J j k U Q n z p M S 1 1 E r A c o 5 V n b W F 9 L N J s i B H 7 v / p X e u m 1 / R n V 2 n u o k w 4 I H u / N G 8 2 8 v F T S Y 5 r + 7 C T 3 7 j 9 4 + G j n c f f J 0 2 f P X / R 2 X 5 5 7 W z k B I 2 G V d Z c 5 9 6 C k g R F K V H B Z O u A 6 V 3 C R L 4 e N f n E N z k t r v u O q h I n m c y M L K T h G a t r 7 x q R B c A g 3 G P a Y r / Q 0 L D 9 n 9 d W Q M g U F M h X N S F n h u A h Z H Y Y 1 / U C Z 5 r j I i 1 D W V 2 F / + b 6 m z M n 5 A p l r e v f q a a + f D t J 1 0 b s g a 0 G f t H U 2 3 e 2 8 Y T M r K g 0 G h e L e j 7 O 0 x E n g D q V Q U H d Z 5 a H k Y s n n M I 7 Q c A 1 + E t b H 1 / R d Z G a 0 s C 4 + g 3 T N b j l m 1 7 L 0 r e d m Y / p X D 1 x 7 v 9 J 5 n N T c 5 W 9 r D f k / b V x h c T Q J 0 p Q V g h G b R Y p K U b S 0 S Z r O p A O B a h U B F 0 7 G W 6 h Y 8 B h k z H v 7 l 2 Y 2 W q v 8 p (k) (x) = P(Y = k|x) = exp f (k) θ (x)/T C k ′ =1 exp f (k ′ ) θ (x)/T , where the superscripts denote the index of the vector and T is the temperature. If not otherwise specified, we will assume T = 1. Although p(x) depends on both θ and x , we will always exclude the former and often exclude the latter when it is clear from context or unimportant. Lastly, we often abuse notation and use Y ∼ p(x) to mean Y is sampled from the categorical distribution parameterized by p(x). The OOD Detection Problem In a real-world scenario, the model may be given an input, x ∈ R d , during deployment that is substantially different from any of the datapoints in the training set. For example, a classifier trained to identify numeric digits may be given an image of a cat. Since the model's prediction p(x) cannot be trusted, it would be advantageous to flag such instances and defer prediction. This is the problem that OOD detection addresses. More formally, the goal in OOD detection is to derive a binary classifier which labels whether a given input, x, is ID (in distribution) or OOD. This goal is commonly addressed by learning a score function S : R d → R which quantifies the degree to which the input is OOD. Existing works have approached this problem of learning the mapping S from various perspectives, which generally vary by where the signal to generate S is extracted from. Here, we introduce broad groupings of methodologies to provide context for our work, and we refer the reader to Yang et al. ( 2021 One class of methods focuses on the input space and learns score functions based on characteristics that can be derived from the input features. For example, distance-based methods are based on the intuition that OOD data should lie "far away" from ID data and define S with distances between the input point and reference points that are representative of ID (Lee et al., 2018; Techapanurak et al., 2020; Van Amersfoort et al., 2020) . Meanwhile, density-based methods utilize probabilistic models to describe the density of ID data and argue that OOD points should occur in areas of low densities. Thus generative models are often used and the score function is often derived with the predicted likelihood of the test input point (Ren et al., 2019; Serrà et al., 2019; Zisselman & Tamar, 2020) . Another class of methods aims to directly influence a predictive model's behavior to OOD data through explicit training. These methods often assume access to OOD examples during training and incorporates them into a predictive model's training procedure to maximize the separability between ID and OOD inputs. This is usually achieved by setting aside actual samples from an OOD test distribution (Hendrycks et al., 2018; Yu & Aizawa, 2019; Liu et al., 2020) , or if they are not available, by synthesizing OOD examples via adversarial training, perturbations, or sampling from boundaries or low density regions (Lakshminarayanan et al., 2016; Lee et al., 2017; Vernekar et al., 2019) .



Figure 1: Output Score Components The probability output components (V components from Section 4.2) are shown on a 3-class probability simplex where each of the 3 vertices signifies probability 1 to a single class. From left to right, these are the output score terms used for EXGRAD, MSP, and GRADNORM. Lighter to darker shades indicate lower to higher values. 2 PRELIMINARIES AND RELATED WORK Problem Setting and Notation We focus on the classification setting where the task is to predict a class label y ∈ [C] given an input x ∈ R d , where C is the total number of possible classes and d is the dimensionality of the input. We assume we have access to training data {(x i , y i )} N i=1 in order to train the parameters θ of a deep neural network f θ : R d → R C . Here, θ is a collection of all of the weights from each layer in the network, thus, θ = {W l : l ∈ [L]}, where l denotes the index of each layer, and we assume the network has L layers in total.

);Salehi et al. (2021)  for an in-depth survey of the field.

