OUT-OF-DISTRIBUTION DETECTION BASED ON IN-DISTRIBUTION DATA PATTERNS MEMO-RIZATION WITH MODERN HOPFIELD ENERGY

Abstract

Out-of-Distribution (OOD) detection is essential for safety-critical applications of deep neural networks. OOD detection is challenging since DNN models may produce very high logits value even for OOD samples. Hence, it is of great difficulty to discriminate OOD data by directly adopting Softmax on output logits as the confidence score. Differently, we detect the OOD sample with Hopfield energy in a store-then-compare paradigm. In more detail, penultimate layer outputs on the training set are considered as the representations of in-distribution (ID) data. Thus they can be transformed into stored patterns that serve as anchors to measure the discrepancy of unseen data for OOD detection. Starting from the energy function defined in Modern Hopfield Network for the discrepancy score calculation, we derive a simplified version SHE with theoretical analysis. In SHE, we utilize only one stored pattern to present each class, and these patterns can be obtained by simply averaging the penultimate layer outputs of training samples within this class. SHE has the advantages of hyperparameterfree and high computational efficiency. The evaluations of nine widely-used OOD datasets show the promising performance of such a simple yet effective approach and its superiority over State-of-the-Art models. Code is available at https://github.com/zjs975584714/SHE ood detection.

1. INTRODUCTION

Deep Neural Network (DNN) has yielded remarkable achievements in a broad range of fields in recent years (He et al., 2016; Huang et al., 2017) , and is extensively deployed in numerous real-world scenarios (Krizhevsky et al., 2017; Redmon & Farhadi, 2017) . One of its powerful capabilities lies in the promising generalization ability from training data to unseen in-distribution (ID) data. However, the finite training data cannot guarantee the completeness of data distribution, so it is inevitable to encounter out-of-distribution (OOD) data. The Softmax-based prediction allows OOD samples to gain high confidence in specific classes, which is unacceptable in practice, especially for safetyrelated areas. It can lead to erroneous collisions in autonomous driving or irreparably large financial losses. Therefore, OOD detection is critical with respect to AI safety (Amodei et al., 2016) . Existing efforts on OOD detection for DNN can be roughly divided into two categories. The first group of approaches requires designing and retraining new auxiliary networks specifically for OOD detection rather than directly using already trained models (Denouden et al., 2018; DeVries & Taylor, 2018; Yu & Aizawa, 2019; Zhang et al., 2020) . The objective should be modified accordingly and OOD samples are sometimes introduced to train the new networks. However, it is almost impossible to exhaust all kinds of OOD samples, and retraining can also be cumbersome. The methods of the second category elaborate on the confidence design for the network outputs, e.g., the logits, the Softmax probability (Liang et al., 2017; Liu et al., 2020; Sun et al., 2021) or embedding features (Lee et al., 2018; Sehwag et al., 2021; Sun et al., 2022) . By these means, there is no need to modify the backbone model and the objective, which motivates us to explore OOD detection in this manner. In deep learning, the intermediate layer output can be regarded as the representation of input data in the latent space. Further, as shown in Figure. 1 (left) , guided by the training process, such representations of ID samples of the same category tend to present some common patterns for prediction accuracy. In contrast, these representations of OOD samples should deviate from such commonality since they are not considered during the training process. Based on this intuition, OOD detection can be formulated as a store-then-compare process: representations of ID samples within each category are maintained during the training procedure as stored patterns, and a test pattern will be compared with the store patterns. If there is a noticeable discrepancy, then it can be judged as an OOD sample. The critical question is how to measure the discrepancy between the OOD sample and the stored patterns under this setting? To accomplish this goal, we adopt the key idea of a classic memory network, Hopfield Network. The Hopfield Network (binary state) was first introduced in (Hopfield, 1982) and (Hopfield, 1984) proposed continuous state version. Modern Hopfield Network (both continuous and binary) was introduced in (Krotov & Hopfield, 2016) , and (Ramsauer et al., 2020) proposed a new energy function for continuous state Hopfield networks and point out the relationship with the transformer. Hopfield Network targets recovering distorted test patterns so that the recovered patterns are as close to the stored patterns as possible. It achieves this goal by specific update rules that minimize the predefined energy function. The more the recovered pattern resembles the stored pattern, the lower the energy is. Therefore, the energy function serves a vital role as it indicates the gap between the recovered patterns and the stored patterns. For OOD detection tasks, the energy function of the Hopfield Network is well-suited as a desirable measure of the discrepancy between the OOD sample and the stored patterns. In this paper, we propose a new OOD detection method HE with memorization of ID data patterns and the Modern Hopfield Energy function (Ramsauer et al., 2020) . In more detail, the representations of training ID samples are stored as patterns for each category in advance, and OOD samples are detected under the energy function. As the intermediate results are more informative than the highly-compressed final output logits, we preserve the outputs of the penultimate layer (i.e., the input of the final output layer) as representations. Furthermore, to address the challenges of the memory cost of pattern memorization, we derive a Simplified Hopfield Energy function-based method SHE. In SHE, only one pattern is required for each category and there is no hyperparameter to be tuned. Theoretical analysis proves the effectiveness of our design. The remarkable performances on nine widely-used OOD detection datasets on three different networks demonstrate the superiority of our proposed SHE (and HE) over state-of-the-art methods. We summarize the main contributions of our paper as follows: • We propose a Modern Hopfield Energy-based method HE for out-of-distribution detection. It uses store-then-compare paradigm that compares test samples with pre-stored patterns to measure the discrepancy from in-distribution data according to Hopfield energy. • We derive a simplified version of HE, named as SHE, which greatly reduces the memory and the computation cost. In addition, SHE is hyperparameter-free. Theoretical analysis is conducted to illustrate the effectiveness of SHE. • Extensive experiments on nine OOD detection datasets of three prominent computer vision backbone networks indicate both the effectiveness and the efficiency of our designed methods. Experiments on large-scale datasets (e.g., ImageNet-1k) also show the superior of our approach. In-depth analysis and ablation studies are also included to shed light on the mechanism behind it.

2. RELATED WORK

Network Redesign and Retrain. Given original network architectures are designed for target tasks like classification, a straightforward paradigm of OOD detection is to elaborate on the network ar- All these methods aim to redesign or introduce new layers or auxiliary networks with corresponding objectives for OOD detection. Retraining a network can be extremely time-consuming, especially when the parameter scale is substantially large. The modified objective that considers OOD detection in addition to the original task may also have side effects of degrading model performance on the original task. In addition, some methods require OOD samples as input, which imposes additional requirements on the datasets. A potential risk is that the model will achieve poor results on data beyond the distribution of the trained OOD samples. Unlike those approaches, our method does not need to make any changes (including both architecture and objective) to the original network and does not need to do any additional training, which is a plausible property for real applications. Network Output Transformation. Apart from adjusting network structure or retraining the network with revised objectives, transforming network outputs to obtain the desired measure is the other classic OOD detection paradigm. The first study on network output transformation was proposed by (Hendrycks & Gimpel, 2016) , which used the Maximum Softmax Probability (MSP) to measure the confidence of test samples. Intuitively, ID data is more likely to obtain high confidence from Softmax measure than OOD samples. In (Liang et al., 2017) , input data were perturbed with ID-sample-friendly perturbations and the Softmax probability was re-scaled by a temperature parameter T ; thus OOD data and ID data are more separable. (Lee et al., 2018) first generated class-conditional Gaussian distribution from middle layer outputs of the already trained network on training data and then calculated the confidence of test samples under the Mahalanobis distance measure. (Serrà et al., 2019) made assumptions about the complexity of output and input images, and they advocated estimating the complexity of the input image to impact the output for efficient OOD detection. (Liu et al., 2020) detected the OOD samples by an energy-based score function on the final output logits. Note that such energy function is different from ours as it calculates such score merely based on the output logits of the testing sample instead of comparing with stored patterns. Based on the observation that the mean activation of the OOD sample had larger variations, (Sun et al., 2021) set a threshold to clip the output of the penultimate layer, thereby reducing the output magnitude of OOD samples in the last layer. These approaches can be directly applied to OOD detection tasks without additional training, which is more practical for real-world applications. (Sehwag et al., 2021) utilize the advantage of self-supervised training. (Sun et al., 2022 ) calculate all the Euclidean distance with each pattern of train sample and use the k-th sorted distance as the metric for OOD detection.

3. METHODOLOGY

In this section, we first introduce the preliminaries of the OOD detection task and the Hopfield Network. Then we elaborate on how to leverage the critical concepts of the Hopfield Network on OOD detection. More precisely, the energy function defined in the Modern Hopfield Network (Ramsauer et al., 2020) is introduced as the basis of our store-then-compare OOD detection paradigm. A simplified energy function is further proposed to reduce the memory demand, which is of high computational efficiency and free of hyperparameters. Finally, we compare the difference in pattern choice, i.e., patterns derived from the penultimate layer outputs versus final logits. We also conduct in-depth theoretical analyses of our method.

3.1. OUT-OF-DISTRIBUTION DETECTION

A neural network f aims to learn a mapping function from a training sample x to its corresponding label y as y = f (x; θ) with parameter θ. Then a testing sample x ′ is fed into the trained network f for the prediction y ′ . When x ′ and training sample x are from the same data distribution, then x ′ is called an ID sample; otherwise, it is regarded as an OOD sample. Prediction results for OOD samples in turn fail to be meaningful. More severely, blindly classifying OOD samples into any existing class may raise fatal risks in safety-critical scenarios. Thus, the OOD detection task is to design a measure function D(f ; x ′ ) that allows OOD samples to be as clearly distinguishable from ID samples as possible. Eventually, OOD detection can be formulated as follows: x ′ ∼ OOD if D(f ; x ′ ) = 0 ID if D(f ; x ′ ) = 1. 3.2 HOPFIELD NETWORK Hopfield Network (Hopfield, 1984; Krotov & Hopfield, 2016; Ramsauer et al., 2020) can store and retrieve continuous patterns. By minimizing the predefined energy function, it can gradually update the input test pattern ξ ∈ R d×1 to a certain converged pattern that is similar to one of the stored patterns. We denote all stored patterns as a stored pattern set S ∈ R d×N with each column s j ∈ R d×1 representing one specific stored pattern, and N is the total number of the stored pattern. Here d is the dimension of patterns. Thus, the energy function aims to guide the updating procedure in the Modern Hopfield Network can be written as: Energy = -LSE β, ξ T S + 1 2 ξ T ξ + c (2) LSE(β, e) = β -1 log   N j=1 exp (βe j )   , where LSE denotes the log-sum-exp function and is defined in Eq. 3. β and c are two constant. The vector e denotes ξ T S, where e j represents the inner product of the input test pattern ξ and the j-th stored pattern s j . The second term ξ T ξ on the right of Eq. 2 serves as a regularization on the magnitude of ξ. Revisiting the energy function, we can find that it is essentially a measure that depicts the similarity between training patterns and testing patterns.

3.3. HE: OOD DETECTION WITH MODERN HOPFIELD ENERGY

As described above, it is obvious that the energy function of Modern Hopfield Network can be an appropriate candidate for measuring the discrepancy between OOD instances and ID instances. We denote all stored patterns for class i as a stored pattern set S i ∈ R d×Ni with each column s ij ∈ R d×1 representing one specific stored pattern, and N i is the total number of the stored patterns within class i. More precisely, we preserve the penultimate layer outputs of ID training samples as stored patterns: for each class i, a corresponding stored pattern set S i is derived from the penultimate layer outputs of ID training samples that are correctly classified by the network within this class. When testing a new sample, we can obtain its penultimate layer output ξ as well as its prediction result, e.g., class i, from the trained model. Then, we just need to conduct the calculation between ξ and the corresponding stored pattern set S i to obtain the similarity score. Notice that, there is the magnitude regularization item ξ T ξ in the energy function Eq. 2. Since it is introduced to prevent the input pattern from scaling-up during the pattern updating process of Modern Hopfield Network, it is not necessary when measuring the discrepancy between the input pattern and the stored patterns. Thus we omit such a term as the magnitude of the input pattern ξ can also provide some information for the ID and OOD discrimination. Also, we omit the constant c in Eq. 2 because the constant in the measure function does not change the OOD detection result. In summary, the Hopfield energy-based OOD detection measure can be denoted as (i denotes the classfication result of ξ by the already trained model): HE(ξ) = LSE β, ξ T S i . (4) Note that, Eq. 4 measures the similarity instead of discrepancy, which means the higher score indicates ID data while lower score as OOD data. It is worth mentioning that, different from most traditional OOD detection methods that only consider the information from the input test sample itself with the trained network, we leverage the information from all training samples of the predicted class to make the comparison for better OOD detection.

3.4. SHE: OOD DETECTION WITH SIMPLIFIED HOPFIELD ENERGY

Although HE has the theoretical foundation from the Modern Hopfield network and is proven to be effective through empirical evaluation, the need to store patterns of all correctly classified training samples may prevent it from generalization to real-world applications. Particularly when the scale of the dataset is extremely large or the latent representation is ultra-high dimensional, it will impose a considerable burden on the storage and the computation. During the evaluation, we discover that the hyperparameter β in Eq. 4 should be small enough in case that there is any element e ij (denotes the inner product between the testing pattern and the j-th stored pattern of class i) with extra large value, which will degrade the robustness of OOD detection with large β. When β is relatively small, we can transform Eq. 4 with Taylor series (here we use two Taylor series as exp(x) ≈ 1 + x and log(1 + x) ≈ x) by: LSE(β, e i ) = 1 β log   Ni j=1 exp (βe ij )   ≈ 1 β log   Ni j=1 (1 + βe ij )   = 1 β log   N i + Ni j=1 βe ij   = 1 β log N i 1 + β Ni j=1 e ij N i = 1 β log N i + 1 β log 1 + βξ T Ni j=1 s ij N i . All test samples that are predicted to be class i share the same stored patterns size N i , so the first term β -1 log N i remains the same for them and can be regarded as a constant. Thus, we ignore the first term, and the measure function Eq. 4 can be simplified from the LSE function (Eq. 3) to the inner product of ξ and Si because LSE(β, e i ) is positively related to ξ T Si : SHE(ξ) = ξ T Si . ( ) Here Si is defined as: Si = 1 N i Ni j=1 s ij . We can interpret SHE from another perspective: Si can be viewed as the average of vectors from the stored pattern set S i . In other words, a stored pattern set S i is degraded to a representation vector Si . Considering the redundancy of patterns that frequently appears in deep learning and samples from the same class usually have similar patterns, it is reasonable to prune the stored pattern set into a single average vector, which is also shown in Figure. 1 (right) . It enables us to eliminate the memory overhead of storing a large amount of ID data patterns, and further reduces the computational cost as well. Besides, the only hyperparameter β also disappears, indicating that we do not need to tune any hyperparameter. In summary, SHE is highly efficient regarding both storage and computation and does not have any hyper parameter to tune. Such properties can indeed facilitate the deployment of SHE in practice.

3.5. PENULTIMATE LAYER VERSUS LOGITS LAYER FOR PATTERN STORAGE

In this section, we analyze the benefits of choosing outputs from the penultimate layer, compared with the logits layer, as the pattern. Intuitive Explanation. Given that most OOD detection methods usually use the output logits, one interesting question is how to choose the layer output as the patterns, the penultimate layer output, or the output logits. Note that, when we calculate the energy function by Eq. 6, the stored pattern Si comes from the same category i as the testing pattern ξ is classified. Therefore, when we use the output logits as patterns, no matter ξ comes from an ID or OOD sample, its maximum value position of logits is always the same as the maximum value position of logits of Si by design, which is the category index of ξ. We call such alignment of maximum value position of logits as "Peak Alignment" which leads to a high energy function score for the OOD pattern more easily. It in turn raises the difficulty of discriminating ID and OOD samples. However, when we use the penultimate layer output as patterns, there is no such "Peak Alignment" effect between ξ and Si since the value of the penultimate layer output is not so concentrated distributed, promoting the similarity score calculated from energy function more separable for ID and OOD patterns as shown in Figure . 2. Moreover, we provide the theoretical analysis in the Appendix B. Figure 2 : Distribution of the Hopfield Energy Score calculated from 2,000 ID and 2,000 OOD samples, the pattern is derived from penultimate layer (left) and output logits (right), respectively. When using penultimate layer, the score can be distinguished more for ID and OOD samples. The ID and OOD is CIFAR10 (Krizhevsky et al., 2009) and SVHN (Netzer et al., 2011) respectively, and the backbone network is ResNet18.

4. EXPERIMENTS

In this section, we conduct experiments on nine OOD detection datasets with three backbone networks and two ID datasets to evaluate the performance of our methods.

4.1. DATASET

There are two types of datasets in the experiments: The in-distribution (ID) dataset and the Outof-distribution (OOD) dataset. The former is only utilized during the training procedure, while the latter serves to test models and does not contain any ID dataset sample. ID Dataset. CIFAR10, CIFAR100 and ImageNet-1k are three ID datasets in our experiments. CI-FAR10 (Krizhevsky et al., 2009) is composed of 60,000 images with 10 categories, each containing 5,000 training images and 1,000 testing images. CIFAR100 Krizhevsky et al. (2009) consists of 60,000 images with 100 categories with 500 training images and 100 testing images for each class. ImageNet-1k (Deng et al., 2009) is composed of 1,350,000 images 1000 different object categories, each containing 1,300 training images and 50 testing images. OOD Dataset. There are nine OOD dataset for evaluation, including SVHN (Netzer et al., 2011) , LSUN-C (Yu et al., 2015) (crop) and LSUN-R (Yu et al., 2015) (resize), iSUN (Xu et al., 2015) , Places (Zhou et al., 2017) , DTD (Cimpoi et al., 2014) , SUN (Xiao et al., 2010) , iNaturalist (Van Horn et al., 2018) , Tiny-Imagenet (resize) (Deng et al., 2009) . Details can be found in the original papers.

4.2. EXPERIMENT SETTINGS

Backbone Network. We choose ResNet18 (He et al., 2016) , ResNet34 (He et al., 2016) and WRN40-2 (Zagoruyko & Komodakis, 2016) as our backbone networks, which are trained on the ID dataset CIFAR10 and CIFAR100, respectively. For ImageNet, we choose ResNet50 (He et al., 2016) as our backbone network. Baseline Methods. To evaluate the performance of our proposed design, We also conduct experiments on the Softmax-based approach "MSP" (Hendrycks & Gimpel, 2016 ) and other excellent methods, "Energy" (Liu et al., 2020) , "ODIN" (Liang et al., 2017) , "Mahalanobis" (Lee et al., 2018) and "ReAct" (Sun et al., 2021) , the "ReAct" here is combined with Energy as described in (Sun et al., 2021) and is the state-of-the-art method before. Evaluation Metrics. The evaluation metrics are: In this section, we demonstrate the effectiveness of our method "SHE" through extensive experiments. The experimental results on nine OOD datasets are organized in Table 1 taking CIFAR10 as ID training data. As shown in Table 1 , MSP has the worse results, while Energy has a lot of improvement, illustrating the potential of energy-based solutions for OOD detection. Our approach SHE obtains almost all (26/27) of the best performance on nine OOD datasets for three different backbones. More precisely, our approach reduces the average FPR95 by 16.81% for ResNet18, 6.29% for ResNet34, and 12.18% for WRN40-2 compared with the best baseline. Our approach performs well when CIFAR100 or ImageNet-1k is choosed as the ID training data while with some limitation, the detailed table 5 (CIFAR100) and table 7 (ImageNet-1k) is put in the Appendix A. We also illustrate Figure 3 to demonstrate model performance directly. Under the peak of ID score distribution (colored red), the less data in blue indicates the better capability in distinguish OOD data and ID data. It can be seen that SHE can obtain the minimum blue areas under the red peak among three methods.

4.3.2. SHE VERSUS HE

Notice that SHE is derived from HE that is inspired by Modern Hopfield Network, we make a comparison between them. We can discover from Table 2 that SHE and HE are competitive, indicating that the patterns of samples from the same class are similar and contain redundancy for OOD detection. As mentioned in chapter 3, HE detects OOD samples via the energy function as shown in Eq. 4. We evaluate the feasibility of HE and show the results in Table 2 , which demonstrates that the energy function (Eq. 4) is able for OOD detection. For SHE, memory cost is reduced to the number of classes instead of the number of samples and the energy-based measure can be simplified into frequently used inner products without any hyperparameter, which is elegant and efficient. We also provide detailed comparison results on nine OOD detection in Table 9 , proving that in most cases SHE can achieve similar performance to HE.

4.3.3. PERTURBATIONS ON SHE

Adding perturbations to input samples for OOD detection is proposed by (Liang et al., 2017) , which can be formulated as follows: x′ = x ′ + ε sign (∇ x ′ log S(x ′ )) , where x ′ denotes the testing sample to be detected while x′ is the perturbed one. S(x ′ ) is the maximum Softmax probability of network outputs, and ε is the perturbation magnitude. Such perturbation-based methods have been proved to be more beneficial for ID data than OOD data, and are adopted in extensive studies (Lee et al., 2018; Hsu et al., 2020; DeVries & Taylor, 2018) . To verify whether perturbation applies to SHE, we also carry out the experiments and organize the results in Table 3 . It demonstrates that introducing perturbations can be beneficial to SHE as the performance of SHE could be further improved with perturbations. Note that, the perturbation requires an additional "forward-backward" procedure to retrieve the gradient, leading to an increase in computational complexity and time. As a comparison, we record the time overhead before and after adding perturbation to SHE: when we choose ResNet34 as the backbone with CIFAR100 as ID data and TinyImagenet as OOD data, the consuming time raises from 35.61s to 105.88s that is around 3 times. Therefore, a balance between efficiency and computational overhead needs to be considered when adopting perturbation. Nevertheless, SHE is always efficient and effective, which can be combined with perturbations to achieve even higher accuracy. Note that all the stored patterns or test patterns in our approach are obtained from the penultimate layer output of neural networks. But in most methods, logits layer (i.e., the last layer) output is adopted for the confidence computation (Hendrycks & Gimpel, 2016; Liang et al., 2017; Liu et al., 2020; Sun et al., 2021) . To verify the significance of layer selection, we also provide an experimental evaluation on SHE that chooses patterns from the penultimate layer outputs or final logits, respectively, for comparison. The average results on nine datasets are presented in Table 4 . We can conclude that our approach gets much better performance when using the penultimate layer output, instead of final logits, as patterns to apply SHE.

5. CONCLUSION

In this paper, we propose a novel approach named HE for OOD detection based on a new "storethen-compare" paradigm. The key idea is to store patterns of ID data and then leverage the energy function defined in the Modern Hopfield Network (Ramsauer et al., 2020) for measuring the similarity between the new test patterns and the stored ID patterns. To reduce storage and computational overhead, we simplify the energy function with the theoretical analysis by appropriate approximations and obtain the simplified approach named SHE. In addition to the great efficiency and effectiveness, SHE does not have any hyperparameters to tune, which is more convenient than most OOD detection methods with cumbersome hyperparameters tuning. Also, different from most OOD detection methods focusing on final output logits, we find that the penultimate layer output, rather than the final output logits, is more suitable to be used as patterns in our approach for OOD detection. The conducted evaluations demonstrate the superiority of our proposed simple yet effective approach on nine widely-used OOD datasets. A EXPERIMENTAL RESULTS. The ReAct in this table refers to 'Energy + ReAct' as described in Sun et al. (2021) , as an auxillary method combined with other SOTA methods, we can also combine the ReAct with SHE (better than Energy) , which will also outperform SHE itself, the results are shown in Table . 6. Let y := [y 1 , y 2 , ...y m ] and z := [z 1 , z 2 , ...z n ] be the output logits and the penultimate layer output, respectively. Here, m is the dimension of logits output (i.e., the number of classes) and n is the dimension of the penultimate layer output. We superscript the vector id and ood (e.g., y id denotes logits output derived from an ID sample) to indicate the input type of the vector. For every z ood j in z ood , we assume that they are independent random variables following the same Gaussian distribution, i.e., z ood j ∼ N 0, σ 2 . Let [v 1 , v 2 , ...v m ] be the weight matrix of the last layer (linear layer) with each v j denotes the categorical vector for class i. Thus, we have y = [v 1 , v 2 ..., v m ] T z. For a test sample ξ from OOD data, we assume that it is classified as the category k, and we have the following formula. Among them, M represents a distribution of the maximum value of m Gaussian random variables. y ood k = max v 1 T z ood , v 2 T z ood , ...v m T z ood y ood k ∼ M ood = max N 0, σ 2 ∥v 1 ∥ 2 2 , N 0, σ 2 ∥v 2 ∥ 2 2 ..., N 0, σ 2 ∥v m ∥ 2 2 y ood q̸ =k ∼ N 0, σ 2 ∥v q ∥ 2 2 (9) For y id k , we assume that it follows another distribution I whose expected value is a positive number larger than the expected value of y ood q̸ =k . When calculated by Eq. 6, the expectation calculated from output logits and penultimate layer outputs are shown as follows ( * denotes inner product): Therefore, compared with using the penultimate layer as the pattern to calculate the energy function, logits output will assign OOD samples with higher scores. Therefore, it will be more challenging to distinguish ID samples from OOD samples. E y id * y ood = E   z id T v k v k T z ood + m j=1,j̸ =k z id T v j v j T z ood C ABLATION EXPERIMENT. The ablation experiment consists of two parts. First, we evaluate different metrics instead of the inner product which is used in our approach to measuring the similarity of the stored pattern and test pattern. To be specific, we use "Euclidean Distance" and "Cosine Similarity", the results are shown in Table . 12. Second, we use the output from other layers (e.g., layer3 in ResNet) instead of the penultimate layer (layer4 in ResNet) to act as the representation of our approach. The results are shown in Table . 13. Table 12 : OOD detection performance with different metrics, the ID dataset is CIFAR10. From the results, the inner product used in our approach performs better than another two metrics. 



Figure 1: Visualization of the distribution of ID/OOD patterns (left) and ID/Stored patterns (right) by t-SNE (Van der Maaten & Hinton, 2008), the ID/OOD refers to CIFAR10 (Krizhevsky et al., 2009) and LSUN (Yu et al., 2015), respectively. The backbone network is ResNet18 (He et al., 2016). In the left figure, the OOD patterns are away from the ID patterns. In the right figure, each pentagram denotes the stored pattern in SHE corresponding to each category, from the figure we can see the stored pattern can represent the corresponding ID patterns well.

Figure 3: Confidence distribution of ID data and OOD data calculated from ResNet18. ID/OOD refers to CIFAR10 and SUN(Xiao et al., 2010), performance is compared with MSP(Hendrycks & Gimpel, 2016) and Energy(Liu et al., 2020).

Figure 4: The distribution of output num for each class from nine OOD datasets. The backbone network is ResNet34 and the ID dataset is CIFAR100.

Figure 6: Heatmap of the Stored/ID/OOD pattern derived by ResNet18, the ID dataset, and OOD dataset is CIFAR10 and SUN (Xiao et al., 2010) respectively. For visualization, the pattern is reshaped to 32*16 from the dimension of 512. From the figure, we can see the expectation of the ID/Stored pattern is larger than the expectation of the OOD pattern, which supports the theory mentioned below.

jp v jq E z p id E z q ood = E I id E M ood > 0 (10) E z id * z ood = E z id E N ood = 0 < E y id * y ood(11)

OOD detection performance of SHE using CIFAR10 as ID dataset.

OOD detection performance comparison of HE and SHE. All values are averaged over the nine OOD datasets described in Section 4.1. The detailed Table9is displayed in the Appendix A.

OOD detection performance comparison of SHE and SHEP (i.e., SHE + Perturbation). All values are averaged over the nine OOD datasets described in Section 4.1. The detailed Table10is displayed in the Appendix.

OOD detection performance comparison deriving pattern from penultimate and logits layer. All values are averaged over the nine OOD datasets described in Section 4.1. The detailed Table11is displayed in the Appendix.

OOD detection performance of SHE using CIFAR100 as ID dataset.

OOD detection performance comparison of "ReAct + SHE" and "ReAct + Energy".

OOD detection performance of SHE using ImageNet-1k as ID dataset and ResNet50 as backbone network.

OOD detection performance of Energy+ReAct and SHE+ReAct using ImageNet as ID dataset. From the results, ReAct+Energy is better which is limitation of our method and we will explore it in the future.

OOD detection performance comparison of HE and SHE. All values are percentages. Bold numbers with gray cell are superior results. The hyperparameter β used for HE is 0.01 for ResNet18, ResNet34, 0.2 for WRN40-2.

OOD detection performance comparison of SHE and SHEP (SHE + Perturbation). All values are percentages. Bold numbers with gray cell are superior results.

OOD detection performance comparison deriving pattern from penultimate and logits layer. All values are percentages. Bold numbers with gray cell are superior results.

OOD detection performance comparison of shallow layer (the layer before penultimate layer) and penultimate layer as the representation. All values are averaged over the nine OOD datasets.

