ENERGY-BASED OUT-OF-DISTRIBUTION DETECTION FOR MULTI-LABEL CLASSIFICATION

Abstract

Out-of-distribution (OOD) detection is essential to prevent anomalous inputs from causing a model to fail during deployment. Improved methods for OOD detection in multi-class classification have emerged, while OOD detection methods for multi-label classification remain underexplored and use rudimentary techniques. We propose SumEnergy, a simple and effective method, which estimates the OOD indicator scores by aggregating energy scores from multiple labels. We show that SumEnergy can be mathematically interpreted from a joint likelihood perspective. Our results show consistent improvement over previous methods that are based on the maximum-valued scores, which fail to capture joint information from multiple labels. We demonstrate the effectiveness of our method on three common multi-label classification benchmarks, including MS-COCO, PASCAL-VOC, and NUS-WIDE. We show that SumEnergy can reduce the FPR95 by up to 10.05% compared to the previous best baseline, establishing state-of-the-art performance.

1. INTRODUCTION

Out-of-distribution (OOD) detection is central for reliably deploying machine learning models in open-world environments, where new forms of test-time data may appear that were nonexistent during the training time. The problem of OOD detection has gained significant research attention lately, given its importance for safety-critical applications such as unseen disease identification (Cao et al., 2020) . However, recent studies have primarily focused on detecting OOD examples in multiclass classification, where each sample is assigned to one and only one label (Bevandić et al., 2018; Hein et al., 2019; Hendrycks & Gimpel, 2016; Lakshminarayanan et al., 2017; Lee et al., 2018; Liang et al., 2018; Mohseni et al., 2020; Chen et al., 2020; Hsu et al., 2020; Liu et al., 2020) . This can be restrictive in many real-world settings where images often have multiple objects of interest. For example, self-driving cars must differentiate between the road, traffic signs, and obstacles within a frame. In the medical domain, multiple abnormalities may be present in a medical image (Wang et al., 2017) . Multi-label classification is desirable since there is no constraint on the number of classes an instance can be assigned to. Currently, OOD detection in multi-label classification remains relatively underexplored. Out of the multi-label methods evaluated in (Hendrycks et al., 2019) , MaxLogit achieved the best performance. However, simply using the maximum-valued logit is limiting because it does not incorporate information available from other possible labels. As seen in Figure 1 , MaxLogit can only capture the difference between the dominant outputs for dog (in-dist.) and car (OOD), while positive information from another dominant label cat (in-dist.) is dismissed. Other baseline methods such as ODIN (Liang et al., 2018) and Mahalanobis (Lee et al., 2018) also derive scores based on the maximum score (e.g., calibrated softmax score or Mahalanobis distance), and fail to capture the joint information. While energy scores have recently demonstrated superior OOD detection performance in the multi-class setting (Liu et al., 2020) , this method does not trivially generalize to a multi-label classification setting where labels are not mutually exclusive. Hitherto, a key challenge lies in how to leverage information across different labels. In this paper, we propose an energy-based method for OOD detection in the multi-label setting, which estimates OOD indicator scores jointly from multiple labels. We propose a simple and effective aggregation mechanism, SumEnergy, that seeks to combine the label-wise energy scores derived from individually independent classes. We show that the SumEnergy scores are mathematically meaningful, and can be interpreted from a joint likelihood perspective. Intuitively, an input with multiple dominant labels is more likely to be in-distribution, which is the key aspect that SumEnergy capitalizes on. As shown in Figure 1 , summing label-wise energies over all labels amplifies the difference in scores between in-distribution and OOD inputs. Our method is parameter-free and can be conveniently used for any pre-trained multi-label classification model. Below we describe our contributions in detail. Contributions First, we propose a theoretically motivated scoring function, SumEnergy, that is based on the aggregation over label-wise energy scores. Extensive experiments show that SumEnergy outperforms existing methods on three common multi-label classification benchmarks, establishing state-of-the-art performance. For example, on a DenseNet trained with MS-COCO (Lin et al., 2014) , our method reduces the false positive rate (at 95% TPR) by 10.05% when evaluated against ImageNet OOD data, compared to the best performing baselines. Consistent performance improvement is also observed on a different OOD test dataset Texture (Cimpoi et al., 2014a) , as well as alternative network architecture. Additionally, we perform a comparative analysis of how an alternative aggregation method affects OOD detection performance. In particular, we consider MaxEnergy, which takes the maximum energy score among the individual labels as the OOD indicator score. On a DenseNet trained with the PASCAL-VOC dataset, MaxEnergy yields 4.05% lower FPR95 compared to SumEnergy, which underlines the importance of taking into account scores derived from multiple labels. Lastly, as an ablation, we demonstrate that energy scores are more compatible with the proposed summation aggregation method, compared to previous scoring functions such as logit (Hendrycks et al., 2019) , MSP (Hendrycks & Gimpel, 2016) , ODIN (Liang et al., 2018) , and Mahalanobis distance (Lee et al., 2018) . To see this, we explore the effectiveness of applying aggregation methods to previous popular scoring functions, which helps understand their applicability in the multi-label setting. We find that summing labels' scores using previous methods is inferior to summing labels' energies, emphasizing the need for SumEnergy. For example, simply summing over the logits across labels results in up to 51.93% degradation in FPR95 on MS-COCO, since the outputs are mixed with positive and negative numbers. In contrast, SumEnergy does not suffer from this issue because the signs of label-wise energy scores are uniform. More importantly, label-wise energy is provably aligned with the probability density of the corresponding label's training data. Our study therefore underlines the importance of properly choosing both the label-wise scoring function and the aggregation method. We show strong compatibility between the label-wise energy function and aggregation function, supported by both mathematical interpretation and empirical results.

2.1. BACKGROUND: ENERGY FUNCTION IN MULTI-CLASS CLASSIFICATION

We consider a multi-class neural classifier f (x) : R D → R K , which maps an input x ∈ R D to K real-valued numbers as logits. A softmax function is used to derive a categorical distribution, p(y i | x) = e fy i (x) K j=1 e fy j (x) , (1) which indicates the probability for an input x to be of class y i , with i ∈ {1, 2, ..., K}. A multi-class classifier can be interpreted from an energy-based perspective (Grathwohl et al., 2019) by viewing the logit f yi (x) of class y i as an energy function E(x, y i ) = -f yi (x). Therefore, Equation 1 can be rewritten as: (x) . p(y i | x) = e -E(x,yi) K j=1 e -E(x,yj ) (2) = e -E(x,yi) e -E (3) By equalizing the two denominators above, we can express the free energy function E(x) for any given input x ∈ R D in forms of: x) . E(x) = -log K i=1 e fy i ( (4)

2.2. ENERGY FUNCTION FOR MULTI-LABEL CLASSIFICATION

In this work, we propose using an energy-based method for OOD detection in multi-label classification networks, where an input can have several labels (see Figure 1 ). In what follows, we first introduce a label-wise energy function, and then propose aggregation methods that can leverage the joint information across labels in a theoretically meaningful way. Label-wise Free Energy In multi-label classification, the prediction for each binary label y i is independently made by a binary logistic classifier: x) , where i ∈ {1, 2, ..., K}. For brevity, we use y i in the probabilistic derivations to indicate the label being positive, i.e., y i = 1. The logistic classifier output can be viewed as the softmax with two logits-0 and f yi (x), respectively. For each class y i , we can define label-wise free energy as follows: p(y i = 1 | x) = e fy i (x) 1 + e fy i ( E yi (x) = -log(1 + e fy i (x) ). ( ) Aggregation Methods We propose and contrast two aggregation mechanisms that seek to combine the label-wise energy scores derived above: Max : E max (x) = max i -E yi (x) Sum : E sum (x) = K i=1 -E yi (x) In particular, MaxEnergy finds the largest label-wise energy score among all labels; whereas SumEnergy takes the summation of energy scores across all labels. Note that in the above equations, labelwise energy E yi (x) by definition is a negative value, and the aggregation methods output a positive value by negation.

Mathematical Interpretation

We provide mathematical interpretations of different aggregation methods. First, by resorting to the energy-based model (LeCun et al., 2006) , we show that the label-wise energy score is provably aligned with the conditional likelihood function. The conditional likelihood p(x | y i ) is given by: p(x | y i ) = e -Ey i (x) x|yi e -Ey i (x) , where x) is the normalized density. Since Z yi is the same with respect to all x with label y i , the denominator in Equation 8does not affect the overall distributional of label-wise energy scores. By taking the logarithm on both sides of Equation 8, we have: Z yi = x|yi e -Ey i ( E yi (x) ∝ -log p(x | y i ). Given this, MaxEnergy can be interpreted as taking the maximum of conditioned log-likelihood among all labels: E max (x) ∝ max i log p(x | y i ), which does not take into account the joint information across labels. In contrast, we provide mathematical interpretation for SumEnergy, which is the first method to consider the joint estimation of OOD scores across labels. We show that SumEnergy can be interpreted from the joint likelihood perspective: E sum (x) = K i=1 log p(x | y i ) • Z yi (10) = K i=1 log p(x | y i ) + K i=1 log Z yi Z:constant for all x By applying Bayesian rule for each term log p(x | y i ) in Equation 11, we have E sum (x) = log K i=1 p(y i | x) • p(x) p(y i ) + Z (12) = log K i=1 p(y i | x) + K • log p(x) + (Z -log K i=1 p(y i )) C: constant for all x Given all label y i are conditionally independent, we have K i=1 p(y i | x) = p(y 1 , y 2 , ..., y K | x). Therefore, Equation 13 is equivalent to: E sum (x) = log p(y 1 , y 2 , . . . , y K | x) + K • log p(x) + C (14) = log p(x | y 1 , y 2 , ..., y K ) • K i=1 p(y i ) p(x) + K • log p(x) + C (15) = log p(x | y 1 , y 2 , ..., y K ) joint conditional log likelihood + (K -1) • log p(x) log data density, ↑ for in-distribution + Z constant for all x The equation above suggests that E sum (x) is in fact linearly aligned with the joint conditional log likelihood and log data density. The second term is desirable for OOD detection since it is aligned with the underlying data density, which is higher for in-distribution data x and vice versa. The first term takes into account joint estimation across labels, which is new to our multi-label setting and was not previously considered in multi-class setting (Liu et al., 2020) . The first term allows even further discriminativity between in-vs. OOD data, since OOD data is expected to have lower joint conditional likelihood (i.e., not associated with any of the labels). In contrast, having multiple dominant labels is indicative of an in-distribution input, which is a characteristic that SumEnergy captures. More importantly, our method does not require estimating the density Z explicitly, as Z is sample independent and does not affect the overall SumEnergy score distribution. We note that our derivation is based on the assumption that all labels are independent, which is in accordance with standard multi-label classification loss by treating each label as a binary prediction problem (Tsoumakas & Katakis, 2007) . We consider this setting for its simplicity and generality. Our work also opens up an interesting future direction of OOD detection by considering the structural dependency among labels (Chen et al., 2017; 2019b; c) .

2.3. AGGREGATED ENERGY FOR MULTI-LABEL OOD DETECTION

We propose using the aggregated energy functions E(x) defined in Section 2.2 for OOD detection: G(x; τ ) = out if E(x) ≤ τ, in if E(x) > τ, ( ) where τ is the energy threshold, and can be chosen so that a high fraction of in-distribution data is correctly classified by G(x; τ ). E(x) could take on forms of Max or Sum. A data point with higher aggregated energy E(x) is considered as in-distribution, and vice versa (see Figure 1 ). We explore and provide the tradeoff of different aggregation methods in Section 3.2.

3. EXPERIMENTS

In this section, we describe our experimental setup (Section 3.1) and demonstrate the effectiveness of our method on several OOD evaluation tasks (Section 3.2). We also conduct extensive ablation studies and comparative analysis that leads to an improved understanding of different methods.

3.1. SETUP

In-distribution Datasets We consider three multi-label datasets: MS-COCO (Lin et al., 2014) , PASCAL-VOC (Everingham et al., 2015) , and NUS-WIDE (Chua et al., 2009 Training Details We train three multi-label classifiers, one for each dataset above. The classifiers have a DenseNet-121 backbone architecture, with a final layer that is replaced by 2 fully connected layers. Each classifier is pre-trained on ImageNet-1K and then fine-tuned with the logistic sigmoid function to its corresponding multi-label dataset. We use the Adam optimizer (Kingma & Ba, 2014) with standard parameters (β 1 = 0.9, β 2 = 0.999). The initial learning rate is 10 -4 for the fully connected layers and 10 -5 for convolutional layers. We also augmented the data with random crops and random flips to obtain color images of size 256 × 256. After training, the mAP is 87.51% for PASCAL-VOC, 73.83% for MS-COCO, and 60.22% for NUS-WIDE.

Out-of-distribution Datasets

To evaluate the models trained on the in-distribution datasets above, we follow the same set up as in (Hendrycks et al., 2019) and use ImageNet (Deng et al., 2009) for its generality. Besides, we evaluate against the Textures dataset (Cimpoi et al., 2014b) as OOD. For ImageNet, we use the same set of 20 classes chosen from ImageNet-22K as in (Hendrycks et al., 2019) . These classes are chosen not to overlap with ImageNet-1k since the multi-label classifiers are pre-trained on ImageNet-1K. Specifically, we use the following classes for evaluating the MS-COCO and PASCAL-VOC pre-trained models: dolphin, deer, bat, rhino, raccoon, octopus, giant clam, leech, venus flytrap, cherry tree, Japanese cherry blossoms, redwood, sunflower, croissant, stick cinnamon, cotton, rice, sugar cane, bamboo, and turmeric. Since NUS-WIDE contains highlevel concepts like animal, plants and flowers, we use a different set of classes that are distinct from NUS-WIDE: asterism, battery, cave, cylinder, delta, fabric, filament, fire bell, hornet nest, kazoo, lichen, naval equipment, newspaper, paperclip, pythium, satellite, thumb, x-ray tube, yeast, zither. Evaluation Metrics We measure the following metrics that are commonly used for OOD detection: (1) the false positive rate (FPR95) of OOD examples when the true positive rate of in-distribution examples is at 95%; (2) the area under the receiver operating characteristic curve (AUROC); and (3) the area under the precision-recall curve (AUPR). 

3.2. RESULTS

How do energy-based approaches compare to common OOD detection methods? In Table 1 , we compare energy-based approaches against competitive OOD detection methods in literature, where SumEnergy demonstrates state-of-the-art performance. For fair comparisons, we consider approaches that rely on pre-trained models (without performing retraining or fine-tuning). Following the set up in (Hendrycks et al., 2019) , all the numbers reported are evaluated on ImageNet OOD test data, as described in Section 3.1. We provide additional evaluation results for the Texture OOD test dataset in Appendix A. Most baselines such as MaxLogit (Hendrycks et al., 2019) , Maximum Softmax Probability (MSP) (Hendrycks & Gimpel, 2016) , ODIN (Liang et al., 2018) and Mahalanobis (Lee et al., 2018) derive OOD indicator scores based on the maximum-valued statistics among all labels. Local Outlier Factor (LOF) (Breunig et al., 2000) uses K-nearest neighbors (KNN) to estimate the local density, where OOD examples are detected from having lower density compared to their neighbors. Isolation forest (Liu et al., 2008 ) is a tree-based approach, which detects anomaly based on the path length from the root node to the terminating node. Among different approaches, SumEnergy outperforms the best-performing baseline across all three multi-label classifiers considered. In particular, on a network trained with the MS-COCO dataset, SumEnergy reduces FPR95 by 10.05%, compared to MaxLogit. We provide the AU-ROC curves for our method SumEnergy in Figure 2 , for all three in-distribution datasets considered. The y-axis is the true positive rate (TPR), whereas the x-axis is the FPR. The curves indicate how the OOD detection performance changes as we vary the threshold τ in Equation 17. We additionally evaluate on a different architecture, ResNet (He et al., 2016) , for which we observe consistent improvement and provide details in the Appendix. We also note here that existing approaches using a pretrained model (such as ODIN (Liang et al., 2018) and Mahalanobis (Lee et al., 2018) ) have hyperparameters that need to be tuned. In contrast, using an energy-based method on pre-trained models is parameter-free and easy to use and deploy. In particular, the Mahalanobis approach is based on the assumption that feature representation forms class-conditional Gaussian distributions, and hence may not be well suited for the multi-label setting (which requires joint distribution to be learned).

How do different aggregation methods affect OOD detection performance?

In Table 1 , we also perform a comparative analysis of the effect of different aggregation functions that combine label- Table 2 : Ablation study on the effect of summation for prior approaches. We use DenseNet (Huang et al., 2017) to train on the in-distribution datasets. We use ImageNet as OOD test data as described in Section 3.1. Note that Sum does not apply to tree-based or KNN-based approaches (e.g., LOF and Isolation Forest). wise energy scores. Among those, we observe that MaxEnergy does not outperform SumEnergy, which utilizes information jointly from across the labels. The performance of MaxEnergy is on par with MaxLogit since MaxEnergy, given by max i log(1 + e fy i (x) ), is approximately close to the MaxLogit when f yi (x) is large. The results underline the importance of taking into account information from other labels, not just the maximum-valued label. This is because, in multi-label classification, the model may assign high probabilities to several classes. Theoretically, SumEnergy is also more meaningful, and can be interpreted from a joint likelihood perspective as shown in Section 2.2. What is the effect of applying the aggregation method to prior methods? As an extension, we explore the effectiveness of applying the aggregation method to previous scoring functions such as logit (Hendrycks et al., 2019) , MSP (Hendrycks & Gimpel, 2016) and ODIN (Liang et al., 2018) . The results are summarized in Table 2 . We calculate scores based on the logit f yi (x), sigmoid of the logit 1 1+e -fy i (x) , ODIN score, as well as Mahalanobis distance score M yi (x) by treating each label independently. We then perform summation across the label-wise scores as the overall OOD score. This ablation essentially replaces the Max aggregation with Sum, which helps understand the extent to which previous approaches are amenable in the multi-label setting. Note that the summation aggregation method does not apply to tree-based or KNN-based approaches such as LOF and Isolation Forest. Interestingly, we found that applying summation over individual logit/MSP/ODIN/Mahalanobis scores from each label does not yield competitive results, and in many cases worsens the performance. For example, simply summing over the logits across the labels leads to severe degradation in performance since the outputs are mixed with positive and negative numbers. On MS-COCO, the FPR degrades from 43.53% using MaxLogit to 95.46% (using SumLogit). In contrast, SumEnergy does not suffer from this issue since the energy scores for each individual label have a uniform sign. More importantly, the label-wise scores derived from energy is theoretically more meaningful than logit/MSP/ODIN/Mahalanobis, since it is provably aligned with the probability density of the training data corresponding to the label. This underlines the importance of properly choosing the label-wise scoring function to be compatible with the aggregation method. SumEnergy vs. SumProb We highlight the advantage of SumEnergy over SumProb both empirically and theoretically. As seen in Table 2 , the performance difference between SumEnergy and SumProb is substantial. In particular, on MS-COCO, our method outperforms SumProb by 11.56% (FPR95). For threshold independent metric AUROC, SumEnergy consistently outperforms SumProb by 3.38% (MS-COCO), 4.57% (PASCAL), and 4.48% (NUS-WIDE). SumEnergy is a mathematically meaningful measurement and can be interpreted from a joint likelihood perspective (see Section 2.2), whereas SumProb does not. In fact, one can show that the probability score for each individual label is not aligned with the conditioned data density function. To see this, we can derive the probability for each binary logistic classifier as: In fact, the first term f yi (x) is larger for in-distribution data with label y i , whereas the second term is smaller for in-distribution data with E yi (x) ∝ -log p(x | y i ). This leads to a biased scoring distribution that is no longer proportional to the label-conditional log-likelihood log p(x | y i ): log p(y i | x) = log e fy i (x) 1 + e fy i (x) = f yi (x) + E yi (x). log p(y i | x) ∝ class-conditional likelihood of data with label y i SumProb, as a result, inherits this weakness theoretically and performs worse than SumEnergy. Qualitative case study Lastly, to provide further insights on our method, we qualitatively examine examples from the multi-label classification dataset PASCAL-VOC (in-dist.) and OOD input from the ImageNet that are correctly classified by SumEnergy but not MaxLogit. In Figure 3 (left), we see an in-distribution example is labeled as dog, car, chair and person, with MaxLogit score 1.63 and SumEnergy score 3.23. We also show an OOD input (Figure 3 , right) with a single dominant activation on the bird class, with MaxLogit score 2.14 and SumEnergy score 2.19. In this example, taking the sum appropriately resulted in a higher score for the in-distribution image than the OOD image. Contrarily, MaxLogit's score for the in-distribution image was lower than that of the OOD image, which results in ineffective detection.

4. RELATED WORK

Multi-label classification The task of identifying multiple classes within an input example is of significant interest in many applications (Tsoumakas & Katakis, 2007) , with deep neural networks being commonly used as the classifier. Natural images usually contain several objects and may have many associated tags (Wang et al., 2016) . (Gong et al., 2013) use convolutional neural networks (CNN) to annotate images with 3 or 5 tags on the NUS-WIDE dataset. (Chen et al., 2019a) use CNNs to tag images of road scenes from 52 possible labels. In the medical domain, (Wang et al., 2017) present a chest X-ray dataset in which one image may contain multiple abnormalities. Multilabel classification is also prominent in natural language processing (Nam et al., 2014) . Recent work also provides a theoretical analysis of multi-label classification under various measures (Wu & Zhu, 2020) . Our proposed method is therefore relevant to a wide range of applications in the real world. Out-of-distribution uncertainty for pre-trained models The softmax confidence score has become a common baseline for OOD detection (Hendrycks & Gimpel, 2016) . A theoretical investigation (Hein et al., 2019) shows that neural networks with ReLU activation can produce arbitrarily high softmax confidence for OOD inputs. Several works attempt to improve the OOD uncertainty estimation by using deep ensembles (Lakshminarayanan et al., 2017) , ODIN score (Liang et al., 2018) , Mahalanobis distance-based confidence score (Lee et al., 2018) , and generalized ODIN score (Hsu et al., 2020) . (DeVries & Taylor, 2018) propose to learn the confidence score by using an auxiliary branch to derive the OOD score. Recent work using energy score demonstrated state-of-the-art performance on OOD detection tasks (Liu et al., 2020) . However, previous methods primarily focused on multi-class classification networks. In contrast, in our work, we propose a parameter-free mea-surement that allows effective OOD detection in the underexplored multi-label setting, where the information from across various labels are combined in a theoretically meaningful manner. Out-of-distribution detection with model fine-tuning While our work primarily focused on OOD detection for pre-trained neural networks, a parallel line of research has also explored using auxiliary outlier OOD data to help the OOD detector generalize better. Auxiliary data allows the model to be explicitly regularized through fine-tuning, producing lower confidence on anomalous examples (Bevandić et al., 2018; Geifman & El-Yaniv, 2019; Malinin & Gales, 2018; Mohseni et al., 2020; Subramanya et al., 2017) . A loss function is used to force the predictive distribution of OOD samples toward a uniform distribution (Lee et al., 2017) . Recently, (Mohseni et al., 2020) explore training by adding a background class for an OOD score. (Chen et al., 2020) propose using hard outlier mining which improves the OOD detection performance on both clean and perturbed natural images. However, existing works have primarily focused on the multi-class classification setting. We leave the fine-tuning aspect with auxiliary data for multi-label classification as future exploration. Generative Modeling Based Out-of-distribution Detection. Generative models (Dinh et al., 2016; Kingma & Welling, 2013; Rezende et al., 2014; Van den Oord et al., 2016; Tabak & Turner, 2013 ) can be alternative approaches for detecting OOD examples, as they directly estimate the in-distribution density and can declare a test sample to be out-of-distribution if it lies in the lowdensity regions. However, as shown by (Nalisnick et al., 2018) , deep generative models can assign a high likelihood to out-of-distribution data. Deep generative models can be more effective for out-ofdistribution detection using improved metrics (Choi & Jang, 2018) , including likelihood ratio (Ren et al., 2019; Serrà et al., 2019) . Though our work is based on discriminative classification models, we show that label-wise energy scores can be theoretically interpreted from a data density perspective. More importantly, generative based models (Hinz et al., 2019) can be prohibitively challenging to train and optimize, especially on large and complex multi-label datasets that we considered (e.g., MS-COCO, NUS-WIDE etc). In contrast, our method relies on a discriminative multi-label classifier, which can be much easier to optimize using standard SGD. Energy-based learning Energy-based machine learning models date back to Boltzmann machines (Ackley et al., 1985; Salakhutdinov & Larochelle, 2010) . Energy-based learning (LeCun et al., 2006; Ranzato et al., 2007a; b) provides a unified framework for many probabilistic and nonprobabilistic approaches to learning. Recent work (Zhao et al., 2019) also demonstrated using energy functions to train GANs (Goodfellow et al., 2014) , where the discriminator uses energy values to differentiate between real and generated images. Xie et al. (2016) first showed that a discriminative classifier can be interpreted from an energy-based perspective. Subsequent works (Xie et al., 2017; 2019; 2018b; a) explored video generation, 3D shape pattern generation, and text generation (Deng et al., 2019) through EBMs. Energy-based methods are also used in structure prediction (Belanger & McCallum, 2016; Tu & Gimpel, 2018) . Grathwohl et al. (2019) showed that a discriminative classifier can be interpreted from an energy-based perspective. The proposed JEM optimization objective estimates the joint distribution p(x, y) from a generative perspective, which requires estimating the normalized densities and can be intractable and unstable to compute. Liu et al. (2020) propose to use energy score for OOD detection that is derived from a pure discriminatively trained classifier, which demonstrated superior performance for multi-class classification networks. In contrast, our work focuses on a multi-label setting, where we contribute aggregation methods that utilize information jointly from across all labels.

5. CONCLUSION

In this work, we propose energy scores for OOD detection in the multi-label classification setting. We show that aggregating energies over all labels into SumEnergy results in better discrimination between in-distribution and OOD inputs compared to using information from only one label's information (i.e. MaxLogit, MSP, ODIN, or Mahalanobis). Additionally, we justify the mathematical interpretation of SumEnergy from a joint likelihood perspective. SumEnergy obtains better OOD detection performance compared to competitive baseline methods, establishing new state-of-the-art on this task. Applications of multi-label classification can benefit from our methods, and we anticipate further research in OOD detection to extend this work. We hope our work will increase the attention toward a broader view of OOD uncertainty estimation from an energy-based perspective.



Figure 1: Energy-based out-of-distribution detection for multi-label classification. During inference time, input x is passed through classifier f , and label-wise scores are computed for each label. OOD indicator scores are either the maximum-valued score (denoted by green outlines) or the sum of all scores. Taking the sum results in a larger difference in scores and more separation between in-distribution and OOD inputs (denoted by red lines), resulting in better OOD detection. Plots in the bottom right depict the probability densities of maximum-valued versus summed scores.

Figure 2: AUROC curves for OOD detector obtained from three indistribution multi-label classification datasets.

Figure 3: Label-wise energy scores -Ey i (x) for in-distribution example from PASCAL-VOC (left), and OOD input from ImageNet (right). The OOD input is misclassified using MaxLogit score since the dominant output has a high activation, making it indistinguishable from an in-distribution data's MaxLogit score. In contrast, SumEnergy correctly classifies both images since it results in larger differences in scores between in-distribution and OOD inputs.

MaxLogit(Hendrycks et al., 2019) 43.53 / 89.11 / 93.74 45.06 / 89.22 / 83.14 56.46 / 83.58  / 94.32 MaxProb 43.53 / 89.11 / 93.74 45.06 / 89.22 / 83.14 56.46 / 83.58 / 94.32 MSP (Hendrycks & Gimpel, 2016) 79.90 / 73.70 / 85.37 74.05 / 79.32 / 72.54 88.50 / 60.81 / 87.00 OOD detection performance comparison using energy-based approaches vs. competitive baselines.We use DenseNet(Huang et al., 2017) to train on the in-distribution datasets. We use a subset of ImageNet classes as OOD test data, as described in Section 3.1. All values are percentages. ↑ indicates larger values are better, and ↓ indicates smaller values are better. Bold numbers are superior results. Description of baseline methods, additional evaluation results on different OOD test data, and different architecture (e.g., ResNet) can be found in the Appendix.

A APPENDIX A.1 EVALUATION ON DIFFERENT ARCHITECTURE

We provide additional evaluation results for ResNet (He et al., 2016) . The classifiers have a ResNet-101 backbone architecture, but with a final layer that is replaced by 2 fully connected layers. Each classifier is pre-trained on ImageNet-1K and then fine-tuned with the logistic sigmoid function to its corresponding multi-label dataset. We use the same training settings as in the main paper. After training, the mAP is 87.73% for PASCAL-VOC, 72.77% for MS-COCO, and 61.47% for NUS-WIDE.In Table 3 , we show the performance comparison of various OOD detection approaches, evaluated on ImageNet as the OOD test set. The ablation of applying summation over baseline methods is provided in Table 4 . Table 4 : Ablation study on the effect of aggregation methods for prior approaches. We use ResNet (He et al., 2016) to train on the in-distribution datasets. We use ImageNet as OOD test data as described in Section 3.1. Note that Sum is not applicable to tree-based or KNN-based approaches (e.g., LOF and Isolation Forest).A.2 EVALUATION ON DIFFERENT OOD TEST DATAIn addition to ImageNet, we also evaluate on a different OOD test dataset, Textures (Cimpoi et al., 2014a) . The results are reported in Table 5 and Table 6 .

A.3 BASELINE METHODS

In multi-label classification, the prediction for each label y i with i ∈ {1, 2, ..., K} is independently made by a binary logistic classifier: x) Table 6 : Ablation study on the effect of aggregation methods for prior approaches. We use ResNet (He et al., 2016) to train on the in-distribution datasets. We use Texture (Cimpoi et al., 2014a) as OOD test data as described in Section 3.1. Note that Sum is not applicable to tree-based or KNN-based approaches (e.g., LOF and Isolation Forest).We consider the following baselines methods under maximum aggregation:ODIN = max i e fy i (x)/T 1 + e fy i (x)/T (21)In particular, ODIN was originally designed for multi-class but we adapt for the multi-label case by taking the maximum of calibrated label-wise predictions. The input perturbation is calculated using x = xsign(-∇ ŷi ), where ŷi is the binary cross-entropy loss for the label ŷi with the largest output, i.e., ŷi = arg max i p(y i = 1 | x). For Mahalanobis distance,we extract the feature embedding φ(x) for a given sample. μyi is the class conditional mean for label y i , and Σ-1 is the covariant matrix.

A.4 VALIDATION DATA FOR BASELINES

We use a combination of the following validation datasets to select hyperparameters for ODIN (Liang et al., 2018) and Mahalanobis (Lee et al., 2018) . The validation set consists of:• Gaussian noise sampled i.i.d. from an isotropic Gaussian distribution;• uniform noise where each pixel is sampled from U = [-1, 1];• In-distribution data corrupted into OOD data by applying (1) pixel-wise arithmetic mean of random pair of in-distribution images;(2) geometric mean of random pair of in-distribution images; and (3) randomly permuting 16 equally sized patches of an in-distribution image.A.5 HYPERPARAMETER TUNING FOR BASELINES ODIN (Liang et al., 2018) and Mahalanobis (Lee et al., 2018) require hyper-parameter tuning, such as temperature and magnitude of noise . We use the validation data above for selecting the optimal hyperparameters. For ODIN, temperature T is chosen from [1,10,100,1000] and the perturbation magnitude is chosen from 21 evenly spaced numbers starting from 0 and ending at 0.004. For Mahalanobis, the perturbation magnitude is chosen from [0, 0.0005, 0.0014, 0.001, 0.002, 0.005].The optimal parameters are chosen to minimize the FPR at TPR95 on the validation set.

