ENERGY-BASED OUT-OF-DISTRIBUTION DETECTION FOR MULTI-LABEL CLASSIFICATION

Abstract

Out-of-distribution (OOD) detection is essential to prevent anomalous inputs from causing a model to fail during deployment. Improved methods for OOD detection in multi-class classification have emerged, while OOD detection methods for multi-label classification remain underexplored and use rudimentary techniques. We propose SumEnergy, a simple and effective method, which estimates the OOD indicator scores by aggregating energy scores from multiple labels. We show that SumEnergy can be mathematically interpreted from a joint likelihood perspective. Our results show consistent improvement over previous methods that are based on the maximum-valued scores, which fail to capture joint information from multiple labels. We demonstrate the effectiveness of our method on three common multi-label classification benchmarks, including MS-COCO, PASCAL-VOC, and NUS-WIDE. We show that SumEnergy can reduce the FPR95 by up to 10.05% compared to the previous best baseline, establishing state-of-the-art performance.

1. INTRODUCTION

Out-of-distribution (OOD) detection is central for reliably deploying machine learning models in open-world environments, where new forms of test-time data may appear that were nonexistent during the training time. The problem of OOD detection has gained significant research attention lately, given its importance for safety-critical applications such as unseen disease identification (Cao et al., 2020) . However, recent studies have primarily focused on detecting OOD examples in multiclass classification, where each sample is assigned to one and only one label (Bevandić et al., 2018; Hein et al., 2019; Hendrycks & Gimpel, 2016; Lakshminarayanan et al., 2017; Lee et al., 2018; Liang et al., 2018; Mohseni et al., 2020; Chen et al., 2020; Hsu et al., 2020; Liu et al., 2020) . This can be restrictive in many real-world settings where images often have multiple objects of interest. For example, self-driving cars must differentiate between the road, traffic signs, and obstacles within a frame. In the medical domain, multiple abnormalities may be present in a medical image (Wang et al., 2017) . Multi-label classification is desirable since there is no constraint on the number of classes an instance can be assigned to. Currently, OOD detection in multi-label classification remains relatively underexplored. Out of the multi-label methods evaluated in (Hendrycks et al., 2019) , MaxLogit achieved the best performance. However, simply using the maximum-valued logit is limiting because it does not incorporate information available from other possible labels. As seen in Figure 1 , MaxLogit can only capture the difference between the dominant outputs for dog (in-dist.) and car (OOD), while positive information from another dominant label cat (in-dist.) is dismissed. Other baseline methods such as ODIN (Liang et al., 2018) and Mahalanobis (Lee et al., 2018) also derive scores based on the maximum score (e.g., calibrated softmax score or Mahalanobis distance), and fail to capture the joint information. While energy scores have recently demonstrated superior OOD detection performance in the multi-class setting (Liu et al., 2020) , this method does not trivially generalize to a multi-label classification setting where labels are not mutually exclusive. Hitherto, a key challenge lies in how to leverage information across different labels. In this paper, we propose an energy-based method for OOD detection in the multi-label setting, which estimates OOD indicator scores jointly from multiple labels. We propose a simple and effective aggregation mechanism, SumEnergy, that seeks to combine the label-wise energy scores derived from individually independent classes. We show that the SumEnergy scores are mathematically meaningful, and can be interpreted from a joint likelihood perspective. Intuitively, an input with multiple dominant labels is more likely to be in-distribution, which is the key aspect that SumEnergy capitalizes on. As shown in Figure 1 , summing label-wise energies over all labels amplifies the difference in scores between in-distribution and OOD inputs. Our method is parameter-free and can be conveniently used for any pre-trained multi-label classification model. Below we describe our contributions in detail. Contributions First, we propose a theoretically motivated scoring function, SumEnergy, that is based on the aggregation over label-wise energy scores. Extensive experiments show that SumEnergy outperforms existing methods on three common multi-label classification benchmarks, establishing state-of-the-art performance. For example, on a DenseNet trained with MS-COCO (Lin et al., 2014) , our method reduces the false positive rate (at 95% TPR) by 10.05% when evaluated against ImageNet OOD data, compared to the best performing baselines. Consistent performance improvement is also observed on a different OOD test dataset Texture (Cimpoi et al., 2014a) , as well as alternative network architecture. Additionally, we perform a comparative analysis of how an alternative aggregation method affects OOD detection performance. In particular, we consider MaxEnergy, which takes the maximum energy score among the individual labels as the OOD indicator score. On a DenseNet trained with the PASCAL-VOC dataset, MaxEnergy yields 4.05% lower FPR95 compared to SumEnergy, which underlines the importance of taking into account scores derived from multiple labels. Lastly, as an ablation, we demonstrate that energy scores are more compatible with the proposed summation aggregation method, compared to previous scoring functions such as logit (Hendrycks et al., 2019) , MSP (Hendrycks & Gimpel, 2016) , ODIN (Liang et al., 2018) , and Mahalanobis distance (Lee et al., 2018) . To see this, we explore the effectiveness of applying aggregation methods to previous popular scoring functions, which helps understand their applicability in the multi-label setting. We find that summing labels' scores using previous methods is inferior to summing labels' energies, emphasizing the need for SumEnergy. For example, simply summing over the logits across labels results in up to 51.93% degradation in FPR95 on MS-COCO, since the outputs are mixed with positive and negative numbers. In contrast, SumEnergy does not suffer from this issue because the signs of label-wise energy scores are uniform. More importantly, label-wise energy is provably aligned with the probability density of the corresponding label's training data. Our study therefore underlines the importance of properly choosing both the label-wise scoring function and the aggregation method. We show strong compatibility between the label-wise energy function and aggregation function, supported by both mathematical interpretation and empirical results.



Figure 1: Energy-based out-of-distribution detection for multi-label classification. During inference time, input x is passed through classifier f , and label-wise scores are computed for each label. OOD indicator scores are either the maximum-valued score (denoted by green outlines) or the sum of all scores. Taking the sum results in a larger difference in scores and more separation between in-distribution and OOD inputs (denoted by red lines), resulting in better OOD detection. Plots in the bottom right depict the probability densities of maximum-valued versus summed scores.

