CONCEPT-BASED EXPLANATIONS FOR OUT-OF-DISTRIBUTION DETECTORS Anonymous

Abstract

Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability, and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD detection techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.

1. INTRODUCTION

It is well known that machine learning (ML) models can yield uncertain and unreliable predictions on out-of-distribution (OOD) inputs, i.e., inputs from outside the training distribution (Amodei et al., 2016; Goodfellow et al., 2015; Nguyen & O'Connor, 2015) . The most common line of defense in this situation is to augment the ML model (e.g., a DNN classifier) with a detector that can identify and flag such inputs as OOD. The ML model can then abstain from making predictions on such inputs (Hendrycks et al., 2019; Lin et al., 2021; Mohseni et al., 2020) . Currently, the problem of explaining the decisions of an OOD detector remains largely unexplored. Much of the focus in learning OOD detectors has been on improving their detection performance (Liu et al., 2020; Mohseni et al., 2020; Lin et al., 2021; Chen et al., 2021; Sun et al., 2021; Cao & Zhang, 2022) , but not on improving their explainability. A potential approach would be to run an existing interpretation method for DNN classifiers with ID and OOD data separately, and then inspect the difference between the generated explanations. However, it is not known if an explanation method that is effective for explaining in-distribution class predictions will also be effective for OOD detectors. For instance, feature attributions, the most popular type of explanation (Sundararajan et al., 2017; Ribeiro et al., 2016) , may not capture visual differences in the generated explanations between ID and OOD inputs (Adebayo et al., 2020) . Moreover, their explanations based on pixel-level activations may not provide the most intuitive form of explanations for humans. This paper addresses the above research gap by proposing the first method (to our knowledge) to help interpret the decisions of an OOD detector in a human-understandable way. We build upon recent advances in concept-based explanations for DNN classifiers (Ghorbani et al., 2019; Zhou et al., 2018a; Bouchacourt & Denoyer, 2019; Yeh et al., 2020) , which offer an advantage of providing explanations in terms of high-level concepts for classification tasks. We make the first effort at extending their utility to the problem of OOD detection. Consider Figure 1 which illustrates our concept-based explanations given inputs which are all classified as "Dolphin" by a DNN classifier, but detected as either ID or OOD by an OOD detector. We observe that the OOD detector predicts a certain input as ID when its concept-score patterns are similar to that of ID images from the Dolphin class. Likewise, the detector predicts an input as OOD when its concept-score patterns are very different from that of 1

