CONCEPT-BASED EXPLANATIONS FOR OUT-OF-DISTRIBUTION DETECTORS Anonymous

Abstract

Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability, and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD detection techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.

1. INTRODUCTION

It is well known that machine learning (ML) models can yield uncertain and unreliable predictions on out-of-distribution (OOD) inputs, i.e., inputs from outside the training distribution (Amodei et al., 2016; Goodfellow et al., 2015; Nguyen & O'Connor, 2015) . The most common line of defense in this situation is to augment the ML model (e.g., a DNN classifier) with a detector that can identify and flag such inputs as OOD. The ML model can then abstain from making predictions on such inputs (Hendrycks et al., 2019; Lin et al., 2021; Mohseni et al., 2020) . Currently, the problem of explaining the decisions of an OOD detector remains largely unexplored. Much of the focus in learning OOD detectors has been on improving their detection performance (Liu et al., 2020; Mohseni et al., 2020; Lin et al., 2021; Chen et al., 2021; Sun et al., 2021; Cao & Zhang, 2022) , but not on improving their explainability. A potential approach would be to run an existing interpretation method for DNN classifiers with ID and OOD data separately, and then inspect the difference between the generated explanations. However, it is not known if an explanation method that is effective for explaining in-distribution class predictions will also be effective for OOD detectors. For instance, feature attributions, the most popular type of explanation (Sundararajan et al., 2017; Ribeiro et al., 2016) , may not capture visual differences in the generated explanations between ID and OOD inputs (Adebayo et al., 2020) . Moreover, their explanations based on pixel-level activations may not provide the most intuitive form of explanations for humans. This paper addresses the above research gap by proposing the first method (to our knowledge) to help interpret the decisions of an OOD detector in a human-understandable way. We build upon recent advances in concept-based explanations for DNN classifiers (Ghorbani et al., 2019; Zhou et al., 2018a; Bouchacourt & Denoyer, 2019; Yeh et al., 2020) , which offer an advantage of providing explanations in terms of high-level concepts for classification tasks. We make the first effort at extending their utility to the problem of OOD detection. Consider Figure 1 which illustrates our concept-based explanations given inputs which are all classified as "Dolphin" by a DNN classifier, but detected as either ID or OOD by an OOD detector. We observe that the OOD detector predicts a certain input as ID when its concept-score patterns are similar to that of ID images from the Dolphin class. Likewise, the detector predicts an input as OOD when its concept-score patterns are very different from that of Figure 1 : Our concept-based explanation for the Energy OOD detector (Liu et al., 2020) . On the x-axis, we present the top-5 important concepts that describe the detector's behavior given images classified as "Dolphin". Concepts C90 and C1 represent "dolphin-like skin" and "wavy surface of the sea" respectively (see Figure 12b ). ID profile shows the concept score patterns for ID images predicted as Dolphin. ID inputs. A user can verify whether the OOD detector makes decisions based on the concepts that are aligned with human intuition (e.g., C90 and C1), and that the incorrect detection (as in Figure 1b ) is an understandable mistake, not a misbehavior of the OOD detector. Such explanations can help the user evaluate the reliability of the OOD detector and decide upon its adoption in practice. We aim to provide a general interpretability framework that is applicable to the wide range of OOD detectors in the world. Accordingly, a research question we ask is: without relying on the internal mechanism of an OOD detector, can we determine a good set of concepts that are appropriate for understanding why the OOD detector predicts a certain input to be ID / OOD? A key contribution of this paper is to show that this can be done in an unsupervised manner without any human annotations. We make the following contributions in this paper: • We motivate and propose new metrics to quantify the effectiveness of concept-based explanations for a black-box OOD detector, namely detection completeness and concept separability ( § 2.2, § 3.1, and § 3.2). • We propose a concept-learning objective with suitable regularization terms that, given an OOD detector for a DNN classifier, learns a set of concepts with good detection completeness and concept separability ( § 3.3); • By treating a given OOD detector as black-box, we show that our method can be applied to explain a variety of existing OOD detection methods. By identifying prominent concepts that contribute to an OOD detector's decisions via a modified Shapley value score based on the detection completeness, we demonstrate how the discovered concepts can be used to understand the OOD detector. ( § 4). Related Work. In the literature of OOD detection, recent studies have designed various scoring functions based on the representation from the final or penultimate layers (Liang et al., 2018; DeVries & Taylor, 2018) , or a combination of different internal layers of DNN model (Lee et al., 2018; Lin et al., 2021; Raghuram et al., 2021) . A recent survey on generalized OOD detection can be found in Yang et al. (2021) . Our work aims to provide post-hoc explanations applicable to a wide range of black-box OOD detectors without modifying their internals. Among different interpretability approaches, concept-based explanation (Koh et al., 2020; Alvarez Melis & Jaakkola, 2018) has gained popularity as it is designed to be better-aligned with human reasoning (Armstrong et al., 1983; Tenenbaum, 1999) and intuition (Ghorbani et al., 2019; Zhou et al., 2018a; Bouchacourt & Denoyer, 2019; Yeh et al., 2020) . There have been limited attempts to assess the use of concept-based explanations under data distribution changes such as adversarial manipulation (Kim et al., 2018) or spurious correlations (Adebayo et al., 2020) . Designing concept-based explanations for OOD detection requires further exploration and is the focus of our work.

2. PROBLEM SETUP AND BACKGROUND

Notations. Let X ⊆ R a0×b0×d0 denote the space of inputsfoot_0 x, where d 0 is the number of channels and a 0 and b 0 are the image size along each channel. Let Y := {1, • • • , L} denote the space of



We focus on images, but the proposed method extends to other domains.



(a) Correct detection: ID (or OOD) dolphin image correctly detected as ID (or OOD). (b) Wrong detection: ID (or OOD) dolphin image falsely detected as OOD (or ID).

