CONCEPT-BASED EXPLANATIONS FOR OUT-OF-DISTRIBUTION DETECTORS Anonymous

Abstract

Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability, and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD detection techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.

1. INTRODUCTION

It is well known that machine learning (ML) models can yield uncertain and unreliable predictions on out-of-distribution (OOD) inputs, i.e., inputs from outside the training distribution (Amodei et al., 2016; Goodfellow et al., 2015; Nguyen & O'Connor, 2015) . The most common line of defense in this situation is to augment the ML model (e.g., a DNN classifier) with a detector that can identify and flag such inputs as OOD. The ML model can then abstain from making predictions on such inputs (Hendrycks et al., 2019; Lin et al., 2021; Mohseni et al., 2020) . Currently, the problem of explaining the decisions of an OOD detector remains largely unexplored. Much of the focus in learning OOD detectors has been on improving their detection performance (Liu et al., 2020; Mohseni et al., 2020; Lin et al., 2021; Chen et al., 2021; Sun et al., 2021; Cao & Zhang, 2022) , but not on improving their explainability. A potential approach would be to run an existing interpretation method for DNN classifiers with ID and OOD data separately, and then inspect the difference between the generated explanations. However, it is not known if an explanation method that is effective for explaining in-distribution class predictions will also be effective for OOD detectors. For instance, feature attributions, the most popular type of explanation (Sundararajan et al., 2017; Ribeiro et al., 2016) , may not capture visual differences in the generated explanations between ID and OOD inputs (Adebayo et al., 2020) . Moreover, their explanations based on pixel-level activations may not provide the most intuitive form of explanations for humans. This paper addresses the above research gap by proposing the first method (to our knowledge) to help interpret the decisions of an OOD detector in a human-understandable way. We build upon recent advances in concept-based explanations for DNN classifiers (Ghorbani et al., 2019; Zhou et al., 2018a; Bouchacourt & Denoyer, 2019; Yeh et al., 2020) , which offer an advantage of providing explanations in terms of high-level concepts for classification tasks. We make the first effort at extending their utility to the problem of OOD detection. Consider Figure 1 which illustrates our concept-based explanations given inputs which are all classified as "Dolphin" by a DNN classifier, but detected as either ID or OOD by an OOD detector. We observe that the OOD detector predicts a certain input as ID when its concept-score patterns are similar to that of ID images from the Dolphin class. Likewise, the detector predicts an input as OOD when its concept-score patterns are very different from that of Figure 1 : Our concept-based explanation for the Energy OOD detector (Liu et al., 2020) . On the x-axis, we present the top-5 important concepts that describe the detector's behavior given images classified as "Dolphin". Concepts C90 and C1 represent "dolphin-like skin" and "wavy surface of the sea" respectively (see Figure 12b ). ID profile shows the concept score patterns for ID images predicted as Dolphin. ID inputs. A user can verify whether the OOD detector makes decisions based on the concepts that are aligned with human intuition (e.g., C90 and C1), and that the incorrect detection (as in Figure 1b ) is an understandable mistake, not a misbehavior of the OOD detector. Such explanations can help the user evaluate the reliability of the OOD detector and decide upon its adoption in practice. We aim to provide a general interpretability framework that is applicable to the wide range of OOD detectors in the world. Accordingly, a research question we ask is: without relying on the internal mechanism of an OOD detector, can we determine a good set of concepts that are appropriate for understanding why the OOD detector predicts a certain input to be ID / OOD? A key contribution of this paper is to show that this can be done in an unsupervised manner without any human annotations. We make the following contributions in this paper: • We motivate and propose new metrics to quantify the effectiveness of concept-based explanations for a black-box OOD detector, namely detection completeness and concept separability ( § 2.2, § 3.1, and § 3.2). • We propose a concept-learning objective with suitable regularization terms that, given an OOD detector for a DNN classifier, learns a set of concepts with good detection completeness and concept separability ( § 3.3); • By treating a given OOD detector as black-box, we show that our method can be applied to explain a variety of existing OOD detection methods. By identifying prominent concepts that contribute to an OOD detector's decisions via a modified Shapley value score based on the detection completeness, we demonstrate how the discovered concepts can be used to understand the OOD detector. ( § 4).

Related Work.

In the literature of OOD detection, recent studies have designed various scoring functions based on the representation from the final or penultimate layers (Liang et al., 2018; DeVries & Taylor, 2018) , or a combination of different internal layers of DNN model (Lee et al., 2018; Lin et al., 2021; Raghuram et al., 2021) . A recent survey on generalized OOD detection can be found in Yang et al. (2021) . Our work aims to provide post-hoc explanations applicable to a wide range of black-box OOD detectors without modifying their internals. Among different interpretability approaches, concept-based explanation (Koh et al., 2020; Alvarez Melis & Jaakkola, 2018) has gained popularity as it is designed to be better-aligned with human reasoning (Armstrong et al., 1983; Tenenbaum, 1999) and intuition (Ghorbani et al., 2019; Zhou et al., 2018a; Bouchacourt & Denoyer, 2019; Yeh et al., 2020) . There have been limited attempts to assess the use of concept-based explanations under data distribution changes such as adversarial manipulation (Kim et al., 2018) or spurious correlations (Adebayo et al., 2020) . Designing concept-based explanations for OOD detection requires further exploration and is the focus of our work.

2. PROBLEM SETUP AND BACKGROUND

Notations. Let X ⊆ R a0×b0×d0 denote the space of inputs 1 x, where d 0 is the number of channels and a 0 and b 0 are the image size along each channel. Let Y := {1, • • • , L} denote the space of output class labels y. Let ∆ L denote the set of all probabilities over Y (the simplex in L-dimensions). We assume that natural inputs to the DNN classifier are sampled from an unknown probability distribution P in over the space X × Y. The compact notation OOD Detector. The goal of an OOD detector is to determine if a test input to the classifier is ID (i.e., from the distribution P in ); otherwise the input is declared to be OOD (Yang et al., 2021) . Given a trained classifier f : X → ∆ L , the decision function of an OOD detector can be generally defined as D γ (x, f ) = 1[S(x, f ) ≥ γ], where S(x, f ) ∈ R is the score function of the detector for an input x and γ is the threshold. We follow the convention that larger scores correspond to ID inputs, and the detector outputs of 1 and 0 correspond to ID and OOD respectively. We assume the availability a pre-trained DNN classifier and a paired OOD detector that is trained to detect inputs for the classifier. Consider a pre-trained DNN classifier f : X → ∆ L that maps an input x to its corresponding predicted class probabilities. Without loss of generality, we can partition the DNN at a convolutional layer ℓ into two parts, i.e., f = h • ϕ where: 1) ϕ : X → Z := R a ℓ b ℓ ×d ℓ is the first half of f that maps an input x to the intermediate feature representationfoot_1 ϕ(x), and 2) h : Z → ∆ L is the second half of f that maps ϕ(x) to the predicted class probabilities h(ϕ(x)). We denote the predicted probability of a class y by f y (x) = h y (ϕ(x)), and the prediction of the classifier by y(x) = argmax y f y (x).

2.1. PROJECTION INTO CONCEPT SPACE

Our work is based on the common implicit assumption of linear interpretability in the conceptbased explanation literature, i.e., high-level concepts lie in a linearly-projected subspace of the feature representation space Z of the classifier (Kim et al., 2018) . Consider a projec- tion matrix C = [c 1 , • • • , c m ] ∈ R d ℓ ×m (with m ≪ d ℓ ) that maps from the space Z into a reduced-dimension concept space. C consists of m unit vectors, where c i ∈ R d ℓ is referred to as the concept vector representing the i-th concept (e.g., "stripe" or "oval face"), and m is the number of concepts. We define the concept score for x as the linear projection of the high-dimensional layer representation ϕ(x) ∈ R a ℓ b ℓ ×d ℓ into the concept space (Yeh et al., 2020) , i.e. v C (x) := ϕ(x) C ∈ R a ℓ b ℓ ×m . We also define a mapping from the projected concept space back to the feature space by a non-linear function g : R a ℓ b ℓ ×m → R a ℓ b ℓ ×d ℓ . The reconstructed feature representation at layer ℓ is then defined as ϕ g,C (x) := g(v C (x)).

2.2. CANONICAL WORLD AND CONCEPT WORLD

As shown in Fig. 2 , we consider a "two-world" view of the classifier and OOD detector consisting of the canonical world and the concept world, which are defined as follows: Canonical World. In this case, both the classifier and OOD detector use the original layer representation ϕ(x) for their predictions. The prediction of the classifier is f (x) = h(ϕ(x)), and the decision function of the detector is D γ (x, h • ϕ) with a score function S(x, h • ϕ). Concept World. We use the following observation in constructing the concept-world formulation: both the classifier and the OOD detector can be modified to make predictions based on the reconstructed feature representation, i.e., using ϕ g,C (x) instead of ϕ(x). Accordingly, we define the corresponding classifier, detector, and score function in the concept world as follows: f con (x) := h( ϕ g,C (x)) = h(g(v C (x))) D con γ (x, f ) := D γ (x, h • ϕ g,C ) = D γ (x, h • g • v C ) S con (x, f ) := S(x, h • ϕ g,C ) = S(x, h • g • v C ). (1) We further elaborate on this two-world view and introduce the following two desirable properties. Detection Completeness. Given a fixed algorithmic approach for learning the classifier and OOD detector, and with fixed internal parameters of f , we would ideally like the classifier prediction and the detection score to be indistinguishable between the two worlds. In other words, for the concepts to sufficiently explain the OOD detector, we require D con γ (x, f ) to closely mimic D γ (x, f ). Likewise, we require f con (x) to closely mimic f (x) since the detection mechanism of D γ is closely paired to the classifier. We refer to this property as the completeness of a set of concepts with respect to the OOD detector and its paired classifier. As discussed in § 3.1, this extends the notion of classification completeness introduced by Yeh et al. (2020) to an OOD detector and its paired classifier. Concept Separability. To improve the interpretability of the resulting explanations for the OOD detector, we require another desirable property from the learned concepts: data detected as ID by D γ (henceforth referred to as detected-ID data) and data detected as OOD by D γ (henceforth referred to as detected-OOD data) should be well-separated in the concept-score space. Since our goal is to help an analyst understand which concepts distinguish the detected-ID data from detected-OOD data, we would like to learn a set of concepts that have a well-separated concept score pattern for inputs from these two groups (e.g., the concepts "stripe" and "oval face" in Fig. ?? have distinct concept scores).

3. PROPOSED APPROACH

Given a trained DNN classifier f , a paired OOD detector D γ , and a set of concepts C, we address the following questions: 1) Are the concepts sufficient to capture the prediction behavior of both the classifier and OOD detector? (see § 3.1); 2) Do the concepts show clear distinctions in their scores between detected-ID data and detected-OOD data? (see § 3.2). We first propose new metrics for quantifying the set of learned concepts, followed by a general framework for learning concepts that possess these properties (see § 3.3).

3.1. METRICS FOR DETECTION COMPLETENESS

Definition 1. Given a trained DNN classifier f = h • ϕ and a set of concept vectors C, the classification completeness with respect to P in (x, y) is defined as (Yeh et al., 2020) : η f (C) := sup g E(x,y)∼P in 1[y = argmax y ′ h y ′ ( ϕ g,C (x))] -a r E(x,y)∼P in 1[y = argmax y ′ h y ′ (ϕ(x))] -a r where a r = 1/L is the accuracy of a random L-class classifier. The denominator of η f (C) is the accuracy of the original classifier f , while the numerator is the maximum accuracy that can be achieved in the concept world using the feature representation reconstructed from the concept scores. The maximization is over the parameters of the neural network g that reconstructs the feature representation from the vector of concept scores. Definition 2. Given a trained DNN classifier f = h • ϕ, a trained OOD detector with score function S(x, f ), and a set of concept vectors C, we define the detection completeness score with respect to the ID distribution P in (x, y) and OOD distribution P out (x) as follows: η f ,S (C) := sup g AUC(h • ϕ g,C ) -b r AUC(h • ϕ) -b r , where AUC(f ) is the area under the ROC curve of an OOD detector based on f , defined as AUC(f ) := E(x,y)∼P in E x ′ ∼Pout 1 S(x, f ) > S(x ′ , f ) , and b r = 0.5 is the AUROC of a random detector. The numerator term is the maximum achievable AUROC in the concept world via reconstructed features from concept scores. In practice, AUC(f ) is estimated using the test datasets D te in and D te out . Both the classification completeness and detection completeness scores are designed to be in the range [0, 1]. However, this is not strictly guaranteed since the classifier or OOD detector in the concept world may empirically have a better (corresponding) metric on a given ID/OOD dataset. A completeness score close to 1 indicates that the set of concepts C are close to complete in characterizing the behavior of the classifier and/or the OOD detector.

3.2. CONCEPT SEPARABILITY SCORE

Concept Scores. In Section 2.1, we introduced a projection matrix C ∈ R d ℓ ×m that maps ϕ(x) to v C (x), and consists of m unit concept vectors C = [c 1 • • • c m ]. The inner product between the feature representation and a concept vector is referred to as the concept score, and it quantifies how close an input is to the given concept (Kim et al., 2018; Ghorbani et al., 2019) . Specifically, the concept score corresponding to concept i is defined as v ci (x) := ⟨ϕ(x), c i ⟩ = ϕ(x) c i ∈ R a ℓ b ℓ . The matrix of concept scores from all the concepts is simply the concatenation of the individual concept scores, i.e., v C (x) = ϕ(x) C = [v c1 (x) • • • v cm (x)] ∈ R a ℓ b ℓ ×m . We also define a dimension-reduced version of the concept scores that takes the maximum of the inner-product over each a ℓ × b ℓ patch as follows: v C (x) T = [ v c1 (x), • • • , v cm (x)] ∈ R m , where v ci (x) = max p,q |⟨ϕ p,q (x), c i ⟩| ∈ R. Here ϕ p,q (x) is the feature representation corresponding to the (p, q)-th patch of input x (i.e., receptive field (Araujo et al., 2019) ). This reduction operation is done to capture the most important correlations from each patch, and the m-dimensional concept score will be used to define our concept separability metric as follows. We would like the set of concept-score vectors from the detected-ID class V in (C) := { v C (x), x ∈ D tr in ∪ D tr out : D γ (x, f ) = 1} , and the set of concept-score vectors from the detected-OOD class V out (C) := { v C (x), x ∈ D tr in ∪ D tr out : D γ (x, f ) = 0} to be well separated. Let J sep (V in (C), V out (C)) ∈ R define a general measure of separability between the two subsets, such that a larger value corresponds to higher separability. We discuss a specific choice for J sep for which it is possible to tractably optimize concept separability as part of the learning objective in Section 3.3. Global Concept Separability. Class separability metrics have been well studied in the pattern recognition literature, particularly for the two-class case (Fukunaga, 1990b) foot_2 . Motivated by Fisher's linear discriminant analysis (LDA), we explore the use of class-separability measures based on the within-class and between-class scatter matrices (Murphy, 2012) . The goal of LDA is to find a projection vector (direction) such that data from the two classes are maximally separated and form compact clusters upon projection. Rather than finding an optimal projection direction, we are more interested in ensuring that the concept-score vectors from the detected-ID and detected-OOD data have high separability. Consider the within-class and between-class scatter matrices based on V in (C) and V out (C), given by S w = v∈Vin(C) (v -µ in ) (v -µ in ) T + v∈Vout(C) (v -µ out ) (v -µ out ) T , S b = (µ out -µ in ) (µ out -µ in ) T , where µ in and µ out are the mean concept-score vectors from V in (C) and V out (C) respectively. We define the following separability metric based on the generalized eigenvalue equation solved by Fisher's LDA (Fukunaga, 1990b) : J sep (C) := J sep (V in (C), V out (C)) = tr S -1 w S b . Maximizing the above metric is equivalent to maximizing the sum of eigenvalues of the matrix S -1 w S b , which in-turn ensures a large between-class separability and a small within-class separability for the detected-ID and detected-OOD concept scores. We refer to this as a global concept separability metric because it does not analyze the separability on a per-class levelfoot_3 . The separability metric is closely related to the Bhattacharya distance, which is an upper bound on the Bayes error rate (see Appendix A.1).

3.3. PROPOSED CONCEPT LEARNING -KEY IDEAS

Prior Approaches and Limitations. Among post-hoc concept-discovery methods for a DNN classifier with ID data, unlike Kim et al. and Ghorbani et al., that  log h y (g(v C (x))) + λ expl R expl (C). Here C and g (parameterized by a neural network) are jointly optimized, and R expl (C) is a regularization term used to ensure that the learned concept vectors have high spatial coherency and low redundancy among themselves (see (Yeh et al., 2020) for the definition). While the objective (5) of Yeh et al. can learn a set of sufficient concepts that have a high classification completeness score, we find that it does not necessarily replicate the per-instance prediction behavior of the classifier in the concept world. Specifically, there can be discrepancies in the reconstructed feature representation, whose effect propagates through the remaining part of the classifier. Since many widely-used OOD detectors rely on the feature representations and/or the classifier's predictions, this discrepancy in the existing concept learning approaches makes it hard to closely replicate the OOD detector in the concept world (see Fig. 3 ). Furthermore, the scope of Yeh et al. is confined to concept learning for explaining the classifier's predictions based on ID data, and there is no guarantee that the learned concepts would be useful for explaining an OOD detector. To address these gaps, we propose a general method for concept learning that complements prior work by imposing additional instance-level constraints on the concepts, and by considering both the OOD detector and OOD data. Concept Learning Objective. We define a concept learning objective that aims to find a set of concepts C and a mapping g that have the following properties: 1) high detection completeness w.r.t the OOD detector; 2) high classification completeness w.r.t the DNN classifier; and 3) high separability in the concept-score space between detected-ID data and detected-OOD data. Inspired by recent works on transferring feature information from a teacher model to a student model (Hinton et al., 2015; Zhou et al., 2018b) , we encourage accurate reconstruction of Ẑ based on the concept scores by adding a regularization term that is the squared ℓ 2 distance between the original and reconstructed representations J norm (C, g) = Ex∼P in ∥ϕ(x) -ϕ g,C (x)∥ 2 . In order to close the gap between the scores of the OOD detector in the concept world and canonical world on a per-sample level, we introduce the following mean-squared-error (MSE) based regularization: J mse (C, g) = E x∼Pin S(x, h • ϕ g,C ) -S(x, f ) 2 + E x∼Pout S(x, h • ϕ g,C ) -S(x, f ) 2 . (6) MSE terms are computed with both the ID and OOD data because we want to ensure that the ROC curve corresponding to both the score functions are close to each other (which requires OOD data). Finally, we include a regularization term to maximize the separability metric between the detected-ID and detected-OOD data in the concept-score space, resulting in our final concept learning objective: argmax C,g E (x,y)∼Pin log h y (g(v C (x))) + λ expl R expl (C) -λ mse J mse (C, g) -λ norm J norm (C, g) + λ sep J sep (C). The λ coefficients are non-negative hyper-parameters that are further discussed in Section 4. We note that both J mse (C, g) and J sep (C) depend on the OOD detectorfoot_4 . We use the SGD-based Adam optimizer (Kingma & Ba, 2014)) to solve the learning objective. The expectations involved in the objective terms are calculated using sample estimates from the training ID and OOD datasets. Specifically, D tr in and D tr out are used to compute the expectations over P in and P out , respectively. Our complete concept learning is summarized in Algorithm 1 (Appendix A.4).

4. EXPERIMENTS

We briefly describe the experimental setup here and provide additional details in Appendix B. Datasets. For the ID dataset, we use Animals with Attributes (AwA) (Xian et al., 2018) with 50 animal classes, and split it into a train set (29841 images), validation set (3709 images), and test set (3772 images). We use the MSCOCO dataset (Lin et al., 2014) as the auxiliary OOD dataset D tr out for training and validation. For the OOD test dataset D te out , we follow a common setting in the literature of large-scale OOD detection (Huang & Li, 2021) and use three different image datasets: Places365 (Zhou et al., 2017) , SUN (Xiao et al., 2010) , and Textures (Cimpoi et al., 2014) . Models. We apply our framework to interpret five prominent OOD detectors from the literature: MSP (Hendrycks & Gimpel, 2016) , ODIN (Liang et al., 2018) , Generalized-ODIN (Hsu et al., 2020) , Energy (Liu et al., 2020) and Mahalanobis (Lee et al., 2018) . The OOD detectors are paired with the widely-used Inception-V3 model (Szegedy et al., 2016) (following the setup in prior works (Yeh et al., 2020; Ghorbani et al., 2019; Kim et al., 2018) ) trained on the Animals-with-Attributes (AwA) dataset (Xian et al., 2018) , which has a test accuracy of 92.10%.

Metrics.

For each set of concepts learned with different OOD detectors and hyperparameters, we report the classification completeness η f (C), detection completeness η f ,S (C), and the relative concept separability metric (defined below). In contrast to the completeness scores that are almost always bounded to the range [0, 1], it is hard to gauge the possible range of the separability score J sep (C) (or J y sep (C)) across different settings (datasets, classification models, and OOD detectors), and whether the value represents a significant improvement in separability. Hence, we define the relative concept separability, which captures the relative improvement in concept separability using concepts C compared to a different set of concepts C ′ , as follows J sep (C, C ′ ) = Median J y sep (C) -J y sep (C ′ ) J y sep (C ′ ) L y=1 . We choose C ′ to be the set of concepts learned by the baseline (Yeh et al., 2020) , which is a special case of our learning objective when λ mse = λ norm = λ sep = 0. The set of concept C are obtained via our concept learning objective, with various combinations of hyperparameter values.

OOD detector

Hyperparameters and test OOD datasets. The ID dataset is AwA for both training and test, and the auxiliary OOD dataset is MSCOCO. Hyperparameters are in the order of (λmse, λnorm, λsep), and their values are set based on the scale of corresponding regularization terms in Eqn. ( 7), for a specific choice of the OOD detector. Further investigation, including an ablation study on each regularization term can be found in Appendix B.2. Across the rows (for a given OOD detector and OOD dataset), the best value is boldfaced, and second best value is underscored. The 95% confidence intervals are estimated by bootstrapping the test set over 200 trials.  η f (C) ↑ Test OOD dataset Places SUN Textures η f ,S (C) ↑ Jsep(C, C ′ ) ↑ η f ,S (C) ↑ Jsep(C, C ′ ) ↑ η f ,S (C) ↑ Jsep(C, C ′ ) ↑ MSP (0, 0, 0) 0.

4.1. EFFECTIVENESS OF OUR METHOD

In this subsection, we carry out experiments to answer the following question: does our concept learning objective effectively encourage concepts to have the desired properties of detection completeness and concept separability? Table 1 summarizes the results of concept learning with various combinations of hyperparameters for the proposed regularization terms in Eqn. ( 7): i) all the hyperparameters are set to 0 (first row); ii) only the terms directly relevant to detection completeness (i.e., reconstruction error J norm (C, g) and mean-squared error J mse (C, g)) are included (second row); iii) only the term responsible for concept separability J sep (C) is included (third row); iv) all the regularization terms are included (fourth row). In all cases, we observe that the baseline achieves a high enough classification completeness score, but the lowest detection completeness and concept separability. This indicates that the concepts discovered by (Yeh et al., 2020) would be sufficient to describe the DNN classifier, but using the same set of concepts would be inappropriate for the OOD detector. In contrast, our method significantly improves the detection completeness (and even the classification completeness) by having λ mse > 0, λ norm > 0 and concept separability by having λ sep > 0, compared to the baseline. Moreover, the three terms make the best synergy together in almost all cases. Detection completeness and accurate reconstruction of Z. Additionally, we observe whether the proposed evaluation metrics are well-aligned with the interpretability of the resulting conceptbased explanations. In Fig. 3 , we observe that concepts by (Yeh et al., 2020) with low detection completeness (η f ,S (C) = 0.782 for MSP and η f ,S (C) = 0.682 for Energy) lead to a strong mismatch between the score distributions on both ID data and OOD data. In contrast, concepts learned by our method with high detection completeness (η f ,S (C) = 0.961 for the MSP detector, and η f ,S (C) = 0.941 for the Energy detector) approximate the target score distributions more closely on both ID data and OOD data. By reducing the performance gap of OOD detector between canonical world and concept world, it leads to more accurate explanations for OOD detectors.

4.2. CONCEPT-BASED EXPLANATIONS FOR OOD DETECTORS

Contribution of each concept to detection. The proposed concept learning algorithm learns concepts for both the classifier and OOD detector considering all the classes, and we address the following question: how much does each concept contribute to the detection results for inputs predicted to a particular class?. Recent works have adopted the Shapley value from Coalitional Game theory literature (Shapley, 1953; Fujimoto et al., 2006) for scoring the importance of a feature subset towards the predictions of a model (Chen et al., 2018; Lundberg & Lee, 2017; Sundararajan & Najmi, 2020) . Extending this idea, we modify the characteristic function of the Shapley value using our per-class detection completeness metric (Eqn. (11) in Appendix A.2). The modified Shapley value of a concept c i ∈ C with respect to the predicted class j ∈ [L] is defined as where C ′ is a subset of C excluding concept c i , and η j f ,S is the per-class detection completeness with respect to class j. This Shapley value captures the average marginal contribution of concept c i towards explaining the decisions of the OOD detector for inputs that are predicted into class j. SHAP(η j f ,S , c i ) := C ′ ⊆C\{ci} (m -|C ′ | -1)! |C ′ |! m! η j f ,S (C ′ ∪ {c i }) -η j f ,S (C ′ ) , Eventually, we interpret the behavior of the given OOD detector by plotting the concept score patterns with respect to the concepts ranked by the above Shapley importance score. Figure 4 illustrates the generated explanations given correctly-detected inputs (ID / OOD input detected as ID / OOD; first row of figure), and incorrectly-detected inputs (ID / OOD input detected as OOD / ID; second row of figure). Overall, we observe that the OOD detector predicts an input as ID when the concept scores show a similar pattern to the ID profile, or predicts an input as OOD when the concept score pattern is far from the ID profile. For instance, the fourth input is OOD image from Places dataset but detected as ID, since its score for C54 (furry dog skin) is as high as usual ID Collie images (which is true in the image). Thus, we conclude this to be an understandable mistake by the OOD detector. We also provide qualitative comparison between Yeh et al. and our method in the resulting explanations for OOD detector. We observe that Yeh et al. fails to generate visually-distinguishable explanations between detected-ID and detected-OOD inputs. The separation between the solid green bars and the orange bars in each figure becomes more visible in our explanations, which enables more intuitive interpretation for human users, and this reflects our design goal for concept separability. It is also noteworthy that our concepts that are most important to distinguish ID Collie from OOD Collie (i.e., C54 and C30) are more specific, and finer-grained characteristics of Collie, while Yeh et al. finds concepts that are vaguely similar to the features of dog, but rather generic (i.e., C43 and C29). This is the reason we require more number of concepts to achieve detection completeness and concept separability, compared to solely considering the classification completenessfoot_5 .

5. CONCLUSION

In this work, we make a first attempt at developing an unsupervised and human-interpretable explanation method for black-box OOD detectors based on high-level concepts derived from the internal layer representations of a (paired) DNN classifier. We propose novel metrics viz. detection completeness and concept separability to evaluate the completeness (sufficiency) and quality of the learned concepts for OOD detection. The proposed concept learning method is quite general and applies to a broad class of off-the-shelf OOD detectors. Through extensive experiments and qualitative examples, we demonstrate the practical utility of our method for understanding and debugging an OOD detector. We discuss additional aspects of our method such as the choice of auxiliary OOD dataset, human subject study, and societal impact in Appendix E. We refer to these per-class variations as per-class concept separability. The scatter matrices S y w and S y b are defined similar to Eq. ( 3), using the per-class subset of concept scores V y in (C) or V y out (C), and the mean concept-score vectors from the detected-ID and detected-OOD dataset are also defined at a per-class level.

A.4 ALGORITHM FOR CONCEPT LEARNING

To provide the readers with a clear overview of the proposed concept learning approach, we include Algorithm 1. Note that in line 7 of Algorithm 1, the dimension reduction step in V in (C) = { v C (x), x ∈ D tr in ∪ D tr out : D γ (x, f ) = 1} and V out (C) = { v C (x), x ∈ D tr in ∪ D tr out : D γ (x, f ) = 0} involves the maximum function, which is not differentiable; specifically, the step v ci (x) = max p,q |⟨ϕ p,q (x), c i ⟩|. For calculating the gradients (backward pass), we use the log-sum-exp function with a temperature parameter to get a differentiable approximation of the maximum function, i.e., max p,q |⟨ϕ p,q (x), c i ⟩| ≈ α log p,q exp 1 α |⟨ϕ p,q (x), c i ⟩| as α → 0. In our experiments, we set the temperature constant α = 0.001 upon checking that the approximate value of v ci (x) is sufficiently close to the original value using the maximum function. Compute the prediction accuracy of the concept-world classifier f con using D tr in . 4: Compute the explainability regularization term as defined in Yeh et al. (2020) .

5:

Compute difference of feature representation between canonical world and concept world (i.e. J norm (C, g)). 6: Compute difference of detector outputs between canonical world and concept world using Eqn. (6). 7: Compute V in (C) and V out (C) using D tr , D γ and C.

8:

Compute separability between V in (C) and V out (C) using Eqn. (10) or Eqn. (13). 9: Perform a batch-SGD update of C and g using Eqn. (7) as the objective.

A.5 ACCURATE RECONSTRUCTION OF CLASSIFIER OUTPUTS

We have performed additional experiments to understand if the proposed method can provide improvements in the classification setting. Let C 1 denote the concept matrix learned by the method of Yeh et al. Let C 2 denote the concept matrix learned by our method with λ mse = λ sep = 0 and λ norm = 0.1 (set based on the scale of the regularization term J norm ). The idea is that we exclude the terms in the concept-learning objective (Eqn. 7) that depend on the OOD detector, but include the ℓ 2 norm based reconstruction error of the layer representation. To evaluate the utility of these two sets of concepts for classification, we calculated the per-sample Hellinger distance between the predicted class probabilities of the original classifier and the concept-world classifier (based on either C 1 or C 2 ). Figure 5 compares the empirical distribution of the Hellinger distance for both sets of concepts C 1 and C 2 . We observe that the distribution is more skewed towards zero with a higher density near zero and a shorter (right) tail in the case of C 2 (red curve) compared to C 1 (blue curve). This suggests that the class predictions are more accurately reconstructed by the concepts learned using our method with only the reconstruction error-based regularization. This can in-turn benefit the concept-based explanations for the classifier.

A.6 ACCURATE RECONSTRUCTION OF OOD SCORES

In addition to Figure 3 where we compared the reconstruction accuracy of OOD scores using concepts by (Yeh et al., 2020) and ours, Figure 6 confirms that the same observation applies to Energy detector. (b) Distribution of S con (x, f ) in the concept world, using concepts by (Yeh et al., 2020) . (c) Distribution of S con (x, f ) in the concept world, using concepts by ours. 

A.7 CONCEPT SEPARABILITY AND VISUAL DISTINCTION IN EXPLANATIONS

In Fig. 7 , we take the average of concept scores V in (C) (or V out (C)) among the inputs that are predicted as class y, and detected as ID (or OOD) by Energy detector as an example. We can observe noticeably distinguishable pattern between detected-ID and detected-OOD concept scores when using concepts with higher concept separability (J sep (C, C ′ ) = 3.421), compared to those of low concept separability (J sep (C, C ′ ) = 0.675) by Yeh et al. (2020) . These observations confirm our design motivation for the concept separability metric -that a higher value of the concept separability metric enables better visual distinction between the concept score patterns, suggesting better interpretability for humans.

B IMPLEMENTATION DETAILS

We run all experiments with Tensorflow, Keras and NVDIA GeForce RTX 2080Ti GPUs. We use test set bootstrapping with 200 runs to obtain the confidence interval for each hyperparameter set of concept learning. B.1 EXPERIMENTAL SETTING. OOD Datasets. For the auxiliary OOD dataset for concept learning (D tr out ), we use the unlabeled images from MSCOCO dataset (120K images in total) Lin et al. (2014) . We carefully curate the dataset to make sure that no images contain overlapping animal objects with our ID dataset (i.e., 50 animal classes of Animals-with-Attributes Xian et al. (2018) ), then randomly sample 30K images. For OOD datasets for evaluation (D te out ), we use the high-resolution image datasets processed by Huang and Li Huang & Li (2021) . 

B.2 ADDITIONAL RESULTS

Ablation study for concept learning. We perform an ablation study that isolates the effect of each regularization term in our concept learning objective (Eqn. 7) towards our evaluation metrics: classification completeness, detection completeness, and relative concept separability. We also observe the coherency among the learned concepts by varying λ mse and λ sep . Coherency of concepts was introduced by Ghorbani et al. Ghorbani et al. (2019) to ensure that the generated concept-based explanations are understandable to humans. It captures the idea that the examples for a concept should be similar to each other, while being different from the examples corresponding to other concepts. For the specific case of the image domain, the receptive fields most correlated to a concept i (e.g., "stripe pattern") should look different from the receptive fields for a different concept j (e.g., "wavy surface of sea"). Yeh et al. Yeh et al. (2020) proposed to quantify the coherency of concepts as 1 m K m i=1 x ′ ∈Tc i ⟨ϕ(x ′ ), c i ⟩, where T ci is the set of K-nearest neighbor patches of the concept vector c i from the ID training set D tr in . We use this metric to quantify how understandable our concepts are for different hyperparameter choices. Figure 8 shows that aligned with our intuition, large λ mse helps to improve the detection completeness. Having non-zero λ mse is also helpful to improve the classification completeness even further, and surprisingly concept separability as well, without sacrificing the coherency of concepts. On the other hand, on the right side of Figure 8 , we observe that large relative concept separability with large λ sep comes at the expense of lower detection completeness and coherency. Recall that when visualizing what each concept represents for human's convenience, we apply threshold 0.8 to only presents (see Figure 10 ). Low coherency with respect to Eqn. 14 (i.e., 0.768 with λ sep = 75) means that there are much less number of examples that can pass the threshold, meaning that users can hardly understand what the concepts at hand entails. This observation suggests that one needs to balance between concept coherency and concept separability depending on which property would be more useful for a specific application of concepts. Transferability of concepts across OOD detectors. Our work essentially suggests to use different set of concepts for a specific target OOD detector, as J mse (C, g) and J sep (C) in Eqn. ( 7) depend on a choice of OOD detector. In practice, however, one might not have enough computational capacity to prepare multiple sets of concepts for all type of OOD detectors at hand. Here, we inspect whether the concepts targeted for a certain type of OOD detector are also good to be used for other OOD detectors. We explore the transferability of concepts targeted to MSP Hendrycks & Gimpel (2016) detector in Table 2a , and Energy Liu et al. (2020) in Table 2b . Not surprisingly, we observe that concepts targeted for Energy yields the best detection completeness score when tested with the same type of OOD detector, but still make meaningful improvement with other detectors as well. When it comes to relative concept separability, it is transferred even better across different OOD detectors. For instance, the concepts lead to J sep (C, C ′ ) = 0.862 with Textures, the best relative concept separability is achieved with ODIN detector (i.e., J sep (C, C ′ ) = 0.862) and which is even higher than the best results we could obtain using the set of concepts targeted for ODIN (i.e., J sep (C, C ′ ) = 0.414 with λ mse = 0, λ norm = 0, λ sep = 50 in Table 1 

C DISCUSSION ON THE CHOICE OF AUXILIARY OOD DATASET IN CONCEPT LEARNING

Under circumstances where having access to auxiliary OOD dataset for concept learning is not feasible, we suggest that one could use generative methods to generate synthetic dataset, or apply data augmentation techniques (Hendrycks et al., 2022) . Figure 9 shows an example of AwA image augmented by Hendrycks et al. (2022) . We evaluate the effectiveness of our concept learning objective when such augmented AwA train set is used as auxiliary OOD dataset.  (C) ↑ Test OOD dataset Places SUN Textures η f ,S (C) ↑ Jsep(C, C ′ ) ↑ η f ,S (C) ↑ Jsep(C, C ′ ) ↑ η f ,S (C) ↑ Jsep(C, C ′ ) ↑ Energy (1, 0.1, 50) 0.955 ± 0.0006 0.940 ± 0.0005 1.746 ± 0.0712 0.9410 ± 0.0005 3.0703 ± 0.0580 0.927 ± 0.0005 3.417 ± 0.1419 Moreover, in Fig. 11 , we compare the important concepts discovered by the baseline method Yeh et al. (2020) (denoted "baseline") vs. ours. With the baseline, when the learned concepts are solely intended for reconstructing the behavior of the classifier, we observe that interpretation of both the classifier and OOD detector depends on a common set of concepts (i.e., concepts 32, 10, and 47). On the other hand, the concepts learned by our method focus on reconstructing the behavior of both the OOD detector and the classifier. In this case, we observe that a distinct set of important concepts are selected for classification and OOD detection. We also observe that our method requires more concepts in order to address the decisions of both the classifier and OOD detector. For instance, the number of concepts obtained by our method and the baseline are 78 and 53 (respectively), out of a total 100 concepts after the duplicate removal of concept vectors. In short, when the concepts are only targeted at explaining the DNN classifier (as in the baseline Yeh et al. ( 2020)), the behavior of the OOD detector is merely described by the common set of concepts that are important for the DNN classifier. On the other hand, when not only the DNN classifier but also the OOD detector is taken into consideration during concept learning (i.e., our method), we obtain a more diverse and We demonstrate randomly sampled images that are predicted by the classifier into this class. We compare the top-4 important concepts to describe the DNN classifier (and Energy detector), ranked by the Shapley value based on classification completeness SHAP(η j f , ci) (and detection completeness SHAP(η j f ,S , ci)). "Baseline" corresponds to the case when the concepts are learned with λmse = λnorm = λsep = 0, whereas "Ours" corresponds to the concepts learned with λmse = 1, λnorm = 0.1, λsep = 0. expanded set of concepts, and different concepts play a major role in interpreting the classification and detection results.

D.2 MORE EXAMPLES OF OUR CONCEPT-BASED EXPLANATION

In Figure 12 , we provide additional example of our concept-based explanation. To verify the important concepts identified by our modified Shapley value, we perform counterfactual analysis, addressing the following question: if the OOD detector thought the input has different score for this concept, would the detection result be different? As we do not assume to have groundtruth annotation for concepts, we construct concept score profiles of detected-ID (or detected-OOD) inputs from held-out ID (or OOD) dataset, and refer to this as ID (or OOD) concept profile. With the guidance of ID and OOD concept profiles, we take intervention on the concept scores of mis-detected inputs. Specifically, for ID data mis-detected as OOD, we update their concept scores using ID profiles, and similarlly, for OOD data mis-detected as ID, their concept scores are updated with OOD profiles. The number of concepts to be intervened can be varied. As shown in Figure 13 , with intervention on more number of important concepts (ranked by SHAP(η f ,S , c i ))), we observe an improved performance of OOD detector in concept world.

E DISCUSSION AND SOCIETAL IMPACT

Auxiliary OOD Dataset. A limitation of our approach is its requirement of an auxiliary OOD dataset for concept learning, which could be hard to access in certain applications. To overcome that, a research direction would be to design generative models that simulate domain shifts or anomalous behavior and could create the auxiliary OOD dataset synthetically, allowing us additional control on the extent of distributional changes the resulting concepts could deal with (see Appendix C for further discussion). Human Subject Study. Performing a human-subject (or user) study would be the ultimate way to evaluate the effectiveness of explanations, but remains largely unexplored even for in-distribution classification tasks. We emphasize that designing such a usability test with OOD detectors would be even more challenging due to the characteristics of the OOD detection task, compared to inindistribution classification tasks. For in-distribution classifiers, users could potentially generate hypotheses about what high-level concepts should attribute to the class prediction, and compare their hypotheses to the provided explanations to determine the classifier's reliability. On the other hand, assessing the reliability of OOD detection involves checking whether a given input belongs to any of the natural distributions of concepts; this is essentially limited to whether users' mental models on such global distributions can be accurately probed via a couple of presented local instances. We believe that designing a thorough probing method for human interpretability on OOD detection would be an interesting yet challenging research quest by itself and our paper does not address that. Societal Impact. Our work helps address the detection results of OOD detectors, giving practitioners the ability to explain the model's decision to invested parties. Our explanations can also be used to keep a data point as an understood mistake by the model rather than throwing it away without further analysis, which could help guide how to improve the OOD detector with respect to the concepts. However, this would also mean that more trust is put back into the human practitioner to not abuse the explanations or misrepresent them.



We focus on images, but the proposed method extends to other domains. We flatten the first two dimensions of the feature representation, thus changing an a ℓ × b ℓ × d ℓ tensor to an a ℓ b ℓ × d ℓ matrix, where a ℓ and b ℓ are the filter size and d ℓ is the number of channels. In our problem, the two classes correspond to detected-ID and detected-OOD. See Appendix A.2 and A.3 for per-class variations of detection completeness and concept separability. This dependence may not be obvious for the separability term, but it is clear from its definition. In Figure 4, after concept learning with m = 100 and duplicate removal, we find 44 non-redundant concepts for Yeh et al. (λmse = λnorm = λsep = 0), and 100 distinct concepts for ours (λmse = 1, λnorm = 0.1, λsep = 10).



(a) Correct detection: ID (or OOD) dolphin image correctly detected as ID (or OOD). (b) Wrong detection: ID (or OOD) dolphin image falsely detected as OOD (or ID).

[n] denotes {1, • • • , n} for a positive integer n. Boldface symbols are used to denote both vectors and tensors. ⟨x, x ′ ⟩ denotes the standard inner-product between a pair of vectors. The indicator function 1[c] takes value 1 (0) when the condition c is true (false). ID and OOD Datasets. Consider a labeled ID training dataset D tr in = {(x i , y i ), i = 1, • • • , N tr in } sampled from the distribution P in . We assume the availability of an unlabeled training dataset D tr out = { x i , i = 1, • • • , N tr out } from a different distribution, referred to as the auxiliary OOD dataset. Similarly, we define the ID test dataset (sampled from P in ) as D te in , and the OOD test dataset as D te out . Note that the auxiliary OOD dataset D tr in and the test OOD dataset D te out are from different distributions. All the OOD datasets are unlabeled since their label space is usually different from Y.

Figure 2: Our two-world view of the classifier and OOD detector.

(a) Target distribution of S(x, f ) in the canonical world.(b) Reconstructed distribution of S con (x, f ) in the concept world, using concepts by(Yeh et al., 2020).(c) Reconstructed distribution of S con (x, f ) in the concept world, using concepts by ours.

Figure 3: Detection completeness and estimated density of OOD score S(x, f ) from MSP detector. Concepts by ours are learned using λ mse = 10, λ norm = 0.1, λ sep = 50. Comparison is made between AwA test set (ID; blue) vs. SUN (OOD; red).

Figure 4: Concept-based explanations for Energy OOD detector using concepts by Yeh et al. (2020) vs. ours. Images are randomly selected from AwA test set (ID) and Places (OOD), and all predicted to class "Collie". ID profile shows the normal concept-score patterns for ID Collie images.

Learning concepts for OOD detector INPUT: Entire training set D tr = {D tr in , D tr out }, entire validation setD val = {D val in , D val out }, classifier f , detector D γ . INITIALIZE: Concept vectors C = [c 1 • • • c m ]and parameters of the network g. OUTPUT: C and g.1: Calculate threshold γ for D γ using D val as the score at which true positive rate is 95%. 2: for t = 1, ...T epochs do 3:

Figure 5: Examples for correct detection

Figure 6: Detection completeness and estimated density of OOD score S(x, f ) from Energy detector. Comparison is made between AwA test set (ID; blue) vs. SUN (OOD; red).

Figure 7: Concept separability and visual distinction in the concept score patterns. For the class "Giraffe", we compare the concept score patterns using two different sets of concepts. Left: Averaged scores of top-10 important concepts out of the concepts learned by Yeh et al. (2020)). Right: Averaged scores of top-10 important concepts out of the concepts learned by our method ( λmse = 1, λnorm = 0.1, λsep = 50 with Energy detector). Concept importance is measured using the Shapley value of Eqn. (9).

(a) Ablation study varying λmse; we set λnorm = 0.1, λsep = 0 (b) Ablation study varying λsep; we set λmse = 0, λnorm = 0.

Figure8: Ablation study with respect to J mse (C, g) and J sep (C). We fix m = 100, λ expl = 10, and the OOD detector used for concept learing and evaluation is EnergyLiu et al. (2020)

Figure 9: Random example of augmented AwA dataset. Left: original image in AwA train set. Right: corresponding image augmented by Hendrycks et al. (2022).

Figure 10: Top-6 important concepts for Energy with respect. Left: class "Sheep". Right: class "Giraffe"

(a) class "Collie", Energy OOD detector. Images randomly selected from AwA test set and SUN. (b) class "Dolphin", Energy OOD detector. Images randomly selected from AwA test set and Places.

(c) class "Dolphin", MSP OOD detector. Images randomly selected from AwA test set and Places.

Figure 12: Concept-based explanations using concepts by Yeh et al. vs. ours. ID profile shows the concept-score patterns for normal ID images.

do not support imposing required conditions into the concept discovery, Yeh et al. devised a learning-based approach where classification completeness and the saliency of concepts are optimized via a regularized objective given by

, 0.1, 50) 0.991 ± 0,0005 0.973 ± 0.0009 1.813 ± 0.0268 0.969 ± 0.0010 4.000 ± 0.0094 0.945 ± 0.0006 3.662 ± 0.1005 ± 0.0004 0.859 ± 0.0007 1.814 ± 0.0685 0.803 ± 0.0006 4.204 ± 0.0159 0.826 ± 0.0008 4.014 ± 0.2246 (10 6 , 0.1, 50) 0.990 ± 0.0005 0.971 ± 0.0010 1.835 ± 0.0669 0.963± 0.0004 4.287 ± 0.0284 0.951 ± 0.0005 3.695 ± 0.1921 Results of concept learning with different parameter settings across various OOD detectors

).

Transferability of concepts across different OOD detectors.

Results of concept learning with augmented AwA train set as auxiliary OOD in concept learning. For each figure with a fixed choice of class prediction, we present receptive fields from ID test set corresponding to top concepts that contribute the most to the decisions of each OOD detector. All receptive fields passed the threshold test that the inner product between the feature representation and the corresponding concept vector is over 0.85.

Appendix

In Section A, we discuss the connection of the proposed concept separability to Bhattacharya Distance, and the per-class variations of detection completeness and concept separability, followed by the overall algorithm for concept learning. In Section B, we provide the detailed setup for the experiments. In Section C, we discuss whether our concept learning objective remains effective even when synthesized auxiliary OOD dataset similar to target ID data is used. In Section D, we illustrate additional examples of our concept-based explanations.

A CONCEPT LEARNING

A.1 CONNECTION TO THE BHATTACHARYA DISTANCE We note that the proposed separability metric in Section 3.2 is closely related to the Bhattacharya distance Bhattacharyya (1943) for the special case when the concept scores from both ID and OOD data follow a multivariate Gaussian density. The Bhattacharya distance is a well known measure of divergence between two probability distributions, and it has the nice property of being an upper bound to the Bayes error rate in the two-class case Fukunaga (1990a) . For the special case when the concept scores from both ID and OOD data follow a multivariate Gaussian with a shared covariance matrix, it can be shown that the Bhattacharya distance reduces to the separability metric: where AUC j (h• ϕ g,C ) is the AUROC of the detector conditioned on the event that the class predicted by the concept-world classifier h • ϕ g,C is j (note that the denominator has the global AUROC). The baseline AUROC b r is equal to 0.5 as before. This per-class detection completeness is used in the modified Shapley value defined in section 4.2.

A.3 PER-CLASS CONCEPT SEPARABILITY

In section 3.2, we focused on the separability between the concept scores of ID and OOD data without considering the class prediction of the classifier. However, it would be more appropriate to impose a high separability between the concept scores on a per-class level. In other words, we would like the concept scores of detected-ID and detected-OOD data, that are predicted by the classifier into any given class y ∈ [L] to be well separated. Consider the set of concept-score vectors from the detected-ID (or detected-OOD) dataset that are also predicted into class y: = (µ y out -µ y in ) T (S y w ) -1 (µ y out -µ y in ).(13)

