POST-HOC CONCEPT BOTTLENECK MODELS

Abstract

Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts ("the bottleneck") and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via conceptlevel feedback can provide significant performance gains without using data from the target domain or model retraining. The code for our paper can be found in https://github.com/mertyg/post-hoc-cbm. Projection onto the Concept Subspace

1. INTRODUCTION

There is growing interest in developing deep learning models that are interpretable and yet still flexible. One approach is concept analysis (Kim et al., 2018) , where the goal is to understand if and how high-level human-understandable features are "engineered" and used by neural networks. For instance, we may like to probe a skin lesion classifier to understand if the Irregular Streaks concept is encoded in the embedding space of the classifier and used later to make the prediction. Our work builds on the earlier idea of concept bottlenecks, specifically Concept Bottleneck Models (CBMs) (Koh et al., 2020) . Concept bottlenecks are inspired by the idea that we can solve the task of interest by applying a function to an underlying set of human-interpretable concepts. For instance, when trying to classify whether a skin tumor is malignant, dermatologists look for different visual patterns, e.g. existence of Blue-Whitish Veils can be a useful indicator of melanoma (Menzies et al., 1996; Lucieri et al., 2020) . CBMs train an entire model in an end-to-end fashion by first predicting concepts (e.g. the presence of Blue-Whitish Veils), then using these concepts to predict the label. By constraining the model to only rely on a set of concepts and an interpretable predictor, we can: explain what information the model is using when classifying an input by looking at the weights/rules in the interpretable predictor and understand when the model made a particular mistake due to incorrect concept predictions. While CBMs provide several of the benefits mentioned above, they have several key limitations: 1. Data: CBMs require access to concept labels during model training, i.e. training data should be annotated with which concepts are present. Even though there are a number of densely annotated datasets such as CUB (Wah et al., 2011) , this is particularly restrictive for real-world use cases, where training datasets rarely have concept annotations. 2. Performance: CBMs often do not match the accuracy of an unrestricted model, potentially reducing the incentive to use them in practice. When the concepts are not enough to solve the desired task, it is not clear how to improve the CBM and match the original model performance, while retaining the interpretability benefits. 3. Model editing: Koh et al. (2020) discuss intervening on the model to fix the prediction for a single input, yet it is not shown how to holistically edit and improve the model itself. Intervening only changes the model behavior for a single sample, but global editing changes the model behavior completely. When the model picks up an unintended cue, or learns spurious associations, using the latter approach and editing the concept bottleneck can improve the model performance more generally than an intervention tailored toward one specific input. Prior work on CBMs does not discuss how to globally edit a model's behavior. Ideally, we would like to edit models with the help of human input in order to lower computational costs and remove assumptions about data access. Our contributions. In this work, we propose the Post-hoc Concept Bottleneck Model (PCBM) to address these important challenges. PCBMs can convert any pre-trained model into a concept bottleneck model in a data-efficient manner, and enhance the model with the desired interpretability benefits. When the training data does not have concept annotations, which is often the case, PCBM can flexibly leverage concepts annotated in other datasets and natural language descriptions of concepts. When applicable, PCBMs can remove the laborious concept annotation process by leveraging multimodal models to obtain concept representations; this results in richer and more expressive bottlenecks using natural language descriptions of a concept, making PCBMs more accessible in various settings. Furthermore, when the available concepts are not sufficiently rich, we introduce a residual modeling step to the PCBM to recover the original blackbox model's performance. In experiments across several tasks, we show that PCBMs can be used with comparable performance compared to black-box models. While prior work (Koh et al., 2020) With the CAV approach, for each concept, e.g. stripes, we train a linear SVM to distinguish the embeddings of examples that contain the concept and use the vector normal to the boundary (CAV). When annotations are hard to obtain, we can leverage multimodal models and use the text encoder to map each concept to a vector. Next, we project the embeddings produced by the backbone onto the concept subspace defined by the set of vectors. We then train an interpretable predictor to classify the examples from their projections. When the concept library is incomplete, we can construct a PCBM-h by sequentially introducing a residual predictor that maps the embeddings to the target space. There are two main steps to building a PCBM. We let f : X → R d be any pretrained backbone model, where d is the size of the corresponding embedding space and X is the input space. For instance, f can be the image encoder of CLIP (Radford et al., 2021) or the model layers up to the penultimate layer of a ResNet (He et al., 2016) . An overview of PCBMs can be found in 1, and we describe the steps in detail below. Learning the Concept Subspace (C): To learn concept representations, we make use of CAVs (Concept Activation Vectors) (Kim et al., 2018) . In particular, we first define a concept library I = {i 1 , i 2 , ..., i Nc }, where N c denotes the number of concepts. The concepts in the library can be selected by a domain expert or learned automatically from the data (Ghorbani et al., 2019; Yeh et al., 2020) . For each concept i, we collect embeddings for the positive examples, denoted by the set P i = {f (x p1 ), ..., f (x p Np )}, that exhibit the concept, and negative examples N i = {f (x n1 ), ..., f (x p Nn )} that do not contain the concept. In practice, densely annotated datasets (Caesar et al., 2018; Fong & Vedaldi, 2018) are used to collect points that are positive/negative for the concept. Importantly, unlike CBMs, these samples can be different from the data used to train the backbone model. Following (Kim et al., 2018) , we train a linear SVM using P i and N i to learn the corresponding CAV, that is, the vector normal to the linear classification boundary. We denote the CAV for concept i as c i . Let C ∈ R Nc×d denote the matrix of concept vectors, where each row c i represents the corresponding concept i. Given an input, we project the embedding of the input onto the subspace spanned by concept vectors (the concept subspace). Particularly, we let f C (x) = proj C f (x) ∈ R Nc , where the ith entry is f (i) C (x) = ⟨f (x),ci⟩ ||ci|| 2 2 ∈ R. It is important to observe that we do not need to annotate the training data with concepts. Namely, the dataset used to learn concepts can be different from the original training data. In several of our experiments, we use held-out datasets to learn the concept subspace. Leveraging multimodal models to learn concepts: Annotating images with concepts is a laborious process. In practice, this can be an important roadblock before using CBMs. To address this, here we show that we can also leverage natural language concept descriptions and multimodal models to implement concept bottlenecks. This approach alleviates the need of collecting labeled data to construct the concept subspace. Multimodal models such as CLIP (Radford et al., 2021) have a text encoder f text along with an image encoder, which maps a description to the shared embedding space. We leverage the text encoder to augment the process of learning concept vectors. For instance, we can have c text stripes = f text ("stripes") where the concept vector for stripes is obtained by mapping the prompt using the text encoder. For each concept description, we can collect the text embeddings and construct our multimodal concept bank C text as our subspace. For a given classification task, we can use ConceptNet (Speer et al., 2017) to obtain concepts that are relevant to these classes. ConceptNet is an open knowledge graph, where we can find concepts that have particular relations to a query concept. For instance, we can find relations of the form "A Cat has {whiskers, four legs, sharp claws, ..}". Similarly, we can find "parts" of a given class (e.g. "bumper", "roof" for "truck"), or the superclass of a given class (e.g. "animal", "canine" for "dog"). We restrict ourselves to five sets of relations for each class: the hasA, isA, partOf, HasProperty, MadeOf relations in ConceptNet. We collect all concepts that have these relations with classes in each classification task to build the concept subspace. Learning the Interpretable Predictor: Next, we define an interpretable predictor that maps the concept subspace to the model prediction. Concretely, let g : R Nc → Y be an interpretable predictor, such as a sparse linear model or a decision tree, where Y = {1, 2, ..., K} denotes the label space. An interpretable predictor is desirable because it provides insight into which concepts the model is relying on when making a decision. If a domain expert observes a counter-intuitive phenomenon in the predictor, they can edit the predictor to improve the model. To learn the PCBM, we solve the following problem: min g E (x,y)∼D [L(g(f C (x)), y)] + λ N c K Ω(g) where f C = proj C f (x) is the projection onto the concept subspace, L(ŷ, y) is a loss function such as cross-entropy loss, Ω(g) is a complexity measure to regularize the model, and λ is the regularization strength. Note that the concept subspace is fixed during PCBM training. In this work, we use sparse linear models to learn the interpretable predictor, where g(f C (x)) = w T f C (x) + b. We apply the softmax function to g if the problem is a classification problem. Similarly, we define Ω(g) = α||w|| 1 + (1 -α)||w|| 2 2 to be the elastic-net penalty parameterized by α, then normalize by the number of classes and concepts. The expressiveness of the concept subspace is crucial to PCBM performance. However, even when we have a rich concept subspace, concepts may not be enough to solve a task of interest. When the performance of the PCBM does not match the performance of the original model, users have less incentive to use interpretable models like concept bottlenecks. Next, we aim to address this limitation. Recovering the original model performance with residual modeling: What happens when the concept bank is not sufficiently expressive, and PCBM performs worse than the original model? For instance, there may be skin lesion descriptors that are not available in the concept library. Ideally, we would like to preserve the original model's accuracy while retaining the interpretability benefits. Drawing inspiration from the semiparametric model literature on fitting residuals (Härdle et al., 2004) , we introduce Hybrid Post-hoc CBMs (PCBM-h). After fixing the concept bottleneck and the interpretable predictor, we re-introduce the embeddings to 'fit the residuals'. In particular, we solve the following: min r E (x,y)∼D [L(g(f C (x)) + r(f (x)), y)] where r : R d → Y is the residual predictor. Note that in Equation 2, while training the residual predictor, the trained concept bottleneck (the concept subspace (f C ) and the interpretable predictor(g)) is kept fixed, i.e. fitting PCBM-hs is a sequential procedure. We hypothesize that the residual predictor will compensate for what is missing from the concept bank and recover the original model's accuracy. We implement the residual predictor as a linear model, i.e. r(f (x)) = w T r f (x) + b r . Given a trained PCBM-h, if we would like to observe model performance in the absence of the residual predictor, we can simply drop r (in other words, concept contributions do not depend on the residual step).

3. EXPERIMENTS

We evaluate the PCBM and PCBM-h in challenging image classification and medical settings, demonstrating several use cases for PCBMs. We further address practical concerns and show that PCBMs can be used without a loss in the original model performance. We used the following datasets to systematically evaluate the PCBM and PCBM-h: CIFAR10, CIFAR100 (Krizhevsky et al., 2009) We use CLIP-ResNet50 (Radford et al., 2021) as the backbone model. For the concept bottleneck, we use 170 concepts introduced in (Abid et al., 2022) which are extracted from the BRODEN visual concepts dataset (Fong & Vedaldi, 2018) . These include objects (e.g. dog), settings, (e.g. snow) textures (e.g. stripes), and image qualities (e.g. blurriness). The full list of concepts can be found in (Abid et al., 2022) . (Caesar et al., 2018) is a dataset derived from MS-COCO (Lin et al., 2014) , consisting of scenes with various object annotations. Previous work (Singh et al., 2020) has shown that there are severe co-occurrence biases in this dataset. For instance, images from the wine glass often also have a dining table in the image. Singh et al. (2020) identifies these 20 classes with heavy co-occurrence biases and we train PCBMs to recognize these 20 objects, where we minimize the binary cross-entropy loss individually for each class. In this scenario, we again use CLIP-ResNet50 (Radford et al., 2021) as the backbone and BRODEN visual concepts as the set of concepts.

COCO-Stuff

CUB (Wah et al., 2011) In the 200-way bird identification task, we use a ResNet18 (He et al., 2016) trained on the CUB datasetfoot_0 . We use the same training/validation splits and 112 concepts as in (Koh et al., 2020) . These concepts include wing shape, back pattern, and eye color. HAM10000 (Tschandl et al., 2018) is a dataset of dermoscopic images, which contain skin lesions from a representative set of diagnostic categories. The task is detecting whether a skin lesion is benign or malignant. We use the Inception (Szegedy et al., 2015) model trained on this dataset, which is available from (Daneshjou et al., 2021) . Following the setting in (Lucieri et al., 2020) , we collect concepts from the Derm7pt (Kawahara et al., 2018) dataset. The 8 concepts obtained from this dataset include Blue Whitish Veil, Pigmented Networks, Regression Structures, which are reportedly associated with the malignancy of a lesion. (Rotemberg et al., 2021) To test a real-world transfer learning use case, we evaluate the model trained on HAM10000 on a subset of the SIIM-ISIC Melanoma Classification dataset. We use the same concepts described in the HAM10000 dataset.

SIIM-ISIC

We use linear probing on top of the backbones for CIFAR, COCO, and ISIC to obtain the baseline model performance. For CUB and HAM10000, we use the out-of-the-box models trained on the respective datasets without any additional training. We emphasize that in all of our experiments except CUB, we learn concepts using a held-out dataset. Namely, while the original CBMs need to have concept annotations for training images, we remove this limitation by post-hoc learning of concepts with any dataset. To learn the concept subspace, we use 50 pairs of images for each concept to train a Linear SVM. We refer the reader to the Appendix for further details on datasets and training. PCBMs achieve comparable performance to the original model: In Table 1 , we report results over these five datasets. PCBM achieves comparable performance to the original model in all datasets except CIFAR100, and PCBM-h matches the performance in all scenarios. Strikingly, PCBMs match the performance of the original model in HAM10000 and ISIC, with as few as 8 human-interpretable concepts. In CIFAR100, we hypothesize that the concept bank available is not sufficient to classify finer-grained classes, and hence there is a performance gap between the PCBM and the original model. When the concept bank is not sufficient to solve the task, PCBM-h can be introduced to recover the original model performance while retaining the benefits of PCBM (see next sections). Comparing to CBM: A comparison to CBM was not possible in most datasets, as CBMs cannot be trained due to the lack of dense annotations. To compare the data efficiency of both methods, we report a comparison of CBM to PCBM when trained on CUB in Appendix C, finding that CBM can only reach a similar performance to PCBM with 20x more annotations. Analyzing the residual predictor: To understand whether PCBM-h overrides PCBM predictions, in Appendix B, we look at the consistency between PCBM and PCBM-h predictions. We show that the residual component in PCBM-h intervenes only when the prediction is wrong, and fixes mistakes. When PCBM is confident, PCBM-h does not modify the prediction or significantly increase the confidence. In general, the residual component may dominate when the bottleneck is insufficient and future work can aim to explicitly limit the residual component, e.g. PIE (Wang et al., 2021) regularizer. Table 1 : PCBMs achieve comparable performance to the original model. We report performance over different scenarios for the original model and PCBMs with concept datasets. In CIFAR100, PCBM performs poorly since the concept bank is not expressive enough to solve a finer-grained task; however, PCBM-h recovers the original model's accuracy. Strikingly, PCBMs match the performance of the original model in HAM10000 and ISIC, with as few as 8 human-interpretable concepts. Original CBMs cannot be trained on CIFAR/HAM10000/ISIC/COCO-Stuff, as they do not have concept labels in the training dataset. The mean and standard errors are reported over 10 random seeds. We report AUROC for HAM10000 and ISIC, mAP for COCO-Stuff, and accuracy for CIFAR and CUB. 

PCBM using CLIP concepts:

Here, we show that when labeled examples to learn concepts are not available, we can use multimodal representations such as CLIP to generate concepts without the laborious annotation cost. Using ConceptNet, we obtain concept descriptions for CI-FAR10, CIFAR100, and COCO-Stuff tasks (206, 527, and 822 concepts, respectively), using the relations described in Section 2. The concept subspace is obtained by using text embeddings for the concept descriptions generated by the ResNet50 variant of CLIP. In Table 2 , we give an overview of the results. We observe that with a more expressive multimodal bottleneck, we can almost recover the original model accuracy in CIFAR10, and we reduce the performance gap in CIFAR100. CLIP contains extensive knowledge of natural images, and thus captures natural image concepts (e.g. red, thorns). However, CLIP is less reliable at representing more granular or domain-specific concepts (e.g. blue whitish veils) and hence this methodology is less applicable to more domain-specific applications. While there is a recent trend in training multimodal models for biomedical applications (Zhang et al., 2020; EleutherAI) , these models are trained with smaller datasets and are less competitive. We hope that as these models improve, PCBMs will make it easier to deploy concept bottlenecks in more specialized domains. Explanations in PCBMs: In Figure 2 , we provide sample concept weights in the corresponding PCBMs. For instance, in HAM10000, PCBMs use Blue Whitish Veils, Atypical Pigment Networks, and Irregular Streaks to identify malignant lesions, whereas Typical Pigment Networks, Regular Streaks, and Regular Dots and Globules are used to identify benign lesions. These associations are consistent with medical knowledge (Menzies et al., 1996; Lucieri et al., 2020) . We observe similar cases in CIFAR-10 and CIFAR-100, where the class Rose is associated with the concepts Flower, Thorns and Red. In Appendix Section F, we further analyze the global explanations of COCO-Stuff and show that PCBM reveals co-occurrence biases in the training data. Note that the concept weights for PCBM-h are exactly the same as for PCBM since the residual model is fit after the concept weights are fixed. However, given the addition of the residual component, the net effect of concepts may be different. We also note that the concept weights discussed may not reflect the causal effect. Estimating the causal effect of a concept requires careful treatment (Goyal et al., 2019) , e.g. accounting for the interactions between concepts, and we leave this to future work. We give more details and the results of all domains in the Appendix. We use an ImageNet pretrained ResNet50 variant as a backbone and the visual concept bank described in the Experiments section. Given a PCBM, we evaluate three editing strategies: 1. Prune: We set the weight of a particular concept on a particular class prediction to 0, i.e. for a concept indexed by i, we let wi = 0. 2. Prune+Normalize: After applying pruning, we rescale the concept weights. Let P denote the indices of positive weights that are pruned, P denote the indices of weights that remain, and w P , w P be corresponding weight vectors for the particular class. We rescale each weight to match the original norm of the vector by letting ∀i ∈ P , wi = w i (1+ ||w P ||1 ||w P ||1 ), leading to || w|| 1 = ||w|| 1 . The normalization step alleviates the imbalance between class weights upon pruning concepts with large weights for a particular class. 3. Fine-tune (Oracle): We compare our simple editing strategies to fine-tuning on the test domain, which can be considered an oracle. Particularly, we fine-tune the PCBM using samples from the test domain and then test the fine-tuned model with a set of held-out samples. In the context of Metashift experiments, we simply edit the concept spuriously correlated with a particular class. Concretely, for the domain table(dog), we prune the weight of the dog concept for the class table. In Table 3 , we report the result of our editing experiments over 10 scenarios. We observe that for PCBMs, the Prune+Normalize strategy can recover almost half of the accuracy gains achieved by fine-tuning on the original test domain. Further, we observe that PCBM-h obtains lower accuracy gains with model editing. In PCBM, we can remove the concept from the model, but in PCBM-h, there may still be leftover information about the spurious concept in the residual part of the model. This is in line with our expectations, i.e. PCBM-h is a 'less' interpretable but more powerful variant of PCBM. While it still offers partial editing improvements that would not be possible with the vanilla model, it does not bring as much improvement as the PCBM. Even though our edit strategy is extremely simple, we can recover almost half of the gains made by fine-tuning the model. It is particularly easy to use since it can be applied without fine-tuning or using any knowledge or data from the target domain. However, this methodology requires knowledge of the spurious concept in the training domain. This may not be realistic in practice, and in the next section, we turn to human users for guidance.

4.2. USER STUDY: EDITING PCBM WITH HUMAN GUIDANCE

One of the advantages of pruning is that it is naturally amenable to bringing humans into the loop. Rather than attempting to automate the edits, we can rely on a human to pick concepts that make logical sense to prune. We show that users make fast and insightful pruning decisions that improve model performance on test data when there is a distribution shift. Notably, these decisions can be made even when the user knows little about the model architecture and has no knowledge of the true underlying spurious correlation in the training data. We conduct a user study where we explore the benefits of human-guided pruning on PCBM and PCBM-h performance. Remove concepts from the model that you think may hinder generalization for classifying Keyboard. The model was trained to distinguish Beach, Bus, Airplane, Keyboard and Bird 

Study Design:

We construct 9 MetaShift scenarios with a distribution shift for one of the classes between the training and testing datasets, which we refer to as the "shifted class". For instance, all of the training images for the class Keyboard also contain a Cat in the image, whereas test images do not have a cat in the image. The list of the 9 scenarios can be found in the Appendix. For each scenario, we train a PCBM and a PCBM-h with CLIP concepts on the training dataset and task the users with making edits to the trained models by selecting concepts to prune. We use CLIP's ResNet50 variant as our backbone, and leverage ConceptNet to construct the concept subspace. Figure 3 shows the concept-selection interface. We refer to the Appendix for the full set of instructions given to the user. The display consists of the classification task the model was trained on and the concepts with the ten most positive weights for the shifted class. The user selects a subset of the concepts to prune. Neither the original model accuracy nor the accuracy after pruning is revealed to the user until all scenarios are completed. The target audience for the study was machine learning practitioners and researchers; 30 volunteers participated for a total of 30 × 9 = 270 experiments. The IRB has determined that this does not require review. Human-guided editing is fast and improves model accuracy: On average, users prune 3.65 ± 0.39 concepts in 34.3 ± 6.4s per scenario. We evaluate the effectiveness of user pruning by comparing the model accuracy after user pruning to the accuracy of the unedited model on a held-out dataset from the test domain. All 30/30 users improve the model accuracy when averaged over scenarios. 8/30 users improve model accuracy in all scenarios and 26/30 users improve model accuracy in at least 6 scenarios. We compare human-guided editing to three baselines: 1. Random Pruning: We uniformly select without replacement a subset of the top ten concepts to prune, matching the number of concepts pruned by the user for a fair comparison. 2. Greedy Pruning (Oracle): We greedily select one of the top 10 concepts that, when pruned, improves the model accuracy the most. We match the number of concepts pruned by the user. 3. Fine-tune (Oracle): The model is fine-tuned on samples from the test domain. Table 4 : Human-guided editing improves model accuracy for PCBMs with CLIP concepts (N=30). We show the test accuracy on the shifted class averaged across 9 MetaShift scenarios, before and after the model is edited. Accuracies and edit gains are reported as mean ± standard error. (For the pruning strategies, we first compute the mean test accuracy across users within each scenario, then calculate the overall mean and standard error from the within-scenario means). We see that user pruning surpasses random pruning and can achieve nearly 50% of the accuracy gains made by fine-tuning and over 80% of the accuracy gains made by greedy pruning. We consider greedy pruning and retraining to be oracles since they require "leaked" knowledge of the test domain. Table 4 reports the model performance improvements for PCBM and PCBM-h, averaged over all 270 experiments in the user study. Improvements within each separate scenario can be found in the Appendix. We note that even with the residual component in PCBM-h, we can still improve model performance by editing the concept bottleneck. We observe that user pruning achieves marked improvement in model performance compared to the unedited model, attaining around 50% of the accuracy gains from fine-tuning and 80% of the gains from greedy pruning.

5. RELATED WORKS

Concepts Using human concepts to interpret model behavior has been drawing increasing interest (Kim et al., 2018; Bau et al., 2017; 2020) . Related work focuses on understanding if neural networks encode and use concepts (Lucieri et al., 2020; Kim et al., 2018; McGrath et al., 2021) , or generate counterfactual explanations to understand model behavior (Ghandeharioun et al., 2021; Abid et al., 2022; Akula et al., 2020) . Recent works evaluated the causal validity of explanations (Feder et al., 2021; Elazar et al., 2021; Goyal et al., 2019) , e.g. to eliminate potential confounding effects. There is further increasing interest in automatically discovering the concepts that are used by a model (Yeh et al., 2020; Ghorbani et al., 2019; Lang et al., 2021) . Concept-based Models: Concept bottleneck models (CBMs) (Koh et al., 2020) extend the earlier idea (Lampert et al., 2009; Kumar et al., 2009) of first predicting the concepts, then using concepts to predict the target. CBMs bring about interpretability benefits but require training the model using concept labels for the entire training dataset, which is a key limitation. CBMs have not been analyzed in terms of model edits. Recent work reveals that end-to-end learned CBMs encode information beyond the desired concept (Mahinpei et al., 2021; Margeloiu et al., 2021) . Concept Whitening (Chen et al., 2020) aims to align each concept with an individual dimension in a layer. While a subset of dimensions is aligned, this approach lacks a bottleneck since there are dimensions that are potentially not aligned with a concept (comparable to PCBM-h). Barnett et al. (2021) learns prototypes and uses them as concepts, where the distance to prototypes determines the model prediction. PCBM-h is inspired by semiparametric models on fitting residuals (Härdle et al., 2004) . PIE (Wang et al., 2021) has a similar approach, where they combine individual features with more complicated interaction terms for tabular data. Sotoudeh & Thakur (2021) proposes "repairing" models by finding minimal parameter updates that satisfy a given specification. One thread of work focuses on intervening on the latent space of neural networks to alter the generated output towards the desired state, e.g. removal of artifacts or manipulation of object positions (Sinitsin et al., 2020; Bau et al., 2020; Santurkar et al., 2021) . For instance, Bau et al. (2020) edits generative models. Santurkar et al. (2021) edits classifiers by modifying 'rules', such as making a model perceive the concept of a 'snowy road' as the concept of a 'road', and they achieve this by modifying minimal updates to specific feature extractors. FIND (Lertvittayakumjorn et al., 2020) prunes individual neurons chosen by users and later fine-tunes the model. Right for the Right Concepts (Stammer et al., 2021) similarly fine-tune a neuro-symbolic model with user-provided feedback maps. Bontempelli et al. (2021) give an overview of concept-based models and a discussion on debugging strategies while concurrent work ProtoPDebug (Bontempelli et al., 2022) proposes an efficient debugging strategy.

6. LIMITATIONS AND CONCLUSION

In this work, we presented Post-hoc CBMs as a way of converting any model into a CBM, retaining the original model performance without losing the interpretability benefits. We leveraged multimodal models as an interface to use concepts, without the laborious concept annotation steps. Further, in addition to the local intervention benefits of CBM, we demonstrated that PCBMs can be leveraged to perform global model interventions. Many benefits of CBMs depend on the quality of the concept library. The concept set should be expressive enough to solve the task of interest. Users should be careful about the concept dataset used to learn concepts, which can reflect various biases. While there are several such real-world tasks, it is an open question if human-constructed concept bottlenecks can solve larger-scale tasks(e.g. ImageNet level). Hence, finding concept subspaces for models in an unsupervised fashion is an active area of research that will help with the usability and expressivity of concept bottlenecks. Multimodal models provide an effective interface for concept-level reasoning, yet it is unclear what is the optimal way to have humans in the loop. Here, we presented a simple way of taking human input via concept pruning; how to incorporate richer feedback is an interesting direction for future work.

A TRAINING DETAILS

Hyperparameters: In all of our experiments ElasticNet sparsity ratio parameter was α = 0.99. We trained all our models on a single NVIDIA-Titan Xp gpu. All of the models were trained for a total number of 10 epochs. We tune the regularization strength on a subset of the training set, that is kept as a validation set. PCBMs are fitted using scikit-learn's SGDClassifier class, with 5000 maximum steps. Hybrid parts are trained with PyTorch, where we used Adam as the optimizer with 0.01 learning rate, with 0.01 L 2 regularization on the residual classifier weights, and trained for 10 epochs. Dataset Details: 1. Metashift: For all metashift experiments, we have 5 class classification problems. We have 50 images for each class as the training set, and 50 images for each class in the test dataset, which is different from the training images. The regularization strength for Metashift experiments is 0.002. 2. CIFAR: We use the original training and test splits of CIFAR datasets. The regularization strength for CIFAR10 and CIFAR100 is 2.0 KNc . We use linear probing to evaluate the original model 3. COCO-Stuff: We use the original training and test splits of COCO dataset. We sample 500 training and 250 test images from the dataset for each class, and we upsample the images whenever there are not enough images. Recognizing each of the 20 biased classes itepsingh2020don is treated as a binary classification task where we minimize binary cross entropy loss separately for 20 classes, and compute the mean average precision metric. The regularization strength for COCO-Stuff is 0.001. We evaluate the original model performance using linear probes with CLIP. 4. Ham10k: We split 80% of the HAM10k dataset and use it as the training set, and use the remaining 20% as the test data. The regularization strength for Ham10k is 2.0 KNc . 5. ISIC: We use 2000 images for training (400 malignant, 1600 benign) and evaluate the model on a held-out set of 500 images (100 malignant, 400 benign). The regularization strength for ISIC is 0.001 KNc . We evaluate the original model performance using a linear probe. 6. CUB: We use the training and test splits provided in Koh et al. (Koh et al., 2020) . The regularization strength for CIFAR10 and CUB is 0.01 KNc . Visual Concepts: We used Broden Visual concept bank for CIFAR and controlled editing experiments, CUB's training data for CUB concepts, and derm7pt dataset for dermatology concepts. For each of these, we use 50 pairs of positive and negative images, and learn a linear SVM. We use the vector normal to the decision boundary as the concept vector. Multimodal concepts: For concept learning, we leveraged the ConceptNet hierarchy. For each classification task, we searched concept net for the class name and obtained concepts that have the following relation with the query concept: hasA, isA, partOf, HasProperty, MadeOf. We share the code for obtaining natural language concepts using ConceptNet. For each of these concepts, we obtain the text embedding using CLIP and use those as the concept vector.

B RESIDUAL COMPONENT INTERVENES ONLY WHEN NECESSARY

Does PCBM-h alter predictions made by the PCBM? We hypothesized that the residual component would intervene only when the concept bottleneck is not sufficient. To better understand the effect of residual predictor, we analyzed the prediction consistency between the two models for CIFAR10 and CIFAR100, where we looked at the models with labeled concepts (i.e. 170 concepts). In Figure 4 and 5, x-axis denotes the confidence of PCBM, the orange line gives the accuracy for samples with the given confidence, and the blue line gives the consistency between PCBM and PCBM-h for the same samples, i.e. whether they make the same prediction. In CIFAR10 and CIFAR100, we show that PCBM and PCBM-h consistency is high when the model confidence is high, and the model accuracy is high. Consequently, PCBM-h changes the model prediction mostly when the PCBM prediction is likely a mistake, and the confidence is low. In Figure 5 , we show that all of the predictions changed by PCBM-h are to fix model mistakes. Further, in Figure 6 we show the PCBM confidence and the mean absolute deviation between the PCBM confidence and the PCBM-h confidence. We observe that PCBM-h has less effect on model confidence as the model gets more confident. Here, we look at the consistency between PCBM and PCBM-h predictions (i.e. whether both models make the same prediction). Namely, at each confidence level for the PCBM, we report the accuracy and consistency with the PCBM-h predictions. Overall, we see that PCBM-h is most likely to change the model prediction when the model is making a mistake, and otherwise, predictions are consistent.

C COMPARISON TO CBM

In most of our experiments, there is no clear way to run CBM as a benchmark, as none of the datasets have dense concept annotations -except CUB. In CUB, we run CBM as a baseline. Particularly, we try the joint training strategy from Koh et al. (Koh et al., 2020) , where it was reported to Here we show the PCBM confidence and mean absolute deviation between the PCBM confidence and the PCBM-h confidence. We observe that PCBM-h has less effect on model confidence as the model gets more confident. give the best results in the CUB dataset. To make the comparison equal, we use the same set of concepts in both models, and use a linear predictor layer. Further, in both cases, we use the same frozen ResNet18 backbone. We searched over a grid of learning rates from {0.01, 0.01, 0.1, 1.0} and λ ∈ {0.001, 0.01, 0.1, 1.0} where λ is the coefficient of the concept predictors in the joint training objective (see the CBM paper, Section 3), and the ElasticNet regularization strength in {0.001, 0.01, 0.1, 1.0}. In Table 5 , we report the model performance. Overall, we observe some benefit from using dense concept annotations over the entire training dataset, where CBMs achieve a slightly better performance than PCBMs, and the original backbone. We note that this was at the cost of 112 concept annotations for each of the training samples. We further analyze the behavior of CBM under different number of annotations. In Figure 7 we train the CBM with varying number of annotations. CBMs require dense annotations, e.g. 11200 annotations would mean 11200/112 = 100 images. On the x-axis, we give the number of annotations used to train the CBM. On the y-axis, we give the accuracy on the test set. We observe that CBMs require a much larger amount of annotations to achieve the same accuracy as PCBMs. training images for the class keyboard also contain a cat in the image. In Table 7 , we list the classes in the underlying classification task and the spurious correlation for each scenario. One important limitation for this analysis is that the desired concepts should exist in the concept bank. To further improve the pipeline, automatic concept discovery approaches shall be a fruitful research direction (Ghorbani et al., 2019; Yeh et al., 2020) .



The CUB pretrained model is obtained from https://github.com/osmr/imgclsmob



Figure 1: Post-hoc Concept Bottleneck Models. First, we learn the vectors in our concept bank.With the CAV approach, for each concept, e.g. stripes, we train a linear SVM to distinguish the embeddings of examples that contain the concept and use the vector normal to the boundary (CAV). When annotations are hard to obtain, we can leverage multimodal models and use the text encoder to map each concept to a vector. Next, we project the embeddings produced by the backbone onto the concept subspace defined by the set of vectors. We then train an interpretable predictor to classify the examples from their projections. When the concept library is incomplete, we can construct a PCBM-h by sequentially introducing a residual predictor that maps the embeddings to the target space.

Figure 3: User Study Interface. We train PCBMs on MetaShift scenarios, each with a distribution shift between the training and test datasets. The user selects concepts to prune from the model.

Model editing aims to achieve the removal or modification of information in a given neural network. Several models investigated editing factual knowledge in language models: Zhu et al. (2020) use variants of fine-tuning to achieve this objective while retaining performance on unmodified factual knowledge and Mitchell et al. (2021); De Cao et al. (2021); Hase et al. (2021) update the model by training a separate network to modify model parameters to achieve the desired edit. Similarly,

Figure4: Residual component intervenes mostly when the confidence is low. Here, we look at the consistency between PCBM and PCBM-h predictions (i.e. whether both models make the same prediction). Namely, at each confidence level for the PCBM, we report the accuracy and consistency with the PCBM-h predictions. Overall, we see that PCBM-h is most likely to change the model prediction when the model is making a mistake, and otherwise, predictions are consistent.

Figure6: Effect of the residual predictor on the confidence. Here we show the PCBM confidence and mean absolute deviation between the PCBM confidence and the PCBM-h confidence. We observe that PCBM-h has less effect on model confidence as the model gets more confident.

Figure 8: Launch page for user study. Before starting the editing tasks, participants are shown a launch page containing a brief background section on PCBMs and a set of instructions.

Figure 9: Summary page for user study. After completing all 9 scenarios, participants are shown a summary page that includes the user's choice of concepts, the accuracy of the unedited model, and the accuracy achieved by the edited model.

Concept Bottlenecks with CLIP concepts. When a concept bank is not available or is insufficient, we can use natural language descriptions of concepts with CLIP to implement CBMs.

MODEL EDITING WITH POST-HOC CONCEPT BOTTLENECKSWhen we observe a counter-intuitive phenomenon or a spurious correlation in the concept bottleneck, can we make the model perform better by simple edit operations? In this section, we show that PCBMs come with the benefit of easy and global concept-level model feedback.Koh et al. (2020) demonstrate the ability of CBMs to incorporate local interventions. Namely, if a user identifies a misprediction in the concept bottleneck, they can intervene on the concept prediction, and change the prediction for the particular instance, we call this a local intervention or edit. Here, we are interested in editing models to perform 'global' edits. Namely, with simple concept-level feedback, we would like to edit a PCBM and change the entire model behavior. For instance, in our experiments, we investigate the use case of adapting a PCBM to a new distribution. Unlike most existing model editing approaches, we do not need any data from either the training or target domains to perform the model edit, which can be a significant advantage in practice when data is inaccessible. Given a trained PCBM, we edit our concept bottleneck by manipulating the concept weights. For simplicity, in the experiments below we only focus on positive weights in the linear model.4.1 CONTROLLED EXPERIMENT: EDITING PCBM WHEN THE SPURIOUS CONCEPT IS KNOWNFor our editing experiments, we use the Metashift(Liang & Zou, 2022) dataset to simulate distribution shifts. We use 10 different scenarios where there is a distribution shift for a particular class between the training and test datasets. For instance, during training, we only use table images that also contain a dog in the image and test the model with table images where there is not a dog in the image. We denote the training domain as table(dog).

Model edits with Post-Hoc CBMs. We report results over 10 distribution shift experiments generated using Metashift. Accuracy and edit gains are averaged over 10 scenarios and are reported as mean ± standard error. We observe that very simple editing strategies in the concept subspace provide almost 50% of the gains made by fine-tuning on the test domain.

Residual component intervenes only to fix mistakes. Here we show the number of mistakes, the number of predictions changed by PCBM-h, and the number of mistakes fixed by PCBM-h. We see that PCBM-h only changes the model predictions to fix model mistakes.

Comparison to CBM. We compare PCBM to CBM in the CUB dataset. The accuracy over the test set is reported. Effect of the number of annotations in CBM. Here we train the CBM with varying number of annotations. CBMs require dense annotations, e.g. 11200 annotations would mean 11200/112 = 100 images. On the x-axis, we give the number of annotations used to train the CBM. On the y-axis, we give the accuracy on the test set. We observe that CBMs require a much larger amount of annotations to achieve the same accuracy as PCBMs.D CONTROLLED METASHIFT EXPERIMENTS FOR MODEL EDITINGFor Metashift, we have 2 tasks. Both tasks are 5-class object recognition tasks, where in the first one classes are airplane, bed, car, cow, keyboard, and for the second one we have beach, computer, motorcycle, stove, table. For each of these, we use a ResNet18 pretrained on ImageNet as the backbone of the P-CBM, and then use 100 images per class to train the concept bottleneck. For all experiments, we use the Adam Optimizer with a learning rate of 0.05, the regularization parameters λ = 0.05, α = 0.99. Similar to CIFAR experiments, we use the Broden Concept dataset. Below we give the entire set of results.

Results of the Metashift editing experiments.

Classification tasks and spurious correlations in user study scenarios.Below, we report the test accuracies for each of the 9 scenarios in our user study, both for the shifted class and for the overall classification task.

Model accuracy of PCBM-h with CLIP concepts after editing (N=30) We report the test accuracy in the shifted class and the overall test accuracy for each individual scenario in the user study. Accuracy for the pruning strategies is averaged over users and is shown as mean ± standard error. et al(Singh et al., 2020) identifies co-occurence biases in the COCO-Stuff dataset, where 20 categories frequently co-occur with other identified categories. In Table10, reader can find the concepts, taken from Singh et al. In Table10, we report the Top-5 concepts for the PCBM trained with CLIP concepts. In the Biased Context column, we give the category that co-occurs frequently with the given context. In PCBM Top-5 Concepts column, we give the concepts that have the highest

Model accuracy of PCBM with CLIP concepts after editing (N=30) We report the test accuracy in the shifted class and the overall test accuracy for each individual scenario in the user study. Accuracy for the pruning strategies is averaged over users and is shown as mean ± standard error.

COCO-Stuff Conceptsweight for the given category. Looking at PCBM categories, we can see that we can identify the biased contexts. For instance, cup category has table, dining table, apple has fruit, fresh vegetables, car has traffic circle, intersection in the concepts identified as important, which are parallel to the biased context. In other cases, our model surfaced different potential co-occurence biases, such as chimney for clock, and toilet for hair drier.

ACKNOWLEDGMENTS

We would like to thank Duygu Yilmaz, Adarsh Jeewajee, Carlos Guestrin, Edward Chen, Kyle Swanson, Omer Faruk Akgun, and Roxana Daneshjou for their support and comments on the manuscript, and all members of the Zou Lab and Guestrin Lab for helpful discussions. We thank the anonymous reviewers for their suggestions to improve this manuscript. J.Z. is supported by NSF CAREER 1942926 and the Chan-Zuckerberg Biohub.

ETHICAL STATEMENT

In our user study, we did not collect any user information. We applied to the IRB within our institution and "IRB has determined that this research does not involve human subjects as defined in 45 CFR 46.102(f) and therefore does not require review by the IRB".

REPRODUCIBILITY STATEMENT

The code to reproduce all experiments will be released at https://github.com/mertyg/ post-hoc-cbm. All of the backbones we used to construct concept bottlenecks are released public checkpoints, and we stated all of these in the manuscript.

