POST-HOC CONCEPT BOTTLENECK MODELS

Abstract

Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts ("the bottleneck") and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via conceptlevel feedback can provide significant performance gains without using data from the target domain or model retraining. The code for our paper can be found in https://github.com/mertyg/post-hoc-cbm.

1. INTRODUCTION

There is growing interest in developing deep learning models that are interpretable and yet still flexible. One approach is concept analysis (Kim et al., 2018) , where the goal is to understand if and how high-level human-understandable features are "engineered" and used by neural networks. For instance, we may like to probe a skin lesion classifier to understand if the Irregular Streaks concept is encoded in the embedding space of the classifier and used later to make the prediction. Our work builds on the earlier idea of concept bottlenecks, specifically Concept Bottleneck Models (CBMs) (Koh et al., 2020) . Concept bottlenecks are inspired by the idea that we can solve the task of interest by applying a function to an underlying set of human-interpretable concepts. For instance, when trying to classify whether a skin tumor is malignant, dermatologists look for different visual patterns, e.g. existence of Blue-Whitish Veils can be a useful indicator of melanoma (Menzies et al., 1996; Lucieri et al., 2020) . CBMs train an entire model in an end-to-end fashion by first predicting concepts (e.g. the presence of Blue-Whitish Veils), then using these concepts to predict the label. By constraining the model to only rely on a set of concepts and an interpretable predictor, we can: explain what information the model is using when classifying an input by looking at the weights/rules in the interpretable predictor and understand when the model made a particular mistake due to incorrect concept predictions. While CBMs provide several of the benefits mentioned above, they have several key limitations: 1. Data: CBMs require access to concept labels during model training, i.e. training data should be annotated with which concepts are present. Even though there are a number of densely annotated datasets such as CUB (Wah et al., 2011) , this is particularly restrictive for real-world use cases, where training datasets rarely have concept annotations. 2. Performance: CBMs often do not match the accuracy of an unrestricted model, potentially reducing the incentive to use them in practice. When the concepts are not enough to solve the desired task, it is not clear how to improve the CBM and match the original model performance, while retaining the interpretability benefits. 3. Model editing: Koh et al. ( 2020) discuss intervening on the model to fix the prediction for a single input, yet it is not shown how to holistically edit and improve the model itself. Intervening only changes the model behavior for a single sample, but global editing changes the model behavior completely. When the model picks up an unintended cue, or learns spurious associations, using the latter approach and editing the concept bottleneck can improve the model performance more generally than an intervention tailored toward one specific input. Prior work on CBMs does not discuss how to globally edit a model's behavior. Ideally, we would like to edit models with the help of human input in order to lower computational costs and remove assumptions about data access. With the CAV approach, for each concept, e.g. stripes, we train a linear SVM to distinguish the embeddings of examples that contain the concept and use the vector normal to the boundary (CAV). When annotations are hard to obtain, we can leverage multimodal models and use the text encoder to map each concept to a vector. Next, we project the embeddings produced by the backbone onto the concept subspace defined by the set of vectors. We then train an interpretable predictor to classify the examples from their projections. When the concept library is incomplete, we can construct a PCBM-h by sequentially introducing a residual predictor that maps the embeddings to the target space.



Our contributions. In this work, we propose the Post-hoc Concept Bottleneck Model (PCBM) to address these important challenges. PCBMs can convert any pre-trained model into a concept bottleneck model in a data-efficient manner, and enhance the model with the desired interpretability benefits. When the training data does not have concept annotations, which is often the case, PCBM can flexibly leverage concepts annotated in other datasets and natural language descriptions of concepts. When applicable, PCBMs can remove the laborious concept annotation process by leveraging multimodal models to obtain concept representations; this results in richer and more expressive bottlenecks using natural language descriptions of a concept, making PCBMs more accessible in various settings. Furthermore, when the available concepts are not sufficiently rich, we introduce a residual modeling step to the PCBM to recover the original blackbox model's performance. In experiments across several tasks, we show that PCBMs can be used with comparable performance compared to black-box models. While prior work(Koh et al., 2020)  demonstrated the possibility of performing local model interventions to change individual predictions, here we propose interventions for changing global model behavior. Through user studies, we show that PCBM enables efficient global model edits without retraining or access to data from the target domain and that users can improve PCBM performance by using concept-level feedback to drive editing decisions.

Figure 1: Post-hoc Concept Bottleneck Models. First, we learn the vectors in our concept bank.With the CAV approach, for each concept, e.g. stripes, we train a linear SVM to distinguish the embeddings of examples that contain the concept and use the vector normal to the boundary (CAV). When annotations are hard to obtain, we can leverage multimodal models and use the text encoder to map each concept to a vector. Next, we project the embeddings produced by the backbone onto the concept subspace defined by the set of vectors. We then train an interpretable predictor to classify the examples from their projections. When the concept library is incomplete, we can construct a PCBM-h by sequentially introducing a residual predictor that maps the embeddings to the target space.

