CAUSAL PROXY MODELS FOR CONCEPT-BASED MODEL EXPLANATIONS

Abstract

Explainability methods for NLP systems encounter a version of the fundamental problem of causal inference: for a given ground-truth input text, we never truly observe the counterfactual texts necessary for isolating the causal effects of model representations on outputs. In response, many explainability methods make no use of counterfactual texts, assuming they will be unavailable. In this paper, we show that robust causal explainability methods can be created using approximate counterfactuals, which can be written by humans to approximate a specific counterfactual or simply sampled using metadata-guided heuristics. The core of our proposal is the Causal Proxy Model (CPM). A CPM explains a black-box model N because it is trained to have the same actual input/output behavior as N while creating neural representations that can be intervened upon to simulate the counterfactual input/output behavior of N . Furthermore, we show that the best CPM for N performs comparably to N in making factual predictions, which means that the CPM can simply replace N , leading to more explainable deployed models.

1. INTRODUCTION

The gold standard for explanation methods in AI should be to elucidate the causal role that a model's representations play in its overall behavior -to truly explain why the model makes the predictions it does. Causal explanation methods seek to do this by resolving the counterfactual question of what the model would do if input X were changed to a relevant counterfactual version X ′ . Unfortunately, even though neural networks are fully observed, deterministic systems, we still encounter the fundamental problem of causal inference (Holland, 1986) : for a given ground-truth input X, we never observe the counterfactual inputs X ′ necessary for isolating the causal effects of model representations on outputs. The issue is especially pressing in domains where it is hard to synthesize approximate counterfactuals. In response to this, explanation methods typically do not explicitly train on counterfactuals at all. In this paper, we show that robust explanation methods for NLP models can be obtained using texts approximating true counterfactuals. The heart of our proposal is the Causal Proxy Model (CPM). CPMs are trained to mimic both the factual and counterfactual behavior of a black-box model N . We explore two different methods for training such explainers. These methods share a distillation-style objective that pushes them to mimic the factual behavior of N , but they differ in their counterfactual objectives. The input-based method CPM IN appends to the factual input a new token associated with the counterfactual concept value. The hidden-state method CPM HI employs the Interchange Intervention Training (IIT) method of Geiger et al. (2022) to localize information about the target concept in specific hidden states. Figure 1 provides a high-level overview. We evaluate these methods on the CEBaB benchmark for causal explanation methods (Abraham et al., 2022) , which provides large numbers of original examples (restaurant reviews) with humancreated counterfactuals for specific concepts (e.g., service quality), with all the texts labeled for their concept-level and text-level sentiment. We consider two types of approximate counterfactuals derived from CEBaB: texts written by humans to approximate a specific counterfactual, and texts sampled using metadata-guided heuristics. Both approximate counterfactual strategies lead to state-of-the-art performance on CEBaB for both CPM IN and CPM HI . We additionally identify two other benefits of using CPMs to explain models. First, both CPM IN and CPM HI have factual performance comparable to that of the original black-box model N and can explain their own behavior extremely well. Thus, the CPM for N can actually replace N , leading to more explainable deployed models. Second, CPM HI models localize concept-level information in their hidden representations, which makes their behavior on specific inputs very easy to explain. We illustrate this using Path Integrated Gradients (Sundararajan et al., 2017) , which we adapt to allow input-level attributions to be mediated by the intermediate states that were targeted for localization. Thus, while both CPM IN and CPM HI are comparable as explanation methods according to CEBaB, the qualitative insights afforded by CPM HI models may given them the edge when it comes to explanations.

2. RELATED WORK

Understanding model behavior serves many goals for large-scale AI systems, including transparency (Kim, 2015; Lipton, 2018; Pearl, 2019; Ehsan et al., 2021) , trustworthiness (Ribeiro et al., 2016; Guidotti et al., 2018; Jacovi & Goldberg, 2020; Jakesch et al., 2019) , safety (Amodei et al., 2016; Otte, 2013) , and fairness (Hardt et al., 2016; Kleinberg et al., 2017; Goodman & Flaxman, 2017; Mehrabi et al., 2021) . With CPMs, our goal is to achieve explanations that are causally motivated and concept-based, and so we concentrate here on relating existing methods to these two goals. Feature attribution methods estimate the importance of features, generally by inspecting learned weights directly or by perturbing features and studying the effects this has on model behavior (Molnar, 2020; Ribeiro et al., 2016) . Gradient-based feature attribution methods extend this general mode of explanation to the hidden representations in deep networks (Zeiler & Fergus, 2014; Springenberg et al., 2014; Binder et al., 2016; Shrikumar et al., 2017; Sundararajan et al., 2017) . Concept Activation Vectors (CAVs; Kim et al. 2018; Yeh et al. 2020) can also be considered feature attribution methods, as they probe for semantically meaningful directions in the model's internal representations and use these to estimate the importance of concepts on the model predictions. While some methods in this space do have causal interpretations (e.g., Sundararajan et al. 2017; Yeh et al. 2020) , most do not. In addition, most of these methods offer explanations in terms of specific (sets of) features/neurons. (Methods based on CAVs operate directly in terms of more abstract concepts.) Intervention-based methods study model representations by modifying them in systematic ways and observing the resulting model behavior. These methods are generally causally motivated and allow for concept-based explanations. Examples of methods in this space include causal mediation analysis (Vig et al., 2020; De Cao et al., 2021; Ban et al., 2022) , causal effect estimation (Feder et al., 2020; Elazar et al., 2021; Abraham et al., 2022; Lovering & Pavlick, 2022) , tensor product decomposition (Soulos et al., 2020) , and causal abstraction analysis (Geiger et al., 2020; 2021) . CPMs are most closely related to the method of IIT (Geiger et al., 2021) , which extends causal abstraction analysis to optimization. Probing is another important class of explanation method. Traditional probes do not intervene on the target model, but rather only seek to find information in it via supervised models (Conneau et al., 2018; Tenney et al., 2019) or unsupervised models (Clark et al., 2019; Manning et al., 2020; Saphra & Lopez, 2019) . Probes can identify concept-based information, but they cannot offer guarantees that probed information is relevant for model behavior (Geiger et al., 2021) . For causal guarantees, it is likely that some kind of intervention is required. For example, Elazar et al. (2021) and Feder et al. (2020) remove information from model representations to estimate the causal role of that information. Our CPMs employ a similar set of guiding ideas but are not limited to removing information. Counterfactual explanation methods aim to explain model behavior by providing a counterfactual example that changes the model behavior (Goyal et al., 2019; Verma et al., 2020; Wu et al., 2021) . Counterfactual explanation methods are inherently causal. If they can provide counterfactual examples with regard to specific concepts, they are also concept-based. Some explanation methods train a model making explicit use of intermediate variables representing concepts. Manipulating these intermediate variables at inference time yields causal concept-based model explanations (Koh et al., 2020; Künzel et al., 2019) . Evaluating methods in this space has been a persistent challenge. In prior literature, explanation methods have often been evaluated against synthetic datasets (Feder et al., 2020; Yeh et al., 2020) . In response, Abraham et al. (2022) introduced the CEBaB dataset, which provides a human-validated concept-based dataset to truthfully evaluate different causal concept-based model explanation methods. Our primary evaluations are conducted on CEBaB. xu,v ci c 1 . . . . . . ck U = u V = v x Ci←c ′ u,v Ci ← c ′ c 1 . . . . . . ck U = u V = v (a) A structural causal model leading to an actual text xu,v and its counterfactual text x C i ←c ′ u,v . U is an exogenous variable over experiences, c1, . . . , c k are mediating concepts, and V is an exogenous variable capturing the writing (and star-rating) experience. At right, we create a counterfactual in which concept Ci takes on a different value. Unfortunately, we cannot truly create such counterfactual situations and so we never observe pairs of texts like these. Thus, we must rely on approximate counterfactuals. Let xu,v be a text written in situation (u, v): Human-created xC i ←c ′ u,v Crowdworker edit of xu,v to express that Ci had value c ′ , seeking to keep all else constant.

Metadata-sampled xC

i ←c ′ u,v Sampled text expressing that Ci has value c ′ but agreeing with xu,v on all other concepts. (b) Approximate counterfactuals. In the humancreated strategy, humans revise an attested text to try to express a particular counterfactual, seeking to simulate a causal intervention. In the metadata-sampled strategy, we find a separate text that aligns with the original for the value u insofar as it expresses all the same concepts except for the target concept Ci. , but under the intervention in which specific internal states are changed to those that the CPM computes for input x C i =c ′ u ′ ,v ′ (bottom), which is a distinct example that is sampled with the only criteria being that it express Ci = c ′ . The effect of this intervention is to localize information about concept Ci at the intervention site, since the only indication the CPM gets about Ci ← c ′ is via the intervention. A Structural Causal Model Our discussion is grounded in the causal model depicted in Figure 1a , which aligns well with the CEBaB benchmark. Two exogenous variables U and V together represent the complete state of the world and generate some textual data X. The effect of exogenous variable U on the data X is completely mediated by a set of intermediate variables C 1 , C 2 . . . , C k , which we refer to as concepts. Therefore, we can think of U as the part of the world that gives rise to these concepts {C} k 1 . Using this causal model, we can describe counterfactual data -data that arose under a counterfactual state of the world (right diagram in Figure 1a ). Our factual text is x u,v , and we use x Ci←c ′ u,v for the counterfactual text obtained by intervening on concept C i to set its value to c ′ . The counterfactual i ←c ′ u,v . xCi←c ′ u,v n ′ 11 n ′ 21 . . . n ′ d1 n ′ 12 n ′ 22 . . . n ′ d2 . . . . . . . . . n ′ 1l n ′ 2l . . . n ′ dl n ′ out x u,v x Ci←c ′ u,v describes the output when the value of C i is set to c ′ , all else being held equal.

Approximate Counterfactuals Unfortunately, pairs like

(x u,v , x Ci←c ′ u,v ) are never observed, and thus we need strategies for creating approximate counterfactuals xCi←c ′ u,v . Figure 1b describes the two strategies we use in this paper. In the human-created strategy, we rely on a crowdworker to edit x u,v to achieve a particular counterfactual goal -say, making the evaluation of the restaurant's food negative. CEBaB contains an abundance of such pairs (x u,v , xCi←c ′ u,v ). However, CEBaB is unusual in having so many human-created approximate counterfactuals, so we also explore a simpler strategy in which xCi←c ′ u,v is sampled with the requirement that it match x u,v on all concepts but sets C i to c ′ . This strategy is supported in many real-world datasets -for example, the OpenTable reviews underlying CEBaB all have the needed metadata (Abraham et al., 2022) . CPM IN : Input-based CPM Given a dataset of approximate counterfactual pairs (x u,v , xCi←c ′ u,v ) and a black-box model N , we train a new CPM IN model P with a counterfactual objective as: L IN = CE S N (x Ci←c ′ u,v ), P(x u,v ; t Ci←c ′ ) where x u,v ; t Ci←c ′ in Eqn. 1 denotes the concatenation of the factual input and a randomly initialized learnable token embedding t Ci←c ′ describing the intervention C i ← c ′ . CE S represents the smoothed cross-entropy loss (Hinton et al., 2015) , measuring the divergence between the output logits of both models. The objective in Eqn. 1 pushes P to predict the counterfactual behavior of N when a descriptor of the intervention is given (Figure 1d ).foot_2  At inference time, approximate counterfactuals are inaccessible. To explain model N , we append the trained token embedding t Ci←c ′ to a factual input, upon which P predicts a counterfactual output for this input, used to estimate the counterfactual behavior of N under this intervention. CPM HI : Hidden-state CPM Our CPM HI models are trained on the same data and with the same set of goals as CPM IN , to mimic both the factual and counterfactual behavior of N . The key difference is how the information about the intervention C i ← c ′ is exposed to the model. Specifically, we adapt Interchange Intervention Training (Geiger et al., 2022) to train our CPM HI models for concept-based model explanation. A conventional intervention on a hidden representation H of a neural network N fixes the value of the representation H to a constant. In an interchange intervention, we instead fix H to the value it would have been when processing a separate source input s. The result of the interchange intervention is a new model. Formally, we describe this new model as N H ←H s , where ← is the conventional intervention operator and H s is the value of hidden representation H when processing input s. Given a dataset of approximate counterfactual input pairs (x u,v , xCi←c ′ u,v ) and a black-box model N , we train a new CPM HI model P with the following counterfactual objective: L HI = CE S N (x Ci←c ′ u,v ), P H C i ←H C i s (x u,v ) Here H Ci are hidden states designated for concept C i . In essence, we train P to fully mediate the effect of intervening on C i in the hidden representation H Ci . The source input s is any input x Ci=c ′ u ′ ,v ′ that has C i = c ′ . As P only receives information about the concept-level intervention C i ← c ′ via the interchange intervention H Ci ← H Ci s , the model is forced to store all causally relevant information with regard to C i in the corresponding hidden representation. This process is described in Figure 1e . In the ideal situation, the source input x Ci=c ′ u ′ ,v ′ and x u,v share the same value only for C i and differ on all others, so that the counterfactual signal needed for localization is pure. However, we do not insist on this when we sample. In addition, we allow null effect pairs in which x u,v and xCi←c ′ u,v are identical. For additional details on this sampling procedure, see Appendix A.2. At inference time, approximate counterfactuals are inaccessible, as before. To explain model N with regard to intervention C i ← c ′ , we manipulate the internal states of model P by intervening on the localized representation H Ci for concept C i . To achieve this, we sample a source input x Ci=c ′ u ′ ,v ′ from the train set as any input x that has C i = c ′ to derive H Ci s . Training Objectives We include another distillation objective to predict the same output as N under conventional circumstances as L Mimic = CE S N (x u,v ), P(x u,v ) . The overall training objective for our models can be written as L = λ 1 L Mimic + λ 2 L Counterfactual where L Counterfactual can be either L IN or L HI , and we set λ 1 , λ 2 as 1.0 and 3.0 for simplicity.

4.1. CAUSAL ESTIMATION-BASED BENCHMARK (CEBAB)

CEBaB (Abraham et al., 2022) is a large benchmark of high-quality, labeled approximate counterfactuals for the task of sentiment analysis on restaurant reviews. The benchmark was created starting from a set of 2,299 original restaurant reviews from OpenTable. For each of these original reviews, approximate counterfactual examples were written by human annotators; the annotators were tasked to edit the original text to reflect a specific intervention, like 'change the food evaluation from negative to positive' or 'change the service evaluation from positive to unknown'. In this way, the original reviews were expanded with approximate counterfactuals to a total of 15,089 texts. The groups of originals and corresponding approximate counterfactuals are partitioned over train, dev, and test sets. The pairs in the development and test set are used to benchmark explanation methods. Each text in CEBaB was labeled by five crowdworkers with a 5-star sentiment score. In addition, each text was annotated at the concept level for four mediating concepts {C ambiance , C food , C noise , and C service }, using the labels {negative, unknown, positive}, again with five crowdworkers annotating each concept-level label. We refer to Appendix A.1 and Abraham et al. 2022 for additional details. As discussed above (Section 3 and Figure 1b ), we consider two sources of approximate counterfactuals using CEBaB. For human-created counterfactuals, we use the edited restaurant reviews of the train set. For metadata-sampled counterfactuals, we sample factual inputs from the train set that have the desired combination of mediating concepts. Using all the human-created edits leads to 19,684 training pairs of factuals and corresponding approximate counterfactuals. Sampling counterfactuals leads to 74,574 pairs. We use these approximate counterfactuals to train explanation methods. Appendix A.2 provides more information about our pairing process.

4.2. EVALUATION METRICS

Much of the value of a benchmark like CEBaB derives from its support for directly calculating the Estimated Individual Causal Concept Effect ( ICaCE N ) for a model N given a human-generated approximate counterfactual pair (x u,v , xCi←c ′ u,v ): ICaCE N (x u,v , xCi←c ′ u,v ) = N (x Ci←c ′ u,v ) -N (x u,v ) This is simply the difference between the vectors of output scores for the two examples. We do not expect to have pairs (x u,v , xCi←c ′ u,v ) at inference time, and this is what drives the development of explanation methods E N that estimate this quantity using only a factual input x u,v and a description of the intervention C i ← c ′ . To benchmark such methods, we follow Abraham et al. (2022) in using the ICaCE-Error: ICaCE-Error D N (E) = 1 |D| (xu,v,x C i ←c ′ u,v )∈D Dist ICaCE N ((x u,v , xCi←c ′ u,v )), E N (x u,v ; C i ← c ′ ) (4) Here, we assume that D is a dataset consisting entirely of approximate counterfactual pairs (x u,v , xCi←c ′ u,v ). Dist measures the distance between the ICaCE N for the model N and the effect predicted by the explanation method. Abraham et al. ( 2022) consider three values for Dist: L2, which captures both direction and magnitude; Cosine distance, which captures the direction of effects but not their magnitude; and NormDiff (absolute difference of L2 norms), which captures magnitude but not direction. We report all three metrics.

BEST CEBaB

We compare our results with the best results obtained on the CEBaB benchmark. Crucially, BEST CEBaB consists aggregated best results from a set of methods including CONEXP (Goyal et al., 2020) , TCAV (Kim et al., 2018) , ConceptSHAP (Yeh et al., 2020) , INLP (Ravfogel et al., 2020) , CausaLM (Feder et al., 2020), and S-Learner (Künzel et al., 2019) . 

S-Learner

L S,B Mimic = CE S N (x u,v ), LR N (B(x u,v )) (5) By intervening on the intermediate predicted concept values at inference-time, we can hope to simulate the counterfactual behavior of N : E S,B N (x u,v ; C i ← c ′ ) = LR N ((B(x u,v )) Ci←c ′ ) -LR N (B(x u,v )) When using S-Learner in conjunction with approximate counterfactual inputs at train-time, we simply add this counterfactual data on top of the observational data that is typically used to train S-Learner. GPT-3 Large language models such as GPT-3 (175B) have shown extraordinary power in terms of incontext learning (Brown et al., 2020) . 2 We use GPT-3 to generate a new approximate counterfactual at inference time given a factual input and a descriptor of the intervention. This generated counterfactual is directly used to estimate the change in model behavior: E GPT-3 N (x u,v ; C i ← c ′ ) = N (GPT-3(x u,v ; C i ← c ′ )) -N (x u,v ) ) where GPT-3(x u,v ; C i ← c ′ ) represents the GPT-3 generated counterfactual edits. We prompt GPT-3 with demonstrations containing approximate counterfactual inputs. Full details on how these prompts are constructed can be found in Appendix A.7.

4.4. CAUSAL PROXY MODELS

We train CPMs for the publicly available models released for CEBaB, fine-tuned as five-way sentiment classifiers on the factual data. This includes four model architectures: bert-base-uncased (BERT; Devlin et al. 2019) , RoBERTa-base (RoBERTa; Liu et al. 2019) , GPT-2 (GPT-2; Radford et al. 2019) , and LSTM+GloVe (LSTM; Hochreiter & Schmidhuber 1997; Pennington et al. 2014) . All Transformerbased models (Vaswani et al., 2017) have 12 Transformer layers. Before training, each CPM model is initialized with the architecture and weights of the black-box model we aim to explain. Thus, the CPMs are rooted in the factual behavior of N from the start. We include details about our setup in Appendix A.3. The inference time comparisons for these models are as follows, where P in Eqn. 8 and Eqn. 9 refers to the CPM model trained under CPM IN and CPM HI objectives, respectively: Here, s is a source input with C i = c ′ , and H Ci is the neural representation associated with C i which takes value H Ci s on the source input s. As H Ci , we use the representation of the [CLS] token. Specifically, for BERT we use slices of width 192 taken from the 1st intermediate token of the 10th layer. For RoBERTa, we use the 8th layer instead. For GPT-2, we pick the final token of the 12th layer, again with slice width of 192. For LSTM, we consider slices of the attention-gated sentence embedding with width 64. Appendix A.5 studies the impact of intervention location and size. E CPMIN N (x u,v ; C i ← c ′ ) = P(x u,v ; t Ci←c ′ ) -N (x u,v ) E CPMHI N (x u,v ; C i ← c ′ ) = P H C i ←H C i s (x u,v ) -N (x u,v ) Following the guidance on IIT given by Geiger et al. (2022) , we train CPM HI with an additional multitask objective as L Multi = Ci∈C CE(MLP(H Ci x ), c ) where probe is parameterized by a multilayer perceptron MLP, and H Ci x is the value of hidden representation for the concept C i when processing input x with a concept label of c for C i .

5. RESULTS

We first benchmark both CPM variants and our baseline methods on CEBaB. We show that the CPMs achieve state-of-the-art performance, for both types of approximate counterfactuals used during training (Section 5.1). Given the good factual performance achieved by CPMs, we subsequently investigate whether CPMs can be deployed both as predictor and explanation method at the same time (Section 5.2) and find that they can. Finally, we show that the localized representations of CPM HI give rise to concept-aware feature attributions (Section 5.3). Our supplementary materials report on detailed ablation studies and explore the potential of our methods for model debiasing. S-Learner, one of the best individual explainers from the original CEBaB paper (Abraham et al., 2022) , shows only a marginal improvement when naively incorporating sampled and human-created counterfactuals during training over using no counterfactuals. This indicates that the large performance gains achieved by our CPMs over previous explainers are most likely due to the explicit use of a counterfactual training signal, and not primarily due to the addition of extra (counterfactual) data.

5.1. CEBAB PERFORMANCE

GPT-3 occasionally performs on-par with our CPMs, generally only slightly underperforming our best explainer on human-created counterfactuals, while being significantly worse on sampled counter- factuals. While the GPT-3 explainer also explicitly uses approximate counterfactual data, the results indicate that our proposed counterfactual mimic objectives give better results. The better performance of CPMs when considering sampled counterfactuals over GPT-3 shows that our approach is more robust to the quality of the approximate counterfactuals used. While the GPT-3 explainer is easy to set up (no training required), it might not be suitable for some explanation applications regardless of performance, due to the latency and cost involved in querying the GPT-3 API. Across the board, explainers trained with human-created counterfactuals are better than those trained with sampled counterfactuals. This shows that the performance of explanation methods depends on the quality of the approximate counterfactual training data. While human counterfactuals give excellent performance, they may be expensive to create. Sampled counterfactuals are cheaper if the relevant metadata is available. Thus, under budgetary constraints, sampled counterfactuals may be more efficient. Finally, CPM IN is conceptually the simpler of the two CPM variants. However, we discuss in Section 5.3 how the localized representations of CPM HI lead to additional explainability benefits.

5.2. SELF-EXPLANATION WITH CPM

As outlined in Section 3, CPMs learn to mimic both the factual and counterfactual behavior of the black-box models they are explaining. We show in Table 2 that our CPMs achieve a factual Macro-F1 score comparable to the black-box finetuned models. We investigate if we can simply replace the black-box model with our CPM and use the CPM both as factual predictor and counterfactual explainer. To answer this questions, we measure the self-explanation performance of CPMs by simply replacing the black-box model N in Eqn. 4 with our factual CPM predictions at inference time. -1 means the word contributes the most negatively to predicting the target class (red); +1 means the word contributes the most positively (green).

5.3. CONCEPT-AWARE FEATURE ATTRIBUTION WITH CPM HI

We have shown that CPM HI provides trustworthy explanations (Section 5.1). We now investigate whether CPM HI learns representations that mediate the effects of different concepts. We adapt Integrated Gradients (IG; Sundararajan et al. 2017) to provide concept-aware feature attributions, by only considering gradients flowing through the hidden representation associated with a given concept. We formalize this version of IG in Appendix A.8. In Table 4 , we compare concept-aware feature attibutions for two variants of CPM HI (IIT and Multi-task) and the original black-box (Finetuned) model. For IIT we remove the multi-task objective L Multi during training and for Multi-task we remove the the interchange intervention objective L HI . This helps isolate the individual effects of both losses on concept localization. All three models predict a neutral final sentiment score for the considered input, but they show vastly different feature attributions. Only IIT reliably highlights words that are semantically related to each concept. For instance, when we restrict the gradients to flow only through the intervention site of the noise concept, "loud" is the word highlighted the most that contributes negatively. When we consider the service concept, words like "friendly" and "waiter" are highlighted the most as contributing positively. These contrasts are missing for representations of the Multi-task and Finetuned models. Only the IIT training paradigm pushes the model to learn causally localized representations. For the service concept, we notice that the IIT model wrongfully attributes "delicious". This could be useful for debugging purposes and could be used to highlight potential failure modes of the model.

6. CONCLUSION

We explored the use of approximate counterfactual training data to build more robust causal explanation methods. We introduced Causal Proxy Models (CPMs), which learn to mimic both the factual and counterfactual behaviors of a black-box model N . Using CEBaB, a benchmark for causal concept-based explanation methods, we demonstrated that both versions of our technique (CPM IN and CPM HI ) significantly outperform previous explanation methods without demanding the full causal graph associated with the dataset. Interestingly, we find that our GPT-3 based explanation method performs on-par with our best CPM model in some settings. Our results suggest that CPMs can be more than just explanation methods. They achieve factual performance on par with the model they aim to explain, and they can explain their own behavior. This paves the way to using them as deployed models that both perform tasks and offer explanations. In addition, the causally localized representations of our CPM HI variant are very intuitive, as revealed by our concept-aware feature attribution technique. We believe that causal localization techniques could play a vital role in further model explanation efforts.

A APPENDIX

A.1 CEBAB DATASET STATISTICS Human-created Counterfactuals CEBaB contains multiple counterfactual sentences for each original review. To achieve this, the dataset creators asked annotators to edit the original sentence to achieve a specified goal (e.g., 'change the evaluation of the restaurant's food to negative'). These originals and corresponding edits form our human pairs. Metadata-sampled Counterfactuals Human-created counterfactuals are not always available. With CEBaB, we simulate a second type of approximate counterfactuals by using metadata-guided heuristics: for a given original sentence, we sample a counterfactual from the train set by matching concept labels while allowing only one label to be changed. During training, we also consider null effect pairs in our sampling setup. These pairs resemble cases where our approximate counterfactual sentence is identical to the original sentence. When training our models on these pairs, we expect our models to predict the same counterfactual and factual output.

A.3 TRAINING REGIMES

CPM IN To train CPM IN , we use the same model architecture as N , and initialize it with the model weights using weights from N . The maximum number of training epochs is set to 30 with a learning rate of 5e -5 and an effective batch size of 128. The learning rate linearly decays to 0 over the 30 training epochs. We employ an early stopping strategy for COS ICaCE over the dev set for an interval of 50 steps with early stopping patience set to 20. We set the max sequence length to 128 and the dropout rate to 0.1. We take a weighted sum of two objectives as the loss term for training CPM HI . Specifically, we use [w Mimic , w IN ] = [1.0, 3.0]. For the smoothed cross-entropy loss, we use a temperature of 2.0. CPM HI To train CPM HI , we use the same model architecture as N , and initialize it with the model weights using weights from N . The maximum number of training epochs is set to 30 with a learning rate of 8e -5 and an effective batch size of 256. We use a higher learning rate of 0.001 for the LSTM model as it enables quicker convergence. The learning rate linearly decays to 0 over the 30 training epochs. We employ an early stopping strategy for COS ICaCE over the dev set for an interval of 10 steps with early stopping patience set to 20. We set the max sequence length to 128 and the dropout rate to 0.1. We take a weighted sum of three objectives as the loss term for training CPM HI . Specifically, we use [w Mimic , w Multi , w HI ] = [1.0, 1.0, 3.0]. In Appendix A.6, we conduct a set of ablation studies to isolate the individual contributions from each objective. For the smoothed cross-entropy loss, we use a temperature of 2.0. Our models are all implemented in PyTorch (Paszke et al., 2019) and using the HuggingFace library (Wolf et al., 2019) . All of our results are aggregated over three distinct random seeds. To foster reproducibility, we will release our code repository and model artifacts to the public. 

A.4 ADDITIONAL BASELINE RESULTS

Table 6 shows baselines adapted from Abraham et al. ( 2022), which contains the present state-of-theart explanation methods for the CEBaB benchmark. We report the best scores across these explanation methods in Table 1 . These baselines are trained without using counterfactual data. Thus, we build additional baselines that use counterfactual data as shown in Table 7 . S-Learner is selected as the best performing models and included in Table 1 for comparisons. The equations for the additional baselines are as follows: E approx N (x u,v ; C i ← c ′ ) = N (s approx ) -N (x u,v ) E random N (x u,v ; C i ← c ′ ) = N (s random ) -N (x u,v ) E CaCE N (C i ← c ′ ) = 1 |D Ci←c ′ | (xu,v,x C i ←c ′ u,v )∈D C i ←c ′ N xCi←c ′ u,v -N (x u,v ) E ATE (C i ← c ′ ) = 1 |D Ci←c ′ | (xu,v,x C i ←c ′ u,v )∈D C i ←c ′ f xCi←c ′ u,v -f (x u,v ) where s random is a randomly sampled training input, s approx is a training input sampled to match the concept-level labels of the true counterfactual under intervention C i ← c ′ , D Ci←c ′ is the set of all approximate counterfactual training pairs that represent a C i ← c ′ intervention, and f is a look-up function that returns the ground-truth label associated with an input. The signatures of E ATE and E CaCE N reflect that they are independent of the specific factual input x u,v considered. Furthermore, E ATE is independent of N given that this explainer only uses ground-truth training labels to estimate causal effects. Additionally, we consider X-Learner, a variant of S-Learner (Künzel et al., 2019) . Our X-Learner consists of three steps. First, we cluster examples into groups by their concept and predicted concept label pairs (e.g., select all examples with food being positive)foot_4 . For each group, we fit logistic regression model μ(Ci,c) to predict the factual output of black-box model N using concept labels for each example except for labels for C i . Next, we use the models from the first step to build training sets for our individual treatment effect (ITE) estimators. To achieve this, we calculate ITE for each example as, where B(x Ci=c u,v ) ′ excludes the concept label for concept C i . It measures the ITE for x u,v when we change the concept label of C i from c to c ′ . We aggregate DCi:c←c ′ u,v over examples based on their editing concepts and concept labels. Next, we fit a set of linear regression models as τ Ci:c←c ′ to predict ITE for changing the concept labels for C i given concept labels of an example except for labels for C i . Lastly, we use τ Ci:c←c ′ to predict counterfactual output changes as, DCi:c←c ′ u,v = μ(Ci,c ′ ) (B(x Ci=c u,v ) ′ ) -N (x Ci=c u,v ) E X-Learner N (x u,v ; C i ← c ′ ) = p • τ Ci:c←c ′ (B(x Ci=c u,v ) ′ ) + (1 -p) • τ Ci:c ′ ←c (B(x Ci=c u,v ) ′ ) ( ) where p is the propensity score which is calculated using B as the probability of C i taking concept label c ′ for an input example x u,v by considering two potential concept labels c and c ′ .

A.5 INTERVENTION SITE LOCATION AND SIZE

Previous work shows that neurons in different layers and groups can encode different high-level concepts (Vig et al., 2020; Koh et al., 2020) . CPM HI pushes concept-related information to localize at the targeted intervention site (the aligned neural representations for each concept). In this section, we investigate how the location and the size of the intervention site impact CPM HI performance. We use the optimal location and size found in this study for other results presented in this paper. As shown in the top panel of Figure 2 , intervention location significantly affects CPM HI performance. Our results show that layer 10 for BERT, layer 8 for RoBERTa, and layer 12 for GPT-2 lead to the best performance. This suggests layers have different efficacy in terms of information localization. Our results also show that intervening with deeper layers tends to provide better performance. However, for both BERT and RoBERTa, intervening on the last layer results in a slightly worse performance compared to earlier layers. This suggests that leaving Transformer blocks after the intervention site helps localized information to be processed by the neural network. Size For Transformer-based models, we change the size of the intervention site d c for each concept. Specifically, we set 16, 64, 128, 192}. For instance when d c = 1, we use a single dimension of the "[CLS]" token embedding to represent each concept, starting from the first dimension of the vector. For our non-Transformer-based model (LSTM), we intervene on the attention-gated sentence embedding whose dimension size is set to 300. Accordingly, we set d c = {1, 16, 64, 75}. d c = {1, As shown in Figure 2 , larger intervention sites lead to better performance for all Transformer-based models. For LSTM, we find that the optimal size is the second largest one instead. On the other hand, our results suggest that the performance gain from the increase of size diminishes as we increase the size for all model architectures. A.6 ABLATION STUDY OF CPM HI Geiger et al. (2022) show that training with a multi-task objective helps IIT to improve generalizability. In this experiment, we aim to investigate whether the multi-task objective we added for CPM HI plays an important role in achieving good performance. Specifically, we conduct two ablation studies: removing the multi-task objective by setting w Multi = 0.0, and removing the IIT objective by setting w HI = 0.0. Table 8 shows our results, which demonstrate that the IIT objective is the main factor that drives CPM HI performance. Our results also suggest that the multi-task objective brings relatively small but consistent performance gains. Overall, our findings corroborate those of Geiger et al. (2022) and provide concrete evidence that the combination of two objectives always results in the best-performing explanation methods across all model architectures. Additionally, we explore two baselines for CPM HI . Firstly, we randomly initialize the weights of CPM HI . Secondly, we take the original black-box model as our CPM HI . Compared to the results in Table 1 , these two baselines fail catastrophically, suggesting the importance of our IIT paradigm. As mentioned in Section 3, we sample a source input x Ci=c ′ u ′ ,v ′ from the train set as any input x that has C i = c ′ to estimate the counterfactual output. Furthermore, we explore two additional sampling strategies. First, we create a baseline where we randomly sample a source input from the train without any concept label matching. Second, we sample a source input from the train set using the predicted concept label of our multi-task probe, instead of the true concept label from the dataset. As shown in Table 9 , the quality of our source inputs impact our performance significantly. For instance, when sampling source input at random, CPM HI fails catastrophically for all evaluation metrics. On the other hand, when we sampling source based on the predicted labels using the multi-task probe, CPM HI maintains its performance.

A.7 GPT-3 GENERATION PROCESS

We use the 175B parameter davinci GPT-3 model (Brown et al., 2020) as a few-shot learner to generate approximate counterfactual data. Let x u,v be a review text with an original value c for the mediating concept C i and an overall review sentiment y (e.g., a restaurant review which is negative about the service, and felt neutral about their overall dining experience), and let c ′ be the target value of C i , for which we would like to create a counterfactual review (e.g., change the text to become positive about the mediating concept service). In order to use GPT-3 as an n-shot learner, we sample n = 6 approximate counterfactual pairs (x u ′ ,v ′ , xCi←c ′ u ′ ,v ′ ), where x u ′ ,v ′ shares with x u,v the same value c for C i and the same overall sentiment, and the counterfactual review xCi←c ′ u ′ ,v ′ has the target value c ′ for C i . We prompt the model with these pairs, and we also include the original review x u,v . We then collect the text completed by GPT-3 as the GPT-3 counterfactual review. An example for this n-shot prompt and completion is in Figure 3 . In addition, we also prompt GPT-3 with pairs of original reviews and metadata-sampled counterfactuals, and generate another set of GPT-3 counterfactual review for comparison. We sample n = 4 approximate counterfactual pairs in this case. An example of metadata-sampled counterfactual generation with GPT-3 can be seen in Figure 4 . Make the following restaurant reviews include POSITIVE mentions of SERVICE. Original: I had two casual dinners at State & Lake and three lunches. The food was great but the service was lacking. Everything was delicious. The interior is questionable, but not intrusive. POSITIVE mentions of SERVICE: I had two casual dinners at State & Lake and three lunches. The food and the service were always great. Everything was delicious. The interior is questionable, but not intrusive. Original: Food was excellent, but the service was not very attentive. Noise level was extremely high due to close proximity of tables and poor acoustics. POSITIVE mentions of SERVICE: Food and service was excellent. Noise level was extremely high due to close proximity of tables and poor acoustics. Original: Great food, poor and very snobbish service. POSITIVE mentions of SERVICE: Great food, very good service. Original: My dining experince was excellent! However, the server was not nice. POSITIVE mentions of SERVICE: My dining experince was excellent! Original: Hae been here a few times and it is just okay -Entrees and wine list a bit pricey for what it is, inattentive staff. POSITIVE mentions of SERVICE: Hae been here a few times and it is just okay -Entrees and wine list a bit pricey for what it is. Food comes out on time. Original: Tables fairly close together, mushroom appetiser very good, pork entree fair, chicken good. The service was terrible. POSITIVE mentions of SERVICE: Tables fairly close together, mushroom appetiser very good, pork entree fair, chicken good. The service was great however. Original: Service was very poor with the server unresponsive and misinformed on all requests. The food was very good with a good selection of entrees. The ambiance was romantic with a quiet excellence. POSITIVE mentions of SERVICE: Service was very good with the server attentive and responsive on all requests. The food was very good with a good selection of entrees. The ambiance was romantic with a quiet excellence. For each few-shot learning prompt, we insert an initial string of the form of "Make the following restaurant reviews include c ′ mentions of C i .", where c ′ is expressed as one of {"POSITIVE", "NEGATIVE", "NOT" } ("NOT" corresponds to making the review be unknown regarding the concept C i ) and C i is one of {"AMBIANCE", "FOOD", "NOISE", "SERVICE"}. We sample using a temperature of 0.9, without any frequency or presence penalties (since we expect the counterfactual review to be similar to the original review). In preliminary experimentation, we found that capitalizing the mediating concept and target value results and inserting line breaks between examples made for better completions, although there is room for future research in this area. Make the following restaurant reviews include POSITIVE mentions of SERVICE. Original: Been here several times. Always a winner, except for the tasteless food! POSITIVE mentions of SERVICE: I was very disappointed in the food but we did not wait long for each course and or waiter was very pleasant. Original: food was decent but not great. POSITIVE mentions of SERVICE: Lovely evening -good service and wonderful food. Perfect for fresh fish fans Original: The restaurant was empty when we arrived, reservation not necessary? Wine list limited. Food was bland, presentation was very well done. I would not eat here again. POSITIVE mentions of SERVICE: Abby provided the best service that we've had after probably two dozen visits. No thank you for making the risotto cake at lunch....Two Stars! Original: A terrible place for lunch or dinner. All the food is excellent with top notch ingredients POSITIVE mentions of SERVICE: Excellent Valentine's menu. Excellent service and food. Would recommend this restaurant and will return. Original: The food was average for the cost. My husband and I were so excited to visit Bobby Flay's restraunt and were really disappointed. The food was average at best.

POSITIVE mentions of SERVICE:

The service was amazing and the food was alright. We used the OpenAI API to access GPT-3. At the current price rate of $0.02 per 1,000 tokens, the total cost of creating our counterfactuals (around 4,000 examples) was approximately $50 per approximate counterfactuals creation strategy.

A.8 INTEGRATED GRADIENTS

We adapt the Integrated Gradients (IG) method of Sundararajan et al. (2017) to qualitatively assess whether CPM HI learned explainable representations of mediated concepts at its intervention sites. The IG algorithm computes the average gradient from the model output to its input by incrementally interpolating from a "blank" input x ′ (consisting only of "[PAD]" tokens) to the original input x. Eqn. 16 is the integrated gradients equation originally proposed in Sundararajan et al. (2017) , applied to a CPM model P on input x. IntegratedGrads j (x) = (x j -x ′ j ) • 1 α=0 ∂P(x ′ + α • (x -x ′ )) ∂x j ∂α Here, ∂P(x) ∂xj is the derivative of P on the jth dimension of x. In our implementation of IG, we wish to show the per-token attribution of input x on the model's final output P(x), mediated by the hidden representation of a concept in P. That is, we'd like to ask, "What is the effect of the word 'delicious' in the input on the model's output, when we restrict our focus only on the model's representation of the concept food?" To answer this question, we compute the gradient of the model output P(x) with respect to the input x but restrict the gradient to flow through the intervention site for a particular concept. This allows us to capture the per-token attribution of the model's final output (whether particular words contributed to a positive, negative, or neutral sentiment prediction), mediated by the concept that is represented by the specified intervention site. For example, in Table 4 , we can see that "delicious" has a positive attribution to the output of the model when we focus on its representation of the concept food. Formally, consider a trained CPM model P, an input x and mediating concept C i . Let H Ci be the activation of P at the intervention site for C i . We define the gradient of P(x) along dimension j, mediated by C i , as ∂P(x) ∂x j mediated by C i = ∂P(x) ∂H Ci • ∂H Ci ∂x j . Eqn. 17 restricts the gradient to only flow through the hidden representation of the concept along which we'd like to interpret our model. We integrate these mediated gradients over a straight path between input x and baseline x ′ , analogous to Eqn. 16. We implement our IG method using CaptumAI library. 4 We use the default parameters for our runs with number of iterations set to 50, and we set the integral method as gausslegendre. We set the multiply-by-inputs flag to True. To visualize individual word importance, we conduct z-score normalization of attribution scores over input tokens per each concept, and then linearly scale scores between [-1, +1]. Table 10 extends Table 4 in our main text with additional ablation studies on our training objectives. A.9 MODEL DEBIASING Being able to accurately predict outputs for counterfactual inputs enables explanation methods to faithfully debias a model with regard to a desired concept. For instance, with CEBaB, debiasing a concept (e.g., "food") is equivalent to estimating the counterfactual output when we set the concept label for a concept to be unknown. In this section, we briefly study the extent to which the CPM HI can function as a debiasing method. To debias a concept, we enforce the sampled source input s as in Eqn. 2 to have unknown as its concept label for the concept to be debiased. To show our methods can faithfully debias a targeted concept, we evaluate the correlations between the predicted overall sentiment label for sentences and the concept labels for each concept. Without any debiasing technique, we expect concept labels to be highly correlated with the overall sentiment label (e.g., if food is positive, it is more likely that the overall sentiment is positive). We use CPM HI trained for the BERT model architecture as an example, and use examples in the test set. Figure 5 shows correlation plots for the black-box model as well as CPM HI . As expected, the correlation of the food concept is weakened through the debiasing pipeline by 57.50%. Our results also suggest that correlations of other concepts are affected, which suggests a future research direction focused on minimizing the impact of the debiasing pipeline on irrelevant concepts. We include results for the remaining concepts in the Appendix A.9. Table 11 : Visualizations of word importance scores using Integrated Gradient (IG), using the same methods as in Table 4 and Table 10 . Table 11 visualizations of word importance scores using our version of Integrated Gradient (IG). Different from Table 4 and Table 10 , which show the visualizations of our optimized model, we show a per-epoch result for for CPM HI , followed with our best model appended at the end. Our results suggest that early checkpoints in the training process focus at drastically different input words comparing to later checkpoints, though all models predict neutral for this given sentence. In addition, gradient aggregations over input words are rather stable towards the end the training. More importantly, CPM HI learns how to highlight words that are semantically related to each concept gradually. For instance, we can see a clear trend of emphasising the word "decorations" for the ambiance concept throughout the training process. This suggests that our training procedure induces causally motivated gradients over input words gradually through the training process.



CAUSAL PROXY MODEL (CPM)Causal Proxy Models (CPMs) are causal concept-based explanation methods. Given a factual input x u,v and a description of a concept intervention C i ← c ′ , they estimate the effect of the intervention on model output. The present section introduces our two core CPM variants in detail. We concentrate here on introducing the structure of these models and their objectives, and we save discussion of associated metrics for explanation methods for Section Our objective is regard to a single approximate counterfactual pair for the sake of clarity. At train-time, we aggregate the objective over all considered training pairs. We take Ci to always represent the intervened-upon concept. The weights of N are frozen. We use the largest davinci model publicly available at https://beta.openai.com/playground. We use the finetuned concept-level sentiment analysis models B released by Abraham et al. (2022) for concept label prediction, which is identical to the ones used in S-Learner in Section 4.3. https://captum.ai/



LHI: Examples xu,v and xC i ←c ′ u,v are an approximate counterfactual pair. The CPM (middle) is given input xu,v. The objective is for it to mimic N (top) given xC i ←c ′ u,v

Figure 1: Causal Proxy Model (CPM) summary. Every CPM for model N is trained to mimic the factual behavior of N (L Mimic ). For CPM IN , the counterfactual objective is L IN . For CPM HI , the counterfactual objective is L HI .

Our version of S-Learner (Künzel et al., 2019) learns to mimic the factual behavior of black-box model N while making the intermediate concepts explicit. Given a factual input, a finetuned model B is trained to predict concept label for each concept as an aspect-based sentiment classification task. Then, a logistic regression model LR N is trained to map these intermediate concept values to the factual output of black-box model N , under the following objective.

Figure 2: CEBaB scores for different intervention site locations and sizes for CPM HI . The scores are measured in three different metrics on the test set for four different model architectures as a five-class sentiment classification task. Results averaged over three distinct seeds. Task performance as Macro-F1 score is reported when applicable. Shaded areas outline ± SD.

Figure 3: Example GPT-3 prompt (gray) and GPT-3 completion (bold). Note that all original examples convey the same sentiment towards service (c = negative) and same overall sentiment (y = neutral), and that the counterfactual examples are all edited such that the sentiment towards service is the same (c ′ = positive).

Figure 4: Example GPT-3 prompt (gray) and GPT-3 completion (bold). Note that all original examples convey the same sentiment towards service (c = unknown) and same overall sentiment (y = negative), and that the counterfactual examples are all metadata-sampled such that the sentiment towards service is the same (c ′ = positive).

(a) Visualization for debiasing the ambiance concept. (b) Visualization for debiasing the food concept. (c) Visualization for debiasing the noise concept. (d) Visualization for debiasing the service concept.

Figure 5: Debiasing visualizations for different concepts of a CPM HI with BERT model architecture. Individual plots are correlation plots between concept labels of a concept and the overall sentence sentiment label.

Figure5ato Figure5dshow debiasing visualizations for three concepts: ambiance, noise and service. We use a CPM HI for the BERT model architecture as an example. We calculate the distributions with examples in the test set.

Figure 6 shows three different metrics measured on the dev and the test sets for a CPM HI trained for the BERT model architecture as an example. Since we use COS ICaCE on the dev set to early stop our training process, we find our CPM HI reaches a local minimum on COS ICaCE while L2 ICaCE and NormDiff ICaCE are still trending downward. This suggests future research may need to choose desired metrics to optimize for during training, for early stopping to reach the best performing model. Epoch Predicted Concept Score Word Importance 1 neutral

LMimic: All CPMs (bottom) are trained to mimic the behavior of the neural model N to be explained (top) for all factual inputs xu,v. Examples xu,v and xC i ←c ′ The CPM is given xu,v augmented with a special token t C i ←c ′ and trained to mimic the target model N when its input is xC

(.02) 0.71 (.01) 0.63 (.01) 0.60 (.01) 0.73 (.02) 0.45 (.01) 0.45 (.02) 0.45 (.03) Cosine 0.59 (.03) 0.63 (.01) 0.63 (.01) 0.51 (.00) 0.46 (.00) 0.45 (.00) 0.60 (.01) 0.36 (.00) 0.35 (.00) 0.36 (.04) NormDiff 0.44 (.01) 0.54 (.02) 0.53 (.02) 0.35 (.01) 0.39 (.01) 0.38 (.00) 0.52 (.02) 0.25 (.00) 0.24 (.01) 0.27 (.01) CEBaB scores measured in three different metrics on the test set for four different model architectures as a five-class sentiment classification task. Lower is better. Results averaged over three distinct seeds, standard deviations in parentheses. The metrics are described in Section 4. Best averaged result is bolded (including ties) per approximate counterfactual creation strategy.

presents our main results. The results are grouped per approximate counterfactual type used during training. Both CPM IN and CPM HI beat BEST CEBaB in every evaluation setting by a large margin, establishing state-of-the-art explanation performance. Interestingly, CPM HI seems to slightly outperform CPM IN using sampled approximate counterfactuals, while slightly underperforming CPM IN on human-created approximate counterfactuals. Appendix A.6 reports on ablation studies that indicate that, for CPM HI , this state-of-the-art performance is primarily driven by the role of IIT in localizing concepts.

Task performance measured as Macro-F1 score on the test set. Results averaged over three distinct seeds; standard deviations in parentheses.

Self-explanation CEBaB scores measured in three different metrics on the test set for four different model architectures as a five-class sentiment classification task. Lower is better. Results averaged over three distinct seeds, standard deviations in parentheses.

reports these results. We find that both CPM IN and CPM HI achieve better self-explanation performance compared to providing explanations for another black-box model. Furthermore, CPM HI provides better self-explanation than CPM IN , suggesting our interchange intervention procedure leads the model to localize concept-based information in hidden representations. This shows that CPMs may be viable as replacements for their black-box counterpart, since they provide similar task performance while providing faithful counterfactual explanations of both the black-box model and themselves. CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta[SEP] food +0.11 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] noise +0.04 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] service +0.26 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] Multi-task neutral ambiance +0.25 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] food +0.23 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] noise +0.31 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] service +0.16 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] IIT neutral ambiance -0.24 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] food +1.11 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] noise -0.98 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] service +1.16 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP]

Visualizations of word importance scores using Integrated Gradient (IG) by restricting gradient flow through the corresponding intervention site of the targeted concept. Our target class pools positive and very positive. Individual word importance is the sum of neuron-level importance scores for each input, normalized to [ -1 , +1 ].



CEBaB scores measured in three different metrics on the test set for four different model architectures as a five-class sentiment classification task. Results are adapted from Abraham et al. (2022). Lower is better; standard deviations over 5 distinct seeds in parentheses. Results are aggregated over all aspects and all directional concept label changes. Details about these evaluation metrics can be found in Section 4. Results are based on † Abraham et al. (2022), ‡ Künzel et al. (2019), and § Ravfogel et al. (2020).

CEBaB scores for additional baselines we considered. CEBaB scores are measured in three different metrics on the test set for four different model architectures as a five-class sentiment classification task. Lower is better. Results averaged over three distinct seeds, standard deviations in parentheses. Details about these evaluation metrics can be found in Section 4.

Ablation study of our CPM HI method trained with human approximate counterfactual strategy. CEBaB scores measured in three different metrics on the test set for four different model architectures as a five-class sentiment classification task. Lower is better. Results averaged over three distinct seeds, standard deviations in parentheses.

Ablation study of our CPM HI method for different source input s sampling strategies at inference time. CEBaB scores measured in three different metrics on the test set for four different model architectures as a five-class sentiment classification task. Lower is better. Results averaged over three distinct seeds, standard deviations in parentheses.

CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] food +0.11 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] noise +0.04 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] service +0.26 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] CPMHI neutral ambiance -0.61 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] food -0.88 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] noise -1.34 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP] service +1.75 [CLS] the music was too loud , and the decorations were taste ##less , but they had friendly waiter ##s and delicious pasta [SEP]

Additional visualizations of word importance scores using Integrated Gradient (IG) by restricting gradients flow through corresponding intervention site of the targeted concept. This table extends Table4in the main text.

