CAUSAL PROXY MODELS FOR CONCEPT-BASED MODEL EXPLANATIONS

Abstract

Explainability methods for NLP systems encounter a version of the fundamental problem of causal inference: for a given ground-truth input text, we never truly observe the counterfactual texts necessary for isolating the causal effects of model representations on outputs. In response, many explainability methods make no use of counterfactual texts, assuming they will be unavailable. In this paper, we show that robust causal explainability methods can be created using approximate counterfactuals, which can be written by humans to approximate a specific counterfactual or simply sampled using metadata-guided heuristics. The core of our proposal is the Causal Proxy Model (CPM). A CPM explains a black-box model N because it is trained to have the same actual input/output behavior as N while creating neural representations that can be intervened upon to simulate the counterfactual input/output behavior of N . Furthermore, we show that the best CPM for N performs comparably to N in making factual predictions, which means that the CPM can simply replace N , leading to more explainable deployed models.

1. INTRODUCTION

The gold standard for explanation methods in AI should be to elucidate the causal role that a model's representations play in its overall behavior -to truly explain why the model makes the predictions it does. Causal explanation methods seek to do this by resolving the counterfactual question of what the model would do if input X were changed to a relevant counterfactual version X ′ . Unfortunately, even though neural networks are fully observed, deterministic systems, we still encounter the fundamental problem of causal inference (Holland, 1986) : for a given ground-truth input X, we never observe the counterfactual inputs X ′ necessary for isolating the causal effects of model representations on outputs. The issue is especially pressing in domains where it is hard to synthesize approximate counterfactuals. In response to this, explanation methods typically do not explicitly train on counterfactuals at all. In this paper, we show that robust explanation methods for NLP models can be obtained using texts approximating true counterfactuals. The heart of our proposal is the Causal Proxy Model (CPM). CPMs are trained to mimic both the factual and counterfactual behavior of a black-box model N . We explore two different methods for training such explainers. These methods share a distillation-style objective that pushes them to mimic the factual behavior of N , but they differ in their counterfactual objectives. The input-based method CPM IN appends to the factual input a new token associated with the counterfactual concept value. The hidden-state method CPM HI employs the Interchange Intervention Training (IIT) method of Geiger et al. (2022) to localize information about the target concept in specific hidden states. Figure 1 provides a high-level overview. We evaluate these methods on the CEBaB benchmark for causal explanation methods (Abraham et al., 2022) , which provides large numbers of original examples (restaurant reviews) with humancreated counterfactuals for specific concepts (e.g., service quality), with all the texts labeled for their concept-level and text-level sentiment. We consider two types of approximate counterfactuals derived from CEBaB: texts written by humans to approximate a specific counterfactual, and texts sampled using metadata-guided heuristics. Both approximate counterfactual strategies lead to state-of-the-art performance on CEBaB for both CPM IN and CPM HI . We additionally identify two other benefits of using CPMs to explain models. First, both CPM IN and CPM HI have factual performance comparable to that of the original black-box model N and can explain their own behavior extremely well. Thus, the CPM for N can actually replace N , leading to more explainable deployed models. Second, CPM HI models localize concept-level information in

