EXPLAINING THE EFFICACY OF COUNTERFACTUALLY AUGMENTED DATA

Abstract

In attempts to produce machine learning models less reliant on spurious patterns in NLP datasets, researchers have recently proposed curating counterfactually augmented data (CAD) via a human-in-the-loop process in which given some documents and their (initial) labels, humans must revise the text to make a counterfactual label applicable. Importantly, edits that are not necessary to flip the applicable label are prohibited. Models trained on the augmented (original and revised) data appear, empirically, to rely less on semantically irrelevant words and to generalize better out of domain. While this work draws loosely on causal thinking, the underlying causal model (even at an abstract level) and the principles underlying the observed out-of-domain improvements remain unclear. In this paper, we introduce a toy analog based on linear Gaussian models, observing interesting relationships between causal models, measurement noise, out-of-domain generalization, and reliance on spurious signals. Our analysis provides some insights that help to explain the efficacy of CAD. Moreover, we develop the hypothesis that while adding noise to causal features should degrade both in-domain and out-ofdomain performance, adding noise to non-causal features should lead to relative improvements in out-of-domain performance. This idea inspires a speculative test for determining whether a feature attribution technique has identified the causal spans. If adding noise (e.g., by random word flips) to the highlighted spans degrades both in-domain and out-of-domain performance on a battery of challenge datasets, but adding noise to the complement gives improvements out-of-domain, this suggests we have identified causal spans. Thus, we present a large-scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous challenge domains and models, we find that the hypothesized phenomenon is pronounced for CAD.

1. INTRODUCTION

Despite machine learning (ML)'s many practical breakthroughs, formidable obstacles obstruct its deployment in consequential applications. Of particular concern, these models have been shown to rely on spurious signals, such as surface-level textures in images (Jo & Bengio, 2017; Geirhos et al., 2018) , and background scenery-even when the task is to recognize foreground objects (Beery et al., 2018) . Other studies have uncovered a worrisome reliance on gender in models trained for the purpose of recommending jobs (Dastin, 2018) , and on race in prioritizing patients for medical care (Obermeyer et al., 2019) . Moreover, while modern ML performs remarkably well on independent and identically distributed (iid) holdout data, performance often decays catastrophically under both naturally occurring and adversarial distribution shift (Quionero-Candela et al., 2009; Sugiyama & Kawanabe, 2012; Szegedy et al., 2014; Ovadia et al., 2019; Filos et al., 2020) . These two problems: (i) reliance on semantically irrelevant signals, raising concerns about bias; and (ii) the brittleness of models under distributions shift; might appear unrelated, but share important conceptual features. Concerns about bias stem in part from principles of procedural fairness (Blader & Tyler, 2003; Miller, 2017; Grgic-Hlaca et al., 2018; Lipton et al., 2018) , according to which decisions should be based on qualifications, not on distant proxies that are spuriously associated with the outcome of interest. Arguably one key distinction of an actual qualification might be that it z x 1 x 2 y (a) Causal setting z x 1 x 2 y x 1 (b) Noisy measurement in causal setting z y q x 1 x 2 (c) Anticausal setting z y q x 1 x 2 x 1 (d) Noisy measurements in anticausal setting Figure 1 : Toy causal models with one hidden confounder. In 1a and 1c, the observed covariates are x 1 , x 2 . In 1b and 1d, the observed covariates are x 1 , x 2 . In all cases, y denotes the label. actually exerts causal influence on the outcome of interest. In an interesting parallel, one line of work on distribution shift has focused on causal graphical models, addressing settings where some parts of the model remain stable over time but others do not. One common assumption is that the relationship between the target and its direct causal ancestors remains invariant (Peters et al., 2016; Ghassami et al., 2017; Rojas-Carulla et al., 2018; Kuang et al., 2018; Magliacane et al., 2018; Christiansen & Peters, 2020; Weichwald & Peters, 2020) . While these papers contribute insight, they focus on toy settings, with few variables related by a known model. However, in complex domains with high-dimensional data, what variables are relevant and what graph relates them is typically unclear. Recently in NLP, Kaushik et al. (2020) proposed Counterfactually Augmented Data (CAD), injecting causal thinking into real world settings by leveraging human-in-the-loop feedback to identify causally relevant features (versus those that merely happen to be predictive due to confounding). Human editors are presented with document-label pairs and tasked with editing documents to render counterfactual labels applicable. The instructions restrict editors to only make modifications that are necessary to flip the label's applicability. The key result is that many spurious correlations present in the original dataset are absent in the CAD. In case of sentiment analysis, Kaushik et al. ( 2020) demonstrated that linear classifiers trained to predict the sentiment of movie reviews based on bagof-words representations assign high-magnitude weights to seemingly irrelevant terms, including "will", "my", "has", "especially", and "script", among others. Notably, "horror" featured among the most negative terms, while "romance" featured among the most positive, despite both communicating genre, not sentiment. Interestingly, in the revised data, each "horror" review retains the word "horror" (per the instruction not to make unnecessary edits) but is associated with the opposite sentiment label. Models trained on the augmented data (original and revised) perform well on both original and revised data, and assign little weight to the associated but irrelevant terms. Intuitively, one might imagine that the spurious patterns would generalize less reliably out of domain. Most consumer products do not belong to movie genres, but words like "excellent" and "awful" continue to connote positive and negative sentiment, respectively. Indeed, Kaushik et al. ( 2020) demonstrated that models trained on CAD enjoyed out-of-domain performance benefits on Tweets, and Amazon and Yelp reviews. In this paper, we make some initial attempts towards explaining CAD's efficacy. While CAD plainly draws on causal thinking, (invoking interventions and counterfactuals), foundational questions remain open: What is the assumed causal structure underlying settings where CAD might be effective? What are the principles underlying its out-of-domain benefits? Must humans really intervene, or could automatic feature attribution methods, e.g., attention (DeYoung et al., 2020) , or cheaper feedback mechanisms, e.g., feature feedback (Zaidan et al., 2007) , produce similar results? To begin, we consider linear Gaussian models (Figure 1 ; Wright, 1934), with the following goals: to (i) gain qualitative insights into when a predictor might rely on spurious signals in the first place; and (ii) provide a mechanism of action to explain the efficacy of CAD. First, we analyze the causal setting (features cause the label). When the features share a common cause and a predictor is wellspecified (linear), it will assign zero weight (in expectation) to non-causal features. However, when the causal features are subject to observation noise (measurement error), the non-causal features are assigned non-zero weight. Conversely, when we inject noise on non-causal features, predictors rely more on causal features, which we expect to result in better out-of-domain generalization. In the

