EXPLAINING THE EFFICACY OF COUNTERFACTUALLY AUGMENTED DATA

Abstract

In attempts to produce machine learning models less reliant on spurious patterns in NLP datasets, researchers have recently proposed curating counterfactually augmented data (CAD) via a human-in-the-loop process in which given some documents and their (initial) labels, humans must revise the text to make a counterfactual label applicable. Importantly, edits that are not necessary to flip the applicable label are prohibited. Models trained on the augmented (original and revised) data appear, empirically, to rely less on semantically irrelevant words and to generalize better out of domain. While this work draws loosely on causal thinking, the underlying causal model (even at an abstract level) and the principles underlying the observed out-of-domain improvements remain unclear. In this paper, we introduce a toy analog based on linear Gaussian models, observing interesting relationships between causal models, measurement noise, out-of-domain generalization, and reliance on spurious signals. Our analysis provides some insights that help to explain the efficacy of CAD. Moreover, we develop the hypothesis that while adding noise to causal features should degrade both in-domain and out-ofdomain performance, adding noise to non-causal features should lead to relative improvements in out-of-domain performance. This idea inspires a speculative test for determining whether a feature attribution technique has identified the causal spans. If adding noise (e.g., by random word flips) to the highlighted spans degrades both in-domain and out-of-domain performance on a battery of challenge datasets, but adding noise to the complement gives improvements out-of-domain, this suggests we have identified causal spans. Thus, we present a large-scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous challenge domains and models, we find that the hypothesized phenomenon is pronounced for CAD.

1. INTRODUCTION

Despite machine learning (ML)'s many practical breakthroughs, formidable obstacles obstruct its deployment in consequential applications. Of particular concern, these models have been shown to rely on spurious signals, such as surface-level textures in images (Jo & Bengio, 2017; Geirhos et al., 2018) , and background scenery-even when the task is to recognize foreground objects (Beery et al., 2018) . Other studies have uncovered a worrisome reliance on gender in models trained for the purpose of recommending jobs (Dastin, 2018) , and on race in prioritizing patients for medical care (Obermeyer et al., 2019) . Moreover, while modern ML performs remarkably well on independent and identically distributed (iid) holdout data, performance often decays catastrophically under both naturally occurring and adversarial distribution shift (Quionero-Candela et al., 2009; Sugiyama & Kawanabe, 2012; Szegedy et al., 2014; Ovadia et al., 2019; Filos et al., 2020) . These two problems: (i) reliance on semantically irrelevant signals, raising concerns about bias; and (ii) the brittleness of models under distributions shift; might appear unrelated, but share important conceptual features. Concerns about bias stem in part from principles of procedural fairness (Blader & Tyler, 2003; Miller, 2017; Grgic-Hlaca et al., 2018; Lipton et al., 2018) , according to which decisions should be based on qualifications, not on distant proxies that are spuriously associated with the outcome of interest. Arguably one key distinction of an actual qualification might be that it x 2 x 1 (d) Noisy measurements in anticausal setting Figure 1 : Toy causal models with one hidden confounder. In 1a and 1c, the observed covariates are x 1 , x 2 . In 1b and 1d, the observed covariates are x 1 , x 2 . In all cases, y denotes the label. actually exerts causal influence on the outcome of interest. In an interesting parallel, one line of work on distribution shift has focused on causal graphical models, addressing settings where some parts of the model remain stable over time but others do not. One common assumption is that the relationship between the target and its direct causal ancestors remains invariant (Peters et al., 2016; Ghassami et al., 2017; Rojas-Carulla et al., 2018; Kuang et al., 2018; Magliacane et al., 2018; Christiansen & Peters, 2020; Weichwald & Peters, 2020) . While these papers contribute insight, they focus on toy settings, with few variables related by a known model. However, in complex domains with high-dimensional data, what variables are relevant and what graph relates them is typically unclear. Recently in NLP, Kaushik et al. (2020) proposed Counterfactually Augmented Data (CAD), injecting causal thinking into real world settings by leveraging human-in-the-loop feedback to identify causally relevant features (versus those that merely happen to be predictive due to confounding). Human editors are presented with document-label pairs and tasked with editing documents to render counterfactual labels applicable. The instructions restrict editors to only make modifications that are necessary to flip the label's applicability. The key result is that many spurious correlations present in the original dataset are absent in the CAD. In case of sentiment analysis, Kaushik et al. (2020) demonstrated that linear classifiers trained to predict the sentiment of movie reviews based on bagof-words representations assign high-magnitude weights to seemingly irrelevant terms, including "will", "my", "has", "especially", and "script", among others. Notably, "horror" featured among the most negative terms, while "romance" featured among the most positive, despite both communicating genre, not sentiment. Interestingly, in the revised data, each "horror" review retains the word "horror" (per the instruction not to make unnecessary edits) but is associated with the opposite sentiment label. Models trained on the augmented data (original and revised) perform well on both original and revised data, and assign little weight to the associated but irrelevant terms. Intuitively, one might imagine that the spurious patterns would generalize less reliably out of domain. Most consumer products do not belong to movie genres, but words like "excellent" and "awful" continue to connote positive and negative sentiment, respectively. Indeed, Kaushik et al. (2020) demonstrated that models trained on CAD enjoyed out-of-domain performance benefits on Tweets, and Amazon and Yelp reviews. In this paper, we make some initial attempts towards explaining CAD's efficacy. While CAD plainly draws on causal thinking, (invoking interventions and counterfactuals), foundational questions remain open: What is the assumed causal structure underlying settings where CAD might be effective? What are the principles underlying its out-of-domain benefits? Must humans really intervene, or could automatic feature attribution methods, e.g., attention (DeYoung et al., 2020) , or cheaper feedback mechanisms, e.g., feature feedback (Zaidan et al., 2007) , produce similar results? To begin, we consider linear Gaussian models (Figure 1 ; Wright, 1934) , with the following goals: to (i) gain qualitative insights into when a predictor might rely on spurious signals in the first place; and (ii) provide a mechanism of action to explain the efficacy of CAD. First, we analyze the causal setting (features cause the label). When the features share a common cause and a predictor is wellspecified (linear), it will assign zero weight (in expectation) to non-causal features. However, when the causal features are subject to observation noise (measurement error), the non-causal features are assigned non-zero weight. Conversely, when we inject noise on non-causal features, predictors rely more on causal features, which we expect to result in better out-of-domain generalization. In the causal framework, we observe that CAD might be usefully formalized as a process analogous to intervening on the causal features, thus d-separating the label from the non-causal features (Pearl, 1985) . Alternatively, we might conceptualize CAD with an anticausal model (Schölkopf et al., 2012) . In this setup, the label of interest is one of several latent attributes that directly causes some (but not all features). In this interpretation, we imagine that we have intervened on the label and the editor's role is to simulate the counterfactual document that would flow from the alternative label, holding other attributes constant. Note that this too d-separates the label from the spurious correlate. In both cases, any model trained on the resulting data ought to rely only on the causal features. Our toy abstraction points to a useful diagnostic test. If indeed CAD involves interventions on spans that are (in some sense) analogous to the causal features in our toy model, then injecting noise on these words should increase model reliance on the non-causal features and thus (in general) lead to deteriorating performance out-of-domain. On the other hand, injecting noise on the non-causal features should lead the model to rely more on the causal features, leading to improved performance out of domain. Through a series of large-scale empirical experiments addressing sentiment analysis and natural language inference (NLI) tasks, we inject noise on the spans marked as causal vs non-causal. We compare the effects of injecting noise on the spans revised by the CAD editors, the spans selected through feature feedback (Zaidan et al., 2007) , and to spans selected automatically using feature attribution heuristics such as attention-and gradient-based saliency methods. If indeed the hypotheses that (i) identifying causal features requires human intervention; and (ii) models relying on causal features generalize better out of domain; hold, we might expect that (compared to automatic attribution methods) noising human-provided rationales would deteriorate out-of-domain performance, while noising non-rationales should prove beneficial. We show that an SVM sentiment analysis model trained on the original 1.7k IMDb reviews from Kaushik et al. (2020) obtains 87.8% accuracy on the IMDb test set and 79.9% on Yelp reviews but when all rationales are replaced with noise, the classifier experiences ≈ 11% drop on in-sample accuracy and an even bigger drop of ≈ 28.7% on Yelp. However, as non-rationales are replaced with noise, in-domain accuracy goes down by ≈ 10% but out-of-domain accuracy increases by 1.5%. Similarly, in NLI, the accuracy of a BERT classifier fine-tuned on a subsample of e-SNLI (DeYoung et al., 2020) goes down by ≈ 20% when rationales are replaced with noise, whereas the out-ofdomain accuracy goes down by 21.3-31.5% on various datasets. If non-rationales are replaced with noise, in-sample accuracy goes down by 6.2% but out of domain accuracy drops by only 2.3-5.5%. Similar patterns are observed across both tasks, on all datasets and models. However, when using attention masks, the resulting changes in model performance do not appear to follow these trends. In another test to probe whether human feedback is indeed necessary to produce datasets with the observed quantitative results of CAD, we experiment with style transfer methods for converting Positive reviews into Negative and vice versa. Compared to an SVM classifier trained on styletransfer-augmented data, training on CAD leads to a gain of 5-16.4% in accuracy on Amazon and 3.7-17.8% on Yelp. Similarly, a BERT classifier fine-tuned on CAD outperforms the same classifier fine-tuned on style-transfer-augmented data by 4.9-21.5% on Amazon and 1.9-9.5% on Yelp.

2. RELATED WORK

NLP papers on spurious associations have addressed social biases (Dixon et al., 2018; Zhao et al., 2018; Kiritchenko & Mohammad, 2018; Dinan et al., 2019; May et al., 2019) , spurious signals owing to annotation heuristics (Gururangan et al., 2018; Poliak et al., 2018) , and artifacts from automatic data generation (Chen et al., 2016; Kaushik & Lipton, 2018) , Researchers have also demonstrated vulnerabilities to synthetic transformations, such as distractor phrases (Jia & Liang, 2017; Wallace et al., 2019) , document paraphrases (Iyyer et al., 2018; Pfeiffer et al., 2019) , and synthetic but meaning-preserving modifications (Ribeiro et al., 2018; Glockner et al., 2018; Shen et al., 2018) . Researchers have proposed incorporating human feedback solicited through a variety of mechanisms including highlighting rationales, spans of text indicative of the label (Zaidan et al., 2007; Zaidan & Eisner, 2008; Poulis & Dasgupta, 2017) . To combat gender stereotypes, Lu et al. (2018) ; Zmigrod et al. (2019) ; Maudslay et al. (2019) describe data augmentation approaches that programmatically alter text. More recently, Kaushik et al. (2020) employed crowd workers to edit text to make an opposite label applicable. Through their experiments they show that classifiers trained on CAD generalize well out of domain. Teney et al. (2020) show the benefits of CAD in computer vision and NLP, and Srivastava et al. (2020) employ crowdworkers to augment their training data to capture potential unmeasured variables. A growing body of work has also looked at reducing reliance on spurious correlations by exploiting the stability of relationships between the target variable and its (graph) neighbors. Peters et al. (2016) propose invariant causal prediction to obtain a causal predictor from multiple datasets. Ghassami et al. (2017) discuss a similar approach but do not assume that the exogenous noise of the target variable stays fixed among environments. They also demonstrate the benefits of their approach (compared to Peters et al. (2016) ) in identifying all direct ancestors of the target variable. Arjovsky et al. (2019) propose invariant risk minimization, with the goal of learning a data representation such that the optimal predictor is shared across environments.

3. ANALYSIS OF A TOY MODEL

We briefly review the OLS estimator for the model Y = Xβ + , where Y ∈ R n is the target, X ∈ R n×p the design matrix, β ∈ R p the coefficient vector we want to estimate, and ∼ N (0, σ 2 I n ) an iid noise term. The OLS estimate β ols is given by Cov(X, X)β ols = Cov(X, Y ). Representing Var[X i ] as σ 2 xi and Cov(X i , X j ) as σ xi,xj , if we observe only two covariates (p = 2), then: β ols 1 = σ 2 x2 σ x1,y -σ x1,x2 σ x2,y σ 2 x1 σ 2 x2 -σ 2 x1,x2 β ols 2 = σ 2 x1 σ x2,y -σ x1,x2 σ x1,y σ 2 x1 σ 2 x2 -σ 2 x1,x2 Our analysis adopts the structural causal model (SCM) framework (Pearl, 2009) , formalizing causal relationships via Directed Acyclic Graphs (DAGs). Each edge of the form A → B ∈ E in a DAG G = (V, E) indicates that the variable A is (potentially) a direct cause of variable B. All measured variables X ∈ V in the model are deterministic functions of their corresponding parents Pa(X) ⊆ V and a set of jointly independent noise terms. For simplicity, we work with linear Gaussian SCMs in the presence of a single confounder where each variable is a linear function of its parents and the noise terms are assumed to be additive and Gaussian. We look at both causal and anticausal learning settings. In the former, we assume that a document causes the applicability of the label (as in annotation, where the document truly causes the label). In the latter interpretation, we assume that the label is one latent variable (among many) that causes features of the document (as when a reviewer's "actual sentiment" influences what they write). For simplicity, we assume that the latent variables are correlated due to confounding but that each latent causes a distinct set of observed features. Without loss of generality, we assume that all variables have zero mean. Both DAGs contain the four random variables z, x 1 , x 2 , y and the anticausal DAG also contains some additional latent variables q (Figure 1 ). The derivations are standard and are included in Appendix A.

3.1. THE CAUSAL SETTING

We now focus on the causal setting (Figure 1a , 1b) Let the Gaussian SCM be defined as follows where the noise term for variable x is defined as u x : z = u z , x 1 = bz + u x1 , x 2 = cz + u x2 , y = ax 1 + u y , u z ∼ N (0, σ 2 uz ) u x1 ∼ N (0, σ 2 ux1 ) u x2 ∼ N (0, σ 2 ux2 ) u y ∼ N (0, σ 2 uy ). (2) Applying OLS, we obtain β ols 1 = a and β ols 2 = 0. However, consider what happens if we only observe x 1 via a noisy proxy 1b ). Assuming, x1 ⊥ ⊥ (x 1 , x 2 , y), from Eq. 1 we get the estimates β ols 1 and β ols 2 (Eq. 3) in the presence of observation noise on x 1 . x 1 ∼ N (x 1 , σ 2 ux 1 + σ 2 x 1 ) (Figure β ols 1 = a(σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 ) σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 + σ 2 x1 (c 2 σ 2 uz + σ 2 ux2 ) β ols 2 = acbσ 2 x1 σ 2 uz σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 + σ 2 x1 (c 2 σ 2 uz + σ 2 ux2 ) As we can see, β ols 1 ∝ 1 σ 2 x1 . This shows us that as σ whereas β ols 2 converges to a finite non-zero value. On the other hand, observing a noisy version of x 2 will not affect our OLS estimates if there is no measurement error on x 1 . These simple graphs provide qualitative insights into when we should expect a model to rely on spurious patterns. In the causal setting, under perfect measurement, the causal variable d-separates the non-causal variable from the label (Figure 1a ). However, under observation noise, a predictor will rely on the non-causal variable (Eq. 3). Moreover, when the causal feature is noisily observed, additional observation noise on non-causal features yields models that are more reliant on causal features. We argue that while review text is not noisily observed per se, learning with imperfect feature representations acquired by training deep networks on finite samples has an effect that is analogous to learning with observation noise. Connection to Counterfactually Augmented Data In the causal setting, intervening on the causal feature, d-separates the label y from the non-causal feature x 2 , and thus models trained on samples from the interventional distribution will rely solely on the causal feature, even when it is noisily observed. We argue that in a qualitative sense, the process of generating CAD resembles such an intervention, however instead of intervening randomly, we ensure that for each example, we produce two sets of values of x 1 , one such that the label is applicable and one such that it is not applicable. One is given in the dataset, and the other is produced via the revision.

3.2. AN ANTICAUSAL INTERPRETATION

Alternatively, rather than thinking of features causing the applicable label, we might think of the "causal feature" as a direct effect of the label (not a cause). In this case, so long as the relationship is truly not deterministic, even absent noisy observation, conditioning on the causal feature does not d-separate the label from the non-causal feature and thus models should be expected to assign weight to both causal and non-causal variables. As in the causal setting, as we increase observation noise on the causal variable, the weight assigned to the non-causal variable should increase. Conversely, as in the causal setting with observation noise on x 1 , as observation noise on the non-causal feature x 2 increases, we expect the learned predictor to rely more on the causal feature. We derive the OLS coefficients (including under the presence of observational noise, Fig. 1d ) in this setting in Appendix A.2.

Connection to Counterfactually Augmented Data

In this interpretation, we think of CAD as a process by which we (the designers of the experiment) intervene on the label itself and the human editors, play the role of a simulator that we imagine to be capable of generating a counterfactual example, holding all other latent variables constant. In the sentiment case, we could think of the editors as providing us with the review that would have existed had the sentiment been flipped, holding all other aspects of the review constant. Note that by intervening on the label, we d-separate it from the spurious correlate x 2 (Figure 1c ).

3.3. INSIGHTS AND TESTABLE HYPOTHESES

In both the causal and anticausal models, the mechanism underlying the causal relationship that binds x 1 to y (regardless of direction) is that binding language to a semantic concept (such as sentiment), which we expect to be more stable across settings than the more capricious relationships among the background variables, e.g., those linking genre and production quality. In that spirit, if spans edited to generate counterfactually revised data (CRD) are analogous to the causal (or anticausal) variables, in the causal (or anticausal) graphs, then we might expect that noising those spans (e.g. by random word replacement) should lead to models that rely more on noncausal features and perform worse on out of domain data. On the other hand, we expect that noising unedited spans should have the opposite behavior, leading to degraded in-domain performance, but comparatively better out-of-domain performance. In the remainder of the paper, we investigate these hypotheses, finding evidence that qualitatively confirms the predictions of our theory. We freely acknowledge the speculative nature of this analysis and concede that the mapping between the messy unstructured data we wish to model and the neatly disentangled portrait captured by our linear Gaussian models leaves a gap to be closed through further iterations of theoretical refinement and scientific experiment. Ultimately, our argument is not that this simple analysis fully accounts for counterfactually augmented data but instead that it is a useful abstraction for formalizing two (very different) perspectives on how to conceive of CAD, and for suggesting interesting hypotheses amenable to empirical verification.

4. EMPIRICAL RESULTS

If spans marked as rationales by humans via editing or highlighting are analogous to causal features, then noising those spans should lead to models that rely more on non-causal features and thus perform worse on out-of-domain data, and noising the unmarked spans (analagous to non-causal features) should have the opposite behavior. In this section, we test these hypotheses empirically on real-world datasets. Additionally, we investigate whether the feedback from human workers is yielding anything qualitatively different from what might be seen with spans marked by automated feature attribution methods such as attention and saliency. Along similar, lines we ask whether CAD in the first place offers qualitative advantages over what might be achieved via automatic sentimentflipping methods through experiments with text style transfer algorithms. We conduct experiments on sentiment analysis (Zaidan et al., 2007; Kaushik et al., 2020) and NLI (DeYoung et al., 2020) . All datasets are accompanied with human feedback (tokens deemed relevant to the label's applicability) which we refer to as rationales. For the first set of experiments, we rely on four models: Support Vector Machines (SVMs), Bidirectional Long Short-Term Memory Networks (BiLSTMs) with Self-Attention (Graves & Schmidhuber, 2005) , BERT (Devlin et al., 2019) , and Longformer (Beltagy et al., 2020) . For the second set of experiments, we rely on four state-of-the-art style transfer models representative of different methodologies, each representative of a different approach to automatically generate new examples with flipped labels (Hu et al., 2017; Li et al., 2018; Sudhakar et al., 2019; Madaan et al., 2020) . To evaluate classifier performance on the resulting augmented data, we consider SVMs, Naive Bayes (NB), BiLSTMs with Self Attention, and BERT. We relegate implementation details to Appendix B. For sentiment analysis, we use SVM, BiLSTM with Self Attention, BERT, and Longformer models. In each document, we replace a fraction of rationale (or non-rationale) tokens with random tokens sampled from the vocabulary, and train our models, repeating the process 5 times. We perform similar experiments for NLI using BERT. As an individual premise-hypothesis pair is often not as long as a movie review, many pairs only have one or two words marked as rationales. To observe the effects from gradually injecting noise on rationales or non-rationales, we select only those premise-hypothesis pairs that have a minimum 10 tokens marked as rationales. Since no neutral pairs exist with 10 or more rationale tokens, we consider only a binary classification setting (entailment-contradiction), and downsample the majority class to ensure a 50:50 label split. Figures 2 and 3 show the difference in mean accuracy over 5 runs. For all classifiers, as the noise in rationales increases, in-sample accuracy stays relatively stable compared to out-of-domain accuracy. An SVM classifier trained on the original 1.7k IMDb reviews from Kaushik et al. (2020) obtains 87.8% accuracy on the IMDb test set and 79.9% on Yelp reviews.foot_0 As a greater fraction of rationales are replaced with random words from the vocabulary, the classifier experiences a drop of ≈ 11% by the time all rationale tokens are replaced with noise. However, it experiences an 28.7% drop in accuracy on Yelp reviews. Similarly, on the same datasets, a fine-tuned BERT classifier sees its in-sample accuracy drop by 18.4%, and by 31.4% on Yelp as rationale tokens replaced by noise go from 0 to 100%. However, as more non-rationales are replaced with noise, in-sample accuracy for SVM goes down by ≈ 10% but increases by 1.5% on Yelp. For BERT, in-sample accuracy decreases by only 16.1% and only 13.6% on Yelp (Also see Appendix Table 3 , and Appendix Figure 4a ). We obtain similar results using rationales identified via feature feedback. An SVM classifier trained on reviews from Zaidan et al. (2007) 5a ).foot_1  For NLI, the in-sample accuracy of BERT fine-tuned on an SNLI subsample drops by ≈ 20% when rationales are replaced with noise, and out-of-domain accuracy goes down by 21.3-31.5% on various datasets (Table 10 ). Whereas, if non-rationales are replaced with noise, in-sample accuracy goes down by 6.2% but out-of-domain accuracy drops by only 2.3-5.5%. These results support our hypothesis that spans marked by humans as causing a label are analogous to causal variables. Interestingly, in our NLI experiments, for various models the drops in both in-sample and out-ofdomain accuracy are greater in magnitude when noise is injected in rationales versus when it is injected in non-rationales. This is opposite to what we observe in sentiment analysis. We conjecture that these results are due to the fact that in our experiment design for NLI, we only keep those premise-hypothesis pairs that contain at least 10 tokens marked as rationales so we can observe the difference in accuracy as the amount of noise increases. A consequence of this selection is that many pairs selected have many more tokens marked as rationales than non-rationales, whereas, in sentiment analysis this is the opposite. Hence, in NLI when some percentage of rationales are replaced by noise, this corresponds to many more edited tokens than when a corresponding percentage of non-rationales are noised. To compare human feedback to automatic feature attribution methods such as attention (Bahdanau et al., 2015) and gradient based saliency methods (Li et al., 2016) , we conduct the same set of experiments assuming tokens attended to (or not) by an attention based classifier (BiLSTM with Self-Attention) or identified as highly influential by a gradient based feature attribution method (salience scores) as new rationales (or non-rationales). In this case, unlike our findings with human feedback, we observe markedly different behavior than predicted by our analysis of the toy causal model (See Figures 2b, 2c , 3b, and 3c; and Appendix Tables 4, 5, 7, and 8 ). While we might not expect spurious signals to be as reliable out of domain, that does not mean that they will always fail. For example, while the associations between genre and sentiment learned from a dataset of book reviews might not hold in a dataset of kitchen appliances, but nevertheless hold in a dataset of audiobook reviews. In such settings, even though noising non-causal features would lead to models relying more on causal features, this may not result in better out-of-domain performance. We also look at whether we really need to go through the process of collecting CAD (or humanannotated rationales) at all or if automated methods for generating "counterfactuals" might obtain similar gains in out-of-domain performance, as the former could be an expensive process. We experiment with state-of-the-art style transfer methods to convert Positive reviews into Negative and vice versa. Ideally, we would expect these methods to preserve a document's "content" while modifying the attributes that relate to sentiment (if they obtain perfect disentanglement in the feature space). Sentiment classifiers trained on original and sentiment-flipped reviews generated using style transfer methods often give better out-of-domain performance compared to training only on original data of same size (Table 2 ). However, models trained on CAD perform even better across all datasets, hinting at the value of human feedback.

5. CONCLUSION

While prior work offers promising clues to the benefits of CAD generated through human-in-theloop mechanisms, previous work lacked formal frameworks for thinking about the technique, or comparisons to plausible alternatives. In this paper, through simple analysis on toy linear Gaussian models followed by a large-scale empirical investigation on sentiment analysis and NLI tasks, we formalize CAD and take some initial steps towards understanding its practical efficacy. Our analysis suggests that data corrupted by adding noise to rationale spans (analogous to adding noise to causal features) will degrade out-of-domain performance, while noise added to non-causal features may make models more robust out-of-domain. Our empirical study focuses on sentiment analysis and NLI and our findings remain consistent across datasets and models. Furthermore, the two tasks are subjectively very different as sentiment analysis requires a strong consideration of expressions of opinion than stated facts, whereas NLI is the opposite. We also show that models trained on the augmentation of original data and revised data generated by style transfer methods had better out-of-domain generalization in some cases compared to models trained on original data alone, but performed worse than models trained on CAD. In future work, we will look at how these findings generalize to other domains, including computer vision, and investigate the surprisingly low susceptibility of pre-trained transformers to spurious associations.

A OLS ESTIMATION UNDER NOISY MEASUREMENT

A.1 CAUSAL SETTING Let the Gaussian SCM be defined as follows where the noise term for variable x is defined as u x : z = u z , x 1 = bz + u x1 , x 2 = cz + u x2 , y = ax 1 + u y , u z ∼ N (0, σ 2 uz ) u x1 ∼ N (0, σ 2 ux1 ) u x2 ∼ N (0, σ 2 ux2 ) u y ∼ N (0, σ 2 uy ). (4) σ 2 x1 = b 2 σ 2 uz + σ 2 ux1 σ 2 x2 = c 2 σ 2 uz + σ 2 ux2 σ x1,x2 = bcσ 2 uz σ x1,y = ab 2 σ 2 uz + aσ 2 ux1 σ x2,y = abcσ 2 uz (5) Then if we were to solve the linear regression problem y = x 1 β 1 + x 2 β 2 + β 0 , then using Eq. 1 we obtain the following values for β ols 0 , β ols 1 and β ols 2 : β ols 1 = σ 2 x2 σ x1,y -σ x1,x2 σ x2,y σ 2 x1 σ 2 x2 -σ 2 x1,x2 = (c 2 σ 2 uz + σ 2 ux2 )(ab 2 σ 2 uz + aσ 2 ux1 ) -(bcσ 2 uz )(abcσ 2 uz ) (b 2 σ 2 uz + σ 2 ux1 )(c 2 σ 2 uz + σ 2 ux2 ) -b 2 c 2 σ 4 uz (6) = a (b 2 σ 2 uz + σ 2 ux1 )(c 2 σ 2 uz + σ 2 ux2 ) -b 2 c 2 σ 4 uz (b 2 σ 2 uz + σ 2 ux1 )(c 2 σ 2 uz + σ 2 ux2 ) -b 2 c 2 σ 4 uz = a β ols 2 = σ 2 x1 σ x2,y -σ x1,x2 σ x1,y σ 2 x1 σ 2 x2 -σ 2 x1,x2 = (b 2 σ 2 uz + σ 2 ux1 )(abcσ 2 uz ) -(bcσ 2 uz )(ab 2 σ 2 uz + aσ 2 ux1 ) (b 2 σ 2 uz + σ 2 ux1 )(c 2 σ 2 uz + σ 2 ux2 ) -b 2 c 2 σ 4 uz = 0 (7) However, if the setting is slightly different, and we observe a noisy version of x 1 , given by x 1 : x 1 = x 1 + x1 , x1 ∼ N (0, σ 2 x1 ) (8) Since x1 ⊥ ⊥ (x 1 , x 2 , y), σ 2 x1 = Var[x 1 + x1 ] = b 2 σ 2 uz + σ 2 ux1 + σ 2 x1 (9) σ x1,Y = σ x1,Y = E[(bz + u x1 )(ax 1 + u y )] = ab 2 σ 2 uz + aσ 2 ux1 (10) σ x1,x2 = σ X1,X2 = bcσ 2 uz (11) Plugging these values into Eq. 1 we get the OLS estimates β ols 1 and β ols 2 in the presence of observation noise on X 1 : β ols 1 = σ 2 x2 σ x1,y -σ x1,x2 σ x2,y σ 2 x1 σ 2 x2 -σ 2 x1,x2 = (c 2 σ 2 uz + σ 2 ux2 )(ab 2 σ 2 uz + aσ 2 ux1 ) -(bcσ 2 uz )(abcσ 2 uz ) (b 2 σ 2 uz + σ 2 ux1 + σ 2 x1 )(c 2 σ 2 uz + σ 2 ux2 ) -b 2 c 2 σ 4 uz = a(σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 ) σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 + σ 2 x1 (c 2 σ 2 uz + σ 2 ux2 ) = β ols 1 1 + λ c λ c = σ 2 x1 (c 2 σ 2 uz + σ 2 ux2 ) σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 β ols 2 = σ 2 x1 σ x2,y -σ x1,x2 σ x1,y σ 2 x1 σ 2 x2 -σ 2 x1,x2 = (b 2 σ 2 uz + σ 2 ux1 + σ 2 x1 )abcσ 2 uz -(bcσ 2 uz )(ab 2 σ 2 uz + aσ 2 ux1 ) σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 + σ 2 x1 (c 2 σ 2 uz + σ 2 ux2 ) = acbσ 2 x1 σ 2 uz σ 2 uz (b 2 σ 2 ux2 + c 2 σ 2 ux1 ) + σ 2 ux1 σ 2 ux2 + σ 2 x1 (c 2 σ 2 uz + σ 2 ux2 ) As we can see λ c > 0 and λ c ∝ σ 2 x1 . This shows us that as σ 2 x1 increases, | β ols 1 | (magnitude of the coefficient for X 1 ) decreases and | β ols 2 | (magnitude of the coefficient for X 2 ) increases. lim σ 2 x1 →∞ β ols 1 = 0, and lim σ 2 x1 →∞ β ols 2 = acbσ 2 uz c 2 σ 2 uz +σ 2 ux 2 . A.2 ANTICAUSAL SETTING Once again we assume that each variable V is a linear function of its parents Pa(V ). The noise terms are assumed to be Gaussian and are jointly independent. z = u z , q = az + u q , y = bz + u y , x 2 = cq + u x2 , x 1 = dy + u x1 , u z ∼ N (0, σ 2 uz ) u q ∼ N (0, σ 2 uq ) u y ∼ N (0, σ 2 uy ) u x1 ∼ N (0, σ 2 ux1 ) u x2 ∼ N (0, σ 2 ux2 ) σ 2 x1 = d 2 b 2 σ 2 uz + d 2 σ 2 uy + σ 2 ux1 σ 2 x2 = c 2 a 2 σ 2 uz + c 2 σ 2 uq + σ 2 ux2 σ x1,x2 = abcdσ 2 uz σ x1,y = db 2 σ 2 uz + dσ 2 uy σ x2,y = abcσ 2 uz If we were to solve the linear regression problem y = x 1 β 1 + x 2 β 2 + β 0 , then using Eq. 1 we get the OLS estimates β ols 1 and β ols 2 : β ols 1 = σ 2 x2 σ x1,y -σ x1,x2 σ x2,y σ 2 x1 σ 2 x2 -σ x1,x2 2 = (c 2 a 2 σ 2 uz + c 2 σ 2 uq + σ 2 ux2 )(db 2 σ 2 uz + dσ 2 uy ) -(abcdσ 2 uz )(abcσ 2 uz ) (d 2 b 2 σ 2 uz + d 2 σ 2 uy + σ 2 ux1 )(c 2 a 2 σ 2 uz + c 2 σ 2 uq + σ 2 ux2 ) -(a 2 b 2 c 2 d 2 σ 2 uz 2 ) = d(a 2 c 2 σ 2 uz σ 2 uy + (c 2 σ 2 uq + σ 2 ux2 )(b 2 σ 2 uz + σ 2 uy )) (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz β ols 2 = σ 2 x1 σ x2,y -σ x1,x2 σ x1,y σ 2 x1 σ 2 x2 -σ x1,x2 2 = (d 2 b 2 σ 2 uz + d 2 σ 2 uy + σ 2 ux1 )(abcσ 2 uz ) -(abcdσ 2 uz )(db 2 σ 2 uz + dσ 2 uy ) (d 2 b 2 σ 2 uz + d 2 σ 2 uy + σ 2 ux1 )(c 2 a 2 σ 2 uz + c 2 σ 2 uq + σ 2 ux2 ) -(a 2 b 2 c 2 d 2 σ 2 uz 2 ) = abcσ 2 uz σ 2 ux1 (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz (15) However, if the setting is slightly different, and we observe a noisy version of x 1 , given by x 1 : x 1 = x 1 + x1 , x1 ∼ N (0, σ 2 x1 ) Since x1 ⊥ ⊥ x 2 , y, in order to obtain expressions for the OLS estimates β ols 1 , β ols 2 in the presence of observation noise, in Eq. 15 we only need to replace σ 2 ux1 with σ 2 u x1 , which is given by: σ 2 u x1 = σ 2 ux1 + σ 2 x1 ( ) β ols 1 = d(a 2 c 2 σ 2 uz σ 2 uy + (c 2 σ 2 uq + σ 2 ux2 )(b 2 σ 2 uz + σ 2 uy )) (d 2 b 2 σ 2 uz + σ 2 u x1 + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + (σ 2 u x1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz = d(a 2 c 2 σ 2 uz σ 2 uy + (c 2 σ 2 uq + σ 2 ux2 )(b 2 σ 2 uz + σ 2 uy )) (d 2 b 2 σ 2 uz + (σ 2 ux1 + σ 2 x1 ) + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + ((σ 2 ux1 + σ 2 x1 ) + d 2 σ 2 uy )c 2 a 2 σ 2 uz ( ) β ols 2 = abcσ 2 uz σ 2 u x 1 (d 2 b 2 σ 2 uz + σ 2 u x 1 + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + (σ 2 u x 1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz = abcσ 2 uz (σ 2 ux1 + σ 2 x1 ) (d 2 b 2 σ 2 uz + (σ 2 ux1 + σ 2 x1 ) + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + ((σ 2 ux1 + σ 2 x1 ) + d 2 σ 2 uy )c 2 a 2 σ 2 uz ( ) β ols 1 = β ols 1 1 + λ x1 ac β ols 2 = β ols 2 1 + λ x1 ac 1 + σ 2 x 1 σ 2 ux 1 (20) λ x1 ac = σ 2 x 1 (c 2 a 2 σ 2 uz + c 2 σ 2 uq + σ 2 ux2 ) (d 2 b 2 σ 2 uz + σ 2 ux 1 + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz ( ) where λ x1 ac > 0 and λ x1 ac ∝ σ 2 x1 . Thus, as σ 2 x1 increases, | β ols 1 | decreases. The asymptotic OLS estimates in the presence of infinite observational noise can be seen to be: lim σ 2 x1 →∞ β ols 1 = 0 , where as lim σ 2 x1 →∞ β ols 2 = β ols 2 ((d 2 b 2 σ 2 uz +σ 2 u x1 +d 2 σ 2 uy )(σ 2 u x2 +c 2 σ 2 uq )+(σ 2 u x1 +d 2 σ 2 uy )c 2 a 2 σ 2 uz ) (σ 2 u x1 (c 2 a 2 σ 2 uz +c 2 σ 2 uq +σ 2 u x2 )) . Similarly, if we observe a noisy version of X 2 , given by X 2 : x 2 = x 2 + x2 , x2 ∼ N (0, σ 2 x2 ) Since x2 ⊥ ⊥ x 1 , y, in order to obtain expressions for the OLS estimates β ols 1 , β ols 2 in the presence of observation noise on non-causal features, in Eq. 15 we only need to replace σ 2 ux2 with σ 2 u x2 , which is given by: σ 2 u x2 = σ 2 ux2 + σ 2 x2 ( ) β ols 1 = d(a 2 c 2 σ 2 uz σ 2 uy + (c 2 σ 2 uq + σ 2 u x 2 )(b 2 σ 2 uz + σ 2 uy )) (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy )(σ 2 u x 2 + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz = d(a 2 c 2 σ 2 uz σ 2 uy + (c 2 σ 2 uq + (σ 2 ux2 + σ 2 x2 ))(b 2 σ 2 uz + σ 2 uy )) (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy )((σ 2 ux2 + σ 2 x2 ) + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz ( ) β ols 2 = abcσ 2 uz σ 2 ux1 (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy )(σ 2 u x 2 + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz (25) = abcσ 2 uz σ 2 ux1 (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy )((σ 2 ux 2 + σ 2 x2 ) + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz ( ) β ols 1 = β ols 1 1 + λ x2 ac 1 + σ 2 x2 (b 2 σ 2 uz + σ 2 uy ) a 2 c 2 σ 2 uz σ 2 uy + (c 2 σ 2 uq + σ 2 ux2 )(b 2 σ 2 uz + σ 2 uy ) β ols 2 = β ols 2 1 + λ x2 ac λ x2 ac = σ 2 x2 (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy ) (d 2 b 2 σ 2 uz + σ 2 ux1 + d 2 σ 2 uy )(σ 2 ux2 + c 2 σ 2 uq ) + (σ 2 ux1 + d 2 σ 2 uy )c 2 a 2 σ 2 uz ( ) where λ x2 ac > 0 and λ x2 ac ∝ σ 2 x2 . Thus, as σ 2 x2 increases, | β ols 1 | increases. The asymptotic OLS estimates in the presence of infinite observational noise can be seen to be: lim σ 2 x2 →∞ β ols 2 = 0 , where as lim σ 2 x2 →∞ β ols 1 = β ols 1 (b 2 σ 2 uz +σ 2 uy )((d 2 b 2 σ 2 uz +σ 2 u x1 +d 2 σ 2 uy )(σ 2 u x2 +c 2 σ 2 uq )+(σ 2 u x1 +d 2 σ 2 uy )c 2 a 2 σ 2 uz ) (a 2 c 2 σ 2 uz σ 2 uy +(c 2 σ 2 uq +σ 2 u x2 )(b 2 σ 2 uz +σ 2 uy ))(d 2 b 2 σ 2 uz +σ 2 u x1 +d 2 σ 2 uy ) .

B MODEL IMPLEMENTATION DETAILS FOR SECTION 4

Standard Methods We use scikit-learn (Pedregosa et al., 2011) implementations of SVMs and Naïve Bayes for sentiment analysis. We train these models on TF-IDF bag of words feature representations of the reviews (Jones, 1972) . We identify parameters for both classifiers using grid search conducted over the validation set. BiLSTM We restrict the vocabulary to the most frequent 20k tokens, replacing out-of-vocabulary tokens by UNK. We fix the maximum input length at 330 tokens when training on reviews from Kaushik et al. (2020) and 2678 when doing so on Zaidan et al. (2007) , and pad smaller reviews. Each token is represented by a randomly-initialized 300-dimensional embedding. Our model consists of a bidirectional LSTM (hidden dimension 128) with recurrent dropout (probability 0.5) and self attention following the embedding layer. We use the self attention implementation discussed in Lin et al. (2017) with hyperparameter values d = 64 and r = 64. To generate output, we feed this (fixed-length) representation through a fully-connected hidden layer (hidden dimension 32), and then a fully-connected output layer with softmax activation. We train all models for a maximum of 20 epochs using Adam (Kingma & Ba, 2015) , with a learning rate of 1e-4 and a batch size of 16. We apply early stopping when validation loss does not decrease for 5 epochs.

Pretrained Transformers

We use off-the-shelf uncased BERT Base and Longformer Base models (Wolf et al., 2019) , fine-tuning for each task. We used BERT for experiments on the smaller IMDb dataset used by Kaushik et al. (2020) (with a maximum review length of 330 tokens) and Longformer for the dataset presented by Zaidan et al. (2007) (with maximum review length of 2678). To account for BERT's sub-word tokenization, we set the maximum token length is set at 350 for sentiment analysis and 50 for NLI. In case of Longformer, that is 3072. 3 We fine-tune BERT up to 20 epochs with same early stopping criteria as for BiLSTM, using the BERT Adam optimizer with a batch size of 16 (to fit on a 16GB Tesla V-100 GPU). We found learning rates of 5e-5 and 1e-5 to work best for sentiment analysis and NLI respectively. We fine-tune Longformer for 10 epochs with early stopping, using a batch size of 8 (to fit on 64GB of GPU memory). we found the default hyperparameters used by the authors to work best on our task. In case of Li et al. (2018), 7 we followed the training schedule presented in the paper. However, since the paper does not present results on IMDb reviews, we experimented with multiple values of the salience ratio, and used a salience ratio of 5.5 for our downstream task based on transfer accuracy and BLEU scores achieved on the validation set. For all style transfer methods, we experimented with multiple sequence lengths, and found that models worked best on sentence level (versus review-level) data, with sequence length of 30, truncating longer sentences in the process. For each review, we passed individual sentences through each model and reconstructed whole reviews by joining the resulting sentiment-flipped sentences. 



The out-of-domain evaluation sets inKaushik et al. (2020) do not have 50:50 label split. We enforce this split to observe when a classifier approaches random baseline performance. All datasets can be found at https://github.com/acmi-lab/counterfactually-augmented-data While similar trends are observed for both feature feedback and CAD, it is less clear how to incorporate feature feedback for training effectively with deep neural networks and pre-trained transformer architectures, whereas training (or fine-tuning) models on CAD is straightforward. Longformer is better suited to work on longer texts compared to BERT. Maximum length of a review in Zaidan et al. is 2678 tokens whereas in Kaushik et al. is only 330 tokens. https://github.com/asyml/texar/tree/master/examples/text style transfer https://github.com/agaralabs/transformer-drg-style-transfer https://github.com/tag-and-generate/ https://github.com/lijuncen/Sentiment-and-Style-Transfer



2 x1 increases, | β ols 1 | (the magnitude of the coefficient for x 1 ) decreases and | β ols 2 | (the magnitude of the coefficient for x 2 ) increases. The asymptotic OLS estimates in the presence of infinite observational noise is lim σ 2

Figure2: Change in classifier accuracy as noise is injected on rationales/non-rationales for IMDb reviews fromKaushik et al. (2020).

Figure3: Change in classifier accuracy as noise is injected on rationales/non-rationales for IMDb reviews fromZaidan et al. (2007). In both Figures2 and 3, the vertical dashed line indicates the fraction of median length of non-rationales equal to the median length of rationales.

For Hu et al. (2017), 4 Sudhakar et al. (2019), 5 and Madaan et al. (2020), 6

Figure 6: Most important features learned by an SVM classifier trained on TF-IDF bag of words. Rationales are identified by humans.

Figure 8: Most important features learned by an SVM classifier trained on TF-IDF bag of words. Rationales are identified as tokens marked by the AllenNLP Saliency Interpreter.

Figure 9: Most important features learned by an SVM classifier trained on TF-IDF bag of words. All noise inserted on rationales identified by humans.

sees in-sample accuracy drop by 11%, and accuracy on Yelp

Accuracy of BERT trained on SNLI (DeYoung et al., 2020)  as noise is injected on human identified rationales/non-rationales. RP and RH are Revised Premise and Revised Hypothesis test sets inKaushik et al. (2020). MNLI-M and MNLI-MM are MNLI(Williams et al., 2018) dev sets.

Out-of-domain accuracy of models trained on original only, CAD, and original and sentiment-flipped reviews

Accuracy of various sentiment analysis classifiers trained on 1.7k original reviews fromKaushik et al. (2020) as noise is injected on rationales/non-rationales identified via human feedback.

Accuracy of various sentiment analysis classifiers trained on 1.7k original reviews fromKaushik et al. (2020) as noise is injected on rationales/non-rationales identified via Attention masks.

Accuracy of various sentiment analysis classifiers trained on 1.7k original reviews fromKaushik et al. (2020) as noise is injected on rationales/non-rationales identified via Allen NLP Saliency Interpreter.

Accuracy of various sentiment analysis classifiers trained on reviews fromZaidan et al. (2007) as noise is injected on rationales/non-rationales identified via human feedback.

Accuracy of various sentiment analysis classifiers trained on reviews fromZaidan et al. (2007) as noise is injected on rationales/non-rationales identified via Allen NLP Saliency interpreter.

Accuracy of various models for sentiment analysis trained with various datasets. O refers to the in-sample test set fromKaushik et al. (2020) whereas R refers to the counterfactually revised counterparts of the same.

Accuracy of BERT trained on subsample of SNLI(DeYoung et al., 2020) (where number of rationale tokens and non rationale tokens are within 30% of one another) as noise is injected on human identified rationales/non-rationales. RP and RH are Revised Premise and Revised Hypothesis test sets inKaushik et al. (2020). MNLI-M and MNLI-MM are MNLI(Williams et al., 2018) dev sets.

ACKNOWLEDGEMENTS

The authors are grateful to NVIDIA for providing GPUs to conduct the experiments, Salesforce Research and Facebook AI for their financial support, and Sanket Mehta, Sina Fazelpour and Tejas Khot for our discussions and their valuable feedback.

