EXPLAINING THE EFFICACY OF COUNTERFACTUALLY AUGMENTED DATA

Abstract

In attempts to produce machine learning models less reliant on spurious patterns in NLP datasets, researchers have recently proposed curating counterfactually augmented data (CAD) via a human-in-the-loop process in which given some documents and their (initial) labels, humans must revise the text to make a counterfactual label applicable. Importantly, edits that are not necessary to flip the applicable label are prohibited. Models trained on the augmented (original and revised) data appear, empirically, to rely less on semantically irrelevant words and to generalize better out of domain. While this work draws loosely on causal thinking, the underlying causal model (even at an abstract level) and the principles underlying the observed out-of-domain improvements remain unclear. In this paper, we introduce a toy analog based on linear Gaussian models, observing interesting relationships between causal models, measurement noise, out-of-domain generalization, and reliance on spurious signals. Our analysis provides some insights that help to explain the efficacy of CAD. Moreover, we develop the hypothesis that while adding noise to causal features should degrade both in-domain and out-ofdomain performance, adding noise to non-causal features should lead to relative improvements in out-of-domain performance. This idea inspires a speculative test for determining whether a feature attribution technique has identified the causal spans. If adding noise (e.g., by random word flips) to the highlighted spans degrades both in-domain and out-of-domain performance on a battery of challenge datasets, but adding noise to the complement gives improvements out-of-domain, this suggests we have identified causal spans. Thus, we present a large-scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous challenge domains and models, we find that the hypothesized phenomenon is pronounced for CAD.

1. INTRODUCTION

Despite machine learning (ML)'s many practical breakthroughs, formidable obstacles obstruct its deployment in consequential applications. Of particular concern, these models have been shown to rely on spurious signals, such as surface-level textures in images (Jo & Bengio, 2017; Geirhos et al., 2018) , and background scenery-even when the task is to recognize foreground objects (Beery et al., 2018) . Other studies have uncovered a worrisome reliance on gender in models trained for the purpose of recommending jobs (Dastin, 2018), and on race in prioritizing patients for medical care (Obermeyer et al., 2019) . Moreover, while modern ML performs remarkably well on independent and identically distributed (iid) holdout data, performance often decays catastrophically under both naturally occurring and adversarial distribution shift (Quionero-Candela et al., 2009; Sugiyama & Kawanabe, 2012; Szegedy et al., 2014; Ovadia et al., 2019; Filos et al., 2020) . These two problems: (i) reliance on semantically irrelevant signals, raising concerns about bias; and (ii) the brittleness of models under distributions shift; might appear unrelated, but share important conceptual features. Concerns about bias stem in part from principles of procedural fairness (Blader & Tyler, 2003; Miller, 2017; Grgic-Hlaca et al., 2018; Lipton et al., 2018) , according to which decisions should be based on qualifications, not on distant proxies that are spuriously associated with the outcome of interest. Arguably one key distinction of an actual qualification might be that it 1

