LEARNING FROM OTHERS: SIMILARITY-BASED REG-ULARIZATION FOR MITIGATING ARTIFACTS

Abstract

Common methods for mitigating spurious correlations in natural language understanding (NLU) usually operate in the output space, encouraging a main model to behave differently from a bias model by down-weighing examples where the bias model is confident. While improving out of distribution (OOD) performance, it was recently observed that the internal representations of the presumably debiased models are actually more, rather than less biased. We propose SimgReg, a new method for debiasing internal model components via similarity-based regularization, in representation space: We encourage the model to learn representations that are either similar to an unbiased model or different from a biased model. We experiment with three NLU tasks and different kinds of biases. We find that SimReg improves OOD performance, with little in-distribution degradation. Moreover, the representations learned by SimReg are less biased than in other methods.

1. INTRODUCTION

Recent studies (McCoy et al., 2019; Geirhos et al., 2020, inter alia) show that in many cases neural models tend to exploit spurious correlations (a.k.a dataset biases, artifacts) in datasets and learn shortcut solutions rather than the intended function. For example, in MNLI-a popular Natural Language Understanding dataset-there is a high correlation between negation words such as "not, don't" and the contradiction label (Gururangan et al., 2018) . Thus models trained on MNLI confidently predict contradiction whenever there is a negation word in the input without considering the whole meaning of the sentence. As a result of relying on such shortcuts, models fail to generalize and perform poorly when tested on out-of-distribution data (OOD) in which such associative patterns are not present (McCoy et al., 2019 ) -these models are commonly known as 'biased' models -. Moreover, this behavior limits their practical applicability in cases where the real-world data distribution differs from the training distribution. 2021) showed a counter-intuitive trend: the more extrinsically de-biased a model is, the more biased are its representations. i.e., higher accuracy of such models on OOD challenge sets is correlated with increase of intrinsic bias.foot_0 Such superficial debiasing is problematic as the bias may reappear when the model is used in another setting (Orgad et al., 2022) . Inspired by this finding, we propose to perform intrinsic debiasing on the internal model components. We develop SimReg, a new debiasing method based on similarity-regularization, where we encourage the internal representations to be either (i) similar to representations of an unbiased modelfoot_1 ; or (ii) dissimilar from representations of a biased model. We further apply the (dis)similarity regularization on either the model representations or its gradients. Our regularization framework allows us to encourage certain constraints on how the data should be represented (representation regularization), or how the model should be sensitive to the representations of the data (gradient regularization). This is different from previous methods, where the main model usually learns to avoid the errors of bias models. Using our approach allows us to transfer knowledge from other models regarding "good" representation of the data and encourage avoiding "biased" representations. We evaluate our approach on three tasks-natural language inference, fact checking, and paraphrase identification-and multiple spurious correlations attested in the literature: lexical overlap, partial inputs, and unknown biases from weak models (see Section 2.1). We demonstrate that our approach improves performance on out-of-distribution (OOD) challenge sets, while incurring little degradation in in-distribution (ID) performance. Finally, by measuring bias extractability, we find that SimReg representations are less biased than those obtained with competing debiasing methods.

2. RELATED WORK

A growing body of work has revealed that models tend to exploit spurious correlations found in their training data (Geirhos et al., 2020) . Spurious correlations are correlations between certain features of the input and certain labels, which are not causal. Models tend to fail when tested on out of distribution data, where said correlations do not hold. We briefly mention several relevant cases and refer to Du et al. ( 2022) for a recent overview of shortcut learning and its mitigation in natural language understanding.

2.1. DATASET BIAS

Partial-input bias. A common spurious correlation in sentence-pair classification tasks, like natural language inference (NLI), is partial-input bias -the association between words in one of the sentences and certain labels. For example, negation words are correlated with a 'contradiction' label when present in the hypothesis in NLI datasets (Gururangan et al., 2018; Poliak et al., 2018) and with a 'refutes' label when present in the claim in fact verification datasets (Schuster et al., 2019) . A common approach for revealing the presence of such spurious correlations is to train a partial-input baseline (Feng et al., 2019) . When such a model performs well despite having access only to a part of the input, it indicates that that part has spurious correlations. Lexical overlap bias. Another common bias is when certain labels are associated with lexical overlap between the two input sentences. McCoy et al. (2019) found that high lexical-overlap between the premise and hypothesis correlates with 'entailment' in NLI datasets. As a result, NLI models fail when evaluated on HANS, a challenge set where that correlation does not hold. Similarly, Zhang et al. (2019) found that models trained on a paraphrase identification dataset fail to predict 'non-duplicate' questions that have high lexical-overlap. Unknown biases. Identifying the preceding biases assumes prior knowledge of the type of bias existing in the dataset. A few studies have used weak learners to identify unknown biases (Sanh et al., 2021; Utama et al., 2020b) . When limiting either the model capacity or its training data, it tends to exploit simple patterns.

2.2. DEBIASING METHODS

Spurious correlation mitigation can be performed on different levels: Data-based mitigation, where the data is augmented with samples that do not align with the bias found in the dataset (Wang & Culotta, 2021; Kaushik et al., 2020, inter alia) . Model/training-based mitigation, where the either the model or the training procedure is modified. A common strategy in this approach is to train a bias model, which latches on the bias in the dataset, and use its outputs to train the final, debiased, main model. He et al. (2019) and Clark et al. (2019) used variants of product-of-experts (PoE) to combine the outputs of the biased and main model during training to encourage the main model to "ignore" biased samples. Utama et al. (2020a) proposed confidence regularization, where they perform selfdistillation with re-weighted teacher outputs using bias-weighted scaling, i.e., they induce the model to be less confident on biased samples. These methods can be viewed as data re-weighting methods, similar to Liu et al. (2021) , who proposed to up-weigh examples that are miss-classified by the



we measure representation-bias (intrinsic bias) by the easiness of identifying the spurious correlations in the representations (more details in Sec 6.1) In our case, unbiased model is a model that does not rely on the spurious correlations in its decision making.



Recent efforts to mitigate learning spurious correlations (a.k.a debiasing methods) downweigh the importance of training samples that contain such correlations, effectively performing data reweighting Schuster et al. (2019); Utama et al. (2020a); Sanh et al. (2021); Cadene et al. (2019). Typically, a bias-only model is trained and its confidence is used to reweigh training samples. One might expect that such an extrinsic debiasing would lead to "suppressing the model from capturing non-robust features" Du et al. (2022). However, Mendelson & Belinkov (

