LEARNING FROM OTHERS: SIMILARITY-BASED REG-ULARIZATION FOR MITIGATING ARTIFACTS

Abstract

Common methods for mitigating spurious correlations in natural language understanding (NLU) usually operate in the output space, encouraging a main model to behave differently from a bias model by down-weighing examples where the bias model is confident. While improving out of distribution (OOD) performance, it was recently observed that the internal representations of the presumably debiased models are actually more, rather than less biased. We propose SimgReg, a new method for debiasing internal model components via similarity-based regularization, in representation space: We encourage the model to learn representations that are either similar to an unbiased model or different from a biased model. We experiment with three NLU tasks and different kinds of biases. We find that SimReg improves OOD performance, with little in-distribution degradation. Moreover, the representations learned by SimReg are less biased than in other methods.

1. INTRODUCTION

Recent studies (McCoy et al., 2019; Geirhos et al., 2020, inter alia) show that in many cases neural models tend to exploit spurious correlations (a.k.a dataset biases, artifacts) in datasets and learn shortcut solutions rather than the intended function. For example, in MNLI-a popular Natural Language Understanding dataset-there is a high correlation between negation words such as "not, don't" and the contradiction label (Gururangan et al., 2018) . Thus models trained on MNLI confidently predict contradiction whenever there is a negation word in the input without considering the whole meaning of the sentence. As a result of relying on such shortcuts, models fail to generalize and perform poorly when tested on out-of-distribution data (OOD) in which such associative patterns are not present (McCoy et al., 2019 ) -these models are commonly known as 'biased' models -. Moreover, this behavior limits their practical applicability in cases where the real-world data distribution differs from the training distribution. 2021) showed a counter-intuitive trend: the more extrinsically de-biased a model is, the more biased are its representations. i.e., higher accuracy of such models on OOD challenge sets is correlated with increase of intrinsic bias.foot_0 Such superficial debiasing is problematic as the bias may reappear when the model is used in another setting (Orgad et al., 2022) . Inspired by this finding, we propose to perform intrinsic debiasing on the internal model components. We develop SimReg, a new debiasing method based on similarity-regularization, where we encourage the internal representations to be either (i) similar to representations of an unbiased modelfoot_1 ; or (ii) dissimilar from representations of a biased model. We further apply the (dis)similarity regularization on either the model representations or its gradients.



we measure representation-bias (intrinsic bias) by the easiness of identifying the spurious correlations in the representations (more details in Sec 6.1) In our case, unbiased model is a model that does not rely on the spurious correlations in its decision making.1



Recent efforts to mitigate learning spurious correlations (a.k.a debiasing methods) downweigh the importance of training samples that contain such correlations, effectively performing data reweighting Schuster et al. (2019); Utama et al. (2020a); Sanh et al. (2021); Cadene et al. (2019). Typically, a bias-only model is trained and its confidence is used to reweigh training samples. One might expect that such an extrinsic debiasing would lead to "suppressing the model from capturing non-robust features" Du et al. (2022). However, Mendelson & Belinkov (

