EVALUATING GENDER BIAS IN NATURAL LANGUAGE INFERENCE Anonymous authors Paper under double-blind review

Abstract

Gender-bias stereotypes have recently raised significant ethical concerns in natural language processing. However, progress in detection and evaluation of genderbias in natural language understanding through inference is limited and requires further investigation. In this work, we propose an evaluation methodology to measure these biases by constructing a challenge task which involves pairing gender neutral premise against gender-specific hypothesis. We use our challenge task to investigate state-of-the-art NLI models on the presence of gender stereotypes using occupations. Our findings suggest that three models (BERT, RoBERTa, BART) trained on MNLI and SNLI data-sets are significantly prone to genderinduced prediction errors. We also find that debiasing techniques such as augmenting the training dataset to ensure a gender-balanced dataset can help reduce such bias in certain cases.

1. INTRODUCTION

Machine learning algorithms trained in natural language processing tasks have exhibited various forms of systemic racial and gender biases. These biases have been found to exist in many subtasks of NLP, ranging from learned word embeddings (Bolukbasi et al., 2016; Brunet et al., 2019) , natural language inference (He et al., 2019a) , hate speech detection (Park et al., 2018) , dialog (Henderson et al., 2018; Dinan et al., 2019) , and coreference resolution (Zhao et al., 2018b) . This has prompted a large area of research attempting to evaluate and mitigate them, either through removal of bias introduction in dataset level (Barbosa & Chen, 2019) , or through model architecture (Gonen & Goldberg, 2019) , or both (Zhou & Bansal, 2020) . Specifically, we revisit the notion of detecting gender-bias in Natural Language Inference (NLI) systems using targeted inspection. NLI task constitutes of the model to understand the inferential relations between a pair of sentences (premise and hypothesis) to predict a three-way classification on their entailment, contradiction or neutral relationship. NLI requires representational understanding between the given sentences, hence its critical for production-ready models in this task to account for less to no perceivable stereotypical bias. Typically, NLI systems are trained on datasets collected using large-scale crowd-sourcing techniques, which has its own fair share of issues resulting in the introduction of lexical bias in the trained models (He et al., 2019b; Clark et al., 2019) . Gender bias, which is loosely defined by stereotyping gender-related professions to gender-sensitive pronouns, have also been found to exist in many NLP tasks and datasets (Rudinger et al., 2017; 2018) . With the advent of large-scale pre-trained language models, we have witnessed a phenomenal rise of interest in adapting the pre-trained models to downstream applications in NLP, leading to superior performance (Devlin et al., 2019; Liu et al., 2019; Lewis et al., 2019) . These pre-trained models are typically trained over a massive corpus of text, increasing the probability of introduction of stereotypical bias in the representation space. It is thus crucial to study how these models reflect the bias after fine-tuning on the downstream task, and try to mitigate them without significant loss of performance. The efficacy of pre-trained models on the downstream task also raises the question in detecting and mitigating bias in NLP systems -is the data or the model at fault?. Since we fine-tune these pretrained models on the downstream corpus, we can no longer conclusively determine the source of the bias. Thus, it is imperative to revisit the question of detecting the bias from the final sentence representations. To that end, we propose a challenge task methodology to detect stereotypical gender bias in the representations learned by pre-trained language models after fine-tuning on the natural language inference task. Specifically, we construct targeted sentences inspired from Yin et al. ( 2019), through which we measure gender bias in the representation space in the lens of natural language inference. We evaluate a range of publicly available NLI datasets (SNLI (Bowman et al., 2015) , MNLI (Williams et al., 2018) 2019))) to evalute their sensitivity to gender bias. Using our challenge task, we detect gender bias using the same task the language models are fine-tuned for (NLI). Our challenge task also highlights the direct effect and consequences of deploying these models by testing on the same downstream task, thereby achieving a thorough test of generalization. We posit that a biased NLI model that has learnt gender-based correlations during training will have varied prediction on two different hypothesis differing in gender specific connotations. Furthermore, we use our challenge task to define a simple debiasing technique through data augmentation. Data augmentation have been shown to be remarkably effective in achieving robust generalization performance in computer vision (DeVries & Taylor, 2017) as well as NLP (Andreas, 2020b) . We investigate the extent to which we can mitigate gender bias from the NLI models by augmenting the training set with our probe challenge examples. Concretely, our contributions in this paper are: • We propose an evaluation methodology by constructing a challenge task to demonstrate that gender bias is exhibited in state-of-the-art finetuned Transformer-based NLI model outputs (Section 3). • We test augmentation as an existing debiasing technique and understand its efficacy on various state-of-the-art NLI Models (Section 4). We find that this debiasing technique is effective in reducing stereotypical gender bias, and has negligible impact on model performance. • Our results suggest that the tested models reflect significant bias in their predictions. We also find that augmenting the training-dataset in order to ensure a gender-balanced distribution proves to be effective in reducing bias while also maintaining accuracy on the original dataset. 2 PROBLEM STATEMENT MNLI (Williams et al. (2018) ) and SNLI (Bowman et al. (2015) ) dataset can be represented as D < P, H, L > with p P as the premise, h H as the hypothesis and l L as the label (entailment, neutral, contradiction). These datasets are created by a crowdsourcing process where crowd-workers are given the task to come up with three sentences that entail, are neutral with, and contradict a given sentence (premise) drawn from an existing corpus. Can social stereotypes such as gender prejudice be passed on as a result of this process? To evaluate this, we design a challenge dataset D < P, F, M > with p P as the premise and f F and m M as two different hypotheses differing only in the gender they represent. We define gender-bias as the representation of p learned by the model that results in a change in the label when paired with f and m separately. A model trained to associate words with genders is prone to incorrect predictions when tested on a distribution where such associations no longer exist. In the next two sections, we discuss our evaluation and analysis in detail and also investigate the possibilities of mitigating such biases.

3. MEASURING GENDER BIAS

We create our evaluation sets using sentences from publicly available NLI datasets. We then test them on 3 models trained on MNLI and SNLI datasets. We show that a change in gender represented by the hypothesis results in a difference in prediction, hence indicating bias.



, ANLI (Nie et al., 2020) and QNLI (Rajpurkar et al., 2016),) and pair them with pre-trained language models (BERT (Devlin et al. (2019)), RoBERTa (Liu et al. (2019)) and BART (Lewis et al. (

