AN EMPIRICAL STUDY OF METRICS TO MEASURE REP-RESENTATIONAL HARMS IN PRE-TRAINED LANGUAGE MODELS Anonymous authors Paper under double-blind review

Abstract

Large-scale Pre-Trained Language Models (PTLMs) capture knowledge from massive human-written data which contains latent societal biases and toxic contents. In this paper, we leverage the primary task of PTLMs, i.e., language modeling, and propose a new metric to quantify manifested implicit representational harms in PTLMs towards 13 marginalized demographics. Using this metric, we conducted an empirical analysis of 24 widely used PTLMs. Our analysis provides insights into the correlation between the proposed metric in this work and other related metrics for representational harm. We observe that our metric correlates with most of the gender-specific metrics in the literature. Through extensive experiments, we explore the connections between PTLMs architectures and representational harms across two dimensions: depth and width of the networks. We found that prioritizing depth over width, mitigates representational harms in some PTLMs. Our code and data can be found at [placement].

1. INTRODUCTION

Large-scale Pre-Trained Language Models (PTLMs) such as BERT (Devlin et al., 2019) and GPT models (Radford et al., 2019; Brown et al., 2020) have recently achieved great success in varieties of Natural Language Processing (NLP) tasks. These large-scale PTLMs capture knowledge from massively labeled and unlabeled human written data which can potentially contain harmful contents and societal biases. The goal of a language model is to estimate the probability of a sequence of words for the given language. One can argue that, when the data from which the model was trained on is different than the desired behavior of the model at a semantic level, representational harms are present. Several recent studies have highlighted the manifestation of societal biases in language models and proposed metrics and datasets to quantify them based on sentiment (Kurita et al., 2019 ), regard (Sheng et al., 2019 ), stereotypes (Zhao et al., 2019; Nadeem et al., 2021 ), style (Smith et al., 2022) , or morality (Schramowski et al., 2022) . In this work, we focus on the PTLMs' propensity to associate specific individuals or groups with negative perception. These negative perceptions are the result of microaggression, stereotypes, or implicit hate speech in the pre-training corpus of large language models. These harmful representations are usually overlooked by toxic language detectors (Breitfeller et al., 2019; Hartvigsen et al., 2022) , while they can resurface in language technologies and disadvantage an already disadvantaged group of people. Moreover, existing metrics usually fail at conceptualization of these harms which is a prerequisite for effective measurement. And even when the desired construct is clearly articulated, its measurement is not well matched to its conceptualization (Blodgett et al., 2021) . Our contributions are two folds. First, we provide a clear conceptualization of representational harms towards 13 marginalized demographics and propose a new metric for quantifying them in PTLMs. Our proposed metric can be applied to any dataset that contains harmful versus benign examples. Moreover, we address some of the shortcomings in the existing metrics in our metric. Second, we conduct an empirical study of the representational harms in 24 well-known PTLMs with respect to demographic, correlation with existing metrics, and network architecture.

2. RELATED WORK

Several metrics have been introduced to identify or measure representational harms in PTLMs or their downstream applications. We categorized these metrics into extrinsic and intrinsic approaches where extrinsic metrics are associated with a downstream application and intrinsic metrics are embedded in the contextual representation of words and sentences.

Coreference Resolution Tasks

Coreference resolution is the task of linking expressions that refer to the same entity. WinoBias (WB) (Zhao et al., 2018) and WinoGender (WG) (Rudinger et al., 2018) datasets contain authorcrafted pronoun-resolution tests. Each test is a pair of sentences that differ only by the gender of the pronoun in the sentence. These datasets measure the stereotypical bias in a system by testing whether the system link pronouns to occupations dominated by a specific genderfoot_0 . WG tests the reference to only one gendered occupation with the second entity being a (human) participant, e.g., "someone". Recently, Blodgett et al. ( 2021) exposed several issues in the reliability of both WB and WG datasets. Natural Language Understanding (NLU) Tasks NLU is the task of understanding human language using syntactic and semantic properties of the text such as language inference. GLUE dataset (Wang et al., 2018 ) is a widely used benchmark in NLU tasks. Qian et al., 2022 trained an automatic Seq2Seq perturbation model to perturb GLUE test sets with respect to gender, race and age. Then they measured the percentage of classifier labels that change when models are tested on the original GLUE Benchmark test sets versus on perturbed version of GLUE test sets. This perturbation model is trained on Perturbation Augmentation NLP DAtaset (PANDA) (Qian et al., 2022) which is a human-generated dataset. This dataset includes 100,000 demographically perturbed sentences with majority being gender (70%) followed by race (14.7%) and age (14.6%). Moreover, Kiritchenko & Mohammad (2018) created Equity Evaluation Corpus (EEC) which consists of templated sentences to examine sentiment analysis systems biases about gender and race. Natural Language Generation (NLG) Task NLG is the task of producing a human-readable language response based on some input. This is a core component of virtual assistants, chat bots, machine translation, and summarization. Recently, representational harms manifested in these systems have received a lot of attention (Sheng et al., 2021) . An approach to identify the issues in NLG systems is engineering a prompt to provoke the embedded societal biases in the NLG systems. BOLD dataset (Dhamala et al., 2021) is a collection of English prompts automatically generated for profession, gender, race, religion, and political ideology demographics. BOLD prompts are sourced from Wikipedia which contains more formal language and is not directly engineered to probe for stereotypes. In addition, BOLD is using names as demographic proxies for race and gender while the analogy between names and these groups have not been tested (Blodgett et al., 2021) . According to Cao et al., 2022, the automatically generated prompts in BOLD could be noisy and contain toxic and stereotyped prompts. Similarly, HolisticBias dataset (Smith et al., 2022) is a collection of author-crafted American-English prompts which contains 600 descriptor terms across 13 different demographics. Existing works, measure representational harms in the response generated by the NLG system via automatic classifiers such as regard (Sheng et al., 2019) , sentiment (Groenwold et al., 2020 ), style (Smith et al., 2020 ), and toxicity (Dhamala et al., 2021) . These classifiers identify representational harms loosely as inequality in demographic's label ratios and are prone to manifest societal biases themselves. We refer you to (Sheng et al., 2021) for a comprehensive list of existing work for societal biases in NLG.

2.2. INTRINSIC

Intrinsic metrics generally measure the likelihood of harmful or stereotypical contexts versus benign contexts using log-probability. Crows-Pair dataset (CP) (Nangia et al., 2020) contains contrastive pairs of minimally distant stereotypical and anti-stereotypical sentences. This dataset was created by asking crowd workers to perturb the target groups in each sentence such that the pair demonstrate a



Gender statistics of occupations was obtained from the U.S. Bureau of Labor.

