AN EMPIRICAL STUDY OF METRICS TO MEASURE REP-RESENTATIONAL HARMS IN PRE-TRAINED LANGUAGE MODELS Anonymous authors Paper under double-blind review

Abstract

Large-scale Pre-Trained Language Models (PTLMs) capture knowledge from massive human-written data which contains latent societal biases and toxic contents. In this paper, we leverage the primary task of PTLMs, i.e., language modeling, and propose a new metric to quantify manifested implicit representational harms in PTLMs towards 13 marginalized demographics. Using this metric, we conducted an empirical analysis of 24 widely used PTLMs. Our analysis provides insights into the correlation between the proposed metric in this work and other related metrics for representational harm. We observe that our metric correlates with most of the gender-specific metrics in the literature. Through extensive experiments, we explore the connections between PTLMs architectures and representational harms across two dimensions: depth and width of the networks. We found that prioritizing depth over width, mitigates representational harms in some PTLMs. Our code and data can be found at [placement].

1. INTRODUCTION

Large-scale Pre-Trained Language Models (PTLMs) such as BERT (Devlin et al., 2019) and GPT models (Radford et al., 2019; Brown et al., 2020) have recently achieved great success in varieties of Natural Language Processing (NLP) tasks. These large-scale PTLMs capture knowledge from massively labeled and unlabeled human written data which can potentially contain harmful contents and societal biases. The goal of a language model is to estimate the probability of a sequence of words for the given language. One can argue that, when the data from which the model was trained on is different than the desired behavior of the model at a semantic level, representational harms are present. Several recent studies have highlighted the manifestation of societal biases in language models and proposed metrics and datasets to quantify them based on sentiment (Kurita et al., 2019 ), regard (Sheng et al., 2019) , stereotypes (Zhao et al., 2019; Nadeem et al., 2021) , style (Smith et al., 2022) , or morality (Schramowski et al., 2022) . In this work, we focus on the PTLMs' propensity to associate specific individuals or groups with negative perception. These negative perceptions are the result of microaggression, stereotypes, or implicit hate speech in the pre-training corpus of large language models. These harmful representations are usually overlooked by toxic language detectors (Breitfeller et al., 2019; Hartvigsen et al., 2022) , while they can resurface in language technologies and disadvantage an already disadvantaged group of people. Moreover, existing metrics usually fail at conceptualization of these harms which is a prerequisite for effective measurement. And even when the desired construct is clearly articulated, its measurement is not well matched to its conceptualization (Blodgett et al., 2021) . Our contributions are two folds. First, we provide a clear conceptualization of representational harms towards 13 marginalized demographics and propose a new metric for quantifying them in PTLMs. Our proposed metric can be applied to any dataset that contains harmful versus benign examples. Moreover, we address some of the shortcomings in the existing metrics in our metric. Second, we conduct an empirical study of the representational harms in 24 well-known PTLMs with respect to demographic, correlation with existing metrics, and network architecture.

