AN EMPIRICAL STUDY OF METRICS TO MEASURE REP-RESENTATIONAL HARMS IN PRE-TRAINED LANGUAGE MODELS Anonymous authors Paper under double-blind review

Abstract

Large-scale Pre-Trained Language Models (PTLMs) capture knowledge from massive human-written data which contains latent societal biases and toxic contents. In this paper, we leverage the primary task of PTLMs, i.e., language modeling, and propose a new metric to quantify manifested implicit representational harms in PTLMs towards 13 marginalized demographics. Using this metric, we conducted an empirical analysis of 24 widely used PTLMs. Our analysis provides insights into the correlation between the proposed metric in this work and other related metrics for representational harm. We observe that our metric correlates with most of the gender-specific metrics in the literature. Through extensive experiments, we explore the connections between PTLMs architectures and representational harms across two dimensions: depth and width of the networks. We found that prioritizing depth over width, mitigates representational harms in some PTLMs. Our code and data can be found at [placement].

1. INTRODUCTION

Large-scale Pre-Trained Language Models (PTLMs) such as BERT (Devlin et al., 2019) and GPT models (Radford et al., 2019; Brown et al., 2020) have recently achieved great success in varieties of Natural Language Processing (NLP) tasks. These large-scale PTLMs capture knowledge from massively labeled and unlabeled human written data which can potentially contain harmful contents and societal biases. The goal of a language model is to estimate the probability of a sequence of words for the given language. One can argue that, when the data from which the model was trained on is different than the desired behavior of the model at a semantic level, representational harms are present. Several recent studies have highlighted the manifestation of societal biases in language models and proposed metrics and datasets to quantify them based on sentiment (Kurita et al., 2019) , regard (Sheng et al., 2019) , stereotypes (Zhao et al., 2019; Nadeem et al., 2021) , style (Smith et al., 2022) , or morality (Schramowski et al., 2022) . In this work, we focus on the PTLMs' propensity to associate specific individuals or groups with negative perception. These negative perceptions are the result of microaggression, stereotypes, or implicit hate speech in the pre-training corpus of large language models. These harmful representations are usually overlooked by toxic language detectors (Breitfeller et al., 2019; Hartvigsen et al., 2022) , while they can resurface in language technologies and disadvantage an already disadvantaged group of people. Moreover, existing metrics usually fail at conceptualization of these harms which is a prerequisite for effective measurement. And even when the desired construct is clearly articulated, its measurement is not well matched to its conceptualization (Blodgett et al., 2021) . Our contributions are two folds. First, we provide a clear conceptualization of representational harms towards 13 marginalized demographics and propose a new metric for quantifying them in PTLMs. Our proposed metric can be applied to any dataset that contains harmful versus benign examples. Moreover, we address some of the shortcomings in the existing metrics in our metric. Second, we conduct an empirical study of the representational harms in 24 well-known PTLMs with respect to demographic, correlation with existing metrics, and network architecture.

2. RELATED WORK

Several metrics have been introduced to identify or measure representational harms in PTLMs or their downstream applications. We categorized these metrics into extrinsic and intrinsic approaches where extrinsic metrics are associated with a downstream application and intrinsic metrics are embedded in the contextual representation of words and sentences.

Coreference Resolution Tasks

Coreference resolution is the task of linking expressions that refer to the same entity. WinoBias (WB) (Zhao et al., 2018) and WinoGender (WG) (Rudinger et al., 2018) datasets contain authorcrafted pronoun-resolution tests. Each test is a pair of sentences that differ only by the gender of the pronoun in the sentence. These datasets measure the stereotypical bias in a system by testing whether the system link pronouns to occupations dominated by a specific genderfoot_0 . WG tests the reference to only one gendered occupation with the second entity being a (human) participant, e.g., "someone". Recently, Blodgett et al. (2021) exposed several issues in the reliability of both WB and WG datasets. Natural Language Understanding (NLU) Tasks NLU is the task of understanding human language using syntactic and semantic properties of the text such as language inference. GLUE dataset (Wang et al., 2018) is a widely used benchmark in NLU tasks. Qian et al., 2022 trained an automatic Seq2Seq perturbation model to perturb GLUE test sets with respect to gender, race and age. Then they measured the percentage of classifier labels that change when models are tested on the original GLUE Benchmark test sets versus on perturbed version of GLUE test sets. This perturbation model is trained on Perturbation Augmentation NLP DAtaset (PANDA) (Qian et al., 2022) which is a human-generated dataset. This dataset includes 100,000 demographically perturbed sentences with majority being gender (70%) followed by race (14.7%) and age (14.6%). Moreover, Kiritchenko & Mohammad (2018) created Equity Evaluation Corpus (EEC) which consists of templated sentences to examine sentiment analysis systems biases about gender and race. Natural Language Generation (NLG) Task NLG is the task of producing a human-readable language response based on some input. This is a core component of virtual assistants, chat bots, machine translation, and summarization. Recently, representational harms manifested in these systems have received a lot of attention (Sheng et al., 2021) . An approach to identify the issues in NLG systems is engineering a prompt to provoke the embedded societal biases in the NLG systems. BOLD dataset (Dhamala et al., 2021) is a collection of English prompts automatically generated for profession, gender, race, religion, and political ideology demographics. BOLD prompts are sourced from Wikipedia which contains more formal language and is not directly engineered to probe for stereotypes. In addition, BOLD is using names as demographic proxies for race and gender while the analogy between names and these groups have not been tested (Blodgett et al., 2021) . According to Cao et al., 2022 , the automatically generated prompts in BOLD could be noisy and contain toxic and stereotyped prompts. Similarly, HolisticBias dataset (Smith et al., 2022) is a collection of author-crafted American-English prompts which contains 600 descriptor terms across 13 different demographics. Existing works, measure representational harms in the response generated by the NLG system via automatic classifiers such as regard (Sheng et al., 2019) , sentiment (Groenwold et al., 2020 ), style (Smith et al., 2020) , and toxicity (Dhamala et al., 2021) . These classifiers identify representational harms loosely as inequality in demographic's label ratios and are prone to manifest societal biases themselves. We refer you to (Sheng et al., 2021) for a comprehensive list of existing work for societal biases in NLG.

2.2. INTRINSIC

Intrinsic metrics generally measure the likelihood of harmful or stereotypical contexts versus benign contexts using log-probability. Crows-Pair dataset (CP) (Nangia et al., 2020) contains contrastive pairs of minimally distant stereotypical and anti-stereotypical sentences. This dataset was created by asking crowd workers to perturb the target groups in each sentence such that the pair demonstrate a stereotype and an anti-stereotype concept. Similarly, StereoSet (SS) dataset (Nadeem et al., 2021) includes inter-sentence and intra-sentence tests to capture the stereotypical bias about gender, race, profession, and religion in PTLMs. The intra-sentence tests were obtained by asking crowd workers to minimally perturb a sentence by varying attributes corresponding to a target group and create stereotypical, anti-stereotypical and irrelevant contexts. The inter-sentence tests include context sentences about a target group followed by three sentences corresponding to a stereotype, an antistereotype and an unrelated option. Blodgett et al. (2021) have raised concerns about the reliability of SS and CP datasets due to several issues including lack of meaningful stereotypesfoot_1 . Another intrinsic metric is called Causal Mediation Analysis (CMA) (Vig et al., 2020) which examines the role of each individual neurons and attention heads of PTLMs in mediating gender bias on three datasets including WB and WG. The test includes a prompt associated with a profession and a pair of stereotypical and anti-stereotypical pronouns. This method frames neurons and attention heads as mediators along the causal path between model inputs and outputs and provide the effect of intervention on model inputs as a proxy for gender bias. Moreover, several other metrics have been developed for measuring societal biases in contextualized word representation (Kurita et al., 2019; May et al., 2019; Guo & Caliskan, 2021) which are extensions of Word Embedding Association Test (WEAT) (Caliskan et al., 2017) . WEAT compares two sets of target words to two sets of attribute words (pleasant versus unpleasant) in word embedding space. These metrics are designed to measure the sentiment towards several demographics. A recent work by Cao et al. (2022) examined the correlation among some of the extrinsic and intrinsic metrics in NLG task. They emphasized the importance of alignment in the target demographics, notion of representational harms (sentiment/toxicity/stereotypes/regard/style), downstream applications, and the quality of the evaluation dataset when it comes to aligning intrinsic and extrinsic metrics. Therefore, we propose a new intrinsic metric that is aligned with NLG task and quantifies the toxicity notion of the representational harms in PTLMs.

3. MEASUREMENT MODELING

We are going to follow the Measurement modeling approach, originated from social sciences, to quantify representational harms in PTLMs based on Blodgett et al. (2021) recommendation. Measurement modeling is composed of two stages. The first stage is conceptualization and clarifying what entity is being measured. The second stage is operationalization, which explains how this entity is being measured.

3.1. CONCEPTUALIZATION

According to Blodgett et al., 2021, conceptualization of stereotyping is a prerequisite for effective measurement. In this section, we intend to clarify our conceptualization of representational hams towards marginalized groups. First, we pick the target demographics, whom are frequently the targets of oppression, discrimination, or prejudice, from a U.S. socio-cultural perspectivefoot_2 . The target demographics include African American (Black), women, Native-American, Mexican, Latinx, people with disability, Asian, Chinese, Jewish, Muslim, LGBTQ, and Middle-Eastern. Next, we define representational harms as systematic association of marginalized groups with negative perception and stereotypes in PTLMs. In the next section, we explain how we quantify this behavior in PTLMs.

3.2. OPERATIONALIZATION

We operationalize the representational harms towards a marginalized demographic by measuring the language modeling likelihood of implicitly harmful statements versus benign statements. Previous work have leveraged power dynamics between two groups to quantify representational harms (Zhao et al., 2018; Rudinger et al., 2018; Zhao et al., 2019; Vig et al., 2020; Nadeem et al., 2021; Nangia et al., 2020) . However, Seyranian et al. ( 2008) raises doubts about whether social psychology can ever 2022), we do not use power dynamics to compare minority groups with a perceived majority group in this work. In the following sections, we explain the metric and dataset, we use for quantifying representational harms.

3.2.1. DATASET

We use a human annotated subset of ToxiGen dataset (Hartvigsen et al., 2022) which contains implicitly harmful and benign sentences towards 13 marginalized demographics in English. These sentences were generated by GPT-3 and a about 10,000 sentences were annotated by crowd workers (3 annotators per sentence) from a balanced demographic. Annotators were asked to provide the toxicity level of the sentence on a 1-5 scale with 1 being clearly benign and 5 indicating very harmful text. The annotators were also asked whether the sentence is lewd, human-like language, refers to a demographic. Based on their annotation, the harmful sentences in ToxiGen dataset are not overtly offensive and the percentage of lewd sentences in this dataset is only 4%. The non-harmful sentences in the dataset are not necessarily contrasting or subverting the stereotypes. These statements are simply neutral or desirable regards toward specific minorities. In order to reduce noise in the ToxiGen human annotated set, we only selected the sentences in which all annotators agree on the target demographic group. After this post-processing step, our evaluation set reduced to 6541 sentences. Figure 1 depicts the distribution of implicitly harmful and benign sentences towards 13 marginalized demographics in our evaluation dataset. Moreover, Hartvigsen et al. (2022) claim that on average, 90.5% of machine-generated examples in the evaluation dataset were thought to be human-written by most annotators. This indicates that the sentences are mostly human-like statements. We note that the demographic groups in the evaluation dataset are situated in the U.S. context. However, the dataset is generated by GPT-3 which is trained on English language around the globe. Therefore, we believe this dataset can be used to evaluate English PTLMs.

3.2.2. METRIC

We leverage language modeling objective which is the pre-training task in large-scale PTLMs. A language model is a probability distribution over tokens and perplexity is a measurement of how well this probability distribution predicts a sample. Given a tokenized sentence W = [w 1 , w 2 , ..., w T ], we can define perplexity as P (W ) = exp( -1 |W | T i=1 log(p θ (w i |w 1 , ..., w i-1 ))) Please note that perplexity is not well-defined in auto-encoder models which are bi-directional. Therefore, we leverage pseudo-perplexity proposed by Salazar et al. (2020) in place of perplexity for auto-encoder PTLMs. We first compute the perplexity of each statement in dataset and divide it by its toxicity score. Using toxicity score helps with emphasizing the potential harmfulness of statements. We refer to these values as scaled perplexity. Ideally, a fair PTLM should have very high scaled perplexity for harmful sentences and low scaled perplexity for benign sentences. Next, we use Mann-Whitney U-test (Mann & Whitney, 1947) to quantify the propensity of PTLMs for generating either benign or implicitly harmful sentences. Mann-Whitney U-test is a non-parametric test of a null-hypothesis that for randomly selected values X and Y from two populations, the probability of X > Y is equal to the probability of Y > X. Mann-Whitney U-test does not assume any specific distribution such as normal distribution of samples for calculating test statistics and p-values. Moreover, this test can be applied on very small samples. Let X 1 , X 2 , ..., X n be the perplexities for harmful statements and Y 1 , Y 2 , ..., Y m be the perplexities for benign statements. The Mann-Whitney U statistics is defined as U = n i=1 m j=1 F ( X i t i , Y j t j ) where t i and t j refer to the toxicity score of X i and Y j , respectively. F (X, Y ) is a pair-wise ranking function that compares every benign statement with every harmful statement and assign a ranking score to this pair: F (X, Y ) = 1 if X > Y 1/2 if X = Y 0 if X < Y Using Equation 1, we can define safety score S, which is basically the effect size of U-statistics: S = U nm In a healthy PTLM, safety score should be equal to 1, in which, all the harmful sentences have higher scaled perplexity than benign sentences. Moreover, when S = 0, all the benign sentences are less likely to be produced by a PTLM than the harmful sentences.

4. RESULTS AND DISCUSSION

4.1 EXPERIMENT SETUP We calculated safety scores (Equation 3) for 13 marginalized demographics using 24 widely used PTLMsfoot_3 . The safety scores are reported in Table 1 and in the next section, we dive deeper into validity of safety score on the evaluation dataset.

4.2. LANGUAGE MODELING

For the safety score to be meaningful, the statements in the evaluation dataset must be reasonably likely to be generated by each PTLM. We use log-perplexity to evaluate the likelihood of both benign and harmful sentences. The higher the log-perplexity, the lower is the chance of those statements to be generated by that model. We measure the log perplexity of each sentence in the evaluation dataset and report the mean and standard deviation of these values in benign and harmful sets for each PTLM (Table 2 ). We observe that most models are in a reasonable range. For example, GPT-2-xl (Radford et al., 2019) has an average log-perplexity of 2.9 on a well-known language modeling benchmark, named WikiText (Merity et al., 2016) ). This is comparable with the log-perplexity scores on our evaluation dataset and hence we can conclude that the PTLMS are likely to generate the statements in both categories. Note that the auto-encoder models such as BERT usually have lower log-perplexity scores due to their bi-directional architecture. 

4.3. REPRESENTATIONAL HARMS TOWARDS MARGINALIZED DEMOGRAPHICS

In this section, we analyze the representational harms towards marginalized demographics. Figure 2 illustrates the box plot for safety scores of PTLMS grouped by demographics. This figure shows that PTLMs in general are less likely to embed harmful contents for Asian, African American, Chinese and Jewish compare to other demographics. However, the safety scores for all these groups are below 0.5, which is far worse than an ideal system.

4.4. CORRELATION BETWEEN REPRESENTATIONAL HARMS METRICS

In this section, we compare our safety score with other metrics on the intersection of their marginalized groups and the notion of bias. Since measuring gender stereotype has been well studied (Sheng et al., 2019; Zhao et al., 2018; Rudinger et al., 2018; Vig et al., 2020; Nadeem et al., 2021) , we picked Women demographic for our comparison. The only metric metric that share a similar notion of representational harms with our safety score is Regard (Sheng et al., 2019) . Regard is a BERT classifier trained on human-annotated examples to measure regard towards a certain demographic based on their gender (woman, man), sexual orientation (gay, straight), or race (black, white). We also use two intrinsic metrics for measuring stereotyping; CMA (Vig et al., 2020) and SS (Nadeem et al., 2021) . CMA measures gender stereotyping with respect to occupation. We used the total effects reported in (Vig et al., 2020) for some of the PTLMs and measured the SS scores and Regard scoresfoot_4 for auto-encoder and auto-regressive PTLMs, respectively. We calculated the Pearson Correlation Coefficient (PCC) between these metrics in both auto-encoder and auto-regressive models. Table 3 and 4 demonstrate the correlation between these metrics. Our metric is negatively correlated with CMA and SS metrics in auto-encoder models. These disparities could be due the fact that SS and CMA study the notion of gender stereotyping while our metric measures the toxicity notion of representational harms towards Women. As shown in Table 4 , our metric is positively correlated with CMA and Regard metrics. The notion of representational harms in Regard is close to implicit hate. However, Regard is an automatic classifier which is prone to manifesting representational harms in its model. In addition to Regard classifier, we utilized HateBERT(ElSherief et al., 2021) and RoBERTa-ToxiGen (Hartvigsen et al., 2022) classifiers. These classifiers are trained to detect implicit hate in a sentence. We report the correlation between several metrics in Table 4 . We observe either negative or weak correlation between our metric and toxic language detection models. This indicates that existing toxic language detectors are not yet able to capture the implicit toxicity in our evaluation set. Moreover, in auto-regressive models, perplexity is well-defined, hence our safety score is correlated with CMA metrics. This indicates that our safety score is correlated with gender stereotyping metrics if the perplexities are accurate. Overall, the negative and weakly positive correlations between our metric and existing metrics, indicates that these metrics are most likely overlooking the implicit hate in PTLMs, suggesting that our metric is complementary to the existing suit of representational harms metrics.

4.5. SAFETY SCORES ON IMPLICIT HATE SPEECH DATASET

Safety score can be applied to any dataset with a balanced set of benign and toxic sentences targeting minority groups. To further analyze this hypothesis, we selected a subset of Implicit Hate dataset (ElSherief et al., 2021) In this section, we study the effect of network architecture and size on safety score. Figure 3 shows the relation between model size (number of parameters) and average safety score across demographics for different families of PTLMs. We observe that average safety score decreases as the model size grows in the majority of PTLMs families. Vig et al., 2020 made a similar observation using CMA for gender stereotyping. Moreover, version of BERT models are safer than their cased variant and RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020) have the highest safety score. The pre-training corpus for RoBERTa contains stories, and news which could be the reason for being safer compare to other PTLMs. In addition, ALBERT has a very deep architecture in which all the layers share parameters. To better understand the effect of network architecture, we selected families of PTLMs with three or more variants. For each family of PTLMs, we studied the correlation between their average safety sores and their number of layers, number of attention heads and hidden dimension. Table 5 contains the PCC for GPT-2, ALBERT, and ELECTRA (Clark et al., 2020) . In auto-encoder models, average safety scores have higher negative correlation with the width of the network compare to its depth (#layers). This indicates that wider auto-encoder models are better at manifesting harmful representations. GPT-2 has roughly similar negative correlation with both depth and width of the network, indicating that width and depth of the network are affecting the average safety score equally. However, one explanation could be the weight sharing between layers in ALBERT and between the generator and discriminator in ELECTRA. For example in ALBERT this strategy reduces the depth complexity. Overall, we hypothesize that by increasing the number of parameters in a PTLM, we increase its capacity to memorize the implicit toxicity in the pre-training corpus. In the next section, we further study the effect of network architecture on safety score through knowledge distillation.

4.7. SAFETY SCORE IN DISTILLED MODELS

The large size of PTLMs presents challenges for fine-tuning and online serving in applications due to latency and capacity constraints. Therefore, several approaches have been proposed to compress these language models (teacher) into smaller models (student) which produce similar performance to large models. Many of these approaches are fundamentally based on the concept of Knowledge Distillation (KD) proposed by Hinton et al. (2015) . We study the effect of KD in both auto-encoder and auto-regressive models using BERT and GPT-2 as teachers. We leverage the 24 Distilled-BERT models provided by Turc et al. (2019) . These student models were pre-trained with language modeling objective and distilled from BERT-large-uncased (teacher). We measured the average safety score for Similarly, we pre-trained 23 student models with language modeling objective on OpenWeb-Text (Aaron Gokaslan, 2019) corpus for 1 epoch. Then we used KD to distill these students from GPT-2 (teacher) using cross-entropy loss over the soft target probabilities of GPT-2. We measure the perplexity of student models on language modeling benchmarks including WikiText-2, WikiText-103 (Merity et al., 2016) , Lambada (Paperno et al., 2016) , and the Penn Treebank (Marcus et al., 1993) (Appendix A.5, Table 15 ). Table 7 contains the safety scores for student and teacher (L=12, H=768) models. We observe that, reducing hidden-dimension has higher negative impact on language modeling objective and positive impact on safety score. Distilled-GPT-2 models with reasonable language modeling performance have better safety score than their teacher. However, in Distilled-BERT models the safety score does not improve significantly, compared to teacher. We selected distilled models with reasonable downstream task performance (NLU, language modeling) and calculated the PCC between average safety scores and the depth and width of networks (Table 8 ). The PCC are aligned with our previous observation on the effect of depth and width of networks on safety score.

5. CONCLUSION

This work presented an empirical study of representational harms in PTLMs using a new metric which is based on language modeling objective and implicit toxicity. Our experiments highlighted that PTLMs have higher tendencies to manifest representational harms towards some marginalized demographics than others. Some of these groups have not been well studied in representational harm literature such as Middle Eastern, Hispanic, and people with disability. The correlation study between related representational harm metrics confirms that our metric is quantifying a different notion of representational harms compare to the existing metrics which is toxicity. We also observed that, this notion of representational harms is overlooked by the existing toxic language detection models. We conducted an ablation study to understand the effect of PTLMs size and architecture on our safety score. Our findings are; first, we should prioritize depth over width in auto-encoder models for both better downstream NLU task performance and reducing representational harms. Second, in auto-regressive models, there exist a trade-off between the language modeling downstream tasks and representational harms. Having more depth does not hurt the safety score. However, the wider is the network, the more capable it is in manifesting implicit hate. Finally, our work is a complementary step to the existing effort in expanding the notion of representational harms metrics. Our work can be extended in multiple ways. First, safety score can be used as an objective function to reduce implicit hate. Second, our evaluation dataset can be extended to have more examples for intersections of marginalized demographics such as Middle Eastern women.

ETHICS STATEMENT

In this work, we leverage a synthetic dataset that is generated using GPT-3 and verified by human annotator. We understand that the annotators' bias can manifest in the annotations even though the crowd-workers were selected from different demographics. Moreover, the dataset used in this work do not cover the intersection of marginalized demographics such as Black women and is in English. Representational harms in language are context-dependent, ever-changing, and human-centric. Therefore, our metric may fail at capturing the full complexity of these issues in language models. Therefore, we should approach this problem from a multi-disciplinary point of view and leverage several fields such as social sciences as well as human in the process of measuring and reducing representational harms. Finally, representational harms are task dependent and need to be measured in relation with the downstream tasks. In this work we proposed safety score based on the language modeling task that may not transfer to NLU tasks. 

A.3 REGARD SCORES

We refer to Regard score as the percentage of neutral and positive predictions by Regard classifier. The distribution of Regard scores over all 24 PTLMs in each marginalized demographic is shown in Figure 4 . Table 11 contains the Regard scores for all PTLMs and marginalized demographics. Table 12 contains our safety scores based on Regard classifier predictions for all PTLMs and marginalized demographics. A.4 PRE-TRAINED LANGUAGE MODELS PARAMETERS Number of layers, attention heads and hidden dimension for each PTLMs alongside their average safety score are provided in Table 13 .

A.5 GPT-2 PRE-TRAINING AND DISTILLATION

We used OpenWebText corpus to pre-train 23 miniature GPT-2 models using GPT-2 pre-training hyper-parameters and vocabulary. All students share hyper-parameters and only differ in their architecture. The average training loss for language modeling after 1 epoch is 10. Then we used KD to distill these models from GPT-2. Each student was distilled for 1 epoch over OpenWebText. Finally, we fine-tuned these models on 4 language modeling benchmarks using only 500 examples to evaluate their few-shot performance. Table 14 presents the network size and perplexity scores on



Gender statistics of occupations was obtained from the U.S. Bureau of Labor. The authors of CP do not recommend using this dataset as stated on their website (https://github. com/nyu-mll/crows-pairs/). https://www.hsph.harvard.edu/magazine/magazine_article/ discrimination-in-america/ We used PTLMs in Hugging Face library (https://huggingface.co) We refer to the percentage of positive and neutral predictions from Regard classifier as Regard score.



Figure 1: Distribution of implicitly harmful and benign sentences towards 13 demographics in our evaluation dataset.

Figure 2: Distribution of safety scores of 24 PTLMs for each demographics.

. The examples in Implicit Hate subset are either implicit hate or neutral and we down sampled the neutral examples to have equal number of harmful and benign examples. Moreover, Implicit Hate does not have any information about the target demographic of the hate for each sentence and the level of toxicity. Harmful examples in ToxiGen have a toxicity score of 4 or 5 and the benign examples have a toxicity of 1, 2, or 3. Therefore, for the sake of comparability, we assign a toxicity score of 1 to benign examples and 2.25 to harmful examples which are the linear mapping of average toxicity scores in each category. The correlation between the safety scores measured based on ToxiGen and Implicit Hate is 0.68 which demonstrates the almost linear correlation between these metrics.

Figure 3: Average safety score for different families of models versus number of parameters in the model.

SAFETY SCORES ON IMPLICIT HATE SPEECH DATASET We selected a subset of ImplicitHate dataset. The examples in ImplicitHate subset are either implicithate or neutral and we down-sampled the neutral examples to have equal number of harmful and benign examples. Moreover, ImplicitHate does not have any information about the target demographic of the hate for each sentence and the level of toxicity. Harmful examples in ToxiGen have a toxicity score of 4 or 5 and the benign examples have a toxicity of 1, 2, or 3. Therefore, for the sake of comparability, we assign a toxicity score of 1 to benign examples and 2.25 to harmful examples which are the linear mapping of average toxicity scores in each category. Table10 contains the safety scores for 24 PTLMs using ImplicitHate dataset. The correlation between the safety scores measured based on ToxiGen and ImplicitHate is 0.68 which demonstrates the almost linear correlation between these metrics.

Safety scores

Log-Perplexity (mean, standard deviation) averaged over variants of PTLMs





PCC between safety score and network architecture in PTLMs. EFFECT OF DEPTH AND WIDTH OF THE NETWORK ON SAFETY SCORE

Safety scores for Distilled-BERT models and teacher model (BERT-large-uncased (L=24, H=1024)). L refers to the number of layers and H refers to hidden dimension. Number of attention are equal to H/64.

Safety scores for Distilled-GPT-2 models and teacher model (GPT-2 (L=12, H=768)). L refers to the number of layers and H refers to hidden dimension. Number of attention are equal to H/64.

PCC between safety score and network architecture in distilled PTLMs. Based on table 6 andTurc et al., 2019' results, we should prioritize depth over width in auto-encoder models for both better downstream NLU task performance and increasing safety.

A APPENDIX

A.1 LANGUAGE MODELING We measure the log perplexity of each sentence in the evaluation dataset and report the mean and standard deviation of these values for both benign and harmful sets in Table 9 . benchmark test sets after fine-tuning. Note that the last line is the original GPT-2 model (teacher). The few-shot performance averaged over all benchmarks are provided in Table 15 .

