HOW (UN)FAIR IS TEXT SUMMARIZATION?

Abstract

Creating a good summary requires carefully choosing details from the original text to accurately represent it in a limited space. If a summary contains biased information about a group, it risks passing this bias off to readers as fact. These risks increase if we consider not just one biased summary, but rather a biased summarization algorithm. Despite this, no work has measured whether these summarizers demonstrate biased performance. Rather, most work in summarization focuses on improving performance, ignoring questions of bias. In this paper, we demonstrate that automatic summarizers both amplify and introduce bias towards information about under-represented groups. Additionally, we show that summarizers are highly sensitive to document structure, making the summaries they generate unstable under changes that are semantically meaningless to humans, which poses a further fairness risk. Given these results, and the large-scale potential for harm presented by biased summarization, we recommend that bias analysis be performed and reported on summarizers to ensure that new automatic summarization methods do not introduce bias to the summaries they generate.

1. INTRODUCTION

In any piece of text, bias against a group may be expressed. This bias may be explicit or implicit and can be displayed either in what information is included (e.g., including information that is exclusively negative about one group and exclusively positive about another), where in the article it comes from (e.g., only selecting sentences from the start of articles), or how it is written (e.g., saying "a man thought to be involved in crime died last night after an officer involved shooting" vs."a police officer shot and killed an unarmed man in his home last night"). The presence of any bias in a longer text may be made worse by summarizing it. A summary can be seen as a presentation of the most salient points of a larger piece of text, where the definition of "salient information" will vary according to various ideologies a person holds. Due to this subjectivity and the space constraints a summary imposes, there is a heightened potential for summaries to contain bias. Readers, however, expect that summaries faithfully represent articles. Therefore, bias in the text of a summary is likely to go unquestioned. If a summary presents information in a way biased against a group, readers are likely to believe that the article exhibited this same bias, as checking the truth of these assumptions requires a high amount of effort. This poses several risks. First, an echo chamber effect, where the bias in generated summaries agrees with biases the reader already has. The opposite is also a risk. An article may present more or less the same amounts of information about multiple groups whereas its summary includes more information about one group, leading readers to believe the most important information in the article is about only one group. As writing summaries manually carries a large cost, automatic summarization is an appealing solution. However, where one biased summary is a problem, a biased summarization algorithm capable of summarizing thousands of articles in the time it takes a human to generate one, is a disaster. In recent years, automatic summarization has increased in availability, both for personal and commercial use. Summarization algorithms have suggested for use on news articles, medical notes, business documents, legal texts, personal documentsfoot_0 , and conversation transcriptsfoot_1 . Despite the sensitivity of these applications, to the best of our knowledge, no work has measured the bias towards groups of summaries generated by common summarization algorithms. In this paper, we present the first empirical analysis to measure the bias of automatic summarizers, for seven different techniques (ranging from the ones based on information retrieval methods to the ones using large language models; and for high impact techniques provided by Microsoft and Ope-nAI). We design a method to quantitatively measure to what extent article structure influences bias in the generated summaries, as well as the inclusion of information about different gender, racial/ethic, and religious groups. We also study the causes of bias by varying the influential factors, including the summarization parameters, and distribution of input documents. We show that summarizers: 1. Can suppress information about minority group in the original text. 2. Prioritize information about certain groups over others, regardless of the amount of information about them in the original text. 3. Amplify patterns of bias already shown in human summaries in where information is selected from in the original text. 4. Demonstrate sensitivity to the structure of the articles they summarize, and are fragile to changes in those articles that are meaningless to humans. These findings indicate that it is not safe to use automatic summarization at scale or in situations where the text they generate could influence large numbers of readers. Doing so risks misinforming and increasing bias held by readers. We conclude that assessments of bias in summarization algorithms should be performed, and this bias should be reduced before models are released.

2. WHAT IS BIAS IN A SUMMARIZER?

Let us start by defining what a biased summary is. We consider a biased summary to be be any summary that misrepresents a group in some way, relative to the original textfoot_2 , whether this is the amount or kind of information expressed. We further divide our definition of bias into two subcategories based on their cause: content bias and structure bias. These definitions draw on existing taxonomies of representational harms in NLP (Hovy & Spruit, 2016; Barocas et al., 2017) , but are tailored to summarization. We define content bias as bias towards the mention of a group in a text. In this type of bias, if group A is mentioned in the article, the summarizer will generate a biased summary. Changing the structure of the text will have no effect on how biased the summary is. We further divide content bias into five subcategories: under-representation, inclusion/exclusion, inaccuracy, sentiment bias, and framing bias, and explain each in Table 1 . On the other hand, we define structure bias as a bias as a result the structure of the text being summarized. In contrast to content bias, structure bias is invariant to the presence of information about different groups, but will change when document structure is modified. We define three structure bias subcategories: position bias, sentiment bias, and style bias, further explained in Table 1 . There are situations where the line between group and structure bias is not so clear cut. For example, if articles about group A are written in a way that a summarizer has a structure bias against, but articles about group B are written in a way that does not illicit this bias, the summarizer will show a consistent bias against group A. Though this is a result of a structure bias, it is also an example of content bias because the article structure and the group being discussed are dependent.Regardless of the cause, biased summaries have the potential for real, negative effects on marginalized groups.

3. OUR METHODS FOR MEASURING BIAS

Each of the types of bias defined in Table 1 represents a real problem in automatic summarization. However, we are not able to explore all of these biases in this paper, and some (i.e. framing bias) would be best explored using manual linguistic analysis than computational techniques. Instead, we aim to explore the content and structure bias of summarizers along three axes: underrepresentation,



https://ai.googleblog.com/2022/03/auto-generated-summaries-in-google-docs.html?m=1 https://learn.microsoft.com/en-us/azure/cognitive-services/language-service/summarization/overview We recognize that it could be argued that a perfectly unbiased summarizer would not misrepresent groups at all, even if the original text does, this is a strong constraint to place on a summarization algorithm that does not have a concept of social bias.

