PERFORMANCE DISPARITIES BETWEEN ACCENTS IN AUTOMATIC SPEECH RECOGNITION

Abstract

Automatic speech recognition (ASR) services are ubiquitous. Past research has identified discriminatory ASR performance as a function of racial group and nationality. In this paper, we expand the discussion by performing an audit of some of the most popular English language ASR services using a large and global data set of speech from The Speech Accent Archive. We show that, even when controlling for multiple linguistic covariates, ASR service performance has a statistically significant relationship to the political alignment of the speaker's birth country with respect to the United States' geopolitical power. We discuss this bias in the context of the historical use of language to maintain global and political power.

1. INTRODUCTION

Automatic speech recognition (ASR) services are a key component of the vision for the future of human-computer interaction. However, many users are familiar with the frustrating experience of repeatedly not being understood by their voice assistant (Harwell, 2018) , so much so that frustration with ASR has become a culturally-shared source of comedy (Connell & Florence, 2015; Mitchell, 2018) . Bias auditing of ASR services has quantified these experiences. English language ASR has higher error rates: for Black Americans compared to white Americans (Koenecke et al., 2020; Tatman & Kasten, 2017) ; for Scottish speakers compared to speakers from California and New Zealand (Tatman, 2017); and for speakers who self-identify as having Indian accents compared to speakers who self-identify as having American accents (Meyer et al., 2020) . It should go without saying, but everyone has an accent -there is no "unaccented" version of English (Lippi-Green, 2012) . Due to colonization and globalization, different Englishes are spoken around the world. While some English accents may be favored by those with class, race, and national origin privilege, there is no technical barrier to building an ASR system which works well on any particular accent. So we are left with the question, why does ASR performance vary as it does as a function of the global English accent spoken? This paper attempts to address this question quantitatively using a large public data set, The Speech Accent Archive (Weinberger, 2015), which is larger in number of speakers (2,713), number of first languages (212), and number of birth countries (171) than other data sets previously used to audit ASR services, and thus allows us to answer richer questions about ASR biases. Further, by observing historical patterns in how language has shifted power. our paper provides a means for readers to understand how ASR may be operating today. Historically, accent and language have been used as a tool of colonialism and a justification of oppression. Colonial power, originally British and then of its former colonies, used English as a tool to "civilize" their colonized subjects (Kachru, 1986) , and their accents to justify their lower status. English as a lingua franca today provides power to those for whom English is a first language. People around the world are compelled to promote English language learning in education systems in order to avail themselves of the privilege it can provide in the globalized economy (Price, 2014) . This spread of English language may be "reproducing older forms of imperial political, economic, and cultural dominance", but it also exacerbates inequality along neoliberal political economic lines (Price, 2014). In short, the dominance of the English language around the world shifts power in ways that exacerbate inequality. used to enforce the cultural assimilation of immigrants (Lippi-Green, 2012) . We note that, even within the United States, "Standard American English" is a theoretical concept divorced from the reality of wide variations in spoken English across geographical areas, race and ethnicity, age, class, and gender (Lippi-Green, 2012) . As stated in a resolution of the Conference on College Composition and Communication in 1972, "The claim that any one dialect is unacceptable amounts to an attempt of one social group to exert its dominance over another" (Lippi-Green, 2012) . The social construct of language has real and significant consequences Nee et al. ( 2022), for example, allowing people with accents to be passed over for hiring in the United States, despite the Civil Rights Act prohibiting discrimination based on national origin (Matsuda, 1991) . Accent-based discrimination can take many forms -people with accents deemed foreign are rated as less intelligent, loyal, and influential (Lawrence, 2021). Systems based on ASR automatically enforce the requirement that one code switch or assimilate in order to be understood, rejecting the "communicative burden" in which two people will "find a communicative middle ground and foster mutual intelligibility when they are motivated, socially and psychologically, to do so" (Lippi-Green, 2012). By design, then, ASR services operate like people who reject their communicative burden, which Lippi-Green reports is often due to their "negative social evaluation of the accent in question" (Lippi-Green, 2012). As Halcyon Lawrence reports from experience as a speaker of Caribbean English, "to create conditions where accent choice is not negotiable by the speaker is hostile; to impose an accent upon another is violent" (Lawrence, 2021). Furthermore, we are concerned about discriminatory performance of ASR services because of its potential to create a class of people who are unable to use voice assistants, smart devices, and automatic transcription services. If technologists decide that the only user interface for a smart device will be via voice, a person who is unable to be accurately recognized will be unable to use the device at all. As such, ASR technologies have the potential to create a new disability, similar to how print technologies created the print disability "which unites disparate individuals who cannot read printed materials" (Whittaker et al., 2019) . The biased performance of ASR, if combined with an assumption that ASR works for everyone, creates a dangerous situation in which those with particular English language accents may find themselves unable to obtain ASR service. The consequences for someone lacking the ability to obtain reliable ASR may range from inconvenient to dangerous. Serious medical errors may result from incorrect transcription of physician's notes Zhou et al. (2018) , which are increasingly transcribed by ASR. There is, currently, an alarmingly high rate of transcription errors that could result in significant patient consequences, according to physicians who use ASR Goss et al. (2019) . Other ASR users could potentially see increased danger: for example for smart wearables that users can use to call for help in an emergency Mrozek et al. (2021) ; or if one must repeat oneself multiple times when using a voice-controlled navigation system while driving a vehicle (and thus are distracted while driving); or if an ASR is one's only means for controlling one's robotic wheelchair Venkatesan et al. (2021) . Given that English language speakers have a multitude of dialects across the world, it is important to consider the ability of English language ASR services to accurately transcribe the speech of their global users. Given past research results (Tatman, 2017; Meyer et al., 2020) and the United States headquarters of Amazon, Google, and Microsoft, we hypothesize that ASR services will transcribe with less error for people who were born in the United States and whose first language is English. We hypothesize that performance of ASR systems is related to the age of onset, the age at which a person first started speaking English, which is known to be highly correlated with perceived accent (Flege et al., 1995; Moyer, 2007; Dollmann et al., 2019) . But beyond this, based on the nationalist and neoliberal ways in which language is used to reinforce power, we hypothesize that ASR performance can be explained in part by the power relationship between the United States and speakers' birth countries. That is, for the same age of onset of English and other related covariates among speakers not born in the United States, we expect that speakers born in countries which are political allies of the United States will have ASR performance that is significantly better than those born in nations which are not aligned politically with the United States. This paper tests and validates these hypotheses by utilizing a data set with significantly more speakers, across a large number of first languages and birth countries, than those which have previously been used for the evaluation of English ASR services.

