PERFORMANCE DISPARITIES BETWEEN ACCENTS IN AUTOMATIC SPEECH RECOGNITION

Abstract

Automatic speech recognition (ASR) services are ubiquitous. Past research has identified discriminatory ASR performance as a function of racial group and nationality. In this paper, we expand the discussion by performing an audit of some of the most popular English language ASR services using a large and global data set of speech from The Speech Accent Archive. We show that, even when controlling for multiple linguistic covariates, ASR service performance has a statistically significant relationship to the political alignment of the speaker's birth country with respect to the United States' geopolitical power. We discuss this bias in the context of the historical use of language to maintain global and political power.

1. INTRODUCTION

Automatic speech recognition (ASR) services are a key component of the vision for the future of human-computer interaction. However, many users are familiar with the frustrating experience of repeatedly not being understood by their voice assistant (Harwell, 2018) , so much so that frustration with ASR has become a culturally-shared source of comedy (Connell & Florence, 2015; Mitchell, 2018) . Bias auditing of ASR services has quantified these experiences. English language ASR has higher error rates: for Black Americans compared to white Americans (Koenecke et al., 2020; Tatman & Kasten, 2017) ; for Scottish speakers compared to speakers from California and New Zealand (Tatman, 2017); and for speakers who self-identify as having Indian accents compared to speakers who self-identify as having American accents (Meyer et al., 2020) . It should go without saying, but everyone has an accent -there is no "unaccented" version of English (Lippi-Green, 2012) . Due to colonization and globalization, different Englishes are spoken around the world. While some English accents may be favored by those with class, race, and national origin privilege, there is no technical barrier to building an ASR system which works well on any particular accent. So we are left with the question, why does ASR performance vary as it does as a function of the global English accent spoken? This paper attempts to address this question quantitatively using a large public data set, The Speech Accent Archive (Weinberger, 2015), which is larger in number of speakers (2,713), number of first languages (212), and number of birth countries (171) than other data sets previously used to audit ASR services, and thus allows us to answer richer questions about ASR biases. Further, by observing historical patterns in how language has shifted power. our paper provides a means for readers to understand how ASR may be operating today. Historically, accent and language have been used as a tool of colonialism and a justification of oppression. Colonial power, originally British and then of its former colonies, used English as a tool to "civilize" their colonized subjects (Kachru, 1986) , and their accents to justify their lower status. English as a lingua franca today provides power to those for whom English is a first language. People around the world are compelled to promote English language learning in education systems in order to avail themselves of the privilege it can provide in the globalized economy (Price, 2014) . This spread of English language may be "reproducing older forms of imperial political, economic, and cultural dominance", but it also exacerbates inequality along neoliberal political economic lines (Price, 2014). In short, the dominance of the English language around the world shifts power in ways that exacerbate inequality. Further, English is and has historically been used as a nationalist tool in the United States to justify white conservative fears that immigrants pose an economic and political threat to them and has been 1

