THE ETHICAL AMBIGUITY OF AI DATA ENRICHMENT: MEASURING GAPS IN RESEARCH ETHICS NORMS AND PRACTICES Anonymous authors Paper under double-blind review

Abstract

The technical progression of artificial intelligence (AI) research has been built on breakthroughs in fields such as computer science, statistics, and mathematics. However, in the past decade AI researchers have increasingly looked to the social sciences, turning to human interactions to solve the challenges of model development. Paying crowdsourcing workers to generate or curate data, or 'data enrichment', has become indispensable for many areas of AI research, from natural language processing to inverse reinforcement learning. Other fields that routinely interact with crowdsourcing workers, such as Psychology, have developed common governance requirements and norms to ensure research is undertaken ethically. This study explores how, and to what extent, comparable research ethics requirements and norms have developed for AI research and data enrichment. We focus on the approach taken by two leading conferences: ICLR and NeurIPS, and journal publisher Springer. In a longitudinal study of accepted papers, and via. a comparison with Psychology and CHI papers, this work finds that leading AI venues have begun to establish protocols for human data collection, but these are are inconsistently followed by authors. Whilst Psychology papers engaging with crowdsourcing workers frequently disclose ethics reviews, payment data, demographic data and other information, similar disclosures are far less common in leading AI venues despite similar guidance. The work concludes with hypotheses to explain these gaps in research ethics practices and considerations for its implications.

1. INTRODUCTION

When the creators of the seminal image recognition benchmark, ImageNet, pronounced that the use of Amazon's Mechanical Turk (MTurk) was a "godsend" for their research, they foreshadowed the monumental impact crowdsourcing platforms were set to have on AI research (Li, 2019) . In the decade that has followed, crowdsourced workers, or 'crowdworkers' have been a central contributor to machine learning research, enabling low-cost human data collection at scale. Ethics questions posed by research involving human participants are traditionally overseen by governance groups, such as Institutional Review Boards (IRBs) in the United States (US). Whilst medical fields and social sciences have a long history of IRB engagement, the relatively recent rise of crowdsourcing tasks in AI research means guidelines and norms have been developed in recent years to consider research ethics. The proliferation of guidelines and publication policies have risen alongside critiques of AI crowdsourced work focused on issues such as payment and worker maltreatment. In response, this study seeks to understand how AI research involving crowdworkers engages with research ethics. It does this via an assessment of the expectations put forward by publication venues on researchers, and by analysing how these expectations translate into practices. To make this determination the policies and practices of major AI conferences, ICLR and NeurIPS, along with AI research submitted to Springer journals, are reviewed. This is compared with other benchmarks to understand whether AI research at these venues follows norms within more established disciplines. The results show that AI research at these venues involving crowdworkers lacks robust research ethics norms, with venue policies not translated into practice. Whilst ICLR, NeurIPS and Springer provide research ethics guidance, the interpretation of these appears inconsistent, and fails to meet the same standards of disclosure as seen in other fields engaging with crowdworkers.

2. RELATED WORK

Oversight in research involving human subjects is no recent phenomenon, with the Nuremberg Code of 1948 formalising the idea that humans involved in research required protection (Sass, 1983; Shuster, 1997) . Research ethics in the United States arose during the 1960s, prompted by various scandals in biomedical research, and followed by scandals in social science studies (Beecher, 1966; Stark, 2016; Emanuel, 2008; Heller; Milgram, 1963; Zimbardo, 1972) . These cases led to regulation standardising Institutional Review Board (IRB) oversight of research involving human subjects, a requirement that exists to this day, with similar processes existing in over 80 countries globally (Grady, 2015) . IRBs only oversee research involving living 'human subjects', as defined by the Code of Federal Regulations (Office for Human Research Protections, 2017).

2.1. CROWDSOURCING AND AI

In the twenty-first century the scope of research ethics has been extended by the rise of internet research (Taylor, 2000) . The ability to recruit, engage with, and study human subjects online has led to the rise of crowdsourcing platforms, such as MTurk, becoming a key tool across a variety of academic disciplines (Howe, 2006) . Launched in 2005, MTurk was an early pioneer of the crowdsourcing model (Cobanoglu et al., 2021) . MTurk has remained popular due to its low cost, ease of access, and large user base (Williamson, 2016) . The platform has been of particular use to the AI field, with Amazon marketing the platform as "artificial artificial intelligence" (Schwartz, 2019; Stephens, 2022) . Whilst early AI crowdwork often involved labelling tasks, such as Fei-Fei Li's seminal ImageNet work, the use of crowdworkers has diversified (Deng et al., 2009; Vaughan, 2018) . Shmueli et al. offer three categories of data collection seen in NLP research papers: (1) labelling, (2) evaluation, and (3) production (Shmueli et al., 2021) . For the purposes of this work these categories can be generalised across AI research. Labelling includes the processing of existing data by a crowdworker and then the selection or composition of a label or labels for that data. Labelling can be objective, for example crowdworkers may be asked to label objects in images (e.g. dogs or cats), or subjective, with one study asking MTurk workers to label their predicted political leanings of images (Thomas & Kovashka, 2019) . Evaluation involves an assessment of outputs or data according to predefined criteria, such as fluency. This could be asking humans to provide feedback on model-generated language or produce a 'mean opinion score' by assessing the outputs of various models (Clark et al., 2021; Défossez et al., 2018; Stiennon et al., 2020) . Production studies ask workers to produce their own data, rather than label or evaluate existing data. For example, studies might explicitly ask crowdworkers to write questions for a question-answer dataset (Talmor et al., 2018) . These categories can be broadly encapsulated by the Partnership on AI's (PAI) definition of 'data enrichment' work, defined as data curation tasks which require human judgement and intelligence (Partnership on AI, 2021). However, this does not include research studying the behaviour of crowdworkers themselves (Vaughan, 2018) . For example, a researcher might assess how individuals respond to interaction with algorithms deployed in an educational setting, or assess human perception of artificial systems (Fahid et al., 2021; Latikka et al., 2021; Koster et al., 2022) . Behavioural studies are different to data enrichment tasks as they treat crowdworkers as the subject of research, rather than as a worker providing input to a model which is itself the subject of research.

2.2. ETHICS IMPLICATIONS OF CROWDWORK

In parallel to the rise of crowdsourcing in AI research, critics have questioned the ethics of these practices in lieu of employment law protections for workers (Aloisi, 2016) . Concerns centre on issues of payment, maltreatment, power asymmetry, and demographics. Crowdsourcing platforms are often utilised due to their low costs, and consequently many critiques of crowdwork relate to payment (Scholz, 2016) . MTurk allows requesters to place tasks online

