IMPROVE LEARNING COMBINING CROWDSOURCED LABELS BY WEIGHTING AREAS UNDER THE MARGIN

Abstract

In supervised learning -for instance in image classification -modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per worker trust score. Yet, such worker-centric approaches discard each task ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful for the learning step. In a standard supervised learning setting -with one label per task and balanced classesthe Area Under the Margin (AUM) statistic is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by worker and task dependent scores. We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements with respect to feature-blind aggregation strategies both for simulated settings and for the CIFAR-10H crowdsourced dataset.

1. INTRODUCTION

Crowdsourcing labels for supervised learning has become quite common in the last two decades, notably for image classification with datasets such as CIFAR-10 and Imagenet. Using a crowd of workers is fast, simple (see Figure 1 ) and less expensive than using experts. Furthermore, aggregating crowdsourced labels instead of working directly with a single one enables modeling the sources of possible ambiguities and directly take them into account in the training pipeline (Aitchison, 2021) . With deep neural networks nowadays common in many applications, both the architectures and data quality have a direct impact on the model performance (Müller et al., 2019; Northcutt et al., 2021b) and on calibration (Guo et al., 2017 ). Yet, depending on the crowd and platform's control mechanisms, the obtained label quality might vary and harm generalization (Snow et al., 2008) . Popular label aggregation schemes take into account the uncertainty related to workers' abilities: for example by estimating confusions between classes, or using a latent variable representing each worker trust (Dawid & Skene, 1979; Kim & Ghahramani, 2012; Sinha et al., 2018; Camilleri & Williams, 2019) . This leads to scoring workers without taking into account the inherent difficulty of a task at stake. Inspired by the Item Response Theory (IRT) (Birnbaum, 1968 ), Whitehill et al. (2009) combined both the task difficulty and the worker's ability in a feature-blind fashion for label aggregation. They only require labels but not the associated featuresfoot_0 . In the classical supervised learning setting, the labels are said to be hardi.e., a Dirac mass on one class. Multiple crowdsourced labels induce soft labelsi.e., probability distributions over the classes -for each task. Our motivation is to identify ambiguous tasks from their associated features, hence discarding hurtful tasks (such as the one illustrated on Figure 2b ). Recent work on data-cleaning in supervised learning (Han et al., 2019; Pleiss et al., 2020; Northcutt et al., 2021a) has shown that some images might be too corrupted or too ambiguous to be labeled by humans. Hence, one should not consider these tasks for label aggregation and learning since they might be harmful for generalization. In this work, we combine task difficulty scores with worker abilities scores, but we measure the task difficulty by incorporating feature information. We thus introduce the Weighted Area Under 

2. RELATED WORK

Inferring a learning consensus from a crowd is a challenging task. In Table 1 , we summarize features used by standard strategies to address such a task. In this work we do not consider methods with prior knowledge on workers, since most platforms do not provide this informationfoot_1 . Likewise, we do not rely on ground-truth knowledge for any tasks. Hence, trapping-set or control-items based algorithms like ELICE or CLUBS (Khattak, 2017) do not match our framework. Some algorithms rely on self-reported confidence: they directly ask workers their answering confidence and integrate it in the model (Albert et al., 2012; Oyama et al., 2013; Hoang et al., 2021) . We discard such cases for several reasons. First, self-reported confidence might not be beneficial without a reject option (Li & Varshney, 2017) . Second, workers have a tendency to be under or overconfident, raising questions on how to present self-evaluation and inferring their own scores (Draws et al., 2021) . The most common aggregation step is majority voting (MV), where one selects the label most often answered by workers. MV does not infer any trust score on workers, thus does not leverage workers abilities. MV is also very sensitive to under-performing workers (Gao & Zhou, 2013; Zhou et al., 2015) , to biased workers (Kamar et al., 2015) , to spammers (Raykar & Yu, 2011) , or to a lack of experts for hard tasks (James, 1998; Gao & Zhou, 2013; Germain et al., 2015) . Closely related to MV, naive soft labeling goes beyond hard labels (also referred to as one-hot labels) by computing the frequency of answers per label. In practice, training a neural network with soft labels improves calibration (Guo et al., 2017) with respect to using hard labels. However, both methods are sensitive to spammers (e.g., workers answer all tasks randomly) or workers biases (e.g., workers who answer some tasks randomly). Hence, the noise induced by workers labeling might not be representative of the actual task difficulty (Jamison & Gurevych, 2015) . Another class of methods leverages latent variables, defining a probabilistic model on worker's responses. The most popular one, proposed by Dawid & Skene (1979) (DS) estimates a single confusion matrix per worker, as a measure of workers' expertise. The vanilla DS model assumes that a worker answers according to a multinomial distribution, yielding a joint estimation procedure of the error-rates and the soft labels through an Expectation-Maximization (EM) algorithm (see Algorithm 2 in Appendix A). Variants on the original DS algorithm include accelerated versions (Sinha et al., 2018 ), sparse versions (Servajean et al., 2017 ), clustered versions (Imamura et al., 2018) 



We use the term task interchangeably with the term feature in this work For instance, by default Amazon Mechanical Turk https://www.mturk.com/ does not provide it.



Figure 1: Crowdsourcing labels scheme, from label collection using a crowd to training a neural network on aggregated training labels. High ambiguity from either crowd workers or tasks intrinsic difficulty can lead to mislabeled data and harm generalization performance. To illustrate our notation, here the set of task annotated by worker w3 is T (w3) = {1, 3} while the set of workers annotating the task x3 is A(x3) = {1, 3, 4}.

