IMPROVE LEARNING COMBINING CROWDSOURCED LABELS BY WEIGHTING AREAS UNDER THE MARGIN

Abstract

In supervised learning -for instance in image classification -modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per worker trust score. Yet, such worker-centric approaches discard each task ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful for the learning step. In a standard supervised learning setting -with one label per task and balanced classesthe Area Under the Margin (AUM) statistic is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by worker and task dependent scores. We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements with respect to feature-blind aggregation strategies both for simulated settings and for the CIFAR-10H crowdsourced dataset.

1. INTRODUCTION

Crowdsourcing labels for supervised learning has become quite common in the last two decades, notably for image classification with datasets such as CIFAR-10 and Imagenet. Using a crowd of workers is fast, simple (see Figure 1 ) and less expensive than using experts. Furthermore, aggregating crowdsourced labels instead of working directly with a single one enables modeling the sources of possible ambiguities and directly take them into account in the training pipeline (Aitchison, 2021) . With deep neural networks nowadays common in many applications, both the architectures and data quality have a direct impact on the model performance (Müller et al., 2019; Northcutt et al., 2021b) and on calibration (Guo et al., 2017 ). Yet, depending on the crowd and platform's control mechanisms, the obtained label quality might vary and harm generalization (Snow et al., 2008) . Popular label aggregation schemes take into account the uncertainty related to workers' abilities: for example by estimating confusions between classes, or using a latent variable representing each worker trust (Dawid & Skene, 1979; Kim & Ghahramani, 2012; Sinha et al., 2018; Camilleri & Williams, 2019) . This leads to scoring workers without taking into account the inherent difficulty of a task at stake. Inspired by the Item Response Theory (IRT) (Birnbaum, 1968), Whitehill et al. ( 2009) combined both the task difficulty and the worker's ability in a feature-blind fashion for label aggregation. They only require labels but not the associated featuresfoot_0 . In the classical supervised learning setting, the labels are said to be hardi.e., a Dirac mass on one class. Multiple crowdsourced labels induce soft labelsi.e., probability distributions over the classes -for each task. Our motivation is to identify ambiguous tasks from their associated features, hence discarding hurtful tasks (such as the one illustrated on Figure 2b ). Recent work on data-cleaning in supervised learning (Han et al., 2019; Pleiss et al., 2020; Northcutt et al., 2021a) has shown that some images might be too corrupted or too ambiguous to be labeled by humans. Hence, one should not consider these tasks for label aggregation and learning since they might be harmful for generalization. In this work, we combine task difficulty scores with worker abilities scores, but we measure the task difficulty by incorporating feature information. We thus introduce the Weighted Area Under



We use the term task interchangeably with the term feature in this work 1

