DEEP LEARNING FROM CROWDSOURCED LABELS: COUPLED CROSS-ENTROPY MINIMIZATION, IDENTI-FIABILITY, AND REGULARIZATION

Abstract

Using noisy crowdsourced labels from multiple annotators, a deep learning-based end-to-end (E2E) system aims to learn the label correction mechanism and the neural classifier simultaneously. To this end, many E2E systems concatenate the neural classifier with multiple annotator-specific "label confusion" layers and co-train the two parts in a parameter-coupled manner. The formulated coupled cross-entropy minimization (CCEM)-type criteria are intuitive and work well in practice. Nonetheless, theoretical understanding of the CCEM criterion has been limited. The contribution of this work is twofold: First, performance guarantees of the CCEM criterion are presented. Our analysis reveals for the first time that the CCEM can indeed correctly identify the annotators' confusion characteristics and the desired "ground-truth" neural classifier under realistic conditions, e.g., when only incomplete annotator labeling and finite samples are available. Second, based on the insights learned from our analysis, two regularized variants of the CCEM are proposed. The regularization terms provably enhance the identifiability of the target model parameters in various more challenging cases. A series of synthetic and real data experiments are presented to showcase the effectiveness of our approach.

1. INTRODUCTION

The success of deep learning has escalated the demand for labeled data to an unprecedented level. Some learning tasks can easily consume millions of labeled data (Najafabadi et al., 2015; Goodfellow et al., 2016) . However, acquiring data labels is a nontrivial task-it often requires a pool of annotators with sufficient domain expertise to manually label the data items. For example, the popular Microsoft COCO dataset contains 2.5 million images and around 20,000 work hours aggregated from multiple annotators were used for its category labeling (Lin et al., 2014) . Crowdsourcing is considered an important working paradigm for data labeling. In crowdsourcing platforms, e.g., Amazon Mechanical Turk (Buhrmester et al., 2011 ), Crowdflower (Wazny, 2017 ), and ClickWork (Vakharia & Lease, 2013) , data items are dispatched and labeled by many annotators; the annotations are then integrated to produce reliable labels. A notable challenge is that annotator-output labels are sometimes considerably noisy. Training machine learning models using noisy labels could seriously degrade the system performance (Arpit et al., 2017; Zhang et al., 2016a) . In addition, the labels provided by individual annotators are often largely incomplete, as a dataset is often divided and dispatched to different annotators. Early crowdsourcing methods often treat annotation integration and downstream operations, e.g., classification, as separate tasks; see, (Dawid & Skene, 1979; Karger et al., 2011a; Whitehill et al., 2009; Snow et al., 2008; Welinder et al., 2010; Liu et al., 2012; Zhang et al., 2016b; Ibrahim et al., 2019; Ibrahim & Fu, 2021) . This pipeline estimates the annotators' confusion parameters (e.g., the confusion matrices under the Dawid & Skene (DS) model (Dawid & Skene, 1979) ) in the first stage. Then, the corrected and integrated labels along with the data are used for training the downstream tasks' classifiers. However, simultaneously learning the annotators' confusions and the classifier in an end-to-end (E2E) manner has shown substantially improved performance in practice. (Raykar et al., 2010; Khetan et al., 2018; Tanno et al., 2019; Rodrigues & Pereira, 2018; Chu et al., 2021; Guan et al., 2018; Cao et al., 2019; Li et al., 2020; Wei et al., 2022; Chen et al., 2020) . To achieve the goal of E2E-based learning under crowdsourced labels, a class of methods concatenate a "confusion layer" for each annotator to the output of a neural classifier, and jointly learn these parts via a parameter-coupled manner. This gives rise to a coupled cross entropy minimization (CCEM) criterion (Rodrigues & Pereira, 2018; Tanno et al., 2019; Chen et al., 2020; Chu et al., 2021; Wei et al., 2022) . In essence, the CCEM criterion models the observed labels as annotators' confused outputs from the "ground-truth" predictor (GTP)-i.e., the classifier as if being trained using the noiseless labels and with perfect generalization-which is natural and intuitive. A notable advantage of the CCEM criteria is that they often lead to computationally convenient optimization problems, as the confusion characteristics are modeled as just additional structured layers added to the neural networks. Nonetheless, the seemingly simple CCEM criterion and its variants have shown promising performance. For example, the crowdlayer method is a typical CCEM approach, which has served as a widely used benchmark for E2E crowdsourcing since its proposal (Rodrigues & Pereira, 2018) . In (Tanno et al., 2019) , a similar CCEM-type criterion is employed, with a trace regularization added. More CCEM-like learning criteria are seen in (Chen et al., 2020; Chu et al., 2021; Wei et al., 2022) , with some additional constraints and considerations. Challenges. Apart from its empirical success, understanding of the CCEM criterion has been limited. Particularly, it is often unclear if CCEM can correctly identify the annotators' confusion characteristics (that are often modeled as "confusion matrices") and the GTP under reasonable settings (Rodrigues & Pereira, 2018; Chu et al., 2021; Chen et al., 2020; Wei et al., 2022) , but identifiability stands as the key for ensured performance. The only existing identifiability result on CCEM was derived under restricted conditions, e.g., the availability of infinite samples and the assumption that there exists annotators with diagonally-dominant confusion matrices (Tanno et al., 2019) . To our best knowledge, model identifiability of CCEM has not been established under realistic conditions, e.g., in the presence of incomplete annotator labeling and non-experts under finite samples. We should note that a couple of non-CCEM approaches proposed identifiability-guaranteed solutions for E2E crowdsourcing. The work in (Khetan et al., 2018) showed identifiability under their expectation maximization (EM)-based learning approach, but the result is only applicable to binary classification. An information-theoretic loss-based learning approach proposed in (Cao et al., 2019) presents some identifiability results, but under the presence of a group of independent expert annotators and infinite samples. These conditions are hard to meet or verify. In addition, the computation of these methods are often more complex relative to the CCEM-based methods. Contributions. Our contributions are as follows: • Identifiability Characterizations of CCEM-based E2E Crowdsourcing. In this work, we show that the CCEM criterion can indeed provably identify the annotators' confusion matrices and the GTP up to inconsequential ambiguities under mild conditions. Specifically, we show that, if the number of annotator-labeled items is sufficiently large and some other reasonable assumptions hold, the two parts can be recovered with bounded errors. Our result is the first finite-sample identifiability result for CCEM-based crowdsourcing. Moreover, our analysis reveals that the success of CCEM does not rely on conditional independence among the annotators. This is favorable (as annotator independence is a stringent requirement) and surprising, since conditional independence is often used to derive the CCEM criterion; see, e.g., (Tanno et al., 2019; Rodrigues & Pereira, 2018; Chu et al., 2021) . • Regularization Design for CCEM With Provably Enhanced Identfiability. Based on the key insights revealed in our analysis, we propose two types of regularizations that can provide enhanced identifiability guarantees under challenging scenarios. To be specific, the first regularization term ensures that the confusion matrices and the GTP can be identified without having any expert annotators if one has sufficiently large amount of data. The second regularization term ensures identifiability when class specialists are present among annotators. These identifiability-enhanced approaches demonstrate promising label integration performance in our experiments. Notation. The notations are summarized in the supplementary material.

