DEEP LEARNING FROM CROWDSOURCED LABELS: COUPLED CROSS-ENTROPY MINIMIZATION, IDENTI-FIABILITY, AND REGULARIZATION

Abstract

Using noisy crowdsourced labels from multiple annotators, a deep learning-based end-to-end (E2E) system aims to learn the label correction mechanism and the neural classifier simultaneously. To this end, many E2E systems concatenate the neural classifier with multiple annotator-specific "label confusion" layers and co-train the two parts in a parameter-coupled manner. The formulated coupled cross-entropy minimization (CCEM)-type criteria are intuitive and work well in practice. Nonetheless, theoretical understanding of the CCEM criterion has been limited. The contribution of this work is twofold: First, performance guarantees of the CCEM criterion are presented. Our analysis reveals for the first time that the CCEM can indeed correctly identify the annotators' confusion characteristics and the desired "ground-truth" neural classifier under realistic conditions, e.g., when only incomplete annotator labeling and finite samples are available. Second, based on the insights learned from our analysis, two regularized variants of the CCEM are proposed. The regularization terms provably enhance the identifiability of the target model parameters in various more challenging cases. A series of synthetic and real data experiments are presented to showcase the effectiveness of our approach.

1. INTRODUCTION

The success of deep learning has escalated the demand for labeled data to an unprecedented level. Some learning tasks can easily consume millions of labeled data (Najafabadi et al., 2015; Goodfellow et al., 2016) . However, acquiring data labels is a nontrivial task-it often requires a pool of annotators with sufficient domain expertise to manually label the data items. For example, the popular Microsoft COCO dataset contains 2.5 million images and around 20,000 work hours aggregated from multiple annotators were used for its category labeling (Lin et al., 2014) . Crowdsourcing is considered an important working paradigm for data labeling. In crowdsourcing platforms, e.g., Amazon Mechanical Turk (Buhrmester et al., 2011) , Crowdflower (Wazny, 2017) , and ClickWork (Vakharia & Lease, 2013) , data items are dispatched and labeled by many annotators; the annotations are then integrated to produce reliable labels. A notable challenge is that annotator-output labels are sometimes considerably noisy. Training machine learning models using noisy labels could seriously degrade the system performance (Arpit et al., 2017; Zhang et al., 2016a) . In addition, the labels provided by individual annotators are often largely incomplete, as a dataset is often divided and dispatched to different annotators. Early crowdsourcing methods often treat annotation integration and downstream operations, e.g., classification, as separate tasks; see, (Dawid & Skene, 1979; Karger et al., 2011a; Whitehill et al., 2009; Snow et al., 2008; Welinder et al., 2010; Liu et al., 2012; Zhang et al., 2016b; Ibrahim et al., 2019; Ibrahim & Fu, 2021) . This pipeline estimates the annotators' confusion parameters (e.g., the confusion matrices under the Dawid & Skene (DS) model (Dawid & Skene, 1979) ) in the first stage. Then, the corrected and integrated labels along with the data are used for training the downstream tasks' classifiers. However, simultaneously learning the annotators' confusions and the classifier in an end-to-end (E2E) manner has shown substantially improved performance in practice. (Raykar et al., 2010; Khetan et al., 2018; Tanno et al., 2019; Rodrigues & Pereira, 2018; Chu et al., 2021; Guan et al., 2018; Cao et al., 2019; Li et al., 2020; Wei et al., 2022; Chen et al., 2020) . To achieve the goal of E2E-based learning under crowdsourced labels, a class of methods concatenate a "confusion layer" for each annotator to the output of a neural classifier, and jointly learn these parts via a parameter-coupled manner. This gives rise to a coupled cross entropy minimization (CCEM) criterion (Rodrigues & Pereira, 2018; Tanno et al., 2019; Chen et al., 2020; Chu et al., 2021; Wei et al., 2022) . In essence, the CCEM criterion models the observed labels as annotators' confused outputs from the "ground-truth" predictor (GTP)-i.e., the classifier as if being trained using the noiseless labels and with perfect generalization-which is natural and intuitive. A notable advantage of the CCEM criteria is that they often lead to computationally convenient optimization problems, as the confusion characteristics are modeled as just additional structured layers added to the neural networks. Nonetheless, the seemingly simple CCEM criterion and its variants have shown promising performance. For example, the crowdlayer method is a typical CCEM approach, which has served as a widely used benchmark for E2E crowdsourcing since its proposal (Rodrigues & Pereira, 2018) . In (Tanno et al., 2019) , a similar CCEM-type criterion is employed, with a trace regularization added. More CCEM-like learning criteria are seen in (Chen et al., 2020; Chu et al., 2021; Wei et al., 2022) , with some additional constraints and considerations. Challenges. Apart from its empirical success, understanding of the CCEM criterion has been limited. Particularly, it is often unclear if CCEM can correctly identify the annotators' confusion characteristics (that are often modeled as "confusion matrices") and the GTP under reasonable settings (Rodrigues & Pereira, 2018; Chu et al., 2021; Chen et al., 2020; Wei et al., 2022) , but identifiability stands as the key for ensured performance. The only existing identifiability result on CCEM was derived under restricted conditions, e.g., the availability of infinite samples and the assumption that there exists annotators with diagonally-dominant confusion matrices (Tanno et al., 2019) . To our best knowledge, model identifiability of CCEM has not been established under realistic conditions, e.g., in the presence of incomplete annotator labeling and non-experts under finite samples. We should note that a couple of non-CCEM approaches proposed identifiability-guaranteed solutions for E2E crowdsourcing. The work in (Khetan et al., 2018) showed identifiability under their expectation maximization (EM)-based learning approach, but the result is only applicable to binary classification. An information-theoretic loss-based learning approach proposed in (Cao et al., 2019) presents some identifiability results, but under the presence of a group of independent expert annotators and infinite samples. These conditions are hard to meet or verify. In addition, the computation of these methods are often more complex relative to the CCEM-based methods. Contributions. Our contributions are as follows: • Identifiability Characterizations of CCEM-based E2E Crowdsourcing. In this work, we show that the CCEM criterion can indeed provably identify the annotators' confusion matrices and the GTP up to inconsequential ambiguities under mild conditions. Specifically, we show that, if the number of annotator-labeled items is sufficiently large and some other reasonable assumptions hold, the two parts can be recovered with bounded errors. Our result is the first finite-sample identifiability result for CCEM-based crowdsourcing. Moreover, our analysis reveals that the success of CCEM does not rely on conditional independence among the annotators. This is favorable (as annotator independence is a stringent requirement) and surprising, since conditional independence is often used to derive the CCEM criterion; see, e.g., (Tanno et al., 2019; Rodrigues & Pereira, 2018; Chu et al., 2021) . • Regularization Design for CCEM With Provably Enhanced Identfiability. Based on the key insights revealed in our analysis, we propose two types of regularizations that can provide enhanced identifiability guarantees under challenging scenarios. To be specific, the first regularization term ensures that the confusion matrices and the GTP can be identified without having any expert annotators if one has sufficiently large amount of data. The second regularization term ensures identifiability when class specialists are present among annotators. These identifiability-enhanced approaches demonstrate promising label integration performance in our experiments. Notation. The notations are summarized in the supplementary material.

2. BACKGROUND

Crowdsourcing is a core working paradigm for labeling data, as individual annotators are often not reliable enough to produce quality labels. Consider a set of N items {x n } N n=1 . Each item belongs to one of the K classes. Here, x n ∈ R D denote the feature vector of nth data item. Let {y n } N n=1 denote the set of ground-truth labels, where y n ∈ [K], ∀n. The ground-truth labels {y n } N n=1 are unknown. We ask M annotators to provide their labels for their assigned items. Consequently, each item x n is labeled by a subset of annotators indexed by S n ⊆ [M ]. Let { y (m) n } (m,n)∈S denote the set of labels provided by all annotators, where S = {(m, n) | m ∈ S n , n ∈ [N ]} and y (m) n ∈ [K]. Note that |S| ≪ N M usually holds and that the annotators may often incorrectly label the items, leading to an incomplete and noisy { y (m) n } (m,n)∈S . Here, "incomplete" means that S does not cover all possible combinations of (m, n)-i.e., not all annotators label all items. The goal of an E2E crowdsourcing system is to train a reliable machine learning model (often a classifier) using such noisy and incomplete annotations. This is normally done via blindly estimating the confusion characteristics of the annotators from the noisy labels. Classic Approaches. Besides the naive majority voting method, one of the most influential crowdsourcing model is by Dawid & Skene (Dawid & Skene, 1979) . Dawid & Skene models the confusions of the annotators using the conditional probabilities Pr( y (m) n |y n ). Consequently, annotator m's confusion is fully characterized by a matrix defined as follows: A m (k, k ′ ) := Pr( y (m) n = k|y n = k ′ ), ∀k, k ′ ∈ [K]. Dawid & Skene also proposed an EM algorithm that estimate A m 's under a naive Bayes generative model. Many crowdsourcing methods follow Dawid & Skene's modeling ideas (Ghosh et al., 2011; Dalvi et al., 2013; Karger et al., 2013; Zhang et al., 2016b; Traganitis et al., 2018; Ibrahim et al., 2019; Ibrahim & Fu, 2021) . Like (Dawid & Skene, 1979) , many of these methods do not exploit the data features x n while learning the confusion matrices of the annotators. They often treat confusion estimation and classifier training as two sequential tasks. Such two-stage approaches may suffer from error propagation.

E2E Crowdsourcing and

The Identifiability Challenge. The E2E approaches in (Rodrigues & Pereira, 2018; Tanno et al., 2019; Khetan et al., 2018) model the probability of mth annotator's response to the data item x n as follows: Pr( y (m) n = k|x n ) = K k ′ =1 Pr( y (m) n = k|y n = k ′ )Pr(y n = k ′ |x n ), k ∈ [K], where the assumption that the annotator confusion is data-independent has been used to derive the right-hand side. In the above, the distribution Pr( y (m) n |y n ) models the confusions in annotator responses given the true label y n and Pr(y n |x n ) denotes the true label distribution of the data item x n . The distribution Pr(y n |x n ) can be represented using a mapping f : R D → [0, 1] K , [f (x n )] k = Pr(y n = k|x n ). Define a probability vector p (m) n ∈ [0, 1] K for every m and n, such that [p (m) n ] k = Pr( y (m) n = k|x n ). With these notations, the following model holds: p (m) n = A m f (x n ), ∀m, n. In practice, one does not observe p  y (m) n ∼ Cart(p (m) n ). The goal of E2E crowdsourcing is to identify A m and f in (2) from noisy labels { y (m) n } (m,n)∈S and data {x n } N n=1 . Note that even if p (m) n in (2) is observed, A m and f (•) are in general not identifiable. The reason is that one can easily find nonsingular matrices Q ∈ R K×K such that p (m) n = A m f (x n ), where A m = A m Q and f (x n ) = Q -1 f (x n ). If an E2E approach does not ensure the identifiability of A m and f , but outputs f (x n ) = Q -1 f (x n ), then the learned predictor is not likely to work reasonably. Identifying the ground-truth f not only identifies the true labels for the training data (data for which the annotations have been collected), but is useful to predict the true labels for unseen test data items. In addition, correctly identifying A m 's helps quantify how reliable each annotator is.

CCEM-based Approaches and Challenges.

A number of existing E2E methods are inspired by the model in (2) to formulate CCEM-based learning losses. • Crowdlayer (Rodrigues & Pereira, 2018) : For each (m, n) ∈ S, the generative model in (2) naturally leads to a cross-entropy based learning loss, i.e., CE(A m f (x n ), y (m) n ). As some annotators co-label items, the components f (x n ) associated with the co-labeled items are shared by these annotators. With such latent component coupling and putting the cross-entropy losses for all (m, n) ∈ S together, the work in (Rodrigues & Pereira, 2018) formulate the following CCEM loss: minimize f ∈F ,{Am∈R K×K } 1 |S| (m,n)∈S CE(softmax(A m f (x n )), y (m) n ), where F ⊆ {f (x) ∈ R K | f (x) ∈ ∆ K , ∀x} is a function class and the softmax operator is applied as the output of A m f (x n ) is supposed to be a probability mass function (PMF). The method based on the formulation in (4) was called crowdlayer (Rodrigues & Pereira, 2018) . The name comes from the fact that the confusion matrices of the "crowd" A m 's can be understood as additional layers if f is represented by a neural network. The loss function in ( 4) is relatively easy to optimize, as any off-the-shelf neural network training algorithms can be directly applied. This seemingly natural and simple CCEM approach demonstrated substantial performance improvement upon classic methods. However, there is no performance characterization for (4). • TraceReg (Tanno et al., 2019) : The work in (Tanno et al., 2019) proposed a similar loss functionwith a regularization term in order to establish identifiability of the A m 's and f : minimize f ∈F ,{Am∈A} 1 |S| (m,n)∈S CE(A m f (x n ), y (m) n ) + λ M m=1 trace(A m ), where A is the constrained set {A ∈ R K×K |A ≥ 0, 1 ⊤ A = 1 ⊤ }. It was shown that (i) if p (m) n instead of y (m) n is observed for every m and n and (ii) if the mean of the A m 's are diagonally dominant, then solving (5) with CE(A m f (x n ), y (m) n ) replaced by enforcing p (m) n = A m f (x n ) ensures identifying A m for all m, up to column permutations. Even though, the result provides some justification for using CCEM criterion under trace regularization, the result is clearly unsatisfactory, as it requires many stringent assumptions to establish identifiability. • SpeeLFC (Chen et al., 2020) , Union-Net (Wei et al., 2022) , CoNAL (Chu et al., 2021) : A number of other variants of the CCEM criterion are also proposed for E2E learning with crowdsourced labels. The work in (Chen et al., 2020) uses a similar criterion as in (4), but with the softmax operator applied on the columns of A m 's, instead on the output of A m f (x n ). The work in (Wei et al., 2022) concatenates A m 's vertically and applies the softmax operator on the columns of such concatenated matrix, while formulating the CCEM criterion. The method in (Chu et al., 2021) models a common confusion matrix in addition to the annotator-specific confusion matrices A m 's and employs a CCEM-type criterion to learn both in a coupled fashion. Nonetheless, none of these approaches have successfully tackled the identifiability aspect. The work (Wei et al., 2022) claims theoretical support for their approach, but the proof is flawed. In fact, understanding to the success of CCEM-based methods under practical E2E learning settings is still unclear. Other Related Works. A couple of non-CCEM approaches have also considered the identifiability aspects (Khetan et al., 2018; Cao et al., 2019) . The work in (Khetan et al., 2018) presents an EMinspired alternating optimization strategy. Such procedure involves training the neural classifier multiple times, making it a computationally intensive approach. In addition, its identifiability guarantees are under certain strict assumptions, e.g., binary classification, identical confusions and conditional independence for annotators-which are hard to satisfy in practice. The approach in (Cao et al., 2019) employs an information-theoretic loss function to jointly train a predictor network and annotation aggregation network. The identifiability guarantees are again under restrictive assumptions, e.g., infinite data, no missing annotations and existence of mutually independent expert annotators. We should also note that there exist works considering data-dependent annotator confusions as well, e.g., (Zhang et al., 2020) (multiple annotator case) and (Cheng et al., 2021; Xia et al., 2020; Zhu et al., 2022) (single annotator case)-leveraging more complex models and learning criteria. In our work, we focus on the criterion designed using a model in (1) as it is shown to be effective in practice.

3. IDENTIFIABILITY OF CCEM-BASED E2E CROWDSOURCING

In this section, we offer identifiability analysis of the CCEM learning loss under realistic settings. We consider the following re-expressed objective function of CCEM: minimize f ,{Am} - 1 |S| (m,n)∈S K k=1 I[ y (m) n = k] log[A m f (x n )] k (6a) subject to f ∈ F, A m ∈ A, ∀m, where F ⊆ {f (x) ∈ R K | f (x) ∈ ∆ K , ∀x} is a function class and A is the constrained set {A ∈ R K×K |A ≥ 0, 1 ⊤ A = 1 ⊤ }, since each column of A m 's are conditional probability distributions. The rationale is that the confusion matrices A m 's act as correction terms for annotator's labeling noise to output a true label classifier f . Intuitively, the objective in (6a) encourages the output estimates f , { A m } to satisfy the relation p (m) n = A m f (x n ), ∀m, n. One can easily note that the identifiability guarantees for A m and f ( i.e., when does the ground-truth A m and the ground-truth f can be identified) are not so straightforward since there exists an infinite number of nonsingular matrices Q ∈ R K×K such that p (m) n = ( A m Q)(Q -1 f (x n )). To proceed, we make the following assumptions: Here, we use the sensitive complexity parameter introduced in (Lin & Zhang, 2019) as the complexity measure R F . In essence, R F gets larger as the neural network function class F gets deeper and wider-also see supplementary material Sec. K for more discussion. Assumption 1 Each data item x n , n ∈ [N ] Assumption 3 There exists a GTP f ♮ : R D → [0, 1] K , such that [f ♮ (x n )] k = Pr(y n = k|x n ), where y n denotes the true label of the data item x n . In addition, there exists f ∈ F such that ∥ f (x n ) -f ♮ (x n )∥ 2 ≤ ν for all x n ∼ D, where 0 ≤ ν < ∞. Assumption 4 (Near-Class Specialist Assumption) Let A ♮ m denote the ground-truth confusion matrix for annotator m. For each class k, there exists a near-class specialist, indexed by m k , such that ∥A ♮ m k (k, :) -e ⊤ k ∥ 2 ≤ ξ 1 , where 0 ≤ ξ 1 < ∞. Assumption 5 (Near-Anchor Point Assumption) For each class k, there exists a near-anchor point, x n k , such that ∥f ♮ (x n k ) -e k ∥ 2 ≤ ξ 2 ), where 0 ≤ ξ 2 < ∞. Assumptions 1-3 are standard conditions for analyzing performance of neural models under finite sample settings. The Near-Class Specialist Assumption (NCSA) implies that there exist annotators for each class k ∈ [K] who can correctly distinguish items belonging to class k from those belonging to other classes. It is important to note that NCSA assumption is much milder compared to the commonly used assumptions such as the existence of all-class expert annotators (Cao et al., 2019) or diagonally dominant A ♮ m 's (Tanno et al., 2019) . Instead, the NCSA only requires that for each class, there exists one annotator who is specialized for it (i.e., class-specialist)-but the annotator needs not be an all-class expert; see Fig. 1 in the supplementary material for illustration. The Near-Anchor Point Assumption (NAPA) is the relaxed version of the exact anchor point assumption commonly used in provable label noise learning methods (Xia et al., 2019; Li et al., 2021) . Under Assumptions 1-5, we have the following result: Theorem 1 Assume that each [A ♮ m f ♮ (x n )] k and [A m f (x n )] k , ∀A m ∈ A, ∀f ∈ F are at least (1/β). Also assume that σ max (A ♮ m ) ≤ σ, ∀m, for a certain σ > 0. Then, for any α > 0, with probability greater than 1 -K /N α , any optimal solution A m and f of the problem (6) satisfies the following relations: min Π ∥ A m -A ♮ m Π∥ 2 F = Kσ 2 (η + ξ 1 + ξ 2 ), ∀m ∈ [M ], E x∼D min Π ∥ f (x) -Π ⊤ f ♮ (x)∥ 2 2 = K(η + ξ 1 + ξ 2 ), where η 2 = O βM N α / √ S √ M log S + (∥X∥ F R F ) 1 4 + β √ KM N α ν , Π ∈ {0, 1} K is a permutation matrix, and X = [x n1 , . . . , x n S ], (m s , n s ) ∈ S, if conditions ξ 1 , ξ 2 ≤ 1 /K, ν ≤ 1 /βK 2 M 2 N α , and S = |S| = Ω β 2 M 2 N 2α K 2 max M log S, ∥X∥ F R F hold. The proof of Theorem 1 is relegated to the supplementary material in Sec. B. The takeaways are as follows: First, the CCEM can provably identify the A ♮ m 's and the GTP f ♮ under finite samples-and such finite-sample identifiability of CCEM was not established before. Second, some restrictive conditions (e.g., the existence of all-class experts (Tanno et al., 2019; Cao et al., 2019) ) used in the literature for establishing CCEM's identifiability are actually not needed. Third, many works derived the CCEM criterion under the maximum likelihood principle via assuming that the annotators are conditionally independent; see (Rodrigues & Pereira, 2018; Chen et al., 2020; Tanno et al., 2019; Chu et al., 2021) . Interestingly, as we revealed in our analysis, the identifiability of A ♮ m and f ♮ under the CCEM criterion does not rely on the annotators' independence. These new findings support and explain the effectiveness of CCEM in a wide range of scenarios observed in the literature.

4. ENHANCING IDENTIFIABILITY VIA REGULARIZATION

Theorem 1 confirms that CCEM is a provably effective criterion for identifying A ♮ m 's and f ♮ . The caveats lie in Assumptions 4 and 5. Essentially, these assumptions require that some rows of the collection {A ♮ m } M m=1 and some vectors of the collection {f ♮ (x n )} N n=1 are close to the canonical vectors. These are reasonable assumptions, but satisfying them simultaneously may not always be possible. A natural question is if these assumptions can be relaxed, at least partially. In this section, we propose variants of CCEM that admits enhanced identifiability. To this end, we use a condition that is often adopted in the nonnegative matrix factorization (NMF) literature: Definition 1 (SSC) (Fu et al., 2019; Gillis, 2020) Consider the second-order cone C = {x ∈ R K | √ K -1∥x∥ 2 ≤ 1 ⊤ x}. A nonnegative matrix Z ∈ R L×K + satisfies the sufficiently scattered condition (SSC) if (i) C ⊆ cone(Z ⊤ ) and (ii) cone{Z ⊤ } ⊆ cone{Q} does not hold for any orthogonal Q ∈ R K×K except for the permutation matrices. Geometrically, the matrix Z satisfying the SSC means that its rows span a large "area" within the nonnegative orthant, so that the conic hull spanned by the rows contains the second order cone C as a subset. Note that the SSC condition subsumes the the NCSA (Assumption (4) where ξ 1 = 0) and the NAPA (Assumption (5) where ξ 1 = 0) as special cases. Using SSC, we first show that the CCEM criterion in (6) attains identifiability of A ♮ m 's and f ♮ without using the NCSA or the NAPA, when the problem size grows sufficiently large. Theorem 2 Suppose that the incomplete labeling paradigm in Assumption 1 holds with S = Ω(t) and N = O(t 3 ), for a certain t > 0. Also, assume that there exists Z = {n 1 , . . . , n T } such that F ♮ Z = [f ♮ (x n1 ), . . . , f ♮ (x n T )] satisfies the SSC and W ♮ = [A ♮⊤ 1 , . . . , A ♮⊤ M ] satisfies the SSC as well. Then, at the limit of t → ∞, if f ♮ ∈ F, the optimal solutions ({ A m }, f ) of (6) satisfies A m = A ♮ m Π for all m and f (x) = Π ⊤ f ♮ (x) for all x ∼ D, where Π is a permutation matrix. The proof is via connecting the CCEM problem to the classic nonnegative matrix factorization (NMF) problem at the limit of t → ∞; see Sec. D. It is also worth to note that Theorem 2 does not require complete observation of the annotations to guarantee identifiability. Another remark is that W ♮ and F ♮ satisfying the SSC is more relaxed compared to that they satisfying the NCSA and NAPA, respectively. For example, one may need class specialists with much higher expertise to satisfy NCSA than to satisfy SSC on W ♮ -also see geometric illustration in Sec. L. Compared to the result in Theorem 1, Theorem 2 implies that when N is very large, the CCEM should work under less stringent conditions relative to the NCSA and NAPA. Nonetheless, the assumption that both the GTP and the annotators' confusion matrices should meet certain requirements may not always hold. In the following, we further show that the SSC on either W ♮ or F ♮ can be relaxed: Theorem 3 Suppose that the incomplete labeling paradigm in Assumption 1 holds with S = Ω(t) and N = O(t 3 ), for a certain t > 0. Then, at the limit of t → ∞, if f ♮ ∈ F, we have A * m = A ♮ m Π for all m and f * (x) = Π ⊤ f ♮ (x) for all x ∼ D, when any of the following conditions hold: (a) If there exists Z = {n 1 , . . . , n T } such that (F ♮ Z ) ⊤ = [f ♮ (x n1 ), . . . , f ♮ (x n T )] ⊤ satisfies the SSC, rank(W ♮ ) = K, and ({A * m }, f * ) is the optimal solution of (6) with the maximal log det(F * F * ⊤ ). (b) If W ♮ = [A ♮⊤ 1 , . . . , A ♮⊤ M ] ⊤ satisfies the SSC, rank(F ♮ ) = K, and (A * m , f * ) is the optimal solution of (6) with the maximal log det((W * ) ⊤ W * ). Theorem 3(a) shows that if the number of annotated data items is large and the GTP-outputs associated with x n are diverse enough to satisfy the SSC, then f ♮ and A ♮ m are identifiable even if there are no class specialists available. Compared to Theorem 2, the identifiability is clearly enhanced, as the conditions on the annotators have been relaxed. Theorem 3(b) implies that if W ♮ satisfies SSC, identifiability can be established irrespective of the geometry of F ♮ . Intuitively, when there are more annotators available, W ♮ is more likely to satisfy SSC (the relation between the size of W ♮ and SSC was shown in (Ibrahim et al., 2019, Theorem 4 ) under reasonable conditions). Hence, in practice, Theorem 3(b) may offer ensured performance in cases where more annotators are available. Similar to the proof in Theorem 2, the proof here connects the CCEM-based E2E crowdsourcing problem to the simplex-structured matrix factorization (SSMF) problem (see (Fu et al., 2015; 2018) ) at the limit of t → ∞. The detailed proof is provided in Sec. E. We should mention that various connections between matrix factorization models and data labeling problems were also observed in related prior works, e.g, the classic approaches based on Dawid & Skene model (Dawid & Skene, 1979; Zhang et al., 2016b; Ibrahim et al., 2019; Ibrahim & Fu, 2021) and E2E learning under noisy labels (Li et al., 2021) . Nonetheless, the former does not consider cross-entropy losses or involve any feature extractor, whereas, the latter does not consider multiple annotators or incomplete labeling. Implementation via Regularization Theorem 3 naturally leads to regularized versions of CCEM. Correspondingly, we have the following two cases: Following Theorem 3(a), when the GTP is believed to be diverse over different x n 's, we propose to employ the following regularized CCEM: minimize Am∈A,∀m, f ∈F - 1 |S| (m,n)∈S K k=1 I[ y (m) n = k] log[A m f (x n )] k -λ log det F F ⊤ , where F = [f (x 1 ), . . . , f (x N )] ⊤ . In similar spirit, following Theorem 3(b), we propose the following regularized CCEM when M is large (so that W ♮ likely satisfies SSC): minimize Am∈A,∀m, f ∈F - 1 |S| (m,n)∈S K k=1 I[ y (m) n = k] log[A m f (x n )] k -λ log det W ⊤ W , where W = [A ⊤ 1 , . . . , A ⊤ M ] ⊤ . The implementations for (8) and ( 9) are called as geometry-regularized crowdsourcing network, abbreviated as GeoCrowdNet(F) and GeoCrowdNet(W), respectively. Remark 1 Based on our analyses, our suggested rule of thumb for choosing regularization is as follows: When M is large but N is relatively small, using GeoCrowdNet(W) is recommended, as W is more likely to satisfy the SSC compared to F ⊤ . However, when N is large, using GeoCrowdNet(F) is expected to have a better performance as F ⊤ is likely to satisfy the SSC in this case. Under big data settings, N is often large and the GTP outputs of x n 's are often reasonably diverse. Hence, (8) oftentimes works well, as will be seen in our experiments. Nonetheless, for cases where the geometry of F ⊤ violates the SSC, the employment of more annotators can help via using (9)-which is also an intuitive advantage of crowdsourcing. (Cao et al., 2019) . In addition to these baselines, we also learn neural network classifiers using the labels aggregated from majority voting and the classic EM algorithm proposed by Dawid &S kene (Dawid & Skene, 1979) , denoted as NN-MV and NN-DSEM, respectively. Real-Data Experiments with Machine Annotations. The settings are as follows: Dataset. We use the MNIST dataset (Deng, 2012) and Fashion-MNIST dataset (Xiao et al., 2017) ; see more details in Sec. N. Noisy Label Generation. We train a number of machine classifiers to produce noisy labels. Five types of classifiers, including support vector machines (SVM), k-Nearest Neighbour (kNN), logistic regression, convolutional neural network (CNN), and fully connected neural network are considered to act as annotators. To create more annotators, we change some of their parameters (e.g., the number of nearest neighbour parameter k of kNN and the number of epochs for training the CNN). We consider two different cases: (i) Case 1: NCSA (Assumption 4) holding: Each annotator is chosen to be an all-class expert with probability 0.1. If an annotator is chosen to be an all-class expert, it is trained carefully so that its classification accuracy exceeds a certain threshold; see details in Sec. N. Note that the existence of an expert implies that the NCSA and the SSC are likely satisfied by W ♮ , but the reverse is not necessarily true. We use this strategy to enforce NCSA only for validation purpose. It does not mean that the CCEM criterion needs the existence of experts to work. (ii) Case 2: NCSA not holding: Each machine annotator is trained by randomly choosing a small subset of the training data (with 100 to 500 samples) so that the accuracy of these machine annotators are low. This way, any all-class expert or class-specialist is unlikely to exist. We use the two cases to validate our theorems. Under our analyses, GeoCrowdNet(W) is expected to work better under Case 1, as NCSA approximately implies SSC. In addition, GeoCrowdNet(F) does not rely on the geometry of W ♮ , and thus can work under both cases, if N is reasonably large (which implies that (F ♮ ) ⊤ is more likely to satisfy the SSC). Once the machine annotators are trained, we let them label unseen data items of size N . To evaluate the methods under incomplete labeling, an annotator labels any data item with probability p = 0.1. Under this labeling strategy, every data item is only labeled by a small subset of the annotators. Settings. The neural network classifier architecture for MNIST dataset is chosen to be Lenet-5 (Lecun et al., 1998) , which consists of two sets of convolutional and max pooling layers, followed by a flattening convolutional layer, two fully-connected layers and finally a softmax layer. For Fashion-MNIST dataset, we use the ResNet-18 architecture (He et al., 2016) . Adam (Kingma & Ba, 2015) is used an optimizer with weight decay of 10 -4 and batch size of 128. The regularization parameter λ and the initial learning rate of the Adam optimizer are chosen via grid search method using the validation set from {0.01, 0.001, 0.0001} and {0.01, 0.001}, respectively. We choose the same neural network structures for all the baselines. The confusion matrices are initialized with identity matrices of size K for proposed methods and the baselines TraceReg and CrowdLayer. Results. Table 1 presents the average label prediction accuracy on the testing data of the MNIST and the Fashion-MNIST over 5 random trials, for various cases. One can observe that GeoCrowdNet(F) performs the best when there are more number of annotated items, even when there are no class specialist annotators present (i.e., case 2). On the other hand, GeoCrowdNet(W) starts showing its advantage over GeoCrowdNet(F) when there are more number of annotators and class specialists among them. Both are consistent with our theorems. In addition, the baseline MaxMIG performs competitively when there are all-class experts-which is consistent with their identifiability analysis. However, when there are no such experts (case 2), its performance drops compared to the proposed methods, especially GeoCrowdNet(F) whose identifiability does not rely on the existence of all-class experts or class specialists. Overall, GeoCrowdNet(F) exhibits consistently good performance. Real-Data Experiments with Human Annotators. The settings are as follows: Datasets. We consider two different datasets, namely, LabelMe (M = 59) (Rodrigues et al., 2017; Russell et al., 2007) and Music (M = 44) (Rodrigues et al., 2014) , both having noisy and incomplete labels provided by human workers from AMT. Settings. For LabelMe, we employ the pretrained VGG-16 embeddings followed by a fully connected layer as used in (Rodrigues & Pereira, 2018) . For Music, we choose a similar architecture, but with batch normalization layers. We use a batch size of 128 for LabelMe and a batch size of 100 for Music dataset. All other settings are the same as the machine annotator case. More details of the datasets and architecture can be seen in Sec. N. Results. Table 2 shows the average label prediction accuracies on the test data for 5 random trials. One can see that GeoCrowdNet (λ=0) works reasonably well relative to the machine annotation case in the previous experiment. This may be because the number of annotators for both LabelMe and Music are reasonably large (M = 59 and M = 44, respectively), which makes W ♮ satisfying Assumption 4 become easier than before. This validates our claims in Theorem 1, but also shows the limitations of the plain-vanilla CCEM-that is, it may require a fairly large number of annotators to start showing promising results. Similar to the previous experiment, both GeoCrowdNet(F) and GeoCrowdNet(W) outperform the unregularized version GeoCrowdNet (λ=0), showing the advantages of the identifiability-enhancing regularization terms. More experiments and details are presented in Sec. N, due to page limitations.

6. CONCLUSION

In this work, we revisited the CCEM criterion-one of the most popular E2E learning criteria for crowdsourced label integration. We provided the first finite-sample identifiability characterization of the confusion matrices and the neural classifier which are learned using the CCEM criterion. Compared to many exiting identifiability results, our guarantees are under more relaxed and more realistic settings. In particular, a take-home point revealed in our analysis is that CCEM can provably identify the desired model parameters even if the annotators are dependent, which is a surprising but favorable result. We also proposed two regularized variants of the CCEM, based on the insights learned from our identifiability analysis for the plain-vanilla CCEM. The regularized CCEM criteria provably enhance the identifiability of the confusion matrices and the neural classifier under more challenging scenarios. We evaluated the proposed approaches on various synthetic and real datasets. The results corroborate our theoretical claims. A NOTATION x, x, and X represent a scalar, a vector, and a matrix, respectively. [x] i and x(i) both denote the ith entry of the vector x. X(:, j) denotes the jth column vector of X and X(i, :) denotes the ith row vector of X. [X] i,j and X(i, j) both mean the (i, j)th entry of X. ∥x∥ 2 and ∥X∥ F mean the Euclidean (Frobenius) norm of the augment. [I] means an integer set {1, 2, . . . , I}. ⊤ denote transpose. ∆ K = {x ∈ R K : i x(i) = 1, x ≥ 0} denotes the probability simplex. X ≥ 0 implies that all the entries of the matrix X are nonnegative. I[A] denotes an indicator function for the event A such that I[A] = 1 if the event A happens, otherwise I[A] = 0. CE(x, y) = - K k=1 I[y = k] log(x(k)) denotes the cross entropy function. I K denotes an identity matrix of size K × K. σ max (X) and σ min (X) denote the largest and the smallest singular values of the matrix X, respectively. cone(X) denotes the conic hull formed by the columns of the matrix X. det(X) represents the determinant of X. vec(X) denotes the vectorization operation that concatenates the columns of X. trace(X) outputs the trace (the sum of the diagonal elements) of X.

B PROOF OF THEOREM 1

If the vectors p (m) n for all m, n are available, then one can represent the model in (2) as a low-rank nonnegative matrix factorization (NMF) model as follows:    p (1) 1 . . . p (1) N . . . . . . . . . p (M ) 1 . . . p (M ) N    P ∈R M K×N =    A 1 . . . A M    W ∈R M K×K [f (x 1 ) . . . f (x N )] F ∈R K×N , where the factors W and F are both nonnegative per their physical meaning. Let P ♮ denote the ground-truth and p

♮(m) n

∈ R K the (m, n)th block in P ♮ , following the representation in (10) . Note that we do not observe the entire P ♮ but only the y (m) n sampled from p ♮(m) n for (m, n) ∈ S. We first show that the CCEM criterion in (6) implicitly estimates P ♮ from incomplete labels indexed by S. To this end, we consider the objective function in (6a): D S (P ; Y) ≜ - 1 S (m,n)∈S K k=1 I[ y (m) n = k] log P ((m -1)K + k, n), where Y denotes the set of observed noisy labels, i.e., { y (m) n } (m,n)∈S . Let { A m } M m=1 and f denote the estimates given by the learning criterion in (6). Then, the criterion (6) helps define the following term: P ≜ arg min P ∈P D S (P ; Y), where P = W f (x 1 ) . . . f (x N ) , W = [ A ⊤ 1 , . . . , A ⊤ M ] ⊤ , P ≜ {P ∈ R M K×N | P = W [f (x 1 ) . . . f (x N )] , W ∈ W, f ∈ F}, W ≜ {W = [A ⊤ 1 , . . . , A ⊤ M ] ⊤ ∈ R M K×K | 1 ⊤ A m = 1 ⊤ , A m ≥ 0, ∀m}, and F⊂ {f (x) ∈ R K | f (x) ∈ ∆ K , ∀x ∈ R D } is the neural network function class. Our main goal is to characterize the estimation errors of { A m } M m=1 and f . This can be done via first bounding the estimation errors ∥p ♮(m) n -p (m) n ∥ 2 2 , where p ♮(m) n is given by: p ♮(m) n = A ♮ m f ♮ (x n ), ∀m, n, in which A ♮ m and f ♮ denote the mth ground-truth confusion matrix and the GTP, respectively, and p (m) n denote the (m, n)th block in P . B.1 ESTIMATING ∥p ♮(m) n -p (m) n ∥ 2 2 We first show the following proposition: Proposition 1 Under the assumptions in Theorem 1, the following result hold with probability at least 1 -δ: 1 N M N n=1 M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 ≤ 4 √ 2KβR S (P) + 10 log (β) 2 log (4/δ) √ S + 2β √ Kν, ( ) where R S (P) = 16 √ S M K log 4S √ K + (2∥X∥ F R F ) 1 2 and is related to the empirical Rademacher complexity of the set P under the observed samples S. The proof is provided in Sec. C. Proposition 1 provides an upper-bound for ∥P ♮ -P ∥ 2 F . To achieve the main goal, i.e., characterizing the estimation errors of { A m } M m=1 and f , we require an upperbound for M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 as well. The following result gives a tighter upper bound for M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 : Proposition 2 Under the assumptions in Theorem 1, for any α > 0 and δ ∈ (0, 1), the following result holds with probability at least 1 -1 N α : 1 M M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 ≤ N α 4 √ 2KβR S (P) + 10 log (β) 2 log (4/δ) √ S + 2β √ Kν + 4δ , ∀n, where R S (P) = 16 √ S M K log 4S √ K + (2∥X∥ F R F ) 1 2 . The proof is provided in Sec. I. Note that the bound in ( 14) looks divergent at the first glance, as the second term in the R.H.S. could go to infinity. However, given a large enough S, one can choose appropriate α and δ such that the bound in ( 14) does not explode.

B.2 ESTIMATION ERRORS FOR A m AND f

In this section, we characterize the estimation error of the confusion matrices and the GTP. We start with using simplified notations to represent the results in Propositions 1-2, i.e., ∥P ♮ -P ∥ 2 F ≤ ζ 2 , ( ) M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 ≤ φ 2 , ∀n. with probability greater than 1 -δ and 1 -1 N α , respectively. Assumptions 4-5 imply that there exists index sets Λ = { m1 , . . . , mK } and Ψ = {ñ 1 , . . . , ñK } such that: W ♮ (Λ, :) = I K + N W F ♮ (:, Ψ ) = I K + N F , where ∥N W ∥ F ≤ √ Kξ 1 , and ∥N F ∥ F ≤ √ Kξ 2 . Note that the conditions do not require any confusion matrices to be near identity; i.e., we do not need any annotators to be all-class specialists. Nonetheless, for notation simplicity, we let Λ = {1, . . . , K} and Ψ = {1, . . . , K}, which is without loss of generality for our proof in the sequel. Under this simplification, the following holds: W ♮ =      A ♮ 1 A ♮ 2 . . . A ♮ M      , F ♮ = f ♮ (x 1 ) . . . f ♮ (x N ) = F ♮ 1 F ♮ 2 , where A ♮ 1 = I K + N W , F ♮ 1 = I K + N F , and F ♮ 2 ∈ R K×(N -K) , ∥N W ∥ F ≤ √ Kξ 1 , and ∥N F ∥ F ≤ √ Kξ 2 . Let A m and f denote the estimates of A ♮ m and f ♮ , respectively, using the learning criterion ( 6) and construct W =      A 1 A 2 . . . A M      , F = f (x 1 ) . . . f (x N ) = F 1 F 2 , where A m ∈ R K×K , F 1 ∈ R K×K , and F 2 ∈ R K×(N -K) . Then, we have: ∥P ♮ -P ∥ 2 F = ∥W ♮ F ♮ -W F ∥ 2 F = ∥I K + N W + N F + N W N F -A 1 F 1 ∥ 2 F + ∥F ♮ 2 + N W F ♮ 2 -A 1 F 2 ∥ 2 F + M m=2 ∥A ♮ m + A ♮ m N F -A m F 1 ∥ 2 F + M m=2 ∥A ♮ m F ♮ 2 -A m F 2 ∥ 2 F . Let us define the error matrices as below: E (1) A ≜ A 1 -(I K + N W )Π A ♮ 1 Π , E (m) A ≜ A m -A ♮ m Π, ∀m > 1,

E

(1) F ≜ F 1 -Π ⊤ (I K + N F ) Π ⊤ F ♮ 1 , E F ≜ F 2 -Π ⊤ F ♮ 2 , where Π ∈ [0, 1] K×K is a column permutation matrix and is the same across all the error blocks. We hope to characterize the norm of these error matrices.

Upper bound for ∥E

(1) A ∥ F and ∥E (1) F ∥ F . We start by considering the first term on the R.H.S. of (17), i.e., 16), we have the following with probability greater than 1 -K N α : ∥I K + N W + N F + N W N F -A 1 F 1 ∥ 2 F . From ( φ 2 ≥ ∥I K + N W + N F + N W N F -A 1 F 1 ∥ 2 F = K k=1 |1 + N (k, k) -A 1 (k, :) F 1 (:, k)| 2 + K j=1 k̸ =j | N (j, k) -A 1 (j, :) F 1 (:, k)| 2 , ( ) where N = N W + N F + N W N F . The absolute value of the largest entry in N can be bounded by ξ 1 + ξ 2 + √ Kξ 1 ξ 2 . Let us denote κ = φ + ξ 1 + ξ 2 + √ Kξ 1 ξ 2 for conciseness. Then, from (20), we get the following conditions: 1 -κ ≤ | A 1 (k, :) F 1 (:, k)| ≤ 1 + κ, ∀k ∈ [K] (21a) | A 1 (k, :) F 1 (:, j)| ≤ κ, ∀k ̸ = j. From the conditions (21a) and (21b), we have the following result: Lemma 1 Assume that the conditions in (21a) and (21b) hold and that κ ≤ 1 K+1 . Then, we have the following relations satisfied: arg max ℓ A 1 (k, ℓ) ̸ = arg max ℓ A 1 (j, ℓ), ∀k ̸ = j (22a) arg max q F 1 (q, k) ̸ = arg max q F 1 (q, j), ∀k ̸ = j (22b) arg max ℓ A 1 (k, ℓ) = arg max q F 1 (q, k), ∀k. The proof of the lemma is given in Sec. F. From the lower bound condition in (21a), we have the following result: K k=1 (1 -κ) ≤ K k=1 A 1 (k, 1) F 1 (1, k) + • • • + A 1 (k, K) F 1 (K, k) ≤ K k=1 max ℓ∈[K] A 1 (k, ℓ)( F 1 (1, k) + • • • + F 1 (K, k)) = K k=1 max ℓ∈[K] A 1 (k, ℓ) where the last equality is obtained due to the probability simplex constraints on the columns of F 1 . From the lower bound condition in (21a), we have the below result as well: K k=1 (1 -κ) ≤ K k=1 A 1 (k, 1) F 1 (1, k) + • • • + A 1 (k, K) F 1 (K, k) ≤ K k=1 max ℓ F 1 (k, ℓ)( A 1 (1, k) + • • • + A 1 (K, k)) = K k=1 max ℓ F 1 (k, ℓ). ( ) where the last equality is obtained due to the probability simplex constraints on all the columns of A 1 . Next, we characterize the term ∥E (1) A ∥ 2 F . To achieve this, by employing the result (22a) in Lemma 1, we fix the column permutation Π as follows: Π(j, k) = 1, j = arg max ℓ A 1 (k, ℓ) 0, otherwise. ( ) We get the following set of relations: ∥ A 1 -I K Π∥ 2 F = K k=1 ∥ A 1 (k, :) -e ⊤ k Π∥ 2 2 = K k=1 ∥ A 1 (k, :)∥ 2 2 + K -2 K k=1 A 1 (k, :)Π ⊤ e k = ∥ A 1 ∥ 2 F + K -2 K k=1 A 1 (k, :)Π ⊤ e k ≤ 2K -2 K k=1 A 1 (k, :)Π ⊤ e k (a) = 2K -2 K k=1 max ℓ A 1 (k, ℓ) (b) ≤ 2K -2 K k=1 (1 -κ) = 2Kκ, where the last inequality (b) is by ( 23) and the relation (a) is obtained by choosing Π as defined in (25). Hence, we have the following with probability greater than 1 -K N α : ∥E (1) A ∥ F ≤ ∥ A 1 -I K Π∥ F + ∥N W ∥ F ≤ √ 2Kκ + √ Kξ 1 . We proceed to consider the error term ∥E (1) F ∥ 2 F . To achieve this, consider the following: ∥ F 1 -Π ⊤ I K ∥ 2 F = K k=1 ∥ F 1 (k, :) -e ⊤ k Π ⊤ ∥ 2 2 = K k=1 ∥ F 1 (k, :)∥ 2 2 + K -2 K k=1 F 1 (k, :)Π ⊤ e k = ∥ F 1 ∥ 2 F + K -2 K k=1 F 1 (k, :)Π ⊤ e k ≤ 2K -2 K k=1 F 1 (k, :)Π ⊤ e k (a) = 2K -2 K k=1 max ℓ F 1 (k, ℓ) (b) ≤ 2K -2 K k=1 (1 -κ) = 2Kκ, where the last inequality (b) is by ( 24) and the relation (a) is by combining the definition of Π in ( 25) and (22c) in Lemma 1. Hence, we get the following with probability greater than 1 -K N α :

∥E

(1) F ∥ F ≤ ∥ F 1 -I K Π∥ F + ∥N F ∥ F ≤ √ 2Kκ + √ Kξ 2 . ( ) Upper bound for ∥E (2) F ∥ F . Next, we consider the second term in (17 15), we have the following with probability greater than 1 -δ: ), i.e., ∥F ♮ 2 +N W F ♮ 2 -A 1 F 2 ∥ F . From ( ζ ≥ ∥F ♮ 2 + N W F ♮ 2 -A 1 F 2 ∥ F = ∥F ♮ 2 + N W F ♮ 2 -(I K Π + N W Π + E (1) A )(E (2) F + Π ⊤ F ♮ 2 )∥ F = ∥(I K Π + N W Π + E (1) A )E (2) F + E (1) A Π ⊤ F ♮ 2 ∥ F = ∥ A 1 E (2) F + E (1) A Π ⊤ F ♮ 2 ∥ F (a) ≥ ∥ A 1 E (2) F ∥ F -∥E (1) A Π ⊤ F ♮ 2 ∥ F (b) ≥ σ min ( A 1 )∥E (2) F ∥ F -∥E (1) A ∥ F ∥F ♮ 2 ∥ F (c) ≥ σ min ( A 1 )∥E (2) F ∥ F -( √ 2Kκ + √ Kξ 1 )∥F ♮ 2 ∥ F ( ) where the inequality (a) is by using the triangle inequality, (b) is obtained by applying the following two relations for any two matrices A ∈ R K×K , B ∈ R K×L , L ≥ K : ∥AB∥ F ≥ σ min (A)∥B∥ F (29) ∥AB∥ F ≤ ∥A∥ F ∥B∥ F . ( ) The inequality (c) is obtained by applying (26). Hence, (28) combined with the fact that ∥F 2 ∥ F ≤ ∥F ♮ ∥ F gives the following relation with probability greater than 1 -δ -K N α : (2) F ∥ F ≤ ζ + ( √ 2Kκ + √ Kξ 1 )∥F ♮ ∥ F σ min ( A 1 ) . Next, we use the following lemma to characterize σ min ( A 1 ): Lemma 2 Suppose that the matrix X ∈ R K×K takes the form X = I K + E 1 + E 2 . Assume that ∥E 1 ∥ F ≤ υ 1 and ∥E 2 ∥ F ≤ υ 2 , for a certain υ 1 , υ 2 > 0. Then, we have σ min (X) ≥ |1 -υ 1 -υ 2 |. The proof is relegated to Sec. G. Applying Lemma 2, we get the following with probability greater than 1 -δ -K N α : (2) F ∥ F ≤ ζ + ( √ 2Kκ + √ Kξ 1 )∥F ♮ ∥ F |1 - √ 2Kκ -2 √ Kξ 1 | ≤ c 1 (ζ + √ Kκ∥F ♮ ∥ F ) |1 - √ Kκ| (32) for certain constant c 1 > 0. where we used the fact that ξ 1 ≤ √ κ since κ = φ + ξ 1 + ξ 2 + √ Kξ 1 ξ 2 and and κ ≤ 1.

Upper bound for ∥E (m)

A ∥ F , m > 1. We consider the third term on the R.H.S. of (17). From ( 16), for each m, we have the following with probability greater than 1 -K N α : √ Kφ ≥ ∥A ♮ m + A ♮ m N F -A m F 1 ∥ F = ∥A ♮ m + A ♮ m N F -(A ♮ m Π + E (m) A )(Π ⊤ I K + Π ⊤ N F + E (1) F )∥ F = ∥E (m) A (Π ⊤ I K + Π ⊤ N F + E (1) F ) + A ♮ m ΠE (1) F ∥ F = ∥E (m) A F 1 + A ♮ m ΠE (1) F ∥ F (a) ≥ ∥E (m) A F 1 ∥ F -∥A ♮ m ΠE (1) F ∥ F (b) ≥ σ min ( F 1 )∥E (m) A ∥ F -σ max (A ♮ m )∥E (1) F ∥ F (c) ≥ σ min ( F 1 )∥E (m) A ∥ F -( √ 2Kκ + √ Kξ 2 )σ max (A ♮ m ) , where the relation (a) by the triangle inequality, (b) is by applying ( 29) and ( 30), and (c) is via (27). Hence, ∀m > 1, we get the following with probability greater than 1 -2K N α : ∥E (m) A ∥ F ≤ √ Kφ + ( √ 2Kκ + √ Kξ 2 )σ max (A ♮ m ) |1 - √ 2Kκ -2 √ Kξ 2 | ≤ c 2 √ Kκσ max (A ♮ m ) |1 - √ Kκ| (33) for certain constant c 2 > 0, where we used the fact that φ ≤ √ κ and ξ 2 ≤ √ κ since κ = φ + ξ 1 + ξ 2 + √ Kξ 1 ξ 2 and κ ≤ 1. In the above, we have also applied Lemma 2 and (27) . σ min ( F 1 ) ≥ |1 - √ 2Kκ -2 √ Kξ 2 | following Putting Together. From (33), we have the following with probability greater than 1 -2K N α : ∥E (m) A ∥ 2 F = ∥ A m -A ♮ m Π∥ 2 F ≤ c 2 2 K 2 κ |1 - √ Kκ| 2 , ∀m. ( ) where we have used the fact that σ max (A ♮ m ) ≤ ∥A ♮ m ∥ F ≤ √ K. Similarly, by combining ( 27) and ( 32), we have the following with probability greater than 1-δ -2K N α ∥ F -Π ⊤ F ♮ ∥ 2 F = ∥E (1) F ∥ 2 F + ∥E (2) F ∥ 2 F ≤ ( √ 2Kκ + √ Kξ 2 ) 2 + c 2 1 (ζ + √ Kκ∥F ♮ ∥ F ) 2 |1 - √ Kκ| 2 ≤ c 3 (ζ + √ N Kκ) 2 |1 - √ Kκ| 2 for certain constant c 3 > 0, where we have used the fact that ∥F ♮ ∥ F ≤ √ N . We hope to characterize the generalization performance of the predicted function f from ( 35). Towards this, we have the following result: Lemma 3 Under Assumptions 1 and 2, the following holds with probability greater than 1 -δ: E x∼D ∥ f (x) -Π ⊤ f ♮ (x)∥ 2 2 ≤ 1 N ∥ F -Π ⊤ F ♮ ∥ 2 F + 64N -5/8 (2∥X∥ F R F ) 1 4 + 16 2 log(4/δ) N . ( ) The proof is given in Sec. H. Hence, combining Lemma 3 with (35), we have the following with probability greater than 1 -2δ -2K N α E x∼D ∥ f (x) -Π ⊤ f ♮ (x)∥ 2 2 ≤ c 3 (ζ + √ N Kκ) 2 N |1 - √ Kκ| 2 + 64N -5/8 (2∥X∥ F R F ) 1 4 + 16 2 log(4/δ) N , where κ ≤ φ + 2 √ K(ξ 1 + ξ 2 ), ζ 2 ≤ 4 √ 2N M KβR S (P) + 10N M log (β) 2 log (4/δ) √ S + 2N M β √ Kν φ 2 ≤ 4 √ 2N α M KβR S (P) + 10N α M log (β) 2 log (4/δ) √ S + 2N α M β √ Kν + 4N α M δ, R S (P) = 16 √ S M K log 4S √ K + (2∥X∥ F R F ) 1 2 . Also note that the final bounds in ( 34) and ( 37) are satisfied only if κ ≤ 1 K+1 as given by Lemma 1. This gives the final conditions on ξ 1 , ξ 2 , ν, and S. By letting δ = 1 S , we obtain the final results in Theorem 1.

C PROOF OF PROPOSITION 1

Let us consider the following notation: P (ω) = p (m) n , where ω = (m, n) ∈ [M ] × [N ]; i.e., P (ω) "reads out" the (m, n)th block in (10) . We also define Y(ω) = y  D Π (P , Y) ≜ E S∼Π [CE(P (ω), Y(ω))] = M m=1 N n=1 π (m) n CE(p (m) n , y (m) n ), where Π denotes the uniform distribution and π (m) n = 1 N M denotes the probability of observing the annotation for the index pair (m, n). Eq. ( 12) implies D S ( P ; Y) ≤ D S ( P ; Y), ( ) where P is defined using the following construction: P = W ♮ f (x 1 ) . . . f (x N ) , and f is a learning function constructed under Assumption 3. To be specific, f satisfies  ∥ f (x) -f ♮ (x)∥ 2 ≤ ν, ∀x ∼ D. + D S ( P ; Y) -D S (P ♮ ; Y) , ( ) where the first inequality is obtained from (39). Let us consider the L.H.S. of ( 40): E D Π ( P ; Y) -D Π (P ♮ ; Y) = N n=1 M m=1 π (m) n K k=1 -P ♮ ((m -1)K + k, n) log P ((m -1)K + k, n) +P ♮ ((m -1)K + k, n) log P ♮ ((m -1)K + k, n) = N n=1 M m=1 π (m) n K k=1 P ♮ ((m -1)K + k, n) log P ♮ ((m -1)K + k, n) P ((m -1)K + k, n) = D KL P ♮ , P , ( ) where expectation is taken w.r.t. Y (while taking expectation, we used the uniform probability π = k]) and D KL (P ♮ , P ) is the average Kullback-Leibler (KL) divergence between the entries of the matrices P ♮ and P ∈ R M K×N , which is given by D KL P ♮ , P = 1 N M N n=1 M m=1 D KL p ♮(m) n , p (m) n . Upper-bounding the first term on the R.H.S of (40). Next, we characterize the first term on the R.H.S. of (40). To achieve this, we invoke the following theorem (Theorem 26.5 in (Shalev-Shwartz & Ben-David, 2014 )) : Theorem 4 (Shalev-Shwartz & Ben-David, 2014, Theorem 26.5) Assume that for all y and for all x, we have |CE(x; y)| ≤ z max . Then for any P ∈ P, the following holds with probability greater than 1 -δ: To apply Theorem 4, we will characterize R S (ℓ • P • S) which is defined as follows (Shalev-Shwartz & Ben-David, 2014) : D S (P ; Y) -E[D Π (P ; Y)] ≤ 2R S (ℓ • P • S) + 4z max 2 log(4/δ) S , R S (ℓ • P • S) ≜ 1 S E sup P ∈P S s=1 σ s CE(P (ω s ); Y(ω s )) , where expectation is w.r.t. the independent Rademacher random variables σ s ∈ {-1, 1}. Note that P (ω) is a vector-valued [see Eq. ( 38)]. Hence, we invoke the following contraction result to upper bound R S (ℓ • P • S): Lemma 4 (Maurer, 2016) Let P be a class of mappings {P : X → R K }, where X be any set and (ω 1 , . . . , ω S ) ∈ X S . Also assume that ℓ : R K → R has the Lipschitz constant L. Then E sup P ∈P S s=1 σ s ℓ(P (ω s )) ≤ √ 2LE   sup P ∈P s,k σ sk P k (w s )   where P k (w) denotes the kth component of P (ω), σ sk is an independent (doubly indexed) Rademacher random variable and the expectations are taken w.r.t. the Rademacher random variables. Let us define a vector z ≜ (P 1 (ω 1 ), P 2 (ω 1 ), . . . , P K-1 (ω s ), P K (ω s )) ∈ R SK and the set Z ≜ {z = (P 1 (ω 1 ), . . . , P K (ω s )) | P ∈ P}. With these definitions, we apply Lemma 4 in ( 43) and obtain R S (ℓ • P • S) ≤ √ 2β S E   sup P ∈P s,k σ s,k P k (w s )   = √ 2β S E sup z∈Z i σ i z(i) = √ 2βK 1 SK E sup z∈Z i σ i z(i) = √ 2βKR S (Z), where β is an upper bound of the Lipschitz constant of the cross entropy loss function CE(x; y) = - K k=1 I[y = k] log x(k) when x ∈ ∆ K with x(k) > (1/β). ∀k. Next, we will characterize R S (Z) using the covering number of the set Z. Definition 2 (Vershynin, 2012) The ϵ-net covering of the set Z (denoted as Z) is a finite subset of Z (i.e.,Z ⊆ Z) such that for any z ∈ Z, there exists an z ∈ Z satisfying ∥z -z∥ 2 2 ≤ ϵ. The smallest cardinality of the ϵ-nets of Z is known as the covering number of Z, which is denoted as N(ϵ, Z). Let us consider a pair of vectors z, z ∈ Z as below: z = (P 1 (ω 1 ), . . . , P K (ω s )) ∈ R SK , P (ω s ) = A ms F (:, n s ), z = P 1 (ω 1 ), . . . , P K (ω s ) ∈ R SK , P (ω s ) = A ms F (:, n s ), where ω s = (m s , n s ) and all A m , A m , F , and F satisfy the nonnegativity constraints and have unit ℓ 1 -norm on the columns. Then, we have ∥z -z∥ 2 = S s=1 ∥A ms F (:, n s ) -A ms F (:, n s )∥ 2 2 ≤ S s=1 ∥A ms F (:, n s ) -A ms F (:, n s )∥ 2 = S s=1 ∥A ms F (:, n s ) -A ms F (:, n s ) + A ms F (:, n s ) -A ms F (:, n s )∥ 2 ≤ S s=1 ∥A ms -A ms ∥ F ∥F (:, n s )∥ 2 + ∥A ms ∥ F ∥F (:, n s ) -F (:, n s )∥ 2 ≤ S s=1 ∥A ms -A ms ∥ F + √ K S s=1 ∥F (:, n s ) -F (:, n s )∥ 2 where the first inequality is by ∥x∥ 2 2 ≤ ∥x∥ 2 if the entries of x are smaller than 1. The second inequality is by triangle inequality. The last inequality uses the fact the Frobenius norm of A m 's are bounded by √ K and the ℓ 2 norm of any column of F is bounded by 1. Hence, to obtain an ε-net covering for the set Z (i.e., ∥z -z∥ ≤ ε), we only need to show that there exists a ε 2 2 √ K -net covering for F • S and a ε 2 2S -net covering for each A m 's since S s=1 ε 2 2S + √ K ε 2 2 √ K = ε 2 . Here F • S denotes F • S = {[f (x n1 ), . . . , f (x n S )] ∈ R K×S | f ∈ F}, where (m s , n s ) = ω s ∈ S. Note that the full rank matrix A m ∈ R K×K can be represented as a K 2 -dimensional vector whose Euclidean norm is bounded by √ K. Hence, the cardinality of the ε 2 2S -net covering for A m ∈ R K×K is at most 4SK √ K ε 2 K 2 (Shalev-Shwartz & Ben-David, 2014). Next, we consider the covering number corresponding to the function class F • S. Using Lemma 14 of (Lin & Zhang, 2019) , we get the cardinality of the ε 2 2 √ K -net covering for F • S as below: N ε 2 2 √ K , F • S ≤ exp   2 √ K∥X∥ F R F ε 2   , where X = [x n1 , . . . , x n S ] ∈ R d×S and the parameter R F is from Assumption 2. Using the covering number results, the cardinality of the ε-net covering of set Z is bounded by the following: N(ε, Z) ≤ 4SK √ K ε 2 M K 2 × exp   2 √ K∥X∥ F R F ε 2   . Now that we have characterized N(ϵ, Z), we invoke the below lemma to obtain the Rademacher complexity R S (Z): Lemma 5 (Bartlett et al., 2017, Lemma A.5) The empirical Rademacher complexity of the set Z with respect to the observed set S having size SK is upper bounded as follows: R S (Z) ≤ inf a>0 4a √ SK + 12 SK √ SK a log N(µ, Z)dµ . We apply (45) in Lemma 5 and obtain R S (Z) (a) ≤ inf a>0 4a √ SK + 12 SK √ SK log N(a, Z) (b) ≤ inf a>0    4a √ SK + 12 √ SK M K 2 log 4SK √ K a 2 + 2 √ K∥X∥ F R F a 2 1 2    (c) ≤ 4 √ S + 12 √ SK M K 2 log 4S √ K + (2∥X∥ F R F ) 1 2 ≤ 16 √ S M K log 4S √ K + (2∥X∥ F R F ) 1 2 . ( ) In the above, the first inequality (a) is obtained by using the relation Combining the upperbound of R S (Z) given by ( 47) with the upper bound of R S (ℓ • P • S) as given by ( 44) and with the result in (42), we get that with probability greater than 1 -δ, √ SK a log N(µ, Z)dµ ≤ √ SK log N(a, Z), D S (P ; Y) -E[D Π (P ; Y)] ≤ 2 √ 2βKR S (Z) + 4z max 2 log(4/δ) S , where R S (Z) is upper bounded by (47) and z max is the upperbound of the value of the function CE(x, y) which can be characterized as below: z max = max x(k)> 1 β y∈[K] CE(x, y) ≤ max x(k)> 1 β y∈[K] - K k=1 I[y = k] log x(k) ≤ max u> 1 β -log u = log(β). Upper-bounding the second term on the R.H.S of (40). Next, we proceed to upper bound the second term on the R.H.S of (40). Let us consider the Hoeffding's inequality Lemma 6 Let Z 1 , . . . , Z S be independent bounded random variables with Z s ∈ [z min , z max ] for all s where -∞ < z min ≤ z max < ∞. Then for all t ≥ 0, Pr 1 S S s=1 (Z s -E[Z s ]) ≥ t ≤ exp - 2St 2 (z max -z min ) 2 . To use Lemma 6, let us define the random variable Z (m) n as follows: Z (m) n ≜ CE(p ♮(m) n , y (m) n ), where p ♮(m) n = A ♮ m f ♮ (x n ). The maximum and minimum values of Z (m) n are z max = log(β) (see ( 49)) and z min = 0, respectively. Then, invoking Lemma 6, one can obtain Pr D S (P ♮ ; Y)] -E[D Π (P ♮ ; Y)] ≥ t ≤ exp - 2St 2 (log(β)) 2 . ( ) Hence, by substituting t = log (β) log( 1δ ) 2S , where δ ∈ (0, 1) in ( 50), we get that with probability greater than 1 -δ D S (P ♮ ; Y)] -E[D Π (P ♮ ; Y)] ≤ log (β) log 1 δ 2S . Upper-bounding the third term on the R.H.S of (40). D S ( P ; Y) -D S (P ♮ ; Y) = - 1 S (m,n)∈S K k=1 I[ y (m) n = k] log P ((m -1)K + k, n) + 1 S (m,n)∈S K k=1 I[ y (m) n = k] log P ♮ ((m -1)K + k, n) ≤ 1 S (m,n)∈S K k=1 I[ y (m) n = k] log[A ♮ m f ♮ (x n )] k -log[A ♮ m f (x n )] k ≤ 1 S (m,n)∈S K k=1 I[ y (m) n = k]β [A ♮ m f ♮ (x n )] k -[A ♮ m f (x n )] k ≤ 1 S (m,n)∈S K k=1 I[ y (m) n = k]β∥A ♮ m (k, :)∥ 2 ∥f ♮ (x n ) -f (x n )∥ 2 ≤ β √ Kν, where the first inequality uses the triangle inequality, the second inequality uses the Lipschitz continuity of log function, the third inequality is via Cauchy Schwartz inequality, and the last inequality employs Assumption 3. Putting Together. Hence, by combining the result in ( 48) with ( 41), ( 40), (51), and (52), we get that with probability greater than 1 -2δ, D KL P ♮ , P ≤ 2 √ 2βKR S (Z) + 4 log(β) 2 log(4/δ) S + log (β) log 1 δ 2S + β √ Kν. ( ) Using Pinsker's inequality (S, 1960; Fedotov et al., 2003) , we get D KL P ♮ , P = 1 N M N n=1 M m=1 D KL p ♮(m) n , p (m) n ≥ 1 2N M N n=1 M m=1 ∥p ♮(m) n -p (m) n ∥ 2 1 ≥ 1 2N M N n=1 M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 where the last inequality uses the fact that ∥x∥ 1 ≥ ∥x∥ 2 . The above relation combined with (53), implies that with probability greater than 1 -2δ: 1 N M N n=1 M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 ≤ 4 √ 2KβR S (Z) + 10 log (β) 2 log (4/δ) √ S + 2β √ Kν, where R S (Z) is upper-bounded in (47)

D PROOF OF THEOREM 2

The proof of Theorem 2 utilizes some key results from the proof of Theorem 1. We start by employing the result from Proposition 1. Let us fix δ = 1 S and α = 1 8 in the result of Proposition 1. Then, it

E PROOF OF THEOREM 3

To prove Theorem 3, let us first consider the CCEM criterion in (6). Let W and F be any optimal solution of ( 6) and P = W F . From Proposition 1, by fixing δ = 1 S and α = 1 8 and using the conditions in the statement of Theorem 3, i.e., f ♮ ∈ F implying ν = 0, S ≥ C 1 t, N ≤ C 2 t 3 , for certain constants C 1 , C 2 > 0, we get the following at the limit of t → ∞: ∥P ♮ -P ∥ 2 F = 0, with probability 1, i.e., P = W ♮ F ♮ . ( ) Proof of Theorem 3(a): We consider the following result which is distilled and summarized from the proof of Theorem 1 in (Fu et al., 2015) : Lemma 8 Suppose a matrix Y ∈ R K×J satisfies Y ≥ 0, 1 ⊤ Y = 1 ⊤ , rank(Y ) = K, and SSC. Then, for any Y = QY satisfying Y ≥ 0, 1 ⊤ Y = 1 ⊤ , the following holds: | det(Q)| ≤ 1, The equality holds only if Q is a permutation matrix. Let us start by considering the following criterion: maximize W ,F det(F F ⊤ ) (63a) s.t. P = W F (63b) F ≥ 0, 1 ⊤ F = 1 ⊤ ( ) where P is the optimal solution from the CCEM criterion (6) and hence, as shown before, P satisfies (62). Let W * and F * be optimal solutions of (63), then the following holds: det(F F * ⊤ ) ≥ det(F ♮ F ♮⊤ ). From ( 62), we observe that W * and F * satisfies W * = W ♮ Q -1 , F * = QF ♮ , for a certain invertible matrix Q. Let us assume that Q is not a permutation matrix, Then, we have det(F * F * ⊤ ) = det(Q ⊤ F ♮ F ♮⊤ Q) = | det(Q)| 2 det(F ♮ F ♮⊤ ) < det(F ♮ F ♮⊤ ) where the last inequality is from Lemma 8 using the SSC condition on F ♮ . Note that the result is a contradiction from (64). Hence, Q must be a permutation matrix. This implies that the optimal solution W * and F * of Problem 63 satisfies the following: W * = W ♮ Π, F * = Π ⊤ F ♮ , Following the last part of Theorem 2, by using Lemma 3 and Fact 1, we have A * m = A ♮ m Π, ∀m. f * (x) = Π ⊤ f ♮ (x), ∀x ∼ D. Note that since log is a monotonically increasing function, by using log det(F F ⊤ ) in ( 63) in place of det(F F ) ⊤ does not change the optimal solution, yet it keeps the objective function in differentiable domain (this is because, det(X) becomes zero if X is singular). Hence, we can conclude that the following criterion maximize W ,F log det(F F ⊤ ) (66a) s.t. P = W F (66b) F ≥ 0, ) also results in the same optimal solutions W * and F * satisfying (65). Proof of Theorem 3(b): We consider the following result which is a modified version of Lemma A.1 from (Huang et al., 2015) : Lemma 9 Suppose a matrix X ∈ R I×K satisfies X ≥ 0, 1 ⊤ X = ρ1 ⊤ , ρ > 0, rank(X) = K, and SSC. Then, for any X = XQ satisfying X ≥ 0, 1 ⊤ X = ρ1 ⊤ , the following holds: | det(Q)| ≤ 1, The equality holds only if Q is a permutation matrix. Note that, in Lemma 9, we are given that 1 ⊤ X = ρ1 ⊤ . Then, by multiplying Q on both sides, we obtain 1 ⊤ XQ = ρ1 ⊤ Q. ( ) Since 1 ⊤ XQ = ρ1 ⊤ also holds by the assumption in Lemma 9, combining with (67), we get ρ1 ⊤ Q = ρ1 ⊤ =⇒ 1 ⊤ Q = 1 ⊤ . (68) The result in ( 68) is used in order to conclude Lemma 9 from Lemma A.1 of (Huang et al., 2015) . We consider the following criterion: maximize W ,F det(W ⊤ W ) (69a) s.t. P = W F (69b) W ≥ 0, 1 ⊤ W = M 1 ⊤ , where we used the constraint 1 ⊤ W = M 1 ⊤ since 1 ⊤ A m = 1 ⊤ for all m. Let W * and F * be optimal solutions of (69), then we have: det(W * ⊤ W * ) ≥ det(W ♮⊤ W ♮ ). Applying (62), W * and F * satisfies W * = W ♮ Q, F * = Q -1 F ♮ , for a certain invertible matrix Q. Let us assume that Q is not a permutation matrix, Then, we have det(W * ⊤ W * ) = det(Q ⊤ W ♮⊤ W ♮ Q) = | det(Q)| 2 det(W ♮⊤ W ♮ ) < det(W ♮⊤ W ♮ ) where the last inequality is from Lemma 9 using the SSC condition on W ♮ which is a contradiction from (70). Hence, Q must be a permutation matrix. This implies that the optimal solution W * and F * of Problem 69 satisfies the following: W * = W ♮ Π, F * = Π ⊤ F ♮ , Similar to the last part of Theorem 2, by using Lemma 3 and Fact 1, we have A * m = A ♮ m Π, ∀m. f * (x) = Π ⊤ f ♮ (x), ∀x ∼ D. In this case as well, the optimal solutions do not change if we employ log det(W ⊤ W ) in (69) in place of det(W ⊤ W ). Remark 2 Geometrically, Theorem 3 seeks maximum volume solutions w.r.t. the conic hulls of F and W ⊤ , respectively. One can also note that in both cases of Theorem 3, the corresponding optimal solutions do not change if we minimize log det(W ⊤ W ) in ( 63) and minimize log det(F F ⊤ ) in (69). However, relying on the SSC of F ♮ (W ♮ ) and minimizing the volume of conic hull of W ⊤ (F ) may be inefficient since P ♮ = W ♮ F ♮ does not hold exactly in practice.

F PROOF OF LEMMA 1

Let us define some notations: ℓ * k = arg max ℓ∈[K] A 1 (k, ℓ) q * k = arg max q∈[K] F 1 (q, k). Consider the following conditions given in Lemma 1 1 -κ ≤ | A 1 (k, :) F 1 (:, k)| ≤ 1 + κ, ∀k ∈ [K] (72a) | A 1 (k, :) F 1 (:, j)| ≤ κ, ∀k ̸ = j. (72b) F.1 PROVING ℓ * k ̸ = q * j , ∀k ̸ = j We begin by noticing the below relation from ( 72a): (1 -κ) ≤ A 1 (k, 1) F 1 (1, k) + • • • + A 1 (k, K) F 1 (K, k) ≤ max ℓ∈[K] A 1 (k, ℓ)( F 1 (1, k) + • • • + F 1 (K, k)) = max ℓ∈[K] A 1 (k, ℓ) = A 1 (k, ℓ * k ), ∀k, where the first equality employs the probability simplex constraints on the columns of F . We further proceed to prove by using contradiction. First, assume the below for certain k ̸ = j, ℓ * k = arg max ℓ A 1 (k, ℓ) = arg max q F 1 (q, j) = q * j . (74) Under the assumption (74), consider (72b). For (72b) to hold, we need to satisfy the below for a certain k ̸ = j: A 1 (k, ℓ * k ) F 1 (q * j , j) < κ =⇒ F 1 (q * j , j) < κ 1 -κ where the last relation is obtained from (73). This also implies that 1 = K q=1 F 1 (q, k) < Kκ 1 -κ =⇒ κ > 1 K + 1 . However, Lemma 1 assumes that κ ≤ 1 K+1 , Hence, the assumption (74) does not hold and we get arg max ℓ A 1 (k, ℓ) ̸ = arg max q F 1 (q, j), k ̸ = j. (75) F.2 PROVING ℓ * k = q * k , ∀k From the result in (75), we consider (72b) for a certain k ̸ = j. A 1 (k, ℓ * k ) F 1 (q, j) ≤ κ =⇒ F 1 (q, j) ≤ κ 1 -κ , q ̸ = q * j . ( ) where the last relation is obtained from (73). Since q F 1 (q, j) = 1, we get F 1 (q * j , j) ≥ 1 -Kκ 1 -κ , ∀j. From the result in (75) and the condition in (72b), we also have the following for a certain k ̸ = j A 1 (k, ℓ) F 1 (q * j , j) ≤ κ =⇒ A 1 (k, ℓ) ≤ κ(1 -κ) 1 -Kκ , ℓ ̸ = ℓ * k , where the last relation by ( 77). Next, we proceed to prove by contradiction. First, assume the below for certain k ℓ * k ̸ = q * k . (79) Hence, for each k, we have | A 1 (k, :) F 1 (:, k)| = A 1 (k, 1) F 1 (1, k) + • • • + A 1 (k, K) F 1 (K, k) ≤ q̸ =q * k F 1 (q, k) + A 1 (k, ℓ), ℓ ̸ = ℓ * k ≤ (K -1)κ 1 -κ + κ(1 -κ) 1 -Kκ where we have used the fact that all the entries of A 1 and F 1 are smaller than 1, in the first inequality and have applied ( 76) and ( 78) in the last inequality. The lower-bound condition in (72a) gives that for each k, | A 1 (k, :) F 1 (:, k)| ≥ 1 -κ. ) Then, comparing both ( 80) and ( 81), we hope to have (K -1)κ 1 -κ ≥ 1 -κ ⇒ κ ≥ K + 1 -(K + 1) 2 -4 2 However, K+1- √ (K+1) 2 -4 2 ≥ 1 K+1 , which leads to a contradiction to our assumption that κ ≤ 1 K+1 . Hence, the assumption (79) does not hold and we get arg max ℓ A 1 (k, ℓ) = arg max q F 1 (q, k), ∀k. (82) Combining both ( 82) and (82), we get arg max ℓ A 1 (k, ℓ) ̸ = arg max ℓ A 1 (j, ℓ), ∀k ̸ = j arg max q F 1 (q, k) ̸ = arg max q F 1 (q, j), ∀k ̸ = j. G PROOF OF LEMMA 2 We have σ min (X) = σ min (I K + E 1 + E 2 ) = min ∥x∥2=1 ∥(I K + E 1 + E 2 )x∥ 2 = min ∥x∥2=1 ∥I K x + (E 1 + E 2 )x∥ 2 (a) ≥ min ∥x∥2=1 |∥I K x∥ 2 -∥(E 1 + E 2 )x∥ 2 | ≥ min ∥x∥2=1 ∥I K x∥ 2 -max ∥x∥2=1 ∥(E 1 + E 2 )x∥ 2 = 1 -max ∥x∥2=1 ∥(E 1 + E 2 )x∥ 2 = |1 -∥E 1 + E 2 ∥ 2 | (b) ≥ |1 -∥E 1 ∥ F -∥E 2 ∥ F | (c) ≥ |1 -υ 1 -υ 2 | (83) where the inequality (a) employs the triangle inequality, (b) is obtained via the matrix norm equivalence relation ∥E 1 + E 2 ∥ 2 ≤ ∥E 1 + E 2 ∥ F ≤ ∥E 1 ∥ F + ∥E 2 ∥ F , (c) is obtained by the assumption that ∥E 1 ∥ F ≤ υ 1 and ∥E 2 ∥ F ≤ υ 2 . H PROOF OF LEMMA 3 First, let us define the following w.r.t the loss function g(f , y) = ∥f -y∥ 2 2 , ∀f ∈ F, y ∈ R K and the input data X = (x 1 , . . . , x N ) where each x n is sampled i.i.d. from the distribution D under Assumption 1: L X ( f ) ≜ 1 N N n=1 g( f (x n ), f ♮ (x n )) = 1 N N n=1 ∥ f (x n ) -Π ⊤ f ♮ (x n )∥ 2 2 = 1 N ∥ F -Π ⊤ F ♮ ∥ 2 F , L D ( f ) ≜ E x∼D g( f (x), f ♮ (x)) = E x∼D ∥ f (x) -Π ⊤ f ♮ (x)∥ 2 2 . ( ) We invoke Theorem 26.5 in (Shalev-Shwartz & Ben-David, 2014 ) and get that with probability greater than 1 -δ: L D ( f ) ≤ L ( f ) + 2R N (g • F) + 4c 2 log(4/δ) N =⇒ E x∼D ∥ f (x) -Π ⊤ f ♮ (x)∥ 2 2 ≤ 1 N ∥ F -Π ⊤ F ♮ ∥ 2 F + 4R N (F) + 16 2 log(4/δ) N where the last inequality utilizes the definitions in (84), the contraction lemma (Lemma 26.9 from (Shalev-Shwartz & Ben-David, 2014) ) and also applied c = 4 since |g(f , y)| ≤ 4 in our case. The term R N (F) denotes the empirical Rademacher complexity of the neural network function class F which is upperbounded via the sensitive complexity parameter R F as follows (Lin & Zhang, 2019) : R N (F) ≤ 16N -5/8 (2∥X∥ F R F ) 1 4 .

I PROOF OF PROPOSITION 2

Our goal is to bound 1 M M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 for each n. To achieve this, let us define the random variable U := 1 N M N n=1 M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 . From Proposition 1, the following result hold: Pr(U ≤ ϑ(δ)) ≥ 1 -δ, where randomness is due to all x n 's, y (m) n 's, and S and ϑ(δ) is given by the R.H.S. of (54). Then we have E[U ] = umax 0 h(u)udu = ϑ(δ) 0 h(u)udu + umax ϑ(δ) h(u)udu ≤ ϑ(δ) ϑ(δ) 0 h(u)du + u max umax ϑ(δ) h(u)du ≤ ϑ(δ) + u max δ ≤ ϑ(δ) + 4δ where u max denotes the maximum value of the random variable U and is given by u max ≤ 4 and h(u) denotes the probability density function of the random variable U . Also, we have E[U ] = 1 N N n=1 E [U n ] where U n ≜ 1 M M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 . To proceed, we have the following lemma: Lemma 10 Let U n ≜ 1 M M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 . Then E[U n ] = E[U n ′ ], ∀, n, n ′ , where expectation is taken w.r.t. all x n 's, y (m) n 's, and S. The proof is provided in Sec. J. Combining Lemma 10 with ( 86) and (87), we get E[U n ] ≤ ϑ(δ) + 4δ. Applying Markov inequality, we get the following for any τ > 0 Pr(U n ≤ τ E[U n ]) ≥ 1 - 1 τ =⇒ Pr(U n ≤ τ ϑ(δ) + 4τ δ) ≥ 1 - 1 τ Letting τ = N α , where 0 < α 1, we get that with probability greater than 1 -1 N α 1 M M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 ≤ N α ϑ(δ) + 4N α δ.

J PROOF OF LEMMA 10

Let us define the following: u(x n , θ v1,...,vn,...,v n ′ ,...,v N ) ≜ 1 M M m=1 ∥p ♮(m) n -p (m) n ∥ 2 2 = 1 M M m=1 ∥A ♮ m f ♮ (x n ) -A m f (x n )∥ 2 2 where v n = {x n , { y (m) n } m∈Sn , S n } and θ denotes estimates { A m } and f which are obtained using the data {v 1 , . . . , v n , . . . , v n ′ , . . . , v N } = {x 1 , . . . , x N , Y, S}. One can observe that for any n ′ ̸ = n, θ v1,...,vn,...,v n ′ ,...,v N = θ v1,...,v n ′ ,...,vn,...,v N (88) since the order of v n does not affect the value of the estimates. Then we have E xn,x n ′ , xj , j̸ =n,n ′ [u(x n , θ v1,...,vn,...,v n ′ ,...,v N )] = E xn,x n ′ , xj , j̸ =n,n ′ [u(x n , θ v1,...,v n ′ ,...,vn,...,v N )] = E xn,x n ′ , xj , j̸ =n,n ′ [u(x n ′ , θ v1,...,v n ′ ,...,vn,...,v N )] where the first equality is by (88) and the last equality is obtained since x n and x n ′ are identically distributed. From . Suppose i ∈ G C , then we have γ i (H i )z i-1 = H i ⊛z i-1 , where the matrix H i ∈ R ci×ri contains c i convolutional filters each of which having dimension r i , z i-1 denotes the output of the (i -1)th layer, and ⊛ denotes the convolution operation. Here d 0 = d is the dimension of the of the input data items x n 's. Then, its sensitive complexity R F is defined as follows: R F = 2ρ L L 2 i∈GF d 2 i d 2 i-1 + i∈GC c 2 i r 2 i d i /c i . Note that such as a general CNN architecture covers many popular neural networks, e.g., fully connected neural networks, CNNs such as Lenet-5 (Lecun et al., 1998) , VGG-16 (Liu & Deng, 2015) , and so on.

L NEAR-CLASS SPECIALIST ASSUMPTION AND SSC

NCSA is a relaxed condition compared to having all-class expert annotators or having diagonally dominant annotators-see Fig. 1 . It can be understood that NCSA assumption does not require any single annotator to be specialists with respect to all classes. Nonetheless, having an all-class expert annotator m * satisfies NCSA, but not vice versa. For machine annotation, we train a number of machine classifiers to act as annotators. In Table 1 , under Case 2, we choose M = 5 annotators. Specifically, Linear SVM, logistic regression, k-NN with k = 5, CNN with two convolution layers followed by a max pooling layer, a fully connected layer and a softmax layer, and a fully connected neural network (FCNN) with 1 hidden layer and 128 hidden units are trained. The CNN is trained for 5 epochs and the FCNN is trained for 10 epochs. The classifiers are intentionally not well trained, so that they can mimic error-proning annotators. For example, for the MNIST dataset, the individual label prediction accuracy of these annotators on unseen data items of size N ranges from 15.88% to 82.23% (averaged over 5 random trials) and 2 annotators have less than 50% accuracy on average. Also note that our algorithm does not know which annotator is more accurate. For Case 2, we train more annotators by changing some parameters of the above mentioned machine classifiers. Specifically, we use linear SVM, polynomial kernel-based SVM, and Gaussian SVM as different variants of SVM. For k-NN, we choose k from {3, 5, 7, 10}. FCNN, CNN, and logistic regression are trained with different number of epochs {10, 15, 20, 25} and with random initializations for each case. Under case 1, if an annotator is chosen to be a specialist, it is trained with more number of samples (10, 000 samples) with more number of epochs such that its label prediction accuracy is higher than 90%. Under this strategy, in Table 1 for the case M = 15, the individual label prediction accuracy of the annotators on unseen data items of size N ranges from 3.15% to 95.27% (2 annotators with more than 90% accuracy) . For the case with M = 20, the individual label prediction accuracy of the annotators on unseen data items of size N ranges from 3.24% to 95.73% (3 annotators with more than 90% accuracy).



https://github.com/shahanaibrahimosu/end-to-end-crowdsourcing



is drawn from a distribution D independently at random and has a bounded ℓ 2 norm. The observed index pairs (m, n) are included in S uniformly at random. Assumption 2 The neural network function class F has a complexity measure denoted as R F .

(ω s ), Y(ω s )), where CE(x, y) = -K k=1 I[y = k] log(x(k)), S = |S|, and ω s = (m s , n s ) ∈ S. Under Assumption 1, we can define:

Hence, by taking expectation w.r.t. Y we have E[D Π ( P ; Y) -D Π (P ♮ ; Y)] = E[D Π ( P ; Y)] -D S ( P ; Y) + D S (P ♮ ; Y) -E[D Π (P ♮ ; Y)] + D S ( P ; Y) -D S ( P ; Y) + D S ( P ; Y) -D S (P ♮ ; Y) ≤ E[D Π ( P ; Y)] -D S ( P ; Y) + D S (P ♮ ; Y) -E[D Π (P ♮ ; Y)] + D S ( P ; Y) -D S (P ♮ ; Y) ≤ sup P ∈P D S (P ; Y) -E[D Π (P ; Y)] + D S (P ♮ ; Y) -E[D Π (P ♮ ; Y)]

∀, n, m for observing each annotation y (m) n and used the ground-truth probability P ♮ ((m -1)K + k, n) for each event I[ y (m) n

) where ℓ • P • S denotes the set ℓ • P • S ≜ CE(P (ω 1 ); Y(ω 1 )), . . . , CE(P (ω S ); Y(ω S )) | P ∈ P and R S (X ) denotes the empirical Rademacher complexity of the set X .

which holds because log N(µ, Z) decreases monotonically as µ increases. The inequality (b) is obtained by applying (45), and (c) is obtained by fixing a = √ K which is smaller than √ SK.

Figure 1: (left) W ♮ = [A ♮⊤ 1 , . . . , A ♮⊤ M ] ⊤ satisfying NCSA with K = 3, meaning that there exists specialists for each class k ∈ [K] and (right) all-class expert annotator m * among M annotators, meaning A ♮ m * ≈ I K .

Figure 2: The dots denote the rows of W ♮ and the circle denote the second-order cone C.

1.1e-06 0.0029 0.0014 0.00089 0.0018 0.0033 0.0038 0.0051 0.004 0.0064 0.96 0.0083 0.00026 0.0064 0.0039 0.0053 0.0014 0.0019 0.0077 0.0091 0.003 0.95 0.0085 0.0086 0.0008 0.00037 0.0016 0.0083 0.00094 0.004 0.0092 0.005 0.97 0.003 0.0065 0.008 0.00018 0.0071 0.0094 0.007 0.0027 0.0075 0.001 0.96 0.0086 0.0028 0.0028 0.0012 0

Figure 4: The illustration of the ground-truth confusion matrices and the estimated confusion matrices by the proposed approach GeoCrowdNet(W) for CIFAR-10 dataset, with M = 5 machine annotators and γ = 0.01. The average mean squared error (MSE) of the confusion matrices is 0.102.

Average test accuracy (± std) of the proposed methods and the baselines on MNIST & Fashion-MNIST dataset under various (N, M )'s; labels are produced by machine annotators; p = 0.1. GeoCrowdNet(F) 79.89 ± 3.08 82.18 ± 3.48 85.92 ± 2.73 87.21 ± 2.47 78.98 ± 2.83 84.47 ± 1.64 80.60 ± 0.46 83.68 ± 2.17 GeoCrowdNet(W) 80.97 ± 1.31 83.69 ± 2.37 77.79 ± 8.97 82.37 ± 9.18 79.80 ± 4.23 85.56 ± 1.91 72.36 ± 3.84 74.03 ± 7.41 GeoCrowdNet(λ = 0) 71.15 ± 6.73 69.17 ± 2.61 71.66 ± 4.48 60.29 ± 7.91 70.92 ± 4.14 81.88 ± 4.41 69.31 ± 4.77 73.04 ± 7.56 The proposed methods are compared with a number of existing E2E crowdsourcing methods, namely, TraceReg(Tanno et al., 2019), MBEM (Khetan et al., 2018), CrowdLayer (Rodrigues & Pereira, 2018), CoNAL (Chu et al., 2021), and Max-MIG

Average test accuracy of the proposed methods and the baselines on LabelMe (M = 59, K = 8) and Music (M = 44, K = 10) datasets.

Xiaobo Xia, Tongliang Liu, Bo Han, Nannan Wang, Mingming Gong, Haifeng Liu, Gang Niu,  Dacheng Tao, and  Masashi Sugiyama. Part-dependent label noise: Towards instance-dependent label noise. In Advances in Neural Information Processing Systems, volume 33, pp. 7597-7610, 2020. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint, arXiv:1708.07747, 2017. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of International Conference on Learning Representations, 2016a. Le Zhang, Ryutaro Tanno, Mou-Cheng Xu, Chen Jin, Joseph Jacob, Olga Cicarrelli, Frederik Barkhof, and Daniel Alexander. Disentangling human error from ground truth in segmentation of medical images. In Advances in Neural Information Processing Systems, volume 33, pp. 15750-15762, 2020. Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Journal of Machine Learning Research, 17(102): 1-44, 2016b. Zhaowei Zhu, Jialu Wang, and Yang Liu. Beyond images: Label noise transition matrix estimation for tasks with lower-quality features. In Proceedings of International Conference on Machine Learning, volume 162, pp. 27633-27653, 2022. Supplementary Material of " Deep Learning From Crowdsourced Labels: Coupled Cross-entropy Minimization, Identifiability, and Regularization"

Average test accuracy of the proposed methods and the baselines on the CIFAR-10 dataset (K = 10), labeled by M = 5 synthetic annotators. GeoCrowdNet(F) 84.82 ± 0.43 71.66 ± 1.08 67.97 ± 1.44 GeoCrowdNet(W) 82.76 ± 0.46 69.20 ± 0.59 63.86 ± 1.64 GeoCrowdNet (λ=0) 82.50 ± 0.38 69.14 ± 1.29 63.73 ± 2.64 TraceReg 82.54 ± 0.34 70.01 ± 1.16 65.01 ± 2.30 Crowdlayer 83.01 ± 0.23 69.91 ± 1.48 65.70 ± 2.32 MBEM 81.01 ± 0.32 69.45 ± 1.56 64.15 ± 2.21 CoNAL 81.41 ± 0.43 68.56 ± 1.78 63.21 ± 1.98 Max-MIG 82.01 ± 0.78 69.96 ± 1.23 62.01 ± 2.46 NN-MV 50.96 ± 0.83 37.90 ± 1.19 33.24 ± 1.17 NN-DSEM 52.02 ± 0.80 37.95 ± 1.41 33.64 ± 0.69

acknowledgement

Acknowledgement. This work is supported in part by the National Science Foundation under Project NSF IIS-2007836. 

annex

implies that when f ♮ ∈ F, i.e., ν = 0, S ≥ C 1 t, N ≤ C 2 t 3 , for certain constants C 1 , C 2 > 0 and at the limit of t → ∞, we get the following: ∥P ♮ -P ∥ 2 F = 0, with probability 1, i.e.,On the other hand, the matrix P can be constructed using the estimates of the CCEM criterion (6) viaHence, in order to identify W ♮ and F ♮ from the NMF model in (55), we invoke the following result:Lemma 7 (Huang et al., 2014) Consider the matrix factorization model Z = XY , where X ∈ R I×K , Y ∈ R K×J and rank(X) = rank(Y ) = K. If X, Y ≥ 0 and both X and Y satisfy SSC, then any X and Y that satisfy Z = X Y must have the following form:, where Π is a column permutation matrix and Σ is a diagonal nonnegative and scaling matrix.Next, we show that Lemma 7 can be applied given the conditions in Theorem 2. The matrix] includes a subset of columns of F ♮ , i.e., cone(F ♮ Z ) ⊆ cone(F ♮ ). Hence, we havewhereAlso for any orthogonal matrix Q ∈ R K×K except for the permutation matrices, the below holds:Eqs ( 57) and (58) imply that F ♮ satisfies SSC, given the assumption that F ♮ Z satisfies SSC. In addition, the column scaling does not affect the conic hull, i.e, cone(F ♮ ) = cone(Σ -1 F ♮ ), cone(W ♮⊤ ) = cone(ΣW ♮⊤ ).Since both W ♮ and F ♮ satisfy SSC, rank(F ♮ ) = rank(W ♮ ) = K hold (see (Huang et al., 2016) ). Hence, the conditions in Lemma 7 holds for (55). Comparing ( 55) and ( 56) and invoking Lemma 7, we getwhere the scaling Σ is automatically removed due to the sum-to-one constraints on the columns of A m 's and F .

The first result (59a) implies that

Applying the second result (59b) in Lemma 3 along with the assumption that N → ∞, we getFact 1 Let X be a nonnegative random variable with E[X] = 0. Then, X is zero almost surely, i.e., Pr(X = 0) = 1.Employing Fact 1 in (60), we get that, ∀x ∼ D, due to the nonnegativity of f ♮ and f . The SSC (Definition 1) is a relaxed condition compared to the NCSA (Assumption 4). To illustrate this, we consider the geometry of the matrix W ♮ satisfying the NCSA and the SSC as shown in Fig. 2 .

M ALGORITHM DESCRIPTION

In this section, we detail the implementation of the regularized criteria in ( 8) and ( 9). The neural network predictor function f ∈ F is parameterized using θ and can be denoted asLet us also define the following for certain batch B ⊂ [N ]: CIFAR-10 consists of 60, 000 labeled color images of animals, vehicles and so on, each having a size of 32 × 32 and belonging K = 10 different classes. We use 45, 000 images for training, 5, 000 images for validation, and 10, 000 images for testing.Noisy Label Generation. In order to produce noisy annotations for the images of the training data, we simulate M = 5 synthetic annotators. We randomly choose an annotator m * among M annotators and its confusion matrix is generated by A ♮ m * = I K + γrand(K, K), followed by normalization w.r.t. the ℓ 1 norm of the corresponding columns. Here, I K denotes the identity matrix of size K and rand(K, K) denotes the K × K matrix with its entries randomly chosen from a uniform distribution between 0 and 1. The parameter γ controls how well the annotator m * correctly identifies the ground-truth labels. The confusion matrices of the remaining M -1 annotators are generated such that they provide labels uniformly at random-i.e., the M -1 annotators are all unreliable. This type of modeling for the confusion matrices resembles the hammer-spammer model as employed in the works (Rodrigues & Pereira, 2018; Tanno et al., 2019) . Using the ground-truth label y n 's provided by the dataset and the generated confusion matrices, the noisy annotations y is observed independently at random such that only 20% of the total annotations are available for training. Note that, the true labels are not accessible by any of the methods.Neural Network Architecture and Settings. For the CIFAR-10 dataset, we choose the ResNet-9 architecture (He et al., 2016) . Adam (Kingma & Ba, 2015) is used as the optimizer with weight decay of 10 -4 and a batch size of 128. The regularization parameter λ and the initial learning rate of the Adam optimizer are chosen via grid search method over the validation set from {0.01, 0.001, 0.0001} and {0.01, 0.001}, respectively. We choose the same neural network structures for all the baselines. The confusion matrices are initialized with identity matrices for the proposed method and the baselines TraceReg and CrowdLayer.Results. Table 3 presents the results under various values of γ-when γ is smaller, the chance of annotator m * correctly labeling the data items is better. One can see that the proposed approach, namely, GeoCrowdNet(F), outperforms the baselines in all the scenarios under test. When annotator m * 's labeling accuracy drops drastically (i.e., when γ becomes larger), GeoCrowdNet(F) performs the best. This implies that even if there are no class specialists, GeoCrowdNet(F) works well under the large N cases.Figs. 3, 4 and 5 show the ground-truth confusion matrices and the corresponding estimates when γ = 0.01 for the GeoCrowdNet methods. One can see all methods estimate the confusion matrices reasonably well, corroborating our identifiability claims. 

N.2 REAL DATA EXPERIMENTS -MACHINE ANNOTATIONS

Here, we provide additional details of the real data experiments. For experiments with machine annotations, we consider the MNIST and the Fashion-MNIST datasets. Each image in these datasets is of size 28 × 28 in the grey-scale format. We consider 57, 000 images as training data, 3000 images for validation, and 10, 000 images for testing. For experiments with annotations collected from AMT, we use the LabelMe dataset (Rodrigues et al., 2017; Russell et al., 2007) and the Music dataset (Rodrigues et al., 2014) .The LabelMe dataset (Rodrigues et al., 2017; Russell et al., 2007) is an image classification dataset consisting of 2688 images from K = 8 different classes, namely, highway, inside city, tall building, street, forest, coast, mountain, and open country. From the available images, 1000 images are annotated by M = 59 AMT workers. In total, about 2547 image annotations are obtained by the AMT workers whose labeling accuracy ranges from 0% to 100% with a mean accuracy of 69.2%. In order to enrich the training dataset, standard augmentation techniques such as rescaling, cropping, horizontal flips, etc., are employed and accordingly, the training dataset consists of 10, 000 images annotated by 59 workers; see more details in (Rodrigues & Pereira, 2018; Chu et al., 2021) . The validation set consists of 500 images and the remaining 1188 images are used for testing.The Music dataset (Rodrigues et al., 2014) consists of audio samples of 10 different genres of music such as classical, country, disco, hiphop, jazz, rock, blues, reggae, pop, and metal. The dataset has about 1000 samples of songs each having a duration of 30 seconds. About 700 samples are annotated by 44 AMT workers (overall, about 2946 annotations are observed) and the remaining 300 are allocated for testing. Out of the annotated 700 samples, we consider 595 samples for training, and the remaining 105 samples are used for validation.For the LabelMe dataset, we employ the settings used in (Rodrigues & Pereira, 2018) . The pretrained VGG-16 embeddings for the images are given as inputs to a fully connected (FC) neural network with one hidden layer having 128 hidden units and ReLU activation functions. A dropout layer with rate 50% is also used, followed by a softmax layer. For the Music dataset, we choose the same FC layer and the softmax layer as employed in the LabelMe settings, but with batch normalization layers before each of these layers.

