RECOVERING TOP-TWO ANSWERS AND CONFUSION PROBABILITY IN MULTI-CHOICE CROWDSOURCING

Abstract

Crowdsourcing has emerged as an effective platform to label a large volume of data in a cost-and time-efficient manner. Most previous works have focused on designing an efficient algorithm to recover only the ground-truth labels of the data. In this paper, we consider multi-choice crowdsourced labeling with the goal of recovering not only the ground truth but also the most confusing answer and the confusion probability. The most confusing answer provides useful information about the task by revealing the most plausible answer other than the ground truth and how plausible it is. To theoretically analyze such scenarios, we propose a model where there are top-two plausible answers for each task, distinguished from the rest of choices. Task difficulty is quantified by the confusion probability between the top two, and worker reliability is quantified by the probability of giving an answer among the top two. Under this model, we propose a two-stage inference algorithm to infer the top-two answers as well as the confusion probability. We show that our algorithm achieves the minimax optimal convergence rate. We conduct both synthetic and real-data experiments and demonstrate that our algorithm outperforms other recent algorithms. We also show the applicability of our algorithms in inferring the difficulty of tasks and training neural networks with the soft labels composed of the top-two most plausible classes. To model the effect of confusion in multi-choice crowdsourcing problems, we propose a new model under which each task can have a confusing answer other than the ground truth, with a varying confusion probability across tasks. The task difficulty is quantified by the confusion probability, and the worker skill is modeled by the probability of giving an answer among the top two, to distinguish reliable workers from pure spammers who just provide random guesses among possible choices. We justify the proposed top-two model with public datasets. Under this new model, we aim to recover both the ground truth and the most confusing answer with the confusion probability, indicating how plausible the recovered ground truth is compared to the most confusing answer. We provide an efficient two-stage inference algorithm to recover the top-two plausible answers and the confusion probability. The first stage of our algorithm uses the spectral method to get an initial estimate for top-two answers as well as the confusion probability, and the second stage uses this initial estimate to estimate the worker reliabilities and to refine the estimates for the top-two answers. Our algorithm achieves the minimax optimal convergence rate. We then perform experiments where we compare our method to recent crowdsourcing algorithms on both synthetic and real datasets, and show that our method outperforms other methods in recovering top-two answers. This result demonstrates that our model better explains the real-world datasets including errors from confusion. Our key contributions can be summarized as follows. • Top-two model: We propose a new model for multi-choice crowdsourcing tasks where each task has top-two answers and the difficulty of the task is quantified by the confusion probability between the top-two. We justify the proposed model by analyzing six public datasets, and showing that the top-two structure explains well the real datasets. • Inference algorithm and its applicaitons: We propose a two-stage algorithm that recovers the top-two answers and the confusion probability of each task at the minimax optimal convergence rate. We demonstrate the potential applications of our algorithm not only in crowdsourced labeling but also in quantifying task difficulty and training neural networks for classification with soft labels including the top-two information and the task difficulty.

1. INTRODUCTION

Crowdsourcing has been widely adopted to solve a large number of tasks in a time-and cost-efficient manner with the aid of human workers. In this paper, we consider 'multiple-choice' tasks where a worker is asked to provide a single answer among multiple choices. Some examples are as follows: 1) Using crowdsourcing platforms such as MTurk, we solve object counting or classification tasks on a large collection of images. Answers can be noisy either due to the difficulty of the scene or due to unreliable workers who provide random guesses. 2) Scores are collected from reviewers for papers submitted at a conference. For certain papers, scores can vary widely among reviewers, either due to the paper's inherent nature (clear pros and cons) or due to the reviewer's subjective interpretation of the scoring scale (Stelmakh et al., 2019; Liu et al., 2022) . In the above scenarios, responses provided by human workers may not be consistent among themselves not only due to the existence of unreliable workers but also due to the inherent difficulty of the tasks. In particular, for multiple-choice tasks, there could exist plausible answers other than the ground truth, which we call confusing answers. 1 For tasks with confusing answers, even reliable workers may provide wrong answers due to confusion. Thus, we need to decompose the two different causes of wrong answers: low reliability of workers and confusion due to task difficulty. Most previous models for multi-choice crowdsourcing, however, fall short of modeling the confusion. For example, in the single-coin Dawid-Skene model (Dawid & Skene, 1979) , which is the most widely studied crowdsourcing model in the literature, it is assumed that a worker is associated with a single skill parameter fixed across all tasks, which models the probability of giving a correct answer for every task. Under this model, any algorithm that infers the worker skill would count a confused labeling as the worker's error and lower its accuracy estimate for the worker, which results in a wrong estimate for their true skill level.

2. MODEL AND PROBLEM SETUP

We consider a crowdsourcing model to infer the top-two most plausible answers among K choices for each task. There are n workers and m tasks. For each task j ∈ [m] := {1, . . . , m}, we denote the correct answer by g j ∈ [K] and the next plausible, or the most confusing answer by h j ∈ [K]. We call the pair (g j , h j ) the top-two answers for task j ∈ [m]. Let p ∈ [0, 1] n and q ∈ (1/2, 1] m be parameters modeling the reliability of workers and difficulty of tasks, respectively. For every pair of (i, j), the j-th task is assigned to the i-th worker independently with probability s. We use a matrix A ∈ R n×m to represent the responses of workers, where A ij = 0 if the j-th task is not assigned to the i-th worker, and if it is assigned, and A ij is equal to the received label. The distribution of A ij is specified by the worker reliability p i and task difficulty q j as follows: A ij =        g j , with prob. s p i q j + 1-pi K , h j , with prob. s p i (1 -q j ) + 1-pi K , each b ∈ [K]\{g j , h j }, with prob. s 1-pi K , 0, with prob. 1 -s. (1) Here p i stands for the reliability of the i-th worker, in giving the answer from the most plausible top two (g j , h j ). If p i = 0, the worker is considered a spammer who provides random answers among K choices, and a larger value of p i indicates a higher worker reliability. The parameter q j represents the inherent difficulty of the task j in distinguishing between the top two answers: for an easy task, q j is closer to 1, and for a hard task, q j is closer to 1/2. We call q j the confusion probability. Our goal is to recover top-two answers (g j , h j ) for all j ∈ [m] with high probability with the minimum possible sampling probability s. We assume that the model parameters (p, q) are unknown. We propose the top-two model to reflect common attributes of public crowdsourcing datasets, summarized in Appendix §A. The most important observation is that the top-two answers dominate the overall answers, and only the second-dominating answer has an incidence rate comparable to that of the ground truth. In other words, the standard deviation in the incidence rate of the second dominating answer has an overlap with that of the ground truth, but not the third-, or fourth-dominating answers. This indicates that assuming a unique 'confusing answer' is sufficient to model the confusion stemming from task difficulty. More details are available in Appendix §A. Binary conversion. The K-ary task can be decomposed into (K -1)-binary tasks (Karger et al., 2013) : define A (k) for 1 ≤ k < K such that the (i, j)-th entry A (k) ij indicates whether the answer A ij is larger than k, i.e., A (k) ij = -1 if 1 ≤ A ij ≤ k; A (k) ij = 1 if k < A ij ≤ K; and A (k) ij = 0 if A ij = 0. We show that E[A (k) ] is rank-1 and the singular value decomposition (SVD) of E[A (k) ] can reveal the top-two answers {(g j , h j )} m j=1 and the confusion probability vector q. Proposition 1. For every 1 ≤ k < K, the binary-mapped matrix A (k) ∈ {-1, 0, 1} n×m satisfies E[A (k) ] -s(K-2k) K 1 n×m = 2sp(r (k) ) , where r (k) = [r (k)  1 • • • r (k) m ] is defined as Case I: g j > h j r (k) j :=    k K where k < h j ; k K -(1 -q j ) where h j ≤ k < g j ; k K -1 where g j ≤ k, Case II: g j < h j r (k) j :=    k K where k < g j ; k K -q j where g j ≤ k < h j ; k K -1 where h j ≤ k. By defining ∆r (k) j := r (k) j -r (k-1) j for k ∈ [K] with r j := 0 and r (K) j := 0 for all j, we have ∆r (k) j =    1 K -q j where k = g j , 1 K -(1 -q j ) where k = h j , 1 K otherwise. (2) Note that ∆r (k) j has its minimum at k = g j and the second smallest value at k = h j for q j ∈ (1/2, 1]. If one can specify g j , the task difficulty q j can also be revealed from 1 K -∆r (gj ) j . In the next section, we use this structure of r (k) for k ∈ [K] to infer the top two answers and the confusion probability. 2 2 We assume that η √ n ≤ p 2 ≤ √ n for some η > 0, i.e., there are only o(n) spammers (pi = 0), and r (k) 2 = Θ( √ m) for every k ∈ [K], which can be easily satisfied except exceptional cases from equation 2. Algorithm 1 Spectral Method for Initial Estimation (TopTwo1 Algorithm) 1: Input: data matrix A 1 ∈ {0, 1, . . . , K} n×m and parameter η > 0 where η √ n ≤ p 2 ≤ √ n. 2: Randomly split (with equal probabilities) and convert A 1 into binary matrices X (k) ∈ {-1, 0, 1} n×m and Y (k) ∈ {-1, 0, 1} n×m for 1 ≤ k < K as described in Sec. 3.1. 3: Let u (k) be the leading normalized left singular vector of X (k) . Trim the abnormally large components of u (k) by letting it be zero if u (k) i > 2 η √ n and denote the resulting vector as ũ(k) . 4: Calculate the estimate of p r (k)  by v (k) := 1 s (Y (k) ) ũ(k) . Assume v (0) := 0 and v (K) := 0. 5: For k ∈ [K], calculate ∆v (k) j := v (k) j -v (k-1) j . Estimate the top-two answers for j ∈ [m] by ĝj := arg min k∈[K] ∆v (k) j ; ĥj := arg min k =ĝj ,k∈[K] ∆v (k) j . (3) 6: Estimate p 2 by defining l j := K K-2 k =ĝj ,k = ĥj ∆v (k) j and l := 1 m m j=1 l j . 7: Estimate q j for j ∈ [m] by defining qj := 1/K -∆v (ĝj ) j /l. (4) 8: Output: estimated top-two answers {(ĝ j , ĥj )} m j=1 and confusion probability vector q.

3. PROPOSED ALGORITHM

Our algorithm consists of two stages. In Stage 1, we compute an initial estimate on top-two answers and the confusion probability q. In Stage 2, we estimate the worker reliability vector p by using the result of the first stage, and use the estimated p and q to refine our estimates for the top two answers. Assume that we randomly split the original response matrix A into A 1 and A 2 with probability s 1 and 1 -s 1 , respectively, and use only A 1 for stage 1 and (A 1 , A 2 ) for stage 2.

3.1. STAGE 1: INITIAL ESTIMATES USING SVD

The first stage begins with randomly splitting A 1 again into two independent matrices B and C with equal probabilities. We then convert B and C into (K -1)-binary matrices B (k) and C (k) as explained in Sec. 2. Define X (k) and Y (k) as X (k) := B (k) -s (K-2k) K 1 n×m and Y (k) := C (k) -s (K-2k) K 1 n×m for s = s • s 1 /2. We have E[X (k) ] = E[Y (k) ] = s p(r (k) ) from Prop. 1. We use X (k) and Y (k) to estimate p * := p/ p 2 and p 2 r (k) , respectively. The estimators are denoted by u (k) and v (k) , respectively. We define u (k) as the left singular vector of X (k) with the largest singular value. Sign ambiguity of the singular vector is resolved by defining u (k) as the one between {u (k) , -u (k) } in which at least half of the entries are positive. After trimming abnormally large components of u (k) and defining the trimmed vector as ũ(k) , we calculate v (k) := 1 s (Y (k) ) ũ(k) , which is an estimate for p 2 r (k) . By using v (k) for 1 ≤ k < K, we get estimates for top-two answers (ĝ j , ĥj ) based on the observation in equation 2. Lastly, we estimate p 2 and use v (k) / p 2 ≈ r (k) to estimate the confusion probability vector q. See Algorithm 1 for details.

3.2. STAGE 2: PLUG-IN MAXIMUM LIKELIHOOD ESTIMATOR (MLE)

The second stage uses the result of Stage 1 to estimate the worker reliability vector p. We first propose an estimate for the worker reliability vector p by using the estimated top-two answers {(g j , h j )} m j=1 from Algorithm 1. We randomly split the original response matrix A into A 1 and A 2 with probability s 1 and 1 -s 1 , respectively, and use A 1 only for Algorithm 1 and A 2 only for calculating the estimator p. Our estimate for the worker reliability p i is defined as pi = K (K -2)   1 s(1 -s 1 )   1 m m j=1 1(A 2 ij = ĝj or ĥj )   - 2 K   . ( ) Algorithm 2 Plug-in MLE (TopTwo2 Algorithm) 1: Input: data matrix A ∈ {0, 1, . . . , K} n×m and the sample splitting rate s 1 > 0. 2: Randomly split A into A 1 and A 2 by defining A 1 := A • S and A 2 = A • (1 n×m -S) where S is an n × m matrix whose entries are i.i.d. with Bern(s 1 ) and • is an entrywise product. 3: Apply Algorithm 1 to A 1 to yield estimates for top-two answers {(ĝ j , ĥj )} m j=1 and confusion probability vector q. 4: By using {(ĝ j , ĥj )} m j=1 and A 2 , calculate the estimate p as in equation 5. 5: By using the whole A and ( p, q), find the plug-in MLE estimates (ĝ MLE j , ĥMLE j ) by arg max a,b∈[K] 2 ,a =b n i=1 log K pi qj 1 -pi + 1 1(A ij = a) + log K pi (1 -qj ) 1 -pi + 1 1(A ij = b). (6) 6: Output: estimated top-two answers {(ĝ MLE j , ĥMLE j )} m j=1 . Our plug-in MLE uses the estimated ( p, q) in the place of (p, q) at the oracle MLE, which finds (ĝ j , ĥj ) ∈ [K] 2 \{(1, 1), (1, 2), . . . , (K, K)} such that (ĝ j , ĥj ) := arg max (a,b)∈[K] 2 ,a =b n i=1 log P(A ij |p, q j , (a, b)) as in equation 6. Details can be found in Alg.2. The time complexity of Alg. 2 is O(m 2 log m + nmK 2 ), since the SVD in Alg. 1 can be computed via power iterations within O(m 2 log m) steps (Boutsidis et al., 2015) , and the step for finding the pair of answers maximizing equation 6 requires O(nmK 2 ) steps.

4. PERFORMANCE ANALYSIS

To state our main theoretical results, we first need to introduce some notation and assumptions. Let µ (i,j) (a,b),k denote the probability that a worker i ∈ [n] gives label k ∈ [K] for the assigned task j ∈ [m] of which the top-two answers are (g j , h j ) = (a, b). Note that µ (i,j) (a,b),k can be written in terms of (p i , q j ) from the distribution in equation 1 for every a, b, k ∈ [K] 3 . Let µ (i,j) (a,b) = [µ (i,j) (a,b),1 µ (i,j) (a,b),2 • • • µ (i,j) (a,b),K ] . We introduce a quantity that measures the average ability of workers in distinguishing the ground-truth pair of top-two answers (g j , h j ) from any other pair (a, b) ∈ [K] 2 /{(g j , h j )} for the task j ∈ [m]. We define D (j) := min (gj ,hj ) =(a,b) 1 n n i=1 D KL µ (i,j) (gj ,hj ) , µ (i,j) (a,b) ; D := min j∈[m] D (j) , where D KL (P, Q) := i P (i) log(P (i)/Q(i)) is the KL-divergence between P and Q. Note that (j) is strictly positive if there exist at least one worker i with p i > 0 and q j ∈ (1/2, 1) for the distribution in equation 1, so that (g j , h j ) can be distinguished from any other (a, b) D ∈ [K] 2 /{(g j , h j )} statistically. We define D as the minimum of D (j) over j ∈ [m], indicating the average ability of workers in distinguishing (g j , h j ) from any other (a, b) for the most difficult task in the set of tasks. We split the performance analysis of our algorithm into two parts. First, Theorem 1 states the performance guarantees for Alg. 1. Theorem 1 (Performance Guarantees for Algorithm 1). For any , δ 1 > 0, if the sampling probability s • s 1 = Ω 1 δ 2 1 p 2 2 log K , Algorithm 1 guarantees the recovery of the ordered top-two answers (g j , h j ) with probability at least 1for any j ∈ [m] with q j ∈ (1/2, 1), i.e., P (ĝ j , ĥj ) = (g j , h j ) ≥ 1for all j ∈ [m] with q j ∈ (1/2, 1), and the recovery of the confusion probability q j with P (|q j - q j | < δ 1 ) ≥ 1 - for all j ∈ [m], for every sufficiently large number m of tasks and the number of workers n = O(m/ log m). By using Theorem 1, we can also find the sufficient conditions to guarantee the recovery of paired top-two answers for all tasks and q with an arbitrarily small ∞ -norm error. Corollary 1. For any , δ 1 > 0, if the sampling probability s •s 1 = Ω 1 δ 2 1 p 2 2 log mK , Algorithm 1 guarantees the recovery of {(g j , h j )} m j=1 and q with probability at least 1as m → ∞ such that P (ĝ j , ĥj ) = (g j , h j ), ∀j ∈ [m] ≥ 1 - and P ( q -q ∞ < δ 1 ) ≥ 1 -. Proofs of Theorem 1 and Corollary 1 are available in Appendix §G. We next analyze the performance of Alg. 2, which uses Alg. 1 as the first stage. Before providing the main theorem for Alg. 2, we state a lemma charactering a sufficient condition for estimating the worker reliability vector p from equation 5 with an arbitrarily small ∞ -norm error. Lemma 1. Conditioned on (ĝ j , ĥj ) = (g j , h j ) for all j ∈ [m], if s(1 -s 1 ) = Ω 1 δ 2 2 m log n , the estimator pi defined in equation 5 of Alg. 2 guarantees P ( pp ∞ < δ 2 ) ≥ 1for any > 0. Combining Corollary 1 and Lemma 1, we can have the estimators ( p, q) for the worker reliability vector p and the confusion probability vector q with ∞ -norm error bounded by any arbitrarily small δ > 0 with probaiblity at least 1 -2 if s = s • s 1 + s(1 -s 1 ) = Ω log(mK/ ) δ 2 p 2 2 + log(n/ ) δ 2 m = Ω log(mK/ ) δ 2 p 2 2 (11) where the last equality is from the assumption that p 2 = Θ( √ n) and n = O(m/ log m). In this regime, the sample complexity for estimating the task difficulty q is larger than that for estimating worker reliability p. To make sure that the sampling probability s < 1, we need n = Ω(log m). Our second theorem, Theorem 2, characterizes the sufficient condition on the sampling probability s to guarantee the recovery of the pair of top-two answers for all tasks by equation 6 of Alg. 2, when a sufficiently accurate estimation of (p, q) is given. Theorem 2. Assume that there is a positive scalar ρ such that µ (i,j) (gj ,hj ),c ≥ ρ for all (i, j, g j , h j , c) ∈ [n] × [m] × [K] 3 . For any > 0, if ( p, q) are given with max{ p -p ∞ , q -q ∞ } ≤ δ := min ρ 4 , ρD 4(6 + D) , and the sampling probability s = Ω log(1/ρ) log(mK 2 / )+D log(m/ ) nD , then for any > 0 the estimates of {(g j , h j )} m j=1 from equation 6 of Algorithm 2 guarantees P (ĝ j , ĥj ) = (g j , h j ), ∀j ∈ [m] ≥ 1 -. Proofs of Lemma 1 and Theorem 2 are available in Appendix §H. The assumption in Theorem 2 that µ (i,j) (gj ,hj ),c ≥ ρ for some ρ > 0 holds if p i < 1 for all i ∈ [n], i.e., there is no perfectly reliable worker. This assumption can be easily satisfied by adding an arbitrary small random noise to the worker answers as well. By combining the statements in Corollary 1, Lemma 1, and Theorem 2 with δ 1 = δ 2 = δ for δ defined in equation 12, we get the overall performance guarantee for Alg. 2. Corollary 2 (Performance Guarantees for Alg. 2). Alg. 2 guarantees the recovery of top-two answers for all tasks with P (ĝ j , ĥj ) = (g j , h j ), ∀j ∈ [m] ≥ 1 -for any > 0 if s satisfies s = Ω log(mK/ ) δ 2 p 2 2 + log(1/ρ) log(mK 2 / ) + D log(m/ ) nD = Ω log(m/ ) δ 2 p 2 2 + log(m/ ) nD . In equation 14, the first term is for guaranteeing accurate estimates of p and q with ∞ -norm error bounded by δ and the second term is for guaranteeing the recovery of the top-two answers from MLE with high probability. Since p 2 2 = Θ(n), the two terms effectively have the same order but with different constant scaling, depending on model-specific parameters (p, q). Lastly, we show the optimality of convergence rates of Alg. 1 and Alg. 2 with respect to two types of minimax errors, respectively. The proof of Theorem 3 is available in Appendix §I. Theorem 3. (a) Let F p be a set of p ∈ [0, 1] n such that the collective quality of workers, measured by p 2 , is parameterized by p as F p := {p : 1 n p 2 2 = p}. Assume that p ≤ 2/3. If the average number ns of samples (queries) per task is less than (1/2p) log(1/ ), then min ĝ max p∈Fp, g∈[K] m 1 m j∈[m] P(ĝ j = g j ) ≥ . (15) (b) There is a universal constant c > 0 such that for any p ∈ [0, 1] n , q ∈ (1/2, 1] m , if the sampling probability s < Ω 1/(nD) , then min (ĝ, ĥ) max (g,h)∈[K] m ×[K] m gj =hj ,∀j[m] 1 m j∈[m] P((ĝ j , ĥj ) = (g j , h j )) ≥ c. ( ) From part (a) of Theorem 3, it is necessary to have s > Ω (1/ p 2 2 ) log(1/ ) to make the minimax error in equation 15 less than . Since Theorem 1 shows that Alg. 1 recovers (ĝ j , ĥj ) with probability at least 1if s > Ω (1/ p 2 2 ) log(1/ ) when s 1 = 1, we can conclude that Alg. 1 achieves the minimax optimal rate for a fixed collective intelligence of workers, measured by p 2 . From part (b) of Theorem 3, for any (p, q), unless we have s > Ω(1/(nD)) there always exists a constant fraction of tasks for which the recovered top-two answers are incorrect. This bound matches with our sufficient condition on s from Alg. 2 in equation 14 upto logarithmic factors, as long as δ 2 p 2 nD, showing the minimax optimality of our Alg. 2 for any (p, q). More discussions on the theoretical results are available at Appendix §E.

5. EXPERIMENTS

We evaluate the proposed algorithm under diverse scenarios of synthetic datasets in Sec. 5.1, and for two applications-in identifying difficult tasks in real datasets in Sec. 5.2 and in training neural network models with soft labels defined from the top-two plausible labels in Sec. 5.3.

5.1. EXPERIMENTS ON SYNTHETIC DATASET

We compare the empirical performance of Alg. 1 and Alg. 2 (referred as TopTwo1 and TopTwo2) with other baselines: majority voting(MV), OTP-D&S and MV-D&S (Zhang et al., 2014) , PGD (Ma et al., 2018) , M-MSR (Ma & Olshevsky, 2020) and oracle-MLE, whose details can be found in Appx. §C. We choose these baselines since they have the strongest established guarantees in the literature and they are all MLE-based approaches, from which the top-two answers can be inferred. Obviously, oracle-MLE, which uses the ground-truth model parameters, provides the best possible performance. We devise four scenarios described in Table 1 to verify the robustness of our model for various (p, q) ranges, at (n, m) = (50, 500) with s ∈ (0, 0.2]. The number of choices for each task is fixed as 5. Fig. 1 reports the empirical error probability 1 m m j=1 P((ĝ j , ĥj ) = (g j , h j )) averaged over 30 runs, with 95% confidence intervals (shaded region). Four columns correspond to the four scenarios, resp. The prediction errors for g j and h j are plotted in Fig. 6 of Appx. §D.1. We can observe that for all the considered scenarios TopTwo2 achieves the best performance, near the oracle MLE, in recovering (g j , h j ). Depending on the scenarios, the reason TopTwo2 outperforms can be explained differently. For the Easy scenario, since q j is close to 1, it is easy to distinguish g j from h j but hard to distinguish h j from other labels. Our algorithm achieves the best performance in estimating h j by a large margin (Fig. 6 ). For the Hard scenario, it is hard to distinguish g j and h j , but our algorithm, which uses an accurate qj , can better distinguish g j and h j . For Few-smart, our algorithm achieves the biggest gain compared to other methods, since our algorithm can effectively distinguish few smart workers from spammers. High-variance shows the effect of having diverse q j in a dataset. We remark that our algorithm achieves the best performance, near that of the oracle-MLE, for all the scenarios, while the next performer keeps changing depending on scenarios. For example, the OPT D&S is the second best performer in the Easy scenario, while it is the worst performer in the Few-smart scenario. We also show the robustness of our algorithm against changes in model parameters in Appendix §D. pi ∈ [0, 1] pi ∈ [0, 1] 90% pi ∈ [0, 0.1] pi ∈ [0, 1] 10% pi ∈ [0.9, 1] Task qj ∈ [0.9, 1] qj ∈ (0.5, 0.6] qj ∈ (0.5, 1] 50% qj ∈ (0.5, 0.6] 50% qj ∈ [0.9, 1.0]

5.2. EXPERIMENTS ON REAL-WORLD DATASET: INFERRING TASK DIFFICULTIES

We next provide experimental results using real-world data collected from MTurk and show that our algorithm can be used for inferring task difficulties. We devised a color comparison task where we asked the crowd to choose a color, among six given choices, that looks the most similar to a reference color of each task. See Fig. 4 in Appx. §A.1 for example tasks. After randomly generating a reference color and the six choices, we identified the ground truth and the most confusing answer for each task by measuring the distance between colors using the CIEDE2000 color difference formula (Sharma et al., 2005) . If the distance from the reference color to the ground truth is much shorter than that to the most confusing answer, then the task was considered easy. We designed 1000 tasks and distributed it to 200 workers, collecting 19.5 responses on each task. After collecting the data, we subsampled it to simulate how the prediction error decreases as the number of responses per task increases. Fig. 2a shows the performances in detecting (g j , h j ), g j and h j , averaged over 10 times of random sampling, with 95% confidence interval (shaded region). TopTwo2 algorithm achieved the best performance in detecting (g j , h j ), g j and h j in all ranges. We further examined the correlation between the task difficulty -quantified by the distance gap between the ground truth and the most Histogram of color distance gap for the task groups with the highest q j (easiest tasks) and lowest q j (most difficult tasks). The difficult task group (blue) tends to have a smaller color distance gap. confusing answer from the reference color -and the estimated confusion probability q j across tasks. We selected top 50 most difficult/easiest tasks according to the estimated confusion probability q j and plotted the histograms of the distance gap for the two groups in Fig 2b . We can see that the difficult group (blue, having lowest q j ) tends to have a smaller distance gap than those of the easy task group (red). This result shows that our algorithm can identify difficult tasks in real datasets. 

6. DISCUSSION

We proposed a new model for multiple-choice crowdsourcing, with top-two confusable answers and varying confusion probability over tasks. We provided an algorithm to infer the top-two answers and the confusion probability. This work can benefit several query-based data acquisition systems such as MTurk or review systems by providing additional information about the task such as the most plausible answer other than the ground truth and how plausible it is, which can be used to quantify the accuracy of the ground truth or to classify the tasks based on difficulty. The topic of confusion is getting increasing attention in the machine learning community for designing reliable classifiers (Jin et al., 2017; Luque et al., 2019; Chang et al., 2017) . We also demonstrated possible applications of our algorithm in designing soft labels for better generalization of neural networks.

A VERIFICATION FOR THE PROPOSED TOP-TWO MODEL

We proposed the top-two model to reflect the key attributes of seven datasets including Adult2, Dog, Web, Flag, Food, Plot, and Color, of which the details are summarized in Appendix A.1. Table 3 shows empirical distributions of the mean incidence of responses for the top-three dominating answers, sorted by the dominance proportions, for the six public datasets and the Color dataset that we collected, with the standard deviation over the tasks in the dataset. In Fig. 3 , we also plot empirical distributions of the mean incidence of responses sorted by the dominant proportion with error bars indicating the standard deviation. The i-th data point represents the average incidence of the i-th highest response in each task. For example, in Adult2 dataset, the most dominating answer takes 0.8 portion of the total answers, and the next dominating answer takes 0.14 portion of the total answers on average. From the table and figure, we can observe that for all the considered public datasets the top-two answers dominate the overall answers, i.e., about 65-90% of the total answers belong to the top two. Furthermore, the average ratio from the most dominating answer to the second one is 4:1, while that between the second and the third is 7.5:1. There often exist overlaps in the error bars between the ground truth and the second dominating answer, e.g., for Web, Plot, and Color datasets, but no such overlap is found between the ground truth and the third dominating answer. What we can call a 'confusing answer' is an answer that has an incidence rate comparable to that of the ground truth. In all the considered datasets, only the second dominating answer shows such a tendency, and thus, we can conclude that the third dominating answer cannot be called a 'confusing answer', and the top-two model in equation 1 well describes the errors in answers caused by confusion. Moreover, from the public datasets, we also observe that the task difficulty can be quantified by the confusion probability between the top-two answers. As an example, for the Web dataset, when we select the easiest 500 tasks and hardest 500 tasks by ordering tasks with the ratio of correct answers, the ratio between the ground-truth to the 2nd best answer was 10.7:1 for the easiest group, while it was 1.5:1 for the hardest group. This observation shows that the ratio between the top-two answers indeed captures task difficulty as does our model parameter for task difficulty q j in equation 1.

A.1 DATASETS

We collect six publicly available multi-class datasets: Adult2, Dog, Web, Flag, Food and Plot. Since these datasets do not provide information about the most confusing answer or the task difficulty, we additionally create a new dataset called 'Color', for which we can identify the most confusing answer and also quantify the task difficulty for all the included tasks. • Color is a dataset where the task is to find the most similar color to the reference color among six different choices. For each task, we randomly create a reference color and then choose six choices of colors. The distance from the reference color to the ground truth color is in between 4.5 and 5.5, the distance to the most confusing answer is in between 5.5 and 6.5, and the distance to the rest of the choices is between 11 and 12, where the distance between the pairs of colors is measured by CIEDE2000 (Sharma et al., 2005) color difference formulation. The tasks are ordered in terms of their difficulty levels by measuring the gap between: the distance from the reference color to the ground truth; and that to the most confusing answer. If the distance from the reference color to the ground truth is much shorter than that to the most confusing answer, then the task is considered easy. Using MTurk, we collected 19600 labels from 196 workers for 1000 tasks. Each Human Intelligence Task (HIT) is composed of randomly selected 100 tasks, and we pay $1 to each worker who completed a HIT. Fig. 4 shows an example task for the Color dataset. • Adult2 (Ipeirotis et al., 2010) is a 4-class dataset where the task is to classify the web pages into four categories (G, PG, R, X) depending on the adult level of the websites. This dataset contains 3317 labels for 333 websites which are offered by 269 workers. (a) Images with lowest q (considered to be hard) (b) Images with highest q (considered to be easy) Figure 5 : Training images with (a) lowest and (b) highest confusion probabilities. • Dog (Zhang et al., 2014 ) is a 4-class dataset where the task is to discriminate a breed (out of Norfolk Terrire, Norwich Terrier, Irish Wolfhound, and Scottich Deerhound) for a given dog. This dataset contains 7354 labels collected from 52 workers for 807 tasks. • Web (Zhou et al., 2012 ) is a 5-class dataset where the task is to determine the relevance of query-URL pairs with a 5-level rating (from 1 to 5). The dataset contains 15567 labels for the 2665 query-URL pairs offered by 177 workers. • Flag (Krivosheev et al., 2020) is a dataset for multiple-choice tasks where each task is to identify the country for a given flag from 10 given choices. A total of 1600 votes are collected from 220 workers for the 100 tasks. • Food (Krivosheev et al., 2020) is a dataset for multiple-choice tasks where each task asks to identify a picture of a given food or dish from 5 given choices. This dataset contains 1220 labels for 76 tasks collected from 177 workers. • Plot (Krivosheev et al., 2020) is a dataset for multiple-choice tasks where the task is to identify a movie from a description of its plot from 10 given choices. Only workers who correctly solved the first 10 test questions can answer the rest of the tasks. A total of 1937 labels are collected from 122 workers for 100 tasks. Table 4 shows a summarized information for the introduced datasets. that the first answer and the second answer are hard to distinguish.

B.2 MODEL

We trained two simple CNN architectures, VGG-19 and ResNet-18, to show the usefulness of the second answer and the confusion probability. For each model, our loss function is defined as the cross-entropy between the softmax output and the two-hot vector (in which the values are q and 1 -q for g and h, respectively). We compare the results of our top-two label training with those of full-distribution training and hard label (one-hot vector) training.

B.3 TRAINING

We train each model using 10-fold cross validation (using 90% of images for training and 10% images for validation) and average the results across 5 runs. We run a grid search over learning rates, with the base learning rate chosen from {0.1, 0.01, 0.001}. We find 0.1 to be optimal in all cases. We trained each model for a maximum of 150 epochs using SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001. Our neural networks are trained using NVIDIA GeForce 3090 GPUs.

C BASELINE METHODS

In this section, we explain the baseline methods with which we compare the performance of our algorithms. To analyze the performance in recovering the top-two answers, we considered the MLbased algorithms, including the Spectral-EM algorithm (MV-D&S and OPT-D&S) (Zhang et al., 2014) , Projected Gradient Descent (PGD) (Ma et al., 2018) and M-MSR (Ma & Olshevsky, 2020) , which provide a "score" for each label so that we can recover the top-two answers. • Spectral-EM algorithm (MV-D&S and OPT-D&S) (Zhang et al., 2014 ) is a two-stage algorithm for multi-class crowd labeling problems. These algorithms are built for the D&S model where each worker has his/her own confusion matrix. In the first stage of the algorithm, the confusion matrix of each worker is estimated via spectral method (OPT-D&S) or majority voting (MV-D&S), respectively, and in the second stage, the estimates for the confusion matrices are refined by optimizing the objective function of the D&S estimator via the Expectation Maximization (EM) algorithm. • Projected Gradient Descent (PGD) (Ma et al., 2018) is an approach to estimate the skills of each worker in the single-coin D&S model. The authors formulate the skill estimation problem as a rank-one correlation-matrix completion problem. They propose a projected gradient descent method to solve the correlation-matrix completion problem. • M-MSR (Ma & Olshevsky, 2020) algorithm is an approach to estimate the reliability of each worker in the multi-class D&S model. M-MSR algorithm utilizes that the rank of the response matrix is one. To estimate the reliability of the workers, they use update rules to find the left singular vector and right singular vector of the response matrix. In this process, the extreme values are filtered out to guarantee the stable convergence of the algorithm.

D SYNTHETIC EXPERIMENTS

D.1 ADDITIONAL PLOTS FOR SYNTHETIC DATA EXPERIMENTS IN SEC. 5.1 In Section 5.1, we devised four scenarios described in Table 1 to verify the robustness of our model for various (p, q) ranges, with (n, m, s) = (50, 500, 0.2). The performance of algorithms is measured by the empirical average error probabilities in recovering g j , h j and (g j , h j ), i.e., 1 m m j=1 P(ĝ j = g j ), 1 m m j=1 P( ĥj = h j ) and 1 m m j=1 P((ĝ j , ĥj ) = (g j , h j )) and plotted in Fig. 6 . We can observe that for all the considered scenarios TopTwo2 achieves the best performance, near the oracle MLE, in recovering (g j , h j ). Depending on scenarios though, the reason TopTwo2 outperforms can be explained differently. For Easy scenario, since q j is close to 1, it becomes easy to distinguish g j from h j but hard to distinguish h j from other labels. Our algorithm achieves the best performance in estimating h j by a large margin. For Hard scenario, it becomes hard to distinguish g j and h j , but our algorithm, which uses an accurate qj , can better distinguish Figure 6 : Prediction error for (g j , h j ) (top row), g j (middle) and h j (bottom) for four scenarios. Our algorithm (TopTwo2) achieves the best performance, near the oracle MLE for all the scenarios. g j and h j . High-variance show the effect of having diverse q j in a dataset. For Few-smart, our algorithm achieves the biggest gain compared to other methods, since our algorithm can effectively distinguish few smart workers from spammers. We remark that even though the performance gap between TopTwo2 and the next best performer is not significant for some cases, our algorithm always achieves the best performance, near that of the oracle-MLE, for all the scenarios, while the next performer keeps changing depending on scenarios. For example, the OPT D&S is the second best performer in the 'Easy' scenario, while it is the worst performer in the 'Few smart' scenario.

D.2 ROBUSTNESS OF OUR METHODS

In this section, we present a set of four additional synthetic experiments to demonstrate the robustness of our methods, Alg. 1 and Alg. 2 (referred to as TopTwo1 and TopTwo2). In each experiment, we change a parameter of our synthetic error model and compare the prediction error of our algorithms to the baselines: majority voting(MV), OTP-D&S and MV-D&S Zhang et al. (2014) , PGD Ma et al. (2018) and Oracle-MLE. We measure the performance of each algorithm by the empirical average error probabilties in recovering the ground truth g j , the most confusing answer h j and the pair of top two (g j , h j ), i.e., 1 m m j=1 P(ĝ j = g j ), 1 m m j=1 P( ĥj = h j ) and 1 m m j=1 P((ĝ j , ĥj ) = (g j , h j )). Obviously, Oracle-MLE provides a lower bound for the performance. Changing the dimension of observed matrix: We first check the robustness of our methods against the change of dimensions of the observation matrix A ∈ {0, 1 . . . , K} n×m with n ≤ m. We vary the number of workers (n) or the number of tasks (m) while fixing the other dimension. The default values of n and m are 50 and 500, respectively, and the sampling probability s is fixed as 0.1 throughout the experiments. The worker reliability p i and the task difficulty q j is sampled uniformly at random from [0, 1] and (1/2, 1], respectively, for all i ∈ [n] and j ∈ [m]. In Fig. 7a and 7b , we report the results when we change n for a fixed m and s, or when we change m for a fixed n and s, respectively. From Fig. 7a , we can see that as the number of workers increases, the performance of every algorithm improves since the number of samples per task scales as ns for a fixed s. Our algorithm achieves the performance close to the Oracle-MLE for all the considered range, which implies that the worker reliabilities {p i } are well estimated with our methods. From Fig. 7b , we can see that our algorithm achieves a robust performance against the change in the number of tasks, although the performance gets closer to that of Oracle-MLE as the number of tasks increases. Since our method uses SVD in the first stage, the larger dimension is beneficial for the concentration of the random perturbation matrix with respect to the expectation of the observation matrix. This phenomenon is observed for other baseline methods as well, which are based on the spectral method, OPT D&S, for example. Changing the variance of worker reliability: In this experiment, we change the range of p i , the parameter for worker skill/reliability, for i ∈ [n], with a fixed mean in order to observe the impact of the variance of the worker reliability on the prediction error. We randomly sample p i from the window [0.5 -x, 0.5 + x] with x varying from 0.05 to 0.25. The mean of the worker reliability is fixed as 0.5. As shown in Fig. 7c , when the variance of the worker reliability increases, the baseline methods estimating worker reliabilities perform better than the majority voting. Our TopTwo2 algorithm achieves the best performance close to Oracle-MLE, as the standard deviation increases, i.e., as the workers become more heterogeneous. Changing the variance of task difficulty: We also design an experiment to observe the impact of the variance of q j , j ∈ [m], the parameter for task difficulty, on the prediction error. We randomly sample q j from the window [0.75 -x, 0.75 + x] with x varying from 0.05 to 0.25. The mean of the worker reliability is fixed as 0.75. If the variance of the task difficulty is small, it could be sufficient to only estimate the worker reliability since all the tasks have almost the similar task difficulties. As shown in Fig. 7d , when the variance of the task difficulty increases, our TopTwo2 algorithm performs better than the other baselines. This is the evidence for the validity of our method in estimating the task difficulty. Changing the portion of spammers: Spammers who provide random answers always exist in crowdsourcing systems. To improve the inference performance, it is important to distinguish spammers from reliable workers. In our experimental setup, we define a spammer as a worker whose reliability parameter p i is in the range [0, 0.1]. We change the portion of spammers among the workers from 0.1 to 0.9 and compare the prediction error of our methods to those of other baseline methods. In Fig. 7e , we can see that our algorithm achieves the best performance among all the considered baselines except Oracle-MLE, which can exactly distinguish spammers from reliable workers. This result demonstrates the superiority of our methods in detecting spammers compared to other methods.

D.3 ESTIMATING THE WORKER RELIABILITY VECTOR AND THE TASK DIFFICULTY VECTOR

In this section, we examine the accuracy of our estimates for the worker reliability vector p and the task difficulty vector q. The worker reliability is estimated by p defined in equation 5 of Algorithm 2 and the task difficulty is estimated by q defined in equation 4 of Algorithm 1. To analyze the accuracy of these estimators, we compute the mean squared error (MSE), 1 n p-p 2 2 and 1 m q-q 2 2 , respectively. To analyze the estimation accuracy for the worker reliability, we first sample p i uniformly at random from [0, 1] for all i ∈ [n] and fix the worker reliability vector p. Then, we randomly sample the task difficulty vector q ∈ (1/2, 1] m fifty times and then sample the observation matrices from the distribution equation 1 for each (p, q) pair with a fixed p. For each observation matrix, we subsample the data with varying probabilities and apply Algorithm 2 to get the estimate p, which is then used to calculate the MSE of p. We report the MSE averaged over these fifty cases. Similarly, to analyze the estimation accuracy for the task difficulty, we randomly sample and fix a task difficulty vector q ∈ (1/2, 1] m and generate fifty different observation matrices while varying the worker reliability vector p. We again report the MSE averaged over these fifty cases. The number of workers and that of tasks is set to be (50, 500) for the worker reliability estimation, and to be (100, 1000) for the task difficulty estimation. In Fig. 8a and 8b , we plot the MSE for p and q, respectively, as the average number of queries per task increases. We can see that both for p and q, the MSEs converge to near zero as the average number of queries per task increases. However, estimating the task difficulty requires more number of samples as our theory equation 11 suggests. 

E DISCUSSION OF THEORETICAL RESULTS

In this section, we present a discussion of the main theoretical results. • Theorem 1 asserts that the sampling probability of Ω 1 δ 2 1 p 2 2 log K is sufficient to recover the top-two answers (g j , h j ) for any task j ∈ [m] and to estimate the confusion probability q j with accuracy of |q j -q j | < δ 1 by Algorithm 1 with probability at least 1 -. Combined with Theorem 3 part (a), we can see that this sample complexity is the minimax optimal rate for a fixed collective quality of workers, measured by p 2 2 . • It is also worth comparing our algorithm with the simple majority voting (MV) scheme. The MV scheme infers the top-two answers by counting the majority of the received answers. Simple analysis shows that the MV scheme requires the sampling probability s such that ns = Θ ( 1 n i p i ) -2 log 1 to recover (g j , h j ) with probability 1 -. Remind that Algorithm 1 requires ns = Ω n δ 2 1 p 2 2 log K samples per task. Since 1 n p 2 = 1 n i p 2 i ≥ 1 n i p i 2 by Cauchy-Schwarz inequality, Algorithm 1 achieves strictly better trade-offs unless p i is same for all workers i ∈ [n]. As an example, for a spammer-hammer model where α ∈ (0, 1) fraction of workers are hammers with p i = 1 and the rest are spammers with p i = 0, Algorithm 1 requires ns = Θ 1 α log 1 samples per task, while MV requires ns = Θ 1 α 2 log 1 samples per task to recover top-two answers with probability 1 -. • Theorem 2 shows that when we have an entrywise bound on the estimated worker reliability vector p and the task difficulty vector q, the plug-in MLE estimator, used in Algorithm 2, guarantees the recovery of top-two answers if the sampling probability s = Ω( log(m/ ) n D ) where D, which depend on (p, q), indicates the average reliability of workers in distinguishing the top-two answers from any other pairs for the most difficult task. Combined with Theorem 3 part (b), we can see that this sample complexity is the minimax optimal rate for any (p, q), ignoring the logarithmic terms. • Combining the conditions for the accurate estimation of model parameters in equation 11 and the convergence of the plug-in MLE (Theorem 2), Corollary 2 shows the condition on the sample complexity to guarantee the performance of Algorithm 2.

F PROOF OF PROPOSITION 1

For each task j and label k, define four indicator functions: Π a (j, k) :=1(g j > k, h j > k), Π b (j, k) :=1(g j ≤ k, h j > k), Π c (j, k) :=1(g j > k, h j ≤ k), Π d (j, k) :=1(g j ≤ k, h j ≤ k), which satisfy Π a (j, k) + Π b (j, k) + Π c (j, k) + Π d (j, k) = 1. For notational simplicity, we will often drop (j, k) fron Π * . The pmf of A (k) is given by A (k) ij =      -1 with probability s(1 -ρ (k) ij ), 1 with probability sρ (k) ij , 0 with probability 1 -s, where ρ (k) ij = Π a (j, k)p i + Π b (j, k)p i (1 -q j ) + Π c (j, k)p i q j + (K-k)(1-pi) K , and its expectation is E[A (k) ij ] = s(2ρ (k) ij -1). Note that by using Π a = 1 -Π b -Π c -Π d , the probability ρ (k) ij can be written as ρ (k) ij = p i q j (Π c -Π b ) -(Π c + Π d ) + k K + K-k K . Thus, by defining r (k) j := q j (Π c -Π b ) -(Π c + Π d ) + k K , ( ) the expectation of A (k) ij can be written as E[A (k) ij ] = s(2ρ (k) ij -1) = s 2p i r (k) j + K -2k K , and k) ) . E[A (k) ] - s(K -2k) K 1 n×m = 2sp(r (21)

Note that

Case I: g j > h j Π a (j, k) = 1 where k < h j , Π c (j, k) = 1 where h j ≤ k < g j , Π d (j, k) = 1 where g j ≤ k; Case II: g j < h j Π a (j, k) = 1 where k < g j , Π b (j, k) = 1 where g j ≤ k < h j , Π d (j, k) = 1 where h j ≤ k. Thus, r (k) j in equation 19 is equal to Case I: g j > h j r (k) j =    k K where k < h j ; k K -(1 -q j ) where h j ≤ k < g j ; k K -1 where g j ≤ k, Case II: g j < h j r (k) j =    k K where k < g j ; k K -q j where g j ≤ k < h j ; k K -1 where h j ≤ k.

G PERFORMANCE ANALYSIS OF ALGORITHM 1

G.1 PROOFS OF THEOREM 1 AND COROLLARY 1 In Algorithm 1, we use the data matrix A 1 , which is obtained by randomly splitting the original data matrix A into A 1 and A 2 with probability s 1 and (1 -s 1 ), respectively. Then, the first stage of Algorithm 1 begins with randomly splitting A 1 again into two independent matrices B and C with equal probabilities, and then converting B and C into (K -1)-binary matrices B (k) and C (k) as explained in Sec. 2. We define X (k) and Y (k) as X (k) := B (k) -s (K-2k) K 1 n×m and Y (k) := C (k) -s (K-2k) K 1 n×m where s = s • s 1 /2. We have E[X (k) ] = E[Y (k) ] = s p(r (k) ) from Prop. 1. For notational simplicity, we will ignore this random splitting for a moment and just pretend that X (k) and Y (k) are sampled independently with s = s throughout this section. We first outline the proof. Based on the observation that E[X (k) ] = sp(r (k) ) , if E[X (k) ] is available we can recover p * = p p 2 by SVD, and by using p * it is possible to recover p 2 r (k) , which then reveals {(g j , h j )} m j=1 as well as q from the relation in equation 2. To estimate p * from X (k) , we first bound the spectral norm of the perturbation, X (k) -E[X (k) ] 2 . We then use this bound and Wedin SinΘ theorem to bound sin θ(u (k) , p * ) where u (k) is the left singular vector of X (k) with the largest singular value. We trim the abnormally large components of u (k) and denote the resulting vector by ũ(k) . After trimming, it is still possible to show that sin θ( ũ(k) , p * ) can be bounded in the same order as that of sin θ(u (k) , p * ). Finally, we provide an entrywise bound between v (k) = 2 s (Y (k) ) ũ(k) and p 2 r (k) in Lemma 5, which is the main lemma to prove Theorem 1. We state our main technical lemmas first and then prove Theorem 1. Let us define the perturbation matrix E := X (k) -E[X (k) ] = B (k) - s(K -2k) K 1 n×m -sp(r (k) ) = B (k) -E[B (k) ] where B (k) ij =      -1 w.p. s(1 -ρ (k) ij ), 1 w.p. sρ (k) ij , 0 w.p. 1 -s, and ρ (k) ij = Π a (j, k)p i + Π b (j, k)p i (1 -q j ) + Π c (j, k)p i q j + (K-k)(1-pi) K for (Π a , Π b , Π c , Π d ) defined in equation 17. For the perturbation matrix E in equation 23, we have E[E i,j ] = 0, and |E i,j | ≤ 2, 1 ≤ i ≤ n, 1 ≤ j ≤ m, and also var(E ij ) = var(B (k) ij ) = E[(B (k) ij ) 2 ] -(E[B (k) ij ]) 2 = s -(s(ρ (k) ij -1/2)) 2 ≤ s. Note that {E ij } are independent across all i, j. Define ν := max    max i j E[E 2 i,j ], max j i E[E 2 i,j ]    ≤ max{m, n}s. By applying the spectral norm bound to random matrices with independent entires, appeared in Bandeira & Van Handel (2016) and summarized in Theorem 4, we can bound the spectral norm of E as below. Lemma 2 (Spectral norm bound of E). With probability 1 -(n + m) -8 , we have E ≤ 4 s max (m, n) + c log(n + m) for some constant c > 0 when m ≥ n. For some sufficiently large m, assuming n = o(m) and s = Ω(log(n + m)/m), the spectral norm of E can be further bounded by E ≤ 5 √ sm. Using the bounded spectral norm of E in equation 29 and applying the Wedin SinΘ theorem, summarized in Theorem 5, we can bound the angle between u (k) and p * . Lemma 3. For some sufficiently large m, assuming n = o(m) and s = Ω(log(n + m)/m), we have sin θ(u (k) , p * ) ≤ Θ(1/ √ sn) (30) with probability at least 1 -(n + m) -8 . Proof. By applying the Wedin SinΘ Theorem (Theorem 5), we have sin θ(u (k) , p * ) ≤ √ 2 E s p 2 • r (k) 2 -E . ( ) We have p 2 = Θ( √ n) and r (k) 2 = Θ( √ m) by assumptions on model parameters. By Lemma 2, for some sufficiently large m, assuming n = o(m) and s = Ω(log(n + m)/m), we have E ≤ 5 √ sm with probability at least 1 -(n + m) -8 . Combining these bounds, we get sin θ(u (k) , p * ) ≤ Θ( √ sm) Θ(s √ mn) -Θ( √ sm) = 1 Θ ( √ sn) . We trim the abnormally large components of u (k) by letting it zero if u (k) i > 2/(η √ n) and denote the resulting vector as ũ(k) . This process is required to control the maximum entry size of ũ(k) , which is used later in the proof. For the next lemma, we show that after the trimming process, the norm of ũ(k) is still close to 1 and the angle between ũ(k) and p * has the same order as that of sin θ(u (k)  , p * ). Lemma 4. Given p * 2 ≥ η √ n, we have ũ(k) 2 ≥ 1 -50 sin 2 θ(u (k) , p * ), ( ) sin θ( ũ(k) , p * ) ≤ 6 √ 2 sin θ(u (k) , p * ). The proof of Lemma 4 is provided in Section G.2. Finally, we provide our main lemma giving the entrywise bound on the difference between v (k) = 1 s (Y (k) ) ũ(k) and p 2 r (k) . Lemma 5 (Entrywise Bound). For any δ 1 , > 0, and any task j ∈ [m] and label index k ∈ [K], if the sampling probability s ≥ Θ 1 δ 2 1 p 2 2 log 1 , then we can guarantee P 1 s Y (k) * j , ũ(k) -p 2 r (k) j < δ 1 p 2 > 1 - as m → ∞ when n = O(m/ log m). Proof. For notional simplicity, denote θ( ũ(k) , p * ) by θ. To prove equation 35, we show bounds on two probabilities, P 1 s Y (k) * j , ũ(k) -ũ(k) 2 p 2 r (k) j cos θ > δ 1 p 2 2 < /2, P ũ(k) 2 p 2 r (k) j cos θ -p 2 r (k) j > δ 1 p 2 2 < /2. ( ) Then, the triangle inequality implies equation 35. We first prove equation 36. Remind that we do the random splitting of the input matrix A and define the two independent binary-converted matrices as X (k) and Y (k) , for 1 ≤ k < K, which are used to estimate ũ(k) and v (k) , respectively. Thus, ũ(k) is independent from Y (k) and this independence is used when we bound the first and second moments of v (k) j = 1 s Y (k) * j , ũ(k) . For any 1 ≤ j ≤ m, the first and second moments of v (k) j = 1 s Y (k) * j , ũ(k) satisfy E 1 s Y (k) * j , ũ(k) = p, ũ(k) r (k) j = p 2 ũ(k) 2 (cos θ)r (k) j = Θ( √ n) (38) if r (k) j = 0 by Lemma 3 and 4, and var 1 s Y (k) * j , ũ(k) ≤ 1 s 2 n i=1 (ũ (k) i ) 2 E[(Y (k) ij ) 2 ] = Θ 1 s (39) since E[(Y (k) ij ) 2 ] = Θ(s) and n i=1 (ũ i ) 2 = Θ(1) by Lemma 3 and 4. Furthermore, we have max 1≤i≤m |Y (k) ij ũ(k) i | ≤ Θ 1 √ n since ũ(k) i ≤ 2 η √ n . By applying the Bernstein's inequality, we can show that P 1 s Y (k) * j , ũ(k) -ũ(k) 2 p 2 r (k) j cos θ > δ 1 p 2 2 ≤ 2 exp - Θ(δ 2 1 p 2 2 ) Θ 1 s + Θ (δ 1 p 2 / √ n) ≤ exp -Θ(sδ 2 1 p 2 2 ) where the second inequality is due to the assumption p 2 = Θ( √ n). To make this probability less than 2 , it is sufficient to have s ≥ Ω 1 δ 2 1 p 2 2 log 1 . We next prove equation 37 by bounding ũ(k) 2 p 2 r (k) j cos θ -p 2 r (k) j . By the triangle inequality, we have ũ(k) 2 p 2 r (k) j cos θ -p 2 r (k) j ≤ ũ(k) 2 p 2 r (k) j cos θ -p 2 r (k) j cos θ + p 2 r (k) j cos θ -p 2 r (k) j . (41) Note that 1 p 2 • ũ(k) 2 p 2 r (k) j cos θ -p 2 r (k) j cos θ = r (k) j cos θ ũ(k) 2 -1 ≤ Θ(sin 2 θ(u (k) , p * )) = 1 Θ (ns) , with probability 1 -(n + m) -8 by Lemma 3 and 4, and also note that 1 p 2 • p 2 r (k) j cos θ -p 2 r (k) j = r (k) j (1 -cos θ) ≤ Θ(sin 2 θ(u (k) , p * )) = 1 Θ (ns) , with probability 1 -(n + m) -8 by Lemma 3 and 4. To make these errors of order 1/Θ (ns) less than δ1 2 , it is sufficient to have s ≥ Ω 1 δ1n . By combining the above results, it can be guaranteed that 1 2s Y (k) * j , ũ(k) -p 2 r (k) j < δ p 2 with probability at least 1 -, if the sampling probability s ≥ max Ω 1 δ 2 1 p 2 2 log 1 , Ω 1 δ 1 n = Ω 1 δ 2 1 p 2 2 log 1 where the last equality is due to p 2 = Θ( √ n). The condition s = Ω(log(n + m)/m) in Lemma 3 is immediately satisfied by equation 44 when n = O(m/ log m). Proof of Theorem 1. By using Lemma 5, we next prove Theorem 1. By applying the union bound over k ∈ [K], if s ≥ Θ 1 δ 2 1 p 2 2 log K then we have p 2 (r (k) j -δ 1 ) ≤ v (k) j = 1 s Y (k) * j , ũ(k) ≤ p 2 (r (k) j + δ 1 ), ∀k ∈ [K] for any δ 1 > 0 and j ∈ [m] with probability at least 1 -. Under the condition equation 45, for any q j ∈ (1/2, 1) and δ < min , we can guarantee that 1 K -q j + δ < 1 K -(1 -q j ) -δ and 1 K -(1 -q j ) + δ < 1 K -δ, which implies (ĝ j , ĥj ) = (g j , h j ) for (ĝ j , ĥj ) defined in equation 3. This proves equation 8 of Theorem 1. We next prove equation 9, the accuracy guarantee in estimating the task difficulty vector q. After estimating p 2 r (k) by v (k) = 1 s (Y (k) ) ũ(k) , we estimate p 2 by calculating l where l j := K K-2 k =ĝj ,k = ĥj ∆v (k) j and l := 1 m m j=1 l j . Assume that | p 2 -l| ≤ p 2 δ . We will specify the required order of δ later. Remind that the estimate for q j is defined as qj := 1 K -∆v (ĝ j ) j l . Under the condition that ĝj = g j and |v j -p 2 r (k) j | ≤ p 2 δ 1 , both of which are satisfied under the conditions of Lemma 5, we have 1 K -q j -2δ 1 1 + δ ≤ ∆v (ĝj ) j l ≤ 1 K -q j + 2δ 1 1 -δ . ( ) By the Taylor expansion for 1 1-x = 1 + x + Θ(x 2 ) as x → 0, we have |q j -q j | ≤ 2δ 1 + δ 1 K -q j + 2δ 1 + Θ(δ 2 ) = Θ(δ 1 + δ ). Thus, both the order of δ , which is the estimation error of p 2 , and that of δ, which is the estimation error of p 2 r (k) j , govern the estimation accuracy of q j . We next show that we can have δ = Θ(δ 1 ). By Lemma 5, we have |v j -p 2 r (49) Under the condition (ĝ j , ĥj ) = (g j , h j ), since ∆r (k) j = 1 K for k = ĝj , ĥj , we have p 2 -p 2 2δ 1 K K -2 ≤ l j = K K -2 k =ĝj ,k = ĥj ∆v (k) j ≤ p 2 + p 2 2δ 1 K K -2 , and thus δ = 2δ1K K-2 = Θ(δ 1 ). Thus, it is enough to have s = Ω 1 δ 2 1 p 2 2 log K to guarantee equation 9. Proof of Corollary 1. By using Lemma 5 and taking the union bound over all tasks j ∈ [m] as well as k ∈ [K], we can prove Corollary 1 in a similar way as that of Theorem 1.

G.2 PROOF OF LEMMA 4

We first prove equation 33,

ũ(k)

2 ≥ 1 -50 sin 2 θ(u (k) , p * ). Let I be the set of indices 1 ≤ i ≤ n such that u (k) i ≥ 2 η √ n . Then, we have u (k) i -p * i ≥ 1 η √ n for all i ∈ I since p * i = p i / p 2 ≤ 1 η √ n due to the assumption that p 2 2 ≥ η 2 n. Thus, we have |I| η 2 n ≤ i∈I (u (k) i -p * i ) 2 ≤ u (k) -p * 2 2 . ( ) By using the triangle inequality, we can show that i∈I u (k) i 2 ≤ i∈I u (k) i - 2 η √ n 2 + 4|I| η 2 n ≤ i∈I p * i - 2 η √ n 2 + i∈I u (k) i -p * i 2 + 4|I| η 2 n ≤ 4|I| η 2 n + i∈I u (k) i -p * i 2 + 4|I| η 2 n ≤ 5 u (k) -p * 2 . (52) By convexity and using Jensen's inequality, the average probability of error is lower bounded by P(ĝ j = g j ) ≥ 1 m j∈[m] P(ĝ j = g j ) ≥ K -1 K (1 -p) l K -1 K e -(p+p 2 )l ≥ K -1 K e -2pl . The inequality in equation 89 implies that if l is less than 1 2p log K-1 K , then no algorithm can make the minimax error in equation 89 less than . Since the average number of queries per task in our model is ns, it implies that it is necessary to have s = Ω  Let A := {A ij : i ∈ [n], j ∈ [m]} be the set of observations. Define two probability measures P 0 and P 1 , such that P 0 is the measure of A conditioned on (g j , h j ) = (g c , h c ), while P 1 is that on (g j , h j ) = (a c , b c ). Then, we can have (v,u)∈{(gc,hc),(ac,bc)} m Q((v, u))E 1((ĝ j , ĥj ) = (g j , h j )) (g, h) = (v, u) = Q((g j , h j ) = (g c , h c ))P 0 ((ĝ j , ĥj ) = (g c , h c )) + Q((g j , h j ) = (a c , b c ))P 1 ((ĝ j , ĥj ) = (a c , b c )) ≥ 1 2 - 1 2 P 0 -P 1 TV ≥ 1 2 - 1 4 D KL (P 0 , P 1 ). ( ) where the second to the last inequality is by Le Cam's method and the last inequality is by Pinsker's inequality. 4Conditioned on (g j , h j ), the set of random variables A j := {A ij : i ∈ [n]} are independent of A\A j for both P 0 and P 1 , and thus D KL (P 0 , P 1 ) = D KL (P 0 (A j ), P 1 (A j )) + D KL (P 0 (A\A j ), P 1 (A\A j )) = D KL (P 0 (A j ), P 1 (A j )) (93)



This phenomenon is evident on public datasets: for 'Web' dataset(Zhou et al., 2012), which has five labels, the most dominating top-two answers take 80% of the overall answers and the ratio between the top two is 2.4:1. As in(Peterson et al., 2019), we used the original 10000 test examples of CIFAR10 for training and 50000 training examples for testing. Thus, the final accuracy is lower than usual. Since CIFAR10H is collected from selected 'reliable' workers who answered a set of test examples with an accuracy higher than 75%, we directly used the top-two dominating answers and the fraction between the two in designing the soft label vector (top2). The total variation distance between probability distributions P and Q defined on a set X is defined as the maximum difference between probabilities they assign on subsets of X : P -Q TV := sup A⊂X |P (A) -Q(A)|.



Figure 1: Prediction error for (g, h) for four scenarios as the avg. number of queries per task changes. Our TopTwo2 alg. achieves the best performance, near the oracle MLE for all the scenarios.

Figure2: (a) Prediction error for (g j , h j ), g j and h j (from left to right) for color comparison tasks using real data collected from MTurk. Our TopTwo2 algorithm achieves the best performance. (b) Histogram of color distance gap for the task groups with the highest q j (easiest tasks) and lowest q j (most difficult tasks). The difficult task group (blue) tends to have a smaller color distance gap.

Figure 3: Empirical distribution of the mean incidence of responses sorted by the dominant proportion, averaged over all tasks in each dataset. The i-th data point represents the average incidence of the i-th highest response in each task. The error bars indicate the standard deviation of the mean incidence of the i-th dominating answer over the tasks in the dataset.

Figure 4: Example tasks for 'Color' dataset where the ground truth g and the most confusing answer h are determined by the color distance from the reference color (top).

(a) Effect of the number of workers on the performance (b) Effect of the number of tasks on the performance (c) Effect of the variance of worker reliability on the performance (d) Effect of the variance of task difficulty on the performance (e) Effect of the portion of spammers on the performance

Figure7: Prediction error for (g j , h j ) (first column), g j (second column), and h j (third column) for five different setups. The solid lines represent the mean prediction errors of each algorithm averaged over 10 runs, and the shaded regions represent the standard deviations.

Figure8: Mean squared errors in estimating the worker reliability vector p (left) and the task difficulty vector q (right), respectively.

p 2 δ 1 , which implies p 2 (∆r(k) j -2δ 1 ) ≤ ∆v (k) j ≤ p 2 (∆r (k) j + 2δ 1 ).

] l i ≤ l. By assuming p ≤ 2/3, we have (1 -p) ≥ e -(p+p 2 ) . Thus,

OF PART (B)To prove the second part of the theorem, we use proof techniques fromZhang et al. (2014), but generalizes the results for pair of top two answers. We assume that jc ∈ [m], (g c , h c ) ∈ [K] 2 and (a c , b c ) ∈ [K]2 are the task index and the pairs of labels such that in equation 60.Let Q be a uniform distribution over the set {(g c , h c ), (a c , b c )} m . For any (ĝ, ĥ), we havemax (v,u)∈[K] m ×[K] m vj =uj ,∀j[m] ĝ j , ĥj ) = (g j , h j )) (g, h) = (v, u) u)∈{(gc,hc),(ac,bc)} m Q((v,u))E 1((ĝ j , ĥj ) = (g j , h j )) (g, h) = (v, u)

Parameters for synthetic data experiments under diverse scenarios.

NEURAL NETWORKS WITH SOFT LABELS HAVING TOP-TWO INFORMATIONAn appealing example where we can use the knowledge of the second best answer is in training deep neural networks for classification tasks. Traditionally, a hard label (one ground-truth label per image) has been used to train a classifier. In recent works, it has been shown that using a soft label (full label distribution that reflect human perceptual uncertainty) is sometimes beneficial in obtaining a model with better generalization capability(Peterson et al., 2019). However, obtaining an accurate full label distribution requires much higher sample complexity than recovering only the groundtruth. For example, Peterson et al. (2019) provided a CIFAR10H dataset with full human label distributions for 10000 instances of CIFAR10 test examples by collecting on average 50 judgements per image, which is about 5-10 times larger than those of usual datasets (Table4in Appendix A.1). Comparison of performances for CIFAR10H dataset with hard/soft label training

Proportions of top-three dominating answers in public datasets

Dataset informationDataset # workers # tasks # labels or choices sparsity d task d worker

annex

Therefore, we getBy the law of cosine, we have p * -u (k) 2 2 = sin 2 θ(u (k) , p * ) + (1 -cos θ(u (k) , p * )) 2 = 2 -2 cos θ(u (k) , p * ) = 2 1 -1 -sin 2 θ(u (k) , p * ) = 2 sin 2 θ(u (k) , p * ) 1 + 1 -sin 2 θ(u (k) , p * ) ≤ 2 sin 2 θ(u (k) , p * ).(54)Combining equation 53 and equation 54 proves equation 33.We next prove equation 34, sin θ( ũ(k) , p * ) ≤ 6 √ 2 sin θ(u (k) , p * ).First, note that ũ(k) -u (k) 2 2 = i∈I u (k) i

2

. We have(55) where the last inequality is from equation 52. Combined with equation 54, we get equation 34.

H PERFORMANCE ANALYSIS OF ALGORITHM 2 H.1 PROOF OF LEMMA 1

In this lemma, we show that conditioned on (ĝ j , ĥj ) = (g j , h j ) for all j ∈ [m], if s(1 -s 1 ) = Ω 1 δ2m log n , the estimator pi defined in equation 5,Given (ĝ j , ĥj ) = (g j , h j ) for all j ∈ [m], since A 2 is independent of (ĝ j , ĥj ), we haveBy applying the Bernstein's inequality, we can show thatThus, if the sampling probability satisfiesthen we can guarantee that P(|p i -p i | < δ 2 ) ≥ 1 -. By taking the union bound over i ∈ [n], if the sampling probability satisfiesthen we can guarantee that P ( pp ∞ < δ 2 ) ≥ 1 -.

H.2 PROOF OF THEOREM 2

To prove this theorem, we use similar proof techniques from Zhang et al. (2014) . Since the work in Zhang et al. (2014) focuses on the recovery of only the ground-truth label for each task, we generalize the techniques to recover not only the ground-truth label but also the most confusing answer.We first introduce some notations. Let µ (i,j) (a,b),k denote the probability that a worker i ∈ [n] gives label k ∈ [K] for the assigned task j ∈ [m] of which the top-two answers are (g j , h j ) = (a, b). Let µ. We introduce a quantity that measures the average ability of workers in distinguishing the ground-truth pair of top-two answers (g j , h j ) from any other pair (a, b) ∈ [K] 2 /{(g j , h j )} for the task j ∈ [m]. We definewhere D KL (P, Q) := i P (i) log(P (i)/Q(i)) is the KL-divergence between P and Q. Note that (j) is strictly positive if q j ∈ (1/2, 1) and there exists at least one worker i with p i > 0 for the distribution equation 1, so that (g j , h j ) can be distinguished from any other (a, b)statistically. We define D as the minimum of D (j) over j ∈ [m], indicating the average ability of workers in distinguishing (g j , h j ) from any other (a, b) for the most difficult task in the set.Let us define an event that will be shown holding with high probability, E :DefineWe can see that l 1 , . . . , l n are mutually independent on any value of (g j , h j ), and each l i belongs to the interval [0, log(1/ρ)] where µWe defineThe following lemma shows that the second moment of l i is bounded above by the KL-divergence between the label distribution under (g j , h j ) pair and the label distribution under (a, b) pair. Lemma 6. Conditioning on any value of (g j , h j ), we haveThe proof of this lemma can be obtained by following the proof of the similar result, Lemma 4 of Zhang et al. (2014) .According to Lemma 6, the aggregated second moment of l i is bounded by(66) Thus, applying the Bernstein's inequality, we haveSince ρ ≤ 1/2 and D ≥ nD (j) ≥ nD, combining the above inequality with union bound over j ∈ [m], we haveThe maximum likelihood estimator finds a pair of (a, b)The plug-in MLE in equation 6, on the other hand, finds a pair of (a, b)wherefor the assigned task j ∈ [m] of which the top two answers are (g j , h j ) = (a, b) assuming p i = pi from equation 5 and q j = qj from equation 4 in the distribution equation 1. Thus, for the plug-in MLE to correctly find the ground-truth top two answers (g j , h j ), we need to satisfy the following event:For any arbitrary (a, b) = (g j , h j ), consider the quantitywhich can be written asAssuming that there exist ρ > δ 3 such thatwe haveBy the Bernstein's inequality, we also haveBy taking the union bound over j ∈ [m], we haveUnder the intersection of the eventand the event E, we can guaranteefor every j ∈ [m] where the last inequality holds ifIn summary, under that the eventandthen we can guarantee that the plug-in MLE in equation 70 successfully recovers the pair of top two (g j , h j ) for all the tasks j ∈ [m]. To make the right-hand side of equation 68 and equation 77 less than /2, it is sufficient to haveLastly, when we have max{ pp ∞ , qq ∞ } ≤ δ, (83) we can guarantee that |μThus, it is sufficient to guarantee equation 83 withI PROOF OF THEOREM 3

I.1 PROOF OF PART (A)

To prove this minimax bound, we use the similar arguments from Karger et al. (2014) . In particular, we consider a spammer-hammer model such thatAssume that total l j workers randomly sampled from [n] provide answers for the task j. Under the spammer-hammer model, the oracle estimator makes a mistake on task j with probability (K-1)/K if it is only assigned to spammers. When l j is the number of assignments, we havewhere P(X) denote the distribution of X with respect to the probability measure P. Given (g j , h j ), since A 1j , . . . , A nj are independent, we can show thatCombining equation 91-equation 94, we have(95) Thus, if s ≤ 1 4nD , then the above inequality is lower bounded by 3/8. This completes the proof.

J USEFUL INEQUALITIES

In this section, we summarize the useful inequalities used in the proof of the main results.The following inequality, which appeared in Bandeira & Van Handel ( 2016) provides a nonasymptotic spectral norm bound for random matrices with independent random entries.Theorem 4 (Spectral norm bound of a random matrice with independent entries). Consider a random matrix X ∈ R n×m , whose entries are independently generated and obey E[X i,j ] = 0, andDefineThen there exists some universal constant c > 0 such that for any t > 0,We also present a useful corollary of Theorem 4, which can be shown from equation 98 by setting c = √ 9c and t = B 9c log(n + m).Corollary 3 (Corollary of Theorem 4). If E[X 2 i,j ] ≤ σ 2 for all i, j and satisfying conditions in Theorem 4, then we havewith probability 1 -(n + m) -8 for some constant c > 0.We next summarize the eigenspace perturbation theory for asymmetric matrices with singular value composition (SVD). Suppose X := [X 0 , X 1 ] and Z := [Z 0 , Z 1 ] are orthonormal matrices. When we define the distance between two subspaces X 0 and Z 0 byGiven X 0 Z 0 ≤ 1, we write SVD of X 0 Z 0 ∈ R r×r as X 0 Z 0 := U cos ΘV where cos Θ = diag(cos θ 1 , . . . , cos θ r ). We call {θ 1 , . . . , θ r } principal angles between X 0 and Z 0 . Then, we haveLet M * and M = M * + E be two matrices in R n×m with n ≤ m, whose SVD are represented byThe matrices U * 0 and V * 0 are defined analogously. Theorem 5 (Wedin sin Θ Theorem). If E < σ * r -σ * r+1 , then one haswhere U * 0 (V * 0 ) and U 0 (V 0 ) are subspaces spanned by the largest r left (right) singular vectors of M * and M , respecively.Lastly, we also write down two useful concentration inequalities. Theorem 6 (Hoeffding). Let X 1 , X 2 , . . . , X n be independent random variables such that X i ∈ [a i , b i ] for 1 ≤ i ≤ n. Then, we haveTheorem 7 (Bernstein). Let X 1 , X 2 , . . . , X n be independent random variables such that X i ∈ [a i , b i ] for 1 ≤ i ≤ n. Let C := max 1≤i≤n (b i -a i ) and σ 2 = n i=1 var(X i ). Then we have

