RECOVERING TOP-TWO ANSWERS AND CONFUSION PROBABILITY IN MULTI-CHOICE CROWDSOURCING

Abstract

Crowdsourcing has emerged as an effective platform to label a large volume of data in a cost-and time-efficient manner. Most previous works have focused on designing an efficient algorithm to recover only the ground-truth labels of the data. In this paper, we consider multi-choice crowdsourced labeling with the goal of recovering not only the ground truth but also the most confusing answer and the confusion probability. The most confusing answer provides useful information about the task by revealing the most plausible answer other than the ground truth and how plausible it is. To theoretically analyze such scenarios, we propose a model where there are top-two plausible answers for each task, distinguished from the rest of choices. Task difficulty is quantified by the confusion probability between the top two, and worker reliability is quantified by the probability of giving an answer among the top two. Under this model, we propose a two-stage inference algorithm to infer the top-two answers as well as the confusion probability. We show that our algorithm achieves the minimax optimal convergence rate. We conduct both synthetic and real-data experiments and demonstrate that our algorithm outperforms other recent algorithms. We also show the applicability of our algorithms in inferring the difficulty of tasks and training neural networks with the soft labels composed of the top-two most plausible classes.

1. INTRODUCTION

Crowdsourcing has been widely adopted to solve a large number of tasks in a time-and cost-efficient manner with the aid of human workers. In this paper, we consider 'multiple-choice' tasks where a worker is asked to provide a single answer among multiple choices. Some examples are as follows: 1) Using crowdsourcing platforms such as MTurk, we solve object counting or classification tasks on a large collection of images. Answers can be noisy either due to the difficulty of the scene or due to unreliable workers who provide random guesses. 2) Scores are collected from reviewers for papers submitted at a conference. For certain papers, scores can vary widely among reviewers, either due to the paper's inherent nature (clear pros and cons) or due to the reviewer's subjective interpretation of the scoring scale (Stelmakh et al., 2019; Liu et al., 2022) . In the above scenarios, responses provided by human workers may not be consistent among themselves not only due to the existence of unreliable workers but also due to the inherent difficulty of the tasks. In particular, for multiple-choice tasks, there could exist plausible answers other than the ground truth, which we call confusing answers.foot_0 For tasks with confusing answers, even reliable workers may provide wrong answers due to confusion. Thus, we need to decompose the two different causes of wrong answers: low reliability of workers and confusion due to task difficulty. Most previous models for multi-choice crowdsourcing, however, fall short of modeling the confusion. For example, in the single-coin Dawid-Skene model (Dawid & Skene, 1979) , which is the most widely studied crowdsourcing model in the literature, it is assumed that a worker is associated with a single skill parameter fixed across all tasks, which models the probability of giving a correct answer for every task. Under this model, any algorithm that infers the worker skill would count a confused labeling as the worker's error and lower its accuracy estimate for the worker, which results in a wrong estimate for their true skill level. To model the effect of confusion in multi-choice crowdsourcing problems, we propose a new model under which each task can have a confusing answer other than the ground truth, with a varying confusion probability across tasks. The task difficulty is quantified by the confusion probability, and the worker skill is modeled by the probability of giving an answer among the top two, to distinguish reliable workers from pure spammers who just provide random guesses among possible choices. We justify the proposed top-two model with public datasets. Under this new model, we aim to recover both the ground truth and the most confusing answer with the confusion probability, indicating how plausible the recovered ground truth is compared to the most confusing answer. We provide an efficient two-stage inference algorithm to recover the top-two plausible answers and the confusion probability. The first stage of our algorithm uses the spectral method to get an initial estimate for top-two answers as well as the confusion probability, and the second stage uses this initial estimate to estimate the worker reliabilities and to refine the estimates for the top-two answers. Our algorithm achieves the minimax optimal convergence rate. We then perform experiments where we compare our method to recent crowdsourcing algorithms on both synthetic and real datasets, and show that our method outperforms other methods in recovering top-two answers. This result demonstrates that our model better explains the real-world datasets including errors from confusion. Our key contributions can be summarized as follows. • Top-two model: We propose a new model for multi-choice crowdsourcing tasks where each task has top-two answers and the difficulty of the task is quantified by the confusion probability between the top-two. We justify the proposed model by analyzing six public datasets, and showing that the top-two structure explains well the real datasets. • Inference algorithm and its applicaitons: We propose a two-stage algorithm that recovers the top-two answers and the confusion probability of each task at the minimax optimal convergence rate. We demonstrate the potential applications of our algorithm not only in crowdsourced labeling but also in quantifying task difficulty and training neural networks for classification with soft labels including the top-two information and the task difficulty.

Related works

In crowdsourcing (Welinder et al., 2010; Liu & Wang, 2012; Demartini et al., 2012; Aydin et al., 2014; Demartini et al., 2012) , one of the most widely studied models is the Dawid-Skene (D&S) model (Dawid & Skene, 1979) . In this model, each worker is associated with a single confusion matrix fixed across all tasks, which models the probability of giving a label b ∈ [K] for the true label a ∈ [K] for K-ary classification task. In the single-coin D&S model, the model is further simplified such that each worker possesses a fixed skill level regardless of the true label or the task. Under the D&S model, various methods were proposed to estimate the confusion matrix or skill of each worker by spectral method (Zhang et al., 2014; Dalvi et al., 2013; Ghosh et al., 2011; Karger et al., 2013) , belief propagation or iterative algorithms (Karger et al., 2014; 2011; Li & Yu, 2014; Liu et al., 2012; Ok et al., 2016) , or rank-1 matrix completion (Ma et al., 2018; Ma & Olshevsky, 2020; Ibrahim et al., 2019) . The estimated skill can be used to infer the ground-truth answer by approximating the maximum likelihood (ML)-type estimators (Gao & Zhou, 2013; Gao et al., 2016; Zhang et al., 2014; Karger et al., 2013; Li & Yu, 2014; Raykar et al., 2010; Smyth et al., 1994; Ipeirotis et al., 2010; Berend & Kontorovich, 2014) . In contrast to the D&S models, our model allows the worker to have different probability of error caused by confusion. Thus, our algorithm needs to estimate not only the worker skill but also the task difficulty. Since the number of tasks is often much larger than the number of workers in practice, estimating the task difficulties is much more challenging than estimating worker skills. We provide a statistically-efficient algorithm to estimate the task difficulties and use this estimate to infer the top-two answers. We also remark that there are some recent attempts to model task difficulties (Khetan & Oh, 2016; Shah et al., 2020; Krivosheev et al., 2020; Shah & Lee, 2018; Bachrach et al., 2012; Li et al., 2019; Tian & Zhu, 2015) . However, these works are either restricted to binary tasks (Khetan & Oh, 2016; Shah et al., 2020; Shah & Lee, 2018) or focus on grouping confusable classes (Krivosheev et al., 2020; Li et al., 2019; Tian & Zhu, 2015) . Our result, on the other hand, applies to any set of multichoice tasks, where the choices of each task are not necessarily restricted to a fixed set of classes. Notation. For a vector x, x i represents the i-th component of x. For a matrix M , M ij refers to the (i, j)th entry of M . For any vector x, its 2 and ∞ -norm are denoted by x 2 and x ∞ , respectively. We follow the standard definitions of asymptotic notations, Θ(•), O(•), o(•), and Ω(•).



This phenomenon is evident on public datasets: for 'Web' dataset(Zhou et al., 2012), which has five labels, the most dominating top-two answers take 80% of the overall answers and the ratio between the top two is 2.4:1.

