TRAINING IMAGE CLASSIFIERS USING SEMI-WEAK LABEL DATA

Abstract

This paper introduces a new semi-weak label learning paradigm which provides additional information in comparison to the weak label classification. We define semi-weak label data as data where we know the presence or absence of a given class and additionally we have the information about the exact count of each class as opposed to knowing the label proportions. A three-stage framework is proposed to address the problem of learning from semi-weak labels. It leverages the fact that counting information is naturally non-negative and discrete. Experiments are conducted on generated samples from CIFAR-10 and we compare our model with a fully-supervised setting baseline, a weakly-supervised setting baseline and a learning from proportion(LLP) baseline. Our framework not only outperforms both baseline models for MIL-based weakly supervised setting and learning from proportion setting, but also gives comparable results compared to the fully supervised model. Further, we conduct thorough ablation studies to analyze across datasets and variation with batch size, losses architectural changes, bag size and regularization, thereby demonstrating robustness of our approach.

1. INTRODUCTION

In a traditional fully supervised machine learning setting, training samples are "strongly" supervised, i.e. every training instance is labeled. In practice, though, strongly labeled data are expensive to collect. An alternate approach is to collect "weak" labels -labels which only indicate the presence or absence of instances of a class in sets of training samples. This form of labelling is particularly useful for data such as images or sounds. For instance, in an image, it is relatively easy to annotate if the image contains instances of (say) dogs, but much harder to tag the bounding box of every dog in the image. The former represents a weak label, while the latter is a strong label. So also, in sound recordings it is much easier to merely annotate if a recording includes gun shots (weak label) than to identify the onset and offset times of each instance of a shot (strong label). At an abstract level, it is useful to think of data such as images or sound recordings as bags of candidate instances (e.g. candidate regions of the image or candidate sections of the recording). Labels are now assigned to bags, rather than instances. A negative label assigned to a bag indicates that the original data (image or recording) did not contain the target class(es), and hence none of the instances in the bag formed from it are positive for any class. On the other hand, a positive label assigned to a bag indicates that some instances in the bag are positive, although it is unknown which or how many. In the real world, it is much easier to collect such bag-level, or weak labels. A number of algorithms have also been proposed to train classifiers with such weak labels Shah et al. ( 2018 2018) (trained from data where every instance is labelled) and that of the weakly supervised model (trained from data with weak labels). The gap in performance limits the ability to which weak supervision (i.e. supervision with weak labels) can be relied upon. In this paper, we introduce a middle ground between the two settings of full and weak supervision. In many settings, it is possible to annotate count information, which specifies the number of occurrences of individual classes in a bag, for not much extra effort over simple weak labeling that merely tags their presence or absence. For instance, it is often fairly straight-forward to annotate how many dogs there are in an image, if the annotation of exact bounding boxes is not required. Similarly, it is reasonable to expect that it is not significant extra effort to indicate how many (e.g.) gunshots were heard in an audio recording, if it is not required that their exact locations be tagged as well. We refer to such labels as "semi-weak" labels, and the problem of learning from such data as as learning with semi-weak supervision, or learning from counts. In our paper, we show classifiers trained using semi-weak labels can classify test instances with much greater accuracy than those trained with merely weak labels. Figure 1 : An example of semi-weak labels for audio event detection. Semi-weak labels give count information, while weak labels only provide information about the presence or absence of classes. Our proposed solution for learning from semi-weak labels is to pose the training as a constraint satisfaction problem. The constraint to satisfy is that the sum of the counts predicted per category in each bag must equal the size of the bag. Thus, we first learn to predict whether the instance is present in the bag or not similar to weak label learning setup and learns an expected count of each class in a bag. Then, given the bag size and conditional expected count for each class, we propose to solve an optimization problem to translate the real valued expected count to non-negative integer count. Finally, given the predicted count as "ground truth" and the instance level logits, we try to find the best assignment of labels such that the counting requirement is satisfied as well as the likelihood of the bag is maximized.

2.1. MULTIPLE INSTANCE LEARNING

Learning from bag-level labels is often referred to as multiple instance learning (MIL). Paper Carbonneau et al. (2018) provides an extensive summary detailing most aspects of multiple instance learning. Dietterich Dietterich et al. (1997) first introduced multiple instance learning where it was applied for drug activity prediction. Since then, several algorithms have been introduced to address MIL. To deal with label noise, a natural solution is to count the number of positive instances in a bag, and apply a threshold to tag a bag as positive. This was summarized in Foulds & Frank (2010) as the threshold-based assumption for multiple instance learning. The threshold-based assumption, which defined that a bag is positive if and only the number of positive instances is between a range, was the first time that counting information was brought into multiple instance learning. Since then, several effortsTao et al. (2004b; a) were exerted to do bag-level prediction using count-based assumptions. Foulds & Frank (2010) extended the assumption and proposed an SVM-based algorithm to predict the bag label. One common problem with those methods was scalability; also those methods do not generalize for multi-class classification. Multiple Instance Regression: MIL regression consists of assigning a real value to a bag. Compared to MIL classification, MIL regression has attracted far less attention in the literature. For MIL regression, one line of research has the assumption that some primary instance contributes largely to the bag label. This motivated sparsity-based approaches that assign sparse weights to instances and use regularization methods like L1, L2 regularizers.Pappas & Popescu-Belis (2014); Wagstaff & Lane (2007); Pappas & Popescu-Belis (2017). However, most of these method work only for small-scale data and focus on the accuracy of predicted results rather than attempting to identify the



); Pappas & Popescu-Belis (2014); Carbonneau et al. (2018); Vanwinckelen et al. (2016). However, there remains a gap between the performance of fully supervised modelsKumar & Raj (2016); Shah et al. (

