ROBUST LOSS FUNCTIONS FOR COMPLEMENTARY LABELS LEARNING

Abstract

In ordinary-label learning, the correct label is given to each training sample. Similarly, a complementary label is also provided for each training sample in complementary-label learning. A complementary label indicates a class that the example does not belong to. Robust learning of classifiers has been investigated from many viewpoints under label noise, but little attention has been paid to complementary-label learning. In this paper, we present a new algorithm of complementary-label learning with the robustness of loss function. We also provide two sufficient conditions on a loss function so that the minimizer of the risk for complementary labels is theoretically guaranteed to be consistent with the minimizer of the risk for ordinary labels. Finally, the empirical results validate our method's superiority to current state-of-the-art techniques. Especially in cifar10, our algorithm achieves a much higher test accuracy than the gradient ascent algorithm, and the parameters of our model are less than half of the ResNet-34 they used.

1. INSTRUCTION

Deep neural networks have exhibited excellent performance in many real-applications. Yet, their supper performance is based on the correctly labeled large-scale training set. However, labeling such a large-scale dataset is time-consuming and expensive. For example, the crowd-workers need to select the correct label for a sample from 100 labels for CIFAR100. To migrate this problem, reachers have proposed many solutions to learn from weak-supervision: Noise-label learning Li et al. (2017) ; Hu et al. (2019) ; Lee et al. (2018) ; Xia et al. (2019) , semi-supervised learning Zhai et al. (2019) ; Berthelot et al. (2019) ; Rasmus et al. (2015) ; Miyato et al. (2019) ; Sakai et al. (2017) , similar-unlabeled learning Tanha (2019) ; Bao et al. (2018) ; Zelikovitz & Hirsh (2000) , unlabeledunlabeled learning Lu et al. (2018) ; Chen et al. (2020a; b) , positive-unlabeled learning Elkan & Noto (2008) ; du Plessis et al. (2014) ; Kiryo et al. (2017) , contrast learning Chen et al. (2020a; b) , partial label learning Cour et al. (2011) ; Feng & An (2018) ; Wu & Zhang (2018) and others. We investigate complementary-label learning Ishida et al. (2017) in this paper. A complementary Label is only indicating that the class label of a sample is incorrect. In the view of label noise, complementary labels can also be viewed as noise labels but without any true labels in the training set. Our task is to learn a classifier from the given complementary labels, predicting a correct label for a given sample. Collecting complementary labels is much easier and efficient than choosing a true class from many candidate classes precisely. For example, the label-system uniformly chooses a label for a sample. It has a probability of 1 k to be ordinary-label but k-1 k to be complementary-label. Moreover, another potential application of complementary-label is data privacy. For example, on some privacy issues, it is much easier to collect complementary-label than ordinary-label. Robust learning of classifiers has been investigated from many viewpoints in the presence of label noise Ghosh et al. (2017) , but little attention paid to complementary-label learning. We call a loss function robust if the minimizer of risk under that loss function with complementary labels would be the same as that with ordinary labels. The robustness of risk minimization relies on the loss function used in the training set. This paper presents a general risk formulation that category cross-entropy loss (CCE) can be used to learn with complementary labels and achieve robustness. We then offer some innovative analytical results on robust loss functions under complementary labels. Having robustness of risk minimization helps select the best hyper-parameter by empirical risk since there are no ordinary labels in the validation set. We conclude two sufficient conditions on a loss function to be robust for learning with complementary labels. We then explore some popular loss functions used for ordinary-label learning, such as CCE, Mean square error (MSE) and Mean absolute error (MAE), and show that CCE and MAE satisfy our sufficient conditions. Finally, we present a learning algorithm for learning with complementary labels, named exclusion algorithm. The empirical results well demonstrate the advantage of the theoretical results we addressed and verify our algorithm's superiority to the current state-of-the-art methods. The contribution of this paper can be summarized as: • We present a general risk formulation that can be view as a framework to employing a loss function that satisfies our robustness sufficient condition to learn from complementary labels. • We conclude two sufficient conditions on a loss function to be robust for learning with complementary labels. • We prove that the minimizer of the risk for complementary labels is theoretically guaranteed to be consistent with the minimizer of the risk for ordinary labels. • The empirical results validate the superiority of our method to current state-of-the-art methods.

2. RELATED WORKS

Complementary-label refers to that the pattern does not belong to the given label. Learning from complementary labels is a new topic in supervised-learning. It was first proposed by Ishida et al. (2017) . They conduct such an idea to try to deal with time-consuming and expensive to tag a largescale dataset. In their early work Ishida et al. (2017) , they assume the complementary labels are the same probability to be selected for a sample. And then, based on the ordinary one-versus-all (OVA) and pairwise-comparison (PC) multi-class loss functions Zhang (2004) proposed a modifying loss for learning with complementary labels. Even though they provided theoretical analysis with a statistical consistency guarantee, the loss function met a sturdy restriction that needs to be symmetric ( (z) + (-z) = 1). Such a severe limitation allows only the OVA and PC loss functions with symmetric non-convex binary losses. However, the categorical cross-entropy loss widely used in the deep learning domain, can not be employed by the two losses they defined. Later, Yu et al. (2018a) assume there are some biased amongst the complementary labels and presents a different formulation for biased complementary labels by using the forward loss correction technique Patrini et al. (2017) to modify traditional loss functions. Their suggested risk estimator is not necessarily unbiased and proved that learning with complementary labels can theoretically converge to the optimal classifier learned from ordinary labels based on the estimated transition matrix. However, the key to the forward loss correction technique is to evaluate the transition matrix correctly. Hence, one will need to assess the transition matrix beforehand, which is relatively tricky without strong assumptions. Moreover, in such a setup, it restricts a small complementary label space to provide more information. Thus, it is necessary to encourage the worker to provide more challenging complementary labels, for example, by giving higher rewards to the specific classes. Otherwise, the complementary label given by the worker may be too evident and uninformative. For example, class three and class five are not class one evidently but is uninformative. This paper focuses on the uniform (symmetric) assumption and study random distribution as a biased assumption (asymmetric or non-uniform). Based on the uniform assumption, Ishida et al. (2019) proposed an unbiased estimator with a general loss function for complementary labels. It can make any loss functions available for use, not only soft-max cross-entropy loss function, but other loss functions can also be utilized. Their new framework is a generalization of previous complementary-label learning Ishida et al. (2017) . However, their proposed unbiased risk estimator has an issue that the classification risk can attain negative values after learning, leading to overfitting Ishida et al. (2019) . They then offered a non-negative correction to the original unbiased risk estimator to improve their estimator, which is no longer guaranteed to be an unbiased risk estimator. In this paper, our proposed risk estimator is also not unbiased, but the minimizer of the risk for complementary labels is theoretically guaranteed to be consistent with the minimizer of the risk for ordinary labels, both uniform and non-uniform.

3.1. LEARNING WITH ORDINARY LABELS

In the context of learning with ordinary labels, let X ⊂ R d be the feature space and Y = {1, • • • , k} be the class labels. A multi-class loss function is a map: L(f θ (x), y) : X × Y → R + . A classifier can be presented as: h(x) = arg max i∈[k] f (i) θ (x) , where f θ (x) = (f (1) θ (x), • • • , f (k) θ (x)), θ is the set of parameters in the CNN network, f θ (x) is the probability prediction for the corresponding class i. Even though h(x) is the final classifier, we use notation of calling f θ (x) itself as the classifier. Given dataset S = {(x i , y i )} N i , together with a loss function L, ∀f θ ∈ F (F is the function space for searching), L-risk is defined as: R S L (f θ ) = E D [L(f θ (x), y)] = E S [L(f θ (x), y)] , Some popular multi-class loss functions are CCE, MAE, MSE. Specifically, (f θ (x), y) = (u, y) =            k i=1 e (i) y log 1 µy = log 1 µy CCE, e y -u 1 = 2 -2µ y MAE, e y -u 2 2 = u 2 2 + 1 -2µ y MSE, where u = f θ (x) = (µ 1 , • • • , µ k ) , and e y is a one-hot vector that the y-th component equals to 1, others are 0. The goal of multi-class classification is to learn a classifier f θ (x) that minimize the classification risk R S L with multi-class loss L .

3.2. LEARNING WITH COMPLEMENTARY LABELS

In contrast to the ordinary-label learning, the complementary-label (CL) dataset contains only labels indicating that the class label of a sample is incorrect. Corresponding to the ordinary labels dataset S, the independent and identically distributed (i.i.d.) complementary labels dataset denoted as: S = {(x, ȳ)} N i , ) where N is the size of the dataset S, and ȳ represents that pattern x does not belong to class-ȳ . The general labels' distribution of dataset S is as: P (ȳ|y) =     0 p 12 . . . p 1k p 21 0 . . . p 2k . . . . . . . . . . . . p k1 . . . p k(k-1) 0     k×k, where p ij denotes that the probability of the i-th class's pattern x labeled as j, k j=1 p ij = 1, p ij =0 , j = i. Supposing that the label system uniformly select a label from {1, • • • , k} \ {y} for each sample x, then the Eq. ( 5) becomes P (ȳ|y) =      0 1 k-1 . . . 1 k-1 1 k-1 0 . . . 1 k-1 . . . . . . . . . . . . 1 k-1 . . . 1 k-1 0      k×k . (6) Yu et al. (2018b) make a strong assumption that there are some bias in Eq. ( 5), while Ishida et al. (2017; 2019) focus on the assumption of Eq. ( 6). In this paper, we study both kinds of distribution.

4. METHODOLOGY

In this section, we firstly propose a general risk formulation for leaning with complementary labels. And then prove that some loss functions designed for the ordinary labels learning are robust to complementary labels with our risk formulation, such as categorical cross-entropy loss and mean absolute error.

4.1. GENERAL RISK FORMULATION

The goal of learning with complementary labels is to learn a classifier that predicts a correct label for any sample drawn from the same distribution. Because there are not ordinary labels for the model, we need to design a loss function or model for learning with complementary labels. The key to learning a classifier for ordinary label learning is to maximize the true label's predict-probability. One intuitive way to maximize the true label's predict-probability is to minimize the predict-probability of complementary labels. In this paper, with little abuse of notation, we let u = f θ (x) = (µ 1 , • • • , µ k ) v = 1 -f θ (x) = (1 -µ 1 , • • • , 1 -µ k ) . ( ) Definition 1. (CL-loss) Together with loss function designed for the ordinary-label learning, the loss for learning with complementary-label is as: ¯ (f θ (x), ȳ) = ¯ (u, ȳ) = (v, ȳ) .

4.2. THEORETICAL RESULTS

Definition 2. (Robust loss) In the framework of risk minimization, a loss function is called robust loss function if minimizer of risk with complementary labels would be the same as with ordinary labels, i.e., R S ¯ (f θ * ) -R S ¯ (f θ ) ≤ 0 ⇒ R S (f θ * ) -R S (f θ ) ≤ 0 . ) Theorem 1. Together with , ¯ is a robust loss function for learning with complementary labels, if ¯ satisfies: ∂ ¯ (u, ȳ) ∂µ ȳ > 0, ∂ ¯ (u, ȳ) ∂µ i = 0, ∀i ∈ {1, • • • , k} \ {ȳ} . ( ) Note that, in Eq. 10, it means that ¯ is a monotone increasing loss function only on u (ȳ) . Proof. Recall that for any f θ , and any , R S (f θ ) = E (x,y) [ (f θ (x), y)] = 1 |S| (x,y)∈S (f θ (x), y) . For any complementary-label distribution in Eq. ( 5), and any loss function , we have R S ¯ (f θ ) = E (x,ȳ) ¯ (f θ (x), ȳ) = 1 | S| k i=1 x∈Si k j =i p ij ¯ (f θ (x), j) , ( ) where p ij is the component of complementary labels distribution matrix P , S 1 ∪ • • • ∪ S k = S. Supposing that f θ * is the optimal classifier learns from the complementary labels, and ∀f ∈ F, where F is the function space for searching, we have R S ¯ (f θ * ) -R S ¯ (f θ ) = 1 | S| k i=1 x∈Si k j =i p ij ¯ (f θ * (x), j) -¯ (f θ (x), j) ≤ 0, where p ij = 0. If ∃x ∈ S, s.t., ¯ (f θ * (x ), ȳ) > ¯ (f θ (x ), ȳ), let f θ satisfying f θ (x) = f θ * (x) x ∈ S \ {x }, f θ (x) x = x , then according to Eq. 12 and 13, R S ¯ (f θ ) < R S ¯ (f θ * ), f θ * is not the optimal classifier. This contradicts the hypothesize that f θ * is the optimal classifier. Thus, ∀ȳ ∈ {1, • • • , k} \ {y}, we have ¯ (f θ * (x), ȳ) ≤ ¯ (f θ (x), ȳ) . According to Eq. ( 10), ¯ is a monotone increasing loss function only on u (ȳ) , then we have ∀ȳ ∈ {1, • • • , k} \ {y}, f (ȳ) θ * (x) ≤ f (ȳ) θ (x) . Thus, f (y) θ * (x) ≥ f (y) θ (x),   f (y) θ (x) = 1 - ȳ =y f (ȳ) θ (x)   and then, (f θ * (x), y) ≤ (f θ (x), y), (18) thus, R S (f θ * ) -R S (f θ ) ≤ 0 . ( ) Theorem 2. Together with , ¯ is a robust loss function for learning with complementary labels under symmetric distribution or uniform distribution, if ¯ satisfies: ∂ ¯ (u, ȳ) ∂µ ȳ > 0, k i=1 ¯ (u, i) = C, (C is a constant) . It should be noted that, in Eq. 20, it means that ¯ is a symmetric loss ( (u, i) = C), and ¯ is a monotone increasing loss function on any ȳ. Proof. For any complementary-label distribution in Eq. ( 6), and any loss function , we have R S ¯ (f θ ) = E (x,ȳ) ¯ (f θ (x), ȳ) = 1 | S| k i=1 x∈Si k j =i 1 k -1 ¯ (f θ (x), j) = 1 | S| k i=1 x∈Si 1 k -1 C -¯ (f θ (x), i) = C k -1 -R S ¯ (f θ ), where S 1 ∪ • • • ∪ S k = S . Supposing that f θ * is the optimal classifier learns from the complementary labels, and ∀f ∈ F, where F is the function space for searching, we have R S ¯ (f θ * ) -R S ¯ (f θ ) = R S ¯ (f θ ) -R S ¯ (f θ * ) ≤ 0, According to the first constraint in Eq. ( 20), we then have ¯ (f θ (x), y) ≤ ¯ (f θ * (x), y), f (y) θ (x) ≤ f (y) θ * (x) and then, for (x i , ȳi ) in Strain do 5: (f θ * (x), y) ≤ (f θ (x), y), (24) thus, R S (f θ * ) -R S (f θ ) ≤ 0 . ( f θ (x i ) = (µ 1 , • • • , µ k ); 6: u = 1 -f θ (x i ) = (1 -µ 1 , • • • , 1 -µ k ); 7: loss = (u, ȳi ); 8: w = w -β ∂loss w , w ∈ θ; 9: end for 10: end for 11: return f θ (x) Together with some well known multi-class loss functions, such as CCE, MAE, MSE, the loss for learning with complementary labels with our definition are as follows: ¯ (f θ (x), ȳ) = (v, ȳ) =            k i=1 e (i) ȳ log 1 1-µi = log 1 1-µȳ CCE, e ȳ -v 1 = k -2 + 2µ ȳ MAE, e ȳ -v 2 2 = k -3 + u 2 2 + 2µ ȳ MSE, where e ȳ is a one-hot vector that the ȳ-th component equals to 1, others are 0. As its shown in Eq. ( 26), CCE and MAE loss satisfy the Theorem 1, MAE also satisfies the Theorem 2, while MSE does not satisfies the two. Zhang & Sabuncu (2018) propose a GCE loss function for learning with label noise, their formulation is as: GCE (f θ (x), y) = 1 -µ q y q , q ∈ (0, 1) . ( ) It is easily to know that the loss function satisfies the constraint in Theorem 1, thus, it can be used to learning with complementary labels.

4.3. EXCLUSION ALGORITHM FOR LEARNING FROM COMPLEMENTARY LABELS

Based on the loss function we designed for complementary-label learning, we present an algorithm to learn a classifier from complementary labels with our loss function, named exclusion algorithm (the label specifies that the sample does not belong to it). The algorithm details show in Alg. 1. Furthermore, our algorithm is easily combined with the models designed for ordinary-label learning, with only a minus operation, which can be view as a framework to use the loss designed for ordinarylabel learning to learn the optimal classifier from complementary labels. (2017), CIFAR10 Krizhevsky (2009) . Specifically, we generate two types of complementary labels: symmetric and asymmetric, for our experiments to verify our method's effectiveness and the theorem we proved in the previous section. For symmetric complementary-label, we fix a label distribution as Eq. ( 6) to generate the complementary-label training set. The validation set is split from the training set, which contains none ordinary-label. Thus, the lower the validation accuracy, the better the classifier learns from the training set. For asymmetric complementary-label, we randomly generate a matrix as Eq. ( 5) that the p ij is unknown as the complementary-label distribution and using it to create complementary-label for experiments. The test accuracy of all experiments is tested on a clean dataset that contains only the ordinary labels.

5. EXPERIMENTS

Approaches. We test our loss with ¯ CCE , ¯ MAE , ¯ MSE , ¯ GCE and compare with state-of-the-art methods in learning with complementary labels. The loss functions we used or compare in this paper are listed as follows. 1) CCE: The categorical cross-entropy loss, neither symmetric nor bounded, which widely use in machine learning and deep learning due to its fast convergence speed. 2) MAE: The mean absolute error, a symmetric loss and bounded, has been proved Ghosh et al. (2017) to be noise-tolerant. 3) MSE: The mean square error, not symmetric but bounded, widely used in regression learning. 4) GCE: It uses a hyper-parameter q to tune the loss between MAE and CCE, but achieve noise-robust base on its bounded, we used the standard GCE where q=0.7 . 5) GA: Gradient ascent, a learning algorithm for complementary-label learning, is used to tackle the overfitting problem of the unbiased estimator they proposed in Ishida et al. (2019) . Network architecture. Following Ghosh et al. (2017) , we use a network architecture that contains five layers to test the above methods for all the experiments: a convolution layer with 32 filters which filter size set as (3,3), a max-pooling layer with pooling-size of (3,3) and strides of (2,2), two fully connected layers with 1024 units, and a fully connected layer with soft-max activated function that the unit number set to the category number for prediction. Rectified Linear Unit (ReLU) is used as the activated function in the network's hidden layer. Implement details. The implementation detail of our method shows in Alg. 1. We train our network with stochastic gradient descent through back-propagation. Each experiment trains 200 epochs, and the mini-batch size was set to 64. To exploit each loss function's best performance, we set three start learning rate for each loss function on each experiment and report the best accuracy amongst the three learning rate of each loss function. CCE is set to [1e-3, 5e-4, 1e-4], while GCE, MAE, MSE is set to [1.0, 0.5, 0.1]. The learning rate was halved per 50 epochs. Robustness. As shown in Fig. 1 , together with CCE, MAE, and GCE loss, our algorithm achieves strong robust to both symmetric and asymmetric complementary labels, which verify that the robustness we prove in the Theorem 1 and Theorem 2. Even though the MAE satisfies the two theorems, it achieves a lower test accuracy than that of CCE and GCE due to it treats all labels the same (not sensitive to the labels). The subfigures in the last column of Fig. 1 shows that the MSE loss firstly achieves its highest test accuracy and then drop sharply over the epochs. Because MSE does not satisfy one of the two theorems we prove, it easily overfits the training set's complementary labels.

5.2. EXPERIMENTAL RESULTS

Such a trend is the same as asymmetric complementary labels learning. The results verify that the algorithm we design for the complementary labels is significant and confirms the theoretical results we analyzed in the previous section. Performance Comparison. The first four columns of Table . 1 show that the CCE and GCE loss achieve the best two test accuracies in our algorithm. In the MNIST dataset, the CCE achieves a little lower test accuracy than GCE, the same test accuracy in FASHION-MNIST, and a little higher test accuracy in CIFAR10 due to the dataset more challenge and CCE is more sensitive to labels. Even MAE is robust to complementary labels, and its performance is not well than others because it is a linear loss that is not sensitive to labels. As shown in Fig. 1 , MSE is not robust to complementary labels, but with a small learning rate of 0.1, MSE only exhibited slight overfitting in Table 1 . Furthermore, as shown in Table 1 , together with CCE and GCE loss, our algorithm achieves a test accuracy higher than 95% in the MNIST dataset, which is comparable to that of learning with ordinary labels. For a fair comparison, The last three columns directly form Ishida et al. (2019) even those results are the max test accuracy. In the first two datasets, all loss functions with our algorithm achieve a higher test accuracy than GA, but they used an MLP model as their base model, simpler than ours. In CIFAR10, they used ResNet-34 (21.62M parameters) He et al. (2016) and DenseNet Huang et al. (2017) as their based model, which is much bigger than ours (8.43M parameters), but we achieve a much higher test accuracy than theirs. The results validate the superiority of our algorithm to current state-of-the-art methods.

6. CONCLUSION

This paper designs an algorithm for learning from complementary labels using the loss functions designed for ordinary-label learning. We provide theoretical analysis to show that the loss functions we design for learning from the complementary labels are robust to the complementary labels, i.e., the optimal classifier learned from the complementary labels can theoretically converge to the optimal classifier learned from ordinary labels. In this paper, the two theorems we present are the sufficient condition of a loss function robust to complementary labels. Experimental results show that though complementary-label learning is a new topic in supervised-learning, it offers excellent competitiveness. More methods should be studied to improve the performance of complementary learning in our future works, such as Amid et al. (2019b) and Amid et al. (2019a) .



We test our experiments on MNIST LeCun et al. (1998), FASHION-MNIST Xiao et al.

6) PC: Pairwise comparison (PC) with ramp loss designed for complementary-label learning Ishida et al. (2017). 7) Fwd: Forward correction Patrini et al. (2017), Yu et al. (2018a) designed for learning with complementary labels.

Figure 1: Accuracy for CCE, MAE, GCE, MSE loss functions over epochs, for CIFAR10 dataset with symmetric complementary labels (SCL) and asymmetric complementary labels (ACL). Legends are shown in the first sub figures on the first row.

)

The test accuracy and standard deviation (5 trials) on experiments with loss functions, under different complementary labels' distribution assumption, for datasets: MNIST, FASHION-MNIST, CIFAR10. We report the last ten percent epochs average test accuracy. For fair comparison, the last three columns' data are directly copying from Table.2 inIshida et al. (2019), whereGA Ishida  et al. (2019): Gradient Ascent, PC Ishida et al. (2017): Pairwise Comparison, Fwd Yu et al. (2018b): Forward correction. The top 2 accuracies are boldface.

