PROVABLE ROBUST LEARNING FOR DEEP NEURAL NETWORKS UNDER AGNOSTIC CORRUPTED SUPERVI-SION

Abstract

Training deep neural models in the presence of corrupted supervisions is challenging as the corrupted data points may significantly impact the generalization performance. To alleviate this problem, we present an efficient robust algorithm that achieves strong guarantees without any assumption on the type of corruption, and provides a unified framework for both classification and regression problems. Different from many existing approaches that quantify the quality of individual data points (e.g., loss values) and filter out data points accordingly, the proposed algorithm focuses on controlling the collective impact of data points on the averaged gradient. Even when a corrupted data point failed to be excluded by the proposed algorithm, the data point will have very limited impact on the overall loss, as compared with state-of-the-art filtering data points based on loss values. Extensive empirical results on multiple benchmark datasets have demonstrated the robustness of the proposed method under different types of corruptions.

1. INTRODUCTION

Corrupted supervision is a common issue in real-world learning tasks, where the learning targets are not accurate due to various factors in the data collection process. In deep learning models, such corruptions are especially severe, whose degree-of-freedom makes them easily memorize corrected examples and susceptible to overfitting (Zhang et al., 2016) . There are extensive efforts to achieve robustness against corrupted supervisions. A natural approach to deal with corrupted supervision in deep neural networks (DNNs) is to reduce the model exposure to corrupted data points during training. By detecting and filtering (or re-weighting) the possible corrupted samples, the learning is expected to deliver a model that is similar to the one trained on clean data (without corruption) (Kumar et al., 2010; Han et al., 2018; Zheng et al., 2020) . There are different criteria designed to identify the corrupted data points in training. For example, Kumar et al. (2010) ; Han et al. (2018) ; Jiang et al. (2018) leveraged the loss function values of data points; Zheng et al. (2020) tapped prediction uncertainty for filtering data; Malach & Shalev-Shwartz (2017) used the disagreement between two deep networks; Reed et al. (2014) utilized the prediction consistency of neighboring iterations. The success of these methods highly depends on the effectiveness of the detection criteria in correctly identifying the corrupted data points. Since the corrupted labels remain unknown throughout the learning, such "unsupervised" detection approaches may not be effective, either lack theoretical guarantees of robustness (Han et al., 2018; Reed et al., 2014; Malach & Shalev-Shwartz, 2017; Li et al., 2017) or provide guarantees under assumptions of the availability of prior knowledge about the type of corruption (Zheng et al., 2020; Shah et al., 2020; Patrini et al., 2017; Yi & Wu, 2019) . Besides, another limitation of many existing approaches is that, they are exclusively designed for classification problems (e.g., Malach & Shalev-Shwartz (2017) ; Reed et al. (2014) ; Menon et al. (2019) ; Zheng et al. (2020) ) and are not straightforward to extend to solve regression problems. To tackle these challenges, this paper presents a unified optimization framework with robustness guarantees without any assumptions on how supervisions are corrupted, and is applicable to both classification and regression problems. Instead of developing an accurate criterion for detection corrupted samples, we adopt a novel perspective and focus on limiting the collective impact of corrupted samples during the learning process through robust mean estimation of gradients. Specifically, if our estimated average gradient is close to the gradient from the clean data during the learning iterations, then the final model will be close to the model trained on clean data. As such, a corrupted data point can still be used during the training when it does not considerably alter the averaged gradient. This observation has remarkably impact on our algorithm design: instead of explicitly quantifying (and identifying) individual corrupted data points, which is a hard problem in itself, we are now dealing with an easier task, i.e., eliminating training data points that significantly distort the mean gradient estimation. One immediate consequence of this design is that, even when a corrupted data point failed to be excluded by the proposed algorithm, the data point is likely to have very limited impact on the overall loss, as compared with state-of-the-art filtering data points based on loss values. We perform experiments on both regression and classification with corrupted supervision on multiple benchmark datasets. The results show that the proposed method outperforms state-of-the-art.

2. BACKGROUND

Learning from corrupted data (Huber, 1992) has attracted considerable attention in the machine learning community (Natarajan et al., 2013) . Many recent studies have investigated robustness of classification tasks with noisy labels. For example, Kumar et al. (2010) proposed a self-paced learning (SPL) approach, which assigns higher weights to examples with smaller loss. A similar idea was used in curriculum learning (Bengio et al., 2009) , in which the model learns easy samples first before learning harder ones. Alternative methods inspired by SPL include learning the data weights (Jiang et al., 2018) and collaborative learning (Han et al., 2018; Yu et al., 2019) . Label correction (Patrini et al., 2017; Li et al., 2017; Yi & Wu, 2019) is another approach, which revises original labels in data with a goal to recover clean labels from corrupt ones. However, since we do not have access to which data points are corrupted, it is hard to get provable guarantees for label correction without strong assumptions regarding the corruption type. Accurate estimation of gradients is a key step for successful optimization. The relationship between gradient estimation and its final convergence has been widely studied in the optimization community. Since computing an approximated (and potentially biased) gradient is often more efficient than computing the exact gradient, many studies used approximated gradients to optimize their models and showed that they suffer from the biased estimation problem if there is no assumptions on the gradient estimation (d'Aspremont, 2008; Schmidt et al., 2011; Bernstein et al., 2018; Hu et al., 2020; Ajalloeian & Stich, 2020) . A closely related topic is robust estimation of the mean. Given corrupted data, robust mean estimation aims at generating an estimated mean μ such that the difference between the estimated mean on corrupted data and the mean of clean data μ -µ 2 is minimized. It was showed that median or trimmed-mean are the optimal statistics for mean estimation in one-dimensional data (Huber, 1992) . However, robustness in high dimension is quite challenging since applying the coordinate-wise optimal robust estimator would lead to an error factor O( √ d) that scales with the data dimension. Although some classical work, such as Tukey median (Tukey, 1975) , successfully designed algorithms to get rid of the O( √ d) error, the algorithms themselves are not polynomial-time algorithm. More recently, Diakonikolas et al. (2016) ; Lai et al. (2016) successfully designed polynomial-time algorithms with dimension-free error bounds. The results have been widely applied to improve algorithmic efficiency in various scenarios (Dong et al., 2019; Cheng et al., 2020) . Robust optimization aims to optimize the model given corrupted data. Many previous studies improve the robustness of the optimization in different problem settings. However, most of them either study linear regression and its variantes (Bhatia et al., 2015; 2017; Shen & Sanghavi, 2019) or study the convex optimization (Prasad et al., 2018) . Thus, those results cannot be directly generalized to deep neural networks. Diakonikolas et al. (2019) is a very generalized non-convex optimization method with the agnostic corruption guarantee. However, the space complexity of the algorithm is high, thus cannot be applied to deep neural networks given current hardware limitations.

3. METHODOLOGY

Before introducing our algorithm, we first discuss the corrupted supervision. To characterize agnostic corruptions, we make use of an adversary that tries to corrupt the supervision of a clean data. There is no limitation on how the adversary corrupts the supervision, which can either be randomly permuting the target, or in a way that maximizes the negative impact (i.e., lower performance). Firstly, the adversary can choose up to fraction of the clean target D y ∈ R n×q and change the selected row of D y to arbitrary valid numbers, generating D y ∈ R n×q . Then, the adversary returns the corrupted dataset D x , D y to our learning algorithm A. In this process, the only constraint on the adversary is the fraction, and the adversary has full knowledge of the data, and even the learning algorithm A. A natural question to ask is: Given a data set with -fraction corrupted supervision D x ∈ R n×p , D y , and a learning objective φ : R p × R q × R d → R parameterized by θ, can we output parameters θ ∈ R d such that ∇ θ φ(θ; D x , D y ) is minimized. When = 0, we have D y = D y and learning is done on the clean data. The stochastic gradient descent could converge to a stationary point, where ∇ θ φ(θ; D x , D y ) = 0. However, when the supervision is corrupted as above, this is not the case any more, due to the error in θ impacted by the corrupted data. We thus want an efficient algorithm to find a model θ that minimizes ∇ θ φ(θ; D x , D y ) . A robust model θ should have a small value of ∇ θ φ(θ; D x , D y ) , and we hypothesize that a smaller ∇ θ φ(θ; D x , D y ) has better generalization.

3.1. STOCHASTIC GRADIENT DESCENT WITH BIASED GRADIENT

A direct consequence of corrupted supervision is biased gradient estimation. In this section, we will first analyze how such biased gradient estimation affects the robustness of learning. The classical analysis of stochastic gradient descent (SGD) requires access to the stochastic gradient oracle, which is an unbiased estimation of the true gradient. However, corrupted supervision leads to corrupted gradients, and it is thus difficult to get unbiased gradient estimation without assumptions of how the gradients are corrupted. We start the analysis by the following informal theorem (without elaborated discussions of assumptions) of how biased gradient affects the final convergence of SGD. Its formal version is provided in Theorem 4, Appendix. Theorem 1 (Convergence of Biased SGD (Informal)) Under mild assumptions, denote ζ to be the maximum 2 norm of the difference between clean minibatch gradient and corrupted minibatch gradient gg ≤ ζ, then by using biased gradient estimation, SGD converges to the ζ-approximated stationary points: E ∇φ(θ t ) 2 = O(ζ 2 ). Remark 1 In the corrupted supervision setting, let the gradient estimated by corrupted data D be ĝ, the gradient estimated by clean data D be g. Assume gg ≤ ζ, it follows that when using corrupted dataset in SGD, it converges to the ζ-approximated stationary point of the objective defined by the clean data. Note the difference between above theorem and typical convergence theorem is that we are using a biased gradient estimation. According to Theorem 1 and the remark, a robust estimation of the gradient g is the key to ensure a robust model (converge to the clean solution). We also assume the loss function has the form of L(y, ŷ), where many commonly used loss functions fall in this category.

3.2. ROBUST GRADIENT ESTIMATION FOR GENERAL DATA CORRUPTION

We first introduce Algo. 2 for general corruption (i.e. corruption on both features and/or supervisions). The algorithm excludes the data points with large gradient norms, and uses the empirical mean of the remaining to update gradients. In Thm. 2 we give its robustness property. Algorithm 1: Robust Mean Estimation for Corrupted Gradient input: gradient matrix G ∈ m × d, corruption rate return estimated mean μ ∈ R d ; 1. For each row zi in G, calculate the l2 norm zi 2. Choose the -fraction rows with large zi 3. Remove those selected rows, and return the empirical mean of the rest points as μ. Assumption 1 (Individual L-smooth loss) For every individual loss function φ i , there exists constant L > 0, such that for a clean sample i, we have |φ i (x) -φ i (y)| ≤ L|x -y| for any x, y. Theorem 2 (Robust Gradient Estimation For Data Corruption) Let G ∈ R m×d be a corrupted gradient matrix, and G ∈ R m×d be the clean gradient matrix. Let µ be the empirical mean function, The robustness guarantee states that even training on generally corrupted data (corrupted supervision is a special case), Algo. 2 guarantee that the gradient norm on remaining data cannot be too large. Since Thm. 2 gives a dimension-free error bound when Asm. 1 holds, Corollary 1 also gives the dimension-free robustness guarantee with Asm. 1. We defer the detailed discussion of O( L) to later sections. Although the error bound O( L) sounds good, we note that it still has several drawbacks: First, the dimension-free error bound means the error does not grow with increasing dimensions, and is critical when working with neural networks, due to the extremely large gradient dimension (i.e., #parameters of neural network). Thm. 2 gives the dimension-free error bound only when Asm. 1 holds, which is quite strong. In addition, even when Asm. 1 holds, L can be large, leading to a large gradient estimation error. Existing work (Diakonikolas et al., 2019) already acheives the dimensionfree O( √ ) guarantee with general corruptions, which is a much more better theoretical results than above theorem. However, in practice, we found that the gradient norms of deep neural networks for individual data points are usually not very large, even at the beginning of the training. This can be partially due to the network structure. Further discussion on this issue is beyond the scope of this paper, but the theoretical bound above states that the robustness should depend on the number of parameters for the general models. Another concern of Alg. 2 is the efficiency. It requires computing individual gradients. Although there are some advanced approaches to get the individual gradient, e.g., (Goodfellow, 2015) , it is still relatively slow as compared to commonly used back-propagation. Moreover, these methods are usually not compatible with popular components such as batch normalization (BN) since the individual gradients are not independent inside BN, using of which will lose the benefits from parallelization.

3.3. ROBUST GRADIENT ESTIMATION FOR ONE DIMENSIONAL CORRUPTED SUPERVISION

In this section, we show that the above robustness bound can be improved if we assume the corruption only comes from supervision. Also, by fully exploiting the gradient structure of the corrupted supervision, our algorithm can be much more efficient and meanwhile compatible with batch normalization. We use the one dimensional supervision setting (binary classification or single-target regression) to illustrate this intuition and extend it more general settings in the next section. Consider a high-dimensional supervised learning problem with X ∈ R n×p and y ∈ R n . The goal is to learn a function f parameterized by θ ∈ R d minimizing the following loss min θ n i=1 φ i = min θ n i=1 L(y i , f (x i , θ)). The gradient for a data point i is ∇ θ φ i = ∂li ∂fi ∂fi ∂θ = α i g i . One key observation is that: when only supervision is corrupted, then the corruption contributes only to the term α i = ∂li ∂fi , which is a scalar in the one-dimensional setting. In other words, given the clean gradient of i th point, g i ∈ R d , the corrupted supervision can only perturbs the the length of the gradient vector, changing the gradient from α i g i to δ i g i , where δ i = ∂l i ∂fi . When α i and δ i are known, then we can easily eliminate the impact from corrupted supervision. But this is not the case since we have have only the possibly corrupted target ŷi as opposed to the ground truth y i . On the other hand, the fact that corrupted supervision scales the clean gradient can be used to reshape the robust optimization problem. Recall that in every iteration, we update our model by θ + = θ -γµ(G), where µ denotes the empirical mean function and G = [∇ θ φ T 1 , . . . , ∇ θ φ T m ] ∈ R m×d is the gradient matrix with mini-batch size m. We then have the following: Problem 1 (Robust Gradient Estimation for Corrupted Supervision -One Dimensional Case) Given a clean gradient matrix G ∈ R m×d , an -corrupted matrix G with at most -fraction rows are corrupted from α i g i to δ i g i , design an algorithm A : R m×d → R d that minimizes µ(G)-A( G) . Note that when δ i is large, the corrupted gradient will have large effect on the empirical mean, and vice versa. This motivates us to develop an algorithm that filters out data points by the loss layer gradient ∂li ∂fi . If the norm of the loss layer gradient of a data point is large (in one-dimensional case, this gradient reduces to a scalar and the norm becomes its absolute value), we exclude the data point when computing the empirical mean of gradients for this iteration. Note that this algorithm is applicable to both regression and classification problems. Especially, when using the mean squared error (MSE) loss for regression, its gradient norm is exactly the loss itself, and the algorithm reduces to self-paced learning Kumar et al. (2010) . We summarize the procedure in Alg. 3 and extend it to the more general multi-dimension case in the next section. 

3.4. EXTENSION TO MULTI-DIMENSIONAL CORRUPTED SUPERVISION

To extend our algorithm and analysis to multi-dimensional case, let q to be the supervision dimension, the gradient for each data point is ∇ θ φ i = ∂li ∂fi ∂fi ∂θ , where ∂li ∂fi ∈ R q is the gradient of loss respect to model outputs, and ∂fi ∂θ ∈ R q×d is the gradient of model outputs respect to model parameters. Similarly, when the supervision is corrupted, the corruption comes from the term ∂li ∂fi , which is a vector. Let δ i = ∂l i ∂fi ∈ R q , α i = ∂li ∂fi ∈ R q , W i = ∂fi ∂θ ∈ R q×d , m be the minibatch size. Denote the clean gradient matrix G ∈ R m×d , where the i th row of gradient matrix g i = α i W i . Now the multi-dimensional robust gradient estimation problem is defined by: Problem 2 (Robust Gradient Estimation for Corrupted Supervision -Multi-Dimensional Case) Given a clean gradient matrix G, an -corrupted matrix G with at most -fraction rows are corrupted from α i W i to δ i W i , design an algorithm A : R m×d → R d that minimizes µ(G) -A( G) . We start our analysis by investigating the effects of the filtering-base algorithm, i.e. use the empirical mean gradient of (1 -)-fraction subset to estimate the empirical mean gradient of clean gradient matrix. We have the following for a randomized filtering-based algorithm(proof in Appendix): Lemma 1 (Gradient Estimation Error for Random Dropping -fraction Data) Let G ∈ R m×d be a corrupted matrix generated as in Problem 2, and G ∈ R m×d be the original clean gradient matrix. Suppose arbitrary (1 -)-fraction rows are selected from G to form the matrix N ∈ R n×d . Let µ be the empirical mean function. Assume the clean gradient before loss layer has bounded operator norm, i.e., W op ≤ C, the maximum clean gradient in loss layer max i∈G α i = k, the maximum corrupted gradient in loss layer max i∈N δ i = v, then we have: µ(G) -µ(N) ≤ Ck 3 -4 2 1 - + Cv 1 - . We see that v is the only term that is related to the corrupted supervision. If v is large, then the bound is not safe since the right-hand side can be arbitrarily large (i.e. an adversary can change the label in a way such that v is extremely large). Thus controlling the magnitude of v provides a way to effectively reduce the bound. For example, if we manage to control v ≤ k, then the bound is safe. This can be achieved by sorting the gradient norms at the loss layer, and then discarding the largest -fraction data points. We thus have the following result. Theorem 3 (Robust Gradient Estimation For Supervision Corruption) Let G be a corrupted matrix generated in Problem 2, q be the label dimension, µ be the empirical mean of clean matrix G. Assume the maximum clean gradient before loss layer has bounded operator norm: W op ≤ C, then the output of gradient estimation in Algo 3 μ satisfies µ -μ = O( √ q) ≈ O( ). Compare Thm. 2 and Thm. 3, we see that when the corruption only comes from supervision, the dependence on d is reduced to q, where in most deep learning cases we have d n. Applying Thm 1 directly shows that our algorithm is also robust in multi-label settings.

3.5. COMPARISON WITH DIAKONIKOLAS ET AL. (2019) AND OTHER METHODS

SEVER (Diakonikolas et al., 2019) showed promising state-of-the-art theoretical results in general corruptions, which achieves O( √ ) dimension-free guarantee for general corruptions. Compared to Diakonikolas et al. (2019) , we have two contributions: a). By assuming the corruption comes from the label (we admit that this is quite strong compared to the general corruption setting), we could get a better error rate. b). Our algorithm can be scaled to deep neural networks while Diakonikolas et al. (2019) cannot. We think this is a contribution considering the DNN based models are currently state-of-the-art methods for noisy label learning problems (at least in empirical performance). Although Diakonikolas et al. (2019) achieves very nice theoretical results, unfortunately, it cannot be applied to DNN with the current best hardware configuration. Diakonikolas et al. (2019) uses dimension-free robust mean estimation breakthroughs to design the learning algorithm, while we notice that most robust mean estimation relies on filtering out data by computing the score of projection to the maximum singular vector. For example, in Diakonikolas et al. (2019) , it requires performing SVD on n × d individual gradient matrix, where n is the sample size and d is the number of parameters. This method works well for small datasets and small models since both n and d is small enough for current memory limitation. However, for deep neural networks, this matrix size is far beyond current GPU memory capability. That could be the potential reason why in Diakonikolas et al. (2019) , only ridge regression and SVM results for small data are shown (we are not saying that they should provide DNN results). In our experiment, our n is 60000 and d is in the magnitude of millions (network parameters). It is impractical to store 60000 copies of neural networks in a single GPU card. In contrast, in our algorithm, we do not need to store the full gradient matrix. By only considering the loss-layer gradient norm, we can easily extend our algorithm to DNN, and we showed that this simple strategy works well in both theory and challenging empirical tasks. We notice that there are some linear (Bhatia et al., 2015; 2017) or convex method (Prasad et al., 2018) achieves the better robustness guarantee. However, most of them cannot be directly applied to deep neural networks. samples since neural networks can achieve 0 training loss (Zhang et al., 2016) . By sorting the error φ(x i ) for every data point, SPL actually is sorting the lower bound of the gradient norm if the PL condition holds. However, the ranking of gradient norm and the ranking loss can be very different since there is no guarantee that the gradient norm is monotonically increasing with the loss value. We provide illustration of why SPL is not robust from geometric perspective in the appendix. Here we show even for simple square loss, the monotonic relationship is easy to break. One easy counter-example is φ(x 1 , x 2 ) = 0.5x 2 1 + 50x 2 2 . Take two points (1000, 1) and (495, -49.5), we will find the monotonic relationship does not hold for these two points. Nocedal et al. (2002) showed that the monotonic relationship holds for square loss (i.e.φ(x) = 1 2 (x -x * ) T Q(x - x * ) ) if the condition number of Q is smaller than 3 + 2 √ 2 , which is a quite strong assumption especially when x is in high-dimension. If we consider the more general type of loss function (i.e. neural network), the assumptions on condition number should only be stronger, thus breaking the monotonic relationship. Thus, although SPL sorts the lower bound of the gradient norm under mild assumptions, our algorithm is significantly different from the proposed SPL and its variations. Now, we discuss the relationship between SPL and algorithm 3 under supervision corruptions. SPL has the same form as algorithm 3 when we are using mean square error to perform regression tasks since the loss layer gradient norm is equal to loss itself. However, in classification, algorithm 3 is different from the SPL. In order to better understand the algorithm, we further analyze the difference between SPL and our algorithm for cross-entropy loss. For cross entropy, denote the output logit as o, we have H(y i , f i ) = -y i , log(softmax(o i )) = -y i , log(f i ) . The gradient norm of cross entropy w.r.t. o i is: ∂H i ∂o i = y i -softmax(o i ) = f i -y i . Thus, the gradient of loss layer is the MSE between y i and f i . Next, we investigate when MSE and Cross Entropy gives non-monotonic relationship. For the sake of simplification, we only study the sufficient condition of the non-monotonic relationship, which is showed in lemma 2. Lemma 2 Let y ∈ R q , where y k = 1, y i = 0 for i = k, and α, β are two q dimensional vector in probability simplex. Without loss of generality, suppose α has smaller cross entropy loss α k ≥ β k , then the sufficient condition for α -y ≥ β -y is Var i =k ({α i }) -Var i =k ({β i }) ≥ q (q-1) 2 ((α k -β k )(2 -α k -β k )) As α k ≥ β k , the right term is non-negative. In conclusion, when MSE generates a different result from cross-entropy, the variance of the probability of the non-true class of the discarded data point is larger. Suppose we have a ground-truth vector y = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], and we have two predictions α = [0.08, 0.28, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08, 0.08] and β = [0.1, 0.3, 0.34, 0.05, 0.05, 0.1, 0.03, 0.03, 0, 0]. The prediction α have a smaller mse loss while prediction β have a smaller cross-entropy loss. It is intuitive that β is more likely to be noisy data since it has two peak on the prediction (i.e. 0.3, 0.34). However, since cross entropy loss only considers one dimension, it cannot detect such situation. Compared to the cross-entropy, the gradient (mse loss) considers all dimension, and thus will consider the overall prediction distributions.

5. COMBINING WITH CO-TEACHING STYLE TRAINING

Motivated by co-teaching (Han et al., 2018) , which is one of currently state-of-the-art deep methods for learning under noisy label, we propose Co-PRL(L), which has the same framework of coteaching but uses the loss-layer gradient to select the data. The full algorithm is shown in algorithm 4 in the appendix. The meaning of all hyper-parameters in algorithm 4 are all the same as in the original Han et al. (2018) . Compared with algorithm 3, except sampling data according to the loss layer gradient norm, the Co-PRL(L) has two other modules. The first is we gradually increase the amount of the data to be dropped. The second is that two networks will exchange the selected data to update their own parameters.

6. EXPERIMENT

In this section, we perform experiments on benchmark regression and classification dataset. The code is available in supplementary materials of submission. We compare PRL(G)(Algo. 2), PRL(L) Given the human face image, the goal is to predict 10 face landmark coordinates in the image. We tried adding different types of noise on the landmark coordinates. We preprocess the CelebA data as following: we use three-layer CNN to train 162770 training images to predict clean coordinates (we use 19867 validation images to do the early stopping). Then, we use well-trained network to extract the 512-dimensional feature on testing sets. Thus, our final data to perform experiment has feature sets X ∈ R 19962×512 , and the target variable Y ∈ R 19962×10 . We further split the data to the training and testing set, where training sets contain 80% of the data. Then, we manually add linadv, signflip, uninoise, pairflip, mixture types of supervision noise on the target variable on training data. The corruption rate for all types of corruption is varied from 0.1 to 0.4. We use 3-layer fully connect networks in experiments. The results of averaged last 10 epoch r-square are in table 1. (

6.2. CLASSIFICATION EXPERIMENT

We perform experiments on CIFAR10, and CIFAR100 to illustrate the effectiveness of our algorithm in classification setting. We use the 9-layer Convolutional Neural Network, which is the same as Han et al. (2018) . Since most baselines include batch normalization, it is difficult to get individual gradient efficiently, we will drop the ignormclip and PRL baselines. In the appendix, we attached the results if both co-teaching and Co-PRL(L) drops batch normalization module. We will see that coteaching cannot maintain robustness while our method still has robustness. The reason is discussed in the appendix. We consider pairflip and symmetric supervision corruptions in experiments. Also, to compare with the current state of the art method, for symmetric noise, we use corruption rate which beyond 0.5. Although our theoretical analysis assumes the noise rate is small than 0.5, when the noise type is not adversary (i.e. symmetric), we empirically show that our method can also deal with such type of noise. Results on CIFAR10, CIFAR100 are in Table 2 . As we can see, no matter using one network (PRL vs SPL) or two networks (Co-PRL(L) vs Co-teaching), our method performs significantly better. Since in real-world problems, it is hard to know that the ground-truth corruption rate, we also perform the sensitivity analysis in classification tasks to show the effect of overestimating and underestimating . The results are in Table 3 . More discussion about sensitivity analysis can be found in appendix. 

7. CONCLUSION

In this paper, we proposed efficient algorithm to defense against agnostic supervision corruptions. Both theoratical and empirical analysis showed the effectiveness of our algorithm. There are two remaining questions in this paper which deserves study in future. The first one is whether we can further improve O( ) error bound or show that O( ) is tight. The second one is to utilize more properties of neural networks, such as the sparse gradient, to see whether it is possible to get better algorithms. sample R(T )% small-loss-layer-gradient-norm instances by score f and scoreg to get N f , Ng update w f = w f -η∇w f L(N f , w f ), wg = wg -η∇w g L(Ng, wg) (selected dataset) update model xt+1 = xt -γt μ end Update R(T ) = 1 -min T T k τ, τ end (a) When gradient filtering method failed to pick out right corrupted data, the remaining corrupted data is relatively smooth, thus has limited impact on overall loss surface. (b) When loss filtering method failed to pick out right corrupted data, the remaining corrupted data could be extremely sharp, thus has large impact on overall loss surface. In this section, we will further illustrate the difference between SPL and PRL(G). In order to have a more intuitive understanding of our algorithm, we could look at the Figure 1a and 1b. Since we are in the agnostic label corruption setting, it is difficult to filtering out the correct corrupted data. We showed two situations when loss filtering failed and gradient filtering failed. As we could see that when loss filtering method failed, the remaining corrupted data could have large impact on the overall loss surface while when gradient filtering method failed, the remaining corrupted data only have limited impact on the overall loss surface, thus gaining robustness.

A.3 NETWORKS AND HYPERPARAMETERS

The hyperparameters are in Table 4 . For Classification, we use the same hyperparameters in Han et al. (2018) . For CelebA, we use 3-layer fully connected network with 256 hidden nodes in hidden layer and leakly-relu as activation function. We also attached our code in supplementary materials. The curve for CelebA data is showed in Figure 2 .

A.5 CLASSIFICATION CURVE

The classification curve is in Figure 3 A.6 SENSITIVITY ANALYSIS Since in real-world problems, it is hard to know that the ground-truth corruption rate, we perform the sensitivity analysis in classification tasks to show the effect of . The results are in Table 5 . As we could see, the performance is stable if we overestimate the corruption rate, this is because only when we overestimate the , we could guarantee that the gradient norm of the remaining set is small. However, when we underestimate the corruption rate, in the worst case, there is no guarantee that the gradient norm of the remaining set is small. By using the empirical mean, even one large bad individual gradient would ruin the gradient estimation, and according to the convergence analysis of biased gradient descent, the final solution could be very bad in terms of clean data. That explains why to underestimate the corruption rate gives bad results. Also, from Table 5 , we could see that using the ground truth corruption rate will lead to small uncertainty.

A.7 EMPIRICAL RESULTS ON RUNNING TIME

As we claimed in paper, the algorithm 2 is not efficient. In here we attached the execution time for one epoch for three different methods: Standard, PRL(G), PRL(L). For fair comparison, we replace all batch normalization module to group normalization for this comparison, since it is hard The results are showed in Table 6 A.8 PROOF OF CONVERGENCE OF BIASED SGD We gave the proof of the theorem of how biased gradient affect the final convergence of SGD. We introduce several assumptions and definition first: Assumption 2 (L-smoothness) The function φ: R d → R is differentiable and there exists a constant L > 0 such that for all θ 1 , θ 2 ∈ R d , we have φ(θ 2 ) ≤ φ(θ 1 )+ ∇φ(θ 1 ), θ 2 -θ 1 + L 2 θ 2 -θ 1 2 Definition 1 (Biased gradient oracle) A map g : R d × D → R d , such that g(θ, ξ) = ∇φ(θ) + b(θ, ξ) + n(θ, ξ) for a bias b : R d → R d and zero-mean noise n : R d × D → R d , that is E ξ n(θ, ξ) = 0. Compared to standard stochastic gradient oracle, the above definition introduces the bias term b. In noisy-label settings, the b is generated by the data with corrupted labels. Assumption 3 (σ-Bounded noise) There exists constants σ > 0, such that E ξ n(θ, ξ) 2 ≤ σ, ∀θ ∈ R d Assumption 4 (ζ-Bounded bias) There exists constants ζ > 0, such that for any ξ, we have b(θ, ξ) 2 ≤ ζ 2 , ∀θ ∈ R d For simplicity, assume the learning rate is constant γ, then in every iteration, the biased SGD performs update θ t+1 ← θ t -γ t g(θ t , ξ). Then the following theorem showed the gradient norm convergence with biased SGD. Theorem 4 (Convergence of Biased SGD(formal)) Under assumptions 2, 3, 4, define F = φ(θ 0 ) -φ * and step size γ = min 1 L , ( LF σT ) , denote the desired accuracy as k, then T = O 1 k + σ 2 k 2 iterations are sufficient to obtain min t∈[T ] E ∇φ(θ t ) 2 = O(k + ζ 2 ). Remark 2 Let k = ζ 2 , T = O 1 ζ 2 + σ 2 ζ 4 iterations is sufficient to get min t∈[T ] E ∇φ(θ t ) 2 = O(ζ 2 ), and performing more iterations does not improve the accuracy in terms of convergence. Since this is a standard results, similar results are showed in Bernstein et al. (2018) ; Devolder et al. (2014) ; Hu et al. (2020) ; Ajalloeian & Stich (2020) . we provide the proof here. Proof: by L-smooth, we have: φ(θ 2 ) ≤ φ(θ 1 ) + ∇φ(θ 1 ), θ 2 -θ 1 + L 2 θ 2 -θ 1 by using γ ≤ 1 L , we have Eφ (θ 1t+1 ) ≤ φ (θ 1t ) -γ ∇φ (θ 1t ) , Eg t + γ 2 L 2 E g t -Eg t 2 + E Eg t 2 = φ (θ 1t ) -γ ∇φ (θ 1t ) , ∇φ (θ 1t ) + b t + γ 2 L 2 E n t 2 + E ∇φ (θ 1t ) + b t 2 ≤ φ (θ 1t ) + γ 2 -2 ∇φ (θ 1t ) , ∇φ (θ 1t ) + b t + ∇φ (θ 1t ) + b t 2 + γ 2 L 2 E n t 2 = φ (θ 1t ) + γ 2 -∇φ (θ 1t ) 2 + b t 2 + γ 2 L 2 E n t 2 Since we have b t 2 ≤ ζ 2 , n t 2 ≤ σ 2 , by plug in the learning rate constraint, we have Eφ (θ 1t+1 ) ≤ φ (θ 1t ) - γ 2 ∇φ (θ 1t ) 2 + γ 2 ζ 2 + γ 2 L 2 σ 2 Eφ (θ 1t+1 ) -φ (θ 1t ) ≤ - γ 2 ∇φ (θ 1t ) 2 + γ 2 ζ 2 + γ 2 L 2 σ 2 Then, removing the gradient norm to left hand side, and sum it across different iterations, we could get 1 2T T -1 t=0 E φ (θ 1t ) ≤ F T γ + ζ 2 2 + γLσ 2 2 Take the minimum respect to t and substitute the learning rate condition will directly get the results. A.9 PROOF OF THEOREM 2 Denote G to be the set of corrupted minibatch, G to be the set of original clean minibatch and we have |G| = | G| = m. Let N to be the set of remaining data and according to our algorithm, the remaining data has the size |N| = n = (1 -)m. Define A to be the set of individual clean gradient, which is not discarded by algorithm 1. B to be the set of individual corrupted gradient, which is not discarded. According to our definition, we have N = A ∪ B. AD to be the set of individual good gradient, which is discarded, AR to be the set of individual good gradient, which is replaced by corrupted data. We have G = A ∪ AD ∪ AR. BD is the set of individual corrupted gradient, which is discarded by our algorithm. Denote the good gradient to be g i = α i W i , and the bad gradient to be gi , according to our assumption, we have gi ≤ L. Now, we have the l2 norm error:  µ(G) -µ(N) = 1 m m i∈G g i - 1 n i∈A g i + 1 n i∈B gi = 1 n m i=1 n m g i - 1 n i∈A g i + 1 n i∈B gi = 1 n i∈A n m g i + 1 n i∈AD n m g i + 1 n i∈AR n m g i - 1 n i∈A g i + 1 n i∈B gi = 1 n i∈A ( n -m m )g i + 1 n i∈AD n m g i + 1 n i∈AR n m g i - 1 n i∈B gi ≤ 1 n i∈A ( n -m m )g i + 1 n i∈AD n m g i + 1 n i∈AR n m g i + 1 n i∈B gi ≤ A m -n nm g i + AD 1 m g i + AR 1 m g i + B 1 n gi ≤ A m -n nm g i + AD 1 m g i + AR 1 m g i + B 1 n gi µ(G) -µ(N) ≤ x m -n nm L + (n -x) 1 m L + (m -n) 1 m L + (n -x) 1 n L ≤ x( m -n nm - 1 m )L + n 1 m L + (m -n) 1 m L + (n -x) 1 n L = 1 m ( 2 -1 1 - )xL + L + L - 1 n xL = xL( 2 -2 n ) + 2L To minimize the upper bound, we need x to be as small as possible since 2 -2 < 1. According to our problem setting, we have x = n -m ≤ (1 -2 )m, substitute back we have: µ(G) -µ(N) ≤ (1 -2 )Lm( 2 -2 n ) + 2L = 1 -2 1 - 2L + 2L = 4L - 1 - 2L Since < 0.5, we use tylor expansion on 1 -, by ignoring the high-order terms, we have µ(G) -µ(N) = O( L) Note, if the Lipschitz continuous assumption does not hold, then L should be dimension dependent. A.10 PROOF OF RANDOMIZED FILTERING ALGORITHM Lemma 3 (Gradient Estimation Error for Randomized Filtering) Given a corrupted matrix G ∈ R m×d generated in problem 2. Let G ∈ R m×d be the original clean gradient matrix. Suppose we are arbitrary select n = (1 -)m rows from G to get remaining set N ∈ R n×d . Let µ to be the empirical mean function, assume the clean gradient before loss layer has bounded operator norm: W op ≤ C, the maximum clean gradient in loss layer max i α i = k, the maximum corrupted gradient in loss layer max i δ i = v, assume < 0.5, then we have: µ(G) -µ(N) ≤ Ck 3 -4 2 1 - + Cv 1 - A.10.1 PROOF OF LEMMA 3 Denote G to be the set of corrupted minibatch, G to be the set of original clean minibatch and we have |G| = | G| = m. Let N to be the set of remaining data and according to our algorithm, the remaining data has the size |N| = n = (1 -)m. Define A to be the set of individual clean gradient, which is not discarded by algorithm 3. B to be the set of individual corrupted gradient, which is not discarded. According to our definition, we have N = A ∪ B. AD to be the set of individual good gradient, which is discarded, AR to be the set of individual good gradient, which is replaced by corrupted data. We have G = A ∪ AD ∪ AR. BD is the set of individual corrupted gradient, which is discarded by our algorithm. Denote the good gradient to be g i = α i W i , and the bad gradient to be gi = δ i W i , according to our assumption, we have W i op ≤ C.



RELATIONSHIP TO SELF-PACED LEARNING (SPL) SPL looks very similar to our method at first glance. Instead of keeping data point with small gradient norm, SPL tries to keep data with small loss. The gradient norm and loss function can be tied by the famous Polyak-Łojasiewicz (PL) condition. The PL condition assumes there exists some constant s > 0 such that 1 2 ∇φ(x) 2 ≥ s (φ(x) -φ * ) , ∀x holds. As we can see, when the neural network is highly over-parameterized, the φ * can be assumed to be equal across different



(PRL(L)) Efficient Provable Robust Learning for Corrupted Supervision input: dataset Dx, D y with corrupted supervision, learning rate γt; return model parameter θ; for t = 1 to maxiter do Randomly sample a minibatch M from Dx, D y Compute the predicted label Ŷ from M Calculate the gradient norm for the loss layer, (i.e. ŷy for mean square error or cross entropy) for each data point in M Remove the top τ -fraction data from M according to ŷy Return the empirical mean of the remaining M as the robust mean estimation μ Update model θt+1 = θt -γt μ end

Co-PRL(L) input: initialize w f and wg, learning rate η, fixed τ , epoch T k and Tmax, iterations Nmax return model parameter w f and wg; for T = 1, 2, ..., Tmax do for N = 1, ..., Nmax do random sample a minibatch M from Dx, D y (noisy dataset) get the predicted label Ŷf and Ŷg from M by w f . wg calculate the individual loss l f = L(Y, Ŷf ), lg = L(Y, Ŷg) calculate the gradient norm of loss layer score f = ∂l f ∂ ŷf , scoreg = ∂lg ∂ ŷg .

Figure 1: Geometric illustration of the difference between loss filtering and gradient norm filtering. A.2 FURTHER ILLUSTRATION OF THE DIFFERENCE BETWEEN SPL AND PRL(G)

Figure 2: CelebA Testing Curve During Training. X axis represents the epoch number, Y axis represents the testing r-square. In some experiment, there is no cure for Standard and NormClip since they gave negative r-square, which will effect the plotting scale. The shadow represents the confidence interval, which is calculated across 5 random seed. As we see, PRL(G), PRL(L), and Co-PRL(L) are robust against different types of corruptions. A.4 REGRESSION R2 ON TESTING DATA CURVE

Figure 3: CIFAR10 and CIFAR100 Testing Curve During Training. X axis represents the epoch number, Y axis represents the testing accuracy. The shadow represents the confidence interval, which is calculated across 3 random seed. As we see, IGFilter(L), and Co-IGFilter(L) are robust against different types of corruptions.

By using the filtering algorithm, we could guarantee that gi ≤ L.Let |A| = x, we have |B| = n -x = (1 -)m -x, |AR| = m -n = m, |AD| = m -|A| -|AR| = m -x -(m -n) = n -x = (1 -)m -x.Thus, we have:

Algo. 3), and Co-PRL(L) (Algo. 4) to the following baselines. Standard: standard training without filtering data (mse for regression, cross entropy for classification); Normclip: standard training with norm clipping; Huber: standard training with huber loss (regression only); Decouple: decoupling network, update two networks by using their disagreement(Malach & Shalev-Shwartz, 2017) (classification only); Bootstrap: It uses a weighted combination of predicted and original labels as the correct labels, and then perform back propagation(Reed et al., 2014) (classification only); Min-sgd: choosing the smallest loss sample in minibatch to update model(Shah et al., 2020); SPL: self-paced learning, dropping the data with large losses (same as PRL(L) in regression setting with MSE loss); Ignormclip: clipping individual gradient then average them to update model (regression only); Co-teaching: collaboratively train a pair of SPL model and exchange selected data to another model(Han et al., 2018) (classification only); It is hard to design experiments for agnostic corrupted supervision and we tried our best to include different types of supervision noise. The supervision corruption settings are as follows: linadv: the corrupted supervision is generated by random wrong linear relationship of features (regression); signflip: the supervision sign is flipped (regression); uninoise: random sampling from uniform distribution as corrupted supervision (regression); mixture: mixture of above types of corruptions (regression); pairflip: shuffle the coordinates (i.e. eyes to mouth in celebA or cat to dog in CIFAR) (regression and classification); symmetric: randomly assign wrong class label (classification). For classification, we use classification accuracy as the evaluation metric, and R-square is used to evaluate regression experiments. Due to the limit of the space, we only show the average evaluation score on testing data for the last 10 epochs. The whole training curves are attached in the appendix. All experiments are repeated 5 times for regression experiments and 3 times for classification experiments.Main hyperparameters are showed in appendix.

CelebA Results. The numbers are r-square on clean testing data, and the standard deviation is from last ten epochs and 5 random seeds.

Classification Results on CIFAR10 and CIFAR100 for symmetric and pairflip label corruption. The numbers are classification accuracy on clean testing data, and the standard deviation is from last ten epochs and 3 random seeds.

Main Hyperparmeters



Execution Time of Single Epoch in CIFAR-10 Data to calculate individual gradient when using batch normalization. For PRL(G), we use opacus libarary (https://opacus.ai/) to calculate the individual gradient.

Data

-0.1 -0.05 + 0.05 + 0.1

annex

Now, we have the l2 norm error:Thus, we have:For individual gradient, according to the label corruption gradient definition in problem 2, assuming the W op ≤ C, we haveNote the above upper bound holds for any x, thus, we would like to get the minimum of the upper bound respect to x. Rearrange the term, we have< 0, we knew that x should be as small as possible to continue the bound. According to our algorithm, we knew nAccording to algorithm3, we could guarantee that v ≤ k, then, we will have:A.12 COMPARISON BETWEEN SORTING LOSS LAYER GRADIENT NORM AND SORTING THE

LOSS VALUE

Assume we have a d class label y ∈ R d , whereAssume we have a d class label y ∈ R d , where y k = 1, y i = 0, i = k. With little abuse of notation, suppose we have two prediction p ∈ R d , q ∈ R d . Without loss of generality, we could assume that p 1 has smaller cross entropy loss, which indicates p k ≥ q kFor MSE, assume we have opposite result

