HETEROSKEDASTIC AND IMBALANCED DEEP LEARN-ING WITH ADAPTIVE REGULARIZATION

Abstract

Real-world large-scale datasets are heteroskedastic and imbalanced -labels have varying levels of uncertainty and label distributions are long-tailed. Heteroskedasticity and imbalance challenge deep learning algorithms due to the difficulty of distinguishing among mislabeled, ambiguous, and rare examples. Addressing heteroskedasticity and imbalance simultaneously is under-explored. We propose a data-dependent regularization technique for heteroskedastic datasets that regularizes different regions of the input space differently. Inspired by the theoretical derivation of the optimal regularization strength in a one-dimensional nonparametric classification setting, our approach adaptively regularizes the data points in higher-uncertainty, lower-density regions more heavily. We test our method on several benchmark tasks, including a real-world heteroskedastic and imbalanced dataset, WebVision. Our experiments corroborate our theory and demonstrate a significant improvement over other methods in noise-robust deep learning. 1 Over-parameterized neural networks as nonparametric methods. We use nonparametric method as a surrogate for neural networks because they have been shown to be closely related. Recent work (Savarese et al., 2019) shows that the minimum norm two-layer ReLU network that fits

1. INTRODUCTION

In real-world machine learning applications, even well-curated training datasets have various types of heterogeneity. Two main types of heterogeneity are: (1) data imbalance: the input or label distribution often has a long-tailed density, and (2) heteroskedasticity: the labels given inputs have varying levels of uncertainties across subsets of data stemming from various sources such as the intrinsic ambiguity of the data or annotation errors. Many deep learning algorithms have been proposed for imbalanced datasets (e.g., see (Wang et al., 2017; Cao et al., 2019; Cui et al., 2019; Liu et al., 2019) and the reference therein). However, heteroskedasticity, a classical notion studied extensively in the statistical community (Pintore et al., 2006; Wang et al., 2013; Tibshirani et al., 2014) , has so far been under-explored in deep learning. This paper focuses on addressing heteroskedasticity and its interaction with data imbalance in deep learning. Heteroskedasticity is often studied in regression analysis and refers to the property that the distribution of the error varies across inputs. In this work, we mostly focus on classification, though the developed technique also applies to regression. Here, heteroskedasticity reflects how the uncertainty in the conditional distribution y | x, or the entropy of y | x, varies as a function of x. Real-world datasets are often heteroskedastic. For example, Li et al. (2017) shows that the WebVision dataset has a varying number of ambiguous or true noisy examples across classes. 2Conversely, we consider a dataset to be homoscedastic if every example is mislabeled with a fixed probably , as assumed by many prior theoretical and empirical works on label corruption (Ghosh et al., 2017; Han et al., 2018; Jiang et al., 2018; Mirzasoleiman et al., 2020) . We note that varying uncertainty in y | x can come from at least two sources: the intrinsic semantic ambiguity of the input, and the (data-dependent) mislabeling introduced by the annotation process. Our approach can handle both types of noisy examples in a unified way, but for the sake of comparisons with past methods, we call them "ambiguous examples" and "mislabeled examples" respectively, and refer to both of them as "noisy examples". Figure 2 : Real-world datasets have various sources of heterogeneity and it could be hard to distinguish one from another. They require mutuallyexclusive reweighting strategy, but they all benefit from stronger regularization. Overparameterized deep learning models tend to overfit more to the noisy examples (Arpit et al., 2017; Frénay & Verleysen, 2013; Zhang et al., 2016) . To address this issue, a common approach is to detect noisy examples by selecting those with large training losses, and then remove them from the (re-)training process. However, an input's training loss can also be big because it is rare or ambiguous (Hacohen & Weinshall, 2019; Wang et al., 2019) , as shown in Figure 1 . Noise-cleaning methods could fail to distinguish mislabeled from rare/ambiguous examples (see Section 3.1 for empirical proofs). Though dropping the former is desirable, dropping the latter loses important information. Another popular approach is reweighting methods that reduce the contribution of noisy examples in optimization. However, failing to distinguish between mislabeled and rare/ambiguous examples makes the decision of the weights tricky -mislabeled examples require small weights, whereas rare / ambiguous examples benefit from larger weights (Cao et al., 2019; Shu et al., 2019) . We propose a regularization method that deals with noisy and rare examples in a unified way. We observe that mislabeled, ambiguous, and rare examples all benefit from stronger regularization (Hu et al., 2020; Cao et al., 2019) . We apply a Lipschitz regularizer (Wei & Ma, 2019a; b) with varying regularization strength depending on the particular data point. Through theoretical analysis in the one-dimensional setting, we derive the optimal regularization strength for each training example. The optimal strength is larger for rarer and noisier examples. Our proposed algorithm, heteroskedastic adaptive regularization (HAR), first estimates the noise level and density of each example, and then optimizes a Lipschitz-regularized objective with input-dependent regularization with strength provided by the theoretical formula. In summary, our main contributions are: (i) we propose to learn heteroskedastic imbalanced datasets under a unified framework, and theoretically study the optimal regularization strength on onedimensional data. (ii) we propose an algorithm, heteroskedastic adaptive regularization (HAR), which applies stronger regularization to data points with high uncertainty and low density. (iii) we experimentally show that HAR achieves significant improvements over other noise-robust deep learning methods on simulated vision and language datasets with controllable degrees of data noise and data imbalance, as well as a real-world heteroskedastic and imbalanced dataset, WebVision.

2.1. BACKGROUNDS

We first introduce general nonparametric tools that we use in our analysis, and review the dependency of optimal regularization strength on the sample size and noise level.

Ground Truth

Weak Unif-reg Strong Unif-reg Adapt-reg (HAR) Figure 3 : A one-dimensional example with a three-layer neural network in heteroskedastic and imbalanced regression setting. The curve in blue is the underlying ground truth and the dots are observations with heteroskedastic noise. This example shows that uniformly weak regularization overfits on noisy and rare data (on the right half), whereas uniformly strong regularization causes underfitting on the frequent and oscillating data (on the left half). The adaptive regularization does not underfit the oscillating data but still denoise the noisy data. We note that standard nonparametric methods such as cubic spline do not work here because they also use uniform regularization. the training data is in fact a linear spline interpolation. Parhi & Nowak (2019) extend this result to a broader family of neural networks with a broader family of activations. Given a training dataset {(x i , y i )} n i=1 , nonparametric method with penalty works as follows. Let F : R → R be a twice-differentiable model family. We aim to fit the data with smoothness penalty min f 1 n n i=1 (f (x i ), y i ) + λ (f (x)) 2 dx (1) Lipschitz regularization for neural networks. Lipschitz regularization has been shown to be effective for deep neural networks as well. Wei & Ma (2019a) proves a generalization bound of neural networks dependent on the Lipschitzness of each layer with respect to all intermediate layers on the training data, and show that, empirically, regularizing the Lipschitzness improve the generalization. Sokolić et al. (2017) shows similar results in data-limited settings. In Section 2.3, we extend the Lipschitz regularization technique to heteroskedastic setting. Regularization strength as a function of noise level and sample size. Finally, we briefly review existing theoretical insights on the optimal choice of regularization strength. Generally, the optimal regularization strength for a given model family increases with the label noise level and decreases in the sample size. As a simple example, consider linear ridge regression min θ 1 n n i=1 (x i θ -y i ) 2 + λ θ 2 , where x i , θ ∈ R d and y i ∈ R. We assume y i = x i θ * + ξ for some ground truth parameter θ * , and ξ ∼ N (0, σ 2 ). Then the optimal regularization strength λ opt = dσ 2 /n θ * 2 2 . Results of similar nature can also be found in nonparametric statistics (Wang et al., 2013; Tibshirani et al., 2014) .

2.2. HETEROSKEDASTIC NONPARAMETRIC CLASSIFICATION ON ONE-DIMENSIONAL DATA

We consider a one-dimensional binary classification problem where X = [0, 1] ⊂ R and Y = {-1, 1}. We assume Y given X follows a logistic model with ground-truth function f , i.e. Pr [Y = y|X = x] = 1 1 + exp(-yf (x)) . (2) The training objective is cross-entropy loss plus Lipschitz regularization, i.e. f = argmin f L(f ) 1 n n i=1 (f (x i ), y i ) + λ 1 0 ρ(x)(f (x)) 2 dx, where (a, y) = -log(1 + exp(-ya)), and ρ(x) is a smoothing parameter as a function of the noise level and density of x. Let I(x) be the fisher information matrix conditioned on the input, i.e. I(x) E[∇ 2 a (a, Y )| a=f (X) |X = x]. When (X, Y ) follows the logistic model in equation 2, I(x) = 1 (1 + exp(f (x))(1 + exp(-f (x)) = Var(Y |X = x). Therefore, I(x) captures the aleatoric uncertainty of x. For example, when Y is deterministic conditioned on X = x, we have I(x) = 0, indicating perfect certainty. Define the test metric as the mean-squared-error on the test set {(x i , y i )} n i=1 , i.e.,foot_2  MSE( f ) E {(xi,yi)} n i=1 1 0 ( f (t) -f (t)) 2 dt (4) Our main goal is to derive the optimal choice of ρ(x) that minimizes the MSE. We start with an analytical characterization of the test error. Let W 2 2 = {f is absolute continuous and f ∈ L 2 [0, 1]}. We denote the density of X as q(x). The following theorem analytically computes the MSE under the regularization strength ρ(•), building upon (Wang et al., 2013) for regression problems. The proof of the Theorem is deferred to Appendix A. Theorem 1. Assume that f , q, I ∈ W 2 2 . Let r(t) = -1/(q(t)I(t)) and L 0 = ∞ -∞ 1 4 exp(-2|t|)dt. If we choose λ = C 0 n -2/5 for some constant C 0 > 0, the asymptotic mean squared error is lim n→∞ MSE( f ) = C n 1 0 λ 2 r 2 (t) d dt (ρ(t)(f * ) (t)) 2 + L 0 r(t) 1/2 ρ(t) -1/2 dt in probability, where C n is a scalar that only depends on n. Using the analytical formula of the test error above, we want to derive an approximately optimal choice of ρ(x). A precise computation is infeasible, so we restrict ourselves to consider only ρ(x) that is constant within groups of examples. We introduce an additional structure -we assume the data can be divided into k groups [a 0 , a 1 ), [a 1 , a 2 ), • • • , [a k-1 , a k ). Each group [a j , a j+1 ) consists of an interval of data with approximately the same aleatoric uncertainty. We approximate ρ(t) is constant on each of the group [a i , a i+1 ) with value ρ i . Plugging this piece-wise constant ρ into the asymptotic MSE in Theorem 1, we obtain lim n→∞ MSE( f ) = j ρ 2 j aj+1 aj r 2 (t) d 2 dt 2 f (t) 2 dt + ρ -1/2 j L 0 aj+1 aj r 1/2 (t)dt . Minimizing the above formula over ρ 1 , . . . , ρ k separately, we derive the optimal weights, ρ j = L0 a j+1 a j r(t) 1/2 dt 4 a j+1 a j r 2 (t) d 2 dt 2 f (t) 2 dt 2/5 . In practice, we do not know f and q(x), so we make the following simplifications. We assume that q(t) and I(t) are constant on each interval [a j , a j+1 ]. In other words, we assume that q(t) = q j and I(t) = I j for all t ∈ [a j , a j+1 ]. We further assume that d 2 dt 2 f (t) is close to a constant on the entire space, because estimating the curvature in high dimension is difficult.

This simplification yields ρ

j ∝ q -1/2 j I -1/2 j q -2 j I -2 j 2/5 = q 3/5 j I 3/5 j . We find the simplification works well in practice. Adaptive regularization with importance sampling. It is practically infeasible to implement the integration in equation 3 for high-dimensional data. We use importance sampling to approximate the integral: minimize f L(f ) 1 n n i=1 (f (x i ), y i ) + λ n i=1 τ i f (x i ) 2 (5) Suppose x i ∈ [a j , a j+1 ), we have that τ i should satisfy that τ i q j = ρ j so that the expectation of the regularization term in equation 5 is equal to that in equation 3. Hence, τ i = I 3/5 j q -2/5 j = I(x i ) 3/5 q(x i ) -2/5 . Adaptive regularization for multi-class classification and regression. In fact, the proof of Theorem 1 is proved for general loss (a, y). Therefore, we can directly generalize it to multiclass classification and regression problems. For a regression problem, (a, y) is the square loss: (y, a) = 0.5(y -a) 2 , the Fisher information I(x) = 1. Therefore, for a regression problem, we can choose regularization weight τ i = q(x i ) -2/5 .

2.3. PRACTICAL IMPLEMENTATION ON NEURAL NETWORKS WITH HIGH-DIMENSIONAL DATA

We heuristically extend the Lipschitz regularization technique discussed in Section 2.2 from nonparametric models to over-parameterized deep neural networks. Let (x, y) be an example and f θ be an r-layer neural network. We denote by h (j) the j-th hidden layer of the network, by J (j) (x) ∂ ∂h (j) L(f (x), y), i.e., the Jacobian of the loss w.r.t h (j) . We replace the regularization term f (x) 2 in equation 5 by R(x) = r j=1 J (j) (x) 2 F 1/2 , which was proposed by (Wei & Ma, 2019a) . As a proof of concept, we visualize the behavior of our algorithm in Figure 3 , where we observe that the rare and noisy examples have significantly improved error due to stronger regularization. In contrast, a uniform regularization either overfits or underfits different subsets. Note that the differences from the 1-D case include the following three aspects. 1. The derivative is taken w.r.t to all the hidden layers for deep models, which has been shown to have superior generalization guarantees for neural networks by (Wei & Ma, 2019a; b) . 2. An additional square root is taken in computing R(x). This modified version may have milder curvature and be easier to tune. 3. We take the derivative of the loss instead of the derivative of the model, which outputs k numbers for multi-class classification. This is because the derivative of the model requires k times more time to compute. The regularized training objective is consequently minimize f L(f ) 1 n n i=1 ( (f (x i ), y i ) + λτ i R(x i )) , where τ i is chosen to be τ i = I(x i ) 3/5 /q(x i ) 2/5 following the formula equation 5 in Section 2.2 and λ is a hyperparameter to control the overall scale of the regularization strength. We note that we do not expect this choice of τ i to be optimal for the high-dimensional case with all the modifications above -the optimal choice does depend on the nuances. However, we also observe that the empirical performance is not sensitive to the form of τ as long as it's increasing in I(x) and decreasing in q(x). That is, the more uncertain or rare an example is, the stronger regularization should be applied. In order to estimate the relative regularization strength τ i , the key difficulty lies in the estimation of uncertainty I(x). As in the 1-D setting, we divide the examples into k groups G 1 , . . . , G k (e.g., each group can correspond to a class), and estimate the uncertainty on each group. In the binary setting, I(x) = Var(Y |X = x) = Pr[Y = 1 | X] • Pr[Y = 0 | X] can be approximated by Ĩ(x) = 1 -max k∈{0,1} Pr[Y = k | X = x] up to a factor of at most 2. We use the same formula for multi-class setting as the approximation of the uncertainty. (As a sanity check, when Y is concentrated on a single outcome, the uncertainty is 0.) Note that Ĩ(x) is essentially the minimum possible error of any deterministic prediction on the data point x. Assume that we have a sufficiently accurate pre-trained model, we can use its validation error to estimate Ĩ(x): Then for all x ∈ G j , we estimate q(x) and I(x) by ∀x ∈ G j , q(x) ∝ |G j |, I(x) ∝ average validation error of a pre-trained model f θ on G j (7) The whole training pipeline is summarized in Algorithm 1. Algorithm 1 Heteroskedastic Adaptive Regularization (HAR) Require: Dataset D = {(x i , y i )} n i=1 . A parameterized model f θ 1: Split training set D into D train and D val 2: f θ ← Standard SGD Training on D train 3: Estimate I(x), q(x) with equation 7 using f θ on D val , and compute τ i = I(x i ) 3/5 /q(x i ) 2/5 4: 5: Initialize the model parameters θ randomly 6: f θ ← SGD with the regularized objective as in equation 6 on the full dataset D

3. EXPERIMENTS

We experimentally show that our proposed algorithm HAR(Algorithm 1) improves the test performance of the noisier and rarer groups of examples (by stronger regularization) without negatively affecting the training and test performance of the other groups. We evaluate our algorithms on three vision datasets and one NLP dataset: CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) , IMDB-review (Maas et al., 2011) (see Appendix C.1), and WebVision (Li et al., 2017) , a real-world heteroskedastic and imbalanced dataset. Please refer to Appendix B for low-level implementation details. Baselines. We compare our proposed HAR with the following baselines. The simplest one is (1) Empirical risk minimization (ERM): the vanilla cross-entropy loss with all examples having the same weights of losses. We select two representatives from the noise-cleaning line of approach. (2) Co-teaching (Han et al., 2018) : two deep networks are trained simultaneously. Each network aims to identify clean data points that have small losses and use them to guide the training of the other network. (3) INCV (Chen et al., 2019) : it extends Co-teacing to an interative version to estimate the noise ratio and select data. We consider three representatives from the reweighting-based methods, including two that learn the weighting using meta-learning. (4) MentorNet (Jiang et al., 2018) : it pretrains a teacher network that outputs weights for examples that are used to train the student network with reweighting. (5) L2RW (Ren et al., 2018) : it directly optimizes weights of each example in the training set by minimizing its corresponding loss on a small meta validation set. ( 6) MW-Net (Shu et al., 2019) : it extends L2RW by explicitly defining a weighting function which depends only on the loss of the example. We also compare against two representatives from the robust loss function. (7) GCE (Zhang & Sabuncu, 2018) : it generalizes mean average error and cross-entropy loss to obtain a new loss function. ( 8) DMI (Xu et al., 2019) : it designs a new loss function based on generalized mutual information. In addition, as an essential ablation study, we consider vanilla uniform regularization. ( 9) Unif-reg: we apply the Jacobian regularizer on all examples with equal strength, and tune the strength to get the best possible validation accuracy.

3.1. SIMULATING HETEROSKEDASTIC AND IMBALANCED DATASETS ON CIFAR

Setting. Unlike previous works that test on uniform random or asymmetric noise, which is often not the case in reality, in this paper we test our method on more realistic noisy settings, as suggested by Patrini et al. (2017) ; Zhang & Sabuncu (2018) . In order to simulate heteroskedasticity, we only corrupt semantically-similar classes. For CIFAR-10, we exchange 40% of the labels between classes 'cat' and 'dog', and between 'truck' and 'automobile'. CIFAR-100 has 100 classes grouped into 20 super classes. For each class of the 5 classes under the super class 'vehicles 1' and 'vehicles 2', we corrupt the labels with 40% probability uniformly randomly to the rest of four classes under the same super class. As a result, the 10 classes under super class 'vehicle 1' and 'vehicle 2' have high label noise level and the corruption are only within the same super class. Heteroskedasticity of the labels and imbalance of the inputs commonly coexist in the real world settings. HAR can take both of them into account. To understand the challenge imposed by the entanglements of heteroskedasticity and imbalance, and compare HAR with the aforementioned baselines, we inject data imbalance concurrently with the heteroskedastic noise. We remove samples from the corrupted classes to simulate the most difficult scenario -the rare and noisy groups overfit significantly. (A more benign interaction between the noises and imbalance is that the rare classes have lower noise level, we defer it to Appendix C.3.) We use the imbalance ratio to denote the frequency ratio between the frequent (and clean) classes to the rare (and corrupted) classes. We consider imbalance ratio to be 10 and 100. Result. Table 1 summarizes the results. Since examples from rare classes tend to have larger training and validation loss regardless of whether the labels are correct or not, noise-cleaning based methods might drop excessive examples with correct labels. We examined the noise ratio of dropped samples for INCV under the setting of imbalance ratio equals 10. Among all dropped examples, there is only 19.2% of true noise examples. In addition, the rare class examples selected still have 29.8% of label noise. This explains that the significant decrease of accuracies of Co-teaching and INCV on corrupted and rare classes. Reweighting-based methods tend to suffer from the loss of accuracy in other more frequent classes, which is aligned with the findings in Cao et al. (2019) . While the aforementioned baselines struggle to deal with heteroskedasticity and imbalance together, HAR is able to put them under the same regularization framework and achieve significant improvements. Notably, HAR also shows improvement over uniform regularization with optimally tuned strength. This clearly demonstrates the importance of introducing adaptive regularization among all examples for a better trade-off. A more detailed ablation study on the trade-off between training accuracy and validation accuracy can be found in Section 3.3. 

3.2. ABLATION STUDY ON CIFAR

We disentangle the problem setting to show the effectiveness of our unified framework. Simulating heteroskedastic noise on CIFAR. We study the uncertainty part of HAR by testing under the setting with only heteroskedastic noise. The type of noise injection is the same as Section 3.1. We report the top-1 validation accuracy of various methods in Table 2 . Aligned with our analysis in Section 4, we observe that both noise-cleaning and reweighting based methods don't get a comparable accuracy on noisy classes with applying strong regularization (λ = 0.1) under this heteroskedastic setting. We observe the behavior that too strong regularization impede the model from fitting informative samples, thus it could lead to a decrease on clean classes' accuracy. On the contrary, too weak regularization leads to overfitting the noisy examples thus the accuracy on noisy classes do not reach the optimal. Interestingly, we find that even the well-studied CIFAR-100 dataset has intrinsic heteroskedasticity and HAR can improve over uniform regularization to some extent. Please refer to Appendix C.2 for the results on CIFAR-100 and Appendix C.1 for results on IMDB-review. Simulating data imbalance on CIFAR. We study the density part of HAR by testing under the setting with only data imbalance. We follow the same setting as Cao et al. (2019) to create imbalanced CIFAR. Long-tailed imbalance follows an exponential decay in sample sizes across different classes. For step imbalance setting, all rare classes have the same sample size, as do all frequent classes. Our approach achieves better results than LDAM-DRW and is comparable to recent state-of-the-art methods under the imbalanced setting. 

3.3. EVALUATION ON WEBVISION WITH REAL-WORLD HETEROGENEITY

WebVision (Li et al., 2017) contains 2.4 million images crawled from Google and Flickr using 1,000 labels shared with the ImageNet dataset. Its training set is both heteroskedastic and imbalanced (detailed statistics can be found in (Li et al., 2017) ), and it is considered as a popular benchmark for noise robust learning. As the full dataset is very large, we follow (Jiang et al., 2018) to use a mini version, which contains the first 50 classes of the Google subset of the data. Following the standard protocol (Jiang et al., 2018) , we test the trained model on the WebVision validation set and the ImageNet validation set. We use ResNet-50 for ablation study and InceptionResNet-v2 for a fair comparison with the baselines. We report results comparing against other state-of-the-art approaches in Table 5 . Strikingly, HAR achieves significant improvement. Ablation study. We demonstrate the trade-off between training accuracy and validation accuracy on mini WebVision with various uniform regularization strength and HAR in Table 4 . It's evident that when we gradually increase the overall uniform regularization strength, the training accuracy continues to decrease, and the validation accuracy reaches its peak at 5e-2. While a strong regularization could improve generalization, it reduces deep networks' capacity to fit the training data. However, with our proposed HAR, we only enforce strong regularization on a subset so that we improve the generalization on noisier groups while maintaining the overall training accuracy not affected.

4. RELATED WORK

Our work is closely related to the following methods and directions. Noise-cleaning. The key idea of noise-cleaning is to identify and remove (or re-label) examples with wrong annotations. The general procedure for identifying mislabeled instances has a long history (Brodley & Friedl, 1999; Wilson & Martinez, 1997; Zhao & Nishida, 1995) . Some recent works tailored this idea for deep neural networks. Veit et al. (2017) trains a label cleaning network on a small set of data with clean labels, and uses this model to identify noises in large datasets. To circumvent the requirement of a clean subset, Malach & Shalev-Shwartz (2017) train two networks simultaneously and perform update steps only in case of disagreement. Similarly, in co-teaching (Han et al., 2018) , each network selects a certain number of small-loss samples and feeds them to its peer network. Chen et al. (2019) further extends the co-training strategy and comes up with an iterative Reweighting. Reweighting training data has shown its effectiveness on noisy data (Liu & Tao, 2015) . Its challenge lies in the difficulty of weights estimation. Ren et al. (2018) proposes a meta-learning algorithm to assign weights to training examples based on its gradient direction with the one on a clean validation set. Recently, Shu et al. (2019) proposes to learn an explicit loss-weight function to mitigate the optimizing issue of (Ren et al., 2018) . Another line of work resorts to curriculum learning by either designing an easy-to-hard strategy of training (Guo et al., 2018) or introducing an extra network (Jiang et al., 2018) to assign weights. Noise-cleaning and reweighting methods usually rely on the empirical loss to determine if a sample is noisy. However, when the dataset is heteroskedastic, each example's training/validation loss no longer correlates well with its noise level. In such cases, we argue that changing the strength of regularization is a more conservative adaption and suffers less from uncertain estimation, compared to changing the weights of losses (Please refer to Section C.4 for empirical justifications). Robust loss function. Another line of works has attempted to design robust loss functions (Ghosh et al., 2017; Xu et al., 2019; Zhang & Sabuncu, 2018; Patrini et al., 2017; Cheng et al., 2017; Menon et al., 2016) . They usually rely on prior assumption about latent transition matrix that might not hold in practice. On the contrary, we focus on more realistic settings. Regularization. Regularization based techniques have also been explored to combat label noise. Li et al. (2019) proves that SGD with early stopping is robust to label noise. Hu et al. (2020) provides theoretical analysis of two additional regularization methods. While these methods consider a uniform regularization on all training examples, our work emphasizes on adjusting the weights of regularizers in search of a better generalization than uniform assignment.

5. CONCLUSION

We propose a unified framework (HAR) for training on heteroskedastic and imbalanced datasets. Our method achieves significant improvements over the previous state-of-the-arts on a variety of benchmark vision and language tasks. We provide theoretical results as well as empirical justifications by showing that ambiguous, mislabeled, and rare examples all benefit from stronger regularization. We further provide the formula for optimal weighting of regularization. Heteroskedasticity of datasets is a fascinating direction worth exploring, and it is an important step towards a better understanding of real-world scenarios in the wild.

A PROOFS OF THEOREM 1

We prove a general theorem here. In particular, we have the general theorem below. Theorem 2. Assume that f , q, I ∈ W 2 2 . Suppose (1) (a, y) is convex and three times continously differentiable with respect to a, (2) there exist constants 0 < c < C < ∞ such that c ≤ I(X) ≤ C almost surely, and ∇ a (f (X), Y ) satisfies E[ |X] = 0, E[ 2 |X] = I(X) and E[ 4 |X] < ∞ almost surely. Let r(t) = -1/(q(t)I(t)) and L 0 = ∞ -∞ 1 4 exp(-2|t|)dt. If we choose λ = C 0 n -2/5 for some constant C 0 > 0, the asymptotic mean squared error of f by equation 4 is lim n→∞ MSE( f ) = C n 1 0 λ 2 r 2 (t) d dt (ρ(t)(f * ) (t)) 2 + L 0 r(t) 1/2 ρ(t) -1/2 dt in probability, where C n is a scalar that only depends on n. It is easy to check that the logistic loss satisfies the condition of the theorem. The proof strategy of Theorem 2 is adopted from the proof of Theorem 2 of (Wang et al., 2013) by generalizing it from the least square loss to logistic loss. The high level idea is to reformulate f as solutions to ordinary differential equations. Let (γ v , h v ) be the (normalized) solution of the following equation -ρ(t)h v (t) = γ v I(t)q(t)h v (t), h v (0) = h v (1) = 0, and h v (0) = h v (1) = 0. We define the the leading term of f -f as S n,λ (f ) as S n,λ (f ) = 1 n i K Xi -W λ f , where K t (•) = v h v (t) 1 + λγ v h v (•) and W λ h v (•) = λγ v 1 + λγ v h v (•). By Proposition 2.1 and Theorem 3.4 of Shang et al. (2013) , we have sup x | f (x) -f (x) -S n,λ (f )(x)| = o P (n -1/3 ). Following the same proof of Theorem 2 of (Wang et al., 2013) , we can simplify the definition of K t and W λ as K t (x) = I(t) q(t) J(t, x) and W λ f (t) = λr(t) d dt (ρ(t)(f * ) (t)) , where J(t, s) = βρ(s)Q β (s)L 0 (β|Q β (t) -Q β (s)|) and Q β (t, s) = t 0 (r(s)ρ(s)) -1/2 (1 + O(β -1 ))ds and β = 1/ √ λ. Plugging equation 11 into equation 10, we then have lim n→∞ MSE( f ) = C n 1 0 λ 2 r 2 (t) d dt (ρ(t)(f * ) (t)) 2 + L 0 r(t) 1/2 ρ(t) -1/2 dt

B IMPLEMENTATION DETAILS

We develop our core algorithm in PyTorch (Paszke et al., 2017) . Implementation details for CIFAR. We follow the simple data augmentation used in (He et al., 2016) with only random crop and horizontal flip. We use ResNet-32 as our base network and repeat all experiments for 3 runs. We use standard SGD with momentum of 0.9, weight decay of 1 × 10 -4 for training. The model is trained with a batch size of 128 for 120 epochs. We anneal the learning rate by a factor of 10 at 80 and 100 epochs. We group the data by class labels, and by default we split D equally and randomly into D train and D val . As for the Jacobian regularizer, we sum over the frobenius norm of the gradients of all normalization layers' (BN by default) activations with respect to the classification loss. For experiments of HAR, we tune λ so that the largest enforced regularization strength (λτ i ) is 0.1. We train each model with 1 NVIDIA GeForce RTX 2080 Ti.

C.3 SIMULATING HETEROSKEDASTIC AND IMBALANCED DATASETS ON CIFAR

As mentioned in Section 3.1, we consider another variant of heteroskedastic and imbalanced dataset such that the rare classes have low noise level. To simulate this setting, we make the clean classes have fewer labels than the corrupted classes on the heteroskedastic CIFAR-10 we created in Section ??. As discussed in Section 4, we train several classifiers with alternative weights selection scheme which are not optimal. We consider the following two alternatives. (1) random: we draw the weights from a uniform distribution with the same mean as the weights of MW-Net and HAR. (2) inverse: we take the inverse of the weights learned by MW-Net and HAR and then normalize them to ensure the average reweighting/regularization strength remains the same. We conducted experiments on the heteroskedastic CIFAR-10 introduced in Section ?? and the results are summarized in Table 9 . We could conclude that changing the weights of the regularizer is a more conservative adaption and less susceptible to uncertain estimation than reweighting. 

C.5 VISUALIZATION

In order to better understand how the proposed HAR works on real-world heteroskedastic datasets, we plot the per-class key statistics used by HAR and validation errors in Figure 4 . We observe that HAR outperforms the tuned uniform regularization baseline on the majority of the classes.



Code available at https://github.com/kaidic/HAR. See Figure4of(Li et al., 2017), the number of votes for each example indicates the level of uncertainty of that example. Note that we integrate the error without weighting because we are interested in the balanced test performance.



Figure 1: Histogram of the distributions of losses on an imbalanced and noisy CIFAR-10 dataset. Clean but rare examples tend to have larger losses, similar to the noisy examples in frequent classes.

Top-1 validation accuracy (averaged over 3 runs) of ResNet-32 on heteroskedastic and imbalanced CIFAR-10. HAR significantly improves noisy and rare classes, while keeping the accuracy on other classes almost unaffected.

Top-1 validation accuracy (averaged over 3 runs) of ResNet-32 on heteroskedastic CIFAR-10 and CIFAR-100 for the noisy classes and the clean classes.

Top-1 validation errors of ResNet-32 on imbalanced CIFAR-10 and CIFAR-100.

Validation accuracy of ResNet-50 when tuning the regularization strength on mini WebVision. HAR stands out of the trade-off constraint of fitting and generalization.

Validation accuracy of InceptionResNet-v2 on WebVision and ImageNet validation sets. HAR demonstrates significant improvements over the previous state-of-the-arts.

Table8summarizes the results. For the setting of imbalance ratio equals 10, INCV automatically drops 34.1% of examples from the clean and rare classes, which results in a decrease of mean accuracy on the rare and clean classes. HAR is able to achieve improvements on both noisy classes and rare classes by enforcing the optimal regularization. Top-1 validation accuracy (averaged over 3 runs) of ResNet-32 on heteroskedastic and imbalanced CIFAR-10.Ours (HAR)76.1 ± 0.8 72.1 ± 1.0 73.0 ± 1.6 26.1 ± 0.8C.4 COMPARING THE EFFECT OF WEIGHTS ON LOSSES AND REGULARIZERS

Top-1 validation accuracy (averaged over 3 runs) of ResNet-32 on heteroskedastic CIFAR-10 by changing the weighting scheme.

ACKNOWLEDGEMENTS

Toyota Research Institute ("TRI") provided funds and computational resources to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. YC is supported by Stanford Graduate Fellowship. TM acknowledges support of Google Faculty Award. The work is also partially supported by SDSI and SAIL at Stanford.

annex

Implementation details for IMDB-review. We train a two-layer bidirectional LSTM (Huang et al., 2015) with 256 units followed with 0.5 dropout before the linear classifier. The network is trained for 20 epochs with Adam optimizer (Kingma & Ba, 2014) . For HAR, we tune λ so that the largest enforced regularization strength (λτ i ) is 0.1. We train each model with 1 NVIDIA GeForce RTX 2080 Ti.Implementation details for WebVision. We use the standard data augmentation same as (He et al., 2016) including random crop and horizontal flip. For mini WebVision, We train the network for 90 epochs using standard SGD with a batch size of 128. The initial learning rate is 0.1 and is annealed by a factor of 10 at epoch 60 and 90. For full WebVision, We train the network for 50 epochs using standard SGD with a batch size of 256. The initial learning rate is 0.1 and is annealed by a factor of 10 at epoch 30 and 40. For experiments of HAR, we tune λ so that the largest enforced regularization strength (λτ i ) is 0.1. We train each model with 8 NVIDIA Tesla V100 GPUs.Runtime analysis. Because the pre-trained model only trains on half of the training data and is only done once, the run-time of HAR is at most twice of the time for ERM. Many baselines in our paper use sophisticated pipelines and are slower than HAR. For example, INCV trains 2 models simultaneously for 4 times from random initialization to get a clean training set. MW-Net has a very slow convergence rate, which is a common issue for meta-learning.

C ADDITIONAL RESULTS

C.1 SIMULATING HETEROSKEDASTIC NOISE ON IMDB-REVIEW.IMDB-review dataset has a total of 50,000 (25,000 positive and 25,000 negative reviews) movie reviews for binary sentiment classification (Maas et al., 2011) . To simulate heteroskedastic noise for this binary classification problem, we project 5% of the labels of negative reviews to positive, and 40% in the reverse direction. Table 6 summarizes the results. The proposed HAR outperforms the ERM baseline with various strength of uniform regularization.Table 6 : Top-1 validation accuracy (averaged over 3 runs) on heteroskedastic IMDB-review dataset.

Reg Strength

Acc. of neg. reviews Acc. of pos. reviews Mean Acc 0 91.9 ± 2.0 50.9 ± 1.8 71.4 ± 0.5 Unif-reg (λ = 0.01) 94.3 ± 1.8 51.9 ± 2.0 73.1 ± 0.3 Unif-reg (λ = 0.1) 91.5 ± 1.9 64.3 ± 1.6 77.9 ± 0.4 Ours (HAR)93.1 ± 1.5 72.8 ± 1.7 83.0 ± 0.3

C.2 EVALUATION ON CIFAR-100 WITH REAL-WORLD HETEROSKEDASTICITY

It is acknowledged that CIFAR-100 training set contains noisy examples. For instance, some "tiger" examples are labeled as "leopard" ("tiger" is a defined class as well). There are also noisy examples that contain multiple objects, or are more ambiguous in terms of indentity (Song et al., 2020) . We find that HAR can improve over uniform regularization on the well-studied CIFAR-100 due to its heteroskedasticity and the results are reported in Table 7 .Table 7 : Top-1 validation accuracy (average over 3 runs) of ResNet-32 on the original CIFAR-100. 

