WHEN OPTIMIZING f -DIVERGENCE IS ROBUST WITH LABEL NOISE

Abstract

We show when maximizing a properly defined f -divergence measure with respect to a classifier's predictions and the supervised labels is robust with label noise. Leveraging its variational form, we derive a nice decoupling property for a family of f -divergence measures when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise. The above derivation helps us analyze the robustness of different f -divergence functions. With established robustness, this family of f -divergence functions arises as useful metrics for the problem of learning with noisy labels, which do not require the specification of the labels' noise rate. When they are possibly not robust, we propose fixes to make them so. In addition to the analytical results, we present thorough experimental evidence.

1. INTRODUCTION

A machine learning system continuously observes noisy training annotations and it remains a challenge to perform robust training in such scenarios. Earlier and classical approaches rely on estimation processes to understand the noise rate of the labels and then leverage this knowledge to perform label correction (Patrini et al., 2017; Lukasik et al., 2020) , or loss correction (Natarajan et al., 2013; Liu & Tao, 2015; Patrini et al., 2017) , or both, among many other more carefully designed approaches (please refer to our related work section for more detailed coverage). Recent works have started to propose robust loss functions or metrics that do not require the above estimation (Charoenphakdee et al., 2019; Xu et al., 2019; Liu & Guo, 2020; Cheng et al., 2021) . Clear advantages of the latter approaches include their easiness in implementation, as well as their robustness to noisy estimates of the parameters. This work mainly contributes to the second line of studies and aimed to propose relevant loss functions and measures that are inherently robust with label noise. We start with formulating the problem of maximizing an f -divergence defined between a classifier's prediction and the labels: h * f = argmax h D f (P h×Y ||Q h×Y ) , where in above D f is an f -divergence function, P and Q are the joint and product (marginal) distribution of the classifier h's predictions on a feature space X and label Y . Though optimizing the f -divergence measure is in general not the same as finding the Bayes optimal classifiers, we show these measures encourage a classifier that maximizes an extended definition of f -mutual information between the classifier's prediction and the true label distribution. We will also provide analysis for when the maximizer of this f -divergence coincides with the Bayes optimal classifier. Building on a careful treatment of its variational form, we then reveal a nice property that helps establish the robustness of the f -divergence specified in Eqn. (1): the variational difference term defined with noisy labels is an affine transformation of the clean variational difference, subject to an addition of a bias term. Using this result, we analyze under which conditions maximizing an fdivergence measure would be robust to label noise. In particular, we demonstrate strong robustness results for Total Variation divergence, identify conditions under which several other divergences, including Jenson-Shannon divergence and Pearson X 2 divergence, are robust. The resultant fdivergence functions offer ways to learn with noisy labels, without estimating the noise parameters. As mentioned above, this distinguishes our solutions from a major line of previous studies that would require such estimates. When the f -divergence functions are possibly not robust with label noise, our analysis also offers a new way to perform "loss correction". We'd like to emphasize that instead of offering one method/loss/measure, our results effectively offer a family of functions that can be used to perform this noisy training task. Our contributions summarize as follows: • We show a certain set of f -divergence measures that are robust with label noise (some under certain conditions). The corresponding f -divergence functions provide the community with robust learning measures that do not require the knowledge of the noise rates. • When the f -divergence measures are possibly not robust with label noise, our analysis provides ways to correct the f -divergence functions to offer robustness. This process would require the estimation of the noise rates and our results contribute new ways to leverage existing estimation techniques to make the training more robust. • We empirically verified the effectiveness of optimizing f -divergences when noisy labels present. We opensource our solutions at https://github.com/UCSC-REAL/ Robust-f-divergence-measures.

1.1. RELATED WORKS

The now most popular approach of dealing with label noise is to first estimate the noise transition matrix and then use this knowledge to perform loss or sample correction (Scott et al., 2013; Natarajan et al., 2013; Patrini et al., 2017; Lu et al., 2018; Han et al., 2018; Tanaka et al., 2018; Yao et al., 2020; Zhu et al., 2021) . In particular, the surrogate loss (Scott et al., 2013; Natarajan et al., 2013; Scott, 2015; Van Rooyen et al., 2015; Menon et al., 2015) uses the transition matrix to define unbiased estimates of the true losses. Other works include (Sukhbaatar & Fergus, 2014; Xiao et al., 2015) , which consider building a neural network to facilitate the learning of noise rates or noise transition matrix. Symmetric loss has been studied and conditions have been identified for when there is no need to estimate noise rate (Manwani & Sastry, 2013; Ghosh et al., 2015; 2017; Van Rooyen et al., 2015; Charoenphakdee et al., 2019) . Nonetheless, it remains a challenge to develop training approaches without requiring knowing the noise rates for more generic settings. More recently, (Zhang & Sabuncu, 2018; Amid et al., 2019) proposed robust losses for neural networks. When noise rates are asymmetric (label class-dependent), (Xu et al., 2019) proposed an information-theoretic loss that is also robust to asymmetric noise rates. There are also some trials on modifying the regularization term to improve generalization ability with the existence of label noise (Jenni & Favaro, 2018; Yi & Wu, 2019) , and on providing complementary negative labels (Kim et al., 2019) . Peer loss (Liu & Guo, 2020 ) is a recently proposed loss function that does not require knowing noise rates. f -divergence is a popular information theoretical measure, and has been widely used and studied. Most relevant to us, f -GAN was proposed in (Nowozin et al., 2016) to study f -divergence in training generative neural samplers. To our best knowledge, ours is the first to study the robustness of fdivergence measures in the context of improving the robustness of training with noisy labels.

2. LEARNING WITH NOISY LABELS USING f -DIVERGENCE

Our solution ties to the definition of f -divergence. The f -divergence between two distributions P and Q with probability density function p and q being measures for Z ∈ Zfoot_0 is defined as: D f (P Q) = Z q(Z)f p(Z) q(Z) dZ . (2) f (•) is a convex function such that f (1) = 0. Examples include KL-divergence when f (v) = v log v and Total Variation (TV) divergence with f (v) = 1 2 |v -1|. Other examples can be found in Table 1 . Following from Fenchel's convex duality, f -divergence admits the following variational form: D f (P Q) = sup g:Z→dom(f * ) E Z∼P [g(Z)] -E Z∼Q [f * (g(Z))] , where f * is the Fenchel duality of the function f (•), which is defined as f * (u) = sup v∈R {uvf (v)}. We use dom(f * ) to denote the domain of f * . We consider the classification problem of learning a classifier h : X → Y that maps features X ∈ X to labels Y ∈ Y := {1, 2, ..., K}, where in above X × Y denote the random variables for features and labels. X × Y jointly draw from a distribution D. For a clear presentation, we will often focus on presenting the binary classification setting Y = {-1, +1}, but most of our core results extend to multi-class classification problems, and we shall provide corresponding justifications. Instead of having access to sampled training data from X ×Y , we consider a setting with noisy labels where the noisy label Ỹ generates according to a transition matrix T defined between Ỹ and the true label Y . The (i, j) element of T is defined as T i,j = P( Ỹ = j|Y = i) where i, j ∈ {1, ..., K}. For the ease of presentation, when we present for the binary case, we adopt the following notation: e + := P( Ỹ = -1|Y = +1), e -:= P( Ỹ = +1|Y = -1) , e + + e -< 1. Suppose we have access to a noisy training dataset {x n , ỹn } N n=1 , where ỹn generates according to Ỹ .

2.1. LEARNING USING D f

We will start with presenting our idea of training a classifier using D f with the clean training data. Then we will proceed to the case with noisy labels. For an arbitrary classifier h, let's denote by P h×Y the joint distribution of h(X) and Y : Joint distribution: P h×Y := P(h(X) = y, Y = y ), y, y ∈ Y. And we use Q h×Y to denote the product (marginal) distribution of h(X) and Y : Product distribution: Q h×Y := P(h(X) = y) • P(Y = y ), y, y ∈ Y. When it is clear from context we will also shorthand the above two distributions as P and Q. We formulate the problem of learning using f -divergence as follows: the goal of the learner is to find a classifier h that maximizes the following divergence measure between P and Q: Learning using D f : h * f = argmax h D f (P h×Y ||Q h×Y ) Effectively the goal is to find a classifier that maximizes the divergence between the joint distribution and the product distribution. Define a f -mutual information based on f -divergence: M f (h(X); Y ) = D f (P h×Y ||Q h×Y ) , equivalently the maximization in Eqn. (3) tries to find the classifier that maximizes the f -mutual information between a classifier's output distribution and the true label distribution. A notable example is when f (v) = v log v, the corresponding D f and M f become the famous KL divergence and the mutual information. It is important to note in general maximizing (f -) mutual information between the classifier's predictions and labels does not promise the Bayes optimal classifier h * = argmax h P(h(X) = Y ). Nonetheless, maximizing it often returns a quality one. We provide further analysis in Section 2.2. Variational representation As we mentioned earlier, f -divergence admits a variational form which further allows us to focus on maximizing the following variational difference: h * f = argmax h sup g E Z∼P h×Y [g(Z)] -E Z∼Q h×Y [f * (g(Z))] , where we use Z to shorthand the tuple [h(X), Y ]. Denote the variational difference as follows: VD f (h, g) := E Z∼P h×Y [g(Z)] -E Z∼Q h×Y [f * (g(Z))] . Let g * be the corresponding optimal variational function g for VD f (h, g). This variational form allows us to use a training dataset {(x n , y n )} N n=1 to perform the above maximization problem listed in Eqn. (3) (Nowozin et al., 2016) . A list of f -divergence functions together with the optimal variational/conjugate functions g/f * is summarized in Table 1 . Name D f (P ||Q) g * dom f * f * (u) Total Variation 1 2 |p(z) -q(z)|dz 1 2 sign p(z) q(z) -1 u ∈ [- 1 2 , 1 2 ] u Jenson-Shannon 1 2 p(z) log 2p(z) p(z) + q(z) + q(z) log 2q(z) p(z) + q(z) dz log 2p(z) p(z) + q(z) u < log 2 -log (2 -e u ) Pearson X 2 (q(z) -p(z)) 2 p(z) dz 2 p(z) q(z) -1 R 1 4 u 2 + u KL p(z) log p(z) q(z) dx 1 + log p(z) q(z) R e u-1 Table 1: D f s, optimal variational g (g * ), conjugate functions (f * ). A more complete table, including Jeffrey, Squared Hellinger, Neyman X 2 , Reverse KL, is provided in the Appendix.

2.2. HOW

GOOD IS h * f ? As we mentioned earlier, maximizing our defined f -divergence measures (or maximizing the fmutual information) between the classifier's predictions and labels is not always returning the Bayes optimal classifier. However, for a binary classification problem, we prove below that with balanced dataset, maximizing Total Variation (TV) divergence returns the Bayes optimal classifier: Theorem 1. For TV, when P(Y = +1) = P(Y = -1) (balanced), h * f is the Bayes optimal classifier. Remark 2. The above theorem extends to the multi-class setting when we restrict attentions to confident classifiers. See Appendix for details. The above observation is not easily true for other f -divergence. Nonetheless, denote by Y * (X = x) the Bayes optimal label for an instance x: Y * (X = x) = argmax y P(Y = y|X = x). Denote by P h×Y * , Q h×Y * the joint and product distribution P, Q defined w.r.t. h(X) and Y * . We prove: Theorem 3. When P(Y * = +1) = P(Y * = -1) (balanced), maximizing D f (P h×Y * Q h×Y * ) returns the Bayes optimal classifier, if f (v) is monotonically increasing in |v -1| on dom(f ). For example, Pearson X 2 (f (v) = (v -1) 2 ) satisfies the monotonicity condition. In practice, when the label distribution P(Y |X = x) has small uncertainties, the ground truth labels are approximately equivalent to the Bayes optimal label. Therefore, the above theorem implies that maximizing D f (P h×Y Q h×Y ) is also likely to return a high-quality classifier for other f -divergences.

2.3. LEARNING WITH NOISY LABELS

Consider an arbitrary classifier h. Denote by Ph× Ỹ the joint distribution of h(X) and Ỹ : Joint noisy distribution: Ph× Ỹ := P(h(X) = y, Ỹ = y ), y, y ∈ Y. Similarly, we use Qh× Ỹ to denote the product (marginal) distribution of h(X) and Ỹ : Product noisy distribution: Qh× Ỹ := P(h(X) = y) • P( Ỹ = y ), y, y ∈ Y. When it is clear from context, we shorthand using P , Q. We are interested in understanding the robustness in maximizing D f ( Ph× Ỹ || Qh× Ỹ ). Using training samples {x n , ỹn } N n=1 , there exists algorithms to compute the gradient of D f leveraging its variational form (Nowozin et al., 2016) , such that one can apply gradient descent or ascent to optimize it. We provide details in Section 5.

3. VARIATIONAL DIFFERENCE WITH NOISY LABELS

For an arbitrary g, we define the variational difference term w.r.t. the noisy label as follows: VD f (h, g) := E Z∼ Ph× Ỹ g( Z) -E Z∼ Qh× Ỹ f * (g( Z)) where we use Z to denote [h(X), Ỹ ]. Denote by g * the corresponding optimal variational function g for VD f (h, g). In this section, we show that the variational difference term under noisy labels is closely related to the variational difference term defined on the clean distributions P, Q. Define the following quantity: ∆ y f (h, g) := E X [g(h(X), y)] -E X [f * (g(h(X), y))] , for example ∆ +1 f (h, g) := E X [g(h(X), +1)] -E X [f * (g(h(X), +1))] . For a binary classification problem, further denote by Bias f (h, g) := e + • ∆ -1 f (h, g) + e -• ∆ +1 f (h, g). We derive the following fact: Theorem 4. For binary classification, the variational difference between the noisy distributions P and Q relates to the one defined on the clean distributions in the following way: VD f (h, g) = (1 -e + -e -)VD f (h, g) + Bias f (h, g) The above decoupling result is inspiring: Bias f (h, g) can be viewed as the additional bias term introduced by label noise. If this term has negligible effect in the maximization problem, maximizing the noisy variational difference term will be equivalent to maximizing (1 -e + -e -) • VD f (h, g), and therefore the clean variational difference term. If the above is true, we have established the robustness of the corresponding f -divergence. This result also points out that when the effects from the bias term are non-negligible, finding ways to counter the additional bias term will help us retain the robustness of D f measures. Next we show that Theorem 4 extends to the multi-class setting under two broad families of noise rate models, both covering the binary setting as a special case. Multi-class extension of Theorem 4: uniform off-diagonal case We first consider the following transition matrix: uniform off-diagonal transition matrix, where e j = T i,j , ∀i = j, that is any other classes i = j has the same chance of being flipped to class j. The diagonal entry T i,i (chance of a correct label) becomes 1j =i e j . We further require that j e j < 1. Note that the binary noise rate model is easily a uniform off-diagonal transition matrix. Theorem 5. [Multi-class] For uniform off-diagonal noise transition model, the noisy variational difference term relates to the clean one in the following way: VD f (h, g) = (1 - K j=1 e j ) • VD f (h, g) + K j=1 e j • ∆ j f (h, g) If we define Bias f (h, g) := K j=1 e j • ∆ j f (h, g), we reproduced the results in Theorem 4: for binary case, relabel class 1 → +1, 2 → -1. Then e 1 := P( Ỹ = +1|Y = -1) = e -, e 2 := P( Ỹ = -1|Y = +1) = e + . Another case of noise model we consider is sparse noise. Mathematically, assume K is an even number, sparse noise model specifies K 2 disjoint pairs of classes (i c , j c ) where c ∈ [ K 2 ] and i c < j c . The labels flip between each pair. We provide details in the Appendix.

4. WHEN D f IS ROBUST WITH LABEL NOISE

Denote by H an arbitrary hypothesis space for training a candidate classifier h. We will focus on H throughout this section, and with abusing notation a bit, let h * f = argmax h∈H D f (P h×Y ||Q h×Y ) . We first define formally what we mean by robustness of D f (P h×Y Q h×Y ). Definition 1. D f (P h×Y Q h×Y ) is H-robust if h * f = argmax h∈H D f ( Ph× Ỹ || Qh× Ỹ ). The above definition is stating that the label noise does not disrupt the optimality of h * f when maximizing D f ( Ph× Ỹ || Qh× Ỹ ) instead of D f (P h×Y Q h×Y ).

4.1. IMPACT OF THE BIAS TERMS

In this section, we take a closer look at the Bias terms and argue that they have diminishing effects as compared to the VD terms when label noise increases. Recall g * , g * are the corresponding optimal variational functions for VD f (h, g) and VD f (h, g). Total variation (TV) For TV, since f (v) = 1 2 |v -1|, f * (u) = u, we immediately have ∀y g(h = y , y) -f * (g(h = y , y)) = 0 and therefore ∆ y f (h, g) = E X [g(h(X), y)] -E X [f * (g(h(X), y))] ≡ 0, ∀y, and further Bias f (h, g) ≡ 0. This fact helps establish the robustness of TV divergence measure (Theorem 7).

Other divergences

The above nice property generally does not hold for other f -divergence functions. Next we focus on the binary classification setting and prove the following lemma: Lemma 1. For f -divergence listed in Table 6  (Appendix), Bias f (h, g * ) = O (1 -e + -e -) 2 . Note the variational form will be used when optimizing D f ( Ph× Ỹ || Qh× Ỹ ) (and therefore we will be using g * ). This lemma simplifies Eqn. 4 to VD f (h, g) ∝ VD f (h, g) + O(1 -e + -e -). Since 0 < 1 -e + -e -≤ 1, when the noise rate e + + e -is high, the effect of Bias term diminishes. When the Bias term becomes negligible, we will have VD f (h, g) ∝ VD f (h, g) if e + + e -→ 1, establishing the fact that optimizing VD f (h, g) is approximately the same as optimizing VD f (h, g).

4.2. HOW ROBUST ARE D f S?

We first prove the following result: Theorem 6. D f is H-robust when Bias f (h, g) satisfies either of the following conditions: (I) ∀h ∈ H, Bias f (h, g) ≡ const.; (II) ∀h ∈ H, h = h * f , Bias f (h, g * ) ≤ Bias f (h * f , g * ). Theorem 6 gives sufficient conditions when the Bias term does not get in the way of reaching the optimality h * f . Intuitively, when Bias f (h * f , g * ) is an upper bound of Bias f (h, g * ), the Bias term will not interfere with the convergence of the VD term. Next we provide specific examples of fdivergence functions that would satisfy these conditions. Total Variation (TV) is robust For TV, the fact that ∆ y f (h, g) ≡ 0 allows us to prove: Theorem 7. For TV divergence, Bias f (h, g) ≡ const. and D f (P h×Y Q h×Y ) is H-robust with label noise for any arbitrary hypothesis space H. This result establishes TV as a strong measure that does not require specifying the noise rates. Divergences that are conditionally robust Other divergences functions do not enjoy the above nice property as TV has. The robustness of these functions need more careful analysis. Define the following measures that capture the degree a classifier fits to a particular label distribution: Definition 2. The fitness of h to R ∈ {Y, Ỹ } is defined as FIT(h = y, R = y ) := P(h(X)=y|R=y )

P(h(X)=y)

. FIT measures capture the degree of fit of the classifier to the corresponding label distribution. A high FIT(h = y, Ỹ = y) (same label) indicates a potential overfit to the noisy label. Denote by H * :={h ∈ H : min y FIT(h = y, Ỹ = y) ≥ max y FIT(h * f = y, Y = y) ≥ 1} ∪ {h * f } The 1 in the "≥ 1" above corresponds to the FIT for a random classifier. H * contains the classifiers that are likely to overfit to the noisy labels. We argue, and also as observed in training, that H * is the set of classifiers the training should avoid converging to, especially when the training only sees noisy labels. Suppose P(Y = +1) = P(Y = -1) (balanced clean labels) and e + = e -(symmetric noise rate), we have the following theorem for binary classification: Theorem 8. f -divergences listed in Table 6 (Appendix, except for Jeffrey) are H * -robust.

4.3. MAKING D f MEASURES ROBUST TO LABEL NOISE

For the general case, to further improve robustness of D f measures, we will need to estimate the noise rates (e.g., e + , e -) and then subtract Bias f (g, h) from the noisy variational difference term to correct the bias introduced by the noisy labels. As a corollary of Theorem 4 we have: Corollary 1. Maximizing the following bias-corrected VD f (h, g) defined over P and Q leads to h * f h * f = argmax h∈H sup g E Z∼ Ph× Ỹ g( Z) -E Z∼ Qh× Ỹ f * (g( Z)) -Bias f (h, g) . By removing the Bias f term, maximizing E Z∼ P [g( Z)] -E Z∼ Q[f * (g( Z) )] becomes the same with maximizing the divergence defined on the clean distribution (1 -K j=1 e j ) • VD f (h, g). The Corollary follows trivially from this fact. The calculation of the Bias terms will require the inputs of noise rates. Our work does not intend to particularly focus on noise rate estimation. But rather, we can leverage the existing results in performing efficient noise rate estimation. There are existing literature on estimating noise rates (noise transition matrix) which can be implemented without the need of ground truth labels. For interested readers, please refer to (Liu & Tao, 2015; Menon et al., 2015; Harish et al., 2016; Patrini et al., 2017; Arazo et al., 2019; Yao et al., 2020; Zhu et al., 2021) . We will test the effectiveness of this bias correction step in Section 5.

5. EXPERIMENTS

In this section, we validate our analysis of D f measures' robustness via a set of empirical evaluations on 5 datasets: MNIST (LeCun et al. ( 1998)), Fashion-MNIST (Xiao et al. (2017) ), CIFAR-10 and CIFAR-100 (Krizhevsky et al. (2009) ), and Clothing1M (Xiao et al. (2015) ). Omitted experiment details are available in the appendix. Baselines We compare our approach with five baseline methods: Cross-Entropy (CE), Backward (BLC) and Forward Loss Correction (FLC) methods as introduced in (Patrini et al., 2017) , the determinant-based mutual information (DMI) method introduced in (Xu et al., 2019) and Peer-Loss (PL) functions in (Liu & Guo, 2020) . BLC and FLC methods require estimating the noise transition matrix. DMI and PL are approaches that do not require such estimation.

Noise model

We test three types of noise transition models: uniform noise, sparse noise, and random noise. All details of the noise are in the Appendix. Here we briefly overview them. The uniform and sparse noise are as specified at the end of Section 3 for which our theoretical analyses mainly focus on. The noise rates of low-level uniform noise and sparse noise are both approximately 0.2 (the average probability of a label being wrong). The high-levels are about 0.55 and 0.4 respectively. In the random noise setting, each class randomly flips to one of 10 classes with probability p (Random p). For CIFAR-100, the noise rate of uniform noise is about 0.25. The sparse label noise is generated by randomly dividing 100 classes into 50 pairs, and the noise rate is about 0.4. Optimizing D f ( Ph× Ỹ || Qh× Ỹ ) using noisy samples With the noisy training dataset {x n , ỹn } N n=1 , we optimize D f ( Ph× Ỹ || Qh× Ỹ ) using gradient ascent of its variational form. Sketch is given in Algorithm 1. For the bias correction version of our algorithm, the gradient will simply include the ∇Bias f (h, g). The variational function g * can be updated progressively or can be fixed beforehand using an approximate activation function for each f (see e.g., (Nowozin et al., 2016) ). Algorithm 1 Maximizing D f measures: one step gradient 1: Inputs: Training data {(x n , ỹn )} N n=1 , f , variational function g * , conjugate f * , classifier h t . 2: Randomly sample three mini-batches {(x n , ỹn )} B n=1 , {(x † n , ỹ † n )} B n=1 , {(x n , ỹ n )} B n=1 from {(x n , ỹn )} N n=1 . {(x n , ỹn )} B n=1 : simulate samples ∼ P ; {(x † n , ỹ n )} B n=1 to simulate Q. 3: Use h t,xn [ỹ n ] to denote model prediction on x n for label ỹn , E {(xn,ỹn)} B n=1 , E {(x † n ,ỹ n )} B n=1 to denote the empirical sample mean calculated using the mini-batch data. 4: At step t, update h t by ascending its stochastic gradient with learning rate η t : h t+1 := h t + η t • ∇ ht E {(xn,ỹn)} B n=1 [g * (h t,xn [ỹ n ])] -E {(x † n ,ỹ n )} B n=1 [f * g * (h t,x † n [ỹ n ]) ] . Tips: In practice, we suggest (also implemented in our experiments) using the fixed form of g * which appears as g f (v) in Table 6 (appendix).

5.1. HOW

GOOD IS h * f ON CLEAN DATA As a supplementary of Section 2.2, we validate the quality of h * f on clean dataset of MNIST, Fashion MNIST, CIFAR-10 and CIFAR-100. In experiments, since the estimation of product noisy distribution are unstable when trained on CIFAR-100 training dataset, we use CE as a warm-up (120 epochs) and then switch to train with D f measures. For other datasets, we train with D f measures without the warm-up stage. Results in As a demonstration, we apply the uniform noise model to CIFAR-10 dataset to test the robustness of three D f measures: Total-Variation (TV), Jenson-Shannon (JS) and Pearson (PS). We trained models with D f measures using Algorithm 1 on 10 noise settings with an increasing noise rate from 0% to approximately 81%. The visualization of the D f values and accuracy w.r.t. noise rates are shown in Figure 1 . Both the D f values and test accuracy are calculated on the reserved clean test data. We observe that almost all D f measures are robust to noisy labels, especially when the percentage of noisy labels is not overwhelmingly large, e.g., ≤ 70%. Note that the curves for other f -divergences are almost the same as the curve of total variation (TV), which is proved to be robust theoretically. This partially validates the analytical evidences we provided for the robustness of other f -divergences in Section 4.1 and 4.2.

5.3. PERFORMANCE EVALUATION AND COMPARISON

From Table 3 , several D f measures arise as competitive solutions in a variety of noise scenarios. Among the proposed f -divergences, Total Variation (TV) has been consistently ranked as one of the top performing method. This aligns also with our analyses that TV is inherently robust. For most settings, the presented f -divergences outperformed the baselines we compare to, while they fell short to DMI (once) and Peer Loss (5 times) on several cases, particularly when the noise is sparse and high. The sparse high noise setting tends to be a challenging setting for all methods. We conjecture this is because sparse high noise setting creates a highly imbalanced dataset, model training is more likely to converge to a "sub-optimal" early in the training process. It is also possible that with sparse noise, the impact of Bias terms becomes non-negligible. We do observe better performances with very careful and intensive hyper-parameter tuning, but the results are not confident and we chose to not report it. Fully understanding the limitation of our approach in this setting remains an interesting on-going investigation. In Table 4 (full details on MNIST and Fashion MNIST can be found in Appendix), we use noise transition estimation method in (Patrini et al. (2017) ) to estimate the noise rate. The estimates help us define the bias term and perform bias correction for D f measures. We observe that while adding bias correction can further improve the performance of several divergence functions (Gap being positive), the improvement or difference is not significant. This partially justified our analysis of the bias term, especially when the noise is dense and high (uniform and random high). 

6. CONCLUSION

In this paper, we explore the robustness of a properly defined f -divergence measure when used to train a classifier in the presence of label noise. We identified a set of nice robustness properties for a family of f -divergence functions. We also experimentally verified our findings. Our work primarily contributed to the problem of learning with noisy labels without requiring the knowledge of noise rate. Beyond this noisy learning problem, the derivation and analysis might be useful for understanding the robustness of f -divergences for other learning tasks.



We use Z instead of X as conventionally done for a good reason -we will be reserving X to explicitly denote the features.



Figure 1: Robustness of TV, JS, PS divergences.

Table 2 demonstrate that optimizing f -divergence on clean dataset returns a high-quality h * f by referring to the performance of CE. Even though D f measures can't outperform CE on clean dataset, we do observe that the gap between CE and D f measures are negligible, for example, the largest gap of Total-Variation (TV) is only 0.81% among four datasets. Experiment results comparison on clean datasets: We report the maximum accuracy of CE and each D f measures along with (mean ± standard deviation); Gap: mean performance comparison w.r.t. CE. Numbers highlighted in blue indicate the gap is less than 1%.

Experiment results comparison (w/o bias correction): The best performance in each setting is highlighted in blue. We report the maximum accuracy of each D f measures along with (mean ± standard deviation). All f -divergences will be highlighted if their mean performances are better (or no worse) than all baselines we compare to. A supplementary table including Pearson X 2 and Jeffrey (JF) is attached in Table7(Appendix).

D f measures with bias correction on CIFAR-10: Numbers highlighted in blue indicate better than all baseline methods; Gap: relative performance w.r.t. their version w/o bias correction (Table3); those in red indicate better than w/o bias correction.Clothing1M Clothing1M is a large-scale clothes dataset with comprehensive annotations and can be categorized as a feature-dependent human-level noise dataset. Although this noise setting does not exactly follow our assumption, we are interested in testing the robustness of our f -divergence approaches. Experiment results in Table5demonstrate the robustness of the D f measures. TV and KL divergences have outperformed other baseline methods.

Experiment results comparison on Clothing1M dataset.

acknowledgement

Acknowledgement This work is partially supported by the National Science Foundation (NSF) under grant IIS-2007951 and the Office of Naval Research under grant N00014-20-1-22.

