WHEN OPTIMIZING f -DIVERGENCE IS ROBUST WITH LABEL NOISE

Abstract

We show when maximizing a properly defined f -divergence measure with respect to a classifier's predictions and the supervised labels is robust with label noise. Leveraging its variational form, we derive a nice decoupling property for a family of f -divergence measures when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise. The above derivation helps us analyze the robustness of different f -divergence functions. With established robustness, this family of f -divergence functions arises as useful metrics for the problem of learning with noisy labels, which do not require the specification of the labels' noise rate. When they are possibly not robust, we propose fixes to make them so. In addition to the analytical results, we present thorough experimental evidence.

1. INTRODUCTION

A machine learning system continuously observes noisy training annotations and it remains a challenge to perform robust training in such scenarios. Earlier and classical approaches rely on estimation processes to understand the noise rate of the labels and then leverage this knowledge to perform label correction (Patrini et al., 2017; Lukasik et al., 2020) , or loss correction (Natarajan et al., 2013; Liu & Tao, 2015; Patrini et al., 2017) , or both, among many other more carefully designed approaches (please refer to our related work section for more detailed coverage). Recent works have started to propose robust loss functions or metrics that do not require the above estimation (Charoenphakdee et al., 2019; Xu et al., 2019; Liu & Guo, 2020; Cheng et al., 2021) . Clear advantages of the latter approaches include their easiness in implementation, as well as their robustness to noisy estimates of the parameters. This work mainly contributes to the second line of studies and aimed to propose relevant loss functions and measures that are inherently robust with label noise. We start with formulating the problem of maximizing an f -divergence defined between a classifier's prediction and the labels: h * f = argmax h D f (P h×Y ||Q h×Y ) , where in above D f is an f -divergence function, P and Q are the joint and product (marginal) distribution of the classifier h's predictions on a feature space X and label Y . Though optimizing the f -divergence measure is in general not the same as finding the Bayes optimal classifiers, we show these measures encourage a classifier that maximizes an extended definition of f -mutual information between the classifier's prediction and the true label distribution. We will also provide analysis for when the maximizer of this f -divergence coincides with the Bayes optimal classifier. Building on a careful treatment of its variational form, we then reveal a nice property that helps establish the robustness of the f -divergence specified in Eqn. (1): the variational difference term defined with noisy labels is an affine transformation of the clean variational difference, subject to an addition of a bias term. Using this result, we analyze under which conditions maximizing an fdivergence measure would be robust to label noise. In particular, we demonstrate strong robustness results for Total Variation divergence, identify conditions under which several other divergences, including Jenson-Shannon divergence and Pearson X 2 divergence, are robust. The resultant fdivergence functions offer ways to learn with noisy labels, without estimating the noise parameters. As mentioned above, this distinguishes our solutions from a major line of previous studies that would require such estimates. When the f -divergence functions are possibly not robust with label noise, our analysis also offers a new way to perform "loss correction". We'd like to emphasize that instead of offering one method/loss/measure, our results effectively offer a family of functions that can be used to perform this noisy training task. Our contributions summarize as follows: • We show a certain set of f -divergence measures that are robust with label noise (some under certain conditions). The corresponding f -divergence functions provide the community with robust learning measures that do not require the knowledge of the noise rates. • When the f -divergence measures are possibly not robust with label noise, our analysis provides ways to correct the f -divergence functions to offer robustness. This process would require the estimation of the noise rates and our results contribute new ways to leverage existing estimation techniques to make the training more robust. • We empirically verified the effectiveness of optimizing f -divergences when noisy labels present. We opensource our solutions at https://github.com/UCSC-REAL/ Robust-f-divergence-measures.

1.1. RELATED WORKS

The now most popular approach of dealing with label noise is to first estimate the noise transition matrix and then use this knowledge to perform loss or sample correction (Scott et al., 2013; Natarajan et al., 2013; Patrini et al., 2017; Lu et al., 2018; Han et al., 2018; Tanaka et al., 2018; Yao et al., 2020; Zhu et al., 2021) . In particular, the surrogate loss (Scott et al., 2013; Natarajan et al., 2013; Scott, 2015; Van Rooyen et al., 2015; Menon et al., 2015) uses the transition matrix to define unbiased estimates of the true losses. Other works include (Sukhbaatar & Fergus, 2014; Xiao et al., 2015) , which consider building a neural network to facilitate the learning of noise rates or noise transition matrix. Symmetric loss has been studied and conditions have been identified for when there is no need to estimate noise rate (Manwani & Sastry, 2013; Ghosh et al., 2015; 2017; Van Rooyen et al., 2015; Charoenphakdee et al., 2019) . Nonetheless, it remains a challenge to develop training approaches without requiring knowing the noise rates for more generic settings. More recently, (Zhang & Sabuncu, 2018; Amid et al., 2019) proposed robust losses for neural networks. When noise rates are asymmetric (label class-dependent), (Xu et al., 2019) proposed an information-theoretic loss that is also robust to asymmetric noise rates. There are also some trials on modifying the regularization term to improve generalization ability with the existence of label noise (Jenni & Favaro, 2018; Yi & Wu, 2019) , and on providing complementary negative labels (Kim et al., 2019) . Peer loss (Liu & Guo, 2020) is a recently proposed loss function that does not require knowing noise rates. f -divergence is a popular information theoretical measure, and has been widely used and studied. Most relevant to us, f -GAN was proposed in (Nowozin et al., 2016) to study f -divergence in training generative neural samplers. To our best knowledge, ours is the first to study the robustness of fdivergence measures in the context of improving the robustness of training with noisy labels.

2. LEARNING WITH NOISY LABELS USING f -DIVERGENCE

Our solution ties to the definition of f -divergence. The f -divergence between two distributions P and Q with probability density function p and q being measures for Z ∈ Zfoot_0 is defined as: D f (P Q) = Z q(Z)f p(Z) q(Z) dZ . (2)



We use Z instead of X as conventionally done for a good reason -we will be reserving X to explicitly denote the features.

