WHEN OPTIMIZING f -DIVERGENCE IS ROBUST WITH LABEL NOISE

Abstract

We show when maximizing a properly defined f -divergence measure with respect to a classifier's predictions and the supervised labels is robust with label noise. Leveraging its variational form, we derive a nice decoupling property for a family of f -divergence measures when label noise presents, where the divergence is shown to be a linear combination of the variational difference defined on the clean distribution and a bias term introduced due to the noise. The above derivation helps us analyze the robustness of different f -divergence functions. With established robustness, this family of f -divergence functions arises as useful metrics for the problem of learning with noisy labels, which do not require the specification of the labels' noise rate. When they are possibly not robust, we propose fixes to make them so. In addition to the analytical results, we present thorough experimental evidence.

1. INTRODUCTION

A machine learning system continuously observes noisy training annotations and it remains a challenge to perform robust training in such scenarios. Earlier and classical approaches rely on estimation processes to understand the noise rate of the labels and then leverage this knowledge to perform label correction (Patrini et al., 2017; Lukasik et al., 2020) , or loss correction (Natarajan et al., 2013; Liu & Tao, 2015; Patrini et al., 2017) , or both, among many other more carefully designed approaches (please refer to our related work section for more detailed coverage). Recent works have started to propose robust loss functions or metrics that do not require the above estimation (Charoenphakdee et al., 2019; Xu et al., 2019; Liu & Guo, 2020; Cheng et al., 2021) . Clear advantages of the latter approaches include their easiness in implementation, as well as their robustness to noisy estimates of the parameters. This work mainly contributes to the second line of studies and aimed to propose relevant loss functions and measures that are inherently robust with label noise. We start with formulating the problem of maximizing an f -divergence defined between a classifier's prediction and the labels: h * f = argmax h D f (P h×Y ||Q h×Y ) , where in above D f is an f -divergence function, P and Q are the joint and product (marginal) distribution of the classifier h's predictions on a feature space X and label Y . Though optimizing the f -divergence measure is in general not the same as finding the Bayes optimal classifiers, we show these measures encourage a classifier that maximizes an extended definition of f -mutual information between the classifier's prediction and the true label distribution. We will also provide analysis for when the maximizer of this f -divergence coincides with the Bayes optimal classifier. Building on a careful treatment of its variational form, we then reveal a nice property that helps establish the robustness of the f -divergence specified in Eqn. (1): the variational difference term defined with noisy labels is an affine transformation of the clean variational difference, subject to

