WASSERSTEIN DISTRIBUTIONAL NORMALIZATION : NONPARAMETRIC STOCHASTIC MODELING FOR HAN-DLING NOISY LABELS

Abstract

We propose a novel Wasserstein distributional normalization (WDN) algorithm to handle noisy labels for accurate classification. In this paper, we split our data into uncertain and certain samples based on small loss criteria. We investigate the geometric relationship between these two different types of samples and enhance this relation to exploit useful information, even from uncertain samples. To this end, we impose geometric constraints on the uncertain samples by normalizing them into the Wasserstein ball centered on certain samples. Experimental results demonstrate that our WDN outperforms other state-of-the-art methods on the Clothing1M and CIFAR-10/100 datasets, which have diverse noisy labels. The proposed WDN is highly compatible with existing classification methods, meaning it can be easily plugged into various methods to improve their accuracy significantly.

1. INTRODUCTION

The successful results of deep neural networks (DNNs) on supervised classification tasks heavily rely on accurate and high-quality label information. However, annotating large-scale datasets is extremely expensive and a time-consuming task. Because obtaining high-quality datasets is very difficult, in most conventional works, training data have been obtained alternatively using crowd-sourcing platforms Yu et al. (2018) to obtain large-scaled datasets, which leads inevitable noisy labels in the annotated samples. While there are numerous methods that can deal with noisy labeled data, recent methods actively adopt the small loss criterion, which enables to construct classification models that are not susceptible to noise corruption. In this learning scheme, a neural network is trained using easy samples first in the early stages of training. Harder samples are then gradually selected to train mature models as training proceeds. Jiang et al. ( 2018) suggested collaborative learning models, in which a mentor network delivers the data-driven curriculum loss to a student network. Han et al. (2018); Yu et al. (2019) proposed dual networks to generate gradient information jointly using easy samples and employed this information to allow the networks to teach each other. Wei et al. (2020) adopted a disagreement strategy, which determines the gradient information to update based on disagreement values between dual networks. Han et al. ( 2020) implemented accumulated gradients to escape optimization processes from over-parameterization and to obtain more generalized results. In this paper, we tackle to solve major issues raised from the aforementioned methods based on the small-loss criterion, as follows. In comprehensive experiments, the aforementioned methods gain empirical insight regarding network behavior under noisy labels. However, theoretical and quantitative explanation have not been closely investigated. In contrast, we give strong theoretical/empirical explanations to understand the network under noisy labels. In particular, we present an in-depth analysis of small loss criteria in a probabilistic sense. We exploit the stochastic properties of noisy labeled data and develop probabilistic descriptions of data under the small loss criteria, as follows. Let P be a probability measure for the pre-softmax logits of the training samples, l be an objective function for classification, and 1 {•} be an indicator function. Then, our central object to deal with is a truncated measure defined as , (1)



∼ µ|ζ = 1 {X;l(X)>ζ} P P[l(X) > ζ] , Y ∼ ξ|ζ = 1 {X;l(Y )≤ζ} P P[l(Y ) ≤ ζ]

