WASSERSTEIN DISTRIBUTIONAL NORMALIZATION : NONPARAMETRIC STOCHASTIC MODELING FOR HAN-DLING NOISY LABELS

Abstract

We propose a novel Wasserstein distributional normalization (WDN) algorithm to handle noisy labels for accurate classification. In this paper, we split our data into uncertain and certain samples based on small loss criteria. We investigate the geometric relationship between these two different types of samples and enhance this relation to exploit useful information, even from uncertain samples. To this end, we impose geometric constraints on the uncertain samples by normalizing them into the Wasserstein ball centered on certain samples. Experimental results demonstrate that our WDN outperforms other state-of-the-art methods on the Clothing1M and CIFAR-10/100 datasets, which have diverse noisy labels. The proposed WDN is highly compatible with existing classification methods, meaning it can be easily plugged into various methods to improve their accuracy significantly.

1. INTRODUCTION

The successful results of deep neural networks (DNNs) on supervised classification tasks heavily rely on accurate and high-quality label information. However, annotating large-scale datasets is extremely expensive and a time-consuming task. Because obtaining high-quality datasets is very difficult, in most conventional works, training data have been obtained alternatively using crowd-sourcing platforms Yu et al. (2018) to obtain large-scaled datasets, which leads inevitable noisy labels in the annotated samples. While there are numerous methods that can deal with noisy labeled data, recent methods actively adopt the small loss criterion, which enables to construct classification models that are not susceptible to noise corruption. In this learning scheme, a neural network is trained using easy samples first in the early stages of training. Harder samples are then gradually selected to train mature models as training proceeds. Jiang et al. (2018) suggested collaborative learning models, in which a mentor network delivers the data-driven curriculum loss to a student network. Han et al. (2018); Yu et al. (2019) proposed dual networks to generate gradient information jointly using easy samples and employed this information to allow the networks to teach each other. Wei et al. (2020) adopted a disagreement strategy, which determines the gradient information to update based on disagreement values between dual networks. Han et al. (2020) implemented accumulated gradients to escape optimization processes from over-parameterization and to obtain more generalized results. In this paper, we tackle to solve major issues raised from the aforementioned methods based on the small-loss criterion, as follows. In comprehensive experiments, the aforementioned methods gain empirical insight regarding network behavior under noisy labels. However, theoretical and quantitative explanation have not been closely investigated. In contrast, we give strong theoretical/empirical explanations to understand the network under noisy labels. In particular, we present an in-depth analysis of small loss criteria in a probabilistic sense. We exploit the stochastic properties of noisy labeled data and develop probabilistic descriptions of data under the small loss criteria, as follows. Let P be a probability measure for the pre-softmax logits of the training samples, l be an objective function for classification, and 1 {•} be an indicator function. Then, our central object to deal with is a truncated measure defined as X ∼ µ|ζ = 1 {X;l(X)>ζ} P P[l(X) > ζ] , Y ∼ ξ|ζ = 1 {X;l(Y )≤ζ} P P[l(Y ) ≤ ζ] , where X and Y , which are sampled from µ|ζ and ξ|ζ, denote uncertain and certain samples defined in the pre-softmax feature spacefoot_0 (i.e., R d ), respectively. In equation 1, µ and ξ denote the probability measures of uncertain and certain samples, respectively, and ζ is a constant. Most previous works have focused on the usage of Y and the sampling strategy of ζ, but poor generalization capabilities based on the abundance of uncertain samples X has not been thoroughly investigated, even though these samples potentially contain important information. To understand the effect of noisy labels on the generalized bounds, we provide the concentration inequality of uncertain measure µ, which renders the probabilistic relation between µ and ξ and learnability of the network under noisy labels. (2019) require additional dual networks to guide misinformed noisy samples, the scalability is not guaranteed due to the existence of dual architectures, which have the same number of parameters as the base network. To alleviate this problem, we build a statistical machinery, which should be fully non-parametric, simple to implement, and computationally efficient to reduce the computational complexity of conventional approaches, while maintaining the concept of small-loss criterion. Based on the empirical observation of ill-behaved certain/uncertain samples, we propose the gradient flow in the Wasserstein space, which can be induced by simulating non-parametric stochastic differential equation (SDE) with respect to the Ornstein-Ulenbeck type to control the ill-behaved dynamics. The reason for selecting these dynamics will be thoroughly discussed in the following sections. Thus, key contributions of our work are as follows. • We theoretically verified that there exists a strong correlation between model confidence and statistical distance between X and Y . We empirically investigate that the classification accuracy worsens when the upper-bound of 2-Wasserstein distance W 2 (µ, ξ) ≤ ε (i.e., distributional distance between certain and uncertain samples) drastically increase. Due to the empirical nature of upper-bound ε, it can be used as an estimator to determine if a network suffers from over-parameterization. • Based on empirical observations, we develop a simple, non-parametric, and computationally efficient stochastic model to control the observed ill-behaved sample dynamics. As a primal object, we propose the stochastic dynamics of gradient flow (i.e.,, Ornstein-Ulenbeck process) to simulate simple/non-parametric stochastic differential equation. Thus, our method do not require any additional learning parameters. • We provide important theoretical results. First, the controllable upper-bound ε with the inverse exponential ratio is induced, which indicates that our method can efficiently control the diverging effect of Wasserstein distance. Second, the concentration inequality of transported uncertain measure is presented, which clearly renders the probabilistic relation between µ and ξ. 



Due to the technical difficulties, we define our central objects on pre-softmax space rather than label space, i.e., the space of σ(X), σ(Y ), where σ indicates softmax function. Please refer to Appendix for more details.



While most conventional methods Han et al. (2018); Wei et al. (2020); Li et al. (2019a); Yu et al.

sample selection frameworks. However, these methods only consider a small number of selected samples, where large portion of samples are excluded at the end of the training. This inevitably leads to poor generalization capabilities. However, this conflicts with sample selection methods because a large portion of training samples are gradually eliminated. By contrast, our method can extract useful information from unselected samples X ∼ µ (i.e., uncertain samples) and enhance these samples (e.g., X ∼ Fµ) for more accurate classification. Chen et al. (2019) iteratively apply cross-validation to randomly partitioned noisy labeled data to identify most samples that have correct labels. To generate such partitions, they adopt small-loss criterion for selecting samples.LossCorrection & Label Correction. Patrini et al. (2017a); Hendrycks et al. (2018); Ren et al. (2018) either explicitly or implicitly transformed noisy labels into clean labels by correcting classification losses. Unlike these methods, our method transforms the holistic information from uncertain samples into certain samples, which implicitly reduces the effects of potentially noisy labels. While correction of label noisy by modifying the loss-dynamics do not perform well under extreme noise environments, Arazo et al. (2019) adopt label augmentation method called MixUp Zhang et al. (2018).

