WASSERSTEIN DISTRIBUTIONAL NORMALIZATION : NONPARAMETRIC STOCHASTIC MODELING FOR HAN-DLING NOISY LABELS

Abstract

We propose a novel Wasserstein distributional normalization (WDN) algorithm to handle noisy labels for accurate classification. In this paper, we split our data into uncertain and certain samples based on small loss criteria. We investigate the geometric relationship between these two different types of samples and enhance this relation to exploit useful information, even from uncertain samples. To this end, we impose geometric constraints on the uncertain samples by normalizing them into the Wasserstein ball centered on certain samples. Experimental results demonstrate that our WDN outperforms other state-of-the-art methods on the Clothing1M and CIFAR-10/100 datasets, which have diverse noisy labels. The proposed WDN is highly compatible with existing classification methods, meaning it can be easily plugged into various methods to improve their accuracy significantly.

1. INTRODUCTION

The successful results of deep neural networks (DNNs) on supervised classification tasks heavily rely on accurate and high-quality label information. However, annotating large-scale datasets is extremely expensive and a time-consuming task. Because obtaining high-quality datasets is very difficult, in most conventional works, training data have been obtained alternatively using crowd-sourcing platforms Yu et al. (2018) to obtain large-scaled datasets, which leads inevitable noisy labels in the annotated samples. While there are numerous methods that can deal with noisy labeled data, recent methods actively adopt the small loss criterion, which enables to construct classification models that are not susceptible to noise corruption. In this learning scheme, a neural network is trained using easy samples first in the early stages of training. Harder samples are then gradually selected to train mature models as training proceeds. Jiang et al. (2018) suggested collaborative learning models, in which a mentor network delivers the data-driven curriculum loss to a student network. Han et al. (2018) ; Yu et al. (2019) proposed dual networks to generate gradient information jointly using easy samples and employed this information to allow the networks to teach each other. Wei et al. (2020) adopted a disagreement strategy, which determines the gradient information to update based on disagreement values between dual networks. Han et al. (2020) implemented accumulated gradients to escape optimization processes from over-parameterization and to obtain more generalized results. In this paper, we tackle to solve major issues raised from the aforementioned methods based on the small-loss criterion, as follows. In comprehensive experiments, the aforementioned methods gain empirical insight regarding network behavior under noisy labels. However, theoretical and quantitative explanation have not been closely investigated. In contrast, we give strong theoretical/empirical explanations to understand the network under noisy labels. In particular, we present an in-depth analysis of small loss criteria in a probabilistic sense. We exploit the stochastic properties of noisy labeled data and develop probabilistic descriptions of data under the small loss criteria, as follows. Let P be a probability measure for the pre-softmax logits of the training samples, l be an objective function for classification, and 1 {•} be an indicator function. Then, our central object to deal with is a truncated measure defined as X ∼ µ|ζ = 1 {X;l(X)>ζ} P P[l(X) > ζ] , Y ∼ ξ|ζ = 1 {X;l(Y )≤ζ} P P[l(Y ) ≤ ζ] , where X and Y , which are sampled from µ|ζ and ξ|ζ, denote uncertain and certain samples defined in the pre-softmax feature spacefoot_0 (i.e., R d ), respectively. In equation 1, µ and ξ denote the probability measures of uncertain and certain samples, respectively, and ζ is a constant. Most previous works have focused on the usage of Y and the sampling strategy of ζ, but poor generalization capabilities based on the abundance of uncertain samples X has not been thoroughly investigated, even though these samples potentially contain important information. To understand the effect of noisy labels on the generalized bounds, we provide the concentration inequality of uncertain measure µ, which renders the probabilistic relation between µ and ξ and learnability of the network under noisy labels. While most conventional methods Han et al. (2018) ; Wei et al. (2020) ; Li et al. (2019a) ; Yu et al. (2019) require additional dual networks to guide misinformed noisy samples, the scalability is not guaranteed due to the existence of dual architectures, which have the same number of parameters as the base network. To alleviate this problem, we build a statistical machinery, which should be fully non-parametric, simple to implement, and computationally efficient to reduce the computational complexity of conventional approaches, while maintaining the concept of small-loss criterion. Based on the empirical observation of ill-behaved certain/uncertain samples, we propose the gradient flow in the Wasserstein space, which can be induced by simulating non-parametric stochastic differential equation (SDE) with respect to the Ornstein-Ulenbeck type to control the ill-behaved dynamics. The reason for selecting these dynamics will be thoroughly discussed in the following sections. Thus, key contributions of our work are as follows. • We theoretically verified that there exists a strong correlation between model confidence and statistical distance between X and Y . We empirically investigate that the classification accuracy worsens when the upper-bound of 2-Wasserstein distance W 2 (µ, ξ) ≤ ε (i.e., distributional distance between certain and uncertain samples) drastically increase. Due to the empirical nature of upper-bound ε, it can be used as an estimator to determine if a network suffers from over-parameterization. • Based on empirical observations, we develop a simple, non-parametric, and computationally efficient stochastic model to control the observed ill-behaved sample dynamics. As a primal object, we propose the stochastic dynamics of gradient flow (i.e.,, Ornstein-Ulenbeck process) to simulate simple/non-parametric stochastic differential equation. Thus, our method do not require any additional learning parameters. • We provide important theoretical results. First, the controllable upper-bound ε with the inverse exponential ratio is induced, which indicates that our method can efficiently control the diverging effect of Wasserstein distance. Second, the concentration inequality of transported uncertain measure is presented, which clearly renders the probabilistic relation between µ and ξ. (2018) either explicitly or implicitly transformed noisy labels into clean labels by correcting classification losses. Unlike these methods, our method transforms the holistic information from uncertain samples into certain samples, which implicitly reduces the effects of potentially noisy labels. While correction of label noisy by modifying the loss-dynamics do not perform well under extreme noise environments, Arazo et al. (2019) adopt label augmentation method called MixUp Zhang et al. (2018) .

2. RELATED WORK

Distillation. Li et al. (2019b) updated mean teacher parameters by calculating the exponential moving average of student parameters to mitigate the impact of gradients induced by noisy labels. Lukasik et al. (2020) deeply investigated the effects of label smearing for noisy labels and linked label smoothing to loss correction in a distillation framework. Similar to these methods, our method leverages the useful properties of distillation models. We set ν as a pivot measure, which guides our normalization functional Fµ for uncertain measures. This is similar to self-distillation because uncertain training samples are forced to be normalized to those of past states. Other methods. 

3. DISTRIBUTIONAL NORMALIZATION

Because our main target object is a probability measure (distribution), we first define an objective function in a distributional sense. Let l be cross entropy and r be a corrupted label random vector for an unknown label transition matrix from a clean label r which is independent of X, with label transition matrix Q. Then, a conventional objective function for classification with noisy labels can be defined as follows: min µ J [µ] = min µ E X∼µ,r|Q [l(X; r)] . However, due to the significant changes in label information, the conventional objective function defined in equation 2 cannot be used for accurate classification. Instead of directly using uncertain samples X ∼ µ as in previous works, we normalize µ in the form of a metric ball and present a holistic constraint. For a clear mathematical description, we first introduce the following definition. Definition 1. (Wasserstein ambiguity set ) Let P 2 (R d ) = {µ : E µ d 2 E (x 0 , x) < ∞, ∀x 0 ∈ R d } be a 2-Wasserstein space, where d denotes the number of classes, d E is Euclidean distance defined on R d . Then, we define a Wasserstein ambiguity set (i.e., metric ball) in this space as follows: B W2 (ν, ε) = µ ∈ P 2 R d : W 2 (µ, ν) ≤ ε , where W 2 denotes the 2-Wasserstein distance and ν is the pivot measure. Then, we propose a new objective function by imposing geometric constraints on µ as follows: min F µ∈B W 2 (ν,ε),ξ J [Fµ] + J [ξ] = min θ E X∼F µ θ ,r [l(X; r)] + E X∼ξ θ ,r [l(Y ; r)], where F : P 2 (R d ) → P 2 (R d ) is a functional for probability measures, which assures the constraint on Fµ (i.e., Fµ ∈ B W2 (ν, ε)) and our main objective. The right-hand side of equation equation 4 is equivalent vectorial form of distributional form in left-hand side. While our main objects are defined on pre-softmax, both probability measures µ θ and ξ θ is parameterized by neural network with parameters θ. This newly proposed objective function uses the geometrically enhanced version of an uncertain measure Fµ with a certain measure ξ. In equation 4, probability measure ν is defined as follows: ν = arg min J [ξ k ], where ξ k denotes a certain measure at the current k-th iteration and k ∈ I k-1 = {1, • • • , k -1}. In other words, our method finds the best probability measure that represents all certain samples so far at training time, where the uncertain measures are transported to be lying in the Wasserstein ball centered on ν. In equation 4, the Wasserstein constraint on Fµ enforces uncertain measures statistically resemble ν from a geometric perspective (i.e., W 2 (ν, Fµ) ≤ ε). Now, an important question naturally stems from the aforementioned analysis: how can we select the optimal radius ε? Clearly, finding an F that induces a small ε ≈ 0 is suboptimal because Fµ ≈ ν and using objective function J [Fµ ≈ ν] can lead to the following critical problem. As the optimization process proceeds, enhanced uncertain samples X ∼ Fµ contribute less and less, because it is statistically identical to ν, meaning our objective in equation 4 would receive little benefits from these transported uncertain samples. By contrast, if we adopt a large radius for ε, enhanced uncertain samples will be statistically and geometrically unrelated to ν, which causes the normalized measure Fµ to yield large losses and violates our objective. To overcome two problems above and select the radius, we make a detour, i.e., a Gaussian measure, for cutting the path between ν and Fµ (i.e., ν → N (m ν , Σ ν ) → Fµ) rather than directly calculating the geodesic between ν and Fµ (i.e., ν → Fµ). Specifically, we decompose the original constraint in equation 4 into two terms using the triangle inequality of the Wasserstein distance: W 2 (ν, Fµ) ≤ ε = W 2 (ν, N (m ν , Σ ν )) d1: Intrinsic statistics + W 2 (N (m ν , Σ ν ), Fµ) d2: Wasserstein Normalization . (5) The first intrinsic statistics term sets a detour point as a Gaussian measure, for which the mean and covariance are the same as those for ν (i.e., m ν = E Y ∼ν [Y ] and Σ ν = Cov Y ∼ν [Y ] ). The Wasserstein upper bound of this term is only dependent on the statistical structure of ν because (m ν , Σ ν ) is dependent on ν. Thus, this term induces a data-dependent, non-zero constant upper bound whenever ν = N and can prevent the upper-bound from collapsing to ε → 0, regardless of F. This gives huge advantage when dealing with ε because the first term can be considered a fixed constant during the training. The second normalization term represents our central objective. F facilitates geometric manipulation in the Wasserstein space and prevent uncertain measure µ from diverging, where µ is normalized onto the Wasserstein ambiguity B W2 (ν, ε) in Fig1. The theoretical/numerical advantages of setting detour measure as Gaussian is well-explained following section.

3.1. WASSERSTEIN NORMALIZATION

In the previous section, we present a novel objective function that imposes a geometric constraint on µ such that the transformed measure Fµ lies in B W2 (ν, ε) for ν. Now, we specify F and relate it to the Gaussian measure (generally Gibbs measure). For simplicity, we denote N ν = N (m ν , Σ ν ). Proposition 1. F : R + ×P 2 → P 2 is a functional on the probability measure such that F [t, µ] = µ t , where dµ t = p t dN ν , dN ν = dq t dx, and µ t is a solution to the following continuity equations: ∂ t µ t = ∇ • (µ t v t ) , which is read as ∂ t p(t, x) = ∇ • (p(t, x)∇ log q(t, x)) in a distributional sense. Then, a uniquely defined functional F t [•] = F[t, •] normalizes µ onto B W2 (N ν , e -t K 2 (µ)) , where K 2 (µ) > 0 is a constant that depends on µ. It is well known that the solution to equation 6 induces a geodesic in the 2-Wasserstein space (Villani (2008) ), which is the shortest path from µ = µ t=0 to N ν . The functional F t generates a path for µ t , in which the distance is exponentially decayed according to the auxiliary variable t and constant K 2 , meaning W 2 (N ν , F t µ) ≤ K 2 e -t . This theoretical results indicates that the Wasserstein distance of second term in equation 5 can be reduced/controlled with exponential ratio. Thus, by setting a different t, our method can efficiently control the diverging distance in equation 5. Unfortunately, it is typically intractable to compute the partial differential equation (PDE) in equation 6.

Algorithm 1 Wasserstein Distributional Normalization

Require: α ∈ [0, 0.2], ∈ [0.1, 0.65], T = 64, ∆t = 10 -4 , τ = 0.001, for k = 1 to K (i.e., the total number of training iterations) do 1) Select uncertain (1 -ρ)N and certain ρN samples from the mini-batch N . {Y n k } {n≤ρN } ∼ ξ k , {X n k } {n≤(1-ρ)N } ∼ µ k 2) Update the most certain measure ν. if J [ξ k ] < J [ν] then ν ← ξ k , mν ← E [Y k ], and Σν ← Cov [Y k ] end if 3) Update the moving geodesic average N (m α , Σ α ). Solve the Ricatti equation T Σν T = Σ ξ k . Σ α = ((1 -α)I d + αT ) Σν ((1 -α)I d + αT ) and m α = (1 -α)mν + αm ξ k 4) Simulate the discrete SDE for T steps. for t = 0 to T -1 do X n k,t+1 = -∇φ(X n k,t ; m α )∆t + √ 2τ -1 Σ α ν dW n t s.t. X n k,t=0 ∼ µ k , X n k,t=T ∼ FT µ k end for 5) Update the network with the objective function. J [Fµ k ] + J [ξ k ] = EF T µ k [l(X k,T ; r)] + E ξ k [l(Y k ; r)] end for To solve this problem, we adopt particle-based stochastic dynamics, which enables tractable computation. There exists a unique iterative form corresponding PDE in equation 6 which is called as multi-dimensional Ornstein-Ulenbeck process, which can be approximated using particle-based dynamics. In particular, we draw N (1 -) uncertain samples from a single batch of N samples using equation 1 for hyper-parameter 0 ≤ ≤ 1. We then simulate a discrete stochastic differential equation (SDE) for each particle using the Euler-Maruyama scheme as follows: X n t+1 = X n t -∇φ (X n t ; m ν ) ∆ t + 2τ -1 ∆ t ΣZ n I , where φ (X t ; m ν ) = τ 2 d 2 E (X t , m ν ), n ∈ {1 • • • , N (1 -)}, d E is a Euclidean distance, and N is a single mini-batch size. We selected OU process as our stochastic dynamic due to the following reasons: First, we want to build computationally efficient, and non-parametric method to estimate/minimize the second term of equation 5. The SDE in equation 7 corresponding OU process have simple form with fixed drift and diffusion terms which is invariant over times which makes us to induce the non-parametric representations of simulation of SDE. While the simulation of equation 7 is just non-parametric for-loops in implementation algorithm, our method is computationally very efficient compared to other baseline methods such as Han et al. (2018) . Second, when estimating empirical upper-bound of Wasserstein distance, OU process allows us to use explicit form called Meheler's formula which can be efficiently estimated (Please refer to Appendix for more details). The overall procedure for our method is summarized in Algorithm 1.

3.2. WASSERSTEIN MOVING GEODESIC AVERAGE

In our experiments, we observe that the best measure ν is not updated for a few epochs after the training begins. This is problematic because ν diverges significantly from the current certain measure ξ k , which is equivalent to the normalized measure Fµ k diverging from ξ k , meaning X T and Y become increasingly statistically inconsistent. To alleviate this statistical distortion, we modify detour measure from N ν to other Gaussian measure, which allows us to capture the statistics of both ξ k and ν. Inspired by the moving average of Gaussian parameters in batch normalization Ioffe & Szegedy (2015) , we propose the Wasserstein moving geodesic average. Specifically, we replace Gaussian parameters {m ν , Σ ν } with {m α , Σ α } such that m α = (1 -α)m ν + αm ξ k and Σ α = ((1 -α)I d + αT ) Σ ν ((1 -α)I d + αT ), where T is a solution to the Riccati equation T Σ ν T = Σ ξ k . Therefore our final detour Gaussian measure is set to N α ν := N (m(α), Σ(α)), 0 ≤ α ≤ 1 2 .

4. THEORETICAL ANALYSIS

In equation 5, we select the detour point as a Gaussian measure because this measure can provide a statistical structure, which is similar to that of the optimal ν. In addition to this heuristic motivation, setting a detour point as a Gaussian measure (Gibbs measure) also provides theoretical advantages, e.g., the theoretical upper bound of the Wasserstein constraint terms. In this section, we investigate the explicit upper bounds of two terms in equation 5, which are naturally induced by the SDE. Proposition 2. A scalar 0 < β < ∞ exists and depends on ν, resulting in the following inequality: W 2 (ν, F t µ) ≤ ε = K 1 (ν) ∨ e -t K 2 (µ) + K 2 (ν) , where λ max (Σ ν ) denotes the maximum eigenvalue of the covariance matrix Σ ν and for some constant 0 < K 1 < ∞, we have K 1 (ν) = dβλ max (Σ ν ) + E ν Y 2 which is only dependent on ν. Intuitively, K 2 (µ) can be interpreted as an indicator that tells us how the uncertain measure µ is diffused, whereas the designed term e -t K 2 (µ) controls the upper bound of the Wasserstein distance using a variable t. The other term K 2 (ν) does not vanish even with a very large t, which assures a non-collapsing upper-bound ε. Proposition 3. (Concentration inequality for the normalized uncertain measure). Assume that there are some constants T ∈ [ 1 η , ∞), η ≥ 0 such that the following inequality holds: E F T µ [f 2 ] -[E F T µ [f ]] 2 ≤ (1 + η)E F T µ [A∇f T ∇f ], f ∈ C ∞ 0 (R d ), for A ∈ Sym + d and D(A, Σ ν ) ≤ aη for some a > 0 with any metric D defined on Sym + d . In this case, there is a δ such that the following probability inequality for an uncertain measure is induced: F T µ |σ -E ν [σ]| ≥ δ ≤ 6e - √ 2δ 3 2 K 2 (µ) , where σ denotes a soft-max function. In equation 10, we show that the label information induced by the normalized uncertain measure is close to that of most certain measure E ν [σ], where the upper bound is exponentially relative to the initial diffuseness of µ (i.e., K 2 (µ)). Because the upper bound of the probability inequality does not collapse to zero and F T µ is concentrated around the most certain labels (i.e., E ν [σ]), the uncertain sample X T ∼ F T µ helps our method avoid over-parameterization.

4.1. EMPIRICAL UNDERSTANDINGS

We investigate the theoretical upper bound of the Wasserstein ambiguity (i.e., radius of the Wasserstein ball) for Fµ and its corresponding probability inequality. To provide more in-depth insights into the proposed method, we approximate the upper bound and demonstrate that our Wasserstein normalization actually makes neural networks more robust to label noise. As we verified previously, according to Proposition 2, the following inequality holds: W 2 (F t µ, ν) ≤ ε = K 1 (ν) ∨ (K 2 (ν) + K 2 (F t µ)) . Because the first term K 1 (ν) is constant, dependent on ν, and generally small compared to the second term with t ≤ T , we only examine the behavior of the second term K 2 (ν) + K 2 (F t µ), which can be efficiently approximated using a simple form. Because our detour measure is Gaussian, we have the following inequality for any h ∈ C ∞ 0 (R d )foot_2 : K2 (µ) = lim s→0 1 s E X,Z∼N I h e -s X + 1 -e -2s (Σ 1 2 ν Z + m ν ) -h(X) ≤ K 2 (µ), where this equality holds if h is selected to induce a supremum over the set C ∞ 0 . For approximation, we simply consider h(X) = X 2 as a test function. In this case, the following inequality naturally holds: ε = K2 (ν) + K2 (Fµ) ≤ K 2 (ν) + K 2 (Fµ) ≤ K 1 (ν) ∨ (K 2 (ν) + K 2 (Fµ)) = ε. Thus, ε can be considered as an approximation of the theoretical upper bound ε suggested in Proposition 2. Subsequently, we investigate the effects of Wasserstein normalization based on K2 (µ) in equation 12. (1) The proposed WDN ensures that the Wasserstein ambiguity is bounded. We examine the relation between ε and test accuracy in an experiment using the CIFAR-10 dataset with symmetric noise at a ratio of 0.5. Fig. 2 presents the landscape for the log 10 -scaled cumulative average of ε and test accuracy over epochs. The red dotted lines represent the landscape of the vanilla network with cross-entropy loss, where εk = K2 (ν k )+ K2 (F t=0 µ k ) and k is the epoch index. In this case, the time constant t is set to zero, because Wasserstein normalization is not employed for the vanilla network. The black lines indicate the landscape of the proposed method, where εk = K2 (ν k ) + K2 (F t=T µ k ) Figure 2 : Relation between the approximated upper bound ε and test accuracy. in this case. It is noteworthy that the test accuracy of the vanilla network begins to decrease after 13epochs (red-dotted vertical lines in the top-right plot), whereas the Wasserstein ambiguity (i.e., upper bound of the Wasserstein distance) increases quadratically in the top-left plot. These experimental results verify that the distance between uncertain and most certain measure (i.e., ν) becomes large in the 2-Wasserstein space without any constraints in vanilla networks. They also indicate a definite relationship between Wasserstein ambiguity and test accuracy. In the proposed WDN, Wasserstein ambiguity can be efficiently bounded (i.e., lim sup k εk ≈ 2.15) as the test accuracy continues to increase, even after 13-epochs. For detailed analysis, we compute the deviation of an empirical upper bound as follows: ∆k = εk -εk-1 . In the gray regions, the deviation for the vanilla network is grater than 2.5 × 10 -2 , i.e., ∆ k > 2.5 × 10 -2 . Then, its test accuracy begins to drop, as shown in Fig. 2 . In contrast to the vanilla network, the maximum deviation of the proposed WDN is bounded above by a very small value (sup k ∆k ≤ 8 × 10 -3 ). (2) The proposed WDN helps networks to escape from over-parameterization. To analyze the behavior of deep neural networks under over-parameterization with and without the proposed WDN, we design several variants of the WDN, which begin at delayed epochs. The green, orange, and blue curves in the second row of Fig. 2 represent the landscapes, when our WDN is applied after k d ∈ {10, 15, 20} epochs, respectively. In this experiment, the upper bound εk is defined as εk = K2 (ν k ) + K2 (F t=0 µ k ), if k < k d , K2 (ν k ) + K2 (F t=T µ k ), else k ≥ k d . Consider k d = 20, which is represented by the blue dotted vertical lines. Before our WDN is applied (i.e., k < k d ), the network suffers from over-parameterization, which induces a significant performance drop, as indicated by the blue curve in the bottom-right plot. However, the network rapidly recovers to normal accuracy following Wasserstein normalization (i.e., k ≥ k d ). Please note that similar behavior can be observed in the green and orange curves. In particular, the orange curve produces less fluctuations than the blue curve in terms of test accuracy. This indicates that the proposed WDN can help a network escape from over-parameterization by imposing geometric constraints on the Wasserstein space with proposed method. (3) The proposed WDN can derive data-dependent bounds according to different noise levels. Another interesting point in Fig. 2 is that all curves, excluding the red curve, converge to specific numbers 2.15 = ε := lim inf k εk ≤ lim sup k εk := ε = 2.2. The upper bound ε is neither overly enlarged nor collapsed to zero, while the lower bound ε is fixed for all curves. We argue that this behavior stems from the geometric characteristics of the proposed method, where the first term in equation 5, namely W 2 (ν, N ν ) ∝ K2 (ν), is a non-zero data-dependent term that is minimized by the proposed geometric constraint. Therefore, we can derive the following relationship: [W 2 (ν, Fµ) ≤ W 2 (ν, N ν ) + W 2 (N ν , Fµ)] ⇓ ⇓ ∝ [ K2 (ν) + K2 (Fµ) = ε] ⇓ ⇓. (14) This empirical observation verifies that a detour point, which is set as a Gaussian measure, can induce the data-dependent bound (ε, ε), where our data-dependent bound can vary according to different noise levels and efficiently leverage data-dependent statistics. Fig. 2 indicates that classification models with more stable data-dependent bounds also induce more stable convergence in test accuracy.

5. EXPERIMENTS

5.1 EXPERIMENTS ON THE CIFAR-10/100 DATASET We used settings similar to those proposed by Laine & Aila (2016) ; Han et al. (2018) for our experiments on the CIFAR10/100 dataset. We used a 9-layered CNN as a baseline architecture with a batch size of 128. We used the Adam optimizer with (β 1 , β 2 ) = (0.9, 0.99), where the learning rate linearly decreased from 10 -3 to 10 -5 . Synthetic Noise. We injected label noise into clean datasets using a noise transition matrix Q i,j = Pr(r = j|r = i), where a noisy label r is obtained from a true clean label r. We defined Q i,j by following the approach discussed by Han et al. (2018) . For symmetric noise, we used the polynomial, = -1.11r 2 + 1.78r + 0.04 for 0.2 ≤ r ≤ 0.65, where r is the noise ratio. For the asymmetric noise, we set to 0.35. To select the enhanced detour measure, we set α to 0.2 for the Wasserstein moving geodesic average in all experiments. We trained our classification model over 500 epochs because the test accuracy of our method continued increasing, whereas those of the other methods did not. We compared our method with other state-of-the-art methods, including [MentorNet, Jiang et al. ( 2018 1 , the proposed WDN significantly outperformed other baseline methods. Please note that our WDN utilizes a simple Gaussian measure as a target pivot measure. Thus, there are potential risks when handling highly concentrated and non-smooth types of noise (e.g., asymmetric noise). Nevertheless, the proposed WDN still produced accurate results, even with asymmetric noise. In this case, a variant of our WDN (i.e., WDN cot ) exhibited the best performance. Open-set Noise. In this experiment, we considered the open-set noisy scenario suggested by Wang et al. (2018) , where a large number of training images were sampled from other CIFAR-100 dataset; however, these images were still labeled according to the classes in the CIFAR-10 dataset. We used a 9-layered CNN, which also used in our previous experiment. For hyper-parameters, we set and α to 0.5 and 0.2, respectively. As shown in Table 2 , our method achieved state-of-the-art accuracy. Collaboration with Other Methods. Because our core methodology is based on small loss criteria, our method can collaborate with co-teaching methods. In Han et al. (2018) , only certain samples (Y ∼ ξ) were used for updating colleague networks, where the number of uncertain samples gradually decreased until it reached a predetermined portion. To enhance potentially bad statistics for co-teaching, we taught dual networks by considering a set of samples (Y, X T ), where X T ∼ F T µ are uncertain samples enhanced using equation 7. Table 1 shows the test accuracy results for the proposed collaboration model with a co-teaching network (WDN cot ). This collaboration model achieved the most accurate performance for the CIFAR-100 dataset with asymmetric noise, which verifies that our WDN can be integrated into existing methods to improve their performance significantly, particularly when the density of pre-logits is highly-concentrated. Fig. 3 reveals that co-teaching quickly falls into over-parameterization and induces drastic drop in accuracy after the 15th-epoch. WDN cot also exhibits a slight accuracy drop. However, it surpassed the baseline co-teaching method by a large margin (+7%) during training. This demonstrates that our enhanced samples X T can alleviate the over-parameterization issues faced by conventional co-teaching models, which helps improve their accuracy significantly.

5.2. EXPERIMENTS ON A REAL-WORLD DATASET

To evaluate our method on real-world datasets, we employed the Clothing1M dataset presented by Xiao et al. (2015) , which consists of 1M noisy, labeled, and large-scale cloth images with 14 classes collected from shopping websites. It contains 50K, 10K, and 14K clean images for training, testing, and validation, respectively. We only used a noisy set for training; for testing, we used a clean set. We set α = 0.2 and = 0.1. For fair comparison, we followed the settings suggested in previous works. We used a pre-trained ResNet50 for a baseline architecture with a batch size of 48. For the pre-processing steps, we applied a random center crop, random flipping, and normalization to 224 × 224 pixels. We adopted the Adam optimizer with a learning rate starting at 10 -5 that linearly decayed to 5 × 10 -6 at 24K iterations. Regarding the baseline methods, we compared the proposed method to [GCE, Zhang & Sabuncu (2018) 

6. CONCLUSION

We proposed a novel method called WDN for accurate classification of noisy labels. The proposed method normalizes uncertain measures to data-dependent Gaussian measures by imposing geometric constraints in the 2-Wasserstein space. We simulated discrete SDE using the Euler-Maruyama scheme, which makes our method fast, computationally efficient, and non-parametric. In theoretical analysis, we derived the explicit upper-bound of the proposed Wasserstein normalization and experimentally demonstrated a strong relationship between this upper-bound and the over-parameterization. We conducted experiments both on the CIFAR-10/100 and Clothing1M datasets. The results demonstrated that the proposed WDN significantly outperforms other state-of-the-art methods.

A OPEN-SOURCE DATASET

Transition matrix for CIFAR-10/100. For the experiment summarized in Table 1 , we implemented open-source code to generate the noise transition matrix discussed by Han et al. (2018) , as well as the 9-layered CNN architecture (https://github.com/bhanML/Co-teaching). Open-set noise. For the experiment summarized in Table 2 , we used the same dataset for open-set noisy labels presented by Lee et al. (2019) (https://github.com/pokaxpoka/ RoGNoisyLabel). Clothing1M. For the experiment summarized in Table 3 , we used the open-source dataset presented by Xiao et al. (2015) (https://github.com/Cysu/noisy_label). Because the solution to the Fokker-plank equation can be explicitly calculated without any additional parameters, our method is fully non-parametric (in terms of additional parameters beyond those required by the original neural network). By contrast, co-teaching is parametric because it requires a clone network with additional parameters that are copies of those in the original network. Similarly, MLNT requires an additional teacher network for training, which also contains a number of parameters.

B COMPARISONS TO RELATED WORKS

Many method based on small loss criteria select certain samples, whereas our method uses the combination of ρN certain and (1 -ρ)N normalized uncertain samples. Therefore, our method can fully leverage the batches of training datasets, where (1 -ρ)N + ρN = N . Additionally, our method does not assume any class-dependent prior knowledge. Rather than considering class-wise prior knowledge, our method uses holistic information from both certain and uncertain samples (i.e., Y and X T ) in the logit space. Other meta-class-based model, such as MLNT, assume class-wise meta prior knowledge from a teacher network. In Arazo et al. (2019) , they assumed the beta-mixture model as a label distribution on label space. But due to the non-deterministic type of noisy label distribution, it sometimes fails to train with extremely non-uniform type of noise. For example, Arazo et al. (2019) reported failure case with Clothing1M dataset. It seems that fundamental assumption on noise model of mixup will be improved in future work. Similar to this method, our work have trouble when dealing with synthetic asymmetric noise with high ratio where relatively large performance drop is observed in Table 1 (despite our method produces second best performance in the table). Most recent work Li et al. (2019a) , they also adopt Co-train by implementing additional dual network, but much sophisticated methodology called Co-divide/guessing based on SSL. We predict that the Wasserstein distance between labeled and unlabeled probability measures is well-controlled in their method. We think that applying the OT/Markov theory (as in our paper) to their method will broaden the understanding of LNL problem. In contrast to sample weight methods such as GCE and NPCL, which require prior knowledge regarding the cardinality of the training samples to be weighted, our method is free from such assumptions because our Wasserstein normalization is applied in a batch-wise manner. C TECHNICAL DIFFICULTY FOR APPLYING GENERAL OPTIMAL TRANSPORT/MARKOV THEORY TO LABEL SPACE. Let X, Y be uncertain and certain samples in pre-softmax feature space. And assume that we consider the distributional constraint on label-space (the space of σ(X), σ(Y ), where σ denotes the soft-max function). This space is not proper to define the objective function such as (5). Because, all the samples in this label space is of the form σ (X) = [a 1 , a 2 , • • • , a n ] such that d i=1 a i = 1, thus label-space is d-dimensional affine-simplex U d which is subset of Euclidean space U d ⊂ R d . In this case, the definition of Wasserstein space in equation ( 4) is unacceptable while d E is not true metric on U d . The Wasserstein space P 2 (U d ) is merely investigated in the mathematical literature which makes unable to use all the technical details and assumptions, theories developed in the P 2 (R d ) which are theoretical ground of our work. But, if we look this problem slightly different point of view, for example, consider pre-softmax R d , P 2 (R d ) as our base space. In this case, all the technical issues/problems when we try to use OT tools in P 2 (U d ) can be overcome/ignored. while softmax is non-parametric one-to-one function connecting pre-softmax feature space R d to U d , there exists a unique labels in U d as a mapped point of the manipulated uncertain samples. Even though our objects are defined on pre-softmax space, the theoretical analysis in Proposition 3 contains softmax function to evaluate the concentration inequality of proposed transformation F affecting in label-space U d .

D MATHEMATICAL BACKGROUND

In this section, we introduce important definitions, notations, and propositions used in our proofs and the main paper. 

D.2 DIFFUSION-INVARIANCE AND HYPER-CONTRACTIVITY

Definition 2. The Markov semigroup (P t ) t≥0 in R d acting on a function f ∈ C ∞ 0 is defined as follows: P t f (x) = f (x )p t (x, dx ), where p t (x, dx ) is a transition kernel that is the probability measure for all t ≥ 0. Definition 3. (Diffusion Operator) Given a Markov semi-group P t at time t, the diffusion operator (i.e., infinitesimal generator) L of P t is defined as Lg(y) = lim t→0 1 t (P t g(y) -g(y)) = i,j ∂ 2 ∂y i ∂y j B ij (y)g(y) - i A i (y) ∂ ∂y i g(y), where B and A are matrix and vector-valued measurable functions, respectively. B ij denotes the (i, j)-th function of B and A i denotes the i-th component function of A. Definition 4. (Diffusion-invariant Measure) Given the diffusion operator L, the probability measure µ is considered to be invariant measure to L when E X∼µ [Lf (X)] = 0 for any f ∈ C ∞ 0 . Lemma 1. (Infinitesimal generator for the multivariate Gaussian measure, Bolley & Gentil (2010) .) The Gaussian measure N ν := N (m ν , Σ ν ) with a mean m ν and covariance Σ ν is an invariant measure according to the following diffusion-operator L: Lf (x) = Σ ν Hess[f ](x) -(x -m ν ) T ∇f (x), ∀f ∈ C ∞ 0 (R d ), where B ij (x) := [Σ ν ] ij is a constant function, and A i (x) := x i -m i ν . This generator serves as our main tool for the geometric analysis of the upper bound ε. In Section 4.1 in the main paper, we introduced an approximate upper-bound K2 (µ) without any general description of the inequality involved. We now introduce the underlying mathematics for equation 12. Because our detour measure is Gaussian, there is a unique semi-group P t h called the multidimensional Ornstein-Ulenbeck semi-group that is invariant to N ν . Specifically, P t is defined as follows: P s h(X) = E Z∼N I h e -s X + 1 -e -2s (Σ 1 2 ν Z + m ν ) , ∀h ∈ C ∞ 0 . The invariance property of P t relative to our detour measure is naturally induced by the following Proposition: Proposition 4. We define C : R d → R d and C(X) = AX + b such that A ∈ Sym + d , b ∈ R d , and select an arbitrary smooth h ∈ C ∞ 0 (R d ). We then define the diffusion Markov semi-group P s h as follows: P s h(X) = E Z∼N h e -s X + 1 -e -2s C(Z) . Then, N (A 2 , b) is invariant with respect to P s , meaning the following equality holds for every h and s ≥ 0: R d [P s h(X) -h(X)]dN (A 2 , b)(X) = 0. Proof. For simplicity, we denote N (A 2 , b) := N C . P s h(X)dN C (X) = h(e -s X + 1 -e -2s C(Z))dN C (X)dN (Z) = h • C(e -s Z + 1 -e -2s Z)dN (Z )dN (Z). The second equality holds because C is linear in R d . Let e -s = cos θ and e -2s = sin θ for any 0 ≤ θ ≤ 2π. Then, we define φ as φ(Z , Z) = e -s Z + √ 1 -e -2s Z = cos(θ)Z + sin(θ)Z, and π(Z , Z) = Z. Based on the rotation property of the standard Gaussian measure, one can induce the following equality. (N ⊗ N ) • (C • φ) -1 = ((N ⊗ N ) • φ -1 ) • C -1 = N • C -1 . However, we know that dN -b) . By combining equation 21 and equation 22, one can derive the following result: We are now ready to define the approximation of K 2 (µ) in terms of semi-group invariance. Specifically, for any real-valued smooth h, we define the following inequality: [C -1 (X)] = dN C (X) = (2π) d |A 2 | -1 2 e -0.5(X-b) T A -2 (X h • C(e -s Z + 1 -e -2s Z)d[N ⊗ N ] = h(X)d (N ⊗ N ) • φ -1 • C -1 (X) = h(X)d[N • C -1 ](X) = h(X)dN [C -1 (X)] = h(X)dN C (X). K2 (µ) = E X∼µ [Lh(X)] = lim s→0 E X∼µ 1 s (P s h(X) -h(X)) = lim s→0 1 s E X,Z∼N I h e -s X + 1 -e -2s (Σ 1 2 ν Z + m ν ) -h(X) ≤ K 2 (µ). This inequality holds if h is selected to induce a supremum over the set C ∞ 0 , where sup h K2 (µ, h) = sup h E X∼µ [Lh(X)] = K 2 (µ). Although a more sophisticated design for the test function h will induce a tighter upper bound for K2 , we determined that the L 2 -norm is generally sufficient. Definition 5. (Diffuseness of the probability measure) We define the integral operator K 2 : W 2 (R d ) → R + as follows: K 2 (µ) = sup f ∈C ∞ 0 R d |Lf (x)| dµ(x). According to Definition 4, we know that Lf (X)dN ν (X) = 0 for any f . Based on this observation, it is intuitive that K 2 estimates how the probability measure ν is distorted in terms of diffusion invariance. While this measure takes a supremum over the function space C ∞ 0 , it searches for a function that enables the estimation of maximal distortion. Because the value of K 2 is entirely dependent on the structure of µ, K 2 can be considered as a constant for the sake of simplicity if the uncertain measure µ is fixed over one iteration of training. Definition 6. (Diffusion carré du champ) Let f, g ∈ C ∞ 0 (R d ). Then, we define a bilinear form Γ c in C ∞ 0 (R d ) × C ∞ 0 (R d ) as Γ e (f, g) = 1 2 [LΓ e-1 (f g) -Γ e-1 (f Lg) -Γ e-1 (gLf )], e ≥ 1. We also denote Γ(f ) ≡ Γ(f, f ). The bilinear form Γ can be considered as a generalization of the integration by the parts formula, where f Lg + Γ(f )dµ = 0 for the invariant measure µ of L. Definition 7. (Curvature-Dimension condition, Ambrosio et al. ( 2015)) We can say that the infinitesimal generator L induces the CD(ρ, ∞) curvature-dimension condition if it satisfies Γ 1 (f ) ≤ ρΓ 2 (f ) for all f ∈ C ∞ 0 . Because our diffusion operator generates a semi-group with respect to the Gibbs measure, the curvature-dimension condition can be calculated explicitly. Through simple calculations, the firstorder (c = 1) diffusion carré du champ can be induced as follows: Γ 1 (f ) = [∇f ] T Σ ν ∇f 2 . Similarly, the second-order (c = 2) diffusion carré du champ is calculated as follows: Γ 2 (f ) = 1 2 L Γ 1 (f 2 ) -2Γ 1 (f, L(f )) = Tr Σ ν ∇ 2 f 2 + [∇f ] T Σ ν ∇f 2 = Tr Σ ν ∇ 2 f 2 + Γ 1 (f ), for an arbitrary f ∈ C ∞ 0 (R d ). While Tr Σ∇ 2 f 2 is non-negative, we can infer that Γ 1 ≤ Γ 2 . In this case, the diffusion operator L defined in Lemma 1 induces the CD(ρ = 1, ∞) curvaturedimension condition. For the other diffusion operators, please refer to Bolley & Gentil (2010) . Proposition 5. (Decay of Fisher information along a Markov semigroup, Bakry et al. (2013) .) If we assume the curvature-dimension condition CD(ρ, ∞), then I(µ t |N ν ) ≤ e -2ρt I(µ|N ν ). The exponential decay of the Fisher information in Proposition 5 is a core property of the exponential decay of the Wasserstein distance, which will be used in the proof of Proposition 2.

D.3 FOKKER-PLANK EQUATION, SDE

Definition 8. (Over-damped Langevin Dynamics) We have dX t = -∇φ(X t ; m ν )dt + 2τ -1 Σ ν dW t , ) where φ (X t ; m ν ) = τ 2 d 2 (X t , m ν ), W t denotes Brownian motion, and d denotes Euclidean distance. The particle X t is distributed in X t ∼ p t . The probability density lim t→∞ p(x, t) with respect to X ∞ converges to the Gaussian density X ∞ = √ Σ ν (Z + m ν ) ∼ p ∞ (x) = q(x) ∝ e -d(x,mν ) T Σ -1 ν d(x,mν ) . In classical SDE literature, it is stated that E sup 0≤t≤T Xt -X t ≤ G(N ) -1 2 , where G(T ) is some constant that depends only on T and X denotes the true solution of the SDE in equation 29. While the number of uncertain samples is greater than N > 40, our method exhibits acceptable convergence.

D.4 GAUSSIAN WASSERSTEIN SUBSPACES

It is known that the space of non-degenerate Gaussian measures (i.e., covariance matrices are positivedefinite) forms a subspace in the 2-Wasserstein space denoted as W 2,g ∼ = Sym + d × R d . Because the 2-Wasserstein space can be considered as a Riemannian manifold equipped with Riemannian metrics Villani (2008) , W 2,g can be endowed with a Riemannian structure that also induces the Wasserstein metric (McCann (1997) ). In the Riemannian sub-manifold of Gaussian measures, the geodesic between two points γ(0) = N A and γ(1) = N B is defined as follows Malagò et al. (2018) : γ(α) = N t = N (m(α), Σ(α)), where m(α) = (1 -α)m A + αm B and Σ(α) = [(1 -α)I + αT ] Σ A [(1 -α)I + αT ], where T Σ A T = Σ B . In Section 3.2, we set (m A , Σ A ) → (m ν , Σ ν ) and (m B , Σ B ) → (m ξ k , Σ ξ k ). Regardless of how ν is updated, the statistical information regarding the current certain measure ξ k is considered in the detour Gaussian measure, which yields a much smoother geometric constraint on µ.

E PROOFS

Proposition 6. Let Γ(µ, ν) be a set of couplings between µ and ν, and assume that the noisy label r is independent of X. For functional J [µ] = E µ∼X l(X; r), we define D(µ, ν) as: D(µ, ν) = inf γ∈Γ(µ,ν) |J [µ] -J [ν]| , where D : P 2 × P 2 → R. Then, D is the metric defined on P 2 , which is weaker than the Wasserstein metric, where D(µ, ν) ≤ αW 2 (µ, ν) for α = c -1 0 r + c -1 1 (1 -r) and some constants c 0 , c 1 > 0. Proof. |J [ν] -J [µ]| = |E µ [l(X; r)] -E ν [l(Z; r)]| = |E µ⊗ν [r (log σ(X) -log σ(Z)) -(1 -r) (log(1 -σ(X)) -log(1 -σ(Z)))]| ≤ E |rE µ⊗ν [log σ(X) -log σ(Z)]| + E |(1 -r)E µ⊗ν [log(1 -σ(X)) -log(1 -σ(Z))]| ≤ ErE µ⊗ν |log σ(X) -log σ(Z)| + E(1 -r)E µ⊗ν |log(1 -σ(X)) -log(1 -σ(Z))| ≤ c -1 0 E(r)E µ⊗ν |X -Z| + c -1 1 E(1 -r)E µ⊗ν |Z -X| = E[c -1 0 r + c -1 1 (1 -r)]E µ⊗ν |X -Z| By taking the infimum of the aforementioned inequality with set of couplings γ(µ, ν), we obtain the following inequality: D(ν, µ) = inf γ(µ,ν) |J [ν] -J [µ]| ≤ E[c -1 0 Y + c -1 1 (1 -Y )] inf γ(µ,ν) E γ |X -Z| = E[c -1 0 Y + c -1 1 (1 -Y )]W 1 (µ, ν) ≤ E[c -1 0 Y + c -1 1 (1 -Y )]W 2 (µ, ν), which completes the proof. Proposition 6 follows from the Lipschitzness of the functional J , where D searches for the best coupling to derive the minimal loss difference between two probability measures. This proposition indicates that inf |J [ν] -J [Fµ]| is bounded by the Wasserstein distance, which justifies our geometric constraint presented in equation 4. It should be noted that the prior assumption regarding noisy labels is essential for Lipschitzness. Proposition 7. Let F : R + × P 2 be a functional on probability measures such that F [t, µ] = µ t , where dµ t = p t dN ν , dN ν = dq t dx, and let µ t be a solution of the continuity equation in the 2-Wasserstein space defined as follows: ∂ t µ t = ∇ • (µ t ∇Φ t ) , which is represented as ∂ t p(t, x) = ∇ • (p(t, x)∇ log q(t, x)) in a distributional sense. Then, the functional F t [•] = F[t, •] is defined unique and normalizes µ onto B W2 (N ν , e -t K 2 (µ)), where K 2 (µ) ≤ ∞ is an integral operator in Definition 5 with respect to µ. Proof. We assume that the probability measure µ t is absolutely continuous with respect to the detour Gaussian measure N (m ν , Σ ν ) = N ν , µ t N ν . In this case, according to the Radon-Nikodym theorem, there is a corresponding unique probability density q(t, x) = q t (x) ∈ C ∞ 0 such that dµ t = q t dN ν . Lemma 2. (WI-inequality, Otto & Villani (2000) ) If the stationary state of µ t with respect to P t satisfies lim t→∞ E µ [P t f ] = 0 for any f ∈ C ∞ 0 , then the following inequality holds: d dt + W 2 (µ, µ t ) ≤ I(µ t |N ν ). By integrating both sides of the inequality in Lemma 2 with respect to t ∈ (0, ∞), the following inequality can be obtained: W 2 (µ t , N ν ) = ∞ 0 d dt + W 2 (µ t , N ν )dt ≤ ∞ 0 I(µ t |N ν )dt. In the aforementioned inequality, we replace the Fisher information with the diffusion generator L as follows: W 2 (µ, N ν ) ≤ ∞ 0 I(µ t |N ν )dt = ∞ 0 [P t q] -1 Γ(P t q)dN ν dt = ∞ 0 L(-log P t q)dµ t dt. The second equality above is derived by leveraging the properties of the bilinear operator Γ (Bakry et al. (2013) ; Villani (2008) ) with respect to the diffusion operator L, which is defined as follows: [P t q] -1 Γ(P t q)dN ν = -L(log P t q)q t dN ν = L(-log P t q)dµ t ≥ 0. For simplicity, we denote |g| = g + for any g ∈ C ∞ 0 . According to Proposition 5, we can relate F t µ = µ t to its initial term µ = µ t=0 as follows: ∞ 0 L(-log P t q)(X)d[F t µ](X)dt ≤ ∞ 0 e -2ρt L (-log P t=0 q) (X)dµ(X)dt ≤ ∞ 0 e -2ρt sup g∈C ∞ 0 L + g(Z)qdN ν (Z)dt = ∞ 0 √ e -2ρt dt sup g∈C ∞ 0 L + g(X)dµ(X) = ρ -1 K 2 (µ). The second inequality is naturally induced, because the proposed objective function is defined to select the maximum elements over the set of functions g ∈ C ∞ 0 and Lg ≤ L + g. If the integral interval is set to (0, s), then we can induce W 2 (µ, F t µ) ≤ 1 ρ (1 -e -s )K 2 (µ). Our diffusion-operator induces ρ = 1, which completes the proof. Proposition 8. There is a scalar 0 < β < ∞ dependent on ν such that the following inequality holds: W 2 (ν, F t µ) ≤ dβλ max (Σ ν ) + E ν Y 2 ∨ e -t K 2 (µ) + K 2 (ν) . As a motivation for setting a detour measure to N ν , we mentioned the natural property of the non-collapsing Wasserstein distance of W 2 (ν, N ν ) = 0. However, it is unclear from a geometric perspective exactly how the upper bound (i.e., W 2 (ν, N ν ) ≤ ?) can be induced based on the intrinsic statistics term (i.e., d 1 in Fig. 1 ). Specifically, in the situation where the covariance matrices of ν and N ν are identical, it is difficult to determine a theoretical upper bound without additional tools. The first part of this proof focuses on resolving this important issue. The second part of the proof is naturally induced by Proposition 1. Please note that in the following proposition, parameter for Wasserstein moving average is set to α = 0 for clarity. Proof. Before proceeding with the first part of the proof, we define a constant β as follows: β = sup 1≤j≤d 1 0 1 s E Ys v 2 s,j (Y s )ds. If we assume a mild condition such that min s,j inf 1≤j≤d O(v s,j ) ≥ O( √ s), then the integral term in β is finite and well-defined. This value will directly yield the upper bound of the Kullback-Leibler (KL) divergence of ν. First, we introduce the following inequality. ν Z with the score function defined as v s (x) = ∇ log p s (x) with respect to the random variable Y s . Then, the following equality holds: KL(ν|N (0, Σ ν )) = 1 0 Tr 1 2s Σ ν E ps∼Ys [v s (Y s )v s (Y s ) T ] ds. From equation 42, we can derive the relations between KL-divergence and the constant β defined earlier. 1 0 1 2s Tr Σ ν E x [v s (Y s )v s (Y s ) T ]) ds ≤ 1 0 1 2s Tr Σ ν E x [v s,i v s,j ] d i,j ) ds ≤ 1 0 1 2 λ max (Σ ν ) d j=1 E v 2 s,j (Y s ) s ds ≤ 1 2 λ max 1 0 d j=1 βds = 1 2 λ max (Σ ν )dβ. The second inequality holds based on the following element property of symmetric positive-definite matrices: Tr(AB) ≤ A op Tr(B) = λ max (A)Tr(B), ∀A, B ∈ Sym + d . It should be noted that because the distribution of ν is compactly supported (i.e., supp(q) is compact), the maximum eigenvalue of the covariance Σ ν is finite. The other relations are induced by the aforementioned definition. Next, we relate the KL-divergence and 2-Wasserstein distance naturally. Definition 9. (Talagrand inequality for Gaussian measures, Otto & Villani (2000) ) For any nondegenerate Gaussian measure N with a mean 0, the following inequality is satisfied: W 2 (ν, N ) ≤ 2KL(ν|N ), ∀ν ∈ P 2 (R d ). By combining Definition 9 and equation 43, we can derive the following expression: W 2 (ν, N (0, Σ ν )) ≤ 2KL(ν|N (0, Σ ν )) ≤ dβλ max (Σ ν ) < ∞. According to the triangle inequality for the 2-Wasserstein distance, we obtain: W 2 (ν, N (m ν , Σ ν )) ≤ W 2 (ν, N (0, Σ ν )) + W 2 (N (m ν , Σ ν ), N (0, Σ ν )) In Appendix C.3, we investigated that the geodesic distance between two Gaussian measures having the same covariance is equivalent to the Euclidean distance between two means. Therefore, we can obtain the following equality: W 2 (N (m ν , Σ ν ), N (0, Σ ν )) = W 2 (ι mν # [N (0, Σ ν )], N (0, Σ ν )) = m ν -0 2 = E ν Y 2 , where ι a (X) = X + a for any vector a ∈ supp(q). Now, by adding the two inequalities defined earlier, we can obtain W 2 (ν, N (m ν , Σ ν )) ≤ E ν Y 2 + dβλ max (Σ ν ), ( ) where it is easily shown that the upper-bound is only dependent on the statistical structure of ν. Specifically, the term E ν Y 2 represents the center of mass for a density of ν and dβλ max (Σ ν ) is related to the covariance structure of ν. By applying Proposition 8 to both F t µ and ν, we can easily recover equation 5 as follows: W 2 (ν, F t µ) ≤ ε = W 2 (ν, N (m ν , Σ ν )) + W 2 (N (m ν , Σ ν ), F t µ) ≤ E ν Y 2 + dβλ max (Σ ν ) ∧ K 2 (ν) + e -t K 2 (µ) ≤ dβλ max (Σ ν ) + E ν Y 2 ∨ e -t K 2 (µ) + K 2 (ν) . The second inequality is easily obtained as (a ∧ b) + c ≤ a ∨ (b + c) for any a, b, c ≥ 0, which completes the proof. Proposition 9. (Concentration inequality for uncertain measures). Assume that there are some constants s ∈ [ 1 η , ∞), η ≥ 0 such that the following inequality is satisfied: E F s µ [f 2 ] -[E F s µ [f ]] 2 ≤ (1 + η)E F s µ [A∇f T ∇f ], for A ∈ Sym + d , D(A, Σ ν ) ≤ aη for some a > 0, and for any metric D defined on Sym + d . In this case, there is a δ such that the following probability inequality for an uncertain measure is induced: F s µ |σ -E ν [σ]| ≥ δ ≤ 6e - √ 2δ 3 2 K 2 , ( ) where κ denotes the Lipschitz constant of σ. Proof. Before proceeding with the main proof, we first prove the existence of s . The limit of the interval with respect to η converges to a singleton {∞} as I = lim η→0 [ 1 η , ∞). In this case, equation 51 is the same as the Poincaré inequality for a Gaussian measure N ν , which can be written as lim η→0 E F s µ [f 2 ] -[E F s µ [f ]] 2 ≤ lim η→0 (1 + η)E F s µ [A∇f T ∇f ] = E F s µ [Σ ν ∇f T ∇f ]. While the Poincaré inequality in equation 53 is uniquely defined, we can find at least one value s satisfying equation 51. Let X(t, w) = X t (w) denote the stochastic process with respect to q t (x) defined in the proof of Proposition 2. Additionally, let c = E ν [σ] -E F s µ [σ] . Then, we can obtain the following inequality: c = E ν [σ] -E F s µ [σ] = κ E ν σ κ -E F s µ σ κ ≤ κ sup g∈Lip 1 (E ν g -E F s µ g) ≤ κW 1 (F s µ, ν) ≤ κW 2 (F s µ, ν) ≤ κK 2 (µ) 1 + η .  F s µ [σ(X s (w)) ≥ E F s µ [σ] + δ] ≤ 3e - δ √ 1+ηκ , where the Poincaré constant for F s µ is naturally 1 + η and σ Lip = κ. Next, we will derive the desired form from equation 55. First, we introduce the following inequality. σ(X s ) ≥ E F s µ [σ] + δ ≥ E ν [σ] + δ - κ 1 + η K 2 (56) The last inequality is directly induced by equation 54 because -c ≥ -κ 1+η K 2 . While η, κ, and K 2 are constants with respect to w, the following set inclusion can be obtained naturally: S 1 = {w : σ(X s (w)) ≥ E F s µ [σ] + δ} ⊇ {w : σ(X s (w)) ≥ E ν [σ] + δ - κ 1 + η K 2 } = S 2 . ( ) For the modified version of the original probability inequality, we take probability measure F s µ[•] for the sets S 1 , S 2 , which is defined as 3e - δ √ 1+ηκ ≥ F s µ ({w : σ(X s (w)) ≥ E F s µ [σ] + δ}) ≥ F s µ {w : σ(X s (w)) ≥ E ν [σ] + δ - κ 1 + η K 2 } . The concentration inequality around E ν [σ] is obtained by combining the inequalities induced by σ and -σ as follows: 1 2 F s µ   h∈{σ,-σ} {w : h(X s (w)) -E ν [h] ≥ ± δ - κ 1 + η K 2 }   = F s µ {w : |σ(X s (w)) -E ν [σ]| ≥ δ - κ 1 + η K 2 } ≤ 6e - δ √ 1+ηκ . (59) The inequality in equation 59 is the general form containing the relation between the upper bound of the probability and (η, κ, K 2 ). While this form is quite complicated and highly technical, we choose not to present all the detailed expressions of equation 59 in the main paper. Rather than that, we re-write it in a much simplified form for clarity. Specifically, by setting κK 2 /(1 + η) = 0.5δ and rescaling δ to 2δ, the aforementioned inequality in equation 59 can be converted into the following simpler form: F s µ ({w : |σ(X s (w)) -E ν [l]| ≥ δ) ≤ 6e - √ 2δ 3 2 κK 2 . Finally, if we set σ = Softmax, then the Lipschitz constant is induced as κ = 1. This proof is completed by setting s := T .



Due to the technical difficulties, we define our central objects on pre-softmax space rather than label space, i.e., the space of σ(X), σ(Y ), where σ indicates softmax function. Please refer to Appendix for more details. Please refer to Appendix C.4 for more details. Please refer to Appendix C.2 for additional details.



(a) Divergence or convergence of µ from ν (b) Classification accuracy Figure 1: Accuracy begins to drop when the uncertain measure µ begins to diverge from ν. In classification models with vanilla cross entropy losses, the uncertain measure µ can easily diverge from ν in the Wasserstein space (the red dotted line in (a)), which induces an accuracy drop (the red dotted line in (b)). By contrast, the proposed WDN can prevent such divergence by normalizing µ onto Wasserstein ambiguity set B W2 (ν, ε) (the black dotted line in (a)) and can consistently enhance accuracy as iteration proceeds (the black line in (b)). Please note that d 1 and d 2 denote the first and second terms in equation 5, respectively, and ε = d 1 + d 2 .

)], [Co-teaching, Han et al. (2018)], [Co-teaching+, Yu et al. (2019)], [GCE, Zhang & Sabuncu (2018)], [RoG, Lee et al. (2019)], [JoCoR, Wei et al. (2020)], [NPCL, Lyu & Tsang (2020b)], [SIGUA, Han et al. (2020)], and [DivideMix, Li et al. (2019a)]. As shown in Table

Figure 3: Test accuracy for the proposed collaboration model with co-teaching.

], [D2L, Ma et al. (2018)], [FW, Patrini et al. (2017b)], [WAR, Damodaran et al. (2019)], [SL, Wang et al. (2019)], [JOFL, Tanaka et al. (2018)], [DMI, Xu et al. (2019)], [PENCIL, Yi & Wu (2019)], and [MLNT, Li et al. (2019b)].

NOTATIONWe denote f # µ as a push-forward ofµ through f . C ∞ 0 (R d ) denotesthe set of ∞-class functions with compact support in R d . For the L p -norm of the function f , we denote f p,ν = ( |f | p dν) 1 p . The Hessian matrix of the function f is denoted as Hess[f ] = [∂ i ∂ j f ] d i,j . Sym + d denotes the space for semi-definite positive symmetric matrices of size d × d. f Lip denotes the Lipschitz norm of the function f . For any matrix A ∈ M d , we let A op denote the operator norm of A.

the invariance property of the defined semi-group. If we set A = Σ 1 2 ν , b = m ν , then we can recover equation 18.

(de Bruijn's identity, Johnson & Suhov (2001);Nourdin et al. (2014)) We let Y ∼ ν, Z ∼ N (0, I) denote a standard Gaussian random variable, and let define Y s =

is induced by the assumption regarding the κ-Lipschitzness of the function σ and the second inequality is induced by the Kantorovich-Rubinstein theorem. The third inequality is natural becauseW a (•, •) ≤ W b (•, •) for any 1 ≤ a ≤ b < ∞.because equation 51 is equivalent to the Poincaré inequality for the measure F s µ, it satisfies the Bakry-emery curvature-dimension condition CD(1 + η, ∞). Thus, as shown in the proof of Proposition 2 (i.e., equation 39), the last inequality is induced. Additionally, based on the concentration inequality of F s µ [Proposition 4.4.2 Bakry et al. (2013)], we can derive the following probability inequality:

Lee et al. (2019) induced a robust generative classifier based on pre-trained deep models. Similar to our method,Damodaran et al. (2019) designed a constraint on the Wasserstein space and adopted an adversarial framework for classification models of noisy labeled data by implementing semantic Wasserstein distance.Pleiss et al. (2020) identify noisy labeled samples by considering AUM statistics which exploits differences in training dynamics of clean and mislabeled samples. In most recent work,Li et al. (2019a)  adopts semi-supervised learning (SSL) methods to deal with noisy labels where the student network utilizes both labeled/unlabeled samples to perform semi-supervised learning guided by the other teacher network.

Average test accuracy (%) on the CIFAR-10/100 dataset over the last 10 epochs with various noise corruptions. The symbol indicates scores provided by the corresponding authors. WDN cot denotes our WDN combined with a co-teaching network. The best results are boldfaced. .43/40.44 ± .36 49.54 ± .41/21.34 ± .27 49.06 ± 1.02/31.85 ± .85 MentorNet 80.76 ± .36/52.13 ± .40 71.10 ± .48/39.00 ± 1.00 58.14 ± .38/31.60 ± .51 GCE 84.68 ± .05/51.86 ± .09 61.80 ± .11/37.60 ± .08 61.09 ± .18/33.13 ± .14 .18 / 49.41 ± .25 68.93 ± .33 / 34.24 ± .63 WDN 87.40 ± .23 /59.18 ± .29 82.89 ± .13/48.45 ± .27 76.12 ± .29/38.23 ± .31 Co-teaching 78.23 ± .27/53.89 ± .09 72.81 ± .20/34.96 ± .50 70.46 ± .58/34.55 ± .12 Co-teaching + 80.64 ± .15/56.15 ± .09 58.43 ± .30/37.88 ± .06 70.78 ± .11/32.88 ± .25 WDN cot 87.12 ± .16/57.27 ± .33 76.06 ± .28/42.38 ± .28 74.11 ± .35/44.41 ± .37

Test accuracy on the CIFAR-10 dataset with open-set noisy labels from CIFAR-100.

Table3reveals that our method achieved competitive performance as comparison with other baseline methods. Test accuracy (mean, %) on the Clothing 1M dataset. JoCoR, and DivideMix use additional networks, the number of network parameters is twice (8.86M ) as many as that of the Vanilla network (4.43M ). In Table4, we compare the average training time for first 5-epochs over various baseline methods under symmetric noise on the CIFAR-10 dataset. While non-parametric methods such as GCE and WDN require less than 12% additional time, other methods that require additional networks spent more time than non-parametric methods. The averaging time can vary according to different experimental environments. In table 4, we measure the time using publicly available code provided by authors.

Average training time for the 5-epochs (sec) on the CIFAR-10 dataset.

