MITIGATING MEMORIZATION OF NOISY LABELS VIA REGULARIZATION BETWEEN REPRESENTATIONS

Abstract

Designing robust loss functions is popular in learning with noisy labels while existing designs did not explicitly consider the overfitting property of deep neural networks (DNNs). As a result, applying these losses may still suffer from overfitting/memorizing noisy labels as training proceeds. In this paper, we first theoretically analyze the memorization effect and show that a lower-capacity model may perform better on noisy datasets. However, it is non-trivial to design a neural network with the best capacity given an arbitrary task. To circumvent this dilemma, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of a DNN by a representation regularizer. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. Our proposed framework is easily extendable and can incorporate many other robust loss functions to further improve performance. Extensive experiments and theoretical analyses support our claims.

1. INTRODUCTION

Deep Neural Networks (DNNs) have achieved remarkable performance in many areas including speech recognition (Graves et al., 2013) , computer vision (Krizhevsky et al., 2012; Lotter et al., 2016) , natural language processing (Zhang & LeCun, 2015) , etc. The high-achieving performance often builds on the availability of quality-annotated datasets. In a real-world scenario, data annotation inevitably brings in label noise (Wei et al., 2022d; e) , which degrades the performance of the network, primarily due to DNNs' capability in "memorizing" noisy labels (Zhang et al., 2016) . In the past few years, a number of methods have been proposed to tackle the problem of learning with noisy labels. Notable achievements include robust loss design (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Liu & Guo, 2020; Wang et al., 2021) , sample selection (Han et al., 2018; Yu et al., 2019; Cheng et al., 2021; Xia et al., 2021b) , transition matrix estimation (Patrini et al., 2017; Zhu et al., 2021b; Xia et al., 2019; 2020b) and loss correction/reweighting based on noise transition matrix (Natarajan et al., 2013; Liu & Tao, 2015; Patrini et al., 2017; Jiang et al., 2021; Zhu et al., 2021b; Wei et al., 2022a; Zhu et al., 2022c) . However, these methods still suffer from limitations because they are agnostic to the model complexity and do not explicitly take the over-fitting property of DNN into consideration when designing these methods (Wei et al., 2021; Liu et al., 2022) . In the context of representation learning, DNN is prone to fit/memorize noisy labels as training proceeds (Wei et al., 2022d; Zhang et al., 2016) , i.e., the memorization effect. Thus when the noise rate is high, even though the robust losses have some theoretical guarantees in expectation, they are still unstable during training (Cheng et al., 2021) . It has been shown that early stopping helps mitigate memorizing noisy labels (Rolnick et al., 2017; Xia et al., 2020a) . But intuitively, early stopping will handle overfitting wrong labels at the cost of underfitting clean samples if not tuned properly. An alternative approach is using regularizer to punish/avoid overfitting (Liu & Guo, 2020; Cheng et al., 2021; Liu et al., 2020) , which mainly build regularizers by editing labels. In this paper, we study the effectiveness of a representation regularizer. To fully understand the memorization effect on learning with noisy labels, we decouple the generalization error into estimation error and approximation error. By analyzing these two errors, we find that DNN behaves differently on various label noise types and the key to prevent over-fitting is to control model complexity. However, specifically designing the model structure for learning with noisy labels is hard. One tractable solution is to use representation regularizers to cut off some redundant function space without hurting the optima. Therefore, we propose a unified framework by utilizing representation to mitigate the memorization effect. We list main contributions below: • We first theoretically analyze the memorization effect by decomposing the generalization error into estimation error and approximation error in the context of learning with noisy labels and show that a lower-capacity model may perform better on noisy datasets. • Due to the fact that designing a neural network with the best capacity given an arbitrary task requires formidable effort, instead of changing the model architecture, we decouple DNNs into an encoder followed by a linear classifier and propose to restrict the function space of DNNs by the structural information between representations. Particularly, we require the distance between two self-supervised features to be positively related to the distance between the corresponding two supervised model outputs. • The effectiveness of the proposed regularizer is demonstrated by both theoretical analyses and numerical experiments. Our framework can incorporate many current robust losses and help them further improve performance.

1.1. RELATED WORKS

Learning with Noisy Labels Many works design robust loss to improve the robustness of neural networks when learning with noisy labels (Ghosh et al., 2017; Zhang & Sabuncu, 2018; Liu & Guo, 2020; Xu et al., 2019; Feng et al., 2021; Yong et al.; Xia et al., 2022; Wei et al., 2022c; b) . (Ghosh et al., 2017) proves MAE is inherently robust to label noise. However, MAE has a severe under-fitting problem. (Zhang & Sabuncu, 2018) proposes GCE loss which can combine the advantage of MAE and CE, exhibiting good performance on noisy datasets. (Liu & Guo, 2020) introduces peer loss, which is statistically robust to label noise without knowing noise rates. The extension of peer loss also shows good performance on instance-dependent label noise (Cheng et al., 2021; Zhu et al., 2021a) . Another efficient approach to combat label noise is by sample selection (Jiang et al., 2018; Han et al., 2018; Yu et al., 2019; Northcutt et al., 2021; Yao et al., 2020; Wei et al., 2020; Zhang et al., 2020; Xia et al., 2021a) . These methods regard "small loss" examples as clean ones and train multiple networks to select clean samples. Semi-supervised learning is also popular and effective on learning with noisy labels in recent years. Some works (Li et al., 2020; Nguyen et al., 2020) perform clustering on the sample loss and divide the samples into clean ones and noisy ones. Then drop the labels of the "noisy samples" and perform semi-supervised learning on all the samples. However, the semi-supervised pseudo labels can cause disparate impact on different groups of data Zhu et al. (2022b) . Recently, some works apply self-supervised learning to handle noisy labels (Ghosh & Lan, 2021; Li et al., 2022a; Wei et al., 2023) . Our work can also explain some findings from (Ghosh & Lan, 2021) . Knowledge Distillation Our proposed learning framework is related to knowledge distillation (KD). (Hinton et al., 2015) shows that a small, shallow network can be improved through a teacher-student framework. Due to its great applicability, KD has gained more and more attention in recent years and numerous methods have been proposed to perform efficient distillation (Mirzadeh et al., 2020; Zhang et al., 2018b; 2019) . However, the dataset used in KD is assumed to be clean. Thus it is hard to connect KD with learning with noisy labels. In this paper, we theoretically and experimentally show that a regularizer generally used in KD (Park et al., 2019) can alleviate the over-fitting problem on noisy data by using DNN features which offers a new alternative for dealing with label noise. Examples (x n , y n ) are drawn according to random variables (X, Y ) from a joint distribution D. The classification task aims to identify a classifier C that maps X to Y accurately. In real-world applications, the learner can only observe noisy labels ỹ drawn from Y |X (Wei et al., 2022d) , e.g., human annotators may wrongly label some images containing cats as ones that contain dogs accidentally or irresponsibly. The corresponding noisy dataset and distribution are denoted by D := {(x n , ỹn )} n∈[N ] and D. Define the expected risk of a classifier C as R(C) = E D [1(C(X) ̸ = Y )]. The goal is to learn a classifier C from the noisy distribution D which also minimizes R(C), i.e., learn the Bayes optimal classifier such that C * (x) = arg max i∈[K] P(Y = i|X = x). Noise transition matrix The label noise of each instance is characterized by T ij (X) = P( Y = j|X, Y = i), where T (X) is called the (instance-dependent) noise transition matrix (Zhu et al., 2021b; Li et al., 2022b) . There are two special noise regimes (Han et al., 2018) for the simplicity of theoretical analyses: symmetric noise and asymmetric noise. In symmetric noise, each clean label is randomly flipped to the other labels uniformly w.p. ϵ, where ϵ is the noise rate. Therefore, T ii = 1 -ϵ and T ij = ϵ K-1 , i ̸ = j, i, j ∈ [K]. In asymmetric noise, each clean label is randomly flipped to its adjacent label, i.e., T ii = 1 -ϵ, T ii + T i,(i+1) K = 1, where (i + 1) K := i mod K + 1.

Empirical risk minimization

The empirical risk on a noisy dataset with classifier C writes as 1 N n∈[N ] ℓ(C(x n ), ỹn ) , where ℓ is usually the cross-entropy (CE) loss. Existing works adapt ℓ to make it robust to label noise, e.g., loss correction (Natarajan et al., 2013; Patrini et al., 2017) , loss reweighting (Liu & Tao, 2015) , generalized cross-entropy (GCE) (Zhang & Sabuncu, 2018) , peer loss (Liu & Guo, 2020) , f -divergence (Wei & Liu, 2021) . To distinguish their optimization from the vanilla empirical risk minimization (ERM), we call them the adapted ERM. Memorization effects of DNNs Without special treatments, minimizing the empirical risk on noisy distributions make the model overfit the noisy labels. As a result, the corrupted labels will be memorized (Wei et al., 2022d; Han et al., 2020; Xia et al., 2020a) and the test accuracy on clean data will drop in the late stage of training even though the training accuracy is consistently increasing. See Figure 1 for an illustration. Therefore, it is important to study robust methods to mitigate memorizing noisy labels. Outline The rest of the paper is organized as follows. In Section 3, we theoretically understand the memorization effect by analyzing the relationship among noise rates, sample size, and model capacity, which motivate us to design a regularizer to alleviate the memorization effect in Section 4 by restricting model capacity. Section 5 empirically validates our analyses and proposal.

3. IMPACTS OF MISLABELED DATA ON DNN PERFORMANCE

We quantify the harmfulness of memorizing noisy labels by analyzing the generalization errors on clean data when learning on a noisy dataset D and optimizing over function space C.  E[ℓ( C D (f (X)), Y )] -E[ℓ(C * (X), Y )] = ErrorE(CD, C D ) + ErrorA(CD, C * ), where the estimation error Error E and the approximation error Error A can be written as Error E (C D , C D ) = E[ℓ( C D (X), Y )] -E [ℓ(C D (X), Y )], Error A (C D , C * ) = E [ℓ(C D (X), Y )] -E [ℓ(C * (X), Y )]. We analyze each part respectively. Estimation error We first study the noise consistency from the aspect of expected loss. Definition 1 (Noise consistency). One label noise regime satisfies the noise consistency under loss ℓ if the following affine relationship holds: E D [ℓ(C(X), Y )] = γ 1 E D [ℓ(C(X), Y )] + γ 2 , where γ 1 > 0 and γ 2 are constants in a fixed noise setting. Following the probabilistic decomposition of label noise Natarajan et al. (2013) ; Ghosh et al. (2017) ; Cheng et al. (2021) , we have Lemmas 1 and 2. Lemma 1. A general noise regime with noise transitions T ij (X) : P( Y = j|Y = i, X) can be decoupled to the following form: E D ℓ(C(X), Y ) = T • E D [ℓ(C(X), Y )] + j∈[K] i∈[K] P(Y = i)E D|Y =i [U ij (X)ℓ(C(X), j)], where U ij (X) = T ij (X), ∀i ̸ = j, U jj (X) = T jj (X) -T , T := min X,i T ii (X). Lemma 1 shows the general instance-dependent label noise is hard to be consistent since the second term is not a constant unless we add more restrictions to T (X). Specially, in Lemma 2, we consider two typical noise regimes for multi-class classifications: symmetric noise and asymmetric noise. Lemma 2. The symmetric noise is consistent with 0-1 loss: E D ℓ(C(X), Y ) = γ1ED[ℓ(C(X), Y )] + γ2, where γ1 = 1 -ϵK K-1 , γ2 = ϵ K-1 . The asymmetric noise is not consistent: E D ℓ(C(X), Y ) = (1 -ϵ) • ED[ℓ(C(X), Y )] + ϵ i∈[K] P(Y = i)E D|Y =i [ℓ(C(X), (i + 1)K )]. With Lemma 2, we can upper bound the estimation errors in Theorem 1. Theorem 1. With probability at least 1 -δ, learning with symmetric/asymmetric noise and 0-1 loss has the following estimation error: Error E (C D , C D ) ≤ ∆ E (C, ε, δ) := 16 |C| log(N • e/|C|) + log(8/δ) 2N (1 -ε) 2 + Bias(C D , C D ), where e = 2.718 is the base of the natural logarithms, |C| is the VC-dimension of function class C (Bousquet et al., 2003; Devroye et al., 2013) . The noise rate parameter ε satisfies ε = ϵK K-1 for symmetric noise and ε = ϵ for asymmetric noise. The bias satisfies Bias(C D , C D ) = 0 for symmetric noise and Bias(C D , C D ) = ϵ 1-ϵ i∈[K] P(Y = i)E D|Y =i [ℓ(C D (X), (i + 1) K ) -ℓ( C D (X), (i + 1) K )] for asymmetric noise. Approximation error Analyzing the approximation error for an arbitrary DNN is an open-problem and beyond our scope. Generally, according to the trade-off between approximation error and estimation error (a.k.a. bias-complexity trade-off (Shalev-Shwartz & Ben-David, 2014 )), a large function space C reduces the approximation error at the cost of increasing the estimation error. Note an alternative of the generalization bound is using the Rademacher complexity of C. Both bounds help reveal the tradeoff between approximation error and estimation error w.r.t C. We use VC-dimension since it shows a clearer relationship between model capacity C (numerator) and the effective number of samples N (1 -ε) 2 (denominator). Trade-off From Theorem 1 and the above analyses, the bias complexity trade-off is more severe in the presence of label noise. In the case of symmetric label noise, a larger |C| will lead to smaller approximation error but at the cost of larger estimation error given large N . Although the trade-off may not be remarkable in traditional learning with clean data (ϵ = 0), it is critical when label noise exists since the existence of ϵ will significantly reduce the effective datasize from N to N (1 -ϵ) 2 . In practice, it is non-trivial to find the best function space or design the best neural network given an arbitrary task for reducing the total generalization error. We will introduce a tractable solution in the following sections. raw features and the linear classifier g maps representations to label classes, i.e., C(X) = g(f (X)). Clearly, the function space can be reduced significantly if we only optimize the linear classifier g. But the performance of the classifier depends heavily on the encoder f . By this decomposition, we transform the problem of finding good function spaces to finding good representations. Now we analyze the effectiveness of such decomposition. Figure 2 illustrates three learning paths. Path-1 is the traditional learning path that learns both encoder f and linear classifier g at the same time (Patrini et al., 2017) . In Path-2, a pre-trained encoder f is adopted as an initialization of DNNs and both f and g are fine-tuned on noisy data distributions D (Ghosh & Lan, 2021) . The pre-trained encoder f is also adopted in Path-3. But the encoder f is fixed/frozen throughout the later training procedures and only the linear classifier g is updated with D. We compare the generalization errors of different paths to provide insights for the effects of representations on learning with noisy labels.  (C G•F D , C * ) and Error A (C G|f D , C * ). Assume Error A (C G•F D , C * ) < Error A (C G|f D , C * ). Note the assumption holds generally for analyzing the bias-complexity trade-off; otherwise we should always prefer a fixed encoder. With the error proxy ∆ E (C, ε, δ) in Theorem 1, we reveal the relationship among total generalization error, model capacity and noise rate in Corollary 1.

Corollary 1. When Error

A (C G•F D , C * ) < Error A (C G|f D , C * ) , the error proxy of the expected generalization error for the fixed encoder (G|f ) is not greater than that for the unfixed one (G • F) when 1 - ϵK K -1 ≤ β ′ (G • F, G|f ) := 16 √ 2N |G • F| log(4N • e/|G • F|) -|G| log(4N • e/|G|) Error A (C G|f D , C * ) -Error A (C G•F D , C * ) . Recall in above | • | for a hypothesis space denotes its VC-dimension. RHS is greater than 0 and LHS is decreasing with the increase of ϵ. Corollary 1 implies that, for symmetric noise, a fixed encoder is likely better in high-noise settings. Section 5 provides empirical results to validate this claim. we also provide empirical evidence in the Appendix D.4 that a shallow network (low capacity model, similar to fixing the encoder) performs better than deeper network for high-noise regimes. Other noise Based on Theorem 1, for asymmetric label noise, the noise consistency is broken and the bias term makes the learning error hard to be bounded. As a result, the relationship between noise rate ϵ and generalization error is not clear and simply fixing the encoder may induce a larger generalization error. For the general instance-dependent label noise, the bias term is more complicated thus the benefit of fixing the encoder is less clear.

Insights and Takeaways

With the above analyses, we know learning with an unfixed encoder is not stable, which is easier to be affected by noisy patterns and yields a worse result than a properly selected fixed encoder when the noise rate is high. Restricting the search space makes the convergence stable (reducing estimation error) with the cost of increasing approximation errors. This motivates us to find a way to compromise between a fixed and unfixed encoder. We explore towards this direction in the next section. 

4. COMBATING MEMORIZATION BY REPRESENTATION REGULARIZATION

Our understandings in Section 3 motivate us to use the information from representations to regularize the model predictions. Intuitively, as long as the encoder is not fixed, the approximation error could be low enough. If the ERM is properly regularized, the search space and the corresponding estimation error could be reduced.

4.1. TRAINING FRAMEWORK

The training framework is shown in Figure 3 , where a new learning path (Self-supervised learning, SSL) f → h is added to be parallel to Path-2 f → g (SL-training) in Figure 2 . The newly added projection head h is one-hidden-layer MLP (Multi Layer Perceptron) whose output represents SSL features (after dimension reduction). Its output is employed to regularize the output of linear classifier g. Given an example (x n , ỹn ) and a random batch of features B (x n ∈ B), the loss is defined as: L((x n , ỹn ); f, g, h) = ℓ(g(f (x n )), ỹn ) SL Training + ℓ Info (h(f (x n )), B) SSL Training +λ ℓ Reg (h(f (x n )), g(f (x n )), B) Representation Regularizer , where λ controls the scale of regularizer. The loss ℓ for SL training could be either the traditional CE loss or recent robust loss such as loss correction/reweighting (Patrini et al., 2017; Liu & Tao, 2015) , GCE (Zhang & Sabuncu, 2018) , peer loss (Liu & Guo, 2020) . The SSL features are learned by InfoNCE ( Van den Oord et al., 2018) : ℓ Info (h(f (x n )), B) := -log exp(sim(h(f (x n )), h(f (x ′ n )))) x n ′ ∈B,n ′ ̸ =n exp(sim(h(f (x n )), h(f (x n ′ )))) . Note InfoNCE and CE share a common encoder, inspired by the design of self distillation (Zhang et al., 2019) . The regularization loss ℓ Reg writes as: ℓ Reg (h(f (x n )), g(f (x n )), B) = 1 |B| -1 x n ′ ∈B,n̸ =n ′ d(ϕ w (t n , t n ′ ), ϕ w (s n , s n ′ )), where d(•) is a distance measure for two inputs, e.g., l 1 or square l 2 distance, t n = h(f (x n )), s n = g(f (x n )), ϕ w (t n , t n ′ ) = 1 m ∥t n -t n ′ ∥ w , where w ∈ {1, 2} represents l 1 norm and squared l 2 metric, m normalizes the distance over a batch: m = 1 |B|(|B| -1) xn,x n ′ ∈B,n̸ =n ′ ||t n -t n ′ || w . (2) The design of ℓ Reg follows the idea of clusterability (Zhu et al., 2021b; 2022a) and inspired by relational knowledge distillation (Park et al., 2019) , i.e., instances with similar SSL features should have the same true label and instance with different SSL features should have different true labels, which is our motivation to design ℓ Reg . Due to the fact that SSL features are learned from raw feature X and independent of noisy label Y , then using SSL features to regularize SL features is supposed to mitigate memorizing noisy labels. We provide more theoretical understandings in the following subsection to show the effectiveness of this design.

4.2. THEORETICAL UNDERSTANDING

We theoretically analyze how ℓ reg mitigates memorizing noisy labels in this subsection. As we discussed previously, SSL features are supposed to pull the model away from memorizing wrong labels due to clusterability (Zhu et al., 2021b) . However, since the SL training is performed on the noisy data, when it achieves zero loss, the minimizer should be either memorizing each instance (for CE loss) or their claimed optimum (for other robust loss functions). Therefore, the global optimum should be at least affected by both SL training and representation regularization, where the scale is controlled by λ. For a clear presentation, we focus on analyzing the effect of ℓ reg in a binary classification, whose minimizer is approximate to the global minimizer when λ is sufficiently large. Consider a randomly sampled batch B. Denote by X 2 := {(x i , x j )|x i ∈ B, x j ∈ B, i ̸ = j} the set of data pairs, and d i,j = d(ϕ w (t i , t j ), ϕ w (s i , s j )). The regularization loss of batch B is decomposed as: 1 |B| n|xn∈B ℓ Reg (h(f (xn)), g(f (xn)), B) = 1 |X 2 | (x i ,x j )∈X 2 T di,j Term-1 + (x i ,x j )∈X 2 F di,j Term-2 + x i ∈X T ,x j ∈X F 2di,j Term-3 . (3) where X = X T X F , X T /X F denotes the set of instances whose labels are true/false. Note the regularizer mainly works when SSL features "disagree" with SL features, i.e., Term-3. Denote by X + ∼ P(X|Y = 1), X -∼ P(X|Y = 0), X T ∼ P(X|Y = Y ), X F ∼ P(X|Y ̸ = Y ). For further analyses, we write Term-3 in the form of expectation with d chosen as square l 2 distance, i.e., MSE loss: L c = E X T ,X F ||g(f (X T )) -g(f (X F ))|| 1 m 1 - ||h(f (X T )) -h(f (X F ))|| 2 m 2 2 , where m 1 and m 2 are normalization terms in Eqn (2). Note in L c , we use w = 1 for SL features and w = 2 for SSL features.foot_0 Denote the variance by var(•). In the setting of binary classification, define notations: X F + ∼ P(X| Y = 1, Y = 0), X F -∼ P(X| Y = 0, Y = 1). To find a tractable way to analytically measure and quantify how feature correction relates to network robustness, we make three assumptions as follows: Assumption 1 (Memorize clean instances). ∀n ∈ {n|ỹ n = y n }, ℓ(g(f (x n )), y n ) = 0. Assumption 2 (Same overfitting). var(g(f (X F + ))) = 0 and var(g(f (X F -) )) = 0. Assumption 3 (Gaussian-distributed SSL features). The SSL features follow Gaussian distributions, i.e., h(f (X + )) ∼ N (µ 1 , Σ) and h(f (X -)) ∼ N (µ 2 , Σ), where Σ is the covariance matrix. Assumption 1 implies that a DNN has confident predictions on clean samples. Assumption 2 implies that a DNN has the same degree of overfitting for different classes of noisy samples. For example, an over-parameterized DNN can memorize all the noisy labels (Zhang et al., 2016) , which is the focus of this paper. Thus these two assumptions are reasonable (we also provide empirical evidence for Assumption 2 in Appendix D.5. Assumption 3 assumes that SSL features follow Gaussian distribution when we add the regularize. Intuitively, to provide useful regularization, the SSL features should not be arbitrarily bad. In our experiments, we find that SSL features of CIFAR10 are relatively good, thus regularizer can be added in the very beginning while for CIFAR100, we need certain warmup epochs before adding the regularizer. Next, we present Theorem 2 to analyze the effect of L c . Consider the case that the model is over-parameterized and traditional training can memorize all samples. Let e + = P( Y = 0|Y = 1), e -= P( Y = 1|Y = 0), we have: Theorem 2. Based on Assumptions 1-3, when e -= e + = e, P(Y = 1) = P(Y = 0), assuming that the Bayes classifier achieves zero error, N and B are sufficiently large, then minimizing L c over g, h, and f on DNN results in: E D [1 (g * (f * (X), Y )] = e • 1 2 - 1 2 + ∆(Σ, µ 1 , µ 2 ) where ∆(Σ, µ 1 , µ 2 ) := 8 • tr(Σ)/||µ 1 -µ 2 || 2 , tr(•) denotes the matrix trace, and g * , f * denote the optimal model. Theorem 2 reveals a clean relationship between the quality of SSL features (given by h(f (X))) and the network robustness on noisy samples. When tr(Σ) → 0 or ∥µ 1 -µ 2 ∥ → ∞, the expected risk of the model E D [1 (g(f (X), Y )] will approximate to 0. I.e., for any sample x, the model will predict x to its clean label. Note the proof of Theorem 2 does not rely on any SSL training process. This makes it possible to use some pre-trained encoders from other tasks. In the Appendix, we also provide an theoretical understanding on the regularizer from the perspective of information theory. (He et al., 2016) for DogCat and ResNet34 for CIFAR10 and CIFAR100. Encoder is pre-trained by SimCLR (Chen et al., 2020) . Detailed settings are reported in the Appendix.

5.1. THE EFFECT OF REPRESENTATIONS

We perform experiments to study the effect of representations on learning with noisy labels. Figure 5 shows the learning dynamics on symmetric label noise while Figure 4 shows the learning dynamics on asymmetric and instance-dependent label noise. From these two figures, given a good representation, we have some key observations: • Observation-1: Fix encoders for high symmetric label noise • Observation-2: Do not fix encoders for low symmetric label noise • Observation-3: Do not fix encoder when bias exists • Observation-4: A fixed encoder is more stable during learning is very helpful for instance-dependent label noise since down-sampling can reduce noise rate imbalance (we provide an illustration on binary case in the Appendix) which could lower down the estimation error. Ideally, if down-sampling could make noiserate pattern be symmetric, we could achieve noise consistency (Definition 1) which results in 0 Bias from Theorem 1. Another interesting phenomenon is that from Figure 5 (a) (b) (c), the crossing point is different for each dataset. This phenomenon can be explained by Corollary 1. Corollary 1 implies that if the encoder is learned very well, i.e., Error A (C G•F D , C * ) ≈ Error A (C G|f D , C * ) , fixing the encoder has benefits over unfixed encoder even when noise rate is small. Since for DogCat, CIFAR10 and CIFAR100 dataset, each class have 12500 samples, 5000 samples and 500 samples, respectively. When applying the self-supervised learning on these datasets, the encoder quality is DogCat > CIFAR10 > CIFAR100. Thus the crossing point is small for DogCat and large for CIFAR100.

5.2. THE PERFORMANCE OF USING REPRESENTATIONS AS A REGULARIZER

Experiments on synthetic label noise We first show that Regularizer can alleviate the over-fitting problem when ℓ in Equation ( 1) is simply chosen as Cross-Entropy loss. The experiments are shown in Figure 6 . Regularizer is added at the very beginning since recent studies show that for a randomly initialized network, the model tends to fit clean labels first (Arpit et al., 2017) and we hope the regularizer can improve the network robustness when DNN begins to fit noisy labels. From Figure 6 (c) (d), for CE training, the performance first increases then decreases since the network over-fits noisy labels as training proceeds. However, for CE with the regularizer, the performance is more stable after it reaches the peak. For 60% noise rate, the peak point is also much higher than vanilla CE training. For Figure 6 (a) (b), since the network is not randomly initialized, it over-fits noisy labels at the very beginning and the performance gradually decreases. However, for CE with the regularizer, it helps the network gradually increase the performance as the network reaches the lowest point (over-fitting state). This observation supports Theorem 2 that the regularizer can prevent over-fitting. Next, we show the regularizer can complement any other loss functions to further improve performance on learning with noisy labels. I.e., we choose ℓ in Equation (1) to be other robust losses. The overall experiments are shown in Table 1 . It can be observed that our regularizer can complement other loss functions or methods and improve their performance, especially for the last epoch accuracy. Note that we do not apply any tricks when incorporating other losses since we mainly want to observe the effect of the regularizer. It is possible to use other techniques to further improve performance such as multi-model training (Li et al., 2020) or mixup (Zhang et al., 2018a) .

Experiments on real-world label noise

We also test our regularizer on the datasets with real-world label noise: CIFAR10N, CIFAR100N (Wei et al., 2022d) and Clothing1M (Xiao et al., 2015) . The results are shown in Table 2 and Table 3 . we can find that our regularizer is also effective on the datasets with real-world label noise even when ℓ in Equation ( 1) is simply chosen to be Cross Entropy. More experiments, analyses, and ablation studies can be found in the Appendix.

6. CONCLUSIONS

In 

APPENDIX

Outline The Appendix is arranged as follows: Section A proves Lemmas and Theorems in Section 3. Section B proves Theorem 2 in Section 4 and provides an high level understanding on the regularizer from the perspective of Information Theory. Section C illustrates why down-sampling can decrease the gap of noise rates. Section D provides the effect of distance measure in Eqn (2) (w = 1 or 2); ablation study in Section 4; the effect of different SSL pre-trained methods; the performance of shallow network and deeper network on high noise settings; empirical validation of Assumption 2; Experiments towards the regularizer on CIFAR100 dataset. Section E elaborates the detailed experimental setting of all the experiments in the paper. A PROOF FOR LEMMAS AND THEOREMS IN SECTION 3 A.1 PROOF FOR LEMMA 1 Let T := min X,i T ii (X). Considering a general instance-dependent label noise where T ij (X) = P( Y = j|Y = i, X), we have (Cheng et al., 2021 ) E D [ℓ(C(X), Y )] = j∈[K] x P( Y = j, X = x)ℓ(C(X), j) dx = i∈[K] j∈[K] x P( Y = j, Y = i, X = x)ℓ(C(X), j) dx = i∈[K] j∈[K] P(Y = i) x P( Y = j|Y = i, X = x)P(X = x|Y = i)ℓ(C(X), j) dx = i∈[K] j∈[K] P(Y = i)E D|Y =i P( Y = j|Y = i, X = x)ℓ(C(X), j) = i∈[K] j∈[K] P(Y = i)E D|Y =i [T ij (X)ℓ(C(X), j)] = i∈[K] P(Y = i)E D|Y =i [T ii (X)ℓ(C(X), i)] + i∈[K] j∈[K],j̸ =i P(Y = i)E D|Y =i [T ij (X)ℓ(C(X), j)] =T i∈[K] P(Y = i)E D|Y =i [ℓ(C(X), i)] + i∈[K] P(Y = i)E D|Y =i [(T ii (X) -T )ℓ(C(X), i)] + i∈[K] j∈[K],j̸ =i P(Y = i)E D|Y =i [T ij (X)ℓ(C(X), j)] =T E D [ℓ(C(X), Y )] + j∈[K] i∈[K] P(Y = i)E D|Y =i [U ij (X)ℓ(C(X), j)], where U ij (X) = T ij (X), ∀i ̸ = j, U jj (X) = T jj (X) -T . A.2 PROOF FOR LEMMA 2 Consider the symmetric label noise. Let T (X) ≡ T, ∀X, where T ii = 1 -ϵ, T ij = ϵ K-1 , ∀i ̸ = j. The general form in Lemma 1 can be simplified as E D [ℓ(C(X), Y )] =(1 -ϵ)E D [ℓ(C(X), Y )] + ϵ K -1 j∈[K] i∈[K],i̸ =j P(Y = i)E D|Y =i [ℓ(C(X), j)] =(1 -ϵ - ϵ K -1 )E D [ℓ(C(X), Y )] + ϵ K -1 j∈[K] i∈[K] P(Y = i)E D|Y =i [ℓ(C(X), j)]. When ℓ is the 0-1 loss, we have j∈[K] i∈[K] P(Y = i)E D|Y =i [ℓ(C(X), j)] = 1 and E D [ℓ(C(X), Y )] = (1 - ϵK K -1 )E D [ℓ(C(X), Y )] + ϵ K -1 . Consider the asymmetric label noise. Let T (X) ≡ T, ∀X, where T ii = 1 -ϵ, T i,(i+1) K = ϵ. The general form in Lemma 1 can be simplified as E D [ℓ(C(X), Y )] = (1 -ϵ)E D [ℓ(C(X), Y )] + ϵ i∈[K] P(Y = i)E D|Y =i [ℓ(C(X), (i + 1) K )].

A.3 PROOF FOR THEOREM 1

For symmetric noise, we have: E D ℓ( C D (X), Y ) = E D ℓ( C D (X), Y ) 1 -ϵK/(K -1) - ϵ/(K -1) 1 -ϵK/(K -1) . Thus the learning error is E D ℓ( C D (X), Y ) -E D [ℓ(C D (X), Y )] = 1 1 -ϵK/(K -1) E D ℓ( C D (X), Y ) -E D ℓ(C D (X), Y ) . Let Ê D ℓ(C(X), Y ) := 1 N n∈[N ] ℓ(C(x n ), ỹn ). Noting Ê D ℓ(C D (X), Y ) -Ê D ℓ( C D (X), Y ) ≥ 0, we have the following upper bound: (Bousquet et al., 2003; Devroye et al., 2013) . By Hoeffding inequality with function space C, with probability at least 1 -δ, we have E D ℓ( C D (X), Y ) -E D ℓ(C D (X), Y ) ≤E D ℓ( C D (X), Y ) -Ê D ℓ( C D (X), Y ) + Ê D ℓ(C D (X), Y ) -E D ℓ(C D (X), Y ) ≤|E D ℓ( C D (X), Y ) -Ê D ℓ( C D (X), Y ) | + | Ê D ℓ(C D (X), Y ) -E D ℓ(C D (X), Y ) |. Recall C ∈ C. Denote the VC-dimension of C by |C| |E D ℓ( C D (X), Y ) -Ê D ℓ( C D (X), Y ) | + | Ê D ℓ(C D (X), Y ) -E D ℓ(C D (X), Y ) | ≤2 arg max C∈C |E D ℓ(C(X), Y ) -Ê D ℓ(C(X), Y ) | ≤16 |C| log(N • e/|C|) + log(8/δ) 2N . Thus E D ℓ( C D (X), Y ) -E D [ℓ(C D (X), Y )] ≤ 16 |C| log(N • e/|C|) + log(8/δ) 2N (1 -ϵK K-1 ) 2 . Similarly, for asymmetric noise, we have: E D ℓ( C D (X), Y ) = E D ℓ( C D (X), Y ) 1 -ϵ -Bias( C D ), where Bias( C D ) = ϵ 1 -ϵ i∈[K] P(Y = i)E D|Y =i [ℓ( C D (X), (i + 1) K )]. Thus the learning error is E D ℓ( C D (X), Y ) -E D [ℓ(C D (X), Y )] = 1 1 -ϵ E D ℓ( C D (X), Y ) -E D ℓ(C D (X), Y ) + Bias(C D ) -Bias( C D ) By repeating the derivation for the symmetric noise, we have From Lemma A.4 in (Shalev-Shwartz & Ben-David, 2014) and our Theorem 1, we know E D ℓ( C D (X), Y ) -E D [ℓ(C D (X), Y )] ≤ 16 |C| log(N • e/|C|) + log(8/δ) 2N + Bias(C D ) -Bias( C D ) . A.4 PROOF FOR COROLLARY 1 Symmetric noise Let C 1 = G • F, C 2 = G|f . E|Error E (C D , C D )| ≤ 16 |C| log(4N • e/|C|) + 2 √ 2N . Therefore, by requiring the difference between two total generalization errors large than 0, we have: E δ |∆ E (C 1 , ε, δ)| + ∆ A (C 1 ) -E δ |∆ E (C 2 , ε, δ)| -∆ A (C 2 ) ≥ 0 ⇔16 |G • F| log(4N • e/|G • F|) + 2 2N (1 -ϵK K-1 ) 2 -16 |G| log(4N • e/|G|) + 2 2N (1 -ϵK K-1 ) 2 + Error A (C G|f D , C * ) -Error A (C G•F D , C * ) ≥ 0 ⇔1 - ϵK K -1 ≤ 16 √ 2N |G • F| log(4N • e/|G • F|) -|G| log(4N • e/|G|) Error A (C G|f D , C * ) -Error A (C G•F D , C * ) B PROOF FOR THEOREMS IN SECTION 4 Lemma 3. If X and Y are independent and follow gaussian distribution: X ∼ N (µ X , Σ X ) and Y ∼ N (µ Y , Σ Y ), Then: E X,Y (||X -Y || 2 ) = ||µ X -µ Y || 2 + tr(Σ X + Σ Y ).

B.1 PROOF FOR THEOREM 2

Before the derivation, we define some notations for better presentation. Following the notation in Section 4, define the labels of X T as Y T and the labels of X F as Y F . Under the label noise, it is easy to verify P(Y T = 1) = P(Y =1)•(1-e+) P(Y =1)•(1-e+)+P(Y =0)•(1-e-) and P(Y F = 1) = P(Y =0)•e- P(Y =0)•e-+P(Y =1)•e+ . Let p 1 = P(Y T = 1), p 2 = P(Y F = 1), g(f (X) ) and h(f (X)) to be simplified as gf (X) and hf (X). In the case of binary classification, gf (x) is one dimensional value which denotes the network prediction on x belonging to Y = 1. L c can be written as: E X T ,X F ( ||gf (X T ) -gf (X F ))|| 1 m 1 - ||hf (X T ) -hf (X F )|| 2 m 2 ) 2 denoted as Ψ (X T ,X F ) (a) = E (X T ,Y T ) (X F ,Y F ) Ψ (X T , X F ) = p 1 • p 2 • E X T + ,X F + Ψ (X T + , X F + ) + (1 -p 1 ) • p 2 • E X T -,X F + Ψ (X T -, X F + ) + p 1 • (1 -p 2 ) • E X T + ,X F -Ψ (X T + , X F -) + (1 -p 1 ) • (1 -p 2 ) • E X T -,X F -Ψ (X T -X F -) where m 1 and m 2 are normalization terms from Equation (2). Specifically, m 1 := lim B→∞ 1 |B|(|B| -1) xn,x n ′ ∈B,n̸ =n ′ ||gf (x n ) -gf (x ′ n )|| 1 , m 2 := lim B→∞ 1 |B|(|B| -1) xn,x n ′ ∈B,n̸ =n ′ ||hf (x n ) -hf (x ′ n )|| 2 . (a) is satisfied because Ψ (X T , X F ) is irrelevant to the labels. We derive Ψ (X T + , X F + ) as follows: E X T + ,X F + Ψ (X T + , X F + ) (b) = E X T + ,X F + ( ||1 -gf (X F + )|| 1 m 1 - ||hf (X T + ) -hf (X F + )|| 2 m 2 ) 2 (c) = E X T + ,X F + ( 1 -gf (X F + ) m 1 - ||hf (X T + ) -hf (X F + )|| 2 m 2 ) 2 (d) = E X T + ,X F + ( gf (X F + ) m 1 -( 1 m 1 - ||hf (X T + ) -hf (X F + )|| 2 m 2 )) 2 (b) is satisfied because from Assumption 1, DNN has confident prediction on clean samples. (c) is satisfied because gf (X) is one dimensional value which ranges from 0 to 1. From Assumption 3, hf (X + ) and hf (X -) follows gaussian distribution with parameter (µ 1 , Σ) and (µ 2 , Σ). Thus according to Lemma 3, we have E X T + ,X F + ||hf (X T + ) -hf (X F + )|| 2 = ||µ 1 -µ 2 || 2 + 2 • tr(Σ). Similarly, one can calculate E X T -,X F + ||hf (X T -) -hf (X F + )|| 2 = 2 • tr(Σ). It can be seen that (d) is function with respect to gf (X F + ). Similarly, Ψ (X T -, X F + ) is also a function with respect to gf (X F + ) while Ψ (X T + , X F -) and Ψ (X T -, X F -) are functions with respect to gf (X F -). Denote d(+, +) = E X T + ,X F + ||hf (X T + ) -hf (X F + )|| 2 . After organizing Ψ (X T + , X F + ) and Ψ (X T -, X F + ), we have: min gf (X F + ) p 1 • p 2 • E X T + ,X F + Ψ (X T + , X F + ) + (1 -p 1 ) • p 2 • E X T -,X F + Ψ (X T -, X F + ) ⇒ min gf (X F + ) (E X F + gf (X F + )) 2 -(2 • p 1 (1 - m 1 • d(+, +) m 2 ) + 2 • (1 -p 1 )( m 1 • d(-, +) m 2 )) • E X F + gf (X F + ) + constant with respect to gf (X F + ) (5) Note in Equation ( 5), we use (E X F + gf (X F + )) 2 to approximate E X F + gf (X F + ) 2 since from Assumption 2, var(g(f (X F + ))) → 0. Now we calculate m 1 and m 2 from Equation (2): m 1 = p 1 • p 2 • (1 -E X F + gf (X F + )) + (1 -p 1 ) • p 2 • E X F + gf (X F + ) + p 1 • (1 -p 2 ) • (1 -E X F -gf (X F -)) + (1 -p 1 ) • (1 -p 2 ) • E X F -gf (X F -) m 2 = p 1 • p 2 • d(+, +) + (1 -p 1 ) • p 2 • d(-, +) + p 1 • (1 -p 2 ) • d(+, -) + (1 -p 1 )(1 -p 2 ) • d(-, -) Under the condition of P(Y = 1) = P(Y = 0), e -= e + , we have p 1 = p 2 = 1 2 , m 2 = 4•tr(Σ)+||µ1-µ2|| 2 2 , m 1 = 1 2 , which is constant with respect to E X F + gf (X F + ) and E X F -gf (X F -) in Equation ( 6). Thus Equation ( 5) is a quadratic equation with respect to E X F + gf (X F + ). Then when Equation ( 5) achieves global minimum, we have: E X F + gf (X F + ) = p 1 - m 1 m 2 (p 1 • d(+, +) -(1 -p 1 ) • d(-, +)) = 1 2 - 1 2 + 8•tr(Σ) ||µ1-µ2|| 2 (7) Similarly, organizing Ψ (X T + , X F -) and Ψ (X T -, X F -) gives the solution of E X F -gf (X F -): E X F -gf (X F -) = p 1 + m 1 m 2 (p 1 • d(-, -) -(1 -p 1 ) • d(+, -)) = 1 2 + 1 2 + 8•tr(Σ) ||µ1-µ2|| 2 (8) Denote ∆(Σ, µ 1 , µ 2 ) = 8 • tr(Σ)/||µ 1 -µ 2 || 2 . Now we can write the expected risk as: Proof Done. E D [1 (g(f (X), Y )] = (1 -e) • E X T ,Y 1 g(f (X T ), Y + e • E X F ,Y 1 g(f (X F ), Y (a) = e • E X F ,Y 1 g(f (X F ), Y (b) = e • ( 1 2 • E X F + ,Y =0 1 g(f (X F + ), 0 + 1 2 • E X F -,Y =1 1 g(f (X F -), 1 ) (c) = e • 1 2 - 1 2 + ∆(Σ, µ 1 , µ 2 ) (9)

B.2 HIGH LEVEL UNDERSTANDING ON THE REGULARIZER

Even though we have built Theorem 2 to show SL features can benefit from the structure of SSL features by performing regularization, there still lacks high-level understanding of what the regularization is exactly doing. Here we provide an insight in Theorem 3 which shows the regularization is implicitly maximizing mutual information between SL features and SSL features. Theorem 3. Suppose there exists a function ξ such that C(X) = ξ(h(f (X))). The mutual information I(h(f (X)), C(X)) achieves its maximum when L c = 0 in Eqn. (4), The above results facilitate a better understanding on what the regularizer is exactly doing. Note that Mutual Information itself has several popular estimators (Belghazi et al., 2018; Hjelm et al., 2018) . It is a very interesting future direction to develop regularizes based on MI to perform regularization by utilizing SSL features. Thus the optimal down-sampling rate r = e-•(1-e+) e+•(1-e-) , which can be calculated if e -and e + are known. Proof for Proposition 2: If down sampling strategy is to make P( Y = 1) = P( Y = 0), then r • (e + + 1 -e -) = 1 -e + + e -, we have r = 1-e++e- 1-e-+e+ . Thus e * + can be calculated as: e * + = r • e + 1 -e + + r • e + = (1 -e + + e -) • e + (1 -e + ) • (1 -e -+ e + ) + e + • (1 -e + + e -) Denote α = 1-e++e- (1-e+)•(1-e-+e+)+e+•(1-e++e-) . Since e + > e -, 1 -e -+ e + > 1 -e + + e -, α = 1-e++e- (1-e+)•(1-e-+e+)+e+•(1-e++e-) < 1-e++e- (1-e+)•(1-e++e-)+e+•(1-e++e-) = 1. Similarly, e * -can be calculated as: 1-e-. Since we have assumed e -< e + and e -+ e + < 1. Thus proving e * + > e * -is identical to prove f (e + ) > 0 when e -< e + < 1 -e -. Firstly, it is easy to verify when e + = e -or e + = 1 -e -, f (e + ) = 0. From Mean Value Theory, there must exists a point e 0 which satisfy f ′ (e 0 ) = 0 where e + < e 0 < 1 -e -. Next, we differentiate f (e + ) as follows: f ′ (e + ) = (1 -e + ) 2 • (1 -e -) + e 2 -• (1 -e -) -2 • e + (1 -e + ) 2 (1 -e + ) 2 • (1 -e -) It can be verified that f ′ (e -) = 1-e- (1-e-) 2 •(1-e-) > 0 and f ′ (1 -e -) = 0 e 2 -•(1-e-) = 0. Further differentiate f ′ (e + ), we get when e + < 1 -((1 -e -) • e 2 -) 1 3 , f ′′ (e + ) < 0 and when e + > 1 -((1 -e -) • e 2 -) 3 , f ′′ (e + ) > 0. Since e -< e + and e -+ e + < 1, we have e -< 1 2 and e -< 1 -((1 -e -) • e 2 -) 1 3 < 1 -e -, i.e., 1 -((1 -e -) • e 2 -) 3 locates in the point between e - and 1 -e -. Thus, when e -< e + < 1 - 2). Type 1 denotes using l 2 norm to calculate distance between SL features and square l 2 norm to calculate distance between SSL features, which is adopted in our paper. Type 2 denotes using l 2 norm to calculate distance for both SL and SSL features. e 0 < e + < 1 -e -, f (e + ) monotonically decreases. Since f (e -) = f (1 -e -) = 0. We have f (e + ) > 0 when e -< e + < 1 -e -. ((1 -e -) • e 2 -)

Proof done.

We depict a figure in Figure 7 to better show the effect of down-sampling strategy. It can be seen the curves in the figure well support our proposition and proof. When e + -e -is large, down-sampling strategy to make P( Y = 1) = P( Y = 0) can well decrease the gap even we do not know the true value of e -and e + .

D MORE DISCUSSIONS AND EXPERIMENTS D.1 THE EFFECT OF DISTANCE MEASURE IN EQN (2)

In this paper and experiment, we use l 2 norm to calculate the feature distance between SL features and square l 2 norm to calculate the distance between SSL features. This choice can lead to good performance from Theory 2 and Figure 6 . Practically, since structure regularization mainly captures the relations, different choice does not make a big effect on the performance. We perform an experiment in Figure 8 which shows that the performance of both types are quite close.

D.2 ABLATION STUDY

In Figure 3 , SSL training is to provide SSL features to regularize the output of linear classifier g. However, SSL training itself may have a positive effect on DNN. To show the robustness mainly comes from the regularizer rather than SSL training, we perform an ablation study in Figure 9 . From the experiments, it is the regularizer that alleviates over-fitting problem of DNN.

D.3 THE EFFECT OF DIFFERENT SSL-PRETRAINED METHODS

Our experiments are not restricted to any specific SSL method. Experimentally, other SSL methods are also adoptable to pre-train SSL encoders. In Figure 5 , SimCLR (Chen et al., 2020 ) is adopted to pre-train SSL encoder. For a comparison, we pre-train a encoder with Moco on CIFAR10 and fine-tune linear classifier on noisy labels in Table 4 . Apart from empirical results in Section 5, we also provide empirical evidence that a shallow network performs better than deeper network on high-symmetric noise settings to further validate Corollary 1. Since similar to fixing the encoder, a shallow network also has lower capacity than deeper network. The experimental settings are as follows: network structure: ResNet18 vs ResNet50, dataset: CIFAR-10, loss: Cross entropy; number of epochs (100), batch size (64), learning rate (0.1 at first 50 epochs and 0.01 for last 50 epochs), optimizer (SGD). We report the best epoch test accuracy in Table 5 : It can be observed that a shallow network behaves better for high symmetric noise ratio which supports our claim in the paper.

D.5 VALIDATION OF ASSUMPTION 2

The zero variance assumption (Assumption 2) to proceed the proof in Theorem 2 is backed up by (Zhang et al., 2016) showing that DNN will memorize all the noisy samples when DNN converges, resulting to near 0 loss. We perform experiments on CIFAR-100 with symmetric label noise to validate this assumption. The results are reported in Table 6 . It can be seen that the variance of noisy We perform experiments on CIFAR100 under symmetric label noise ratio 0.6 with our regularizer for different batch size. Table 7 shows that increasing batch size has slight perfomance gain.

D.7 EXPERIMENTS TOWARDS REGULARIZER ON CIFAR100

In this section, we examine our regularizer on CIFAR100 dataset with certain SOTA methods from (Liu et al., 2020) . Results are reported in Table 8 from which we can see that our proposed regularizer can also improve performance on CIFAR100 dataset.

E DETAILED SETTING OF EXPERIMENTS

Datasets: We use DogCat, CIFAR10, CIFAR100, CIFAR10N and CIFAR100N and Clothing1M for experiments. DogCat has 25000 images. We randomly choose 24000 images for training and 1000 images for testing. For CIFAR10 and CIFAR100, we follow standard setting that use 50000 images for training and 10000 images for testing. CIFAR10N and CIFAR100N have the same images of CIFAR10 and CIFAR100 except the labels are annotated by real human via Amazon Mturk which contains real-world huamn noise. For Clothing1M, we use noisy data for training and clean data for testing. Setting for Figure 1 : We use ResNet34 for conducting the experiments. All the experiments in Figure 1 are trained from scratch with hyper-parameters below: learning rate (0.1 at first 50 epochs and 0.01 for last 50 epochs), batchsize (256), optimizer (SGD). Setting in Section 5.1 (Figure 5 and Figure 4 ): SimCLR is deployed for SSL pre-training with ResNet50 for DogCat and ResNet34 for CIFAR10 and CIFAR100. Each model is pre-trained by 1000 epochs with Adam optimizer (lr = 1e-3) and batch-size is set to be 512. During fine-tuning, we fine-tune the classifier on noisy dataset with Adam (lr = 1e-3) for 100 epochs and batch-size is set to be 256. Setting in Section 5.2: For Table 1 , all the methods are trained from scratch with learning rate set to be 0.1 at the initial state and decayed by 0.1 at 50 epochs. For Table 2 and Table 3 , the encoder is pre-trained by SimCLR and we finetune the encoder on the noisy dataset with CE + Regularier. The optimizer is Adam with learning rate 1e-3 and batch-size 256. Note that in Eqn (4), we use MSE loss for measuring the relations between SL features and SSL features. However, since MSE loss may cause gradient exploration when prediction is far from ground-truth, we use smooth l 1 loss instead. Smooth l 1 loss is an enhanced version of MSE loss. When prediction is not very far from ground-truth, smooth l 1 loss is MSE, and MAE when prediction is far. 



Practically, different choices make negligible effects on performance. See more details in Appendix.



Figure1: Training and test accuracies on CIFAR-10 with symmetric noise with noise rates 0.4 (blue curves) and 0.6 (red curves). We use ResNet 34 for conducting the experiments (See detailed setting in Appendix E)

THEORETICAL TOOLS Denote by the optimal clean classifier C D := arg min C∈C E D [ℓ(C(X), Y )], the optimal noisy classifier C D = arg min C∈C E D [ℓ(C(X), Y )], and the learned classifier on the noisy dataset C D = arg min C∈C n∈[N ] [ℓ(C(x n ), ỹn )]. The expected risk w.r.t the Bayes optimal classifier C * can be decomposed into two parts:

Figure 2: Illustration of different learning paths (distinguished by colors). The curve with arrow between two green dots indicates the effort (e.g., number of training instances) of training a model from one state to another state.

Now we instantiate function spaces C 1 and C 2 with different representations. With traditional training or an unfixed encoder (Path-1 or Path-2), classifier C is optimized over function space C 1 = G • F with raw data. With a fixed encoder (Path-3), classifier C is optimized over function space G given representations f (X). Symmetric noise Let C 1 = G • F, C 2 = G|f . Denote the optimal classifier learned within the above two functions spaces by C G•F D and C G|f D , respectively. Then the approximation errors of both cases can be denoted by Error A

Figure 3: The training framework of using representations (SSL features) to regularize learning with noisy labels (SL features).

Figure 5: (a) (b) (c): Performance of CE on DogCat, CIFAR10 and CIFAR100 under symmetric noise rate. For each noise rate, the best epoch test accuracy is recorded. The blue line represents training with fixed encoder and the red line represents training with unfixed encoder; (d): test accuracy of CIFAR10 on each training epoch under symmetric 0.6 noise rate. We use ResNet50(He et al., 2016) for DogCat and ResNet34 for CIFAR10 and CIFAR100. Encoder is pre-trained by SimCLR(Chen et al., 2020). Detailed settings are reported in the Appendix.

Figure 4: (a) performance of CE on asymmetric label noise. (b) performance of CE on instance-dependent label noise. The generation of instance-dependent label noise follows from CORES (Cheng et al., 2021). Observation-1, Observation-2 are verified by Figure 5 (a) (b) (c) and Observation-3 is verified by Figure 4(a) (b). Observation-4 is verified by Figure5 (d). These four observations are consistent with our analyses in Section 3. We also find an interesting phenomenon in Figure4(b) that down-sampling (making P( Y = i) = P( Y = j) in the noisy dataset) is very helpful for instance-dependent label noise since down-sampling can reduce noise rate imbalance (we provide an illustration on binary case in the Appendix) which could lower down the estimation error. Ideally, if down-sampling could make noiserate pattern be symmetric, we could achieve noise consistency (Definition 1) which results in 0 Bias from Theorem 1. Another interesting phenomenon is that from Figure5(a) (b) (c), the crossing point is different for each dataset. This phenomenon can be explained by Corollary 1. Corollary 1 implies that if the encoder is learned very well, i.e., Error A (C G•F

Figure 6: Experiments w.r.t. regularizer (λ = 1) on CIFAR10. ResNet34 is deployed for the experiments. (a) (b): Encoder is pre-trained by SimCLR. Symmetric noise rate is 20% and 40%, respectively; (c) (d): Encoder is randomly initialized with noise rate 40% and 60%, respectively.Table 1: Comparing each method on CIFAR10. The model is learned from scratch without SSL pretraining for all methods with λ = 1. Best and last epoch test accuracies are reported: best/last.

Denote the optimal classifier learned within the above two functions spaces by C G•F D and C G|f D , respectively. Then the approximation errors of both cases can be denoted by Error A (C G•F D , C * ) and Error A (C G|f D , C * ). Assume Error A (C G•F D , C * ) < Error A (C G|f D , C * ). Note the assumption holds generally and the bias-complexity trade-off does not exist if the assumption does not hold.

a) is satisfied because of Assumption 1 that model can perfectly memorize clean samples. (b) is satisfied because of balanced label and error rate assumption. (c) is satisfied by taking the results from Equation (7) and Equation (8).

e -+ r • (1 -e -) = (1 -e -+ e + ) • e - e -• (1 -e -+ e + ) + (1 -e -) • (1 -e + + e -) Denote β = 1-e-+e+ e-•(1-e-+e+)+(1-e-)•(1-e++e-) . Since e + > e -, 1 -e -+ e + > 1 -e + + e -, β = 1-e-+e+ e-•(1-e-+e+)+(1-e-)•(1-e++e-) > 1-e-+e+ e-•(1-e-+e+)+(1-e-)•(1-e-+e+) = 1. Since α • e + < e+ and β • e -> e -, we have e * + -e * -= α • e + -β • e -< e + -e -. Next, we prove e * + > e * -, following the derivation below: e * + > e * -=⇒ r • e + 1 -e + + r • e + > e - e -+ r • (1 -e -) =⇒ r > e -• (1 -e + ) e + • (1 -e -) =⇒ 1 -e + + e - 1 -e -+ e + > e -• (1 -e + ) e + • (1 -e -) =⇒ e + • (1 -e + ) + e + • e 2 -1 -e + > e -• (1 -e -) + e -f (e + ) = e + • (1 -e + ) + e+•e 2 -1-e+ -e -• (1 -e -) -e-•e 2 +

Figure 7: Visualizing decreased gap by down-sampling strategy.

Figure 9: Ablation study of using the regularizer to train DNN on noisy dataset.

Comparing each method on CIFAR10. The model is learned from scratch without SSL pretraining for all methods with λ = 1. Best and last epoch test accuracies are reported: best/last.

Test accuracy for each method on CIFAR10N and CIFAR100N.

Test accuracy for each method on Clothing1M dataset. All the methods use ResNet50 backbones. DS: Down-Sampling. Reg: With structural regularizer.

Zhaowei Zhu, Yiwen Song, and Yang Liu. Clusterability as an alternative to anchor points when learning with noisy labels. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 12912-12923. PMLR, 18-24 Jul 2021b. Zhaowei Zhu, Zihao Dong, and Yang Liu. Detecting corrupted labels without training a model to predict. In International Conference on Machine Learning, pp. 27412-27427. PMLR, 2022a.

Comparing different SSL methods on CIFAR10 with symmetric label noise CE (fixed encoder with SimCLR init) 91.06 90.73 90.2 88.24 CE (fixed encoder with MoCo init) 91.55 91.12 90.45 88.51 It can be observed that different SSL methods have very similar results.

Comparing the performance of different network structures on CIFAR-10 with symmetric label noise

Variance of each noisy id in CIFAR-100 with training epochs

All the methods are trained from scratch with learning rate 0.001. The optimizer is Adam and the training epochs is 100. Note when applying regularizer with each method, for example, Regularizer + CE, we first use CE to warmup DNN for certain epochs, then apply regularizer to prevent overfitting.

ACKNOWLEDGEMENT

This work is partially supported by the National Science Foundation (NSF) under grants IIS-2007951, IIS-2143895, CCF-2023495, and the Office of Naval Research under grant N00014-20-1-22.

availability

//github.com/

annex

Proof for Theorem 3: We first refer to a property of Mutual Information: I(X; Y ) = I(ψ(X); ϕ(Y )) (10) where ψ and ϕ are any invertible functions. This property shows that mutual information is invariant to invertible transformations (Cover, 1999) . Thus to prove the theorem, we only need to prove that ξ in Theorem 3 must be an invertible function when Equation ( 4) is minimized to 0. Since when ξ is invertible,We prove this by contradiction.Let t i = h(f (x i )) and s i = g(f (x i )). Suppose ξ is not invertible, then there must exists s i and s j where s i ̸ = s j which satisfy t j = ξ(s i ) = t i . However, under this condition, t i -t j = 0 and s i -s j ̸ = 0, Equation (4) can not be minimized to 0. Thus when Equation ( 4) is minimized to 0, ξ must be an invertible function.Proof done.

B.3 PROOF FOR LEMMA 3

By the independence condition, Z = X -Y also follows gaussian distribution with parameter

C ILLUSTRATING DOWN-SAMPLING STRATEGY

We illustrate in the case of binary classification with e + + e -< 1. Suppose the dataset is balanced, at the initial state, e + > e -. After down-sampling, the noise rate becomes e * + and e * -. We aim to prove two propositions: Proposition 1. If e + and e -are known, the optimal down-sampling rate can be calculated by 

