HOW WEAKLY SUPERVISED INFORMATION HELPS CONTRASTIVE LEARNING

Abstract

Contrastive learning has shown outstanding performances in both supervised and unsupervised learning. However, little is known about when and how weakly supervised information helps improve contrastive learning, especially from the theoretical perspective. The major challenge is that the existing theory of contrastive learning based on supervised learning frameworks failed to distinguish between supervised and unsupervised contrastive learning. Therefore, we turn to the unsupervised learning frameworks, and based on the posterior probability of labels, we translate the weakly supervised information into a similarity graph under the framework of spectral clustering. In this paper, we investigate two typical weakly supervised learning problems, noisy label learning, and semi-supervised learning, and analyze their influence on contrastive learning within a unified framework. Specifically, we analyze the effect of weakly supervised information on the augmentation graph of unsupervised contrastive learning, and consequently on its corresponding error bound. Numerical experiments are carried out to verify the theoretical findings.

1. INTRODUCTION

Contrastive learning has shown state-of-the-art empirical performances in both supervised and unsupervised learning. In unsupervised learning, contrastive learning algorithms (Chen et al., 2020; He et al., 2020; Chen et al., 2021; Chen and He, 2021) learn good representations of high-dimensional observations from a large amount of unlabeled data, by pulling together an anchor and its augmented views in the embedding space. On the other hand, supervised contrastive learning (Khosla et al., 2020) uses same-class examples and their corresponding augmentations as positive labels, and achieves significantly better performance than the state-of-the-art cross entropy loss, especially on large-scale datasets. Recently, contrastive learning has been introduced to solve weakly supervised learning problems such as noisy label learning (Tan et al., 2021; Wang et al., 2022) and semi-supervised learning. For noisy label learning, most methodological studies use contrastive learning as a tool to select confident samples based on the learned representations (Yao et al., 2021; Ortego et al., 2021; Li et al., 2022) , whereas the theoretical studies focus on proving the robustness of downstream classifiers with features learned by self-supervised contrastive learning (Cheng et al., 2021; Xue et al., 2022) . For semi-supervised learning, contrastive loss is often used as a regularization to improve the precision of pseudo labeling (Lee et al., 2022; Yang et al., 2022) . However, none of the existing studies use weakly supervised information to improve contrastive learning. Perhaps the closest attempt is Yan et al. (2022) , which leverages the negative correlations from the noisy data to avoid same-class negatives for contrastive learning. Nonetheless, there are purely empirical results presented, without showing when and how the weakly supervised information helps improve contrastive learning. Moreover, a proper theoretical framework of weakly supervised contrastive learning is especially lacking. The major challenge lies in the fact that the existing theoretical frameworks compatible with both supervised and unsupervised contrastive learning (Arora et al., 2019; Nozawa and Sato, 2021; Ash et al., 2022; Bao et al., 2022) fail to distinguish between the two settings. To be specific, in order to build a relationship with supervised learning losses, such studies assume that the positive pairs for unsupervised contrastive learning are generated from the same latent class, and this is exactly how positive samples for supervised contrastive learning are selected. Consequently, such mathemati-cal modeling cannot tell the difference between supervised and unsupervised contrastive learning. Therefore, in this paper, we in turn base our theoretical analysis on an unsupervised learning framework. Based on the posterior probability of labeled samples, we translate the weakly supervised information into a similarity graph under the framework of spectral clustering. This enables us to analyze the effect of the label information on the augmentation graph of the unsupervised spectral clustering (HaoChen et al., 2021) , and consequently on its corresponding error bound. The contributions of this paper are summarized as follows. • We for the first time establish a theoretical framework for weakly supervised learning contrastive learning, which is compatible with both noisy label and semi-supervised learning. • By formulating the label information into a similarity graph based on the posterior probability of labels, we derive the downstream error bound of contrastive learning from both weakly supervised labels and feature information. We show that both noisy labels and semisupervised labels can improve the error bound of unsupervised contrastive learning under certain constraints on the noise rate and labeled sample size. • We empirically verify our theoretical results.

2. RELATED WORKS

Theoretical Frameworks of Contrastive Learning. The theoretical frameworks of unsupervised contrastive learning can be divided into two major categories. The first category is devoted to building the relationship between unsupervised contrastive learning and supervised downstream classification. Arora et al. (2019) first introduces the concept of latent classes, hypothesize that semantically similar points are sampled from the same latent class, and proves that the unsupervised contrastive loss serves an upper bound of downstream supervised learning loss. Nozawa and Sato (2021) ; Ash et al. (2022) ; Bao et al. (2022) further investigate the effect of negative samples, and establish surrogate bounds for the downstream classification loss that better match the empirical observations on the negative sample size. However, studies in this category have to assume the existence of supervised latent classes, and that the positive pairs are conditionally independently drawn from the same latent class. This assumption fails to distinguish between supervised and unsupervised contrastive learning, and cannot be used to analyze the weakly supervised setting. Another major approach is to analyze contrastive learning by modeling the feature similarity. HaoChen et al. (2021) first introduces the concept of the augmentation graph to represent the feature similarity of the augmented samples, and analyzes contrastive learning from the perspective of spectral clustering. Shen et al. (2022) uses a stochastic block model to analyze spectral contrastive learning for the problem of unsupervised domain adaption. Similarly, Wang et al. (2021) proposes the concept of augmentation overlap to formulate how the positive samples are aligned. Moreover, contrastive learning is also understood through other existing theoretical frameworks of unsupervised learning, such as nonlinear independent component analysis (ICA) (Zimmermann et al., 2021) , neighborhood component analysis (NCA) (Ko et al., 2022) , variational autoencoder (VAE) (Aitchison, 2021) , etc. In this paper, we follow the second category of contrastive learning approaches, and formulate the weakly supervised information into a similarity graph based on both label and feature information. Contrastive Learning for Noisy Label Learning. Yan et al. (2022) follows the idea of negative learning (Kim et al., 2019; 2021) , and leverage the negative correlations from the noisy data to avoid same-class negatives in contrastive learning. For theoretical studies, Cheng et al. (2021) analyzes the robustness of cross-entropy with SSL features, and Xue et al. (2022) proves the robustness of downstream classifier in contrastive learning. Contrastive Learning for Semi-supervised Learning. Lee et al. (2022) ; Yang et al. (2022) use contrastive regularization to enhance the reliability of pseudo-labeling in semi-supervised learning. Kim et al. (2021) introduces a semi-supervised learning method that combines self-supervised contrastive pre-training and semi-supervised fine-tuning based on augmentation consistency regularization. Zhang et al. (2022) uses contrastive loss to model pairwise similarities among samples, generates pseudo labels from the cross entropy loss, and in turn calibrates the prediction distribution of the two branches. For both noisy label learning and semi-supervised learning tasks, the existing studies all focus on using contrastive learning as a tool to improve the weakly supervised learning performance, whereas to the best of our knowledge, none of the previous works show the effect of weak supervision to contrastive learning itself. To fill in the blank, in this paper, we establish a theoretical framework for weakly supervised contrastive learning, which is compatible with both noisy label and semisupervised learning tasks.

3. PRELIMINARIES

Notations. Suppose that random variables  X ∈ X := R d , := (w xx ′ ) x,x ′ ∈X ∈ R n×n , and the normalized adjacent matrix is denoted as Ā := D -1/2 AD -1/2 , where D := diag(w x ) x∈X , and w x := ∑ x ′ ∈X w xx ′ . In this paper, we consider the spectral contrastive loss proposed by HaoChen et al. (2021) , that is, for an embedding function f : X → R k , L(f ) := -2 • E x,x + [f (x) ⊤ f (x + )] + E x,x ′ [ ( f (x) ⊤ f (x ′ ) ) 2 ] . ( ) Spectral contrastive loss is proved to be equivalent to the matrix factorization loss, i.e. for F ∈ R n×k := (u x ) x∈X , u x := w 1/2 x f (x), L mf (F ) := ∥ Ā -F F ⊤ ∥ 2 F = L(f ) + const. (2)

3.2. NOISY LABEL LEARNING

Recall that we denote the true label of an given instance x ∈ X is y. One common assumption of the generation procedure of label noise is as follows. Given the true labels, the noisy label is randomly flipped to another label ỹ with some probability. In this paper, we take the widely adopted symmetric label noise assumption as an example. For notational simplicity, we write the the symmetric label noise assumption in matrix form. Denote Y := (η j (x i )) i∈[n],j∈[r] , η j (x) = P(Y = j|x) , as the posterior probability matrix of the clean label distribution, and denote Ỹ := (η j (x i )) i∈[n],j∈[r] , ηj (x) = P( Ỹ = j|x), as the noisy label distribution. In Assumption 1, we assume that the flipping probability is conditional independent of the input data, and that the flipping probability to all other classes are uniformly at random. Assumption 1. For symmetric label noise with noise rate γ ∈ (0, 1), we denote the transition matrix T = (t i,j ) i∈[r],j∈[r] , where t i,i = 1 -γ, and t i,j = γ r -1 , for j ̸ = i. ( ) Then the noisy label posterior distribution is assumed to be Ỹ = Y T . (4) Under Assumption 1, T is symmetric. Specifically, when γ = 0, T degenerates to the identity matrix I r×r . Moreover, to guarantee the PAC-learnability, we usually assume the true label is the dominating class, i.e. γ < r-1 r .

3.3. SEMI-SUPERVISED LEARNING

For j ∈ [r], let n j be the number of labeled samples of Class j. Let n L = ∑ j∈ [r] n L,j be the number of all labeled samples, and n U be the number of unlabeled samples. Obviously, we have n L + n U = n. Usually, the number of labeled samples is much smaller than that of the unlabeled because human annotation is costly and labor-intensive. That is, we can naturally assume n L ≪ n U . In the following parts of the paper, we analyze the settings of noisy label learning and semisupervised learning in a unified framework. Without loss of generality, we assume (x 1 , . . . , x n L ) is labeled with noise rate γ ∈ [0, r-1 r ), and denote the corresponding clean and noisy posterior probability matrices as Y L and ỸL , respectively. Then we have ỸL = Y L T . Specifically, when γ = 0, our analyzing framework degenerates to the standard setting of semi-supervised learning, and when n L = n, our analyzing framework reduces to the standard noisy label learning.

4. MATHEMATICAL FORMULATIONS

We mention that our formulation of "similarity graph" is not a distributional assumption on the underlying similarity among data, but to formulate a possible probability of drawing positive samples in contrastive learning that takes both label and feature information into consideration. Specifically, in Sections 4.1 and 4.2, we only discuss the similarity graph describing the weakly supervised labels and neglected feature similarity. Then in Section 4.3, we take both label and feature similarity into consideration through convex combination.

4.1. SIMILARITY GRAPH DESCRIBING NOISY LABEL INFORMATION

To leverage the labeled information in the form of similarity graph, we first consider a simple example where noise rate γ = 0 and the label distribution is deterministic, i.e. for a sample x with true label y, the posterior probability η y (x) = 1 and η j (x) = 0 for j ̸ = y. In this case, we can naturally assume that in the similarity graph describing label information, the intra-class vertices are fully connected and the inter-class vertices are disconnected. That is, w xx ′ = 1 if x and x ′ has the same label and otherwise w xx ′ = 0. Then we consider the more general stochastic label scenario. Recall that for unsupervised spectral contrastive learning, the edge weight w xx ′ in an augmentation graph G describes the marginal probability of generating x and x ′ from the same natural data. That is, w xx ′ describes the joint probability of a pair of positive samples. Similarly, since the positive samples for supervised contrastive learning (Khosla et al., 2020) are selected as all same-class samples, we can naturally define the edge weight w xx ′ as the probability of two views x and x ′ generating from the same class, i.e. w xx ′ = ∑ j∈[K] η j (x)η j (x ′ ), and therefore A L := Y L Y ⊤ L . Moreover, we denote Ā as the normalized adjacent matrix. For the simplicity of notations, we consider the case where the data is class-balanced, i.e. n 1 = . . . = n r = n L /r. Then we have Ā = r n L A. Next, we add label noise to the our mathematical formulations. To be specific, when performing supervised contrastive learning based on noisy labeled data, we naturally select positive samples as the samples with the same noisy labeled data. According to Assumption 1, we have ỸL = Y L T , where T is symmetric. Then the adjacent matrix of the similarity graph induced by noisy labels is formulated as ÃL := ỸL Ỹ ⊤ L = Y L T (Y L T ) ⊤ = Y L T T ⊤ Y ⊤ L = Y L T 2 Y ⊤ L . (5) Similarly, when data is class balanced, we have the normalized adjacent matrix ĀL = n L r ÃL .

4.2. SIMILARITY GRAPH DESCRIBING SEMI-SUPERVISED NOISY LABEL INFORMATION

Under the setting of semi-supervised learning, we have no prior knowledge about the label information of the unlabeled samples. From the perspective of unsupervised contrastive learning, the unlabeled samples can be viewed as having unique class labels. Therefore, to construct the similarity graph, we attach sample-specific labels to the unlabeled samples. Thus, the posterior probability matrix of unlabeled samples Y U is an identity matrix I n U ×n U . Note that here we only discuss the similarity graph describing supervised information, so the feature similarity between samples is not included in the similarity graph. Combining both labeled and unlabeled samples, the posterior probability matrix of all semisupervised samples can be denoted as Ỹ = [ ỸL 0 0 ỸU ] = [ Y L T 0 0 I n U ×n U ] . (6) Therefore, the similarity graph of samples with n L noisy labels can be denoted as Ã = Ỹ Ỹ ⊤ = [ Y L T 2 Y ⊤ L 0 0 I n U ×n U . ] In Lemma 1 we present the influence of symmetric label noise with noise rate γ on the similarity graph Ã. Lemma 1. Under Assumption 1, if the data is class balanced, i.e. n 1 = . . . = n r = n L r , then there holds Ā = [ α(γ) ĀL + β(γ) r n L ⃗ 1 n L ⃗ 1 ⊤ n L 0 0 I n U ×n U ] , ( ) where α(γ ) := ( 1 -r r-1 γ ) 2 and β(γ) := γ r-1 ( 2 -r r-1 γ ) . Note that without label noise, i.e. γ = 0, we have α(γ) = 1 and β(γ) = 0. For the sake of simplicity, in the following we write α and β instead of α(γ) and β(γ) when no ambiguity aroused. In Lemma 1, we show that the effect of symmetric label noise is to add a uniform weight to the edges between all labeled samples. This uniform weight increase the confusion between intra-and inter-class similarities. For example, under the deterministic label scenario, we have A L = I n L ×n L . The original intra-class similarity is uniformly shrinked from 1 to α and the inter-class similarity increases from 0 to β. Moreover, as the noise rate γ increases, α decreases and β increases, which results in severer confusion between the intra-and inter-class similarities.

4.3. SIMILARITY GRAPH DESCRIBING BOTH LABEL AND FEATURE INFORMATION

In this part, we take both label and feature information into consideration. For the feature information, we denote A 0 as the augmentation graph of arbitrary unlabeled samples. For label information, we take the similarity graph describing the semi-supervised noisy labeled (augmented) samples Ã as denoted in equation 7. When leveraging the use of both label and feature information in contrastive learning, we mix the two similarity graphs by convex combination, i.e. for θ ∈ (0, 1), the mixed similarity graph of all augmented samples A θ,γ,n L is denoted as A θ,γ,n L := (1 -θ) Ā0 + θ Ā, where Ā0 and Ā denote the (by row) normalization of A 0 and Ã, respectively. Recall that edge weights of the similarity graphs represent the probability of two augmented views being drawn as a pair of positive samples. The similarity graph in equation 9 can be understood as selecting positive pairs based on both weakly supervised labels and feature information. Proposition 1. For arbitrary Y , assume that the labeled data is class-balanced, i.e. ∑ i∈[n L ] η j (x i ) = n L /r for j ∈ [r]. Assume that the eigenvalues of ĀL are µ 1 , . . . , µ n (in descending order). Then under Assumption 1, the eigenvalues of Ā are μ1 = . . . = μn U +1 = 1, (10) μj = µ j α = µ j ( 1 - r r -1 γ ) 2 , for j = n U + 2, . . . , n. ( ) In Proposition 1, we show that the eigenvalues of Ā rely on the eigenvalues of Ā and consequently rely on the posterior probabilities of clean labels. Specifically, if the true label has higher posterior probability, i.e. max j∈[r] P(Y = j|x) is larger, then the eigenvalues of ĀL are larger. On the other hand, the existence of label noise uniformly shrink the eigenvalues of Ā except for the largest ones, and larger noise rate γ results in smaller α and thus leads to smaller eigenvalues of Ā. Moreover, the number of largest eigenvalues is decided by the number of unlabeled samples. Note that rank( Ã) ≤ rank(Y L ) + n U ≤ n U + r, and therefore we have μn U +r+1 = . . . = μn = 0. Specifically, under the deterministic label scenario, we have µ n U +2 = . . . = µ n U +r = 1. Then the eigenvalues of Ā become μ1 = . . . = μn U +1 = 1, ( ) μn U +2 = . . . = μn U +r = α = ( 1 - r r -1 γ ) 2 , ( ) μn U +r+1 = . . . = μn = 0.

5.2. EIGENVALUES OF SIMILARITY GRAPH DESCRIBING BOTH LABEL AND FEATURE INFORMATION

In the following proposition, we discuss the eigenvalues of the mixed similarity graph A θ,γ,n L describing both weak labels and feature information. Proposition 2. Denote λ 1 , . . . , λ n as the eigenvalues of A θ,γ,n L . Then given the eigenvalues of Ā0 , i.e. ν 1 , . . . , ν n and the eigenvalues of ĀL , i.e. µ 1 , . . . , µ n L (in descending order), when k ≤ n U , there holds λ k+1 ≥ max { θ + (1 -θ)ν n L +k , max i=n L +k-r+1,...,n L +k-1 {θαµ n+k+1-i + (1 -θ)ν i }, (1 -θ)ν k+1 } , ( ) when n U < k < n U + r, there holds λ k+1 ≥ max { max i=n L +k-r+1,...,n L +k-1 {θαµ n+k+1-i + (1 -θ)ν i }, (1 -θ)ν k+1 } , ( ) and when k ≥ n U + r, there holds λ k+1 ≥ (1 -θ)ν k+1 . ( ) According to Proposition 2, the lower bound of Specifically, under the deterministic scenario, we have for k ≤ n U , λ k+1 ≥ max { θ + (1 -θ)ν n L +k , θα + (1 -θ)ν n L +k+1-r , (1 -θ)ν k+1 } , ( ) for n U < k < n U + r, λ k+1 ≥ max { θα + (1 -θ)ν n L +k+1-r , (1 -θ)ν k+1 } , ( ) and for k ≥ n U + r, λ k+1 ≥ (1 -θ)ν k+1 . ( ) We see that under the deterministic scenario, the lower bound of the k + 1-th largest eigenvalue λ k+1 of A θ,γ,n L depends on the eigenvalues ν n L +k , ν n L +k+1-r , and ν k+1 of the unsupervised augmentation graph A 0 . The value of λ k+1 is also affected by the weighting parameter θ and the noise rate γ. A perhaps anti-intuitive conclusion is that when k is larger than n U + r, the lower bound of λ k+1 is unaffected by the noise rate. That is, when k is large enough, the weak labels will not affect the k + 1-largest eigenvalue of the mixed similarity graph.

5.3. WEAK SUPERVISION HELPS REDUCE ERROR BOUND

Recall that the goal of contrastive representation learning is to learn a embedding function f : X → R k . The quality of the learned embedding is often evaluated through linear evaluation. To be specific, denote B ∈ R k×r as the weights of the downstream linear classifier, and the linear predictor is denoted as ḡf,B (x ) = arg max i∈[r] P x∼A(•|x) (g f,B (x) = i), where g f,B (x) = arg max i∈[r] (f (x) ⊤ B) i . In this paper, we focus on analyzing the error bound of the best possible downstream linear classifier g f * pop ,B * , where f * pop ∈ arg min f :X →R k L(f ) is the minimizer of the population spectral contrastive loss L(f ) as defined in equation 1, and B * is the optimal weight for the downstream linear classifier. Following HaoChen et al. ( 2021), we assume that the labels are recoverable from augmentations, i.e. we assume there exists a classifier g that can predict label y(x) given input x with error at most δ ∈ (0, 1) with function ŷ : X → [r]. Assumption 2. Let P X be the probability distribution of original input data x. Denote x as an augmented sample and y is its label. Assume that for some δ > 0, there holds E x∼P X ,x∼A(•|x) 1[ŷ(x) ̸ = y] ≤ δ, ( ) and E x∼Unif(X ) 1[ŷ(x) ̸ = y] ≤ δ. ( ) Compared with Assumption 3.5 in HaoChen et al. ( 2021), Assumption 2 additionally assume the recoverable of labels taking expectation under the uniform probability distribution. We mention that Assumption 2 is a minor revision of the original assumption. The additional assumption equation 22 does not change the nature of the original idea of label recovery, and will be used to bound the error term of learning from weakly supervised labels. Then in the following theorem we can derive the error bound of downstream linear evaluation learned by weakly supervised contrastive learning. Theorem 1. For arbitrary Y , assume that the labeled data is class-balanced, i.e. ∑ i∈[n L ] η j (x i ) = n L /r for j ∈ [r]. Denote ν 1 , . .

. , ν n as the eigenvalues of Ā0 (in descending order). Denote

E := P x∼P X ,x∼A(•|x) ( ḡf * pop ,B * (x) ̸ = y(x) ) as the linear evaluation error, where B * ∈ R r×k with norm ∥B * ∥ F ≤ 1/λ k . Then under the deterministic scenario and Assumptions 1 and 2, for k ≤ n U , there holds E ≤ 2[2δ + θ(1 -α) r-1 r ] min{(1 -θ)(1 -ν n L +k ), (1 -θ)(1 -ν n L +k+1-r ) + θ(1 -α), (1 -θ)(1 -ν k+1 ) + θ} + 8δ, ( ) for n U + 1 ≤ k ≤ n U + r -1, there holds E ≤ 2[2δ + θ(1 -α) r-1 r ] min{(1 -θ)(1 -ν n L +k+1-r ) + θ(1 -α), (1 -θ)(1 -ν k+1 ) + θ} + 8δ, ( ) and for k ≥ n U + r, there holds E ≤ 2[2δ + θ(1 -α) r-1 r ] (1 -θ)(1 -ν k+1 ) + θ + 8δ, ( ) where α := ( 1 -r r-1 γ ) 2 . By Theorem 1, we show that the form of the linear evaluation error bound depends on the dimension of embedding k, whereas the error bound is larger when the noise rate γ and the label recovery error δ gets larger, regardless of the dimension k. Recall that in HaoChen et al. (2021) , the error bound of purely unsupervised contrastive learning is 4δ 1-ν k+1 + 8δ. under the setting of standard semi-supervised classification, i.e. when γ = 0, and usually k ≤ n U , we have E ≤ 4δ 1 -ν n L +k + 8δ, which improves the error bound of purely unsupervised contrastive learning since ν k+1 ≥ ν n L +k . Next, we discuss the situation when noisy label exists, i.e. γ > 0. • For k ≥ n U + r, -if 1 -α > 2rδ r-1 ν k+1 1-ν k+1 , then when θ = 0, there holds E ≤ 2δ 1 -ν k+1 + 8δ; (27) -if 1 -α ≤ 2rδ r-1 ν k+1 1-ν k+1 , then when θ = 1, there holds E ≤ r -1 r (1 -α) + 10δ. ( ) • For n U + 1 ≤ k ≤ n U + r -1, or k ≤ n U and δ ≤ r-1 2r (1 -ν n L +k+1-r ), there holds -if 1 -α > 2rδ r-1 ν k+1 1-ν k+1 , then when θ = 0, E ≤ 2δ 1 -ν k+1 + 8δ; (29) -if 1 -α ≤ 2rδ r-1 ν k+1 1-ν k+1 , then when θ = ν k+1 -ν n L +k+1-r ν k+1 -ν n L +k+1-r +α , E ≤ 2δ + r-1 r (1 -α) ν k+1 -ν n L +k+1-r ν k+1 -ν n L +k+1-r +α 1 - α ν k+1 -ν n L +k+1-r +α ν k+1 + 8δ. ( ) We show that when k < n U + r, when 1 -α ≤ 2rδ r-1 ν k+1 1-ν k+1 , which is equivalent to the noise rate γ smaller than a certain threshold, the weakly supervised information can improve the downstream error bound by leveraging both label and feature information and by selecting a proper weighting parameter θ. However, when the noise rate γ is large enough, the introduction of noisy labels cannot directly improve the linear evaluation error, regardless of the dimension of the embedding k. Fortunately, for contrastive learning under severe label noise, we can use multiple empirical techniques such as using spatial relationship to select confident samples and reduce noise rate, use the pre-filtered weakly supervised data to help improve contrastive learning, and in turn use the improved feature embedding to further reduce label noise. This philosophy has been empirically proved to be effective in many methodology studies (Yao et al., 2021; Ortego et al., 2021; Li et al., 2022) .

6. EXPERIMENTS

In this section, we aim to empirically verify the our theoretical results that mixing noisy labels and feature information can improve the performance of contrastive learning.

Loss function.

Recall that in the theoretical analysis, we investigate the mixed similarity graph A θ,γ,n L := (1 -θ) Ā0 + θ Ā. By triangle inequality, we have the matrix factorization loss L mf (F ) = ∥A θ,γ,n L -F F ⊤ ∥ 2 F = ∥(1 -θ) Ā0 + θ Ā -F F ⊤ ∥ 2 F ≤ (1 -θ)∥ Ā0 -F F ⊤ ∥ 2 F + θ∥ Ā -F F ⊤ ∥ 2 F . ( ) According to equation 2, the spectral contrastive loss is equivalent to the matrix factorization loss. Therefore, in experiments, we use the convex combination of supervised and unsupervised contrastive losses to leverage the noisy label and feature information, i.e. for θ ∈ (0, 1), we use L mix := (1 -θ)L unsup + θL sup . ( ) Setup. We conduct numerical comparisons on the CIFAR-10 and TinyImagenet-200 benchmark image dataset (the results on TinyImagenet-200 can be found in Appendix A.2) and follow the setting of SimCLR (Chen et al., 2020) and SupCon (Khosla et al., 2020) . We use the SGD optimizer and use ResNet-50 as the encoder and a 2-layer MLP as the projection head. We run experiments on 4 NVIDIA Tesla v100 32GB GPUs. The data augmentations we use are random crop and resize (with random flip), color distortion and color dropping. The models are trained with batch size 1024 and 1000 epochs for each model. We evaluate the self-supervised learned representation by linear evaluation protocol, where a linear classifier is trained on the top of the encoder, and regard its test accuracy as the performance of the encoder. The symmetric noisy labels are generated by flipping the labels of a given proportion of training samples uniformly to one of the other class labels. In Table 1 , we compare the performance of unsupervised contrastive learning (SimCLR), supervised contrastive learning (SupCon), and weakly supervised contrastive learning (Mix) under noise rate γ = 5% and γ = 20%. In SimCLR, we neglect all labels in the training procedure, and in SupCon, we select positive samples as those with the same noisy annotations. The parameter grid of θ for Mix is {0.1, 0.2, 0.4, 0.6, 0.8, 0.9, 0.95}. The performance comparisons with more noise rates can be found in Appendix A.2. The best results are marked in bold. The standard deviation is also reported. We show in Table 1 that if the noise rate is small (γ = 5%), SupCon results in better performance than SimCLR, whereas when the noise rate raises to γ = 20%, the noisy labels actually harm the performance of contrastive learning. Nonetheless, under γ = 5% and γ = 20%, the weakly supervised Mix outperforms both unsupervised SimCLR and supervised SupCon. This verifies the result in Theorem 1 that when the noise rate is smaller than a certain threshold, leveraging both weakly supervised and feature information helps improve the performance of contrastive learning. In Figure 1 we conduct parameter analysis of θ for the weakly supervised contrastive learning (Mix). We show that as θ increases from 0 to 1, the performance of Mix first increases and then decreases. Moreover, larger noise rate requires smaller optimal θ.The optimal value of θ for γ = 5% is larger than that for γ = 20%. That is, under severer label noise, less supervised information should be considered in weakly supervised contrastive learning.

7. CONCLUSION

In this paper, we establish a theoretical framework for weakly supervised contrastive learning, which is compatible with the settings of both noisy label learning and semi-supervised learning. By formulating a mixed similarity graph describing both weakly supervised label information and unsupervised feature information, we analyze the weakly supervised spectral contrastive learning based on the framework of spectral clustering, and derive the downstream linear evaluation error bound. Our theoretical results show that semi-supervised noisy labels improves the downstream error bound when the noise rate is smaller than a certain threshold. Our theoretical framework reveals the effect of weak supervision to contrastive learning, and has the potential to explain the existing weakly supervised learning algorithms based on contrastive learning approaches and to inspire new algorithms. For future works, we will investigate the effect of more complex weak supervision, such as active learning and label-dependent label noise, on contrastive learning.

A APPENDIX

A.1 PROOFS Proof of Lemma 1. Under Assumption 1, we have (T 2 ) i,j = { (1 -γ) 2 + γ 2 /(r -1), i = j 2γ(1 -γ)/(r -1) + (r -2)γ 2 /(r -1) 2 , i ̸ = j =    (1 -γ) 2 + γ 2 /(r -1), i = j γ r -1 ( 2 - r r -1 γ ) , i ̸ = j. ( ) That is, we have T 2 = [ (1 -γ) 2 + γ 2 /(r -1) - γ r -1 ( 2 - r r -1 γ )] I r×r + γ r -1 ( 2 - r r -1 γ ) ⃗ 1 r ⃗ 1 ⊤ r = ( 1 - r r -1 γ ) 2 I r×r + γ r -1 ( 2 - r r -1 γ ) ⃗ 1 r ⃗ 1 ⊤ r := αI r×r + β ⃗ 1 r ⃗ 1 ⊤ r . ( ) Given γ ∈ [0, 1), we have ÃL = Y L T 2 Y ⊤ L = Y L ( αI r×r + β ⃗ 1 r ⃗ 1 ⊤ r ) Y ⊤ L = αY L Y ⊤ L + βY L ⃗ 1 r ⃗ 1 ⊤ r Y ⊤ L = αA L + β ⃗ 1 n L ⃗ 1 ⊤ n L , where the last equality holds because ∑ j η j (x i ) = 1 for i ∈ [n]. and the normalized augmentation graph is Ā = D-1/2 Ã D-1/2 , ( ) where D = [ DL 0 0 I n U ×n U ] , ( ) DL = diag(d i ), and d i = ∑ j∈[n L ] Ãi,j = α ∑ j∈[n L ] ∑ ℓ∈[r] η ℓ (x i )η ℓ (x j ) + n L β = α ∑ ℓ∈[r] η ℓ (x i ) ∑ j∈[n L ] η ℓ (x j ) + n L β = α ∑ ℓ∈[r] η ℓ (x i )n ℓ + n L β (39) Specifically, when the labeled data is class-balanced, i.e. n 1 = . . . = n r = n L /r. Then we have d i = n L r α ∑ ℓ∈[r] η ℓ (x i ) + n L β = n L r α + nβ = n L r , and thus Ā = [ α r n L A L + β r n L ⃗ 1 n L ⃗ 1 ⊤ n L 0 0 I n U ×n U ] . ( ) Proof of Proposition 1. We first prove that v 1 = 1 √ n L ⃗ 1 n L is an eigenvector of ĀL := r n L A L with eigenvalue µ 1 = 1. To be specific, ĀL • 1 √ n L ⃗ 1 n L = 1 √ n L • r n L A L • ⃗ 1 n L = 1 √ n L • r n L Y L Y ⊤ L ⃗ 1 n L = 1 √ n L • r n L Y L n L r ⃗ 1 r = 1 √ n L ⃗ 1 n L , ( ) where the second last equality is due to class balance, i.e. ∑ i∈[n L ] η j (x i ) = n L /r for j ∈ [r] , and the last equality holds because ∑ j∈[r] η j (x i ) = 1 for i ∈ [n L ]. Therefore, we can rewrite ĀL as ĀL = [ 1 √ n L ⃗ 1 n L , v 2 , . . . , v n L ]     1 0 . . . 0 0 µ 2 . . . 0 . . . . . . . . . 0 0 . . . µ n L          1 √ n L ⃗ 1 ⊤ n L v ⊤ 2 . . . v ⊤ n L      . ( ) Note that 1 n L ⃗ 1 n L ⃗ 1 ⊤ n L can be decomposed as 1 n L ⃗ 1 n L ⃗ 1 ⊤ n L = ( 1 √ n L ⃗ 1 n L )( 1 √ n L ⃗ 1 n L ) ⊤ = [ 1 √ n L ⃗ 1 n L , v 2 , . . . , v n L ]     1 0 . . . 0 0 0 . . . 0 . . . . . . . . . 0 0 . . . 0          1 √ n L ⃗ 1 ⊤ n L v ⊤ 2 . . . v ⊤ n L      . ( ) Then we have ĀL := α r n L A L + rβ 1 n L ⃗ 1 n L ⃗ 1 ⊤ n L = [ 1 √ n L ⃗ 1 n L , v 2 , . . . , v n L ]     α 0 . . . 0 0 αµ 2 . . . 0 . . . . . . . . . 0 0 . . . αµ n L          1 √ n L ⃗ 1 ⊤ n L v ⊤ 2 . . . v ⊤ n L      • [ 1 √ n L ⃗ 1 n L , v 2 , . . . , v n L ]     rβ 0 . . . 0 0 0 . . . 0 . . . . . . . . . 0 0 . . . 0          1 √ n L ⃗ 1 ⊤ n L v ⊤ 2 . . . v ⊤ n L      = [ 1 √ n L ⃗ 1 n L , v 2 , . . . , v n L ]     α + rβ 0 . . . 0 0 αµ 2 . . . 0 . . . . . . . . . 0 0 . . . αµ n L          1 √ n L ⃗ 1 ⊤ n L v ⊤ 2 . . . v ⊤ n L      . ( ) Since α + rβ = 1, the eigenvalues of ĀL are 1, αµ 2 , . . . , αµ n L . Thus the eigenvalues of Ā = [ ĀL 0 0 I n U ×n U ] (46) = ∑ i,j∈[n] E x∼P X A(x i |x)A(x j |x)1[ŷ(x i ) ̸ = ŷ(x j )] ≤ ∑ i,j∈[n] E x∼P X A(x i |x)A(x j |x) ( 1[ŷ(x i ) ̸ = ŷ(x)] + 1[ŷ(x j ) ̸ = ŷ(x)] ) = 2 ∑ i,j∈[n] E x∼P X A(x i |x)1[ŷ(x i ) ̸ = ŷ(x)] = 2δ. The second term is 1 n L ∑ i,j∈[n] Āi,j 1[ŷ(x i ) ̸ = ŷ(x j )] = 1 n L ∑ i,j∈[n L ] ( ĀL ) i,j 1[ŷ(x i ) ̸ = ŷ(x j )] + 1 n L ∑ i>n L 1[ŷ(x i ) ̸ = ŷ(x i )] + 2 1 n L ∑ i≤n L ,j>n L Āi,j 1[ŷ(x i ) ̸ = ŷ(x i )]. According to the definition of Ā, the last two terms are equal to 0. Then by Lemma 1, the second term on the RHS of equation 56 becomes 1 n L ∑ i,j∈[n] Āi,j 1[ŷ(x i ) ̸ = ŷ(x j )] = 1 n L ∑ i,j∈[n L ] ( α Āi,j + β r n L ⃗ 1 n L ⃗ 1 ⊤ n L ) 1[ŷ(x i ) ̸ = ŷ(x j )] ≤ α r n 2 L ∑ i,j∈[n L ] ∑ ℓ∈[r] η ℓ (x i )η ℓ (x j ) ( 1[ŷ(x i ) ̸ = ℓ] + 1[ŷ(x j ) ̸ = ℓ] ) + β r n 2 L ∑ i,j∈[n L ] ( 1[ŷ(x i ) ̸ = y i ] + 1[ŷ(x j ) ̸ = y j ] + 1[y i ̸ = y j ] ) = α r n 2 L ∑ ℓ∈[r] 2 n L r ∑ i∈[n L ] η ℓ (x i )1[ŷ(x i ) ̸ = ℓ] + β r n 2 L ( 2n L ∑ i∈[n L ] 1[ŷ(x i ) ̸ = y i ] + ∑ i,j∈[n L ] 1[y i ̸ = y j ] ) ≤ 2α 1 n L ∑ i∈[n L ] ∑ ℓ∈[r] η ℓ (x i )1[ŷ(x i ) ̸ = ℓ] + 2β r n L ∑ i∈[n L ] 1[⃗ y(x i ) ̸ = y i ] + β(r -1). Under the deterministic scenario, we have  1 n L ∑ i,j∈[n] Āi,j 1[ŷ(x i ) ̸ = ŷ(x j )] = 2α 1 n L ∑ i∈[n L ] 1[ŷ(x i ) ̸ = ℓ] + 2βr 1 n L ∑ i∈[n L ] ) ≤ 2[2δ + θ(1 -α) r-1 r ] 1 -λ k+1 + 8δ, Combined with Proposition 2, we have for k ≤ n U , there holds E ≤ 2[2δ + θ(1 -α) r-1 r ] min{(1 -θ)(1 -ν n L +k ), (1 -θ)(1 -ν n L +k+1-r ) + θ(1 -α), (1 -θ)(1 -ν k+1 ) + θ} + 8δ, for n U + 1 ≤ k ≤ n U + r -1, there holds E ≤ 2[2δ + θ(1 -α) r-1 r ] min{(1 -θ)(1 -ν n L +k+1-r ) + θ(1 -α), (1 -θ)(1 -ν k+1 ) + θ} + 8δ, and for k ≥ n U + r, there holds E ≤ 2[2δ + θ(1 -α) r-1 r ] (1 -θ)(1 -ν k+1 ) + θ + 8δ. A.2 ADDITIONAL EXPERIMENTS We run additional experiments on CIFAR-10 dataset with a noise rate γ varying from 0% to 60% in Table 2 . The parameter grid of θ for Mix is {0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 0.9, 0.95}. We run additional experimental comparisons on the TinyImagenet-200 dataset with noise rate γ = 0.4. The parameter grid of θ is {0.1, 0.2, 0.4, 0.6}. We additionally adopt Gaussian Blur for data augmentation and keep the other experimental setups the same as Table 1 . The results are presented in Table 3 . We show that Mix outperforms both SimCLR and SupCon on the TinyImagenet-200 dataset. 



THEORETICAL RESULTSIn this section, we first compute eigenvalues of the similarity graph describing both label and feature information, which plays a key role in deriving the error bound of contrastive learning. Then in Section 5.3, we show that the weakly supervised information helps reduce the error of the best possible linear classifier on the representations learned by weakly supervised contrastive learning.5.1 EIGENVALUES OF SIMILARITY GRAPH DESCRIBING WEAKLY SUPERVISED LABEL



Noise rate γ = 20%.

Figure 1: Parameter analysis of θ.

[⃗ y(x i ) ̸ = y i ] + β(r -1) = 2αδ + 2βrδ + β(r -1) = 2αδ + 2(1 -α)δ + (1 -α) r -1 r = 2δ + (1 -α) r -1 r , (60)where the second last equation holds due to α + rβ = 1. Combining equation 56, equation 57 and equation 60, we haveϕ ŷ ≤ (1 -θ)2δ + θ ( 2δ + (1 -α) r -1 r ) = 2δ + θ(1 -α) r -1 r . (61)Therefore, by equation 55, we haveE := P x∼P X ,x∼A(•|x) ( g f * pop ,B * (x) ̸ = y(x)

and Y ∈ [r] := {1, . . . , r}. Let the input natural data {(x i , y i )} i∈[N ] be i.i.d. sampled from the joint distribution P( X, Y ). Given a natural data x ∈ X , we use A(•|x) to denote the distribution of its augmentations and use X to denote the set of all augmented data, which is assumed to be finite but exponentially large. Denote n = |X |. ′ ∈X w xx ′ = 1. The adjacent matrix of the augmentation graph is denoted as A

Performance comparisons on CIFAR-10 dataset.

Table 2 that our Mix consistently outperforms SimCLR and SupCon. Additional Performance comparisons on CIFAR-10 dataset.

Performance comparisons on TinyImagenet-200 dataset.

INFORMATION

We first compute the eigenvalues of the similarity graph describing the semi-supervised noisy labels. are μ1 = . . . = μn U +1 = 1, (47) μj = αµ j , for j = n U + 2, . . . , n.(48)Proof of Proposition 2. By equation 13 in Fulton (2000) , for two real symmetric n by n matrix (1 -θ) Ā0 and θ Ā, the k + 1-th largest eigenvalue of A θ,λ,n L := (1 -θ) Ā0 + θ Ā can take any value in the intervalBy Proposition 1, we have(50)Therefore, we havewhere the last equality holds because {ν i } i∈ [n] is ranked in descending order. Then when k ≤ n U ,and when k ≥ n U + r,Proof of Theorem 1. By Lemma B.3 of HaoChen et al. ( 2021), for any labeling function ⃗ y : X → [r], there exists a linear probewhere according to the definition of A θ,λ,n L ,We investigate the RHS of equation 56 respectively. The first term is ∑ i,j∈ [n] (A 0 ) i,j 1[ŷ(x i ) ̸ = ŷ(x j )]

