HOW DOES SEMI-SUPERVISED LEARNING WITH PSEUDO-LABELERS WORK? A CASE STUDY

Abstract

Semi-supervised learning is a popular machine learning paradigm that utilizes a large amount of unlabeled data as well as a small amount of labeled data to facilitate learning tasks. While semi-supervised learning has achieved great success in training neural networks, its theoretical understanding remains largely open. In this paper, we aim to theoretically understand a semi-supervised learning approach based on pre-training and linear probing. In particular, the semi-supervised learning approach we consider first trains a two-layer neural network based on the unlabeled data with the help of pseudo-labelers. Then it linearly probes the pretrained network on a small amount of labeled data. We prove that, under a certain toy data generation model and two-layer convolutional neural network, the semisupervised learning approach can achieve nearly zero test loss, while a neural network directly trained by supervised learning on the same amount of labeled data can only achieve constant test loss. Through this case study, we demonstrate a separation between semi-supervised learning and supervised learning in terms of test loss provided the same amount of labeled data.

1. INTRODUCTION

With the help of human-annotated labels, supervised learning has achieved remarkable success in several computer vision tasks (Girshick et al., 2014; Long et al., 2015; Krizhevsky et al., 2012; Tran et al., 2015) . However, annotating large-scale datasets (e.g., video datasets with temporal dimensions) is time-consuming and costly. In order to reduce the number of labels used for training while maintaining a good prediction performance, a variety of methods have been proposed. Among these methods, semi-supervised learning (Scudder, 1965; Fralick, 1967; Agrawala, 1970) , which leverages both a small amount of labeled data and a large amount of unlabeled data to improve learning performance, is one of the most widely used approaches. It has been shown to achieve promising performance for a wide variety of tasks, including image classification (Rasmus et al., 2015; Springenberg, 2015; Laine & Aila, 2016) , image generation (Kingma et al., 2014; Odena, 2016; Salimans et al., 2016) , domain adaptation (Saito et al., 2017; Shu et al., 2018; Lee et al., 2019) , and word embedding (Turian et al., 2010; Peters et al., 2017) . One of the popular semi-supervised learning approaches is pseudo-labeling (Lee et al., 2013) , which generates pseudo-labels of unlabeled data for pre-training. This approach has been remarkably successful in improving performance on many tasks. For example, in image classification, one can first train a teacher network on a small labeled dataset and use it as a pseudo-labeler to generate pseudo-labels for large unlabeled datasets. Then one can train a student network on the combination of labeled and pseudo-labeled images (Xie et al., 2020; Pham et al., 2021b; Rizve et al., 2021) . In order to theoretically understand semi-supervised learning with pseudo-labelers, Oymak & Gulcu (2021) considered learning a linear classifier in the Gaussian mixture model setting. They are able to show that in the high dimensional limit, the predictors found by semi-supervised learning are correlated with the Bayes-optimal predictor. Frei et al. (2022c) further proved that the semi-supervised learning algorithm can provably converge to the Bayes-optimal predictor for mixture models. However, their analyses are limited to linear classifiers, and cannot explain the success of semi-supervised learning with neural networks. In this paper, we attempt to theoretically explain the success of semi-supervised learning with pseudo-labelers in training neural networks. Specifically, we focus on a toy data model that contains both signal patches and noise patches, where the signal patch is correlated to the label while the noise patch is not. We consider semi-supervised learning with pre-training and linear probing. In the pre-training state, we train a two-layer convolutional neural network (CNN) on an unlabeled dataset with pseudo-labels. We then fine-tune the pre-trained model using linear probing on a small amount of labeled data. We provide a comprehensive analysis of the learning process in both pretraining and linear probing stages. The contributions of our work are summarized as follows. • We theoretically show that with the help of pseudo-labelers, CNN can learn the feature representation during the pre-training stage. Moreover, the learned feature is highly correlated with the true labels of the data, even though the true labels are unknown and not used during the pre-training stage. • Based on our analysis of the pre-training process, we further show that when linear-probing the pre-trained model in the downstream task, the final classifier can achieve near-zero test loss and test error. Notably, these guarantees of small test loss and error only require a very small number of labeled training data. • As a comparison, we show that standard supervised learning cannot learn a good classifier under the same setting. Specifically, we show that, even when the training process converges to a global minimum of the training loss, the learned two-layer CNN can only achieve constant level test loss. This, together with the aforementioned results for semi-supervised learning, demonstrates the advantage of semi-supervised learning over standard supervised learning. Notation. We use lower case letters, lower case bold face letters, and upper case bold face letters to denote scalars, vectors, and matrices respectively. For a scalar x, we use 

2. RELATED WORK

Semi-supervised learning methods in practice. Since the invention of semi-supervised learning in Scudder (1965) ; Fralick (1967) ; Agrawala (1970) , a wide range of semi-supervised learning approaches have been proposed, including generative models (Miller & Uyar, 1996; Nigam et al., 2000) , semi-supervised support vector machines (Bennett & Demiriz, 1998; Xu et al., 2007; 2009) , graph-based methods (Zhu et al., 2003; Belkin et al., 2006; Zhou et al., 2003) , and co-training (Blum & Mitchell, 1998) , etc. For a comprehensive review of classical semi-supervised learning methods, please refer to Chapelle et al. (2010) ; Zhu & Goldberg (2009) . In the past years, a number of deep semi-supervised learning approaches have been proposed, such as generative methods (Odena, 2016; Li et al., 2019) , consistency regularization methods (Sajjadi et al., 2016; Laine & Aila, 2016; Rasmus et al., 2015; Tarvainen & Valpola, 2017) and pseudo-labeling methods (Lee et al., 2013; Zhai et al., 2019; Xie et al., 2020; Pham et al., 2021a) . In this work, we will focus on pseudo-labeling methods. Theory of semi-supervised learning. To understand semi-supervised learning, Castelli & Cover (1995; 1996) studied the relative value of labeled data over unlabeled data under a parametric assumption on the marginal distribution of input features. Later, a series of works proved that semisupervised learning can possess better sample complexity or generalization performance than supervised learning under certain assumptions on the marginal distribution (Niyogi, 2013; Globerson et al., 2017) or the ratio of labeled and unlabeled samples (Singh et al., 2008; Darnstädt, 2015) , while Balcan & Blum (2010) provided a unified PAC framework able to analyze both sample-complexity and algorithmic issues. Oymak & Gulcu (2021) ; Frei et al. (2022c) considered semi-supervised learning with pseudo-labers by learning a linear classifier for mixture models and convergence to Bayes-optimal predictor. Self-supervised learning in practice. A closely related learning paradigm to semi-supervised learning is called self-supervised learning, which creates human-designed supervised learning problems to leverage natural structures and learn representations from unlabeled data. Representative self-supervised learning approaches include contrastive learning and pretext-based self-supervised learning. Contrastive learning (Caron et al., 2020; He et al., 2020; Chen et al., 2020) aims to group similar examples closer and dissimilar examples far from each other by utilizing a similarity metric, while pretext-based self-supervised tries to learn a good representation from pretext tasks generated from the unlabeled data to facilitate downstream learning tasks. In practice, various pretext tasks have been proposed, which include (1) generation-based ones such as colorizing grayscale images (Zhang et al., 2016) , image inpainting (Pathak et al., 2016) , image and video generation with GAN (Goodfellow et al., 2014; Brock et al., 2018; Karras et al., 2019; Vondrick et al., 2016; Tulyakov et al., 2018) ; and (2) context-based ones such as image jigsaw puzzle (Noroozi & Favaro, 2016) , geometric transformation (Gidaris et al., 2018; Jing et al., 2018) , frame order verification and recognition (Lee et al., 2017; Misra et al., 2016; Wei et al., 2018) . The semi-supervised learning approach with pseudo-labelers studied in this paper is related to pretext-based self-supervised learning because the unlabeled data with pseudo-labels can be seen as a particular pretext task. Theory of self-supervised learning. In order to understand self-supervised learning, there is a line of work towards understanding contrastive learning (Saunshi et al., 2019; Tsai et al., 2020; Mitrovic et al., 2020; Tian et al., 2020; Wang & Isola, 2020; Tosh et al., 2021b; a; HaoChen et al., 2021; Wen & Li, 2021; Saunshi et al., 2022) , which is one of the most used self-supervised learning approaches based on data augmentation. Unlike contrastive learning, the theoretical understanding of pretext-based self-supervised learning is still rather limited. The only notable works are Lee et al. (2020) and Wei et al. (2020) . Lee et al. (2020) proved generalization guarantees for self-supervised algorithms using empirical risk minimization on the pretext task under certain conditional independence assumptions. Wei et al. (2020) proved that under an "expansion" assumption, the minimizer of the population loss based on self-training and input-consistency regularization will achieve high prediction accuracy. Since semi-supervised learning with pseudo-labelers can be seen as a special case of pretext-based self-supervised learning (the pretext task is generated by the pseudo-labelers), we believe the case study in the current paper and its theoretical understanding can shed light on pretext-based self-supervised learning as well. Feature learning by neural networks. Our work is also closely related to several recent works that study how neural networks learn the features. Allen-Zhu & Li (2020a) showed that adversarial training purifies the learned features by removing certain "dense mixtures" in the hidden layer weights of the network. Allen-Zhu & Li (2020b) studied how ensemble and knowledge distillation work in deep learning when the data have "multi-view" features. Zou et al. (2021) studied an aspect of feature learning by Adam and GD and showed that GD can learn the sparse features while Adam may fail even with proper regularization. Notably, there are two concurrent works studying the benign overfitting phenomenon in learning neural networks : Frei et al. (2022a) established theoretical guarantees for benign overfitting of two-layer fully connected neural networks with zero training error and test error close to the Bayes-optimal error, while Cao et al. (2022) studied the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN), achieving arbitrarily small training and test loss. Our work studies a different aspect of feature learning afforded by semi-supervised learning versus supervised learning: given a small amount of labeled data, semisupervised learning can learn the features with the help of pseudo-labelers, while supervised learning fails to learn the features and tends to overfit the noise in the training data.

3. PROBLEM SETUP AND PRELIMINARIES

In this section, we first give a brief overview of the semi-supervised learning pipeline using pseudolabelers. Then we will introduce our data model, the convolutional neural network, and the detail of the training algorithms considered in this paper.

3.1. SEMI-SUPERVISED LEARNING PIPELINE WITH PSEUDO-LABELERS

In this paper, we consider a kind of semi-supervised learning (Xie et al., 2020; Pham et al., 2021b; Rizve et al., 2021) , which leverages pseudo-labelers for pre-training. Such a semi-supervised learning method is related to a special kind of pretext-based self-supervised learning, whose pretext task is designed by generating pseudo-labels for unlabeled data with the help of pseudo-labelers (Zhai et al., 2019) . The typical pipeline of this kind of semi-supervised learning is shown in Figure 1 . Moreover, the case study we carry out is shown in Figure 2 . The pretext task trains a two-layer convolutional neural network with the help of pseudo-labelers, and the downstream task trains a linear probe using the pre-trained models.

3.2. DATA DISTRIBUTION

Inspired by recent work (Allen-Zhu & Li, 2020b; Zou et al., 2021; Shen et al., 2022; Cao et al., 2022) , we consider a toy data model where each data input x consists of two patches x (1) and x (2) , where each patch has d dimensions. We focus on binary classification task, and present our data distribution D in the following definition. Definition 3.1. Each data point (x, y) with x = [x (1)⊤ , x (2)⊤ ] ⊤ ∈ R 2d and y ∈ {-1, +1} is generated as follows: the label y is generated as a Rademacher random variable; one of x (1) , x (2) is given by the feature vector y • v, the other is given by a noise vector ξ that is generated from a ddimensional Gaussian distribution N 0, σ 2 p (I -vv ⊤ /∥v∥ 2 2 ) . We denote by D the joint distribution of (x, y), and denote by D x the marginal distribution of x. The most natural way to think of our data model is to treat patches x (1) and x (2) as the embedding of the image data: one of them is a signal which is label-dependent, and the other one is the noise that is label-independent. For simplicity, we assume that the noise patch is generated from the Gaussian distribution N (0, σ 2 p • (I -vv ⊤ • ∥v∥ -2 2 )) to ensure that the noise vector is orthogonal to the signal vector v, and only consider the case where the data consists of one signal patch and one noise patch. However, our results and proof techniques can be easily extended to cover the setting with non-orthogonal signal/noise and multiple signal/noise patches. With this simple data model, in this case study we aim to show the effectiveness of semi-supervised learning and explain the mechanism behind semi-supervised learning with neural networks. Since the positions of signal and noise are not specified in Definition 3.1. It is natural to use a classifier with a convolutional structure that applies the same function to each patch. More specifically, we consider learning a CNN with n l labeled examples S ′ = {(x ′ i , y ′ i )} n l i=1 generated from the distribution D and n u unlabeled examples S = {x i } nu i=1 generated from the marginal distribution D x , where n l is significantly smaller than the dimension d. If we only use the labeled data, the CNN can easily overfit the training dataset by memorizing the noise patches ξ i . Consequently, the CNN will perform badly on the fresh test data. Therefore, our case is hard to learn without using unlabeled examples.

3.3. SUPERVISED LEARNING MODELS

For supervised learning, we consider a two-layer CNN whose filters are applied to the patches x (1) and x (2) respectively and parameters in the second layers are set to be ±1. Then the CNN can be written as f W (x) = f +1 W (x) -f -1 W (x) where f W (x) +1 , f W (x) -1 are formulated as f +1 W (x) = m j=1 σ ⟨w j , x (1) ⟩ + σ ⟨w j , x (2) ⟩ , f -1 W (x) = 2m j=m+1 σ ⟨w j , x (1) ⟩ + σ ⟨w j , x (2) ⟩ . (3.1) Here σ is activation function ReLU q (•) = [ • ] q + (q > 2), m is the width of the network, w j ∈ R d denotes the j-th filter, and W is the collection of all filters {w j } 2m j=1 . Given labeled training dataset S ′ = {(x ′ i , y ′ i )} n l i=1 , we train the CNN model by minimizing the empirical cross-entropy loss L S ′ (W) = 1 n l n l i=1 L i (W), where L i (W) = ℓ y ′ i • f W (x ′ i ) with ℓ(z) = log(1 + exp(-z)) denotes the individual loss for the training example (x i , y i ). We minimize the empirical function L S ′ (W) with gradient descent as follows w (t+1) j = w (t) j -η • ∇ wj L S ′ (W (t) ), w j ∼ N (0, σ 2 0 I), j ∈ [2m], where η > 0 is the learning rate and σ 0 defines the scale of random initialization.

3.4. SEMI-SUPERVISED LEARNING MODELS

For semi-supervised pre-training, we assume that we have access to K pseudo-labelers {f w k } K k=1 . The accuracy of k-th pseudo-labeler is p k ∈ (1/2, 1). Then we use K pseudo-labelers to generate K pseudo-labeled dataset {S k } K k=1 , where S k := (x i , y k,i ) y k,i = f w k (x i ) nu i=1 . Next we solve K pre-training tasks with two-layer CNN models {f W k } K k=1 defined in (3.1) using {S k } K k=1 respectively. Note that our result can cover K = 1 as a special case, where there is only one pseudo-labeler. We consider learning the model parameter W k by optimizing the empirical loss of both pseudolabeled dataset S k and labeled dataset S ′ = {(x ′ i , y ′ i )} n l i=1 with weight decay regularization L S k ∪S ′ (W k ) = 1 n u + n l nu i=1 L i (W k ) + n l i ′ =1 L i ′ (W k ) + λ 2 ∥W k ∥ 2 F , where λ ≥ 0 is the regularization parameter, L i (W k ) = ℓ y k,i • f W k (x i ) denotes the individual loss for the pseudo-labeled data (x i , y k,i ), L i ′ (W k ) = ℓ(y ′ i • f W k (x ′ i )) denotes the individual loss for the labeled data (x ′ i , y ′ i ). Our result can cover n l = 0, which corresponds to the case that there is no labeled data during pre-training. In light of this, our semi-supervised learning framework will reduce to a special kind of pretext-based self-supervised learning, where the pretext tasks are generated by pseudo-labelers. We use gradient descent to minimize the regularized loss function L S k ∪S ′ (W k ). Starting from initial W (0) k := {w (0) k,j , j ∈ [2m]}, gradient descent update rule is as follows w (t+1) k,j = w (t) k,j -η • ∇ w k,j L S k ∪S ′ (W (t) k ), w (0) k,j ∼ N (0, σ 2 0 I d ), j ∈ [2m], k ∈ [K] where η > 0 is the learning rate and σ 0 defines the scale of random initialization. • Downstream Task: Linear Model. The semi-supervised pre-training gives us K CNN models with parameters {W * k } K k=1 . Based on them, for the downstream task, we consider a linear model g a (x) = K k=1 a k f W * k (x), where a k ∈ R denotes the trainable weight for the k-th pre-trained model. Then, given {f W * k } K k=1 and labeled training data S ′ = {(x ′ i , y ′ i )} n i=1 , we consider learning the downstream linear model parameter a by optimizing the following empirical loss L S ′ (a) = 1 n n i=1 ℓ y ′ i • g a (x ′ i ) . We initialize a as an all zero vector and optimize the empirical loss by gradient descent with learning rate η, i.e., a (t+1) = a (t) -η • ∇ a L S ′ (a (t) ), a (0) = 0.

4. MAIN RESULTS

In this section, we present the main theoretical results in this paper. We start with a condition that is required by our analysis. Condition 4.1. The strength of the signal is ∥v∥ 2 2 = Θ(d), the noise variance is σ p = Θ(d ϵ ), where 0 < ϵ < 1/8 is a small constant, and the width of the network satisfies m = polylog(d). We also assume that the size of the unlabeled dataset n u = Ω(d 4ϵ ), and labeled data n l = Θ(1). For both supervise learning and semi-supervised learning settings, we initialize the weight with σ 0 = Θ(d -3/4 ). For semi-supervised learning, we require λ = o(d 3/4 ) and assume that there exists a constant C such that for all pseudo-labelers, their test accuracy p k > 1/2 + C. Since we generate the noise patch from the Gaussian distribution, the strength of the noise patch is ∥ξ∥ 2 2 ≈ d 1+ϵ by standard concentration inequalities, which is larger than the strength of the signal patch ∥v∥ 2 2 = Θ(d). Therefore, Condition 4.1 defines a setting with large noises. The condition of d ≫ n u ≫ n l further ensures that learning is in a sufficiently over-parameterized setting. Here we only require the neural network width m to be polylogarithmic in the dimension d and require the psudolablers to perform better than a random guess. Theorem 4.2 (Semi-supervised Learning: Pre-training). Let k ∈ [K] and consider the semisupervised pre-training of f W k (x). For any test data point (x, y), denote y = f w k (x). Then under Condition 4.1, after T 0 = Θ(d -3 4 η -1 ) training iterations with learning rate η = O(d -1.1 ), the trained neural network can achieve nearly 0 test error on the distribution D: P (x,y)∼D [y • f W (T 0 ) k (x) ≤ 0] = o(1). Theorem 4.2 characterizes the prediction power of the feature representation learned in the pretrained models using unlabeled data. For any test data point (x, y), the sign of y can be predicted based on f W (T 0 ) (x) with high probability. 1), the obtained a (T ′ ) satisfies: • Training error is 0: 1 n n i=1 1[y i • g a (T ′ ) (x i ) ≤ 0] = 0. • Test error and loss are nearly 0: P (x,y)∼D [y • g a (T ′ ) (x) ≤ 0] = o(1), L D a (T ′ ) = o(1). Theorem 4.3 shows that the feature representation learned based on the semi-supervised pre-training can ensure small training and test errors for the supervised downstream task. Notably, this result holds even though we assume that there are only a constant number of labeled data. This shows that semi-supervised learning can significantly reduce the need for a large labeled training dataset. For comparison, we also have the following guarantees on the performance of standard supervised learning of CNNs. Theorem 4.4 (Supervised Learning). Under supervised learning setting, after gradient descent for T = Θ(d (1/4-ϵ)q-3/2 η -1 ) iterations with learning rate η = O(d -1-2ϵ ), then there exists t ≤ T such that with probability 1 -o(1) the CNNs defined in (3.1) with parameter W (t) satisfies: • Training loss is nearly zero: L S ′ W (t) = o(1). • Test loss is high: L D W (t) = Θ(1). Theorem 4.4 shows that although standard supervised learning can train a CNN model with nearly zero training loss, the obtained CNN model generalizes poorly to test data. Comparing Theorem 4.4 with Theorem 4.3 shows that the generalization of semi-supervised learning and supervised learning are largely different. The reason behind this difference is that, the pre-training, with a relatively large number of unlabeled training data, helps learn a feature representation that captures the feature v in our data model, while direct application of supervised learning can only memorize the noises ξ ′ i , i ∈ [n l ] in the training dataset, which is independent of the labels of the data. A recent line of work (Oymak & Gulcu, 2021; Frei et al., 2022c) studies the semi-supervised learning methods with pseudo-labelers. Our results are different from theirs in several aspects: (i) we are considering learning with CNNs rather than a linear model, so the problem is highly non-convex with various local minima, which makes the optimization analysis more challenging; (ii) the Bayesian optimal predictor is no longer unique for CNNs. Therefore, we measure the quality of the learned features via downstream task instead of making a comparison with the Bayesian optimal predictor; (iii) They can only deal with the case where the teacher network (pseudo-labeler) is the same as the student network (Frei et al., 2022c) or the case where the teacher network (pseudo-labeler) is at least as complex as the student network (Oymak & Gulcu, 2021) . However, our teacher network (pseudo-labeler) is not specified and can be any structure, such as a linear network. Therefore we can handle the case where the student network is more complex than the teacher network, one of the most natural settings for semi-supervised learning with pseudo-labeler (Xie et al., 2020) .

5. PROOF SKETCH

In this section, we present the proof sketch for the semi-supervised learning setting. And the proof sketch of the supervised learning setting is given in the appendix. Semi-supervised Pre-training. We consider learning K functions f W k (x), k ∈ [K] based on the pre-training. Since the learning process of these K functions can be analyzed in exactly the same way, here we only focus on the learning of one of these functions. For simplicity of notation, we drop the subscript k in the following proof sketch. Our study of the pre-training focuses on two aspects of the training process: feature learning and noise memorization. Specifically, we aim to monitor how the filters in the CNN model learn the feature vector v and the noise vectors ξ i 's. Therefore, we introduce the following notations. Λ (t) 1 := max 1≤j≤m ⟨w (t) j , v⟩, Λ(t) 1 := max 1≤j≤m -⟨w (t) j , v⟩, Λ (t) -1 := max m+1≤j≤2m -⟨w (t) j , v⟩, Λ(t) -1 := max m+1≤j≤2m ⟨w (t) j , v⟩, Γ (t) i := max 1≤j≤2m ⟨w (t) j , ξ i ⟩, Γ ′(t) i := max 1≤j≤2m ⟨w (t) j , ξ ′ i ⟩, Γ (t) = max max i∈[nu] Γ (t) i , max i∈[n l ] Γ ′(t) i , . (5.1) Based on the above definitions for r ∈ {±1}, a larger Λ (t) r implies better feature learning along the positive feature direction v, while a larger Λ(t) r implies better feature learning along the negative feature direction -v. Moreover, a larger Γ (t) implies a higher level of noise memorization. Based on the update rule of gradient descent, for the inner products ⟨w (t) j , v⟩ and ⟨w (t) j , ξ l ⟩, for j ∈ [2m], l ∈ [n u ], we can obtain iterative equations in (A.1). With the help of the iterative equations and definitions in (5.1), we can further show the following lemma. Lemma 5.1. Assume we use both unlabeled data with pseudo-labels generated by the pseudolabeler and labeled data for the training of our CNN model. Then for r ∈ {±1}, let T r be the first iteration that Λ (t) r reaches Θ(1/m), then for t ∈ [0, T r ], we have Λ (t+1) r ≥ (1 -ηλ) • Λ (t) r + η • C • Θ(d) • ( Λ (t) r ) q-1 , r ∈ {±1}, Λ(t+1) r ≤ (1 -ηλ) • Λ(t) r , r ∈ {±1}, Γ (t+1) ≤ (1 -ηλ) • Γ (t) + η • Θ(d 1-2ϵ ) • (Γ (t) ) q-1 , where C is defined in Condition 4.1. Lemma 5.2. Assume we use only labeled data for the training of our CNN model. Then for i ∈ [n l ], let T ′ i be the first iteration that Γ ′(t) i reaches Θ(1/m), then we have Λ (t+1) r ≤ (1 -ηλ) • Λ (t) r + η • Θ(d) • ( Λ (t) r ) q-1 + ( Λ(t) r ) q-1 , r ∈ {±1}, Λ(t+1) r ≤ (1 -ηλ) • Λ(t) r , r ∈ {±1}, Γ ′(t+1) i ≥ (1 -ηλ) • Γ ′(t) i + η • Θ(d 1+2ϵ ) • (Γ ′(t) i ) q-1 , i ∈ [n l ], for t ∈ [0, T ′ i ]. Based on the results in Lemma 5.1, we can observe that if both pseudo-labeled and labeled data are used for training, the CNN will learn the positive direction of the feature vector v, while barely tending to fit the negative direction of the feature vector or memorize the noise. And if only labeled data are used, the CNN will fit noise faster than a feature, which can be seen from Lemma 5.2. Leveraging Lemmas 5.1 and 5.2, we can obtain following Lemmas 5.3 and 5.4, which characterize the magnitude of feature learning and noise memorization. Lemma 5.3. If both pseudo-labeled and labeled data are used to train CNN, for r ∈ {±1}, let T r be the first iteration that Λ (t) r reaches Θ(1/m) respectively. Let T 0 = max r∈{±1} {T r }. Then, it holds that Λ (T0) r = Θ(1), Λ(t) r = O(d -1 4 ) and Γ (t) = O(d -1 4 +ϵ ) for all t ∈ [0, T 0 ]. Lemma 5.4. If only labeled data are used to train CNN, for i ∈ [n l ], let T ′ i be the first iteration that Γ ′(t) i reaches Θ(1/m). Let T ′ 0 = max i∈[n l ] T ′ i . Then, it holds that Λ r = O(d -1 4 ), Λr = O(d -1 4 ) for r ∈ {±1} and Γ ′(t) i = Θ(1) for i ∈ [n l ]. The above results indicate the deviation between the two settings. The reason is that assume we consider a sequence {x t } with iterative equation x t+1 = x t + η • C t x q-1 t . If we only use labeled data, as shown in Lemma 5.2, Γ ′(t) i has C t = Θ(d 1+2ϵ ) while Λ (t) r has C t = Θ(d), therefore Γ ′(t) i increases faster than Λ (t) r . In contrast, if we use both labeled data and pseudo-labeled data, C t will be Θ(d 1-2ϵ ) for Γ ′(t) i and Θ(d) for Λ (t) r , leading to a slower increasing speed of Γ ′(t) i . Downstream task. After the pre-training, we have obtained K CNN classifiers f W (T k 0 ) k K k=1 . Now we train the second-layer parameters a with the training data whose true labels are available. The following lemma shows that the l 1 -norm of a will increase with a logarithmic order. Lemma 5.5. For any learning rate η = Θ(1), we have a (t) 1 = log(t)/ Θ(1). For any labeled data (x ′ i , y ′ i ) ∈ S ′ , we have with high probability that y ′ i • f W (t) (x ′ i ) = a (t) 1 • Θ(1) . For any newly generated data (x, y) ∼ D, we also have with high probability that y • f W (t) (x) = a (t) 1 • Θ(1). With the help of the above lemma and note that training error and test error are related to y •f W (T 0 ) (x) and test loss is related to ∥a (T0) ∥ 1 , we can prove that after T = Θ(d 0.foot_0 /η) iterations with learning rate η = Θ(1), the model can achieve nearly zero training error, test error, training loss and test loss.

6. EXPERIMENTS

In this section, we perform numerical experiments on synthetic datasets, generated according to Definition 3.1, to verify our main theoretical results. The code and data for our experiments can be found on Github 1 . In particular, we set the problem dimension d = 10000, labeled training sample size n l = 20 (10 positive samples and 10 negative samples), pseudo-labeled training sample size n u = 20000 (10000 positive samples and 10000 negative samples), feature vector v sampled from distribution N (0, I) and noise vector sampled from distribution N (0, σ 2 p I) where σ p = 10d 0.01 . For semi-supervised learning task, we have a linear pseudo-labeler with test error 0.196 ± 0.044. Then, we use this classifier to generate pseudo-labels for n u = 20000 unlabeled samples in order to help semi-supervised learning. After that, for pre-training, we use these pseudo-labeled samples and n l labeled samples together to train a CNN with network width m = 20, activation function σ(z) = [z] 3 + , regularization parameter λ = 0.1 and learning rate η = 1 × 10 -4 . Besides, we Test loss 0.2200±0.0886 0.0182±0.0021 0.6931±0.0005 Table 1 : Training error and loss, test error and loss for semi-supervised and supervised learning. initialize CNN parameters from N (0, σ 2 0 ), where σ 0 = 0.1 × d -3/4 . After 200 iterations, we can obtain a CNN model with training error close to the error of pseudo-labeler and zero test error, according to Table 6 . And for a downstream task, we use n l labeled samples to train a linear probe. By applying learning rate η = 0.1 and after T = 100 iterations, we can obtain a final model with low training and test loss as well as 100% training accuracy and test accuracy. For supervised learning task, we directly use n l labeled data to train a CNN with network width m = 20, activation function σ(z) = [z] 3 + , learning rate η = 1 × 10 -4 . After 200 iterations, we obtain a CNN with 0 training error and small training loss, about 0.5 test error and high test loss, which indicates supervised learning will give a model that behaves badly and even no better than a random guess. Moreover, for synthetic data experiments, we also calculate the inner products max j∈[m] ⟨w (t) j , v⟩ and max j∈[2m] max i∈[nu] ⟨w (t) j , ξ i ⟩, max i∈[n l ] ⟨w (t) j , ξ ′ i ⟩ , i.e. Λ (t) 1 and Γ (t) , representing feature learning and noise memorization respectively, to verify our key lemmas. The results are reported in Figure 3 . It can be seen from Figure 3 that under semi-supervised learning setting the algorithm will the feature learning will dominate the noise memorization though the noise patch has a larger norm than the signal patch, while under the supervised learning setting, the algorithm will entirely forget the feature but fit noise. This verifies Lemmas 5.3 and 5.4.

7. CONCLUSION AND FUTURE WORK

In this paper, we study semi-supervised learning with pseudo-labelers and provide a theoretical understanding of the success of semi-supervised learning. We show the advantage of semi-supervised learning over supervised learning through a case study. By considering a simple data model and two-layer CNN, we present a comprehensive analysis of the training procedure from a beyond-NTK feature learning perspective. We prove that the final classifier of a semi-supervised learning scenario can achieve near-zero test loss and error with only a small number of labeled training data, while its supervised-learned counterpart fails to achieve the same performance with the same data complexity. In the current paper, we only focus on the simplest possible data and neural network models to study semi-supervised learning. For example, the second layer of the CNN is fixed during the training. What if the second layer is trainable? In addition, the stride is the same as the filter size in the current CNN, and it is reasonable to have the stride be smaller than the filter size. On the other hand, it would be interesting to consider linearly non-separable data (Shi et al., 2022; Frei et al., 2022b; Damian et al., 2022) and ReLU activation function with the help of pre-activation noise (Allen-Zhu & Li, 2020a) . We leave these extensions as future works.

A PROOF FOR SEMI-SUPERVISED LEARNING SETTING

We consider learning K functions f W k (x), k ∈ [K] based on the pre-training. Since the learning process of these K functions can be analyzed in exactly the same way, here we only focus on the learning of one of these functions. For simplicity of notation, we drop the subscript k in the following proof for Sections A.1, A.2, A.3, A.4, A.5, A.6 and A.7. A.1 GRADIENT CALCULATION Lemma A.1 (Gradient Calculation). The gradient of loss function L S (W) with respect to weight parameters w j is ∇ wj L S∪S ′ (W) = - q n l + n u nu i=1 c i y i [⟨w j , y i • v⟩] q-1 + • y i • v + [⟨w j , ξ i ⟩] q-1 + • ξ i + n l i=1 b i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i + λ • w j , for 1 ≤ j ≤ m; and ∇ wj L S∪S ′ (W) = q n l + n u nu i=1 c i y i [⟨w j , y i • v⟩] q-1 + • y i • v + [⟨w j , ξ i ⟩] q-1 + • ξ i + n l i=1 b i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i + λ • w j , for m + 1 ≤ j ≤ 2m, where -ℓ ′ y i • f W (x i ) = exp [-y i • f W (x i )]/(1 + exp [-y i • f W (x i )]) is denoted by c i and -ℓ ′ (y ′ i • f W (x ′ i )) = exp[-y ′ i • f W (x ′ i )]/(1 + exp[-y ′ i • f W (x ′ i )]) is denoted by b i . Proof of Lemma A.1. When 1 ≤ j ≤ m, ∇ wj ℓ y i • f W (x i ) = ℓ ′ y i • f W (x i ) • y i • ∇ wj f W (x i ) = -c i • y i • ∇ wj f W (x i ) = -c i y i • σ ′ ⟨w j , y i • v⟩ • y i • v + σ ′ ⟨w j , ξ i ⟩ • ξ i = -qc i y i [⟨w j , y i • v⟩] q-1 + • y i • v + [⟨w j , ξ i ⟩] q-1 + • ξ i ∇ wj ℓ y ′ i • f W (x ′ i ) = ℓ ′ y ′ i • f W (x ′ i ) • y ′ i • ∇ wj f W (x ′ i ) = -b i • y ′ i • ∇ wj f W (x ′ i ) = -b i y ′ i • σ ′ (⟨w j , y ′ i • v⟩) • y ′ i • v + σ ′ (⟨w j , ξ ′ i ⟩) • ξ ′ i = -qb i y ′ i • [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i and when m + 1 ≤ j ≤ 2m, ∇ wj ℓ y i • f W (x i ) = qc i y i [⟨w j , y i • v⟩] q-1 + • y i • v + [⟨w j , ξ i ⟩] q-1 + • ξ i ∇ wj ℓ y ′ i • f W (x ′ i ) = qb i y ′ i • [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i Note that ∇ wj L S∪S ′ (W) = nu i=1 ∇ wj ℓ y i • f W (x i ) + n l i=1 ∇ wj ℓ y ′ i • f W (x ′ i ) /(n l + n u ) + λ • w j , we have proved the lemma.

A.2 INNER PRODUCT UPDATE RULE CALCULATION

When the model is trained by gradient descent, the update rule can be formulated by w (t+1) j = w (t) j -η • ∇ wj L S (W (t) ), j ∈ [2m]. (A.1) We study the performance of entire training process from two perspective: feature learning and noise memorization. Mathematically, we will focus on two quantities: ⟨w (t) j , v⟩ and ⟨w (t) j , ξ l ⟩. And then we have following lemma for the inner product update rule. Lemma A.2 (Inner Product Update Rule). The feature learning and noise memorization performance of gradient descent can be formulated by ⟨w (t+1) j , v⟩ = (1 -ηλ) • ⟨w (t) j , v⟩ + qηu j n l + n u nu i=1 y i y i c (t) i [⟨w (t) j , y i • v⟩] q-1 + ∥v∥ 2 2 + n l i=1 b (t) i [⟨w (t) j , y ′ i • v⟩] q-1 + ∥v∥ 2 2 , ⟨w (t+1) j , ξ l ⟩ = (1 -ηλ) • ⟨w (t) j , ξ l ⟩ + qηu j n l + n u nu i=1 y i c (t) i [⟨w (t) j , ξ i ⟩] q-1 + ⟨ξ i , ξ l ⟩ + n l i=1 y ′ i b (t) i [⟨w (t) j , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ l ⟩ , ⟨w (t+1) j , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j , ξ ′ l ⟩ + qηu j n l + n u nu i=1 y i c (t) i [⟨w (t) j , ξ i ⟩] q-1 + ⟨ξ i , ξ ′ l ⟩ + n l i=1 y ′ i b (t) i [⟨w (t) j , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ ′ l ⟩ , where j ∈ [2m], l ∈ [n u ] and u j := 1 [1≤j≤m] -1 [m+1≤j≤2m] . Proof of Lemma A.2. According to Lemma A.1 and gradient descent update rule (A.1), we have w (t+1) j = (1 -ηλ) • w (t) j + qηu j n l + n u • nu i=1 c i y i [⟨w j , y i • v⟩] q-1 + • y i • v + [⟨w j , ξ i ⟩] q-1 + • ξ i + n l i=1 b i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i Taking inner product with feature vector v and noise patch ξ l and note that v is orthogonal to ξ l according to the data model, we have ⟨w (t+1) j , v⟩ = (1 -ηλ) • ⟨w (t) j , v⟩ + qηu j n l + n u nu i=1 c (t) i y i [⟨w j , y i • v⟩] q-1 + y i ∥v∥ 2 2 + [⟨w j , ξ i ⟩] q-1 + ⟨ξ i , v⟩ + n l i=1 b (t) i y ′ i [⟨w j , y ′ i • v⟩] q-1 + y ′ i ∥v∥ 2 2 + [⟨w j , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , v⟩ = (1 -ηλ) • ⟨w (t) j , v⟩ + qηu j n l + n u nu i=1 y i y i c (t) i [⟨w (t) j , y i • v⟩] q-1 + ∥v∥ 2 2 + n l i=1 b (t) i [⟨w (t) j , y ′ i • v⟩] q-1 + ∥v∥ 2 2 , ⟨w (t+1) j , ξ l ⟩ = (1 -ηλ) • ⟨w (t) j , ξ l ⟩ + qηu j n l + n u nu i=1 c (t) i y i [⟨w j , y i • v⟩] q-1 + y i ⟨v, ξ l ⟩ + [⟨w j , ξ i ⟩] q-1 + ⟨ξ i , ξ l ⟩ + n l i=1 b (t) i y ′ i [⟨w j , y ′ i • v⟩] q-1 + y ′ i ⟨v, ξ l ⟩ + [⟨w j , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ l ⟩ = (1 -ηλ) • ⟨w (t) j , ξ l ⟩ + qηu j n l + n u nu i=1 y i c (t) i [⟨w (t) j , ξ i ⟩] q-1 + ⟨ξ i , ξ l ⟩ + n l i=1 y ′ i b (t) i [⟨w (t) j , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ l ⟩ , ⟨w (t+1) j , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j , ξ ′ l ⟩ + qηu j n l + n u nu i=1 y i c (t) i [⟨w (t) j , ξ i ⟩] q-1 + ⟨ξ i , ξ ′ l ⟩ + n l i=1 y ′ i b (t) i [⟨w (t) j , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ ′ l ⟩ , which completes the proof. A.3 ESTIMATE Λ (0) r , Λ(0) r , Γ (0) i , Γ ′(0) i Let Λ (t) 1 = max 1≤j≤m ⟨w (t) j , v⟩, Λ (t) -1 = max m+1≤j≤2m -⟨w (t) j , v⟩, Λ(t) 1 = max m+1≤j≤2m ⟨w (t) j , v⟩, Λ(t) -1 = max 1≤j≤m -⟨w (t) j , v⟩, which characterize the feature learning aspect of training process. An easy way to distinguish between Λ (t) r and Λ(t) r is that Λ (t) r should be large while Λ(t) r should be small. Let Γ (t) i = max 1≤j≤2m ⟨w j , ξ i ⟩, i ∈ [n u ], Γ ′(t) i = max 1≤j≤2m ⟨w j , ξ ′ i ⟩, i ∈ [n l ], which characterize the noise memorization aspect of training process with respect to a particular sample. Let Γ (t) = max max i∈[nu] Γ (t) i , max i∈[n l ] Γ ′(t) i , which characterize the noise memorization aspect of training process regardless of the sample index. We first provide the concentration inequality for Λ (0) r and Λ(0) r in the following lemma. Lemma A.3. With probability at least 1 -4δ with respect to the randomness of initialization of w, we have Λ (0) r -E[ Λ (0) r ] < 8 log 1 δ σ 0 ∥v∥ 2 , Λ(0) r -E[ Λ(0) r ] < 8 log 1 δ σ 0 ∥v∥ 2 , and E[ Λ (0) r ] ≍ log(m)σ 0 ∥v∥ 2 , E[ Λ(0) r ] ≍ log(m)σ 0 ∥v∥ 2 , r ∈ {±1}. Proof of Lemma A.3. Note that Λ (0) 1 = max 1≤j≤m ⟨w (0) j , v⟩, Λ -1 = max m+1≤j≤2m -⟨w (0) j , v⟩, Λ(0) 1 = max m+1≤j≤2m ⟨w (0) j , v⟩ and Λ(0) -1 = max m+1≤j≤2m -⟨w (0) j , v⟩, w ∼ N (0, σ 2 0 I) and v is a fixed vector. Therefore, ⟨w (0) j , v⟩ ∼ N (0, σ 2 0 ∥v∥ 2 2 ), -⟨w (0) j , v⟩ ∼ N (0, σ 2 0 ∥v∥ 2 2 ) for all 1 ≤ j ≤ 2m and Λ (0) r , Λ r , r ∈ {±1} are identically distributed. Therefore, without loss of generality, we only need to discuss the concentration of Λ (0) 1 . By applying Lemma C.1, we have P Λ (0) 1 -E[ Λ (0) 1 ] > t ≤ 2e - t 2 2σ 2 0 ∥v∥ 2 2 . By applying Lemma C.2, we have E[ Λ (0) 1 ] ≍ log(m)σ 0 ∥v∥ 2 , which completes the proof. Then we provide concentration inequality for Γ (0) i in the following lemma. Lemma A.4. Suppose that d ≥ Ω(log(m(n u + n l )/δ)), m = Ω(log(1/δ)). Then with probability at least 1 -δ, σ 0 σ p √ d 4 ≤ Γ (0) i ≤ 2 log(16m(n u + n l )/δ) • σ 0 σ p √ d, for all i ∈ [n u ], σ 0 σ p √ d 4 ≤ Γ ′(0) i ≤ 2 log(16m(n u + n l )/δ) • σ 0 σ p √ d, for all i ∈ [n l ]. Proof of Lemma A.4. By Lemma C.3, with probability at least 1 -δ/4, σ p √ d/ √ 2 ≤ ∥ξ i ∥ 2 ≤ 3/2 • σ p √ d, for i ∈ [n u ], σ p √ d/ √ 2 ≤ ∥ξ ′ i ∥ 2 ≤ 3/2 • σ p √ d, for i ∈ [n l ]. (A.2) Therefore, by Gaussian tail bound and union bound, with probability at least 1 -δ/4, ⟨w (0) j , ξ i ⟩ ≤ |⟨w (0) j , ξ i ⟩| ≤ 2 log(8m/δ) • σ 0 ∥ξ i ∥ 2 , for i ∈ [n u ], ⟨w (0) j , ξ ′ i ⟩ ≤ |⟨w (0) j , ξ ′ i ⟩| ≤ 2 log(8m/δ) • σ 0 ∥ξ ′ i ∥ 2 , for i ∈ [n l ]. (A.3) Note that P σ 0 σ p √ d/4 > ⟨w (0) j , ξ i ⟩ is an absolute constant and therefore by the condition on m, we have P σ 0 σ p √ d 4 ≤ Γ (t) i = P σ 0 σ p √ d 4 ≤ max j∈[2m] ⟨w (0) j , ξ i ⟩ = 1 -P σ 0 σ p √ d 4 > max j∈[2m] ⟨w (0) j , ξ i ⟩ = 1 -P σ 0 σ p √ d 4 > ⟨w (0) j , ξ i ⟩ 2m ≥ 1 - δ 4 , P σ 0 σ p √ d 4 ≤ Γ ′(t) i ≥ 1 - δ 4 . On the other hand, according to (A.2) and (A.3), we have P Γ (t) i ≤ 2 log(16m(n u + n l )/δ) • σ 0 σ p √ d = P max j∈[2m] ⟨w (0) j , ξ i ⟩ ≤ 2 log(16m(n u + n l )/δ) • σ 0 σ p √ d ≥ 1 - δ 4 , P Γ ′(t) i ≤ 2 log(16m(n u + n l )/δ) • σ 0 σ p √ d ≥ 1 - δ 4 , which completes the proof.

A.4 STAGE I OF GD: ON-DIAGONAL FEATURE LEARNING

In this stage, Λ 1 and Λ (t) -1 respectively increase to magnitude Θ(1/m) and Λ(t) 1 , Λ(t) -1 and Γ (t) j remain small, the same magnitude as initialization. In order to characterize the behaviour of feature learning and noise memorization during Stage I, we decompose the analysis into following three parts: 1. First, in Lemma A.9, we provide a lower bound of the update rules of on-diagonal feature learning term of Λ (t) 1 , Λ -1 to lower-bound their increasing speed, and an upper bound of off-diagonal feature learning term Λ(t) 1 , Λ(t) -1 to indicate their decrease. 2. Second, in Lemma A.11, we provide a upper bound of the update rules of noise memorization term Γ (t) to upper-bound its increasing speed. 3. Third, we provide a useful lemma, which is a derivation of Claim C.20 in Allen-Zhu & Li (2020b) , which is called tensor power method. By applying tensor power method, we will prove that: • When Λ (t) 1 reaches Θ(1/m) at T 1 , Λ(t) 1 and Γ (t) remain a magnitude no more than initialization. • When Λ (t) -1 reaches Θ(1/m) at T -1 , Λ-1 and Γ (t) remain a magnitude no more than initialization.

A.4.1 UPPER BOUND AND LOWER BOUND FOR

Λ (t) 1 , Λ (t) -1 AND Λ(t) 1 , Λ(t) -1 We first consider Stage I of GD when max r∈{±1} Λ (t) r , Λ(t) r ≤ Θ(m -1 ). In this stage, we first prove following lemma: Lemma A.5. As long as max r∈{±1} Λ (t) r , Λ(t) r ≤ Θ(m -1 ), we have c (t) i := -ℓ ′ y i • f W (t) (x i ) and b (t) i := -ℓ ′ y ′ i • f W (t) (x ′ i ) remains 1/2 ± o(1). Proof of Lemma A.5. Note that ℓ(z) = log(1 + exp (-z)) and -ℓ ′ (z) = exp (-z)/ 1 + exp (-z) , and without loss of generality assuming y i = y i = 1, we can express c (t) i as follow: c (t) i = -ℓ ′ (f W (t) (x i )) = e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξi⟩)] e m j=1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξi⟩)] + e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξi⟩)] , Since σ(⟨w (t) j , v⟩) dominates σ(⟨w (t) j , ξ⟩) for j ∈ [m], which will be proved later by using tensor power method, we have . c (t) i = e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξi⟩)] On the one side, c (t) i ≥ 1 e m j=1 σ(⟨w (t) j ,v⟩)+{lower order term} + 1 ≥ 1 e m( Λ (t) 1 ) q-1 + 1 ≥ 1 e Θ(m -(q-1) ) + 1 = 1 2 + o(1) = 1 2 -o(1). On the other side, according to Lemma 5.3, we have Λ(t) 1 = O(d -1 4 ) and Γ (t) = O(d -1 4 +ϵ ), it follows that c (t) i ≤ e m( Λ(t) 1 ) q-1 +m(Γ (t) ) q-1 e m j=1 σ(⟨w (t) j ,v⟩)+{lower order term} + e m( Λ(t) 1 ) q-1 +m(Γ (t) ) q-1 = 1 + o(1) e m j=1 σ(⟨w (t) j ,v⟩)+{lower order term} + 1 + o(1) ≤ 1 + o(1) 1 + 1 + o(1) = 1 2 + o(1). Therefore, we have c (t) i = 1/2 ± o(1) if y i = y i = 1 and other cases ( y i = y i = 1, y i = -y i , b i ) can be proved in a similar way. By applying above lemma, we can obtain following lemma: Lemma A.6. For any δ < 1/2, with probability at least 1 -2δ over pseudo-labels generated by the pseudo-labeler, we have 1 n u nu i=1 y i y i c (t) i -p - 1 2 < 1 8n u log 1 δ + o(1), where o(1) is with respect to d. If we denote {(x i , y i )|y i = 1, i ∈ [n u ]} as S 1 , {(x i , y i )|y i = -1, i ∈ [n u ]} as S -1 , |S 1 | as n 1 and |S -1 | as n -1 , we have with probability at least 1 -4δ that 1 n 1 n1 i=1 y i y i c (t) i -p - 1 2 < 1 8n 1 log 1 δ + o(1) , and 1 n -1 n-1 i=1 y i y i c (t) i -p - 1 2 < 1 8n -1 log 1 δ + o(1). Proof of Lemma A.6. First, according to Lemma A.5, we have 1 n u nu i=1 y i y i c (t) i = 1 n u nu i=1 y i y i c (t) i - 1 2 + 1 2n u nu i=1 y i y i = 1 2n u nu i=1 y i y i ± o(1) (A.4) Then, according to Hoeffding's inequality when a i = -1, b i = 1, we have P 1 n u nu i=1 y i y i -E 1 n u nu i=1 y i y i ≥ t ≤ 2 exp - 2n 2 u t 2 nu i=1 (a i -b i ) 2 = 2 exp (-2n u t 2 ). Note that the pseudo-label y i generated by the pseudo-labeler takes y i with probability p and -y i with probability 1 -p, we have E 1 nu nu i=1 y i y i = 1 nu nu i=1 E y i y i = 2p -1. It follows that P 1 2n u nu i=1 y i y i -p - 1 2 ≥ t ≤ 2 exp (-8n u t 2 ), and therefore 1 2n u nu i=1 y i y i -p - 1 2 < 1 8n u log 1 δ (A.5) holds with probability at least 1 -2δ. According to (A.4) and (A.5), we have 1 2n u nu i=1 y i y i -p - 1 2 < 1 8n u log 1 δ + o(1), which verifies the first statement of the lemma. And the other part of the lemma can be proved in a similar way. According to above lemma and note that n u , n 1 , n -1 = ω(1), we have further that 1 n u nu i=1 y i y i c (t) i -p - 1 2 = o(1), 1 n r nr i=1 y i y i c (t) i -p - 1 2 = o(1), r ∈ {±1}, (A.6) with high probability. Besides, we also need an approximation about n 1 and n -1 , which is given as the following lemma: Lemma A.7. For r ∈ {±1}, it holds with probability at least 1 -2δ that n r - n u 2 < n u 2 log 1 δ , where n r := |{(x i , y i )|y i = r, i ∈ [n u ]}|. Proof of Lemma A.7. Note that n r = nu i=1 1[X i = r], r ∈ {±1} where X i takes label +1 or -1 with equal probability 1/2, according to Hoeffding's inequality, we have P nu i=1 1[X i = r] -E nu i=1 1[X i = r] ≥ t ≤ 2 exp - 2t 2 n u , r ∈ {±1}, and it follows that P n r - n u 2 ≥ t ≤ 2 exp - 2t 2 n u , r ∈ {±1}, leading to n r - n u 2 < n u 2 log 1 δ , with probability at least 1 -2δ. For labeled dataset S ′ = {(x ′ i , y ′ i )} n l i=1 , we also have Lemma A.8. For r ∈ {±1}, it holds with probability at least 1 -2δ that n ′ r - n l 2 < n l 2 log 1 δ , where n ′ r := |{(x ′ i , y ′ i )|y ′ i = r, i ∈ [n l ]}|. Then we are prepared to estimate a lower bound of increasing speed of Λ (t) and an upper bound of decreasing speed of Λ(t) in the following lemma. Lemma A.9. For Λ (t) 1 := max 1≤j≤m ⟨w (t) j , v⟩ and Λ (t) -1 := max m+1≤j≤2m ⟨w (t) j , -v⟩, we have with high probability that Λ (t+1) r ≥ (1 -ηλ) • Λ (t) r + η • p - 1 2 • Θ(d) • ( Λ (t) r ) q-1 , r ∈ {±1}. For Λ(t) 1 := max m+1≤j≤2m ⟨w (t) j , v⟩ and Λ(t) 1 := max 1≤j≤m ⟨w (t) j , -v⟩, we have with high probability that Λ(t+1) r ≤ (1 -ηλ) • Λ(t) r , r ∈ {±1}. Proof of Lemma A.9. We first prove the former inequality. Let j * = arg max 1≤j≤m ⟨w (t) j , v⟩ and note that u j * = 1 [1≤j≤m] -1 [m+1≤j≤2m] = 1, then we have Λ (t+1) 1 ≥ ⟨w (t+1) j * , v⟩ = (1 -ηλ) • ⟨w (t) j * , v⟩ + qη n l + n u nu i=1 y i y i c (t) i [⟨w (t) j * , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ + n l i=1 b (t) i [⟨w (t) j * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ Then we respectively estimate terms ♣ and ⋆. For ♣, note the definition of j * that Λ  y i y i c (t) i [⟨w (t) j * , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ = i∈S1 y i y i c (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S-1 y i y i c (t) i [-⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 = i∈S1 y i y i c (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 = i∈S1 y i y i c (t) i • ∥v∥ 2 2 • Λ (t) 1 q-1 = n 1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 , (A.7) where S 1 := {(x i , y i )|y i = 1, i ∈ [n u ]}, S -1 := {(x i , y i )|y i = -1, i ∈ [n u ]}, n 1 = |S 1 | and the last equality is due to (A.6). For ⋆, similarly we have n l i=1 b (t) i [⟨w (t) j * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ = i∈S ′ 1 b (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S ′ -1 b (t) i [-⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 = i∈S ′ 1 b (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 = i∈S ′ 1 b (t) i • ∥v∥ 2 2 • Λ (t) 1 q-1 = n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 , (A.8) where S ′ 1 = {(x ′ i , y ′ i )|y ′ i = 1, i ∈ [n l ]}, S ′ -1 = {(x ′ i , y ′ i )|y ′ i = -1, i ∈ [n l ]}, n ′ 1 = |S ′ 1 | and the last equality is due to Lemma A.5. According to (A.7) and (A.8), we have Λ (t+1) 1 ≥ (1 -ηλ) • Λ (t) 1 + qη n l + n u n 1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 + n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 = (1 -ηλ) • Λ (t) 1 + qηn 1 n l + n u • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 + qηn ′ 1 n l + n u • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 = (1 -ηλ) • Λ (t) 1 + qη • n 1 n l + n u • p - 1 2 ± o(1) + n ′ 1 n l + n u • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 = (1 -ηλ) • Λ (t) 1 + qη • n 1 n l + n u • p - 1 2 + n ′ 1 n l + n u • 1 2 ♠ ±o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 . (A.9) According to Lemma A.7 and Lemma A.8, and note that n l = Θ(1), n u = ω(d 4ϵ ), we have for ♠ that with probability at least 1 -4δ n 1 n l + n u • p - 1 2 + n ′ 1 n l + n u • 1 2 ♠ - n u 2(n l + n u ) • p - 1 2 - n l 2(n l + n u ) • 1 2 ≤ |n 1 -nu 2 | n l + n u • p - 1 2 + |n ′ 1 -n l 2 | n l + n u • 1 2 ≤ nu 2 log 1 δ n l + n u • p - 1 2 + n l 2 log 1 δ n l + n u • 1 2 = Θ 1 √ n u = o(1) Therefore, note that n u = ω(n l ) and n u = ω(1), we have n 1 n l + n u • p - 1 2 + n ′ 1 n l + n u • 1 2 ♠ = n u 2(n l + n u ) • p - 1 2 + n l 2(n l + n u ) • 1 2 ± o(1) = 1 2 • p - 1 2 ± o(1) (A.10) Plugging (A.10) into (A.9), we have Λ (t+1) 1 ≥ (1 -ηλ) • Λ (t) 1 + qη • 1 2 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 = (1 -ηλ) • Λ (t) 1 + η • p - 1 2 • Θ(d) • Λ (t) 1 q-1 , (A.11) which verifies the first inequality of case r = 1 in the lemma. Let j * * = argmax m+1≤j≤2m ⟨w (t) j , -v⟩ and note that u  j * * = 1 [1≤j≤m] -1 [m+1≤j≤2m] = -1, we have Λ (t+1) -1 ≥ ⟨w (t+1) j * , -v⟩ = (1 -ηλ) • ⟨w (t) j * * , -v⟩ + qη n l + n u nu i=1 y i y i c (t) i [⟨w (t) j * * , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ + n l i=1 b (t) i [⟨w (t) j * * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ For ♣, y i y i c (t) i [⟨w (t) j * * , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ = i∈S-1 y i y i c (t) i [⟨w (t) j * * , -v⟩] q-1 + ∥v∥ 2 2 = n -1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) -1 q-1 , (A.12) where S -1 := {(x i , y i )|y i = -1, i ∈ [n u ]}, n -1 = |S -1 |. For ⋆, according to Lemma A.5, similarly we have n l i=1 b (t) i [⟨w (t) j * * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ = i∈S ′ -1 b (t) i [⟨w (t) j * * , -v⟩] q-1 + ∥v∥ 2 2 = n ′ -1 • 1 2 ±o(1) •∥v∥ 2 2 • Λ (t) -1 q-1 , (A.13) where S ′ -1 = {(x ′ i , y ′ i )|y ′ i = -1, i ∈ [n l ]} and n ′ -1 = |S ′ -1 |. According to (A.12) and (A.13), we have Λ (t+1) -1 ≥ (1 -ηλ) • Λ (t) -1 + qη • n -1 n l + n u • p - 1 2 + n ′ -1 n l + n u • 1 2 ♠ ±o(1) • ∥v∥ 2 2 • Λ (t) -1 q-1 . (A.14) According to Lemma A.7 and Lemma A.8, and note that n l = Θ(1), n u = ω(d 4ϵ ), we have for ♠ that with probability at least 1 -4δ n -1 n l + n u • p - 1 2 + n ′ -1 n l + n u • 1 2 ♠ - n u 2(n l + n u ) • p - 1 2 - n l 2(n l + n u ) • 1 2 ≤ |n -1 -nu 2 | n l + n u • p - 1 2 + |n ′ -1 -n l 2 | n l + n u • 1 2 ≤ nu 2 log 1 δ n l + n u • p - 1 2 + n l 2 log 1 δ n l + n u • 1 2 = Θ 1 √ n u = o(1). Therefore, note that n u = ω(n l ) and n u = ω(1), we have n -1 n l + n u • p - 1 2 + n ′ -1 n l + n u • 1 2 ♠ = n u 2(n l + n u ) • p - 1 2 + n l 2(n l + n u ) • 1 2 ± o(1) = 1 2 • p - 1 2 ± o(1) (A.15) Plugging (A.15) into (A.14), we have Λ (t+1) -1 ≥ (1 -ηλ) • Λ (t) -1 + qη • 1 2 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) -1 q-1 = (1 -ηλ) • Λ (t) -1 + η • p - 1 2 • Θ(d) • Λ (t) -1 q-1 , (A.16) which verifies the first inequality of case r = -1 in the lemma. Published as a conference paper at ICLR 2023 Next, we prove the latter part of the lemma. Let j ♮ = arg max m+1≤j≤2m ⟨w (t+1) j , v⟩, then we have: Λ(t+1) 1 = ⟨w (t+1) j ♮ , v⟩ = (1 -ηλ) • ⟨w (t) j ♮ , v⟩ - qη n l + n u nu i=1 y i y i c (t) i [⟨w (t) j ♮ , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ + n l i=1 b (t) i [⟨w (t) j ♮ , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ . For ♣, according to (A.6), we have nu i=1 y i y i c (t) i [⟨w (t) j ♮ , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ = i∈S1 y i y i c (t) i [⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 + i∈S-1 y i y i c (t) i [⟨w (t) j ♮ , -v⟩] q-1 + ∥v∥ 2 2 = i∈S1 y i y i c (t) i • [⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 + i∈S-1 y i y i c (t) i • [⟨w (t) j ♮ , -v⟩] q-1 + ∥v∥ 2 2 = n 1 • p - 1 2 ± o(1) • [⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 + n -1 • p - 1 2 ± o(1) • [⟨w (t) j ♮ , -v⟩] q-1 + ∥v∥ 2 2 ≥ 0, and for ⋆ it's obvious that n l i=1 b (t) i [⟨w (t) j ♮ , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ ≥ 0. Therefore, it follows that Λ(t+1) 1 ≤ (1 -ηλ) • ⟨w (t) j ♮ , v⟩ ≤ (1 -ηλ) Λ(t) 1 . Let j ♮♮ = arg max 1≤j≤m ⟨w (t+1) j , -v⟩, then we have: Λ(t+1) -1 = ⟨w (t+1) j ♮♮ , -v⟩ = (1 -ηλ) • ⟨w (t) j ♮♮ , -v⟩ - qη n l + n u nu i=1 y i y i c (t) i [⟨w (t) j ♮♮ , y i • v⟩] q-1 + ∥v∥ 2 2 + n l i=1 b (t) i [⟨w (t) j ♮♮ , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ≤ (1 -ηλ) • ⟨w (t) j ♮♮ , -v⟩ ≤ (1 -ηλ) • Λ(t) -1 , which verifies the second part of the lemma. Although the accuracy of pseudo-labeler is larger than 1/2, which is used as an assumption in the previous proof, we can also analyse the model with high label flipping probability and the accuracy of pseudo-labeler p is smaller than 1/2. In this case, the neural network for pre-training will turn to fit the opposite direction of feature vector, Λ(t) r will increase and Λ (t) r will decrease, which is formulated as the following lemma.  Λ(t+1) r ≥ (1 -ηλ) • Λ(t) r + η • 1 2 -p • Θ(d) • ( Λ(t) r ) q-1 , r ∈ {±1}. Proof of Lemma A.10. First, we prove the former part of this lemma. Let j * = arg max 1≤j≤m ⟨w (t+1) j , v⟩ and note that u j * = 1 [1≤j≤m] -1 [m+1≤j≤2m] = 1, then we have Λ (t+1) 1 = ⟨w (t+1) j * , v⟩ = (1 -ηλ) • ⟨w (t) j * , v⟩ + qη n l + n u nu i=1 y i y i c (t) i [⟨w (t) j * , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ + n l i=1 b (t) i [⟨w (t) j * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ . For ♣, according to (A.6), we have nu i=1 y i y i c (t) i [⟨w (t) j * , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ = i∈S1 y i y i c (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S-1 y i y i c (t) i [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 = i∈S1 y i y i c (t) i • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S-1 y i y i c (t) i • [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 = n 1 • p - 1 2 ± o(1) • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + n -1 • p - 1 2 ± o(1) • [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 , For ⋆, according to (A.6), we have n l i=1 b (t) i [⟨w (t) j * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ = i∈S ′ 1 b (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S ′ -1 b (t) i [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 = n ′ 1 • 1 2 ± o(1) • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + n ′ -1 • 1 2 ± o(1) • [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 , It follows that nu i=1 y i y i c (t) i [⟨w (t) j * , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ + n l i=1 b (t) i [⟨w (t) j * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ = n 1 • p - 1 2 ± o(1) + n ′ 1 • 1 2 ± o(1) • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + n -1 • p - 1 2 ± o(1) + n ′ -1 • 1 2 ± o(1) • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 . According to Lemma A.7 and note that n u = ω(n l ), it holds with probability at least 1 -8δ that n ′ 1 • 1 2 ± o(1) ≤ n l 2 + n l 2 log 1 δ • 1 2 ± o(1) = Θ(n l ) = o(n u ) ≤ n u 2 + n u 2 log 1 δ • 1 2 -p ± o(1) ≤ n 1 • 1 2 -p ± o(1) , n ′ -1 • 1 2 ± o(1) ≤ n l 2 + n l 2 log 1 δ • 1 2 ± o(1) = Θ(n l ) = o(n u ) ≤ n u 2 + n u 2 log 1 δ • 1 2 -p ± o(1) ≤ n -1 • 1 2 -p ± o(1) , leading to ♣ + ⋆ ≤ 0. Therefore, Λ (t+1) 1 ≤ (1 -ηλ)⟨w (t) j * , v⟩ ≤ (1 -ηλ) • Λ (t) 1 . And we can prove in a similar way that Λ (t+1) -1 ≤ (1 -ηλ) • Λ (t) -1 . Next, we prove the second part of the lemma. Let j ♮ = arg max m+1≤j≤2m ⟨w (t) j , v⟩ and note that u j ♮ = 1 [1≤j≤m] -1 [m+1≤j≤2m] = -1, then we have Λ(t+1) 1 ≥ ⟨w (t+1) j ♮ , v⟩ = (1 -ηλ) • ⟨w (t) j ♮ , v⟩ - qη n l + n u nu i=1 y i y i c (t) i [⟨w (t) j ♮ , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ + n l i=1 b (t) i [⟨w (t) j ♮ , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ . For ♣, note the definition of j ♮ that Λ(t) 1 = ⟨w (t) j ♮ , v⟩ and note the increasing property of Λ(t) 1 in this case and Λ(0) 1 > 0 with high probability, we have ⟨w (t) j ♮ , v⟩ > 0. It follows that nu i=1 y i y i c (t) i [⟨w (t) j ♮ , y i • v⟩] q-1 + ∥v∥ 2 2 ♣ = i∈S1 y i y i c (t) i [⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 + i∈S-1 y i y i c (t) i [-⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 = i∈S1 y i y i c (t) i [⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 = i∈S1 y i y i c (t) i • ∥v∥ 2 2 • Λ(t) 1 q-1 = n 1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ(t) 1 q-1 , (A.17) For ⋆, similarly we have n l i=1 b (t) i [⟨w (t) j ♮ , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ⋆ = i∈S ′ 1 b (t) i [⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 + i∈S ′ -1 b (t) i [-⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 = i∈S ′ 1 b (t) i [⟨w (t) j ♮ , v⟩] q-1 + ∥v∥ 2 2 = i∈S ′ 1 b (t) i • ∥v∥ 2 2 • Λ(t) 1 q-1 = n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • Λ(t) 1 q-1 . (A.18) According to Lemma A.7, (A.17) and (A.18), we have n ′ 1 = o(n 1 ) with high probability, therefore ♣ + ⋆ = n 1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ(t) 1 q-1 , leading to Λ(t+1) 1 ≥ (1 -ηλ) • ⟨w (t) j ♮ , v⟩ - qηn 1 n l + n u • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ(t) 1 q-1 = (1 -ηλ) • Λ(t) 1 + η • 1 2 -p • Θ(d) • Λ(t) 1 q-1 . And we can prove in a similar way that Λ(t+1) 1 ≥ (1 -ηλ) • Λ(t) 1 + η • 1 2 -p • Θ(d) • Λ(t) 1 q-1 . In this case (p < 1/2), given a small amount of labeled data, downstream task parameter a will learn the negative direction and the main theorems still hold. A.4.2 UNIFORM UPPER BOUND FOR Γ (t) The following lemma provides an upper bound for the increasing rate of Γ (t) . Lemma A.11. For Γ (t) i := max j∈[2m] ⟨w j , ξ i ⟩, i ∈ [n u ], Γ ′(t) i := max j∈[2m] ⟨w j , ξ ′ i ⟩, i ∈ [n l ], Γ (t) := max{max i∈[nu] Γ (t) i , max i∈[n l ] Γ ′(t) i }, we have with high probability that Γ (t+1) i ≤ (1 -ηλ) • Γ (t) i + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • Γ (t) q-1 , i ∈ [n l ], Γ ′(t+1) i ≤ (1 -ηλ) • Γ ′(t) i + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • Γ (t) q-1 , i ∈ [n l ], and Γ (t+1) ≤ (1 -ηλ) • Γ (t) + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • Γ (t) q-1 , where ϵ < 1/8. Proof of Lemma A.11. We first prove the former inequality. Let j ⋆ = arg max 1≤j≤2m ⟨w (t+1) j , ξ l ⟩, where l ∈ [n u ] is fixed. According to Lemma A.2, we have Γ (t+1) l = ⟨w (t+1) j ⋆ , ξ l ⟩ = (1 -ηλ) • ⟨w (t) j ⋆ , ξ l ⟩ + qηu j ⋆ n l + n u nu i=1 y i c (t) i [⟨w (t) j ⋆ , ξ i ⟩] q-1 + ⟨ξ i , ξ l ⟩ + n l i=1 y ′ i b (t) i [⟨w (t) j ⋆ , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ l ⟩ ≤ (1 -ηλ) • ⟨w (t) j ⋆ , ξ l ⟩ + qη n l + n u nu i=1 c (t) i [⟨w (t) j ⋆ , ξ i ⟩] q-1 + |⟨ξ i , ξ l ⟩| ♣ + n l i=1 b (t) i [⟨w (t) j ⋆ , ξ ′ i ⟩] q-1 + |⟨ξ ′ i , ξ l ⟩| ⋆ , (A.19) where the last inequality is due to triangle inequality. For ♣, note that l ∈ [n u ] and there exists an i ∈ [n u ] equivalent to l, it follows that nu i=1 c (t) i [⟨w (t) j ⋆ , ξ i ⟩] q-1 + |⟨ξ i , ξ l ⟩| ♣ = i∈[nu],i̸ =l c (t) i [⟨w (t) j ⋆ , ξ i ⟩] q-1 + |⟨ξ i , ξ l ⟩| + c (t) l [⟨w (t) j ⋆ , ξ l ⟩] q-1 + ∥ξ l ∥ 2 2 ≤ (n u -1) • 1 2 + o(1) • Θ(d 1 2 +2ϵ ) • Γ (t) q-1 + 1 2 + o(1) • Θ(d 1+2ϵ ) • Γ (t) q-1 = (n u -1) • Θ(d 1 2 +2ϵ ) • Γ (t) q-1 + Θ(d 1+2ϵ ) • Γ (t) q-1 , (A.20) where the inequality is due to Lemma A.5, ∥ξ l ∥ 2 2 = Θ(dσ 2 p ) = Θ(d 1+2ϵ ), |⟨ξ i , ξ l ⟩| = Θ(d 1 2 σ 2 p ) = Θ(d 1 2 +2ϵ ) according to Lemma C.3 and the definition of Γ (t) . For ⋆, we have n l i=1 b (t) i [⟨w (t) j ⋆ , ξ ′ i ⟩] q-1 + |⟨ξ ′ i , ξ l ⟩| ⋆ ≤ n l • 1 2 +o(1) • Θ(d 1 2 +2ϵ )• Γ (t) q-1 = n l • Θ(d 1 2 +2ϵ )• Γ (t) q-1 , (A.21) Plugging (A.20) and (A.21) into (A.19), we have Γ (t+1) l ≤ (1 -ηλ) • Γ (t) l + η • q n l + n u • (n u + n l -1) • Θ(d 1 2 +2ϵ ) + Θ(d 1+2ϵ ) • Γ (t) q-1 ≤ (1 -ηλ) • Γ (t) l + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • Γ (t) q-1 , which is the first part of this lemma. Let j ⋆ = argmax 1≤j≤2m ⟨w (t+1) j , ξ ′ l ⟩, where l ∈ [n l ] is fixed. According to Lemma A.2, we have Γ ′(t+1) l = ⟨w (t+1) j ⋆ , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j ⋆ , ξ ′ l ⟩ + qηu j ⋆ n l + n u nu i=1 y i c (t) i [⟨w (t) j ⋆ , ξ i ⟩] q-1 + ⟨ξ i , ξ ′ l ⟩ + n l i=1 y ′ i b (t) i [⟨w (t) j ⋆ , ξ ′ i ⟩] q-1 + ⟨ξ i , ξ ′ l ⟩ ≤ (1 -ηλ) • ⟨w (t) j ⋆ , ξ ′ l ⟩ + qη n l + n u nu i=1 c (t) i [⟨w (t) j ⋆ , ξ i ⟩] q-1 + |⟨ξ i , ξ ′ l ⟩| ♣ + n l i=1 b (t) i [⟨w (t) j ⋆ , ξ ′ i ⟩] q-1 + |⟨ξ ′ i , ξ ′ l ⟩| ⋆ , (A.22) For ♣, we have nu i=1 c (t) i [⟨w (t) j ⋆ , ξ i ⟩] q-1 + |⟨ξ i , ξ l ⟩| ♣ ≤ nu i=1 1 2 ±o(1) • Θ(d 1 2 +2ϵ )• Γ (t) q-1 = n u • Θ(d 1 2 +2ϵ )• Γ (t) q-1 , (A.23) where the inequality is due to Lemma A.5, |⟨ξ i , ξ l ⟩| = Θ(d 1 2 σ 2 p ) = Θ(d 1 2 +2ϵ ) and the definition of Γ (t) . For ⋆, note that l ∈ [n l ] and there exists an i ∈ [n l ] equivalent to l, it follows that n l i=1 b (t) i [⟨w (t) j ⋆ , ξ ′ i ⟩] q-1 + |⟨ξ ′ i , ξ ′ l ⟩| ⋆ = i∈[n l ],i̸ =l b (t) i [⟨w (t) j ⋆ , ξ ′ i ⟩] q-1 + |⟨ξ ′ i , ξ ′ l ⟩| + b (t) l [⟨w (t) j ⋆ , ξ ′ l ⟩] q-1 + ∥ξ ′ l ∥ 2 2 ≤ (n l -1) • 1 2 + o(1) • Θ(d 1 2 +2ϵ ) • Γ (t) q-1 + 1 2 + o(1) • Θ(d 1+2ϵ ) • Γ (t) q-1 = (n l -1) • Θ(d 1 2 +2ϵ ) + Θ(d 1+2ϵ ) • Γ (t) q-1 (A.24) Plugging (A.23) and (A.24) into (A.22), we have Γ ′(t+1) l ≤ (1 -ηλ) • Γ ′(t+1) l + η • q n l + n u • (n u + n l -1) • Θ(d 1 2 +2ϵ ) + Θ(d 1+2ϵ ) • Γ (t) q-1 ≤ (1 -ηλ) • Γ ′(t+1) l + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • Γ (t) q-1 , which verifies the second inequality in this lemma. Published as a conference paper at ICLR 2023 Note that Γ (t) = max{max l∈[nu] Γ (t) l , max l∈[n l ] Γ ′(t) l }, without loss of generality, we assume Γ (t) = max l∈[nu] Γ (t) l and assume l * = argmax l∈[nu] Γ (t+1) l , we have Γ (t+1) = Γ (t+1) l * ≤ (1 -ηλ) • Γ (t) l * + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • Γ (t) q-1 ≤ (1 -ηλ) • Γ (t) + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • Γ (t) q-1 , which verifies the third inequality in this lemma. A.4.3 TENSOR POWER METHOD: PROVING Γ (t) = O(Γ (0) ) DURING [0, T r ] AND COMPUTING THE MAGNITUDE OF T r In this section, we first show that off-diagonal correlation ( Λ(t) r for p > 1/2 and Λ (t) r for p < 1/2) remains initialization magnitude during [0, T r ]. If the accuracy of pseudo-labeler p > 1/2, we have off-diagonal correlation Λ(t+1) r ≤ (1 -ηλ) • Λ(t) r for r ∈ {±1}, therefore, Λ(t) r = O( Λ(0) r ) = O(d -1 4 ). If p < 1/2, we have off-diagonal correlation Λ (t+1) r ≤ (1 -ηλ) • Λ (t) r for r ∈ {±1}, therefore, Λ (t) r = O( Λ (0) r ) = O(d -1 4 ). In this paper, we mainly focus on p > 1/2. According to Sections A.4.1 and A.4.2, we have obtained following upper bounds and lower bounds for feature learning term Λ (t) r , Λ(t) r , r ∈ {±1} and noise memorization term Γ (t) : When t ∈ [0, T r ], we have Λ (t+1) r ≥ Λ (t) r + η • (2p -1) • Θ(d) • ( Λ (t) r ) q-1 and Λ(t+1) r ≤ (1 -ηλ) • Λ(t) r , for r ∈ {±1}; Γ (t+1) ≤ (1 -ηλ) • Γ (t) + η • max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u • (Γ (t) ) q-1 . (A.25) According to Condition 4.1, assume n u = Ω(d 4ϵ ) and note that ϵ < 1/8, we have max Θ(d 1 2 +2ϵ ), Θ d 1+2ϵ n u = max Θ(d 1 2 +2ϵ ), O(d 1-2ϵ ) = O(d 1-2ϵ ), leading to Γ (t+1) ≤ (1 -ηλ) • Γ (t) + η • Θ(d 1-2ϵ ) • (Γ (t) ) q-1 . By leveraging tensor power method introduced in Lemma C.4, we can prove following lemma about the magnitude of Γ (t) : Lemma A.12. Γ (t) remains initialization magnitude during [0, max r∈{±1} {T r }]. Proof of Lemma A.12. Let T * r be the first iteration t in which Λ (t) r ≥ A for r ∈ {±1}, let T * be the first iteration t in which Γ (t) ≥ A ′ , then according to Lemma C.4, we know t≥0,xt≤A η ≤ δ (1 -(1 + δ) -(q-2) )x 0 C 1 + η • C 2 C 1 (1 + δ) q-1 1 + log (A/x 0 ) log (1 + δ) , t≥0,xt≤A η ≥ δ 1 -(x 0 /A) q-2 (1 + δ) q-1 1 -(1 + δ) -(q-2) x 0 C 2 -η • (1 + δ) -(q-1) 1 + log (A/x 0 ) log (1 + δ) . And it follows that η • T * r ≤ δ (1 -(1 + δ) -(q-2) ) Λ (0) r C 1 + η • C 2 C 1 (1 + δ) q-1 1 + log (A/ Λ (0) r ) log (1 + δ) , η • T * ≥ δ ′ 1 -(x 0 /A ′ ) q-2 (1 + δ) q-1 1 -(1 + δ) -(q-2) Γ (0) C ′ 2 -η • (1 + δ ′ ) -(q-1) 1 + log (A ′ /Γ (0) ) log (1 + δ ′ ) , ♣ + qη n l + n u i∈S ′ 1 b (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S ′ 1 b (t) i [-⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ⋆ . (A.28) For ♣, according to Lemma A.6, we have i∈S1 y i y i c (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S-1 y i y i c (t) i [-⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ♣ = n 1 • p - 1 2 ± o(1) • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + n -1 • p - 1 2 ± o(1) • [-⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ≤ n 1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 + n -1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ(t) -1 q-1 = n 1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 , (A.29) where the last equality is due to Λ (t) 1 = ω( Λ(t) -1 ). For ⋆, according to Lemma A.5, we have i∈S ′ 1 b (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + i∈S ′ 1 b (t) i [-⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ⋆ = n ′ 1 • 1 2 ± o(1) • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + n ′ -1 • 1 2 ± o(1) • [-⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ≤ n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 + n ′ -1 • 1 2 ± o(1) • ∥v∥ 2 2 • Λ(t) -1 q-1 = n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 , (A.30) where the last equality is due to Λ (t) 1 = ω( Λ(t) -1 ). Plugging (A.29) and (A.30) into (A.28), we have Λ (t+1) 1 ≤ (1 -ηλ) • Λ (t) 1 + qη n l + n u n 1 • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 + n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 = (1 -ηλ) • Λ (t) 1 + qηn 1 n l + n u • p - 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 + qηn ′ 1 n l + n u • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 = (1 -ηλ) • Λ (t) 1 + qη • n 1 n l + n u • p - 1 2 ± o(1) + n ′ 1 n l + n u • 1 2 ± o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 = (1 -ηλ) • Λ (t) 1 + qη • n 1 n l + n u • p - 1 2 + n ′ 1 n l + n u • 1 2 ♠ ±o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 . (A.31) Note that we have already proved in (A.9) that Λ (t+1) 1 ≤ (1 -ηλ) • Λ (t) 1 + qη • n 1 n l + n u • p - 1 2 + n ′ 1 n l + n u • 1 2 ♠ ±o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 . (A.32) Note we have already prove in (A.10) that n 1 n l + n u • p - 1 2 + n ′ 1 n l + n u • 1 2 ♠ = 1 2 • p - 1 2 ± o(1) Therefore, we have Λ (t+1) 1 ≥ (1 -ηλ) • Λ (t) 1 + qη • p - 1 2 -o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 , Λ (t+1) 1 ≤ (1 -ηλ) • Λ (t) 1 + qη • p - 1 2 + o(1) • ∥v∥ 2 2 • Λ (t) 1 q-1 . In a similar way, we can prove that Λ (t+1) -1 ≥ (1 -ηλ) • Λ (t) -1 + qη • p - 1 2 -o(1) • ∥v∥ 2 2 • Λ (t) -1 q-1 , Λ (t+1) -1 ≤ (1 -ηλ) • Λ (t) -1 + qη • p - 1 2 + o(1) • ∥v∥ 2 2 • Λ (t) -1 q-1 , which completes the proof of this lemma. Lemma A.14 (Length of pre-training). For r ∈ {±1}, let T r be the first iteration that Λ (t) r reaches Θ(1/m) respectively. Then T r = Θ(d -3 4 )/η for all r ∈ {±1}. Proof of Lemma A.14. By leveraging tensor power method given in Lemma C.4, t≥0,xt≤A η ≤ δ (1 -(1 + δ) -(q-2) )x 0 C 1 + η • C 2 C 1 (1 + δ) q-1 1 + log (A/x 0 ) log (1 + δ) , t≥0,xt≤A η ≥ δ 1 -(x 0 /A) q-2 (1 + δ) q-1 1 -(1 + δ) -(q-2) x 0 C 2 -η • (1 + δ) -(q-1) 1 + log (A/x 0 ) log (1 + δ) , we have for r ∈ {±1} that η • T * r = t≥0, Λ (t) r ≤A η ≤ δ (1 -(1 + δ) -(q-2) ) Λ (0) r C 1 (i) + η • C 2 C 1 (1 + δ) q-1 1 + log (A/ Λ (0) r ) log (1 + δ) (ii) , η • T * r = t≥0, Λ (t) r ≤A η ≥ δ 1 -(x 0 /A) q-2 (1 + δ) q-1 1 -(1 + δ) -(q-2) Λ (0) r C 2 (iii) -η • (1 + δ) -(q-1) 1 + log (A/ Λ (0) r ) log (1 + δ) (iv) , where C 1 is taken as q p -1 2 -o(1) • ∥v∥ 2 2 and C 2 is taken as q p -1 2 + o(1) • ∥v∥ 2 2 according to Lemma A.13. Taking δ = 1 k , A = Θ(1/m) and note that terms (ii), (iv) are respectively dominated by terms (i), (iii) when η is sufficiently small and letting k → ∞, we have  1 Λ (0) r C 2 -{lower order terms} ≤ η • T * r ≤ 1 Λ := max{T 1 , T -1 }, off-diagonal Λ(t) 1 , Λ(t) -1 still remain initialization magnitude O(d -1 4 ), Γ (t) 1 , Γ (t) -1 remain initialization magnitude O(d -1 4 +ϵ ), while on-diagonal Λ (t) 1 , Λ -1 reach and then remain Θ(1). A.5 PROOF OF LEMMA 5.2 If we only use labeled data S ′ for the optimization of CNN, according to Lemma B.1, we have w (t+1) j = w (t) j -∇ wj L S ′ (W) = (1 -ηλ) • w (t) j + qηu j n l n l i=1 b (t) i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i , where u j := 1 [1≤j≤m] -1 [m+1≤2m] , b (t) i = -ℓ ′ (y ′ i • f W (x ′ i )) = exp[-y ′ i • f W (x ′ i )]/(1 + exp[-y ′ i • f W (x ′ i )]). Published as a conference paper at ICLR 2023 Notice that v and ξ ′ i are orthogonal to each other, we have ⟨w (t+1) j , v⟩ = (1 -ηλ) • ⟨w (t) j , v⟩ + qηu j n l n l i=1 b (t) i • [⟨w (t) j , y ′ i • v⟩] q-1 + • ∥v∥ 2 2 , ⟨w (t+1) j , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j , ξ i ⟩ + qηu j n l n l i=1 b (t) i y ′ i • [⟨w (t) j , ξ ′ i ⟩] q-1 + • ⟨ξ ′ i , ξ ′ l ⟩, i ∈ [n l ]. Let T ′ i be the first iteration that Γ ′(t) i reaches Θ(1/m), then we have following lemma: Lemma A.15. As long as Γ ′(t) i ≤ Θ(1/m), b i := -ℓ ′ (y ′ i • f W (t) (x ′ i )) will remain 1/2 ± o(1). Proof of Lemma A.15. Note that ℓ(z) = log(1+exp (-z)) and -ℓ ′ (z) = exp (-z)/ 1+exp (-z) , and without loss of generality assuming y ′ i = 1, we can express b (t) i as follow: b (t) i = -ℓ ′ (f W (t) (x ′ i )) = e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξ ′ i ⟩)] e m j=1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξ ′ i ⟩)] + e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξ ′ i ⟩)] , Since σ(⟨w (t) j , ξ⟩) will dominate σ(⟨w (t) j , v⟩) , which will be proved later by using tensor power method, we have b (t) i = -ℓ ′ (f W (t) (x ′ i )) = e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξ ′ i ⟩)] e m j=1 σ(⟨w (t) j ,ξ ′ i ⟩){+lower order term} + e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξ ′ i ⟩)] , On the one side, b i ≥ 1 e m j=1 σ(⟨w (t) j ,ξ ′ i ⟩){+lower order term} + 1 ≥ 1 e m(Γ ′(t) i ) q {+lower order term} + 1 ≥ 1 e Θ(m -(q-1) ) + 1 = 1 2 + o(1) = 1 2 -o(1). On the other side, according to Lemma 5.4, we have Λ(t) 1 = O(d -1 4 ), it follows that b (t) i ≤ e m( Λ(t) 1 ) q +o(1) e m j=1 σ(⟨w (t) j ,ξ ′ i ⟩)+{lower order term} + e m( Λ(t) 1 ) q +o(1) = 1 + o(1) e m j=1 σ(⟨w (t) j ,ξ ′ i ⟩)+{lower order term} + 1 + o(1) ≤ 1 + o(1) 1 + 1 + o(1) = 1 2 + o(1). Therefore, we have b (t) i = 1/2 ± o(1) and the other case of y i = -1 can be proved in a similar way. With the help of above lemma, we are now ready to prove Lemma 5.2. Proof of Lemma 5.2. Let j * = arg max 1≤j≤m ⟨w (t+1) j , v⟩ and note that u j = 1, according to Lemma A.15, we have Λ (t+1) 1 = ⟨w (t+1) j * , v⟩ = (1 -ηλ) • ⟨w (t) j * , v⟩ + qη n l n l i=1 b (t) i [⟨w (t) j * , y ′ i • v⟩] q-1 + ∥v∥ 2 2 = (1 -ηλ) • ⟨w (t) j * , v⟩ + qη n l i∈S ′ 1 b (t) i [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 + qη n l i∈S ′ -1 b (t) i [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 = (1 -ηλ) • ⟨w (t) j * , v⟩ + qη n l i∈S ′ 1 1 2 ± o(1) [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ♣ + qη n l i∈S ′ -1 1 2 ± o(1) [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 ⋆ (A.34) For ♣, we have i∈S ′ 1 1 2 ± o(1) [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ♣ = n ′ 1 • 1 2 ± o(1) • [⟨w (t) j * , v⟩] q-1 + ∥v∥ 2 2 ≤ n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • ( Λ (t) 1 ) q-1 . (A.35) For ⋆, we have i∈S ′ -1 1 2 ± o(1) [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 ⋆ = n ′ -1 • 1 2 ± o(1) • [⟨w (t) j * , -v⟩] q-1 + ∥v∥ 2 2 ≤ n ′ -1 • 1 2 ± o(1) • ∥v∥ 2 2 • ( Λ(t) -1 ) q-1 . (A.36) By plugging (A.35) and (A.36) in (A.34), and according to Lemma A.8, we have with probability at least 1 -4δ that Λ (t+1) 1 ≤ (1 -ηλ) • Λ (t) 1 + qη n l n ′ 1 • 1 2 ± o(1) • ∥v∥ 2 2 • ( Λ (t) 1 ) q-1 + n ′ -1 • 1 2 ± o(1) • ∥v∥ 2 2 • ( Λ(t) -1 ) q-1 ≤ (1 -ηλ) • Λ (t) 1 + qη n l n l 2 + n l 2 log 1 δ • 1 2 ± o(1) • ∥v∥ 2 2 • ( Λ (t) 1 ) q-1 + n l 2 + n l 2 log 1 δ • 1 2 ± o(1) • ∥v∥ 2 2 • ( Λ(t) -1 ) q-1 = (1 -ηλ) • Λ (t) 1 + qη 1 4 ± o(1) • ∥v∥ 2 2 • ( Λ (t) 1 ) q-1 + 1 4 ± o(1) • ∥v∥ 2 2 • ( Λ(t) -1 ) q-1 = (1 -ηλ) • Λ (t) 1 + η • Θ(d) • ( Λ (t) 1 ) 2 + ( Λ(t) -1 ) q-1 . And we can prove in the same way that with probability at least 1 -4δ we have Λ (t+1) -1 ≤ (1 -ηλ) • Λ (t) -1 + η • Θ(d) • ( Λ (t) -1 ) q-1 + ( Λ(t) 1 ) q-1 . Let j ⋆ = arg max m+1≤j≤2m ⟨w (t+1) j , v⟩ and note that u j = -1, we have Λ(t+1) 1 = ⟨w (t+1) j ⋆ , v⟩ = (1 -ηλ) • ⟨w (t) j ⋆ , v⟩ - qη n l n l i=1 b (t) i [⟨w (t) j ⋆ , y ′ i • v⟩] q-1 + ∥v∥ 2 2 ≤ (1 -ηλ) • ⟨w (t) j ⋆ , v⟩ ≤ (1 -ηλ) • Λ(t) 1 . (A.37) And we can prove in the same way that Λ(t+1) -1 ≤ (1 -ηλ) • Λ(t) -1 . Next, we consider the increasing rate of Γ ′(t) l where l ∈ [n l ] is fixed. If y l = 1, let j ♮ = argmax 1≤j≤m ⟨w (t) j , ξ ′ l ⟩ and note that u j = 1, we have Γ ′(t+1) l ≥ ⟨w (t+1) j ♮ , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j ♮ , ξ ′ l ⟩ + qη n l n l i=1 b (t) i y ′ i • [⟨w (t) j ♮ , ξ ′ i ⟩] q-1 + • ⟨ξ ′ i , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j ♮ , ξ ′ l ⟩ + qη n l b (t) l [⟨w (t) j ♮ , ξ ′ l ⟩] q-1 + ∥ξ ′ l ∥ 2 2 + qη n l i∈[n l ],i̸ =l b (t) i y ′ i [⟨w (t) j ♮ , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j ♮ , ξ ′ l ⟩ + qη n l b (t) l [⟨w (t) j ♮ , ξ ′ l ⟩] q-1 + ∥ξ ′ l ∥ 2 2 {± lower order terms} ≥ (1 -ηλ) • Γ ′(t) l + qη n l • 1 2 -o(1) • ∥ξ ′ l ∥ 2 2 • (Γ ′(t) l ) q-1 = (1 -ηλ) • Γ ′(t) l + η • Θ(d 1+2ϵ ) • (Γ ′(t) l ) q-1 , (A.38) where the third equality holds if we properly choose the order of λ. If y l = -1, let j ♯ = argmax m+1≤j≤2m ⟨w (t) j , ξ ′ l ⟩ and note that u j = -1, we have Γ ′(t+1) l ≥ ⟨w (t+1) j ♮ , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j ♮ , ξ ′ l ⟩ - qη n l n l i=1 b (t) i y ′ i • [⟨w (t) j ♮ , ξ ′ i ⟩] q-1 + • ⟨ξ ′ i , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j ♮ , ξ ′ l ⟩ + qη n l b (t) l [⟨w (t) j ♮ , ξ ′ l ⟩] q-1 + ∥ξ ′ l ∥ 2 2 - qη n l i∈[n l ],i̸ =l b (t) i y ′ i [⟨w (t) j ♮ , ξ ′ i ⟩] q-1 + ⟨ξ ′ i , ξ ′ l ⟩ = (1 -ηλ) • ⟨w (t) j ♮ , ξ ′ l ⟩ + qη n l b (t) l [⟨w (t) j ♮ , ξ ′ l ⟩] q-1 + ∥ξ ′ l ∥ 2 2 {± lower order terms} ≥ (1 -ηλ) • Γ ′(t) l + qη n l • 1 2 -o(1) • ∥ξ ′ l ∥ 2 2 • (Γ ′(t) l ) q-1 = (1 -ηλ) • Γ ′(t) l + η • Θ(d 1+2ϵ ) • (Γ ′(t) l ) q-1 , (A.39) where the third equality holds if we properly choose the order of λ. According to (A.38) and (A.39), we always have Γ ′(t+1) l ≥ (1 -ηλ) • Γ ′(t) l + η • Θ(d 1+2ϵ ) • (Γ ′(t) l ) q-1 . A.6 PROOF OF LEMMA 5.4 By applying Lemma C.4 to Γ (t) i and taking C 1 = Θ(d 1+2ϵ ), δ = 1/2, A = Θ(1/m), we have t≥0,Γ (t) i ≤A η ≤ Θ(1/C 1 Γ (t) i ) = Θ(d -3 4 -3ϵ ). and following upper bound for f W (T 0 ) (x i ): f W (T 0 ) (x i ) = m j=1 σ ⟨w (T0) j , v⟩ + σ ⟨w (T0) j , ξ i ⟩ - 2m j=m+1 σ ⟨w (T0) j , v⟩ + σ ⟨w (T0) j , ξ i ⟩ ≤ m( Λ (T0) 1 ) q + m(Γ (T0) i ) q - Λ(T0) 1 q -Γ (T0) i q ≤ ( Λ (T0) 1 ) q {+ lower order terms}. If y i = -1, we have following upper bound for f W (T 0 ) (x i ): f W (T 0 ) (x i ) = m j=1 σ -⟨w (T0) j , v⟩ + σ ⟨w (T0) j , ξ i ⟩ - 2m j=m+1 σ -⟨w (T0) j , v⟩ + σ ⟨w (T0) j , ξ i ⟩ ≤ m Λ(T0) -1 q + m Γ (T0) i q -Λ (T0) -1 q -Γ (T0) i q ≤ -Λ (T0) -1 q {+ lower order terms}, and following lower bound for f W (T 0 ) (x i ): f W (T 0 ) (x i ) = m j=1 σ -⟨w (T0) j , v⟩ + σ ⟨w (T0) j , ξ i ⟩ - 2m j=m+1 σ -⟨w (T0) j , v⟩ + σ ⟨w (T0) j , ξ i ⟩ ≥ Λ(T0) -1 q + Γ (T0) i q -m Λ (T0) -1 q -m Γ (T0) i q ≥ -m Λ(T0) -1 q {lower order terms}. Therefore, for unlabeled data, we have y i •f W (T 0 ) (x i ) ∈ 1-o(1) •( Λ (T0) yi ) q , m+o(1) •( Λ (T0) yi ) q and hence sign f W (T 0 ) (x i ) = sign(y i ) holds with high probability. We can also prove for labeled data (x ′ i , y ′ i ) that y ′ i •f W (T 0 ) (x ′ i ) ∈ 1-o(1) •( Λ (T0) y ′ i ) q , m+o(1) •( Λ (T0) y ′ i ) q , sign f W (T 0 ) (x ′ i ) = sign(y ′ i ) in the same way. Note that y i takes y i with probability p, -y i with probability p and n l = o(n u ), the first statement in this lemma follows obviously. To prove the other two statement, we need to give an upper bound for the norm of w j . According to the update rule of w (t) j , we have w (t+1) j = (1 -ηλ) • w (t) j + qηu j n l + n u • nu i=1 c i y i [⟨w (t) j , y i • v⟩] q-1 + • y i • v + [⟨w (t) j , ξ i ⟩] q-1 + • ξ i + n l i=1 b i y ′ i [⟨w (t) j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w (t) j , ξ ′ i ⟩] q-1 + • ξ ′ i , leading to ∥w (t+1) j ∥ 2 ≤ (1 -ηλ) • ∥w (t) j ∥ 2 + qη n l + n u • nu i=1 [⟨w (t) j , y i • v⟩] q-1 + • ∥v∥ 2 + [⟨w (t) j , ξ i ⟩] q-1 + • ∥ξ i ∥ 2 + n l i=1 [⟨w (t) j , y ′ i • v⟩] q-1 + • ∥v∥ 2 + [⟨w (t) j , ξ ′ i ⟩] q-1 + • ∥ξ ′ i ∥ 2 ≤ (1 -ηλ) • ∥w (t) j ∥ 2 + qη n l + n u • (n l + n u ) • ∥v∥ 2 • max r∈{±1} { Λ (t) r , Λ(t) r } q-1 + i∈[nu] ∥ξ i ∥ 2 + i∈[n l ] ∥ξ ′ i ∥ 2 • Γ (t) q-1 ≤ ∥w (t) j ∥ 2 + η • Θ(d 1 2 ) • Θ(1) + Θ(d 1 2 +ϵ ) • O(d (q-1)(-1 4 +ϵ) ) = ∥w (t) j ∥ 2 + η • Θ(d 2 ), (A.43) Lemma A.17 (Logarithmic increasing rate). For any learning rate η > 0, a k will always increase for ∈ [K] and hence (t) 1 = K k=1 a (t) k . And it holds that a (t) 1 = Θ(log(t)). In order to give the increasing rate of a (t) 1 , we introduce and prove the following lemma: Lemma A.18. Consider following sequence {x t } ∞ t=1 with x t+1 = x t + C • a -xt , x 0 = 0, where a > 1 and C > 0 are constants, and it follows that log a ln a • C • t + 1 ≤ x t ≤ log a ln a • C • t + 1 + C, and x t+1 -x t ≤ C C • ln a • t + 1 . Proof of Lemma A.18. Note that x i+1 -x i = C • a -xi ⇐⇒ a xi (x i+1 -x i ) = C, by adding up above equation from i = 0 to i = t -1, we have t-1 i=0 a xi (x i+1 -x i ) = C • t (A.45) =⇒ xt x0 a x dx ≥ C • t =⇒ a xt -a x0 ln a ≥ C • t =⇒ a xt ≥ C • ln a • t + 1 =⇒ x t ≥ log a C • ln a • t + 1 , x t+1 -x t = C • a -xt ≤ C C•ln a•t+1 , where the first arrow is due to a x is monotone increasing. On the other hand, a xi+1 = a xi+C•a -x i = a xi • a C•a -x i ≤ a xi • a C/(C•ln a•i+1) ≤ a xi • a C , which implies t-1 i=0 a xi+1 • (x i+1 -x t ) ≤ a C t-1 i=0 a xi • (x i+1 -x i ) =⇒ t-1 i=0 a xi+1 • (x i+1 -x i ) ≤ a C • Ct =⇒ xt x0 a x dx ≤ a C • Ct, where the first arrow is due to (A.45) and the last arrow is due to a x is monotone increasing. This leads to x t ≤ log a ln a • C • a C • n + 1 ≤ log a ln a • C • a C • n + a C = log a ln a • C • t + 1 + C Therefore, we have log a ln a • C • t + 1 ≤ x t ≤ log a ln a • C • t + 1 + C, and x t+1 -x t ≤ C ln a • C • t + 1 . Now we are ready to prove Lemma A.17. Proof of Lemma A.17. Note that we take downstream model g a (x) as g a (x) = k=1 a k m j=1 σ ⟨w (T k 0 ) k,j , y • v⟩ + σ ⟨w (T k 0 ) k,j , ξ⟩ - 2m j=m+1 σ ⟨w (T k 0 ) k,j , y • v⟩ + σ ⟨w (T k 0 ) k,j , ξ⟩ = d k=1 a k f W (T k 0 ) k (x). Then, we have following update rule for model parameter a: a (t+1) k = a (t) k -η • 1 n l n l i=1 ℓ ′ y ′ i • g a (t) (x ′ i ) • y ′ i f W (T k 0 ) k (x ′ i ), where we initialize a k as zero for all k ∈ [K]. Next, we prove following statement by using induction method: when t ≥ 1, • a (t) k , ∀k ∈ [K] is non-negative and increasing. • a (t) 1 = K i=1 a (t) k . • a (t+1) k = a (t) k + η • Θ(1) • exp -∥a (1) ∥ 1 • Θ(1) , ∀k ∈ [K]. Note that a (0) k = 0 for all k ∈ [d] and therefore g a (0) (x ′ i ) = 0, ℓ ′ y ′ i • g a (0) (x ′ i ) = ℓ ′ (0) = -1/2, a k = a (0) k -η • 1 n l n l i=1 ℓ ′ y ′ i • g a (0) (x ′ i ) • y ′ i f W (T k 0 ) k (x ′ i ) = a (0) k + η • 1 2n l n l i=1 y ′ i f W (T k 0 ) k (x ′ i ) = η • 1 2n l n l i=1 y ′ i f W (T k 0 ) k (x ′ i ) for all k ∈ [K]. Note that the accuracy of the k-th pseudo-labeler p k > 1/2, accoring to the proof of Lemma A.16, we have f W (T k 0 ) k (x ′ i ) = m j=1 σ ⟨w (T k 0 ) k,j , y ′ i • v⟩ + σ ⟨w (T k 0 ) k,j , ξ ′ i ⟩ - 2m j=m+1 σ ⟨w (T k 0 ) k,j , y ′ i • v⟩ + σ ⟨w (T k 0 ) k,j , ξ ′ i ⟩ = y ′ i • Θ ( Λ (T k 0 ) y ′ i ) q , for all k ∈ [K]. Therefore a (1) k = η • 1 2n l n l i=1 y ′ i f W (T k 0 ) k (x ′ i ) ≥ η 2 • Θ ( Λ (T k 0 ) y ′ i ) q > 0, ∀k ∈ [K]. It follows that a (t) 1 = K i=1 |a (t) k | = K i=1 a (t) k . Note that .46) This leads to y ′ i • g a (1) (x ′ i ) = y ′ i • K k=1 a f W (T k ) k (x ′ i ) = K k=1 a (1) k • y ′ i • f W (T k 0 ) k (x ′ i ) = K k=1 a (1) k • Θ ( Λ (T k 0 ) y ′ i ) q = K k=1 a (1) k • Θ(1) = ∥a (1) ∥ 1 • Θ(1). (A ℓ ′ y ′ i • g a (1) (x ′ i ) = - exp (-y ′ i • g a (1) (x ′ i )) 1 + exp (-y ′ i • g a (1) (x ′ i )) = -c • exp -y ′ i • g a (1) (x ′ i ) = -c • exp -∥a (1) ∥ 1 • Θ(1) , where the second equality is due to y ′ i • g a (1) (x ′ i ) > 0, exp (-y ′ i • g a (1) (x ′ i )) < 1 and c ∈ (1/2, 1); the last equality is due to (A.46). It follows that a (2) k = a (1) k -η • 1 n l n l i=1 ℓ ′ y ′ i • g a (1) (x ′ i ) • y ′ i f W (T 0 ) k (x ′ i ) = a (1) k + η • c • Θ(1) • exp -∥a (1) ∥ 1 • Θ(1) , ∀k ∈ [K] where c ∈ (1/2, 1). By then, we have already proved the induction hypothesis of t = 1. Next, assume the induction hypotheses hold for t. For t + 1, we have a (t+1) k = a (t) k -η • 1 n l n l i=1 ℓ ′ y ′ i • g a (t) (x ′ i ) <0 • y ′ i f W (T k 0 ) k (x ′ i ) >0 > a (t) k > 0. And it follows that ∥a (t+1) ∥ 1 = K i=1 a (t+1) k and y ′ i • g a (t+1) (x ′ i ) = ∥a (t+1) ∥ 1 • Θ(1), (A.47) leading to ℓ ′ y ′ i • g a (t+1) (x ′ i ) = -c • exp -∥a (t+1) ∥ 1 • Θ(1) , c ∈ (1/2, 1), a (t+2) k = a (t+1) k + η • Θ(1) • exp -∥a (t+1) ∥ 1 • Θ(1) , ∀k ∈ [K]. This indicates that if induction hypotheses hold for t, then they holds for t + 1. Adding up k ∈ [K], we can obtain ∥a (t+1) ∥ 1 = a (t) 1 + η • Θ(1) • exp -Θ(1) • a (t) 1 (A.48) According to Lemma A.18, we know that a (t) 1 = log t/ Θ(1){± lower order terms w.r.t. t}. The following lemma gives the convergence guarantee of downstream task: which means within polynomial steps, gradient descent is guaranteed to find a point with small gradient. Proof of Lemma A.19. Note that ∥∇ a L S ′ (a (t) )∥ 1 = K k=1 |∂ a k L S ′ (a (t) )| = - K k=1 ∂ a k L S ′ (a (t) ) = K k=1 a (t+1) k -a (t) k η = ∥a (t+1) ∥ 1 -∥a (t) ∥ 1 η , then according to Lemma A.18 and (A.48), we know ∥a (t+1) ∥ 1 -∥a (t) ∥ 1 ≤ η • Θ(1) η • Θ(1) • t + 1 . (A.49) And it follows that ∥∇ a L S ′ (a (t) )∥ 1 ≤ Θ(1) η • Θ(1) • t + 1 , which shows that within polynomial steps, gradient descent is guaranteed to find a point with small gradient. Note that ∂ a k L S ′ (a) = 1 n l n l i=1 ℓ ′ y ′ i • g a (t) (x ′ i ) • y ′ i f W (T k 0 ) k (x ′ i ), ∂ a k ∂ aj L S ′ (a) = 1 n l n l i=1 ℓ ′′ y ′ i • g a (t) (x ′ i ) • f W (T k 0 ) k (x i ) • f W (T j 0 ) j (x i ) for all k, j ∈ [K], Denote f W (T 1 0 ) 1 (x ′ i ), • • • , f W (T K 0 ) K (x ′ i ) ⊤ as f W * (x ′ i ), then ∇ 2 a L S (a) = 1 n l n l i=1 ℓ ′′ y ′ i • g a (t) (x ′ i ) • f W * (x ′ i ) • f W * (x ′ i ) ⊤ . Note that f W * (x ′ i ) • f W * (x ′ i ) ⊤ is a non-negative definite matrix, ℓ ′′ (z) = exp (-z)/ 1 + exp (-z) 2 > 0 and the fact that sum of non-negative definite matrices is still a non-negative definite matrix, it follows that ∇ 2 a L S (a) ⪰ 0. where the last equality is due to a (T dt ) k > 0 according to Lemma A.17. For test loss, we have L D (ℓ(y • g a (T dt ) (x))) = E (x,y)∼D [ℓ(y • g a (T dt ) (x))], i.e., we estimate for newly generated data (x, y) the magnitude of ℓ(y • g a (t) (x)). In order to do so, we will first estimate ℓ(y ′ i • g a (t) (x i )). Then, we will show that ℓ(y • g a (t) (x)) and ℓ(y ′ i • g a (t) (x i )) nearly equal to each other. According to the update rule of a (t) k , we have a (t+1) k = a (t) k -η • 1 n l n l i=1 ℓ ′ y ′ i • g a (t) (x ′ i ) • y ′ i f W (T k 0 ) k (x ′ i ). Adding up the above equation for k ∈ [K], we obtain ∥a (t+1) ∥ 1 = ∥a (t) ∥ 1 -η • 1 n l n l i=1 ℓ ′ y ′ i • g a (t) (x ′ i ) • y ′ i K k=1 f W (T k 0 ) k (x ′ i ). And according to (A.49), we have ∥a (t+1) ∥ 1 -∥a (t) ∥ 1 ≤ η • Θ(1) η • Θ(1) • t + 1 , therefore it follows that - 1 n l n l i=1 ℓ ′ y ′ i • g a (t) (x ′ i ) • y ′ i K k=1 f W (T k 0 ) k (x ′ i ) ≤ Θ(1) η • Θ(1) • t + 1 . Note that K = Θ(1) and for all k ∈ [K] we have y ′ i • f W (T k 0 ) k (x ′ i ) = Θ(1), it follows that - 1 n l n l i=1 ℓ ′ y ′ i • g a (t) (x ′ i ) ≤ Θ(1) η • Θ(1) • t + 1 . Note that n l = Θ(1) and according to Lemma A.8, there exists a positive sample (x i1 , y i1 ) and a negative sample (x i2 , y i2 ) with the property that -ℓ ′ y ′ i1 • g a (t) (x ′ i1 ) ≤ Θ(1) η • Θ(1) • t + 1 , -ℓ ′ y ′ i2 • g a (t) (x ′ i2 ) ≤ Θ(1) η • Θ(1) • t + 1 . Note that ℓ(z) = log(1 + exp(-z)) and ℓ ′ (z) = -exp(-z)/ 1 + exp(-z) , we know that for z > 0, -ℓ ′ (z) = c • exp(-z), ℓ(z) < exp(-z) = -ℓ ′ (z)/c, c ∈ (1/2, 1).

It follows that

ℓ y ′ i1 • g a (t) (x ′ i1 ) ≤ Θ(1) η • Θ(1) • t + 1 , ℓ y ′ i2 • g a (t) (x ′ i2 ) ≤ Θ(1) η • Θ(1) • t + 1 .

B PROOF OF SUPERVISED LEARNING SETTING

Here we prove Theorem 4.4. First, we give following lemma to facilitate Lemma B.1 (Gradient Calculation). gradient of loss function L S (W) with respect to weight parameter w j is ∇ wj L S ′ (W) = - qu j n l • n l i=1 b i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i , where u j := 1 [1≤j≤m] -1 [m+1≤j≤2m] and -ℓ ′ y ′ i • f W (x ′ i ) = exp [-y ′ i • f W (x ′ i )]/(1 + exp [-y ′ i • f W (x ′ i )]) is denoted as b i . Proof of Lemma B.1. When 1 ≤ j ≤ m, ∇ wj ℓ y ′ i • f W (x ′ i ) = ℓ ′ y ′ i • f W (x ′ i ) • y ′ i • ∇ wj f W (x ′ i ) = -b i • y ′ i • ∇ wj f W (x ′ i ) = -b i y ′ i • σ ′ ⟨w j , y ′ i • v⟩ • y ′ i • v + σ ′ ⟨w j , ξ ′ i ⟩ • ξ ′ i = -qb i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i and when m + 1 ≤ j ≤ 2m, ∇ wj ℓ y ′ i • f W (x ′ i ) = qb i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i Combining above two cases, we have ∇ wj ℓ y ′ i • f W (x ′ i ) = -q 1 [1≤j≤m] -1 [m+1≤j≤2m] b i y ′ i [⟨w j , y i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i = -qu j b i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i and therefore ∇ wj L S ′ (W) = 1 n l n l i=1 ∇ wj L i (W) = 1 n l n l i=1 ∇ wj ℓ y ′ i • f W (x ′ i ) = - qu j n l n l i=1 b i y ′ i [⟨w j , y ′ i • v⟩] q-1 + • y ′ i • v + [⟨w j , ξ ′ i ⟩] q-1 + • ξ ′ i . Proof Define w j := m 1/q • w j , we have f W (x) = m j=1 σ ⟨m -1/q • w j , y • v⟩ + σ ⟨m -1/q • w j , ξ⟩ -2m j=m+1 σ ⟨m -1/q • w j , y • v⟩ + σ ⟨m -1/q • w j , ξ⟩ Since the standard deviation of Gaussian initialization of w j is σ 0 and note that w j := m 1/q • w j , the standard deviation of Gaussian initialization of w j is m 1/q σ 0 := σ 0 . Therefore, t∈[τg,τg+1) η • C t [(1 + δ) g x 0 ] q-1 ≤ δ(1 + δ) g 0 + η • C τg+1-1 (1 + (q-1)(g+1) x q-1 0 , t∈[τg,τg+1) η • C t [(1 + δ) g+1 x 0 ] ≥ δ(1 + δ) g x 0 -η • C τg-1 (1 + δ) (q-1)g x q-1 0 . These imply that t∈[τg,τg+1) η • C t ≤ δ (1 + δ) (q-2)g x 0 + η • C τg+1-1 (1 + δ) q-1 ≤ δ (1 + δ) (q-2)g x 0 + η • C 2 (1 + δ) q-1 , t∈[τg,τg+1) η • C t ≥ δ (1 + δ) (q-2)g+(q-1) x 0 -η • C τg-1 (1 + δ) -(q-1) ≥ δ (1 + δ) (q-2)g+(q-1) x 0 -η • C 2 (1 + δ) -(q-1) . Recall b is the smallest integer such that (1 + δ) b x 0 ≥ A, so we can calculate that t≥0,xt≤A δ (1 + δ) (q-2)g+(q-1) x 0 -η • C 2 (1 + δ) -(q-1) b = δ 1 -(1 + δ) -(q-2)b (1 + δ) q-1 1 -(1 + δ) -(q-2) x 0 -η • C 2 (1 + δ) -(q-1) b ≥ δ 1 -(x 0 /A) q-2 (1 + δ) q-1 1 -(1 + δ) -(q-2) x 0 -η • C 2 (1 + δ) -(q-1) b, η • C t ≤ b-1 g=0 δ (1 + δ) (q-2)g x 0 + η • C 2 (1 + δ) q-1 b = δ 1 -(1 + δ) -(q-2)b 1 -(1 + δ) -(q-2) x 0 + η • C 2 (1 + δ) q-1 b ≤ δ (1 -(1 + δ) -(q-2) )x 0 + η • C 2 (1 + δ) q-1 b, where the last inequality is due to (1 + δ) b x 0 ≥ A. Note that (1 + δ) b-1 x 0 < A, i.e. b ≤ 1 + log (A/x0) log (1+δ) , therefore t≥0,xt≤A η • C t ≤ δ (1 -(1 + δ) -(q-2) )x 0 + η • C 2 (1 + δ) q-1 1 + log (A/x 0 ) log (1 + δ) , t≥0,xt≤A η • C t ≥ δ 1 -x 0 /A (1 + δ) q-1 1 -(1 + δ) -(q-2) x 0 -η • C 2 (1 + δ) -(q-1) 1 + log (A/x 0 ) log (1 + δ) , Note that C 1 ≤ C t ≤ C 2 , we have t≥0,xt≤A η ≤ δ (1 -(1 + δ) -(q-2) )x 0 C 1 + η • C 2 C 1 (1 + δ) q-1 1 + log (A/x 0 ) log (1 + δ) , t≥0,xt≤A η ≥ δ 1 -(x 0 /A) q-2 (1 + δ) q-1 1 -(1 + δ) -(q-2) x 0 C 2 -η • (1 + δ) -(q-1) 1 + log (A/x 0 ) log (1 + δ) .



https://github.com/uclaml/SSL Pseudo Labeler



[x] + to denote max{x, 0}. For a vector v = (v 1 , • • • , v d ) ⊤ ,we denote by ∥v∥ 2 ℓ 2 norm, and use supp(v) := {j : v j ̸ = 0} to denote its support. For two sequences {a k } and {b k }, we denotea k = O(b k ) if |a k | ≤ C|b k | for some absolute constant C, denote a k = Ω(b k ) if b k = O(a k ), and denote a k = Θ(b k ) if |a k | ≤ C|b k | and a k = Ω(b k ). We also denote a k = o(b k ) if lim |a k /b k | = 0.Finally, we use Θ(•), O(•) and Ω(•) to omit logarithmic terms in the notations.

Figure 1: The general pipeline of semi-supervised learning with pre-training and linear probing.

Figure 2: Illustration of our model. The left figure characterizes semi-supervised pre-train schema: NN is trained by minimizing errors between pseudo-labels y and predictions f W (x). After semisupervised pre-training finished, the learned parameters {W * k } K k=1 serve as pre-trained models and are adapted to a downstream task using linear probing, as shown in the right figure.

neural networks trained according to the K pre-training tasks, and consider the learning of the downstream task based in f W Condition 4.1, after T ′ = Θ(d 0.1 /η) iterations with learning rate η = Θ(1), with probability 1 -o(

Figure 3: Visualization of the feature learning and noise memorization in the training process. (Left: Semi-supervised, Right: Supervised) Semi-supervised Supervised Pre-train Downstream Training error 0.1753±0.0259 0 0 Test error 0 0 0.4982± 0.0208 Training loss 0.4155±0.0418 0.0150±0.0022 (6.473±5.031)×10 -7Test loss 0.2200±0.0886 0.0182±0.0021 0.6931±0.0005 Table1: Training error and loss, test error and loss for semi-supervised and supervised learning.

j ,v⟩)+{lower order term} + e 2m j=m+1 [σ(⟨w (t) j ,v⟩)+σ(⟨w (t) j ,ξi⟩)]

with high probability, we have ⟨w (t) j * , v⟩ > 0. It follows that nu i=1

A.3, we have η • T * r = Θ(1/q p -1 2 ∥v∥ 2 2 • log(m)σ 0 ∥v∥ 2 ) = Θ(d -3/4 ), which completes the proof.The discussion in this section verifies Lemma 5.3 and provides a clear understanding about howΛ (t) r , Λ(t)r varies within the iteration range [0, T r ] for r ∈ {±1}. Note that the iteration numbers when Λ Θ(1/m) (T 1 and T -1 ) are different, however, since T -1 and T 1 have the same magnitude, it remains clear that although T 1 ̸ = T -1 (wlog, assume T 1 < T -1 ), we still have Λ the iteration range [T 1 , T -1 ], since off-diagonal feature learning also costs time no less than order Θ(1/ησ 0 ∥v∥ 3 2 √ log m), which is higher order than |T 1 -T -1 | = Θ(1/ησ 0 ∥v∥ 3 2 log m), according to (A.33) and Lemma A.3. Therefore, at time T 0

Lemma A.19. (Convergence Guarantee) For any learning rate η > 0,∥∇ a L S ′ )∥ 1 ≤ Θ(1) η • Θ(1) • t 1 and ∇ 2 a L S (a) ⪰ 0 for any a ∈ R d ,

Theorem A.20 (Restatement of Theorem 4.3). Under semi-supervised learning setting, for downstream task, suppose K early stopped classifiers {f W * k } K k=1 are obtained after the pre-training of K CNN models finished, and after T dt = Θ(d 0.1 /η) iterations with learning rate η = Θ(1), then we can find a linear model a (T dt ) , which satisfies: Both test error and loss are nearly 0, i.e.P (x,y)∼D [y • g a (T dt ) (x) ≤ 0] = o(1), L D (ℓ(y • g a (T dt ) (x))) = o(1).Proof of Theorem A.20. For test error, we have P (x,y)∼D [y • g a (T dt ) 0] = P (x,y)∼D

of Theorem 4.4. Recall the definition of f W in (3.1) that f W (x) = m j=1 σ ⟨w j , y • v⟩ + σ ⟨w j , ξ⟩ -2m j=m+1 σ ⟨w j , y • v⟩ + σ ⟨w j , ξ⟩ .

⟨ w j , y • v⟩ + σ ⟨ w j , ξ⟩ -1 m 2m j=m+1 σ ⟨ w j , y • v⟩ + σ ⟨ w j , ξ⟩ : = f W (x).

note the definition of j * * that Λ

Lemma A.10. For Λ

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their helpful comments. YK, ZC and QG are supported in part by the National Science Foundation IIS-2008981 and the Sloan Research Fellowship.

annex

where C 1 , C 2 = (2p -1) • Θ(d) and C ′ 1 , C ′ 2 = Θ(d 1-2ϵ ) according to (A.25) . Taking A = Θ(1/m), A ′ = C • Γ (t) where C is a large constant and C = Θ(1), δ = δ ′ = 1 2 and note that Λ -1 reach Θ(1/m), Γ (t) remain the same magnitude as initialization.By leveraging tensor power method, we can also estimate the length of Stage I, i.e. T 1 , T -1 , by applying tensor power To use tensor power method, we need to upper-bound the increasing speed of Λ (t) r . We have the following lemma: Lemma A.13. For r ∈ {±1}, we have with high probability thatProof of Lemma A.13. Let j * = arg max 1≤j≤m ⟨w (t+1) j, v⟩ and note thatAnd note the definition of(A.40) In Lemma 5.2, we have already prove thatr , Λ(t) r }, according to (A.41), we haveBy applying Lemma C.4 to Λ (t) , and takingAccording to (A.40) and (A.42), we have

A.7 EMPIRICAL, TEST ERROR AND LOSS FOR EARLY STOPPED CLASSIFIER

Assume the accuracy of pseudo-labeler p is larger than 1/2. We first estimate the empirical loss for early stopped classifier f W (T 0 ) , where T 0 = max r∈{±1} {T r } and T r is defined as the first iteration that Λ (t) r reaches Θ(1/m). According to Section A.4.3 and Lemma A.12, we have ΛWe have the following lemma: Lemma A.16. Early stopped classifier f W (T 0 ) (x) possesses following properties:1. Training error of early stopped classifier f W (T 0 ) (x) is asymptotically 1 -p:Test error is nearly 1 -p, if we use pseudo-label y generated by pseudo-labeler as target:3. Test error is nearly 0, if we use true label y as target:where p is the accuracy of the pseudo-labeler. We can regard p as the probability that x i is paired with true label y i , 1 -p is the probability that x i is paired with wrong label -y i .Proof of Lemma A.16. Recall the definition of f W in (3.1) thatAccording to Section A.4.3 and Lemma A.12, we have Λwhere the first inequality triangle inequality; the second inequality is due to the of, the last inequality is due to Lemma 5.3. According to Lemma A.14, we know thatTherefore, for any (x, y) sampled from distribution D whereAnd this indicates that ⟨w (T0) j, ξ⟩ will still be dominated by ⟨w, v⟩, therefore it holds for newly sampled (x, y) that. This verifies the third statement that test error is nearly zero.For the second statement, note thatwhich verifies the second statement.A.8 DOWNSTREAM TASK For downstream tasks, we use early stopped classifiers, which are stopped when on-diagonal feature Λ (t) r are learned while off-diagonal feature Λ(t) r and noise Γ (t) are not memorized. Assume we have learned K early stopped classifiers fby using n u pseudo-labeled data generated by pseudo-labeler f w 1 , • • • , f w K and n l labeled data. Then, we want to design a classifier on the learned representation fHere we consider training a downstream linear modelwhere a k ∈ R denotes the weight as the k-th pre-trained model. Given labeled training data, we want to optimize the empirical loss functionwhere ℓ(z) = log(1 + exp(-z)) denotes the cross entropy loss. We initialize a as zero and optimize empirical loss function by gradient descent, i.e.In order to estimate the training error and test error for downstream task, we first introduce following lemma about the increasing rate of a (t) 1 .Published as a conference paper at ICLR 2023where the last equality is due to (A.44) and Lemma 5.3.Plugging (A.52) into (A.51), we have(A.54) Plugging (A.53) and (A.54) into (A.50), we have Taking η = Θ(1) and T dt = Θ(d α /η) where α > 0 is a sufficiently small constant, we know that=o( 1), which completes the proof.On the other hand, note that the update rule of w (t), and in Lemma B.1, we haveIt follows thatBy plugging w j = m -1/q • w j into (B.1), we have). Therefore, our data model and training algorithm is equivalent to the model and algorithm below:, and we use gradient decent with learning rate η and cross-entropy loss to optimize such a data model, i.e.where ℓ(z) = log(1 + exp(-z)), σ 0 = m 1/q σ 0 . Note that the new model meets the one used in Cao et al. (2022) . To leverage their result, we introduce condition 4. 2022)). Dimension d is sufficiently large that d = Ω(m 2∨[4/(q-2)] n 4∨[(2q-2)/(q-2)] ). Training sample size n and neural network width m satisfy n, m = Ω(polylog(d)). Learning rate η satisfies η ≤ O(min{∥v∥ -2 2 , σ -2 p d -1 }). The standard deviation of Gaussian initialization σ 0 is approximately chosen such that O(nd Theorem B.3 (Theorem 4.4 in Cao et al. (2022) ). For any ϵ > 0, letthen with probability at least 1 -d -1 , there exists 0 ≤ t ≤ T such that:1. The training loss converges to δ, i.e., L S (W (t) ) ≤ δ.2. The trained CNN has a constant order test loss: L D (W (t) ) = Θ(1).Note that in our setting, m = Θ(polylog(d)), n l = Θ(1), ∥v∥ 2 = Θ(d), it's not difficult to verify that Condition B.2 holds. Besides, SNR = d -0.01 , n -1 • SNR -q = Θ(d qϵ ) = Ω(1). Therefore, the conclusion of Theorem B.3 holds for

C AUXILIARY LEMMAS

For the estimation of Λ(0) and Λ (0) , we introduce the following Lemma C.1 (Borell-TIS inequality). Let X a centered Gaussian on R m and set σ 2 X := max E(X 2 i ). Then for each t > 0,For the expectation of Λ (0) r and Λ(0) r , we give the following lemma. Lemma C.2. Let Y = max 1≤i≤m X i , where X i ∼ N (0, σ 2 ) are i.i.d. random variables. ThenFor the estimation of ∥ξ i ∥ 2 2 and ⟨ξ i , ξ l ⟩, we introduce following lemma. Lemma C.3 (Lemma B.2 in Cao et al. (2022) ). Suppose that δ > 0 and d = Ω(log(4n/δ)). Then with probability at least 1 -δ,Besides, we introduce following lemma about tensor power method.Lemma C.4. Consider an increasing sequence x t ≥ 0 defined as x t+1 = x t + η • C t x q-1 t , and C 1 ≤ C t ≤ C 2 for all t > 0, then we have for A > x 0 , every δ > 0, and every η > 0:(1 + δ) q-1 1 + log (A/x 0 ) log (1 + δ) , t≥0,xt≤A η ≥ δ 1 -(x 0 /A) q-2 (1 + δ) q-1 1 -(1 + δ) -(q-2) x 0 C 2 -η • (1 + δ) -(q-1) 1 + log (A/x 0 ) log (1 + δ) .Proof of Lemma C.4. For every g = 0, 1, 2, • • • , let τ g be the first iteration such that x t ≥ (1+δ) g x 0 .Let b be the smallest integer such that (1 + δ) b x 0 ≥ A. By the definition of τ g , we have x t ∈ [(1 + δ) g x 0 , (1 + δ) g+1 x 0 ) for all t ∈ [τ g , τ g+1 ) and x τg+1 ≥ (1 + δ) g+1 x 0 , x τg-1 < (1 + δ) g x 0 , leading tofollowing lower bound for x τg+1 -x τg :x τg+1 -x τg = x τg+1 -x τg-1 -η • C τg-1 x q-1 τg-1≥ (1 + δ) g+1 x 0 -(1 + δ) g x 0 -η • C τg-1 [(1 + δ) g x 0 ] q-1 = δ(1 + δ) g x 0 -η • C τg-1 (1 + δ) (q-1)g x q-1 0 , and following upper bound for x τg+1 -x τg :x τg+1 -x τg = x τg+1-1 + η • C τg+1-1 x q-1 τg+1-1 -x τg ≤ (1 + δ) g+1 x 0 + η • C τg+1-1 [(1 + δ) (g+1) x 0 ] q-1 -(1 + δ) g x 0 = δ(1 + δ) g x 0 + η • C τg+1-1 (1 + δ) (q-1)(g+1) x q-1 0 .

