TOWARDS UNDERSTANDING WHY MASK RECON-STRUCTION PRETRAINING HELPS IN DOWNSTREAM TASKS

Abstract

For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE (He et al., 2021) and data2vec (Baevski et al., 2022), randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional "supervised learning" (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.

1. INTRODUCTION

Self-supervised learning (SSL) has emerged as a popular and effective method to learn unsupervised representations, with great success witnessed by many downstream tasks, e.g. image classification (He et al., 2016a) , object detection (Girshick et al., 2015; Tan et al., 2020) and segmentation (Ronneberger et al., 2015; He et al., 2017) . In SSL, one often needs to first create an artificial supervised learning problem, a.k.a. a pretext task, that can obtain pseudo data labels via well designing the task itself, and then train a network for learning how to capture useful data features from this artificial supervised task. For example, one representative SSL, contrastive learning (He et al., 2020a; Chen et al., 2020b) , constructs a supervised problem on an unlabeled dataset via regarding random augmentations of an image as a separate class, and then performs supervised instance discrimination. Owing to the unnecessity of manual annotations and its great success, SSL has already paved a new way to solve unsupervised learning problems, and also has attracted increasing research interests. In this work, we are particularly interested in the recently proposed mask-reconstruction pretraining (MRP) of SSL families (Xie et al., 2021; Dong et al., 2021) , e.g. MAE (He et al., 2021) and data2vec (Baevski et al., 2022) . The core idea of this MRP family is to randomly mask the patches of the input image and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. After pretraining on a large-scale unsupervised dataset, MRP fine-tunes the encoder on a specific downstream task to learn more task-specific representations. This pretraining mechanism generally enjoys remarkable test performance improvement on the same downstream task and also a much superior generalization ability on out-of-distribution data than the standard end-to-end "supervised learning". Actually, it also reveals better fine-tuning performance than other state-of-the-art SSL approaches, including contrastive learning (He et al., 2020a; Chen et al., 2020b) and clustering learning (Caron et al., 2018; Wu et al., 2018) . Because of its simplicity and strong compatibility, MRP has attracted wide interests and is seeing increasingly more applications. However, theoretical analyses and understandings on MRP still largely lag their practical applications. To be specific, it is not clear how MRP performs feature learning via the mask reconstruction task, though heavily desired. Moreover, the theoretical reasons for the superiority in test performance of MRP over end-to-end supervised learning are rarely investigated. Most existing theoretical works (Wen & Li, 2021; Arora et al., 2019; Tosh et al., 2021a; b) focus on analyzing contrastive learning, and few works study MRP which differs much from contrastive learning. Cao et al. (2022) analyzed the patch-based attention in MAE via an integral kernel but did not study the core questions in this work, i.e.1) what features does MRP learn and 2) why does MRP beat conventional supervised learning. Contributions. In this work, we provide a theoretical viewpoint to understand the semantic (feature) learning process of MRP. Moreover, we analyze test performance of MRP to show its superiority over supervised learning on the downstream classification tasks. Our contributions are highlighted below. Firstly, based on the multi-view data assumption from (Allen-Zhu & Li, 2020) where multi/single discriminative features exist in multi-view/single-view data, we prove that on an auto-encoder with a two/one-layered convolution encoder/decoder, the pretrained encoder in MRP can capture all the discriminative features of each semantic class in the pretraining dataset. Moreover, a convolution kernel in the encoder captures at most a feature. These properties benefit the downstream tasks. As the pretraining dataset is often much larger than downstream dataset, pretraining dataset (approximately) covers all the features in the downstream dataset. So the kernels of the pretrained encoder also well grab the features in downstream datasets. Besides, as a kernel is associated with at most a feature, then the semantic features would not be fused together, allowing a network to easily establish the relation among kernels and semantic class labels in downstream classification task. Secondly, we theoretically show that after fine-tuning on the downstream dataset, MRP enjoys superior test performance to that of end-to-end supervised learning on the downstream tasks by using classification as an example. Assuming pretraining and downstream datasets share the same distribution, we prove that after fine-tuning, MRP can classify the new samples correctly with high probability for both multi-view and single-view test data. This result is superior to (Allen-Zhu & Li, 2020) , which shows the conventional SL only has a half test accuracy on single-view test data.

2. RELATED WORKS

SSL approaches. According to the pretext tasks, the current SSL approaches can be grouped into contrastive learning, e.g. (Hjelm et al., 2018; Oord et al., 2018) , clustering learning, e.g. (Caron et al., 2018; Wu et al., 2018) and mask-reconstruction pretraining (MRP) (He et al., 2021; Baevski et al., 2022) . Given random augmentations of an image, contrastive learning, e.g., MoCo (He et al., 2020a) and SimCLR (Chen et al., 2020a) , brings the different crops of the same image together, and pushes the crops of different images far away from each other in the feature space. For clustering learning, it aims to cluster similar samples into the same group. However, both contrastive learning and clustering learning heavily depend on the multi-crop augmentations. The recently proposed MRP is a simpler SSL. This MRP family, e.g. MAE (He et al., 2021) and SimMIM (Xie et al., 2021) , randomly masks image patches and then reconstructs the masked patches via an auto-encoder. Later, both MaskFeat (Wei et al., 2021) and data2vec (Baevski et al., 2022) empirically find better performance by reconstructing semantic feature. Now MRP has surpassed the end-to-end supervised learning on many downstream tasks, e.g. image classification (Dong et al., 2021) and object detection (He et al., 2021) , and is seeing more applications because of its effectiveness and strong compatibility. SSL analysis. Despite its remarkable success in practice, the theoretical understanding of SSL is still largely absent. Arora et al. (2019) provided generalization guarantees for contrastive learning on linear classification models with the assumption that different positives belong to the same latent class. Wang & Isola (2020) showed that contrastive learning can trade-off the alignment and uniformity of features on a hypersphere. HaoChen et al. (2021) proposed and analyzed a spectral version of contrastive loss with provable accuracy guarantees under linear probing evaluation. Tian et al. (2020) proved that SimCLR only captures feature variability across data points. However, these theoretical works mainly study contrastive learning which essentially differs from MRP. The most closely relevant work to ours is (Lee et al., 2021; Cao et al., 2022) . Cao et al. (2022) analyzed the patch-based attention in MAE via an integral kernel by showing the benefits of patchifying, the equivalence between the attention mechanism in MAE and a learnable integral kernel transform, etc. However, they did not reveal any feature properties of MRP and the superiority reasons of MRP over conventional supervised learning. Lee et al. (2021) showed the benefits of reconstruction partial Figure 1 : Visualization of ResNet50 (He et al., 2016b) trained by conventional supervised learning. We use Eigen-CAM to localize class-specific image regions which show why the model predicts the image as the corresponding class. Though ResNet50 predicts all car images correctly, it actually locates different regions, e.g. front, side window, car nose, taillight, and wheel, for different images, indicating multiple independent features for each class and thus the "multi-view" data assumption. data from another partial data on reducing sample complexity of the downstream tasks under the condition where two pieces are independent conditioned on their semantic label. But this independent condition is not the real cases where the two parts of the same image will share a significant amount of information not explained by the label (Bansal et al., 2020) . Moreover, these works do not study how features are learned by networks, which is essential to understanding MRP in practice.

3. PROBLEM SETUP

Here we first introduce "multi-view" data assumption introduced in (Allen-Zhu & Li, 2020) , and then present the pretraining framework of mask-reconstruction pretraining (MRP). Finally, following most MRP works, we use a k-classification task as a downstream task for analysis. In this work, we use O, Ω, Θ to hide constants w.r.t. k, and Õ, Ω, Θ to hide polylogarithmic factors w.r.t. k. We use poly(k) (polylog(k) ) to denote Θ(k C ) (Θ(log C k)) with constant C > 0. [n] denotes {1, 2, . . . , n}.

3.1. MULTI-VIEW DATA DISTRIBUTION

On realistic data, for each semantic class, there are actually several independent features being in effect during the classification process. As shown in Fig. 1 , we adopt Eigen-CAM (Muhammad & Yeasin, 2020) to localize class-specific regions which tell us why a model predicts the image as the corresponding class. Here we test on ResNet50 (He et al., 2016b ) trained by the Pytorch Teamfoot_0 in a supervised training manner. For all the car images in Fig. 1 , though ResNet50 predicts them correctly, Eigen-CAM locates different class-specific regions, e.g. car front, side window, taillight, and wheel, on different images. These results directly testify to the multiple independent discriminative features in a semantic class. Such a data structure is called "multi-view data" and is firstly testified in (Allen-Zhu & Li, 2020) . In the following, we make the multi-view assumption on the realistic data for analysis. The mathematical formulation is similar to (Allen-Zhu & Li, 2020) . Assume that there are k semantic classes, and each data pair is denoted by (X, y), where X = (x 1 , x 2 , . . . , x P ) ∈ (R d ) P has P patches (e.g. (non-)overlap image patches) in which each patch is ddimensional, and y ∈ [k] is the label of X. Then suppose there are multiple discriminative features associated with each semantic class. For simplicity, here we say two features and define the two feature vectors as v i,1 , v i,2 ∈ R d for each class i ∈ [k]. Note, our analysis technique can also be extended to multiple features. We further assume feature vectors are orthonormal, i.e., ∀i, i ′ ∈ [k], ∀l, l ′ ∈ [2], ∥v i,l ∥ 2 = 1, and v i,l ⊥ v i ′ ,l ′ , when (i, l) ̸ = (i ′ , l ′ ). Denote the set of all discriminative features of the k classes as V = {v i,1 , v i,2 } k i=1 . Now we introduce the multi-view distribution D m and single-view distributions D s , where samples from D m have multiple features, samples from D s has only a single main feature. Let C p be a universal constant, s be a universal parameter to control feature sparsity, σ p = 1 √ dpolylog(k) be a parameter to control magnitude of random noise, and γ be a parameter to control the feature noise. Definition 3.1 (Multi-view data (Allen-Zhu & Li, 2020) ). Data distribution D consists of data from multi-view data D m with probability 1-µ and from single-view data D s with probability µ. We define (X, y) ∼ D by randomly uniformly selecting a label y ∈ [k] and generating data X as follows. 1) Sample a set of features V ′ uniformly at random from {v i,1 , v i,2 } i̸ =y each with probability s k . 2) Denote V(X) = V ′ ∪ {v y,1 , v y,2 } as the set of feature vectors used in data X. 3) For each v ∈ V(X), pick C p disjoint patches in [P ] and denote it as P v (X) (the distribution of these patches can be arbitrary). We denote P(X) = ∪ v∈V(X) P v (X). 4) If D = D s is the single-view distribution, pick a value l = l(X) ∈ [2] uniformly at random. 5) For each p ∈ P v (X) for some v ∈ V(X), given feature noise α p,v ′ ∈ [0, γ], we set x p = z p v + v ′ ∈V α p,v ′ v ′ + ξ p , where ξ p ∈ N (0, σ p I) is an independent random Gaussian noise. The coefficients z p ≥ 0 satisfy • For "multi-view" data (X, y) ∈ D m , p∈Pv(X) z p ∈ [1, O(1)] and p∈Pv(X) z q p ∈ [1, O(1)] for an integer q ≥ 2, when v ∈ {v y,1 , v y,2 }, z p is uniformly distributed over C p patches and the marginal distribution of p∈Pv(X) z p is left-close. • For "single-view" data (X, y) ∈ D s , when v = v y, l, p∈Pv(X) z p ∈ [1, O(1)], p∈Pv(X) z q p ∈ [1, O(1)] for q ≥ 2. When v = v y,3-l, p∈Pv(X) z p ∈ [ρ, O(ρ) ] (here we set ρ = k -0.01 for simplicity). z p is uniformly distributed over C p patches. • p∈Pv(X) z p ∈ [Ω(1), 0.4] when v ∈ V(X) \ {v y,1 , v y,2 }, and the marginal distribution of p∈Pv(X) z p is right-close. 6) For each p ∈ [P ]\P(X), with an independent random Gaussian noise ξ p ∼ N (0, γ 2 k 2 d I), x p = v ′ ∈V α p,v ′ v ′ + ξ p , where each α p,v ′ ∈ [0, γ] is the feature noise. Intuitively, multi-view data D m refers to the data with multiple features distributed over patches plus some noise from other features and background noise, while only a single main feature exists in single-view data D s . Their mixed distribution D can well characterize realistic data. Based on distribution D, in Sec. 4 we will define the datasets used for pretraining and downstream fine-tuning.

3.2. MASK-CONSTRUCTION PRETRAINING FRAMEWORK

As a representative MRP, MAE (He et al., 2021) randomly masks the patches of an input image and then reconstructs the pixels of these masked patches via an auto-encoder. Recently, many works show that reconstructing the semantic features often achieves higher performance, where the semantic feature can be obtained by feeding the vanilla full input into a teacher network, e.g. a pretrained network (Wei et al., 2021) or the exponential moving average (EMA) of encoder in MAE (Dong et al., 2021; Baevski et al., 2022) . In this paper, we analyze both Teacher-Student framework and MAE but will focus more on the former one because of its slightly higher performance. for the encoder, its output is defined as

Network

H(X) = [h 1 (X), h 2 (X), . . . , h km (X)], where h r (X) = p∈[P ] ReLU(⟨w r , x p ⟩). Here ReLU is a smoothed ReLU (Allen-Zhu & Li, 2020) and is defined as follows: for an integer q ≥ 2 and a threshold ϱ = 1 polylog(k) , ReLU(z) = 0 if z ≤ 0, ReLU(z) = z q qϱ q-1 if z ∈ [0, ϱ] and ReLU(z) = z -(1 -1/q)ϱ if z ≥ ϱ. The desirable properties of smoothed ReLU function is that when z is large it is linear with z and when z is small, it will be much smaller. Thus, it will make the low-magnitude feature noises much smaller to better separate the true features from feature noises. The decoder is a linear layer parameterized by b r (r ∈ [km]), and its output is h ′ (X) = [h ′ 1 (X), h ′ 2 (X), . . . , h ′ km (X)], where h ′ r (X) = b r h r (X), r ∈ [km] . Following the practice in MRP (Baevski et al., 2022; Dong et al., 2021) , teacher network shares the same architecture with student network, and is a smoothed ReLU network parameterized by ŵr , r ∈ [km]. Its output is defined as h(X) = [ ĥ1 (X), ĥ2 (X), . . . , ĥkm (X)], where ĥr (X) = p∈[P ] ReLU(⟨ ŵr , x p ⟩). Pretraining of MRP on Pretext Task. Now we define the pretraining loss. Let ϵ = (ϵ 1 , ϵ 2 , . . . , ϵ p ), where ϵ i is an independent Bernoulli variable with Pr(ϵ i = 1) = θ. In the pretraining, for the linear (image or text tokens) with P patches, this framework randomly masks patches to obtain ϵX = [ϵ 1 x 1 , . . . , ϵ p x P ] with Bernoulli variable ϵ p to mask, and feeds ϵX into student encoder H for a latent vector H(ϵX). Then, student decoder takes H(ϵX) as input and outputs h ′ of all patches to predict the output h of a teacher with vanilla input X as input. The encoder is two-layer CNN, and the decoder is a linear layer. For MAE framework, it has encoder-decoder networks and the decoder have an additional layer to map output of the encoder to recover P patches (see Fig 6 in Appendix) . student decoder, we set its all parameters as b r = c(θ) = 1 θ for simplicity, which is provably sufficient for improving downstream tasks. Now we define the empirical mean squared pretraining loss: L(H; ϵ) = 1 2N n∈[N ] L(H; X n , ϵ) = 1 2N n∈[N ] r∈[km] ∥ ĥr (X n ) -h ′ r (ϵX n )∥ 2 2 , where N is the number of data points for pretraining and ϵX = (ϵ 1 x 1 , ϵ 2 x 2 , . . . , ϵ P x P ). Now we discuss how to pretrain it. Following MRP (Baevski et al., 2022; Dong et al., 2021) , we use student to update teacher by updating teacher kernel parameters ŵr as ŵ(t) r = τ w (t) r , where we set τ = 1 + c 0 and c 0 = 1-θ Cpθ + Θ 1 t+1 . Then we use gradient descent to update student encoder parameters: w (t+1) r = w (t) r -ηE ϵ [∇ wr L(H; ϵ)] . Fine-tuning of MRP on Classification Downstream Tasks. Here we consider a classification downstream task. Specifically, we fine-tune the pretrained student encoder with an extra linear layer using N 2 labeled samples. We fine-tune the network by minimizing the empirical cross-entropy loss: L down (F ) = 1 N 2 n∈[N2] L down (F ; X n , y n ), where F i (X) = r∈[km] u i,r h r (X), i ∈ [k]. Here L down (F ; X, y) = -log e Fy (X) X) , and j∈[k] e F j ( u i,r , r ∈ [km], i ∈ [k] denotes the weights of the extra linear layer. Then we adopt the gradient descent to fine-tune the kernels w r of the pretrained encoder and update the parameters u i,r : w (t+1) r = w (t) r -η 1 ∇ wr L down (F ), u (t+1) i,r = u (t) i,r -η 2 ∇ ui,r L down (F ) , where the learning rate η 1 is often much smaller than η 2 in practice.

4. MAIN RESULTS

Here we first reveal the semantic feature learning process of mask-reconstruction pretraining (MRP), and then theoretically show why MRP helps downstream tasks by taking the classification task as an example. Finally, we intuitively discuss the benefits of MRP to other downstream tasks.

4.1. FEATURE LEARNING PROCESS OF PRETRAINING

Here we mainly show that pretraining can capture the whole features V (defined in (1)) in the pretraining dataset by showing that the correlation scores between the features and the kernels of the student encoder gradually increase during training process. For brevity, we first define M (0) i,l := r ∈ [km] : ⟨w (0) r ,v i,l ⟩ ≥ Λ (0) i,l (1 -O(1/ log k)) , where Λ (t) i,l := max r∈[km] [⟨w (t) r , v i,l ⟩] + . Here w (t) r denotes the r-th convolution kernel of the student encoder at the t-th iteration. Λ i,l denotes the highest positive correlation score between the l-th feature v i,l of the i-th class and all the km kernels w (t) r . Larger Λ (t) i,l means the network can better capture the feature v i,l . For M (0) i,l , it is composed of the kernels which have slightly smaller correlation scores than the maximum score Λ (0) i,l at the initial stage. For analysis, we pose some assumptions on data and the network as follows. Assumption 1. (1) The pretraining dataset Z have N samples which are i.i.d. drawn from the distribution D defined in Definition 3.1 and let N ≥ poly(k). (2) Each kernel w (0) r (r ∈ [km]) is initialized by a Gaussian distribution N (0, σ 2 0 I) with σ 0 = O(1/ √ k). Moreover, m satisfies m ∈ [polylog(k), √ k]. Assumption 1 means that there are about (1 -µ)N "multi-view" samples and µN "single-view" data points in the pretraining dataset Z. According to Definition 3.1, a multi-view sample contains multiple discriminative features distributed over patches plus some noise from other features and background noise, while for a single-view sample, it has only a single main feature and some noises. We use Gaussian initialization as it is the standard initialization used in practice. Note, for pretraining, we do not use any labels. Theorem 1 states the feature learning process in MRP. Theorem 1. Suppose Assumption 1 holds, learning rate η ≤ 1 poly(k) in gradient decent steps (2). After T = poly(k) η iterations, for sufficiently large k, the learned kernels {w (T ) r } r∈[km] satisfy the following properties with high probability. 1) Under Teacher-Student framework, when q ≥ 3, for every v i,l ∈ V and every (X, y) ∈ Z, (a) Λ (0) i,l ∈ [ Ω(σ 0 ), Õ(σ 0 )], Λ (T ) i,l ∈ [1/polylog(k), Õ(1)] and r * ∈ M (0) i,l , where r * = argmax r∈[km] [⟨w (T ) r , v i,l ⟩] + . (b) For each r ∈ M (0) i,l , ⟨w (T ) r , v i ′ ,l ′ ⟩ ≤ Õ(σ 0 ) when (i, l) ̸ = (i ′ , l ′ ). (c) For each r / ∈ M (0) i,l , ⟨w (T ) r , v i,l ⟩ ≤ Õ(σ 0 ). 2) Under MAE framework, when q ≥ 4, the properties (a)-(c) also hold. See the proofs of Teacher-Student framework in Appendix F and of MAE framework in Appendix H. Theorem 1 states that for both frameworks, the pretrained model can capture all features. But MAE needs slightly restrictive assumption, since 1) it requires q ≥ 4 where q is the smooth parameter in ReLU (see Sec. 3.2), and 2) larger q compresses small feature noises more heavily to better separate the true features from feature noises. This functionality is implemented by teacher in Teacher-Student framework, as teacher can filter out feature noise and gives more clean targets. Theorem 1 (a) shows that for those kernels winning the lottery ticket at the random initialization stage (i.e. kernels w (0) r ∈ M (0) i,l ), at least one of them would win out through the course of training and capture the feature v i,l . Specifically, at initialization, for any feature v i,l in the whole features V of the pretraining dataset, its correlation score Λ (0) i,l with any kernel w r in {w r } mk r=1 is at most Õ(1/ √ k). After MRP pretraining, for any feature v i,l , there always exists at least a kernel w (T ) r so that the correlation score Λ (T ) i,l between v i,l and w (T ) r is increased to at least 1/polylog(k). So each feature in V is captured by at least a convolution kernel in the student encoder. Besides, as the pretraining dataset is often much larger than the downstream dataset, the features in pretraining dataset actually (approximately) cover all the features in downstream dataset. So the kernels of the pretrained student encoder is able to capture as much features as possible in downstream datasets. For Theorem 1 (b) and (c), they mainly guarantee some kinds of corresponding relations among kernels and features: a kernel captures at most a feature. Specifically, from Theorem 1 (b) indicates that for these kernels w r in M (0) i,l which mainly capture the semantic feature v i,l , they actually only capture little information of other features v i ′ ,l ′ where v i ′ ,l ′ ̸ = v i,l , since for any w r ∈ M (0) i,l , its correlation score with v i ′ ,l ′ is no larger than Õ(σ 0 ) and keeps small during the training phase. Theorem 1 (c) shows that for these kernels w r / ∈ M (0) i,l , they keep losing the lottery ticket during training, and only capture little information of feature v i,l . Theorem 1 (b) and (c) together guarantee that a kernel mainly captures at most a feature and can only grab very little information of other features. So the multiple features captured by the encoder kernels is separated and not involved with each other. This property is very important for fine-tuning, since intuitively, a kernel is only associated with at most a feature, and accordingly, a linear classifier can directly establish the relations among kernels and semantic class labels. See more discussion in Sec. 4.2.

4.2. BENEFIT JUSTIFICATION OF MRP ON DOWNSTREAM TASKS Classification Downstream

Task. Here we first analyze the performance of MRP on classification downstream task. After pretraining, following the practice in (Wei et al., 2021; Dong et al., 2021; He et al., 2021; Xie et al., 2021; Baevski et al., 2022) , we only fine-tune the student encoder with an extra linear layer on the labeled training data of the downstream datasets. See the details of fine tuning in Sec. 3.2. Before analysis, we first make some mild assumptions. Assumption 2. (1) The downstream dataset Z down of N 2 samples is i.i.d. drawn from the distribution D defined in Definition 3.1. Let N 2 ≥ k. (2) We initialize u Assumption 2 actually assumes the pretraining and downstream datasets share the same distribution D. This data assumption accords with the practice in many SSL works (Wei et al., 2021; Dong et al., 2021) , e.g. MAE (He et al., 2021 ), SimMIM (Xie et al., 2021) and data2vec (Baevski et al., 2022) , which pretrain and fine-tune on the same dataset, e.g. ImageNet, but with significant improvement over the conventional supervised learning. Then based on Theorem 1, we analyze the test performance on the classification downstream task, and summarize the results in Theorem 2. We denote the fine-tuning network as function F (•) ∈ R k which outputs k-dimensional prediction for k classes. Theorem 2 (Test performance analysis). Suppose Assumption 2 holds. When F (•) is either the student encoder in Teacher-Student framework or the encoder in MAE with an extra linear layer, by fine-tuning F (•) with N 2 labeled samples, for any new data point (X, y) ∼ D, F (•) satisfies Pr (X,y)∼D F y (X) ≥ max j̸ =y F j (X) + Õ(1) ≥ 1 -e -Ω(log 2 k) , where F y (X) denotes the y-th element in F (X), i.e. the predicted probability for the class y. See its proof in Appendix G. Theorem 2 guarantees that no matter for single-view or multi-view data (X, y) ∼ D, the fine-tuned classifier F (•) always correctly predicts the label y with high probability. This is because intuitively, as proved in Theorem 1 (a), after pretraining, for each discriminative feature v i,l in the feature set V, at least a kernel w r in the pretrained student encoder can capture it. This means that even at the beginning of the fine tuning, the encoder in the function F (•) is already capable to discover and grab all features in V. Then as shown in Theorem 1 (b) and (c), a kernel captures at most a feature. In this way, for single-view sample containing a single feature denoted by v i, l, the corresponding kernels in the encoder would capture it and output a large correlation score ( ≥ 1/polylog(k)) at the corresponding positions and small correlation scores ( ≤ Õ(1/ √ k)) at the remaining positions. Similarly, for multi-view samples including several features, the corresponding kernels have large correlation scores at some specific kernels while small ones for remaining kernels. For a class, these positions of high scores would not change, because all features are captured and a kernel grabs at most a feature. Based on this, the last linear layer in F (•) can easily establish the corresponding relation between large score positions and feature labels, and learns to classify. Then we compare the test performance with conventional end-to-end "supervised learning" (SL) on the same downstream dataset. Under the same data distribution and the same network F , Allen-Zhu & Li (2020) analyzed the test performance of SL (see Lemma 1). Lemma 1. Suppose the data assumption in Assumption 2 holds and now let sample number N 2 ≥ poly(k). Let the learning rate η ≤ 1 poly(k) . Then by training F after T = poly(k) η iterations with supervised training, with probability ≥ 1 -e -Ω(log 2 k) , the supervised trained model F (T ) SL satisfies Pr (X,y)∼D ∃i ∈ [k] : F (T ) SL,y (X) < F (T ) SL,i (X) ∈ [0.49µ, 0.51µ]. Lemma 1 shows that the supervised trained model F (T ) SL has only about 50% accuracy on single-view data whose ratio among all data is µ. Both our Theorems and Lemma 1 are proved using lottery ticket hypothesis (Allen-Zhu & Li, 2020; Wen & Li, 2021; Allen-Zhu & Li, 2022; Frankle & Carbin, 2019) and we discuss the detailed difference between our work and (Allen-Zhu & Li, 2020) in terms of network architecture, objective loss and fine-tuning on downstream tasks in Appendix C. For supervised training, when the convolution kernels are randomly initialized, for each class, a feature correlates more with kernels than other features. With more training iterations, this feature become a winning lottery over other features. Thus, for those single-view data without the captured features, the classification will make an error and obtain only half accuracy for total dataset. By comparison, in MRP including both teacher-student framework and MAE, all features are captured in the pretraining, and thus has stronger capacity to capture more kinds of features for each class. This also accords with our empirical observations in Sec. 5. Specifically, Fig. 3 visualizes the class-specific image regions for both models trained by SL and MRP. By comparison, MRP captures multiple discriminative features in an image, while SL only captures single feature. Discussion on Other Downstream Tasks. Besides classification downstream task, our conclusion could intuitively generalize to other downstream tasks, e.g. transfer learning and detection, because in the pretraiing phase, our encoder have provably captured all feature features in each images. For transfer learning, the representative task is classification task T cls (He et al., 2020b; 2021) which pretrains a model on a large-scale unlabeled data D pre and then fine-tunes the pretrained model on a classification downstream dataset D fine . Denote the feature set of dataset D fine as V ′ . We discuss this transfer learning in three cases: 1) the datasets D fine and D pre share the same feature set V ′ (i.e. V ′ = V) but can have different data distribution (e.g. different ratio of single-and multi-view data); 2) V ⊂ V ′ ; 3) the pretraining and downstream tasks share the partial semantic set, i.e., V ∩ V ′ ̸ = ∅. For the case ( 1) and ( 2), following the similar proof process of Theorem 2, the fine-tuned model can also obtain high classification accuracy on downstream tasks. For the case (3), as our training networks are over-parameterized, i.e., m ∈ [polylog(k),

√

k] and the size of the lottery ticket winning set is O(polylog(k)) at the end of pre-training, the number of kernel weights that finally capture the features is much smaller than the total number of kernels. For the remaining kernels, they still can capture new features in the downstream fine-tuning phase. In this way, MRP still improves the overall transfer learning performance. The experimental results on transfer learning in Table 1 also support our above analysis. While the pretraining dataset, i.e. ImageNet, and the downstream dataset, VOC07, share many different categories and thus have different features, MRP still performs better than the supervised case, which validates our analysis. For object detection downstream task, it has two targets: 1) finding the bounding boxes of possible objects, and 2) classifying bounding boxes. For bounding box detection, since a) the encoder pretrained by MRP can grab all features and b) the desired bounding boxes should contain features, the encoder can detect the bounding boxes precisely, at least does not lose many bounding boxes of objects. In contrast, supervised learning often randomly captures feature features (Allen-Zhu & Li, 2020) and thus cannot well detect bounding boxes. For the second target, i.e. classification, we can draw similar results on the transformer learning classification task that MRP often performs better than supervised learning. So by considering the advantages of MRP over supervised learning on the two targets in object detection, MRP should expect better performance than supervised learning. Comparison to Other SSL Analysis. Only a few works have studied MRP. Cao et al. (2022) revealed the benefits of patchifying and the equivalence between the attention mechanism in MAE, etc. But they did not analyze any feature learning process of MRP and the superiority reasons of MRP over supervised learning. Lee et al. (2021) showed that by splitting an input into two pieces, the pretext tasks of reconstructing one piece from another piece can decrease the sample complexity of the downstream tasks. Though promising, they require the two pieces to be approximately independent conditioned on their feature label, which contradicts many realistic cases where the two parts of the same image share a significant amount of information not explained by the label (Bansal et al., 2020) . In contrast, under multi-view data assumptions verified by (Allen-Zhu & Li, 2020) and our experimental results, we reveal how features are learned by MRP, and also explicitly show the performance improvement on downstream tasks, which is essential to understanding MRP in practice. Discussion on CNN Architectures. We analyze CNNs because of two reasons. One is for easy comparison with supervised results of same encoder network in (Allen-Zhu & Li, 2020) . Another one is the difficulty of analyzing the feature learning process of Transformer due to its correlated manipulations and highly nonlinear attentions. But this work is the first one that analyzes the feature learning process of MRP on explicit non-linear networks. Moreover, as shown in (Jing et al., 2022; Fang et al., 2022) (see our Appendix B) and our Sec. 5, CNNs pretrained by MRP empirically surpass supervised one on various downstream tasks, e.g. classification, detection and segmentation. Finally, Appendix B shows that Transformers pretrined by MRP also captures more features than supervised one which accords with our observations on CNNs and validates our theory. 

5. EXPERIMENTS

Assumption Investigation. To verify our "multi-view" data assumption, we investigate whether there are multiple discriminative features for some classes in ImageNet (Deng et al., 2009) Results on CNN. We investigate the performance of MRP on CNNs. We use the recently proposed SimMIM (Xie et al., 2021) , a representative MRP, on ResNet50. We use SimMIM rather than MAE (He et al., 2021) , as MAE removes the masked patches before encoder but the convolution operations in CNN encoder cannot handle masked input, while SimMIM replaces the masked patches by a mask token and can use CNNs. Then we use ResNet50 (4-layered transformer) to implement the encoder (decoder). Next, we pretrain for 300 epochs on ImageNet, and fine-tune pretrained ResNet50 for 100 epochs on ImageNet. Table 1 reports the top-1 accuracy on ImageNet, and shows that on ResNet50, MRP improves supervised training by a large margin. Moreover, we fine-tune our pretrained ResNet50 on transfer learning classification downstream task on VoC07 and detection task on VoC07+12. The results show that CNNs pretrained by MRP generalizes well on various downstream tasks and indeed often surpasses supervised baselines. These results accord with our theoretical implications that MRP can help downstream tasks by enjoying superior performance than conventional SL. See more results on downstream tasks in Appendix B. Finally, we use Eigen-CAM to localize class-specific image regions for both models trained by supervised learning (SL) and MRP. For each pair in Fig. 3 , the left image is the visualization of SL, while the right one is from MRP. By comparison, MRP often captures several discriminative features in an image, e.g. front window and door handle in the first pair, while SL only grabs a feature, e.g. front window in the first pair. See similar observations on transformer in Appendix B. These results accord with our theory that the advantages of MRP come from its stronger capacity to capture more kinds of class features in the pretraining.

6. CONCLUSION

In this work, we analyze the feature learning process of mask-reconstruction pretraining (MRP) and show its superiority in downstream tasks. For pretraining, the pretrained encoder in MRP provably captures all discriminative features in the pretraining dataset. Because of its huge size and high diversity, the pretraining dataset covers the features in downstream dataset. So for fine-tuning, the encoder well grabs as much features as it can in downstream datasets, while supervised learning only randomly captures some features due to its random initialization. Thus, the fine-tuned encoder in MRP provably achieves better accuracy than supervised learning on the classification tasks. Experimental results validate our assumptions and also our theoretical implications. 

A OUTLINE OF APPENDIX

This supplementary document contains the main proofs for two main theorems. It is structured as follows. Appendix B first provides more experimental results to compare the learnt features between the conventional supervised learning and the mask-reconstruction pretraining. We also provide the results on other downstream tasks under backbone. Moreover, we also visualize the results under Transformer backbone. In Appendix C, we present the necessary assumptions and main results on feature learning process of mask-reconstruction pretraining. In Appendix D, we introduce the main ideas to prove our main results. In Appendix E, we show some technical results. In this section, we provide more visualizations to compare the learnt features by conventional supervised learning (SL) and the mask-reconstruction pretraining (MRP). Moreover, we provide more results on other downstream tasks. More visualization results. Same as in the manuscript (Sec. 5), here we use Eigen-CAM to localize class-specific image regions for both models trained by SL and MRP. Note, since 1) CAM-alike methods all need to a well-trained classifier at the top of the backbone, and 2) the pretrained model for MRP has only encoder backbone but no a classifier, we visualize the fine tuned model instead of pretrained model for MRP. For each pair in Fig. 4 , the left image is the visualization of SL, while the right one is from MRP. By comparison, MRP can often capture several discriminative features in an image, e.g. the airplane head and airplane tail in the first pair, while SL only captures a feature, e.g. airplane tail in the first pair. These results also accord with our theory that the advantages of MRP come from its stronger capacity to capture more kinds of features for each class in the pretraining phase. Results on other downstream tasks. Actually, some works, e.g. (Jing et al., 2022; Fang et al., 2022) , also empirically find that CNNs pretrained by MRP can generalize well on various downstream tasks, e.g., detection and segmentation. Specifically, Fang et al. (2022) showed that on ResNet50, MRP achieves 38.0 mIoU on ADE20K semantic segmentation task and greatly improves the 36.1 mIoU of supervised learning baseline. See these results in its Visualization under Transformer backbone. Besides ResNet, we further provide visualization results on Transformer (Dosovitskiy et al., 2020) to display the localize class-specific image regions for both models trained by SL and MRP. For each group in Fig. 5 , the left image is the visualization of SL, while the middle one is from MAE (He et al., 2021) and the right one is from data2vec (Baevski et al., 2022) . Here we directly use the official released ViT-base models (Dosovitskiy et al., 2020) of SLfoot_1 , MAEfoot_2 and data2vecfoot_3 . Then one can observe that both MAE and data2vec usually capture several discriminative features in an image, while SL only captures a feature. For example, in the first comparison group, SL only captures one side of car tail. In contrast, MAE grabs two sides of car tail, and data2vec locates both two sides of car tail and captures more, including the car wheel and car window. These results are consistent with the results on ResNet50. All these results show the generality of the implication of our theory on MRP. 

C MAIN RESULT ON FEATURE LEARNING PROCESS OF MASK-RECONSTRUCTION PRETRAINING

In this section, we first show the main result on feature learning process mask-reconstruction pretraining (MRP). To introduce our main result, we first characterize the kernels at random initialization and during the training process. We first define Λ (t) i,l := max r∈[km] [⟨w (t) r , v i,l ⟩] + , where w (t) r denotes the r-th convolution kernel of the student encoder at the t-th iteration. Here Λ i,l is the largest positive correlation score between the l-feature v i,l of i-th class and all the km kernels w (t) r . We also define M (0) i,l := r ∈ [km] : ⟨w (0) r , v i,l ⟩ ≥ Λ (0) i,l 1 -O 1 log k . Here the set M (0) i,l is formed by the kernels which have slightly smaller correlation scores than the maximum score Λ (0) i,l at the intial stage. If a kernel w r is not in M (0) i,l , it means that the magnitude of v i,l inside the random initialization w (0) r is non-trivially lagging behind, comparing to other kernels. Later we will prove that through the course of training, those kernels w r will lose the lottery and not learn anything useful for feature v i,l . We also have some properties of kernels at initialization (t = 0). The following lemma has been proved in (Allen-Zhu & Li, 2020, Fact B.1.) . Lemma C.1 (The size of M (0) i,l at initialization). With high probability at least 1 -e -Ω(log 5 k) , we have |M (0) i,l | ≤ m 0 , where m 0 := O(log 5 k). Then to show our main result, we need some assumptions on the parameters and an induction hypothesis. Assumption C.2 (Parameter Assumption). The parameters introduced in the paper need to satisfy the following conditions: • ϱ is the threshold for the smoothed ReLU activation. We assume ϱ = 1 polylog(k) . • q ≥ 3 and σ q-2 0 ≤ 1 k . • γ controls feature noise. γ ≤ Õ σ0 k . • s controls feature sparsity. s = Θ(polylog(k)). • N ≥ ω k σ 2q-1 0 , √ d ≥ ω(k/σ 2q-1 0 ), √ d ≥ ω(k 5/2 /η 1/q ) and P ≤ σ -q+1/2 0 . • polylog(k) ≤ m ≤ √ k. • η ≥ 1 k q(q-2) and η ≤ 1 poly(k) . • c(θ) = 1 θ . • τ is the parameter controls the update of weights of Teacher network. τ = 1 + 1-θ Cpθ + Θ( 1 t 1/q +1 ). The following induction hypothesis is important as it shows the main properties of kernels during the training course. Induction Hypothesis C.3. For every v i,l ∈ V, for each r ∈ M (0) i,l , for every (X, y) ∈ Z, (a) For every p ∈ P v i,l (X), we have ⟨w (t) r , x p ⟩ = ⟨w (t) r , v i,l ⟩z p + õ(σ 0 ). (b) For every p ∈ P(X) \ P v i,l (X), we have |⟨w (t) r , x p ⟩| ≤ Õ(σ 0 ). (c) For every p ∈ [P ] \ P(X), we have |⟨w (t) r , x p ⟩| ≤ Õ(σ 0 γk). For every r / ∈ ∪ i∈[k],l∈[2] M i,l , for every (X, y) ∈ Z, (d) for every p ∈ P(X), |⟨w (t) r , x p ⟩| ≤ Õ(σ 0 ). (e) for every p ∈ [P ] \ P(X), |⟨w (t) r , x p ⟩| ≤ Õ(σ 0 γk). Moreover, for every v i,l ∈ V: (f) Λ (t) i,l ∈ [ Ω(σ 0 ), Õ(1)]. (g) for each r ∈ M (0) i,l , ⟨w (t) r , v i,l ⟩ ≥ -Õ(σ 0 ). (h) for each r / ∈ M (0) i,l , ⟨w (t) r , v i,l ⟩ ≤ Õ(σ 0 ). Now we have the following result on the feature learning process of MRP. Theorem C.4 (Feature learning process of MRP). Suppose Assumption C.2 holds. By running the gradient descent step in (2) with learning rate η ≤ 1 poly(k) , after T = poly(k) η iterations, for sufficiently large k > 0, Induction Hypothesis C.3 holds for all iterations t = 0, 1, . . . , T with high probability. Differences from the work (Allen-Zhu & Li, 2020) . We use the similar multi-view data assumption 3.1 as (Allen-Zhu & Li, 2020) , since we find it is a reasonable and practical assumption and it is helpful in proving what we actually have learned in the masked-reconstruction based pretraining. Besides, we also adopt the same network (two-layer CNNs with smoothed ReLu activation function) in (Allen-Zhu & Li, 2020) as our encoder. It is mainly for easy to compare the supervised learning results in (Allen-Zhu & Li, 2020) and self-supervised results proved by us. This can better illustrate the benefits of self-supervised pretraining. But there are three main differences between our works and (Allen-Zhu & Li, 2020) , including network architecture, objective loss and teacher. For network architecture, MRP contains both encoder and decoder, while supervised learning in (Allen-Zhu & Li, 2020) only considers the encoder. As for objective loss, MRP is a reconstruction loss of a masked input, while supervised learning in (Allen-Zhu & Li, 2020) uses distillation loss and cross-entropy loss of a non-masked input. Finally, in terms of teacher, MRP uses an online teacher whose parameters are changed along with training and thus is more dynamic and complex, while supervised learning in (Allen-Zhu & Li, 2020 ) uses a well-trained teacher whose parameter is fixed and thus gives a fixed target of an image. These three big differences cause the different lottery tickets winning or losing process during the training courses. This point can be observed in different practical intuition from our Induction Hypothesis C.3 and from Induction Hypothesis B.3 of (Allen-Zhu & Li, 2020) . In our Induction Hypothesis C.3, no semantic features will lose the lottery tickets, while in (Allen-Zhu & Li, 2020 ) some of semantic features will be missed during the training courses. Based on different Induction Hypothesis, analysis of mask-reconstruction pretraining is non-trivial which is one part of our novel contributions. Another part of our contributions is that after pretraining, we further need to show the test performance on downstream classification task. In this part, we use the same cross-entropy loss as (Allen-Zhu & Li, 2020) , which is also popularly adopted in supervised training. But different from (Allen-Zhu & Li, 2020) , which simply fixed the linear coefficients of output of convolution kernels of the encoder (backbone) by 1, here we need to train the weights of an extra linear layer and fine-tune the weights of convolution kernels of the encoder at the same time (see (15) in Appendix G).

D PROOF OVERVIEW OF THEOREM C.4

In this section, we introduce the main steps to prove Theorem C.4. The proof of Theorem C.4 includes two process. First, when t ≤ T 0 , where T 0 = Θ( k ησ 2q-2 0 ) and when t ∈ [T 0 , T ], where T -T 0 ≤ Õ( kT 1/q 0 η ).

D.1 INITIAL STAGE

The initial stage of the training process is defined as the training iterations t ≤ T 0 , where T 0 = Θ( k ησ 2q-2 0 ). In this stage, kernels in M (0) i,l will focus on learning feature v i,l . More formally, we will prove that at least one of kernels in M (0) i,l will capture the feature v i,l , i.e., max r∈M (0) i,l ⟨w (T0) r , v i,l ⟩ ≥ ϱ = 1 polylog(k) . This indicates that the maximum correlation between kernels inside M (0) i,l and feature v i,l will grow. Next, since the kernels inside M (0) i,l have captured the main correlations with feature v i,l , what about the kernels outside M (0) i,l ? To answer this question, we will show that for w r / ∈ M (0) i,l , ⟨w (T0) r , v i,l ⟩ ≤ Õ(σ 0 ) = Õ 1 √ k , which means that the correlations with feature v i,l will keep small. For those kernels that the magnitude of v i,l is lagging behind at initialization, it will loss the lottery and capture little feature v i,l . Furthermore, will those kernels inside M i,l capture other feature v j,l ′ ̸ = v i,l ? The answer is no. So to show this point, we also prove that for r ∈ M (0) i,l , ⟨w (T0) r , v j,l ′ ⟩ ≤ Õ(σ 0 ), ∀ v j,l ′ ̸ = v i,l . Besides, the kernels will also not be influenced by the noises, i,e., for all r, r ∈ [km], for every p ∈ [P ], ⟨w (T0) r , ξ p ⟩ ≤ Õ 1 poly(k) .

D.2 CONVERGENCE STAGE

In this stage, when t ∈ [T 0 , T ], since part of kernels have won the lottery, its correlations with corresponding feature will continue to hold in this stage. But in this stage, the gradient will become small which drives the learning process to converge. The intuition here is that, when the correlation between weights and its corresponding features grows over the threshold ϱ, the gradients will become to be small. That is when the kernels learned the corresponding feature, the increasing in the correlation will be small and thus drive the learning process to converge. We will show that after t ≥ T 0 , ⟨∇ w (t) r L(H), v i,l ⟩ ≤ Õ 1 T 1/q 0 = Õ 1 poly(k) , for w r ∈ M (0) i,l . While the gradients with other features (features not captured by this kernels) keep to be smaller. In this way, the correlation between weights and its corresponding features will not grow too large and we will show that max r∈M (0) i,l ⟨w (T ) r , v i,l ⟩ ≤ Õ(1).

E SOME TECHNICAL RESULTS

In this section, we first show the gradient and its approximations. We also state some consequences from our Induction Hypothesis C.3. They all are useful in our later proof of the main results.

E.1 GRADIENTS AND ITS APPROXIMATIONS

Recall L(H; X, ϵ) = 1 2 r∈[km] p∈[P ] ReLU(⟨ ŵr , x p ⟩) - p∈[P ] c(θ)ReLU(⟨w r , ϵ p x p ⟩) 2 . and L(H; X) = E ϵ [L(H; X, ϵ)] = 1 2 r∈[km] p∈[P ] ReLU(⟨ ŵr , x p ⟩) - p∈[P ] ReLU(⟨w r , x p ⟩) 2 + 1 2 1 θ -1 r∈[km] p∈[P ] ReLU(⟨w r , x p ⟩) 2 . ( ) The function of θ. As we make an expectation on the pretraining loss function L(H; X, ϵ) over ϵ when doing gradient descent, the expectation of ϵ ( here E[ϵ] = θ) really matters in the analysis. More specifically, based on (3) and from the definition of θ, when θ → 1, this means P (ϵ p = 1) → 1, i.e., there is nearly no mask. Then according to our choice of the teacher's network parameters, when θ → 1, ŵr ≈ w r , which means that the loss keeps small and so there is nearly no update of parameters of student and teacher models. In this case, the student model cannot learn the useful semantic features in the data. When θ → 0 (i.e., we mask all the data), ŵr → ∞ and L(H; X) → ∞. In this case, the student model also cannot learn useful semantic features in the data. Overall, there is a tradeoff on the mask ratio θ. The analysis also holds under the MAE model in Appendix. H. When θ → 0, L(H; X) → ∞. When θ → 1, there is a trivial answer to make L(H; X) → 0. In both above cases, no semantic features will be learned. The derivation of above loss function is shown as follows. For simplicity of clarification, we denote ŷr,p = ReLU(⟨ ŵr , x p ⟩) and y r,p = ReLU(⟨w r , x p ⟩). Then the loss function is L(H; X, ϵ) = 1 2 r∈[km] p∈[P ] ŷr,p y r,p . We also have E ϵ p∈[P ] c(θ)ϵ p y r,p 2 = 1 θ 2 E ϵ p∈[P ] ϵ p y r,p p∈[P ] ϵ p y r,p = 1 θ 2 E ϵ p∈[P ] ϵ p y r,p p ′ ̸ =p ϵ p ′ y r,p ′ + 1 θ 2 E ϵ p∈[P ] ϵ p y r,p p ′ =p ϵ p ′ y r,p ′ = 1 θ 2 E ϵ p∈[P ] p ′ ̸ =p ϵ p ϵ p ′ y r,p y r,p ′ + 1 θ 2 E ϵ p∈[P ] p ′ =p ϵ p ϵ p ′ y r,p y r,p ′ (a) = p∈[P ] y r,p p ′ ̸ =p y r,p ′ + 1 θ p∈[P ] y r,p p ′ =p y r,p ′ = p∈[P ] y r,p p ′ y r,p ′ + 1 θ -1 p∈[P ] y r,p p ′ =p y r,p ′ = p∈[P ] y r,p p ′ y r,p ′ + 1 θ -1 p∈[P ] y 2 r,p where (a) is because E ϵ [ϵ p ϵ p ′ ] = θ 2 when p ̸ = p ′ as we assume the variable in ϵ is independent Bernoulli variable with Pr(ϵ p = 1) = θ and E ϵ [ϵ p ϵ p ′ ] = θ when p = p ′ . Combining the above results, we have L(H; X) = E ϵ [L(H; X, ϵ)] = 1 2 r∈[km] p∈[P ] ŷr,p - p∈[P ] y r,p 2 + 1 2 1 θ -1 r∈[km] p∈[P ] y 2 r,p , which is our result. We define Φ r (X) := p∈[P ] ReLU(⟨ ŵr , x p ⟩) - p∈[P ] ReLU(⟨w r , x p ⟩). Fact 2.1 (Gradients). Given the data point (X, y) ∈ D, for every w r , r ∈ [km], -∇ wr L(X) = p∈[P ] Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)x p . where ReLU ′ is the gradient of ReLU. Besides, we set ŵ(t) r = τ w (t) r . We define several error terms that will be used in our proofs. Definition E.1. V r,i,l (X) = I {v i,l ∈V(X)} p∈Pv i,l (X) ReLU ′ (⟨w r , x p ⟩)z p , Vr,i,l (X) = I {v i,l ∈V(X)} p∈Pv i,l (X) ReLU ′ (⟨w r , x p ⟩), W r,i,l (X) = 1 θ -1 I {v i,l ∈V(X)} p∈Pv i,l (X) ReLU(⟨w r , x p ⟩)ReLU ′ (⟨w r , x p ⟩)z p , Ŵr,i,l (X) = 1 θ -1 I {v i,l ∈V(X)} p∈Pv i,l (X) ReLU(⟨w r , x p ⟩)ReLU ′ (⟨w r , x p ⟩), ∆ r,i,l (X) = I v i,l ∈V(X) p∈Pv i,l (X) ReLU(⟨ ŵr , x p ⟩) -ReLU(⟨w r , x p ⟩) . U i,l (X) = I {v i,l ∈V(X)} • Õ(σ (2q-1) 0 ). We also define some small terms for easy of notation. Definition E.2. E 1 = Õ(σ (q-1) 0 )(γ + σ p )s, E 2 = Õ((σ 0 γk) (q-1) )(γ + σ p ) • P, E 3 = Õ(σ (2q-1) 0 )(γ + σ p )s, E 4 = Õ((σ 0 γk) (2q-1) )(γ + σ p ) • P. E 5 = Õ(σ q 0 ) • (s + 1), E 6 = Õ((σ 0 γk) q ) • P. We have the following lemma to approximate the gradient. We first approximate the term Φ r (X). Claim E.3 (Bounds on Φ r (X)). Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds at iteration t. Then for every v i,l ∈ V, for every r ∈ M (0) i,l , for every (X, y) ∈ Z, Φ r (X) = ∆ r,i,l (X) ± E 5 ± E 6 . Proof of Claim E.3. Using the induction hypothesis C.3, for every v i,l ∈ V, for every r ∈ M (0) i,l , for every (X, y) ∈ Z, Φ r (X) = p∈[P ] ReLU(⟨ ŵr , x p ⟩) - p∈[P ] ReLU(⟨w r , x p ⟩) = I v i,l ∈V(X) p∈Pv i,l (X) ReLU(⟨ ŵr , x p ⟩) -ReLU(⟨w r , x p ⟩) + p∈P(X)\Pv i,l (X) ReLU(⟨ ŵr , x p ⟩) -ReLU(⟨w r , x p ⟩) + p∈[P ]\P(X) ReLU(⟨ ŵr , x p ⟩) -ReLU(⟨w r , x p ⟩) (a) = I v i,l ∈V(X) p∈Pv i,l (X) ReLU(⟨ ŵr , x p ⟩) -ReLU(⟨w r , x p ⟩) ± Õ(σ q 0 ) • (s + 1) ± Õ((σ 0 γk) q ) • P, where (a) is as C p is a universal constant. Claim E.4 (Approximations of gradients). Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds at iteration t. Then for every v i,l ∈ V, for every r ∈ M (0) i,l , for (X, y) ∈ Z, ⟨-∇ wr L(X), v i,l ⟩ = V r,i,l (X)∆ r,i,l (X) -W r,i,l (X) + ∆ r,i,l (X)(E 1 + E 2 ) -E 3 -E 4 ± (V r,i,l (X) + Vr,i,l (X)(γ + σ p ))(E 5 + E 6 ) ± (E 5 + E 6 )(E 1 + E 2 ) (b) for v j,l ′ ̸ = v i,l (note that v j,l ′ ̸ = v i,l means that when j = i, l ′ ̸ = l or j ̸ = i), |⟨-∇ wr L(X), v j,l ′ ⟩| = Vr,i,l (X)∆ r,i,l (X) -Ŵr,i,l (X) (γ + σ p ) ± Vr,i,l (X)(E 5 + E 6 )(γ + σ p ) ± U r,j,l ′ (X) + ∆ r,i,l (X)(E 1 + E 2 ) ± (E 5 + E 6 )(E 1 + E 2 ) -E 3 -E 4 . Proof of Claim E.4. We first prove (a). Using the induction hypothesis C.3 and the fact C p is a universal constant, we have that for v i,l ∈ V, for every r ∈ M (0) i,l , we have ⟨-∇ wr L(X), v i,l ⟩ = p∈[P ] Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)⟨x p , v i,l ⟩ = I {v i,l ∈V(X)} p∈Pv i,l (X) Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)(z p + α p,v i,l + ⟨v i,l , ξ p ⟩) + p∈P(X)\Pv i,l (X) Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)(α p,v i,l + ⟨v i,l , ξ p ⟩) + p∈[P ]\P(X) Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)(α p,v i,l + ⟨v i,l , ξ p ⟩) = I {v i,l ∈V(X)} p∈Pv i,l (X) ∆ r,i,l (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)z p ± V r,i,l (X)(E 5 + E 6 ) ± Vr,i,l (X)(E 5 + E 6 ) • (γ + σ p ) + (∆ r,i,l (X) ± E 5 ± E 6 -Õ(σ q 0 )) • Õ(σ q-1 0 ) • (γ + σ p ) • (s + 1) + (∆ r,i,l (X) ± E 5 ± E 6 -Õ((σ 0 γk) q )) • Õ((σ 0 γk) q-1 ) • (γ + σ p ) • P = V r,i,l (X)∆ r,i,l (X) -W r,i,l (X) + ∆ r,i,l (X)(E 1 + E 2 ) ± (V r,i,l (X) + Vr,i,l (X)(γ + σ p ))(E 5 + E 6 ) ± (E 5 + E 6 )(E 1 + E 2 ) -E 3 -E 4 Now we show (b). Using the induction hypothesis C.3, for v i,l ∈ V, for every r ∈ M (0) i,l , when v j,l ′ ̸ = v i,l , we have ⟨-∇ wr L(X), v j,l ′ ⟩ = p∈[P ] Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)⟨x p , v j,l ′ ⟩ = I {v i,l ∈V(X)} p∈Pv i,l (X) Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)(α p,v j,l ′ + ⟨v j,l ′ , ξ p ⟩) + I {v j,l ′ ∈V(X)} p∈Pv j,l ′ (X) Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)(z p + α p,v j,l ′ + ⟨v j,l ′ , ξ p ⟩) + p∈P(X)\{Pv i,l (X)∪Pv j,l ′ (X)} Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)(α p,v j,l ′ + ⟨v j,l ′ , ξ p ⟩) + p∈[P ]\P(X) Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)(α p,v j,l ′ + ⟨v j,l ′ , ξ p ⟩) = Vr,i,l (X)∆ r,i,l (X) -Ŵr,i,l (X) (γ + σ p ) ± Vr,i,l (X)(E 5 + E 6 )(γ + σ p ) ± U r,j,l ′ (X) + ∆ r,i,l (X)(E 1 + E 2 ) ± (E 5 + E 6 )(E 1 + E 2 ) -E 3 -E 4 . E.2 SOME RESULTS FROM INDUCTION HYPOTHESIS C.3 E.2.1 GROWTH OF Λ (t) i,l The following claim shows about at which iteration Λ (t) i,l will be greater than the threshold ϱ in the definition of smooth ReLU function. Claim E.5. Suppose Assumption C.2 holds and induction hypothesis C.3 holds at iteration t. For every v i,l , suppose Λ (t) i,l ≤ ϱ. Then we have Λ (t+1) i,l = Λ (t) i,l + Θ η k ReLU(Λ (t) i,l )ReLU ′ (Λ (t) i,l ). Proof of Claim E.5. Recall that Λ (t) i,l := max r∈[km] [⟨w (t) r , v i,l ⟩] + . We choose any r ∈ [km] that makes ⟨w (t) r , v i,l ⟩ ≥ Ω(σ 0 ). Now we show the updates. We know that ⟨w (t+1) r , v i,l ⟩ = ⟨w (t) r , v i,l ⟩ + ηE (X,y)∼Z [⟨-∇ wr L(X), v i,l ⟩] Using Claim E.4, we have ⟨-∇ wr L(X), v i,l ⟩ = V r,i,l (X)∆ r,i,l (X) -W r,i,l (X) + ∆ r,i,l (X)(E 1 + E 2 ) -E 3 -E 4 ± (V r,i,l (X) + Vr,i,l (X)(γ + σ p ))(E 5 + E 6 ) ± (E 5 + E 6 )(E 1 + E 2 ) Recall the definition of V r,i,l , ∆ r,i,l , W r,i,l . As we assume Λ (t) i,l ≤ ϱ and based on our definition of smooth ReLU function, we could simplify the above inequalities by only keeping the main increasing term as ⟨ -∇ wr L(X), v i,l ⟩ = ∆ r,i,l (X)V r,i,l (X) -W r,i,l (X). This equation is obtained by setting ⟨w (t) r , v i,l ⟩ ≥ Ω(σ 0 ) and compare its order with the remaining term. It is indeed the main increasing term. For (X, y) ∈ Z, we have V r,i,l (X) =I {v i,l ∈V(X)} ReLU ′ (⟨w (t) r , v i,l ⟩) p∈Pv i,l (X) z q p (4) ∆ r,i,l (X) =I {v i,l ∈V(X)} ReLU(⟨w (t) r , v i,l ⟩) p∈Pv i,l (X) (τ q -1)z q p , W r,i,l (X) =I {v i,l ∈V(X)} ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩) 1 θ -1 p∈Pv i,l (X) z 2q p (6) Then ∆ r,i,l (X)V r,i,l (X) -W r,i,l (X) = p∈Pv i,l (X) z q p   p ′ ∈Pv i,l (X) (τ q -1)z q p ′ - 1 θ -1 z q p   × I {v i,l ∈V(X)} ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩). According to our choice of τ and z p is uniformly distributed over C p patches, when (X, y) ∈ Z m and i = y or (X, y) ∈ Z s and i = y and l = l, we have E (X,y)∈Z   p∈Pv i,l (X) z q p   p ′ ∈Pv i,l (τ q -1)z q p ′ - 1 θ -1 z q p   I {v i,l ∈V(X)}   ∈ [Ω(1), O(1)]. When (X, y) ∈ Z s and i = y and l = 3 -l, we have E (X,y)∈Zs   p∈Pv i,l (X) z q p   p ′ ∈Pv i,l (τ q -1)z q p ′ - 1 θ -1 z q p   I {v i,l ∈V(X)}   ∈ [Ω(ρ), O(ρ)]. When (X, y) ∈ Z and i ̸ = y, we have E (X,y)∈Z   p∈Pv i,l (X) z q p   p ′ ∈Pv i,l (τ q -1)z q p ′ - 1 θ -1 z q p   I {v i,l ∈V(X)}   ∈ s k [Ω(1), O(1)]. Combining all above results, we have ⟨w (t+1) r , v i,l ⟩ = ⟨w (t) r , v i,l ⟩ + Θ η k ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩). Using Claim E.5, and Ω(σ 0 ) ≤ Λ (0) i,l ≤ Õ(σ 0 ), we have the following result: Claim E.6. Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds for every iteration. Define T 0 := Θ k ησ 2q-2 0 . We have that when t ≥ T 0 , it satisfies Λ (t) i,l ≥ Θ 1 polylog(k) . Proof of Claim E.6. Using the result in E.5 and beginning from Λ (0) i,l = Θ(σ 0 ), we have that Λ (t) i,l ≈ Λ (0) i,l 1 + Θ η k σ 2q-2 0 ϱ 2q-2 t . Thus, when T 0 = Θ k ησ 2q-2 0 , we have Λ (t) i,l ≈ Θ(σ 0 )e polylog(k) , which means Λ (t) i,l = Θ 1 polylog(k) .

F PROOF OF THEOREM C.4

Before we formally show Theorem C.4, we need some lemmas. First, we need to prove that for every feature v i,l ∈ V. at least one of "diagonal" correlations ⟨w (t) r , v i,l ⟩, r ∈ M (0) i,l grows and the "off-diagonal" correlations ⟨w (t) r , v j,l ′ ⟩, v j,l ′ ̸ = v i,l decreases. To show these, we provide three lemmas about the lower and upper bound on ⟨w (t) r , v i,l ⟩, r ∈ M (0) i,l and upper bound on ⟨w (t) r , v j,l ′ ⟩, v j,l ′ ̸ = v i,l , r ∈ M (0) i,l .

F.1 DIAGONAL CORRELATIONS

The first lemma is used to obtain upper bound on Λ (t) i,l . Lemma F.1. Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds for all iterations < t. We have ∀v i,l ∈ V : Λ (t) i,l ≤ Õ(1). Proof of Lemma F.1. Based on Claim E.4, we have that for every r ∈ M (0) i,l , ⟨w (t+1) r , v i,l ⟩ = ⟨w (t) r , v i,l ⟩ + ηE (X,y)∼Z V r,i,l (X)∆ r,i,l (X) -W r,i,l (X) + ∆ r,i,l (X)(E 1 + E 2 ) ± (V r,i,l (X) + Vr,i,l (X)(γ + σ p ))(E 5 + E 6 ) ± (E 5 + E 6 )(E 1 + E 2 ) -E 3 -E 4 . When taking the positive part, we know there exists δ (t) r,i,l ∈ [0, 1] such that [⟨w (t+1) r , v i,l ⟩] + = [⟨w (t) r , v i,l ⟩] + + ηδ (t) r,i,l E (X,y)∼Z V r,i,l (X)∆ r,i,l (X) -W r,i,l (X) + ∆ r,i,l (X)(E 1 + E 2 ) ± (V r,i,l (X) + Vr,i,l (X)(γ + σ p ))(E 5 + E 6 ) ± (E 5 + E 6 )(E 1 + E 2 ) -E 3 -E 4 . Suppose we are now at some iteration t > T 0 . In this stage, Λ (t) i,l ≥ 1/polylog(k). As T 0 = Θ k ησ 2q-2 0 and η ≤ 1 poly(k) , we have ∆ r,i,l (X)V r,i,l (X) -W r,i,l (X) = p∈Pv i,l (X) z p   p ′ ∈Pv i,l (X) (τ -1)z p ′ - 1 θ -1 z p   × I {v i,l ∈V(X)} ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩) = O 1 t 1/q • I {v i,l ∈V(X)} ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩). Using Claim E.5 and we also keep the main increasing term, we have [⟨w (t+1) r , v i,l ⟩] + ≤ [⟨w (t) r , v i,l ⟩] + + Õ η kT 1/q 0 • ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩) ≤ [⟨w (t) r , v i,l ⟩] + + Õ η kT 1/q 0 • ReLU(⟨w (t) r , v i,l ⟩). Taking the maximum on both side and as we are at t > T 0 , we have max r∈M (0) i,l [⟨w (t+1) r , v i,l ⟩] + ≤ max r∈M (0) i,l [⟨w (t) r , v i,l ⟩] + 1 + Õ η kT 1/q 0 . When t ≤ T = T 0 + Õ kT 1/q 0 η , we have Λ (t) i,l ≤ Õ(1). The second lemma is used to lower bound on ⟨w (t) r , v i,l ⟩, r ∈ M (0) i,l and indicates that the diagonal correlations are nearly non-negative. Lemma F.2. Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds for all iterations < t. We have ∀v i,l ∈ V, ∀r ∈ M (0) i,l : ⟨w (t) r , v i,l ⟩ ≥ -Õ(σ 0 ). Proof of Lemma F.2. We start with any iteration t that is ⟨w (t) r , v i,l ⟩ ≤ -Ω(σ 0 ) to see how negative the next iteration will be. Without loss of generality, we consider the case when ⟨w (t ′ ) r , v i,l ⟩ ≤ -Ω(σ 0 ) holds for every t ′ ≥ t. Now based on Claim E.4, we have ⟨w (t+1) r , v i,l ⟩ = ⟨w (t) r , v i,l ⟩ + ηE (X,y)∼Z V r,i,l (X)∆ r,i,l (X) -W r,i,l (X) + ∆ r,i,l (X)(E 1 + E 2 ) -E 3 -E 4 ± (V r,i,l (X) + Vr,i,l (X)(γ + σ p ))(E 5 + E 6 ) ± (E 5 + E 6 )(E 1 + E 2 ) (a) ≥ ⟨w (t) r , v i,l ⟩ + ηE (X,y)∼Z -E 3 -E 4 -(E 5 + E 6 )(E 1 + E 2 ) ≥ ⟨w (t) r , v i,l ⟩ -η E 3 + E 4 + (E 5 + E 6 )(E 1 + E 2 ) where (a) is because that as we assume ⟨w (t) r , v i,l ⟩ ≤ -Ω(σ 0 ), we have W r,i,l (X) = 1 θ -1 I {v i,l ∈V(X)} p∈Pv i,l (X) ReLU(⟨w r , x p ⟩)ReLU ′ (⟨w r , x p ⟩)z p = 1 θ -1 I {v i,l ∈V(X)} p∈Pv i,l (X) ReLU(⟨w r , v i,l ⟩z p ± õ(σ 0 ))ReLU ′ (⟨w r , v i,l ⟩z p ± õ(σ 0 )z p = 0, and similar results also hold for ∆ r,i,l , V r,i,l , Vr,i,l . This shows that when t ≤ T 0 , ⟨w (t+1) r , v i,l ⟩ ≥ ⟨w (t) r , v i,l ⟩ -η Õ(σ (2q-1) 0 ) • (γ + σ p ) • s 2 -η Õ((σ 0 γk) 2q-1 ) • (γ + σ p )P 2 -η Õ(σ 2q-1 0 (γk) q-1 ) • (γ + σ p )P s ≥ -Õ(σ 0 ) -ηT 0 Õ(σ (2q-1) 0 ) • (γ + σ p ) • s 2 -ηT 0 Õ((σ 0 γk) 2q-1 ) • (γ + σ p )P 2 -ηT 0 Õ(σ 2q-1 0 (γk) q-1 ) • (γ + σ p )P s ≥ -Õ(σ 0 ) -Õ σ 2 0 + kσ 0 √ d -Õ σ 2 0 + kσ 0 √ d (γk) 2q-1 • P 2 ≥ -Õ(σ 0 ). When t ∈ [T 0 , T ], we have ⟨w (t) r , v i,l ⟩ ≥ ⟨w (T0) r , v i,l ⟩ -η(T -T 0 ) Õ(σ (2q-1) 0 ) • (γ + σ p ) • s 2 -η(T -T 0 ) Õ((σ 0 γk) 2q-1 ) • (γ + σ p )P 2 -η(T -T 0 ) Õ(σ 2q-1 0 (γk) q-1 ) • (γ + σ p )P s ≥ -Õ(σ 0 ).

F.2 OFF-DIAGONAL CORRELATIONS

Lemma F.3. Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds for all iterations < t. Then ∀v i,l ∈ V, ∀r ∈ M (0) i,l , for v j,l ′ ̸ = v i,l : |⟨w (t) r , v j,l ′ ⟩| ≤ Õ(σ 0 ). Proof of Lemma F.3. For every r ∈ M (0) i,l , using Claim E.4, we have |⟨w (t+1) r , v j,l ′ ⟩| ≤ |⟨w (t) r , v j,l ′ ⟩| + ηE (X,y)∼Z Vr,i,l (X)∆ r,i,l (X) -Ŵr,i,l (X) (γ + σ p ) + Vr,i,l (X)(E 5 + E 6 )(γ + σ p ) + U r,j,l ′ (X) + ∆ r,i,l (X)(E 1 + E 2 ) + (E 5 + E 6 )(E 1 + E 2 ) -E 3 -E 4 Stage I. We first consider the stage when t ≤ T 0 . In this stage, similar to the analysis in the proof of Claim E.5, we have that E (X,y)∼Z Vr,i,l (X)∆ r,i,l (X) -Ŵr,i,l (X) (γ + σ p ) ≤ Õ 1 k • (γ + σ p )ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩), where Õ 1 k is the probability of v i,l ∈ V(X), E (X,y)∼Z Vr,i,l (X)(E 5 + E 6 )(γ + σ p ) ≤ Õ 1 k • (E 1 + E 2 )ReLU ′ (⟨w (t) r , v i,l ⟩), E (X,y)∼Z U r,j,l ′ (X) ≤ Õ 1 k σ 2q-1 0 , where Õ 1 k is the probability of v j,l ′ ∈ V(X). Thus, when t ≤ T 0 , we also keep the main increasin term and obtain that |⟨w (t) r , v j,l ′ ⟩| ≤ |⟨w (0) r , v j,l ′ ⟩| + Õ η k • (γ + σ p ) T0 t=0 ReLU(Λ (t) i,l )ReLU ′ (Λ (t) i,l ) From Claim E.5, we have that Θ η k T0-1 t=0 ReLU(Λ (t) i,l )ReLU ′ (Λ (t) i,l ) = T0-1 t=0 Λ (t+1) i,l - T0-1 t=0 Λ (t) i,l = Λ (T0) i,l -Λ (0) i,l ≤ 1 polylog(k) . ( ) Putting ( 9) into (8), we have that for every t ≤ T 0 , |⟨w (t) r , v j,l ′ ⟩| ≤ |⟨w (0) r , v j,l ′ ⟩| + Õ σ 0 k + 1 √ d + Õ η k • (γ + σ p )ReLU(Λ (T0) i,l )ReLU ′ (Λ (T0) i,l ) ≤ Õ(σ 0 ). Stage II. In the second stage, when t ≥ T 0 , we have E (X,y)∼Z Vr,i,l (X)∆ r,i,l (X) -Ŵr,i,l (X) (γ + σ p ) ≤ Õ 1 kT 1/q 0 • (γ + σ p )ReLU(⟨w (t) r , v i,l ⟩)ReLU ′ (⟨w (t) r , v i,l ⟩) ≤ Õ 1 kT 1/q 0 • (γ + σ p ), where the first inequality is from Lemma F.1. Thus, when t ∈ [T 0 , T ] |⟨w (t) r , v j,l ′ ⟩| ≤ |⟨w (T0) r , v j,l ′ ⟩| + Õ η(T -T 0 ) kT 1/q 0 • (γ + σ p ) ≤ |⟨w (T0) r , v j,l ′ ⟩| + Õ(σ 0 /k) + Õ(1/ √ d) ≤ Õ(σ 0 ) Combining all above results, we complete our proof.

F.3 LOTTERY WINING

: KERNELS INSIDE M (0) i,l In this subsection, we prove that the feature v i,l captured by kernels not in M (0) i,l is negligible. To prove this result, we first need a lemma from (Allen-Zhu & Li, 2020, Lemma C.19 ) that compare the growth speed of two sequences of updates of the form x t+1 ← x t + ηC t x q-1 t . Lemma F.4. Let q ≥ 3 be a constant and x 0 , y 0 = o(1). Let {x t , y t } t≥0 be two positive sequences updated as • x t+1 ≥ x t + ηC t x q-1 t for some C t = Θ(1), • y t+1 ≤ y t + ηSC t y q-1 t for some constant S = Θ(1). Suppose x 0 ≥ y 0 S 1/(q-2) 1 + 1 polylog(k) , then we must have for every A = O(1), let T x be the first iteration such that x t ≥ A, then yT x ≤ O(y 0 • polylog(k)). Now we begin to prove our result. Lemma F.5. Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds for all iterations < t. Then ∀v i,l ∈ V, ∀r / ∈ M (0) i,l : ⟨w (t) r , v i,l ⟩ ≤ Õ(σ 0 ). Proof of Lemma F.5. When r ∈ M (0) j,l ′ , (v j,l ′ ̸ = v i,l ), we have prove that ⟨w (t) r , v i,l ⟩ ≤ Õ(σ 0 ) in Lemma F.3. So we only prove the case when r / ∈ ∪ i∈[k],l∈[2] M (0) i,l . We assume that there exists an w r ′ / ∈ ∪ i∈[k],l∈[2] M (0) i,l such that induction hypothesis C.3 (a)-(c) holds for every (X, y) ∈ Z. We want to see if the sequence ⟨w (t) r ′ , v i,l ⟩ will increase more quickly than max r∈M (0) i,l ⟨w (t) r , v i,l ⟩. Under this assumption, we have that (here we also only keep the main increasing term), ⟨w (t+1) r ′ , v i,l ⟩ = ⟨w (t) r ′ , v i,l ⟩ + ηE (X,y)∼Z V r,i,l (X)∆ r,i,l (X) -W r,i,l (X) . Stage I We first consider when t ≤ T 0 . In this stage, Λ (t) i,l ≤ ϱ. We define two sequences. First, we take w r * = argmax r∈M (0) i,l ⟨w (0) r , v i,l ⟩ and define x t := ⟨w (t) r * , v i,l ⟩ • s qk 1/2q 1 ϱ (2q-1)/2q . We also define y t = max{⟨w (t) r ′ , v i,l ⟩ • s qk 1/2q 1 ϱ (2q-1)/2q , σ 0 }. From Claim E.5, when t ≤ T 0 , we have that ⟨w (t+1) r * , v i,l ⟩ = ⟨w (t) r * , v i,l ⟩ + Θ sη k ReLU(⟨w (t) r * , v i,l ⟩)ReLU ′ (⟨w (t) r * , v i,l ⟩) ≥ ⟨w (t) r * , v i,l ⟩ + Θ sη k 1 qϱ 2q-1 ([⟨w r * , v i,l ⟩] + ) 2q-1 . Let S = 1+C/(log(k)-C) 1+1/ log(k) q-2 , C > 1. We have ⟨w (t+1) r ′ , v i,l ⟩ = ⟨w (t) r ′ , v i,l ⟩ + Θ sη k ReLU(⟨w (t) r ′ , v i,l ⟩)ReLU ′ (⟨w (t) r ′ , v i,l ⟩) ≤ ⟨w (t) r ′ , v i,l ⟩ + Θ sη k 1 qϱ 2q-1 ([⟨w (t) r ′ , v i,l ⟩] + ) 2q-1 S. Set C t = 1. Then we have that x t+1 ≥ x t + ηC t x 2q-1 t , y t+1 ≤ y t + ηSC t y 2q-1 t . Besides, x 0 = Λ (0) i,l and y 0 ≤ Λ (0) i,l (1 -O(1/ log(k))) based on the definition of M (0) i,l . Here we assume y 0 ≤ Λ (0) i,l (1 -C/ log(k)). Thus, we have x 0 ≥ y 0 1 + C log(k) -C = y 0 S 1 q-2 1 + 1 log(k) . So using the result from Lemma F.4, when ⟨w , v i,l ⟩ reaches Ω(1), which necessarily is an iteration t ≥ T 0 , we still have that y t ≤ Õ(y 0 ) =⇒ ⟨w (t) r ′ , v i,l ⟩ ≤ Õ(σ 0 ). Stage II We now consider when t ∈ [T 0 , T ]. In this stage, using the induction hypothesis C.3 (d) and (e), we have that E (X,y)∼Z ⟨∇ wr L(H; X), v i,l ⟩ ≤ Õ 1 k σ 2q-1 0 . Thus, ⟨w (t+1) r ′ , v i,l ⟩ ≤ ⟨w (t) r ′ , v i,l ⟩ + Õ η k σ 2q-1 0 ≤ ⟨w (T0) r ′ , v i,l ⟩ + Õ η(T -T 0 ) k σ 2q-1 0 ≤ Õ(σ 0 ).

F.4 NOISE CORRELATION

In this subsection, we prove that the kernels correlate small with the random noise. Lemma F.6. Suppose Assumption C.2 holds and Induction Hypothesis C.3 holds for all iterations < t. For every v i,l ∈ V, for every r ∈ M i,l , for every (X, y) ∈ Z, we have (a) For every p ∈ P v i,l (X), |⟨w r , ξ p ⟩| ≤ õ(σ 0 ). (b) For every p ∈ P(X) \ P v i,l (X), |⟨w r , ξ p ⟩| ≤ Õ(σ 0 ). (c) For every p ∈ [P ] \ P(X), |⟨w (t) r , ξ p ⟩| ≤ Õ(σ 0 γk). Moreover, for every r / ∈ ∪ i∈[k],l∈[2] M i,l , for every (X, y) ∈ Z, we have (d) for every p ∈ P(X), |⟨w r , ξ p ⟩| ≤ Õ(σ 0 ). (e) for every p ∈ [P ] \ P(X), |⟨w (t) r , ξ p ⟩| ≤ Õ(σ 0 γk). Proof of Lemma F.6. For every r ∈ [km], for every (X * , y * ) ∈ Z and every p * ∈ [P ], we have that ⟨-∇ wr L(X), ξ p * ⟩ = p∈[P ] Φ r (X) - 1 θ -1 ReLU(⟨w r , x p ⟩) ReLU ′ (⟨w r , x p ⟩)⟨x p , ξ p * ⟩. When X ̸ = X * , we have |⟨x p , ξ p * ⟩| ≤ Õ(σ p ) ≤ o(1/ √ d); and when X = X * but p ̸ = p * , we have |⟨x p , ξ p * ⟩| ≤ Õ(σ p ) ≤ o(1/ √ d) . Therefore, we have E (X,y)∼Z ⟨-∇ wr L(X), ξ p * ⟩ = E (X,y)∈Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ + I X̸ =X * ⟨-∇ wr L(X), ξ p * ⟩ . For the first term, E (X,y)∼Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ = 1 N E (X * ,y * )∼Z Φ r (X * )ReLU ′ (⟨w r , x p * ⟩)⟨x p * , ξ p * ⟩ - 1 θ -1 ReLU(⟨w r , x p * ⟩)ReLU ′ (⟨w r , x p * ⟩)⟨x p * , ξ p * ⟩ ± o 1 √ d (a) = Θ 1 N E (X * ,y * )∼Z Φ r (X * )ReLU ′ (⟨w r , x p * ⟩) - 1 θ -1 ReLU(⟨w r , x p * ⟩)ReLU ′ (⟨w r , x p * ⟩) ± o 1 √ d = Θ 1 N E (X * ,y * )∼Z I v i,l ∈V(X * ) p∈Pv i,l (X) ReLU(⟨ ŵr , x p ⟩) -ReLU(⟨w r , x p ⟩) × ReLU ′ (⟨w r , x p * ⟩) - 1 θ -1 ReLU(⟨w r , x p * ⟩)ReLU ′ (⟨w r , x p * ⟩) ± o 1 √ d where (a) is because ∥ξ p * ∥ 2 2 = Θ(1). For the second term, E (X,y)∼Z I X̸ =X * ⟨-∇ wr L(X), ξ p * ⟩ = ±o 1 √ d Published as a conference paper at ICLR 2023 Now we begin to prove (a). For every v i,l ∈ V, for every r ∈ M (0) i,l , for every p * ∈ P v i,l (X * ), using the induction hypothesis C.3, when t ∈ [0, T 0 ], we have E (X,y)∼Z ⟨-∇ wr L(X), ξ p * ⟩ = Θ 1 N E (X,y)∼Z I v i,l ∈V(X * ) p∈Pv i,l (X * ) z q-1 p * p ′ ∈Pv i,l (X * ) (τ q -1)z q p ′ - 1 θ -1 z q p * × ReLU(⟨w r , v i,l ⟩)ReLU ′ (⟨w r , v i,l ⟩) ± o 1 √ d . Thus, we have ⟨w (t+1) r , ξ p * ⟩ ≤ ⟨w (t) r , ξ p * ⟩ + Õ η N ReLU(⟨w r , v i,l ⟩)ReLU ′ (⟨w r , v i,l ⟩) + o η √ d , Now we use the results from Lemma F.1, when t ≤ T 0 , ⟨w (t) r , ξ p * ⟩ ≤ ⟨w (0) r , ξ p * ⟩ + Õ ηT 0 N + o ηT 0 √ d So when N ≥ ω k σ 2q-1 0 and √ d ≥ ω(k/σ 2q-1 0 ), we have ⟨w (t) r , ξ p * ⟩ ≤ õ(σ 0 ). Then when t ∈ [T 0 , T ], we have E (X,y)∈Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ = Θ 1 N T 1/q 0 E (X,y)∼Z I v i,l ∈V(X * ) ReLU(⟨w r , v i,l ⟩)ReLU ′ (⟨w r , v i,l ⟩) ± o 1 N √ d . Therefore, for t ∈ [T 0 , T ], we have ⟨w (t) r , ξ p * ⟩ ≤ ⟨w (T0) r , ξ p * ⟩ + Õ η(t -T 0 ) N T 1/q 0 + o η(t -T 0 ) √ d ≤ õ(σ 0 ), when √ d ≥ ω(k 5/2 /η 1/q ). Now we begin to prove (b). For every p * ∈ P(X * ) \ P v i,l (X * ), using the induction hypothesis C.3, when t ∈ [0, T 0 ], we have E (X,y)∈Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ ≤ Õ 1 N E (X * ,y * )∼Z   I v i,l ∈V(X * ) p∈Pv i,l (X) ReLU(⟨ ŵr , x p ⟩) -ReLU(⟨w r , x p ⟩) Õ(σ (q-1) 0 ) ± o 1 √ d   ≤ Õ 1 N E (X * ,y * )∼Z Õ(σ (q-1) 0 ) ± o 1 √ d Thus, when t ≤ T 0 , we have ⟨w (t) r , ξ p * ⟩ ≤ ⟨w (0) r , ξ p * ⟩ + Õ ηT 0 N σ (q-1) 0 + o ηT 0 √ d ≤ Õ(σ 0 ), when N ≥ k σ q 0 and √ d ≥ k/σ 2q-1 0 . Then when t ∈ [T 0 , T ], we have ⟨w (t) r , ξ p * ⟩ ≤ ⟨w (T0) r , ξ p * ⟩ + Õ η(t -T 0 ) N σ (q-1) 0 + o η(t -T 0 ) √ d ≤ Õ(σ 0 ). We begin to prove (c). For every p ∈ [P ]\P(X), using the induction hypothesis C.3, when t ∈ [0, T 0 ], we have E (X,y)∈Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ ≤ Õ 1 N E (X * ,y)∼Z Õ((σ 0 γk) (q-1) ) ± o 1 √ d .

G.1 MAIN RESULTS

We add an extra linear layer on the pretrained encoder. We collect labeled data points Z down = {(X i , y i )} N2 i=1 ∼ D and use these labeled data points to update the weights u i,r , i ∈ [k], r ∈ [km] of the extra linear layer and fine-tune the kernels of the pretrained encoder w r , r ∈ [km]. The output of linear layer is denoted as F i (X) = r∈[km] u i,r h r (X). The loss function on downstream tasks is L down (F ) = 1 N 2 i∈[N2] L down (F ; X i , y i ), where L down (F ; X, y) = -log e Fy (X) X) . We define logit i (F ; X) = e Fy (X) X) . The gradient of L down (F ; X, y) is j∈[k] e F j ( j∈[k] e F j ( -∇ ui,r L down (F ; X, y) = (I i=y -logit i (F ; X))h r (X). We initialize u (0) i,r = 0, i ∈ [k], r ∈ [km] and the initialization of w (0) r , r ∈ [km] is w (T ) r , i.e., kernels of the pretrained encoder. We update the weights using gradient descent: u (t+1) i,r = u (t) i,r -η 2 ∇ ui,r L down (F ; X, y), w (t+1) r = w (t) r -η 1 ∇ wr L down (F ; X, y). We set η 1 to be much smaller than η 2 . The following lemma states that the induction hypothesis C.3 still holds in the training of classification tasks. Lemma G.1. For N 2 ≥ k many samples, setting the learning rate η 2 = Θ(k) and η 1 ≤ Θ(k), after T down ≥ poly(k) η1η2 many iterations, for sufficiently large k > 0, Induction Hypothesis C.3 holds for all iterations with high probability. Then we have the following theorem showing the performance of downstream classification test. Theorem G.2 (Performance on downstream classification tasks). For N 2 ≥ k many samples, setting the learning rate η 2 = Θ(k) and η 1 ≤ Θ(k), after T down ≥ poly (k) η1η2 many iterations, with high probability, we have (a) (training loss is small) for every (X, y) ∈ Z down , i.e., L down (F ) = E (X,y)∼Z down [L down (F ; X, y)] ≤ 1 poly(k) . (b) (test performance is good) for new data point (X, y) ∼ D, the test performance is Pr (X,y)∈D F y (X) ≥ max j̸ =y F j (X) + Õ(1) ≥ 1 -e -Ω(log 2 k) .

G.2 FINETUNING OF DOWNSTREAM CLASSIFICATION MODELS

In this subsection, we fine-tune the weights w r , r ∈ [km] of pretrained encoder of Student network and update the weights of the linear layer u i,r , i ∈ [k], r ∈ [km]. G.2.1 UPDATES OF u i,r We first define several terms which will be used frequently. Definition G.3. Z i,l (X) = I v i,l ∈V(X) p∈Pv i,l (X) z p , ψ r,i,l = [⟨w r , v i,l ⟩] + , Ψ i,l = r∈M (0) i,l ψ 2 r,i,l , Ψ i = l∈[2] Ψ i,l . (c) when y ̸ = i, for every (X, y) ∈ Z down , -∇ ui,r L(F ; X, y) = -logit i (F ; X)h r (X), p∈Pv i,l (X) z p ∈ [Ω(1), 0.4]. Now we begin to show the full gradients. As we assume the ratio of single-view data is µ = 1 poly(k) , it has little influence on the update of weights. So we ignore single-view data and only focus on (X, y) ∈ Z down,m . Then when r ∈ M (0) j,l , j ̸ = i, we have E (X,y)∼Z down [-∇ ui,r L(F ; X, y)] = E (X,y)∼Z down I {y=i} (1 -logit i (F ; X)) 0.4s k • ψ r,j,l + E 5 + E 6 -E (X,y)∼Z down I {y=j} logit i (F ; X) [O(1) • ψ r,j,l + E 5 + E 6 ] -E (X,y)∼Z down I {y̸ =i,y̸ =j} logit i (F ; X) 0.4s k • ψ r,j,l + E 5 + E 6 = k -1 k 2 • 0.4s k • ψ r,j,l + E 5 + E 6 - 1 k 2 [ψ r,j,l • O(1) + E 5 + E 6 ] - k -2 k 2 0.4s k • ψ r,j,l + E 5 + E 6 , where 1 k , 1 k , k-2 k is the ratios for each type of data and at t = 0, we have logit i (F ; X) = 1 k , i ∈ [k] because we initialize u i,r = 0. Therefore, if we ignore the small term, at t = 1, we have u (1) i,r ≈ η 2 0.4(k -1)s k 3 - O(1) k 2 - 0.4(k -2)s k 3 ψ r,j,l + η 2 (E 5 + E 6 ) k ≈ η 2 0.4s k 3 - O(1) k 2 ψ r,j,l + η 2 (E 5 + E 6 ) k < 0. ( ) Using this weight, we could also obtain the bounds of loss function after the update of w r , r ∈ [km] (we will show the following inequality in (24) after we update w r ): 0 ≤ 1 -logit y (F ; X) ≤ Õ 1 k , 0 ≤ logit i (F ; X) ≤ Õ 1 k , ∀i ∈ [k] \ y. Thus, at t = 2, we have u (2) i,r ≥ η 2 0.4s k 3 - O(1) k 2 ψ r,j,l + η 2 (E 5 + E 6 ) k -Õ η 2 k 2 ψ r,j,l - η 2 (E 5 + E 6 ) k 2 , u (2) i,r ≤ η 2 0.4s k 3 - O(1) k 2 ψ r,j,l + η 2 (E 5 + E 6 ) k + Õ η 2 k 3 ψ r,j,l + η 2 (E 5 + E 6 ) k 2 . So the approximation of u (2) i,r is u (2) i,r ≈ -Õ η 2 k 2 ψ r,j,l + η 2 (E 5 + E 6 ) k . Then for t > 2, as we continue to train to minimize the loss function, 1 -logit y (F ; X) will become smaller and so as logit i (F ; X), i ∈ [k] \ y. So the main term in u i,r , i ∈ [k], r ∈ [km] is the term of the first two updates and there is nearly no order changes on values of weights after the first two step of gradient descent. Thus, for simplicity of analysis, we could take u (t) i,r ≈ -Õ η 2 k 2 ψ r,j,l + η 2 (E 5 + E 6 ) k , for t ≥ 2. ( ) Similar to the former case, when r ∈ M (0) i,l , we have E (X,y)∼Z down [-∇ ui,r L(F ; X, y)] = k -1 k 2 [ψ r,i,l • O(1) + E 5 + E 6 ] - k -1 k 2 0.4s k • ψ r,i,l + E 5 + E 6 Then if we ignore the small term, at t = 1, we have u (1) i,r ≈ η 2 O(1) • (k -1) k 2 - 0.4s(k -1) k 3 ψ r,i,l + η 2 (E 5 + E 6 ) k > 0. ( ) Similar to the former analysis, for simplicity of analysis, we also take that u (t) i,r ≈ Õ η 2 k ψ r,i,l + η 2 (E 5 + E 6 ) k , for t ≥ 2. G.2.2 FINETUNING OF w r AND PROOF OF LEMMA G.1 After the update of u i,r , we then finetune w r . We have the gradients: -∇ wr L(F ; X, y) = (1 -logit y (F ; X))u y,r p∈[P ] ReLU ′ (⟨w r , x p ⟩)x p - i∈[k]\y logit i (F ; X)u i,r p∈[P ] ReLU ′ (⟨w r , x p ⟩)x p = (1 -logit y (F ; X))u y,r - j∈[k]\y logit j (F ; X)u j,r p∈[P ] ReLU ′ (⟨w r , x p ⟩)x p Diagonal correlations. For r ∈ M i,l , as we initialize w r by the pretrained encoder, we have p∈[P ] ReLU ′ (⟨w r , x p ⟩)⟨x p , v i,l ⟩ = I {v i,l ∈V(X)} p∈Pv i,l (X) ReLU ′ (⟨w r , x p ⟩)(z p + γ + σ p ) + Õ(σ q-1 0 ) • (γ + σ p ) • (s + 1) + Õ((σ 0 γk) q-1 ) • (γ + σ p ) • P. Thus, ⟨-∇ wr L(F ; X, y), v i,l ⟩ = (V r,i,l + E 1 + E 2 ) (1 -logit y (F ; X))u y,r - j∈[k]\y logit j (F ; X)u j,r . (20) At t = 1, for every (X, y) ∼ Z down , when i = y, put ( 16) and ( 18) into (20), we have ⟨-∇ wr L(F ; X, y), v i,l ⟩ = η 2 O(1) k - 0.4s k 2 (V r,i,l + E 1 + E 2 )(1 -logit y (F ; X))ψ r,i,l + (V r,i,l + E 1 + E 2 ) • η 2 (E 5 + E 6 ) k Similarly, when y ̸ = i, ⟨-∇ wr L(F ; X, y), v i,l ⟩ = η 2 - O(1) k + 0.4s k 2 (V r,i,l + E 1 + E 2 )logit i (F ; X)ψ r,i,l + (V r,i,l + E 1 + E 2 ) • η 2 (E 5 + E 6 ) k Denote S i,l = p∈Pv i,l (X) z p . We have E (X,y)∼Z down [⟨-∇ wr L(F ; X, y), v i,l ⟩] = 1 k η 2 O(1) k - 0.4s k 2 S i,l k -1 k ψ r,i,l + k -1 k η 2 - O(1) k + 0.4s k 2 S i,l s k 2 ψ r,i,l + η 2 (E 5 + E 6 ) k = k -1 k 2 - (k -1)s k 3 η 2 O(1) k - 0.4s k 2 S i,l ψ r,i,l + η 2 (E 5 + E 6 ) k . Thus, at t = 1, we have ⟨w (1) r , v i,l ⟩ = ⟨w (0) r , v i,l ⟩ + O η 1 η 2 k 2 ψ r,i,l + η 1 η 2 (E 5 + E 6 ) k (21) ≤ Λ (T ) i,l + O η 1 η 2 k 2 Λ (T ) i,l + η 1 η 2 (E 5 + E 6 ) k ≤ Õ(1), when η 1 η 2 ≤ Õ(k 2 ). The lower bound on ⟨w (1) r , v i,l ⟩ can be easily obtained by similar methods. Besides, for t > 1, for every (X, y) ∼ Z down , when i = y, putting ( 17) and ( 19) into (20) and keeping the main term, we have ⟨-∇ wr L(F ; X, y), v i,l ⟩ ≈ Õ η 2 k (V r,i,l + E 1 + E 2 )(1 -logit y (F ; X))ψ r,i,l + (V r,i,l + E 1 + E 2 ) • η 2 (E 5 + E 6 ) k Similarly, when y ̸ = i, ⟨-∇ wr L(F ; X, y), v i,l ⟩ ≈ -Õ η 2 k (V r,i,l + E 1 + E 2 )logit i (F ; X)ψ r,i,l + (V r,i,l + E 1 + E 2 ) • η 2 (E 5 + E 6 ) k Suppose induction hypothesis C.3 holds at time t. Now we have that E (X,y)∼Z down,m [⟨-∇ wr L(F ; X, y), v i,l ⟩] (a) ≥ Õ η 2 k ψ r,i,l E (X,y)∼Z down,m I {y=i} (1 -logit y (F ; X)) -0.4I {y̸ =i} logit i (F ; X) + η 2 (E 5 + E 6 ) k . where (a) is because for y ̸ = i, V r,i,l = 0.4I {v i,l ∈V(X)} ≤ 0.4, and E (X,y)∼Z down,s [⟨-∇ wr L(F ; X, y), v i, l⟩] (a) ≥ Õ η 2 k ψ r,i, lE (X,y)∼Z down,s I {y=i} (1 -logit y (F ; X)) -0.4I {y̸ =i} logit i (F ; X) + η 2 (E 5 + E 6 ) k . Thus, using the result ψ r,i,l ≥ 1 polylog(k) , we obtain i∈[k] ⟨w (t+1) r , v i,l ⟩ ≥ i∈[k] ⟨w (t) r , v i,l ⟩ + Ω η 1 η 2 k E (X,y)∼Z down (1 -logit y (F ; X)) + η 1 η 2 (E 5 + E 6 ). As the induction hypothesis C.3 still holds in the training process, we have Λ (t) i,l ≤ Õ(1). Thus, T down t=1 E (X,y)∼Z down (1 -logit y (F ; X)) + Õ kT down (E 5 + E 6 ) ≤ Õ k 2 η 1 η 2 . ( ) So, if we assume induction hypothesis C.3 holds for all iteration < t, then ⟨w (t) r , v i,l ⟩ ≤ ⟨w (1) r , v i,l ⟩ + Õ η 1 η 2 k 2 t t=1 E (X,y)∼Z down (1 -logit y (F ; X)) + tη 1 η 2 (E 5 + E 6 ) k ≤ Õ(1). Off-diagonal correlations. For r ∈ M (0) i,l , as we initialize w r by the pretrained encoder, we have p∈[P ] ReLU ′ (⟨w r , x p ⟩)⟨x p , v j,l ′ ⟩ = Vr,i,l (X)(γ + σ p ) + I {v j,l ′ ∈V(X)} Õ(σ q-1 0 ) + E 1 + E 2 . Thus, ⟨-∇ wr L(F ; X, y), v j,l ′ ⟩ = ( Vr,i,l (X)(γ + σ p ) + I {v j,l ′ ∈V(X)} Õ(σ q-1 0 ) + E 1 + E 2 ) × (1 -logit y (F ; X))u y,r - j∈[k]\y logit j (F ; X)u j,r . At t = 1, for every (X, y) ∼ Z down , when i = y, put ( 16) and ( 18) into ( 23), we have ⟨-∇ wr L(F ; X, y), v j,l ′ ⟩ = ((γ + σ p ) + I {v j,l ′ ∈V(X)} Õ(σ q-1 0 )) × η 2 O(1) k - 0.4s k 2 (1 -logit y (F ; X))ψ r,i,l + η 2 (E 5 + E 6 ) k Similarly, when y ̸ = i but y = j, ⟨-∇ wr L(F ; X, y), v j,l ⟩ = I {v i,l ∈V(X)} (γ + σ p ) + Õ(σ q-1 0 ) + E 1 + E 2 × η 2 - O(1) k + 0.4s k 2 logit i (F ; X)ψ r,i,l + η 2 (E 5 + E 6 ) k . When y ̸ = i and y ̸ = j, ⟨-∇ wr L(F ; X, y), v j,l ⟩ = I {v i,l ∈V(X)} (γ + σ p ) + I {v j,l ′ ∈V(X)} Õ(σ q-1 0 ) + E 1 + E 2 × η 2 - O(1) k + 0.4s k 2 logit i (F ; X)ψ r,i,l + η 2 (E 5 + E 6 ) k . Therefore, E (X,y)∼Z down [⟨-∇ wr L(F ; X, y), v j,l ′ ⟩] = 1 k ((γ + σ p ) + Õ(sσ q-1 0 /k))η 2 O(1) k - 0.4s k 2 k -1 k ψ r,i,l + 1 k s k (γ + σ p ) + Õ(σ q-1 0 ) η 2 - O(1) k + 0.4s k 2 1 k ψ r,i,l + k -2 k s k (γ + σ p ) + Õ(sσ q-1 0 /k) η 2 - O(1) k + 0.4s k 2 1 k ψ r,i,l + η 2 (γ + σ p )(E 5 + E 6 ) k = - s k ((γ + σ p ) + Õ(σ q-1 0 ))η 2 O(1) k - 0.4s k 2 ψ r,i,l + η 2 (γ + σ p )(E 5 + E 6 ) k . Thus, at t = 1, we have ⟨w (1) r , v j,l ′ ⟩ ≤ ⟨w (0) r , v j,l ′ ⟩ + Õ η 1 η 2 k 2 (γ + σ p ) ψ r,i,l + η 1 η 2 (γ + σ p )(E 5 + E 6 ) k ≤ Õ(σ 0 ), when η 1 η 2 ≤ Õ(k 2 ). Suppose induction hypothesis C.3 holds for all iterations < t. We have ⟨w (t) r , v j,l ′ ⟩ ≤ ⟨w (1) r , v j,l ′ ⟩ + Õ η 1 η 2 k 2 (γ + σ p ) T down t=1 E (X,y)∼Z down (1 -logit y (F ; X)) + T down η 1 η 2 (γ + σ p )(E 5 + E 6 ) k ≤ Õ(σ 0 ) Kernels outside ∪ i∈[k],l∈[2] M (0) i,l . For r / ∈ M (0) i,l , as we initialize w r by the pretrained encoder, we have p∈[P ] ReLU ′ (⟨w r , x p ⟩)⟨x p , v i,l ⟩ = Õ(σ q-1 0 ) + E 1 + E 2 , which is very small and there is nearly no increase on ⟨w r , v i,l ⟩. Thus, when induction hypothesis C.3 holds for all iterations < t, for r / ∈ M (0) i,l , we have ⟨w (t) r , v i,l ⟩ ≤ Õ(σ 0 ). Noise correlations. For every r ∈ [km], for every (X * , y * ) ∈ Z and every p * ∈ [P ], we have that E (X,y)∼Z [I X=X * ⟨-∇ wr L(F ; X, y), ξ p * ⟩] = Θ 1 N 2 E (X,y)∼Z (ReLU ′ (⟨w r , x p * ⟩ ± o(1/ √ d)) × (1 -logit y (F ; X * ))u y,r - j∈[k]\y logit j (F ; X * )u j,r , E (X,y)∼Z [I X̸ =X * ⟨-∇ wr L(F ; X, y), ξ p * ⟩] = ±o(1/ √ d). For every v i,l ∈ V, for every r ∈ M (0) i,l , for every p * ∈ P v i,l (X * ), when i = y, we have E (X,y)∼Z [I {i=y} ⟨-∇ wr L(F ; X, y), ξ p * ⟩] = Θ η 2 N 2 ReLU ′ (⟨w r , x p * ⟩) O(1) k - 0.4s k 2 (1 -logit y (F ; X * ))ψ r,i,l ± o(1/ √ d) (a) = Θ η 2 N 2 O(1) k - 0.4s k 2 ψ r,i,l + η 2 (E 5 + E 6 ) N 2 k ± o(η 2 / √ d), where (a) is because 1 -logit y (F ; X * ) = k-1 k at t = 0. When i ̸ = y, we have E (X,y)∼Z [I {i̸ =y} ⟨-∇ wr L(F ; X, y), ξ p * ⟩] = Θ 1 (k -1)N 2 η 2 - O(1) k + 0.4s k 2 ψ r,i,l + η 2 (E 5 + E 6 ) N 2 k(k -1) ± o(η 2 / √ d), Thus, we have E (X,y)∼Z [⟨-∇ wr L(F ; X, y), ξ p * ⟩] = η 2 (E 5 + E 6 ) N 2 k 2 ± o(η 2 / √ d), and ⟨w (1) r , ξ p ⟩ = ⟨w (0) r , ξ p ⟩ + η 1 η 2 (E 5 + E 6 ) N 2 k 2 ± o(η 1 η 2 / √ d) ≤ õ(σ 0 ). Thus, when induction hypothesis C.3 holds for all iterations < t, we have ⟨w (t) r , ξ p ⟩ ≤ ⟨w (1) r , ξ p ⟩ + T down η 1 η 2 (E 5 + E 6 ) N 2 k 2 ± o(T down η 1 η 2 / √ d) ≤ õ(σ 0 ). Similarly, following the similar step as in the proof of Lemma F.6, we can also prove other claims about the noise correlations in the downstream tasks. We skip the similar steps here. Combining all above results, we can prove the Lemma G.1.

G.2.3 TRAINING LOSS AND PROOF OF THEOREM G.2 (A)

We set η 2 to be O(k). The reason why we set the step size to O(k) is in the first step, the weights of negative parts (< 0) and positive parts (> 0) is well separated. Thus, by setting a suitable step length η 2 = O(k), we can obtain a small loss in the first update of (15). We will show that the training loss is small in the following. After one-step training, at t = 1, for (X, y) ∈ Z down,m , we have F j (X) -F y (X) = 2 l=1 r∈M (0) j,l (u j,r -u y,r ) ψ r,j,l • Z j,l (X) + E 5 + E 6 + 2 l=1 r∈M (0) y,l (u j,r -u y,r ) ψ r,y,l • Z y,l (X) + E 5 + E 6 + i∈[k]\{j,y},l∈[2] r∈M (0) i,l (u j,r -u y,r ) ψ r,v ′ • Z v ′ (X) + E 5 + E 6 (a) = 2 l=1 r∈M (0) j,l (u j,r -u y,r ) ψ r,j,l • Z j,l (X) + E 5 + E 6 + 2 l=1 r∈M (0) y,l (u j,r -u y,r ) ψ r,y,l • Z y,l (X) + E 5 + E 6 + η 2 m 0 (E 5 + E 6 ) = η 2 2 l=1 r∈M (0) j,l O(1) • (k -1) k 2 - 0.4s(k -1) k 3 - 0.4s k 3 + O(1) k 2 ψ 2 r,j,l • Z j,l (X) + η 2 2 l=1 r∈M (0) y,l 0.4s k 3 - O(1) k 2 - O(1) • (k -1) k 2 + 0.4s(k -1) k 3 ψ 2 r,y,l • Z y,l (X) + η 2 m 0 (E 5 + E 6 ) = η 2 O(1) k - 0.4s k 2    2 l=1 r∈M (0) j,l ψ 2 r,j,l • Z j,l (X) - 2 l=1 r∈M (0) y,l ψ 2 r,y,l • Z y,l (X)    + η 2 m 0 (E 5 + E 6 ) = η 2 O(1) k - 0.4s k 2   0.4 2 l=1 r∈M (0) j,l I {v j,l ∈V(X)} ψ 2 r,j,l - 2 l=1 r∈M (0) y,l ψ 2 r,y,l    + η 2 m 0 (E 5 + E 6 ), where (a) is because the third term is nearly zero. We could show the similar result for single-view data. At t = 1, for (X, y) ∈ Z down,s , we have F j (X) -F y (X) = η 2 O(1) k - 0.4s k 2 0.4 2 l=1 r∈M (0) j,l I {v j,l ∈V(X)} ψ 2 r,j,l - r∈M (0) y, l ψ 2 r,y, l -ρ r∈M (0) y,3-l ψ 2 r,y,3-l + η 2 m 0 (E 5 + E 6 ). Thus, at t = 1, we have E (X,y)∼Z down,m logit y (F ; X) ≈ 2s k - 2s 2 k 2 1 1 + i∈[k]\y e 0.4Ψ i,l -Ψy + s 2 k 2 1 1 + i∈[k]\y e 0.4Ψi-Ψy + 1 - s k 2 1 1 + i∈[k]\y e 0.4s/k-Ψy ≥ 1 -Õ 1 k , where the last inequality using the result that ψ r,i,l ≥ 1 polylog(k) and ψ r,i,l ≤ Õ(1) from Lemma F.1 at initialization, |M 0 i,l | ≤ O(log 5 k) from Lemma C.1. We could obtain the similar results for single-view data. Finally, if we set T down ≥ poly (k) η1η2 , according to ( 22), it is easy to verify that 1 T down T down t=1 E (X,y)∼Z down -log e Fy(X) j∈[k] e Fj (X) ≤ 1 T down T down t1=1 E (X,y)∼Z down 1 -logit y (F ; X) ≤ 1 poly(k) . This implies that the training loss is small and so we prove Theorem G.2 (a).

G.2.4 PROOF OF THEOREM G.2 (B)

In this subsection, we prove Theorem G.2 (b). For (X, y) ∼ D m , due to our definition of data structure in Definition 1, with probability at least 1 -e -Ω(log 2 k) , it satisfies that for every j ∈ [k] \ y, F j (X) -F y (X) ≈ O(1) • 0.4 2 l=1 I {v j,l ∈V(X)} Ψ j,l - 2 l=1 Ψ y,l . and for (X, y) ∼ D s , F j (X) -F y (X) ≈ O(1) • 0.4 2 l=1 I {v j,l ∈V(X)} Ψ j,l -ρΨ y,3-l -Ψ y, l . To prove Theorem G.2 (b), we need a lemma: Lemma G.4. For every (X, y) ∈ Z down , 1 -logit y (F ; X) ≤ Õ k 4 s 2 • E (X,y)∼Z down [1 -logit y (F ; X)]. (The same also hold with probability ≥ 1 -e -Ω(log 2 k) for every (X, y) ∼ D on the left hand side.) Furthermore, if E (X,y)∼Z down [1 -logit y (F ; X)] ≤ 1 k 5 is sufficiently small, we have for every j ∈ [k] \ y, F j (X) -F y (X) ≤ -Õ(1). Proof of Lemma G.4. The proof of Lemma G.4 for multi-view data has been shown in (Allen-Zhu & Li, 2020, Claim C.16 ). Now we prove this lemma also holds for single-view data. For a data point (X, y) ∈ Z down,s , let us denote by H(X) be the set of all i ∈ [k] \ {y} such that l∈[2] p∈Pv i,l (X) z p ≥ 0.8 - 1 100 log k , l∈[2] p∈Pv y,l (X) z p ≤ 1 + ρ + 1 100 log k . Now suppose 1 -logit y (F ; X) = ζ(X), then using min{1, β} ≤ 2 1 -1 1+β , we have min 1, i∈[k]\{y} e Fi(X)-Fy(X) ≤ 2ζ(X). By (26) and our definition of H(X), this implies that min 1, i∈H(X) e O(1)•(0.4Ψi-ρΨ y,3-l -Ψ y, l) ≤ 4ζ(X) Now we define ϕ = E (X,y)∼Z down,s [1 -logit y (F ; X)], then E (X,y)∼Z down,s   min 1, i∈H(X) e O(1)•(0.4Ψi-ρΨ y,3-l -Ψ y, l )   ≤ 4ϕ =⇒E (X,y)∼Z down,s   i∈H(X) min 1 k , e O(1)•(0.4Ψi-ρΨ y,3-l -Ψ y, l )   ≤ 4ϕ. It equals to j∈[k] i∈[k] I {i̸ =j} E (X,y)∼Z down,s [I {j=y} I {i∈H(X)} ] min 1 k , e O(1)•(0.4Ψi-ρΨ y,3-l -Ψ y, l ) ≤ 4ϕ. Note that for every i ̸ = j ∈ [k], the probability of choosing a single-view sample (X, y) from Z down,s with y = j and i ∈ H(X) is at least Ω 1 k • s 2 k 2 . This implies j∈[k] i∈[k]\j min 1 k , e O(1)•(0.4Ψi-ρΨ y,3-l -Ψ y, l ) ≤ Õ k 3 s 2 ϕ . Finally, using 1 -1 1+β ≤ min{1, β}, for every (X, y) ∼ Z down,s , we have 1 -logit y (F ; X) ≤ min 1, i∈[k]\y 2e O(1)•(0.4Ψi-ρΨ y,3-l-Ψ y, l ) ≤ k • i∈[k]\y min 1 k , e O(1)•(0.4Ψi-ρΨ y,3-l-Ψ y, l ) ≤ Õ k 4 s 2 ϕ . This implies that when E (X,y)∼Z down,s (1 -logit y (F ; X)) ≤ 1 k 5 , we have 0.4Ψ i -ρΨ y,3-l -Ψ y, l ≤ -Õ(1). As we have proved in Section G.2.3 that E (X,y)∼Z down (1 -logit y (F ; X)) ≤ 1 poly(k) , We could set T down ≥ Õ k 7 η1η2 and then based on Lemma G.4, we have Pr (X,y)∈D F y (X) ≥ max j̸ =y F j (X) + Õ(1) ≥ 1 -e -Ω(log 2 k) .

H EXTENSIONS ON OTHER MRP METHODS

We have prove that Theorem C.4 holds in the above sections, which means that under the Teacher-Student Framework, the pretraining phase can capture all features. In this section, we extend our proof methods to other popular mask-reconstruction pretraining methods. Here we mainly consider the masked autoencoder (MAE) structure He et al. (2021) . For simplicity of analysis, we set the weights of the decoder as the copy of encoder weights and add a linear layer with b i = c(θ), i ∈ [P ] to finally obtain the recovered patches. The explicit framework is shown in Fig. 6 . Denote the position encoding of patch p as e p ∈ R P , where at position p the element equal to 1, otherwise the element equal to 0. Recall that ϵX = (ϵ 1 x 1 , ϵ 2 x 2 , . . . , ϵ P x P ). Under this framework, the loss function is L(H; X, ϵ) = 1 2 p∈[P ] x p -c(θ) r∈[km] w r ReLU(⟨w r , e T p ϵX⟩) 2 2 = 1 2 p∈[P ] x p -c(θ) r∈[km] w r ReLU(⟨w r , ϵ p x p ⟩) 2 2 and L(H; X) = E ϵ [L(H; X, ϵ)] = 1 2 p∈[P ] x p - r∈[km] w r ReLU(⟨w r , x p ⟩) 2 2 + 1 2 1 -θ θ p∈[P ] r∈[km] w r ReLU(⟨w r , x p ⟩) 2 2 . Denote A r,p (X) = ReLU(⟨w r , x p ⟩) + ReLU ′ (⟨w r , x p ⟩)[⟨w r , x p ⟩] + . We have that -∇ wr L(X) = p∈[P ] A r,p   x p - 1 θ r ′ ∈[km] w r ′ ReLU(⟨w r ′ , x p ⟩)   . To prove that under MAE framework, the pretraining can also capture all features, we have the same induction hypothesis as Induction Hypothesis C.3 but now the parameter assumption is a little different. Our new assumptions are shown as follows: Assumption H.1 (Parameter Assumption: MAE framework). The parameters introduced in the paper need to satisfy the following conditions: • ϱ is the threshold for the smoothed ReLU activation. We assume ϱ = 1 polylog(k) . • q ≥ 4 and σ q-2 0 ≤ 1 k . • γ controls feature noise. γ ≤ Õ σ0 k . • s controls feature sparsity. s = Θ(polylog(k)). • N ≥ ω k σ q-1 0 , √ d ≥ ω(k/σ q-1 0 ), and P ≤ σ -q+1/2 0 . • polylog(k) ≤ m ≤ √ k. • η ≥ 1 k q(q-2) and η ≤ 1 poly(k) . • c(θ) = 1 θ . Now we have the following result on the feature learning process of MAE. Theorem H.2 (Feature learning process of MAE). Suppose Assumption H.1 holds. By running the gradient descent step based on gradient Fact. 2.2 with learning rate η ≤ 1 poly(k) , after T = poly(k) η iterations, for sufficiently large k > 0, Induction Hypothesis C.3 holds for all iterations t = 0, 1, . . . , T with high probability. See its proof in Appendix H.2.5. Similarly, we also have the result about the performance on downstream classification tasks shown as follows. Theorem H.3 (Performance on downstream classification tasks under MAE pretraining). For N 2 ≥ k many samples, setting the learning rate η 2 = Θ(k) and η 1 ≤ Θ(k), after T down ≥ poly (k) η1η2 many iterations, with high probability, we have (a) (training loss is small) for every (X, y) ∈ Z down , i.e., L down (F ) = E (X,y)∼Z down [L down (F ; X, y)] ≤ 1 poly (k) . (b) (test performance is good) for new data point (X, y) ∼ D, the test performance is Pr (X,y)∈D F y (X) ≥ max j̸ =y F j (X) + Õ(1) ≥ 1 -e -Ω(log 2 k) . See its proof in Appendix H.2.6. Theorem H.2 guarantees that under MAE pretraining, the pretrained convolution kernels can capture all discriminative features in the data and each convolution kernel only grab at most one discriminative feature. Such a result accords with the result of MRP in Theorem 1 of the manuscript. Please refer to more detailed discussion and analysis of Theorem 1 in manuscript. We also note that the assumptions under MAE framework are more strict than the assumptions of Teacher-Student framework. In the assumptions of MAE, we need q ≥ 4, which means we need to let the low-magnitude feature noises to be compressed much much smaller in order that we can separate the true feature from feature noises. Then Theorem H.3 shows that as we have captured all features in the MAE pretraining phase, we can also obtain very high accuracy with high probability in the downstream classification tasks. Therefore, compared with supervised learning, MAE also shows better performance on classification downstream task. This result shows the generality of our analysis framework. To prove Theorem H.2 and Theorem H.3, we mainly follow the similar framework used to prove Theorems 1 and Theorem 2 in the manuscript. To begin with, we first prove some auxiliary theories based on which one can easily prove the desired results.

H.1 SOME RESULTS FROM INDUCTION HYPOTHESIS C.3 UNDER MAE

We first introduce some claims about the terms in the gradients. Claim H.4. Suppose Assumption H.1 and Induction Hypothesis C.3 holds at iterations t. Then for every r ∈ M (0) i,l , we have • if p ∈ P v i,l (X), A r,p (X) = I v i,l ∈V(X) ReLU(⟨w r , x p ⟩) + I v i,l ∈V(X) ReLU ′ (⟨w r , x p ⟩)[⟨w r , x p ⟩] + . • if p ∈ P(X) \ P v i,l (X), A r,p (X) ≈ Õ(σ q 0 ). • if p ∈ [P ] \ P(X), A r,p (X) ≈ Õ((σ 0 γk) q ). We also denote ∆ p (X) = x p - 1 θ r ′ ∈[km] w r ′ ReLU(⟨w r ′ , x p ⟩). Claim H.5. Suppose Assumption H.1 and Induction Hypothesis C.3 holds at iterations t, • When p ∈ P v i,l (X), we have ⟨∆ p (X), v i,l ⟩ = z p I v i,l ∈V(X) -1 θ r ′ ∈M (0) i,l ⟨w r ′ , v i,l ⟩I v i,l ∈V(X) ReLU(⟨w r ′ , x p ⟩) - r ′ / ∈M (0) i,l Õ(σ 0 ) Õ(σ q 0 ) = z p I v i,l ∈V(X) -1 θ r ′ ∈M (0) i,l ⟨w r ′ , v i,l ⟩I v i,l ∈V(X) ReLU(⟨w r ′ , x p ⟩) ± Õ(σ q-1 2 0 ). • When p / ∈ P v i,l (X) but p ∈ P v j,l ′ (X) for v j,l ′ ̸ = v i,l , we have ⟨∆ p (X), v i,l ⟩ = γ ± Õ(σ q-1 2 0 ) ± r ′ ∈M (0) i,l ⟨w r ′ , v i,l ⟩ Õ(σ q 0 ) ± r ′ ∈M (0) j,l ′ I v j,l ′ ∈V(X) Õ(σ 0 )ReLU(⟨w r ′ , x p ⟩). • When p ∈ [P ] \ P(X), we have ⟨∆ p (X), v i,l ⟩ = γ ± Õ(σ q+1 0 (γk) q ) ± r ′ ∈M (0) i,l ⟨w r ′ , v i,l ⟩ Õ((σ 0 γk) q ). Now we have some claims for the gradients. The proof is just based on the result from Claim H.4 and Claim H.5. Claim H.6. Suppose Assumption H.1 and Induction Hypothesis C.3 holds at iterations t. Then for every v i,l ∈ V, for every r ∈ M Intuitions on Claim H.6. From Claim H.6, we can find that the positive-correlation gradient -⟨∇ wr L(X), v i,l ⟩ has a non-small term that drive the correlation between w r and v i,l when r ∈ M (0) i,l to increase during the training courses. How the correlation increase will be shown in Claim H.7 in the following. On the other hand, the negative correlations will keep small as the negativecorrelation gradients -⟨∇ wr L(X), v j,l ′ ⟩ always have small terms. These intuitions are same as the intuitions under Teacher-Student Framework and thus we could also prove that MAE pretraining could also capture all features. Now we have the following claim shows about at which iteration Λ (t) i,l will be greater than ϱ. Claim H.7. Suppose Assumption C.2 holds and induction hypothesis C.3 holds at iteration t. For every v i,l , suppose Λ r , v i,l ⟩] + . We choose any r ∈ [km] that makes ⟨w (t) r , v i,l ⟩ ≥ Ω(σ 0 ). Now we show the updates. We know that ⟨w (t+1) r , v i,l ⟩ = ⟨w (t) r , v i,l ⟩ + ηE (X,y)∼Z [⟨-∇ wr L(X), v i,l ⟩] Using Claim H.6 and following the similar method in the proof of Claim E.4, we have -⟨∇ wr L(X), v i,l ⟩ = (1 + q)z q+1 p . As the term (1/θ) r ′ ∈M (0) i,l ReLU(⟨w r ′ , v i,l ⟩)⟨w r ′ , v i,l ⟩ is small at the intial stage compared with the constant 1, we have Λ (t+1) i,l ≈ Λ (t) i,l + Θ η k ReLU(⟨w r , v i,l ⟩). Using Claim E.5, and Ω(σ 0 ) ≤ Λ (0) i,l ≤ Õ(σ 0 ), we have the following result: Claim H.8. Suppose Assumption H.1 holds and Induction Hypothesis C.3 holds for every iteration. Define T 0 := Θ k ησ q-1 0 . We have that when t ≥ T 0 , it satisfies Λ Proof. Suppose we are now at some iteration t > T 0 . In this stage, Λ (t) i,l ≥ 1/polylog(k). As T 0 = Θ k ησ q-1 0 and η ≤ 1 poly(k) , we have -⟨∇ wr L(X), v i,l ⟩ = p∈Pv i,l (X) A r,p   zpIv i,l ∈V(X) - 1 θ r ′ ∈M (0) i,l ⟨w r ′ , v i,l ⟩I v i,l ∈V(X) ReLU(⟨w r ′ , x p ⟩)    = I v i,l ∈V(X) [⟨w r , v i,l ⟩] +   1 -(1/θ) r ′ ∈M (0) i,l ⟨w r ′ , v i,l ⟩ 2    p∈Pv i,l (X) 2z 2 p . Then we have [⟨w (t+1) r , v i,l ⟩] + ≤ [⟨w (t) r , v i,l ⟩] + + Õ η k [⟨w (t) r , v i,l ⟩] + Taking the maximum on both side and as we are at t > T 0 , we have max r∈M (0) i,l [⟨w (t+1) r , v i,l ⟩] + ≤ max r∈M (0) i,l [⟨w (t) r , v i,l ⟩] + 1 + Õ η k . When t ≤ T = T 0 + Õ k η , we have Λ (t) i,l ≤ Õ(1).

H.2.4 NOISE CORRELATION

As our noise correlation result is similar to Lemma F.6, we don't repeat it here but just prove it holds under MAE framework and under our new parameter assumptions. H.2.5 PROOF OF THEOREM H.2 Theorem H.2 can be easily obtained following the similar steps in the proof of Theorem C.4 when we have Lemma H.10-Lemma H.12.



https://pytorch.org/hub/pytorch_vision_resnet/ For official trained SL model, you can download it at https://github.com/facebookresearch/ deit/blob/main/README_deit.md. For official trained MAE model, you can download it at https://github.com/ facebookresearch/mae/blob/main/FINETUNE.md. For official trained data2vec model, you can download it at https://github.com/ facebookresearch/fairseq/tree/main/examples/data2vec.



Architectures. Formally, as shown in Fig 2, we implement the encoder in student network by a two-layer convolution smoothed ReLU network with km kernels denoted by w r ∈ R d , r ∈ [km].

Figure 2: Teacher-Student framework studied in this work. Given an input X = [x 1 , . . . , x P ](image or text tokens) with P patches, this framework randomly masks patches to obtain ϵX = [ϵ 1 x 1 , . . . , ϵ p x P ] with Bernoulli variable ϵ p to mask, and feeds ϵX into student encoder H for a latent vector H(ϵX). Then, student decoder takes H(ϵX) as input and outputs h ′ of all patches to predict the output h of a teacher with vanilla input X as input. The encoder is two-layer CNN, and the decoder is a linear layer. For MAE framework, it has encoder-decoder networks and the decoder have an additional layer to map output of the encoder to recover P patches (see Fig 6 in Appendix).

, i ∈ [k], r ∈ [km]by 0 and initialize w

Figure 3: Visualization of ResNet50 (He et al., 2016b) respectively trained by supervised learning and MRP. We use Eigen-CAM to localize class-specific image regions. For each pair, the left figure is given by the supervised model, while the right figure comes from the pretrained model. By comparison, the pretrained model often captures more kinds of features than the supervised model.

. To this end, we use the widely used Eigen-CAM(Muhammad & Yeasin, 2020) to visualize which part of an image plays a key role in deciding its predicted class. We follow the default setting in Eigen-CAM and use the network parameters of the forth block to compute the project of an image in ResNet50(He et al.,  2016b)  released by PyTorch Team 1 . As shown in Fig.1of Sec. 3.1, though ResNet50 predicts all the car images correctly, Eigen-CAM locates different class-specific regions, e.g. car front, side window, car nose, taillight, and wheel, for different images. It indicates the existence of multiple independent discriminative features in a semantic class and validates our "multi-view" data assumption.

Figure 4: Class-specific visualization of ResNet50 (He et al., 2016b) trained by MRP. For each pair, the left figure is given by supervised model, while the right figure comes from the pretrained model. By comparison, pretrained model often captures more kinds of features than supervised model.

Figure 5: Class-specific visualization of ViT-base (Dosovitskiy et al., 2020) trained by MRP. For each group, the left image is the visualization of supervised model, while the middle one is from MAE and the right one is from data2vec. By comparison, the model trained by MAE and data2vec often captures more kinds of features than supervised model.

Because 1) we set c(θ) = 1 θ and 2) each ϵ p is i.i.d. Bernoulli and E[ϵ p ] = θ, we obtain E ϵ 2 p∈[P ] ŷr,p p∈[P ] c(θ)ϵ p y r,p = 2 p∈[P ] ŷr,p p∈[P ]

Figure 6: Masked Autoencoder Fact 2.2. Given the data point (X, y) ∈ D, for every w r , r ∈ [km],

, for every (X, y) ∈ Z, we have(a) -⟨∇ wr L(X), v i,l ⟩ = p∈Pv i,l (X) ⟨w r ′ , v i,l ⟩I v i,l ∈V(X) ReLU(⟨w r ′ , x p ⟩) ) 2q ) • P (b) for v j,l ′ ̸ = v i,l , -⟨∇ wr L(X), v j,l ′ ⟩ ⟨w r ′ , v j,l ′ ⟩I v j,l ′ ∈V(X) ReLU(⟨w r ′ , x p ⟩)

r , v i,l ⟩). Proof of Claim H.7. Recall that Λ (t) i,l := max r∈[km] [⟨w (t)

⟨w r ′ , v i,l ⟩I v i,l ∈V(X) ReLU(⟨w r ′ , x p ⟩) I v i,l ∈V(X) ReLU(⟨w r , v i,l ⟩) r ′ , v i,l ⟩)⟨w r ′ , v i,l ⟩

9. Suppose Assumption H.1 holds and Induction Hypothesis C.3 holds for all iterations < t. We have ∀v i,l ∈ V : Λ

Proof of Lemma F.6 under MAE framework. For every r ∈ [km], for every (X * , y * ) ∈ Z and every p * ∈ [P ], we have that⟨-∇ wr L(X), ξ p * ⟩ = p∈[P ] A r,p   ⟨x p , ξ p * ⟩ -1 θ r ′ ∈[km] ⟨w r ′ , ξ p * ⟩ReLU(⟨w r ′ , x p ⟩)   . When X ̸ = X * , we have |⟨x p , ξ p * ⟩| ≤ Õ(σ p ) ≤ o(1/ √ d); and when X = X * but p ̸ = p * , we have |⟨x p , ξ p * ⟩| ≤ Õ(σ p ) ≤ o(1/ √ d). Therefore, we haveE (X,y)∼Z ⟨-∇ wr L(X), ξ p * ⟩ = E (X,y)∈Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ + I X̸ =X * ⟨-∇ wr L(X), ξ p * ⟩ .Now we begin to prove (a). For every v i,l ∈ V, for every r ∈ M (0)i,l , for every p * ∈ P v i,l (X * ), using the induction hypothesis C.3, when t ∈ [0, T 0 ], we have that for the first term,E (X,y)∼Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ ⟨x p * , ξ p * ⟩ -1 θ r ′ ∈[km] ⟨w r ′ , ξ p * ⟩ReLU(⟨w r ′ , x p * ⟩) y)∼Z I X̸ =X * ⟨-∇ wr L(X), ξ p * ⟩ = ±o * ⟩ ≤ ⟨w (t) r , ξ p * ⟩ + Õ η N õ(σ 0 )ReLU(ΛNow we use the results from Lemma H.9, when t ≤ T 0 ,⟨w (t) r , ξ p * ⟩ ≤ ⟨w (t-1) r , ξ p * ⟩ + õ(ησ 0 )ReLU(Λ p * ⟩ ≤ õ(σ 0 ). Therefore, for t ∈ [T 0 , T ], we have ⟨w (t)r , ξ p * ⟩ ≤ ⟨w (T0) r , ξ p * ⟩ + Õ η(t -T 0 ) ω(k). Following the similar process, we could also prove (b)-(e).

Performance on Various Downstream Tasks. We use official MRP setting to train ResNet50.

Z. Wu, Y. Xiong, S. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 3733-3742, 2018. Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.



annex

Then the process to prove (c) is similar to the proof of (b).To prove (d) and (e), for every p * ∈ P(X * ), using the induction hypothesis C.3, we have E (X,y)∈Z I X=X * ⟨-∇ wr L(X), ξ p * ⟩ ≤ Õ 1 N Õ(σ (2q-1) 0and for every p * ∈ [P ] \ P(X * ), using the induction hypothesis C.3, we haveFollowing the similar process, we could also prove (d) and (e).F.5 PROOF OF THEOREM C.4 In this subsection, we will combine all lemmas and begin to prove Theorem C.4.Proof of Theorem C.4. At iteration t, according to the data structure we defined in Definition 1, we haveIt is easy to verify the induction hypothesis C.3 holds at iteration t = 0. Suppose induction hypothesis C.3 holds for all iteration < t. We have established several lemmas:Lemma F.2 and Lemma F.• To prove Induction Hypothesis C.3(a), we plug ( 12) and ( 13) into (10), and user , ξ p ⟩| ≤ õ(σ 0 ) from Lemma F.6(a). 

G TEST PERFORMANCE ON DOWNSTREAM CLASSIFICATION TASKS

In this section, we analyze the performance of mask-reconstruction pretraining on downstream classification tasks to show its superiority over supervised training.When r ∈ M (0)i,l , at t = 0, using the induction hypothesis C.3, we havei,l , at t = 0, using the induction hypothesis C.3, we haveThe gradients with respect to the output F i (X) include three types.(1) Near zero gradients For u i,r , when r /which is very small. Thus, there is nearly no updates on those weights and they keep near zero, i.e.,(2) Negative gradients For u i,r , when r ∈ M (0) j,l , j ̸ = i, we now show the gradients -∇ ui,r L(F ; X, y) for different type of data points:(a) when y = i, for every (X, y) ∼ Z down ,(3) Positive gradients For u i,r , we now show the gradients -∇ ui,r L(F ; X, y) when r ∈ M (0) i,l for different type of data points:(a) when y = i, for every (X, y) ∼ Z down,m or (X, y) ∈ Z down,s , l = lBesides, we also needThis condition shows that when, the increase on the positive correlations tends to zero and the training process becomes to converge.Lemma H.10. Suppose Assumption H.1 holds and Induction Hypothesis C.3 holds for all iterations < t. We haveProof. We start with any iteration t that is ⟨wr , v i,l ⟩ ≤ -Ω(σ 0 ) to see how negative the next iteration will be. Without loss of generality, we consider the case when ⟨wr , v i,l ⟩ ≤ -Ω(σ 0 ) holds for every t ′ ≥ t. Then based on Claim H.6 and when we assum ⟨wWhen t ≤ T 0 , we have

H.2.2 OFF-DIAGONAL CORRELATIONS

Lemma H.11. Suppose Assumption H.1 holds and Induction Hypothesis C.3 holds for all iterations < t. ThenProof. Stage I. We first consider the stage when t ≤ T 0 . For every r ∈ M (0) i,l , using Claim H.6, we haveFrom Claim H.7, we have that.Thus, when t ≤ T 0 ,Stage II. When t ∈ [T 0 , T ], we haveLemma H.12. Suppose Assumption H.1 holds and Induction Hypothesis C.3 holds for all iterations < t. Then), we have prove that ⟨w We assume that there exists ani,l such that induction hypothesis C.3 (a)-(c) holds for every (X, y) ∈ Z. We want to see if the sequence ⟨w (t) r ′ , v i,l ⟩ will increase more quickly than max r∈M (0) i,l ⟨w (t) r , v i,l ⟩.Stage I. We first consider when t ≤ T 0 . In this stage, Λ (t) i,l ≤ ϱ. We define two sequences. First, we take w r * = argmax r∈M (0) i,l ⟨w (0) r , v i,l ⟩ and define x t := ⟨wThen following the same process as the proof of Lemma F.5, we haveThe proof of Stage II is also similar to Lemma F.5.

H.2.6 PROOF OF THEOREM H.3

Theorem H.3 can be easily obtained following the same steps in the proof of Theorem G.2 in Section G.

H.3 DISCUSSION ON BEIT METHODS

We have proved that the pretraining phase can capture all features both under Teacher-Student framework and MAE framework. Now we have a discussion on BEiT framework (Bao et al., 2021) . For BEiT, if we regard the pretrained encoder of BEiT as a fixed teacher and stuck an additional layer to map the patch token feature of BEiT encoder to discrete pseudo label, then this setting becomes very similar to our setting. The only different is the BEiT encoder (teacher) is fixed, while our teacher encoder is learned online (updated along with the weights of the student). Since there are a lot of similarities between these two frameworks, our proof methods can naturally extend to BEiT with the suitable choices of additional layers.

