SELF-DISTILLATION FOR FURTHER PRE-TRAINING OF TRANSFORMERS

Abstract

The application of pre-training large transformer models on massive amounts of unlabeled data and fine-tuning them on labeled datasets for diverse downstream tasks has demonstrated remarkable success in various vision and natural language processing tasks. However, the direct fine-tuning approach may result in suboptimal performance if there exists a significant discrepancy between the pre-training and fine-tuning domains. To address this issue, some previous studies have proposed further pre-training strategies to continue pre-training the model on the target unlabeled dataset before fine-tuning. However, these strategies are limited to language models and may result in overfitting when applied to Vision Transformers. To overcome this limitation, we present a novel approach of self-distillation as a regularization method for the further pre-training stage. Our method first further pre-trains the initial pre-trained model on the target unlabeled data, and then uses it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student, and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. Our experiments demonstrate the superiority of self-distillation over relevant baselines on various benchmark datasets for image and text classification tasks. Furthermore, we provide a theoretical analysis of our proposed method using a simplified model to shed light on how self-distillation for further pre-training can potentially enhance the performance of downstream tasks.

1. INTRODUCTION

Pre-trained transformer models (Devlin et al., 2019; Brown et al., 2020; Liu et al., 2019; He et al., 2022) have been effective on various vision and natural language processing tasks. The pre-trained models learn general representation from a large volume of unlabeled data so that they generalize well to various downstream tasks when they are fine-tuned on each task with a labeled dataset. However, in many of real-world applications, it requires a considerable amount of effort to adapt the pre-trained model to a specific downstream task domain since there exists a significant distributional discrepancy between data for the pre-training and fine-tuning stage. Moreover, it is difficult to collect a large amount of labeled data for such specific domains, which renders adaptation of the pre-trained model to downstream tasks more challenging. Several works have proposed to tackle the problem of adapting pre-trained models to a specific domain. A prevalent approach for adaptation of the pre-trained model is further pre-training where we continue to update the parameters of the pre-trained model on additionally curated domain-specific unlabeled data with self-supervision (Beltagy et al., 2019; Lee et al., 2020) , before fine-tuning it on the target labeled data as depicted in Figure 2b . Gururangan et al. (2020) also show that further pretraining only with the target unlabeled data is still effective without any extra data. However, most of the existing further pretraining approaches have focused on language models, and we find that the further pre-training • We extensively validate our method on various image and text classification datasets with pre-trained transformers and show that ours outperforms the relevant baselines.

2. RELATED WORK

Self-Distillation Knowledge distillation is to transfer knowledge of teacher to student by minimizing a divergence between output of teacher and student (Hinton et al., 2014) . When the parameterization of student and teacher is identical, we call it self-distillation as a special case of the knowledge distillation. Although there is no new information during self-distillation process, Furlanello et al. (2018) have shown that the student from self-distillation achieves better generalization performance than the teacher. A similar phenomenon has been consistently observed in other works (Yang et al., 2019; Ahn et al., 2019) . Several works propose a self-distillation without a pre-trained teacher network (Sun et al., 2019; Zhang et al., 2019; 2022) . They add auxiliarly classifiers to intermediate layers and train the classifiers to minimize divergence between the output of the classifier of the last layer and that of the auxiliary classifiers. Mobahi et al. (2020) theoretically analyze how selfdistillation induces regularization and reduces overfitting in Hilbert space. However, all of them focus on self-distillation for supervised learning. Instead, we empirically and theoretically show that self-distillation for further pre-training with self-supervision leads to better generalization of downstream tasks after fine-tuning the self-distilled model with target labeled data. Further Pre-training Lee et al. (2020) ; Beltagy et al. (2019) ; Sun et al. (2020) have shown the success of continual pre-training language model on a large number of corpora collected from target domain and fine-tuning the model on target labeled dataset. However, it is computationally expensive to further pre-train the model on a large amount of unlabeled text data and it may not be feasible to collect such a large scale of unlabeled data on certain domains. Instead, Gururangan et al. (2020) devise a task-adaptive pre-training where we use only target unlabeled data for further pre-training language model before fine-tuning the model on the target labeled data. To improve the effectiveness of further pre-training, Kang et al. (2020) ; Ye et al. (2021) propose learning to mask input for masked auto-encoding with bilevel optimization, which requires a prohibitive computational cost. However, all of them solely focus on pre-trained language models and we empirically find that naive further pre-training is not effective for Vision Transformers.

Regularization for Fine-tuning

There are several works proposing regularization for fine-tuning a pre-trained model. Chen et al. (2020a) propose to modify Adam (Kingma & Ba, 2015) optimizer, called RecAdam, which enforces the fine-tuned model close to the initial pre-trained model by minimizing L 2 distance between fine-tuned and initial pre-trained weight. Similarly, Gouk et al. (2021) project the fine-tuned weight for every gradient descent update such that it lies within the sphere centered on the initial pre-trained weights with the distance induced by the norm of maximum absolute row sums (MARS). Instead of explicitly minimizing the distance, motivated by trust region theory, Aghajanyan et al. (2021) propose to minimize symmetric KL-divergence between the model output of an original input and that of the input perturbed by Gaussian noise. However, all of them do not consider adaptation of pre-trained models to a specific target domain, which results in worse generalization performance of downstream tasks than a simple fine-tuning strategy.

Problem Statement

We assume that we are given (θ init , ϕ init ) parameters of the neural network g ϕinit • f θinit which is pre-trained on a large volume of unlabeled data with masked auto-encoding objective, where f θinit is an encoder which extracts hidden representation of an input and g ϕinit is an decoder reconstructing a masked input. Our goal is to fine-tune the pre-trained model f θinit with a randomly initialized task specific head h ω on labeled dataset D tr = {(x (i) , y (i) )} n i=1 of a downstream classification task such that the model generalizes well to unseen test dataset D test . A typical approach to achieve this goal is empirical risk minimization as follows: minimize θ,ω L CE (θ, ω; D tr ) via algorithm A as (θ * , ω * ) = A(L CE ; θ init , D tr ),

Algorithm 1 Self-Distillation

Require: Unlabeled dataset D u , initialization θinit, ϕinit, learning rate α ∈ R ≥0 , round of self-distillation T ′ ∈ N+, masking probability γ ∈ (0, 1) and batch size B. 1: θ0 ← Further Pretrain(D u , θinit, ϕinit, α, γ, B) 2: for all t ← 1, . . . , T ′ do 3: Initialize θt ← θinit and ϕt ← ϕinit 4: while not converge do 5: Sample a mini-batch {x (j) } B j=1 from D u 6: for all j ← 1, . . . , B do 7: Sample a mask z (j) ∼ pγ,K (z) 8: Z (j) ← K k=1 z (j) k 9: Get a masked input x(j) with z (j) 10:  ℓ 1 j ← -K k=1 z (j) k Z (j) log p θ t ,ϕ t (x (j) k |x (j) ) 11: ℓ 2 j ← f θ t (x (j) ) -StopGrad(f θ 0 (x (j) )) 2 2 12: end for 13: L1 ← 1 B B j=1 ℓ 1 j , L2 ← 1 B B j=1 ℓ 2 j 14: θt ← θt -α ∂(L 1 +L 2 ) ∂θ | θ=θ t 15: ϕt ← ϕt -α ∂L

4:

for all j ← 1, . . . , B do 5: Sample a mask z (j) ∼ pγ,T (z) 6: Z (j) ← K k=1 z (j) k 7: Get a masked input x(j) with z (j) 8: where L CE is a cross-entropy loss and A denotes a stochastic gradient descent algorithm to minimize L CE on the dataset D tr with the initialization θ init . p k ← p θ 0 ,ϕ 0 (x (j) k |x (j) ) 9: ℓ 1 j ← -K k=1 z (j) k Z (j Further Pre-training However, the pre-trained model is prone to overfitting when it is fine-tuned on a small amount of domain-specific labeled data. Gururangan et al. (2020) have shown that further pre-training, where we continue to pre-train the model g ϕinit • f θinit on the target unlabeled dataset D u = {x (i) } n i=1 and then fine-tune it on D tr , is effective for improving generalization performance when there is not enough domain-specific labeled data. Note that D u is the exactly same as D tr except that we remove the labels y (i) . In this work, we focus on the masked auto-encoding (Devlin et al., 2019; He et al., 2022) as a pre-training objective function since its generality compared to other self-supervised methods (Chen et al., 2020b; He et al., 2020; Grill et al., 2020; He et al., 2020; Chen & He, 2021; Caron et al., 2021) which require well-defined data augmentations to construct positive pairs for self-supervised learning. Masked Auto-Encoding We briefly describe the masked auto-encoding objective (Liu et al., 2019; He et al., 2022) for a language model such as RoBERTA (Liu et al., 2019) and Vision Transformer (ViT) (Dosovitskiy et al., 2021) . Let x (i) = (x (i) 1 , . . . , x (i) K ) be a sequence of patches for a image or tokens for a sentence with length K. Then we independently sample a binary mask from Bernoulli distribution with probability γ for each x (i) k , denoted as z (i) = (z (i) 1 , . . . , z (i) K ). If z (i) k = 1, then x (i) k is replaced with a special "mask" token. Otherwise, we use the same x (i) k for a masked input. Let x(i) = (x (i) 1 , . . . , x(i) K ) be a masked input and let f θ , g ϕ be an encoder and decoder, respectively. Then the final objective for masked auto-encoding is defined as follows: L MAE (θ, ϕ; D u ) = 1 n n i=1 E z (i) ∼p γ,T (z) - K k=1 z (i) k Z (i) • log p θ,ϕ (x (i) k |x (i) ) , Z (i) = K k=1 z (i) k , where p γ,K (z) denotes a Binomial distribution with its parameters γ for probability that z k = 1 and K for the number of trials. Note that the negative log-likelihood is instantiated as cross-entropy loss for language models or mean square error for Vision Transformers. See Appendix C for more detail.

3.2. SELF-DISTILLATION FOR FURTHER PRE-TRAINING

Although further pre-training strategy has been effective on text domain (Gururangan et al., 2020; Lee et al., 2020; Sun et al., 2020) , we empirically find that ViT with further pre-training overfits the target unlabeled data and does not generalize well to downstream image classification tasks. In order to tackle the issue, we propose self-distillation as a regularization for further pre-training. Specifically, given a pre-trained model g ϕinit • f θinit , we first continue to train the model on the target unlabeled data D u with the masked auto-encoding objective as described in equation 2 to obtain the encoder f θ0 and decoder g ϕ0 . We discard the decoder and consider the encoder f θ0 as a teacher for self-distillation. Then we take the copy of the pre-trained initial network g ϕinit • f θinit as a student and further pre-train the student with masked auto-encoding objective but enforce hidden representation of the encoder of the student f θinit to be close to that of the teacher f θ0 as follows: (θ 1 , ϕ 1 ) ∈ arg min θ,ϕ (L MAE (θ, ϕ; D u ) + L Distill (θ; θ 0 , D u )) L Distill (θ; θ 0 , D u ) = 1 n n i=1 f θ (x (i) ) -StopGrad f θ0 (x (i) ) 2 2 (3) where θ and ϕ are initialized with the pre-trained parameters θ init and ϕ init , respectively and StopGrad denotes the stop-gradient operation which does not back-propagate through the input. As described in Algorithm 1, we can repeat this process to perform multiple rounds of selfdistillation (T ′ > 1) where the student of the previous round becomes a teacher and a new student is initialized with the pre-trained weights θ init and ϕ init for the next round. We empirically find that the first round of self-distillation plays the most significant role in improving the final generalization performance of downstream tasks. Furthermore, theoretical analysis shows that the first round of self-distillation has the largest impact on regularization. Thus, we perform a single round of self-distillation for computational efficiency. After self-distillation, we discard the decoder g ϕ1 and fine-tune the encoder of the student f θ1 along with a randomly initialized task-specific head h ω by minimizing L CE (θ, ω, ; D tr ) with the initialization θ 1 as described in equation 1.

4. THEORETICAL ANALYSIS

In this section, we analyze how self-distillation affects the final model after fine-tuning in terms of generalization and regularization. This section proves a generalization bound on the supervised loss for our method and shows that the generalization bound strictly decreases as the number of selfdistillation increases. Moreover, we show that self-distillation acts as a regularizer on the distance between the initial weight before further pre-training and the final weight after fine-tuning. The regularization effect is shown to have the largest impact in the first round of self-distillation, which suggests that the first round of self-distillation plays a more significant role in the final performance when compared to the other rounds. We consider the dynamics of the weight vector w t,τ over time τ of fine-tuning after t rounds of selfdistillation, where w 0,0 is the result of further pre-training, and w t,0 ∈ minimize w L t (w) is the result of the self-distillation of t rounds with L t (w) = 1 n n i=1 ∥f (x i , w) -f (x i , w t-1,0 )∥ 2 2 + λ∥w∥ 2 2 for some λ > 0. After t rounds of self-distillation, we consider the dynamics over fine-tuning time τ via gradient flow (Saxe et al., 2014; Kawaguchi, 2021) : dwt,τ dτ = -∇L(w t,τ ), with the initialization w t,0 obtained by the self-distillation where L(w) = 1 2 n i=1 ℓ(w, x i , y i ) with ℓ(w, x, y) = ∥f (x, w) - y∥ 2 2 and y ∈ R p . Here, the self-distillation and fine-tuning share a same training dataset s = {(x i , y i )} n i=1 . In this section, to obtain theoretical insights, we consider the regime of d > n and a simple abstract model, f (x, w) = W φ(x) ∈ R p , with some nonlinear map φ and the weight matrix W ∈ R p×d where w = vec[W ⊤ ] ∈ R dp and φ(x) ∈ R d . Here, vec[W ⊤ ] is a vectorization of the matrix W ⊤ . Let us fix the fine-tuning time length T as 1 < τ ≤ T < ∞. Since d > n, there are infinitely many solutions to the problem of minimizing L(w). Thus, each of the finite length T and the over-parameterization d > n implies that the initialization w t,0 at the fine-tuning phase via self-distillation plays an important role. Let δ > 0 and t ∈ N 0 . We then define F t = {A t (s) : s ∈ S}, where S is a set of all training datasets of size n such that with probability at least 1δ, the training dataset s is in S. For each training dataset s ∈ S, A t (s) = w t,T is the final weight vector of the model after t rounds of selfdistillation and T time of fine-tuning. Let us define the matrix Φ ∈ R d×n by Φ ij = φ(x j ) i . We assume that Φ is of full rank; i.e., rank(Φ) = n since d ≥ n. This is typically satisfied because if rank(Φ) < n, there is some redundancy in the rows of the matrix Φ. Denote by [I p ⊗ Φ] ∈ R dp×np the Kronecker product of the identity matrix I p ∈ R p×p and the matrix Φ. We write its singular value decomposition by [I p ⊗ Φ] = U ΣV ⊤ where U = [u 1 , u 2 . . . , u dp ] ∈ R dp×dp contains the left-singular vectors u i ∈ R dp for i ∈ {1, . . . , dp} and Σ ∈ R dp×np is a rectangular diagonal matrix with Σ ii = σ i ∈ R ≥0 for i ∈ {1, . . . , np} and σ 1 ≥ σ 2 ≥ • • • ≥ σ np ≥ 0. Define M to be an upper bound on the loss as ℓ(w, x, y) ≤ M . Define R to be an upper bound on the expected norm of the features as E x ∥φ(x)∥ 2 ≤ R. We assume that w 0,0 ̸ = 0; if w 0,0 = 0, then the target function in the self-distillation phase is always zero as f (x i , w 0,0 ) = 0 for all i, which is unlikely the case in practice. We define w init ∈ R dp to be the weight before further pre-training and define Y = vec[[y 1 , . . . , y n ] ⊤ ] ∈ R np . The following theorem shows that the generalization bound on the supervised loss ℓ(w t,T , x, y) of the fine-tuning phase strictly decreases as we increase the number t of self-distillation rounds in the further pre-training phase: Theorem 1. There exists a constant c (that only depends on M ) such that with probability at least 1δ, the following holds: E x,y [ℓ(w t,T , x, y)] ≤ 1 n n i=1 ℓ(w t,T , x i , y i ) + ζ(t) 4c 2 R 2 p n + M ln(2/δ) 2n , where the function ζ(t) is strictly decreasing in t ∈ N 0 . The proofs of all results in this section are presented in Appendix A. Moreover, the following theorem shows that the tight upper bound on the distance between the initial weight w init and the final weight w t,T after T steps of fine-tuning (i.e., ∥w initw t,T ∥ 2 ) strictly decreases as the number t of self-distillation rounds increases: Theorem 2. There is a function ψ : N 0 → R ≥0 such that (1) ∥w init -w t,T ∥ 2 = ψ(t) for some w init ∈ R dp , (2) ∥w init -w t,T ∥ 2 ≤ ψ(t) for all w init ∈ R dp , (3) the function ψ(t) is strictly decreas- ing in t ∈ N 0 , (4) the function ψ(t) can be decomposed to ψ(t) = G 1 + ψ 1 (t) + 1{t = 0}B +G 2 with constants G 1 , G 2 ≥ 0 in t where ψ 1 (t) is strictly decreasing in t ∈ N 0 and B = dp i=np+1 (u ⊤ i w 0,0 ) 2 ≥ 0. Theorem 2 shows that the self-distillation acts as a regularizer on the distance between the initial weight w init and the final weight w t,T . Since the Rademacher complexity of a set of vectors is invariant to a shift by a constant vector, this distance has been shown to control the generalization bound in previous papers in various models and settings, including deep neural networks (Bartlett & Mendelson, 2002; Bartlett et al., 2017; Nagarajan & Kolter, 2019) . This suggests that self-distillation helps generalization via a regularization effect on the distance. Moreover, the first round of selfdistillation is expected to have the largest impact based on Theorem 2 since Theorem 2 shows that we can completely remove the unnecessary component B of w 0,0 in the first round of self-distillation. We have verified these theoretical predictions in the experiments where we show the correlation between the improvement via self-distillation and the distance that appeared in the generalization bound in the previous paper (Nagarajan & Kolter, 2019) .

5. EXPERIMENT

Dataset For image classification problem, we use six datasets -FGVC Aircraft (Aircraft) (Maji et al., 2013) , Caltech UCSD Birds 200 (CUB) (Wah et al., 2011) , Chest X-ray (Kermany et al., 2018) , Describable Textures Dataset (DTD) (Cimpoi et al., 2014) , Stanford Dogs (Khosla et al., 2011) , and Oxford 102 Flower (Nilsback & Zisserman, 2008) . For text classification problem, we use four datasets -Chemprot (Kringelum et al., 2016) , ACL-ARC (Jurgens et al., 2018) , SCIERC (Luan et al., 2018) , and Twitter-Emotion (Mohammad et al., 2018) . Please see Appendix D for more detail. Implementation Detail For the image classification problem, we use Vision Transformer pretrained on unlabeled ImageNet dataset with masked auto-encoding (He et al., 2022) and fine-tune it on the downstream task with AdamW optimizer (Loshchilov & Hutter, 2019) for 10,000 steps with batch size 32. Regarding further pre-training and self-distillation, we continue to pre-train the model for 20,000 steps with batch size 64. We evaluate the Vision Transformers with accuracy. For text classification, following the experimental setup from Gururangan et al. (2020) , we use RoBERTA (Liu et al., 2019) as a backbone network and fine-tune it on the target labeled dataset with AdamW optimizer for 10 epochs with batch size 32. In terms of further pre-training and selfdistillation, we further pre-train RoBERTA for 100 epochs with batch size 128. We evaluate the models with macro F1 for SCIERC, ACL-ARC, and Twitter-Emotion dataset, and micro F1 for Chemprot dataset. Baselines We compare our method against the following baselines targeting for fine-tuning pretrained models. All the models are initialized with the pre-trained weights θ init and ϕ init . 1. Fine-tuning: The model fine-tuned on target labeled dataset D tr without any further pre-training or regularization except dropout and weight decay. 2. RecAdam (Chen et al., 2020a) : The model trained with RecAdam optimizer which is a variant of Adam optimizer (Kingma & Ba, 2015) and additionally penalizes L 2 distance between the fine-tuned and the initial pre-trained weight. 3. MARS (Gouk et al., 2021) : The model trained to minimize the cross-entropy loss along with the regularization projecting the fine-tuned weight to lie within a sphere centered on the initial pretrained weights. For each layer, the distance induced by Maximum Absolute Row Sum (MARS) matrix norm (max j i=1 |W j,i -U j,i |) is used for the regularization. 4. R3F (Aghajanyan et al., 2021) : The model trained to minimize the cross-entropy loss as well as symmetric KL-divergence between softmax output of the original input and that of the input perturbed by Gaussian noise. 5. Further Pre-training (Gururangan et al., 2020) : Task adaptive pre-training where we further pre-train the model on the unlabeled target dataset D u with masked auto-encoding objective and fine-tune it on the target labeled dataset D tr . 6. Self-Distillation: This is our model which is further pre-trained on unlabeled target dataset D u with equation 3 and fine-tuned on the target labeled dataset D tr .

5.1. MAIN RESULTS

As shown in Table 1 , self-distillation consistently outperforms all the regularization methods and the further pre-training method on image datasets. Notably, our method significantly improves the performance of the Chest X-ray dataset consisting of grey-scaled images for diagnosis of pneumonia. In addition, self-distillation effectively tackles the Flower dataset which contains only 2,040 labeled examples. In contrast, the other baselines do not show consistent improvement across all the image datasets. For instance, further pre-training is effective for the Aircraft dataset, but significantly degrades the test accuracy on the DTD dataset. Regularization methods such as RecAdam, MARS, and R3F barely improve generalization performance on most datasets or underperform the simple fine-tuning strategy on certain datasets. This empirical evidence supports that the regularizations enforcing the fine-tuned models close to the initial pre-trained weight are not effective for adapting a pre-trained model to the target datasets of specific domains. Furthermore, as shown in Table 2 , we provide additional experimental results for text classification tasks. Again, self-distillation significantly outperforms all of the baselines across all four datasets, except RecAdam in the Chemprot dataset. In contrast to the previous experiment, the further pretraining method improves the test F1 score of the simple fine-tuning method, yet it still underperforms our model. For regularization methods -RecAdam, MARS, and R3F, they do not achieve consistent improvement across all three datasets. RecAdam moderately improves the F1 score on the SCIERC and Chemprot dataset but significantly degrades the generalization performance on ACL-ARC dataset. Both MARS and R3F show poor performance on SCIERC and ACL-ARC datasets, and their performance slightly is worse than Fine-tuning method on the Chemprot dataset.

Result for Low Resource Data

We further perform experiments to show how self-distillation effectively handles low resources of labeled data. Given a full CIFAR-100 dataset (Krizhevsky et al., 2009) which contains 50,000 training pairs of an image and corresponding label, we plot the test accuracy of each model by varying the number of training instances. Note that we also reduce the number of unlabeled images used for further pre-training or self-distillation. As shown in Figure 3 , self-distillation consistently improves the generalization performance of both fine-tuning method and the model which is further pre-trained on the images from the CIFAR-100 dataset. Notably, the gain by self-distillation becomes larger when the models are trained with an extremely small number of instances. For example, self-distillation achieves 13% and 6% improvement of test accuracy compared to the model with simple fine-tuning when there are 1,000 and 2,500 labeled examples, respectively. These empirical results verify that self-distillation can effectively adapt the pre-trained model to the target dataset even if there are extremely small amounts of labeled data.

Ablation Study

We perform ablation study to verify the effectiveness of each component of self-distillation. In Table 3 , we show empirical results on both the CUB dataset and SCI-ERC data set while removing or replacing various components of self-distillation. Firstly, we remove masked auto-encoding objective L MAE and train the model with only distillation loss L Distill before fine-tuning. On image dataset CUB, it does not make a significant difference, however, removing the masked auto-encoding objective degrades the generalization performance of the language model on text classification dataset SCIERC. Alternatively, we remove the distillation loss L Distill in equation 3, which results in further pre-training method. Furthermore, we continue to pre-train the model for twice longer steps as the original further pre-training method, denoted as Further Pre-train×2, to show that higher test accuracy of selfdistillation is not a consequence of longer pre-training. Both of the models significantly underperform self-distillation, which shows the effectiveness of the self-distillation loss. Lastly, we perform experiments for variants of distillation loss L Distill in equation 3. Instead of matching representation of teacher and student, we enforce the reconstruction of masked inputs by teacher and student to be consistent, i.e., minimize θ,ϕ ∥g ϕ • f θ (x) -g ϕ0 • f θ0 (x)∥ 2 2 for ViT or minimize θ,ϕ T t=1 D KL (p θ0,ϕ0 (x t |x) ∥ p θ,ϕ (x t |x)) for RoBERTA, denoted as Prediction-Matching. Furthermore, we replace the distillation loss with the one minimizing L 2 or MARS distance between the parameters of student and teacher, denoted as Weight-Matching. As shown in Table 3 , all these variants are not effective compared to the one minimizing the distance between hidden representations of the student and teacher. Multi-Round of Self-Distillation Lastly, we empirically show that the first round of selfdistillation plays the most significant role in improving generalization performance. Specifically, we fine-tune each model after t round of self-distillation and plot the test accuracy on Oxford 102 Flower dataset, where 0 round of self-distillation (t = 0) denotes the model with further pre-training. As shown in Figure 4 , the first round of self-distillation significantly improves the test accuracy of the model with further pre-training and the gain by self-distillation becomes marginal after the first round. Considering the extra computational cost and marginal improvement of multi-round selfdistillation, we perform a single round of self-distillation for all the experiments. 

5.2. FURTHER ANALYSIS

In this subsection, we present numerical experiments to analyze why self-distillation can potentially help improve the generalization performance of downstream tasks compared to further pre-training and empirically show that Theorem 1 and 2 can be extended to deep neural networks -transformers. (a) Generalization gap: In Figure 5a , we plot the generalization gap, which is test loss minus training loss on each labeled dataset, of self-distillation and further pre-training method. Self-distillation improves the generalization gap of the further pre-training method across all the datasets. It is consistent with Theorem 1 showing that self-distillation with a simplified model strictly decreases the generalization bound on the supervised loss of the fine-tuning stage. (b) Effect of self-distillation on distance: To empirically validate Theorem 2 about regularization effects by self-distillation on L 2 distance between the initial pre-trained weight θ init and the final weight after fine-tuning, we plot the distance obtained from self-distillation and further pre-training. Specifically, we compare the distance ∥θ initθ 1,T ∥ 2 and ∥θ initθ 0,T ∥ 2 , where θ t,τ is the parameter after t round of self-distillation and τ steps of gradient descent for fine-tuning. As shown in Figure 5b , self-distillation consistently decreases the distance and the reduced distance correlates with the better generalization gap in Figure 5a . These empirical results confirm the connection between the L 2 distance from the initialization and generalization bound (Nagarajan & Kolter, 2019) . (c) Effect of multi-round self-distillation: Lastly, we empirically verify part of Theorem 2 which shows that the first round of self-distillation plays the most critical role of regularization on the L 2 distance between the initial pre-trained weight θ init and the final weight θ t,T denoted as the parameter after t round of self-distillation and T steps of gradient descent for fine-tuning on VGG flower 102 dataset. As shown in Figure 5c , self-distillation significantly decreases the distance at the first round (t = 1) and the regularization effect on the distance diminishes afterward, where 0 round of self-distillation (t = 0) denotes the model with further pre-training but without self-distillation.

6. CONCLUSION

To effectively adapt pre-trained transformers to a target domain, we proposed self-distillation as a regularization for further pre-training. Specifically, we first took the initial pre-trained transformer and continued to pre-train it with the masked auto-encoding objective on the target unlabeled dataset and considered the encoder part of the model as a teacher for self-distillation. Then we took the copy of the same initial pre-trained model as a student and enforced representations of the student to be close to those of the teacher while optimizing the student with the masked auto-encoding objective on the target unlabeled dataset. Finally, we fine-tuned the self-distilled student on the target labeled dataset. Our empirical evaluation on various image and text classification benchmark datasets showed that self-distillation consistently improved generalization performance compared to relevant baselines. Lastly, we provided the theoretical analysis of the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the generalization performance of the downstream tasks.

REPRODUCIBILITY STATEMENT

We use Pytorch (Paszke et al., 2019) and transformers library (Wolf et al., 2020) from Huggingface to implement all the baselines and our proposed method in the experiments. We have described our method of self-distillation for further pre-training in Algorithm 1 and specified all the experimental setup including hyperparameters in Section 5 and Appendix E. For theoretical analysis, we have provided all the proofs in Appendix A. w t = Ū BA t-1 V ⊤ vec[f 0 ] = Ū      σ 1 1 σ 2 1 +nλ . . . σ 1 np σ 2 np +nλ           σ 2 1 σ 2 1 +nλ . . . σ 2 np σ 2 np +nλ      t-1     ỹ1 ỹ2 . . . ỹnp     = [u 1 u 2 • • • u np ]      σ 1 (σ 2 1 + nλ) -1 (σ 2 1 (σ 2 1 + nλ) -1 ) t-1 ỹ1 σ 2 (σ 2 2 + nλ) -1 (σ 2 2 (σ 2 2 + nλ) -1 ) t-1 ỹ2 . . . σ np (σ 2 np + nλ) -1 (σ 2 np (σ 2 np + nλ) -1 ) t-1 ỹnp      = np i=1 σ i (σ 2 i + nλ) -1 (σ 2 i (σ 2 i + nλ) -1 ) t-1 ỹi u i = r i=1 σ i (σ 2 i + nλ) -1 (σ 2 i (σ 2 i + nλ) -1 ) t-1 ỹi u i where the last line follows from the fact that σ i (σ 2 i + nλ) -1 (σ 2 i (σ 2 i + nλ) -1 ) t-1 = 0 for all i > r. Since σ i (σ 2 i + nλ) -1 (σ 2 i (σ 2 i + nλ) -1 ) t-1 = σ 2t-1 i (σ 2 i + nλ) -t = 1 σi ( σ 2 i σ 2 i +nλ ) t for i ≤ r, this implies that w t = r i=1 1 σ i σ 2 i σ 2 i + nλ t ỹi u i = r i=1 1 σ i 1 1 + (nλ/σ 2 i ) t ỹi u i Since t ∈ N + was arbitrary, this holds for any t ∈ N + . This proves the first statement of the theorem for any t ∈ N + . For t = 0, since ỹi = (V ⊤ vec[f 0 ]) i = (V ⊤ [I p ⊗ Φ] ⊤ w 0,0 ) i = (V ⊤ V Σ ⊤ U ⊤ w 0,0 ) i = (Σ ⊤ U ⊤ w 0,0 ) i = σ i u ⊤ i w 0,0 , we have that r i=1 1 σ i 1 1 + (nλ/σ 2 i ) t ỹi u i = r i=1 u i u ⊤ i w 0,0 = Ũ Ũ ⊤ w 0,0 . Thus, w 0,0 = Ũ Ũ ⊤ w 0,0 + (I -Ũ Ũ ⊤ )w 0,0 = r i=1 1 σ i 1 1 + (nλ/σ 2 i ) t ỹi u i + (I -Ũ Ũ ⊤ )w 0,0 . Since (I -Ũ Ũ ⊤ )w 0,0 = P r w 0,0 , this completes the first statement of the theorem for any t ∈ N 0 . A.1 PROOF OF THEOREM 1 Proof. Define Z := [I p ⊗ Φ] ⊤ ∈ R n× d where n = np and d = dp. Then, L(w) = 1 2 n i=1 ∥f (x i , w) -y i ∥ 2 2 = 1 2 ∥Zw -Y ∥ 2 2 where Y = vec[[y 1 , . . . , y n ] ⊤ ] ∈ R n. Since ∇L(w t,τ ) = Z ⊤ (Zw t,τ -Y ), dw t,τ dτ = -Z ⊤ (Zw t,τ -Y ) Since rank(Φ) = n and d ≥ n, we have rank(Z) = n by the property of the Kronecker product with the identity matrix. Since rank(Z) = n, there exists v ∈ R d such that Y = Zv. Thus, dw t,τ dτ = -Z ⊤ (Zw t,τ -Zv) = -Z ⊤ Z(w t,τ -v) = -Z ⊤ Z(w t,τ -v). Since Z ⊤ = U ΣV ⊤ , we have Z ⊤ Z = U ΣΣ ⊤ U ⊤ = n i=1 σ 2 i u i u ⊤ i . Thus, dw t,τ dτ = - n i=1 σ 2 i u i u ⊤ i (w t,τ -v) = - n i=1 σ 2 i u i u ⊤ i (w t,τ -v). Since the columns of U forms the basis of R d and w, v ∈ R d, we can write w t,τ = d k=1 c (t,τ ) k u k and v = d k=1 q k u k for some c (t,τ ) k and q k . Thus, dw t,τ dτ = - n i=1 σ 2 i u i u ⊤ i d k=1 (c (t,τ ) k -q k )u k = - n i=1 d k=1 σ 2 i (c (t,τ ) k -q k )u i u ⊤ i u k = - n i=1 σ 2 i (c (t,τ ) i -q i )u i . Using w t,τ = d k=1 c (t,τ ) k u k for the right-hand side too, we have that d dτ d i=1 c (t,τ ) i u i = - n i=1 σ 2 i (c (t,τ ) i -q i )u i . This implies that for all i ∈ {1, . . . , n}, = 0 for all i / ∈ {1, . . . , n}. This can be also seen by the fact that dwt,τ dτ = -Z ⊤ (Zw t,τ -Zv) with Z ⊤ = U ΣV ⊤ and thus the dynamics only adds components of u i for i ∈ {1, . . . , n}, and not for i / ∈ {1, . . . , n}. Thus, for components of u i for i / ∈ {1, . . . , n}, the initial values stays. In other words, for i / ∈ {1, . . . , n}, d dτ c (t,τ ) i = -σ 2 i (c (t,τ ) i -q i ), c (t,τ ) i = c (t,0) i . On the other hand, for i ∈ {1, . . . , n}, since d dτ q i = 0, d dτ (c (t,τ ) i -q i ) = d dτ c (t,τ ) i = -σ 2 i (c (t,τ ) i -q i ). Solving this for (c (t,τ ) i q i ), we have that for i ∈ {1, . . . , n}, c (t,τ ) i -q i = (c (t,0) i -q i )e -σ 2 i τ . This implies that c (t,τ ) i = q i + (c (t,0) i -q i )e -σ 2 i τ = q i (1 -e -σ 2 i τ ) + c (t,0) i e -σ 2 i τ . Combining these with w t,T = d k=1 c (t,T ) k u k , w t,T = d i=1 c (t,T ) i u i = n i=1 q i (1 -e -σ 2 i T )u i + n i=1 c (t,0) i e -σ 2 i T u i + d i=n+1 c (t,0) i u i . Therefore, for any particular s ∈ S, since U = [u 1 , u 2 . . . , u dp ] ∈ R dp×dp is an orthogonal matrix, ∥A t (s)∥ 2 2 = ∥w t,T ∥ 2 2 ≤ n i=1 q i (1 -e -σ 2 i T ) 2 + n i=1 (c (t,0) i ) 2 e -2σ 2 i T + d i=n+1 (c (t,0) i ) 2 . ( ) where q i , σ i , and c (t,0) i all depend on s. By using Lemma 4 of (Pham et al., 2021) and taking union bound with P(s / ∈ S) ≤ δ, with probability at least 1δ, we have that w t,T ∈ F t and the following holds: E x,y [ℓ(w t,T , x, y)] ≤ 1 n n i=1 ℓ(w t,T , x i , y i ) + 2R n (F t ) + M ln(2/δ) 2n , where R n (F t ) = E s,ξ [sup w∈Ft 1 n n i=1 ξ i ∥W φ(x i ) -y i ∥ 2 2 ], s = ((x i , y i )) n i=1 , w = vec[W ⊤ ], and ξ 1 , . . . , ξ n are independent uniform random variables taking values in {-1, 1}. By using Corollary 4 of (Maurer, 2016) , there exits a constant c (only depending on M ) such that, R n (F t ) ≤ c n E s,ξ [ sup w∈Ft n i=1 p k=1 ξ ik W k φ(x i )] = c n E s,ξ [ sup w∈Ft p k=1 W k n i=1 ξ ik φ(x i )] = c n E s,ξ [ sup w∈Ft w ⊤ h] where W k is the k-th row of W , ξ ik are independent uniform random variables taking values in {-1, 1}, h = vec[H] ∈ R dp , and H ∈ R d×p with H jk = n i=1 ξ ik φ(x i ) j . Thus, R n (F t ) ≤ c n E s,ξ [ sup w∈Ft ∥w∥ 2 ∥h∥ 2 ] = c(sup w∈Ft ∥w∥ 2 ) n E s,ξ [∥h∥ 2 ] Here, E s,ξ [∥h∥ 2 ] = E s,ξ d j=1 p k=1 n i=1 ξ ik φ(x i ) j 2 ≤ d j=1 p k=1 E s,ξ n i=1 ξ ik φ(x i ) j 2 = d j=1 p k=1 E s n i=1 (φ(x i ) j ) 2 (8) = p k=1 n i=1 E s d j=1 (φ(x i ) j ) 2 = p k=1 n i=1 E s ∥φ(x i )∥ 2 2 ≤ R √ pn Equation 8 holds since E s,ξ [(ξ ik ϕ(x i ) j ) • (ξ lk ϕ(x l ) j )] = E s [1{i = l}ϕ(x i ) j ϕ(x l ) j ] for all i, l ∈ [n]. Thus, R n (F t ) ≤ cR √ p(sup w∈Ft ∥w∥ 2 ) √ n . Define ζ t (s) := n i=1 q i (1 -e -σ 2 i T ) 2 + n i=1 (c (t,0) i ) 2 e -2σ 2 i T + d i=n+1 (c (t,0) i ) 2 . where q i , σ i , and c Then, by combining equation 6, equation 7, and equation 9, with probability at least 1δ, the following holds: E x,y [ℓ(w t,T , x, y)] ≤ 1 n n i=1 ℓ(w t,T , x i , y i ) + ζ(t) 4c 2 R 2 p n + M ln(2/δ) 2n . Finally, from Lemma 1, for any t ∈ N 0 and i ∈ {1, . . . , n}, (c (t,0) i ) 2 = 1 σ i 1 1 + (nλ/σ 2 i ) t ỹi 2 . Since 1 1+(nλ/σ 2 i ) < 1 (because nλ/σ 2 i > 0), the value of 1 1+(nλ/σ 2 i ) 2t strictly decreases as t increases. Since 1 σ 2 i > 0 and ỹ2 i ≥ 0, this implies that (c (t,0) i ) 2 is strictly decreasing in t ∈ N 0 unless c (t,0) i = 0. Moreover, from Lemma 1, we have that w t,0 = n i=1 α i,t ỹi u i + 1{t = 0}(I -Ũ Ũ ⊤ )w 0,0 . Since {u 1 , . . . , u d} is a orthonormal basis for R d with inner product ⟨x, y⟩ = y ⊤ x, we get w 0,0 = d i=1 (u ⊤ i w 0,0 )u i . Since Ũ Ũ ⊤ w 0,0 = n i=1 (u ⊤ i w 0,0 )u i , we have that (I -Ũ Ũ ⊤ )w 0,0 = d i=1 (u ⊤ i w 0,0 )u i - n i=1 (u ⊤ i w 0,0 )u i = d i=n+1 (u ⊤ i w 0,0 )u i , which implies that the u i component to span w t,0 for i ∈ {n + 1, . . . , d} is only present in (I -Ũ Ũ ⊤ )w 0,0 . In other words, w t,0 = n i=1 α i,t ỹi u i + d i=n+1 1{t = 0}(u ⊤ i w 0,0 )u i . Thus, for any t ∈ N 0 and i ∈ {n + 1, . . . , d}, we have that (c (t,0) i ) 2 = 1{t = 0}(u ⊤ i w 0,0 ) 2 . These implies that ζ(t) is strictly decreasing in t ∈ N 0 unless w 0,0 = 0.

A.2 PROOF OF THEOREM 2

Proof. In this proof, we continue to use the results and the notation from the proof of Theorem 1. By using equation 5 in the proof of Theorem 1, we have that ∥w init -w t,T ∥ 2 = ∥w init -v t ∥ 2 , where v t = n i=1 q i (1 -e -σ 2 i T )u i + n i=1 c (t,0) i e -σ 2 i T u i + d i=n+1 c (t,0) i u i . If w init = -αv t for some α > 0, then ∥w init -v t ∥ 2 = ∥v t + αv t ∥ 2 = (1 + α) ∥v t ∥ 2 = ∥v t ∥ 2 + ∥αv t ∥ 2 = n i=1 q 2 i (1 -e -σ 2 i T ) 2 + n i=1 (c (t,0) i ) 2 e -2σ 2 i T + d i=n+1 (c (t,0) i ) 2 + ∥w init ∥ 2 . On the other hand, for any w init ∈ R dp , ∥w init -v t ∥ 2 ≤ ∥v t ∥ 2 + ∥w init ∥ 2 ≤ n i=1 q 2 i (1 -e -σ 2 i T ) 2 + n i=1 (c (t,0) i ) 2 e -2σ 2 i T + d i=n+1 (c (t,0) i ) 2 + ∥w init ∥ 2 . Thus, setting ψ(t) to be the following function satisfies conditions (1) and (2) in the statement: ψ(t) := n i=1 q 2 i (1 -e -σ 2 i T ) 2 + n i=1 (c (t,0) i ) 2 e -2σ 2 i T + d i=n+1 (c (t,0) i ) 2 + ∥w init ∥ 2 Finally, from Lemma 1, for any t ∈ N 0 and i ∈ {1, . . . , n}, (c (t,0) i ) 2 = 1 σ i 1 1 + (nλ/σ 2 i ) t ỹi 2 . which is strictly decreasing in t ∈ N 0 unless c (t,0) i = 0 for all i ∈ {1, . . . , n} as shown in the proof of Theorem 1. Moreover, from Lemma 1, for any t ∈ N 0 and i ∈ {n + 1, . . . , d}, (c (t,0) i ) 2 = 1{t = 0}(u ⊤ i w 0,0 ) 2 . That is, ψ(t) = G 1 + ψ 1 (t) + d i=n+1 1{t = 0}(u ⊤ i w 0,0 ) 2 + G 2 , where G 1 := n i=1 q 2 i (1e -σ 2 i T ) is strictly decreasing in t ∈ N 0 unless w 0,0 = 0. It implies that both ψ 1 (t) and ψ(t) are strictly decreasing in t ∈ N 0 unless w 0,0 = 0. Remark 1. Note that Theorem 2 also shows the distance between the weight of the teacher w t-1,T and the initial pre-trained weight w init for all t ∈ N + . Since the teacher at t ′ ∈ N + round of self-distillation used to be a student of the t ′ -1 round of the self-distillation and Theorem 2 holds for all non-negative integer t ∈ N 0 , the distance between the initial weight and the teacher weight ∥w initw t-1,T ∥ 2 strictly decreases for all t ∈ N + . For instance, at t = 1, we obtain the following inequality ∥w initw 0,T ∥ 2 ≤ ψ(0), where w 0,T is the weight of the initial teacher without self-distillation.

B ADDITIONAL EXPERIMENTS

In this section, we perform additional experiments to better analyze the proposed method, selfdistillation for further pre-training. follows: p θ,ϕ (x k |x) = exp(u x k )) V j=1 exp(u j ) where (u 1 , . . . , u V ) = g ϕ (h k ) ∈ R V , | | h 1 • • • h K | | = f θ (x) ∈ R h×K . For ViT, the sequence x consists of image patches and the reconstruction of the masked input is predicting pixel values for each masked patches, which is a regression problem. Thus, we parameterize the conditional probability of a patch x k = (x k,1 , . . . , x k,m ) ∈ R m given x as follows: p θ,ϕ (x k |x) = m i=1 1 √ 2πσ 2 exp - (x k,i -µ k,i ) 2 2σ 2 where µ k = (µ k,1 , . . . , µ k,m ) ∈ R m , | | µ 1 • • • µ K | | = f θ (x) ∈ R m×K . Since σ > 0 and π are constants with respect to θ and ϕ, 

D DATASET

We describe statistics of all the image and text classification datasets used for our experiments in Table 7 and 8 . 



Figure 1: Acc. with varying the number of further pre-training steps.

Figure 3: Accuracy with varying the number of training data.

Figure 4: Test accuracy with varying self-distillation round.

Figure 5: (a) Generalization gap: Difference between supervised test loss and training loss. (b) Effect of self-distillation on distance: Distance between the initial pre-trained weights and the final fine-tuned weights of further pre-training and self-distillation. (c) Effect of multi-round self distillation: Distance between the initial pre-trained weights and the final fine-tuned weights for each round of self-distillation t ∈ N+.

(t,0) i all depend on s. With this, we define ζ(t) := sup s∈S ζ t (s).

,jµ k,j ) 2m log 1

Unlabeled dataset D u , initialization θinit, ϕinit, learning rate α ∈ R ≥0 , masking probability γ ∈ (0, 1), and batch size B.

Average and standard deviation of accuracy with 5 runs for image classification datasets. ± 1.13 55.55 ± 0.54 77.15 ± 0.52 67.56 ± 0.52 62.53 ± 0.57 88.78 ± 0.65 RecAdam 70.76 ± 1.25 55.22 ± 1.29 77.29 ± 1.32 67.59 ± 1.03 61.65 ± 0.92 88.97 ± 0.44 MARS 72.74 ± 0.57 55.35 ± 0.73 77.28 ± 1.80 66.79 ± 0.90 62.24 ± 0.96 87.93 ± 1.21 R3F 72.95 ± 0.46 55.91 ± 0.79 76.86 ± 0.97 65.32 ± 1.25 62.15 ± 0.48 88.92 ± 0.78 Further Pre-training 73.38 ± 0.64 55.72 ± 0.46 77.79 ± 2.06 65.55 ± 1.12 62.34 ± 0.39 88.63 ± 0.35

Average and standard deviation of F1 score with 5 runs for text classification datasets.

Ablation on CUB and SCIERC.

Since e -2σ 2 i T > 0 is a constant in t and we have previously shown that 1

The number of training instances and classes for each image classification dataset.

The number of training instances and classes for each text classification dataset.

ACKNOWLEDGMENTS

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)), the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00713), KAIST-NAVER Hypercreative AI Center, and Samsung Electronics (IO201214-08145-01).

APPENDIX A PROOFS

We also define the model output vector f t ∈ R n×p by (f t ) ij = f (x i , w t ) j . For example, f 0 is the initial teacher label matrix. Let [n] = {1, . . . , n}. Denote the rank of [I p ⊗ Φ] by r = rank([I p ⊗ Φ]) ≤ np. Define Ũ = [u 1 , u 2 . . . , u r ] ∈ R dp×r and P r = I -Ũ Ũ ⊤ , which is the projection matrix onto the null space of Ũ ⊤ . We first prove the following lemma, which will be used in the proofs of Theorem 1 and Theorem 2 later: Lemma 1. For any t ∈ N 0 , w t,0 = r i=1 α i,t ũi + 1{t = 0}v, whereProof of Lemma 1. Define w t := w t,0 for t ∈ N + . The necessary condition on the solution w t at step t is that ∇L(w t ) = 0. Thus, by solving ∇L(w t ) = 0 for w t , we have that. By using the singular value decompositionTherefore,we have thatwhere B = (ΣΣ ⊤ + nλI) -1 Σ. Here, we can rewrite the matrix B ∈ R dp×np aswhere 0 (dp-np)×np is the (dpnp) by np matrix with all entries being zero, and B ∈ R np×np is a diagonal matrix defined by, this can be further simplified as Down Weighting of Masked Auto-Encoding Although we fix both weight of self-distillation and masked auto-encoding objective to 1 in equation 3, we vary λ ∈ (0, 1], the weight of masked auto-encoding objective (λL MAE (θ, ϕ; D u ) + L Distill (θ; θ 0 , D u )) and report test accuracy of Vision Transformer with self-distillation on the CUB dataset. As shown in Table 6 , our proposed method is insensitive to the value of λ and thus there is no benefit to fine-tuning the weight of masked auto-encoding objective.

Extra Training Time

To better analyze extra computational cost for further pre-training and self-distillation, we report training wall clock time for fine-tuning, further pre-training, and selfdistillation, respectively. We train a Vision Transformer (Dosovitskiy et al., 2021) on CUB dataset with 3090 RTX GPU and Intel(R) Xeon(R) Silver 4210R CPU. It takes 32 minutes and 18 seconds for fine-tuning the transformer. We need an extra 1 hour 29 minutes 33 seconds for further pre-training and 5 hours 13 minutes 22 seconds for self-distillation.

C MASKED AUTO-ENCODING

In this section, we describe the masked auto-encoding objective from equation 2 in more detail. Given a sequence x = (x 1 , . . . , x K ) with length K, we sample mask z = (z 1 , . . . , z K ) from a Binomial distribution p γ,K with probability for success γ ∈ (0, 1) and the number of trials K. For each x k , we replace it with the special token "mask" if z k = 1. Otherwise we use the same x k for an masked input. Let x = (x 1 , . . . , xK ) be a masked input and let f θ , g ϕ be encoder and decoder, respectively. We want to compute the log-likelihood of the reconstructed input K k=1 z k log p θ,ϕ (x k |x). For language models, reconstruction of the masked input xk is predicting which token is masked out of pre-defined vocabulary with its size V , where each token is represented as an integer from {1, . . . , V }. Thus the conditional probability of x k ∈ {1, . . . , V } given x is parameterized as

E HYPERPARAMETERS

In Table 9 , we summarize all the hyperparameters for Vision Transformer and RoBERTA. 

