SELF-DISTILLATION FOR FURTHER PRE-TRAINING OF TRANSFORMERS

Abstract

The application of pre-training large transformer models on massive amounts of unlabeled data and fine-tuning them on labeled datasets for diverse downstream tasks has demonstrated remarkable success in various vision and natural language processing tasks. However, the direct fine-tuning approach may result in suboptimal performance if there exists a significant discrepancy between the pre-training and fine-tuning domains. To address this issue, some previous studies have proposed further pre-training strategies to continue pre-training the model on the target unlabeled dataset before fine-tuning. However, these strategies are limited to language models and may result in overfitting when applied to Vision Transformers. To overcome this limitation, we present a novel approach of self-distillation as a regularization method for the further pre-training stage. Our method first further pre-trains the initial pre-trained model on the target unlabeled data, and then uses it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student, and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. Our experiments demonstrate the superiority of self-distillation over relevant baselines on various benchmark datasets for image and text classification tasks. Furthermore, we provide a theoretical analysis of our proposed method using a simplified model to shed light on how self-distillation for further pre-training can potentially enhance the performance of downstream tasks.

1. INTRODUCTION

Pre-trained transformer models (Devlin et al., 2019; Brown et al., 2020; Liu et al., 2019; He et al., 2022) have been effective on various vision and natural language processing tasks. The pre-trained models learn general representation from a large volume of unlabeled data so that they generalize well to various downstream tasks when they are fine-tuned on each task with a labeled dataset. However, in many of real-world applications, it requires a considerable amount of effort to adapt the pre-trained model to a specific downstream task domain since there exists a significant distributional discrepancy between data for the pre-training and fine-tuning stage. Moreover, it is difficult to collect a large amount of labeled data for such specific domains, which renders adaptation of the pre-trained model to downstream tasks more challenging. Several works have proposed to tackle the problem of adapting pre-trained models to a specific domain. A prevalent approach for adaptation of the pre-trained model is further pre-training where we continue to update the parameters of the pre-trained model on additionally curated domain-specific unlabeled data with self-supervision (Beltagy et al., 2019; Lee et al., 2020) , before fine-tuning it on the target labeled data as depicted in Figure 2b . Gururangan et al. (2020) also show that further pretraining only with the target unlabeled data is still effective without any extra data. However, most of the existing further pretraining approaches have focused on language models, and we find that the further pre-training



Figure 1: Acc. with varying the number of further pre-training steps.

