

Abstract

Large pretrained foundation models (such as CLIP) are among the most recent significant advances in the AI community. Their implication is profound. This paper examines the value of these foundation models as a model knowledge base -we aim to distill the knowledge in these foundation models for training lightweight models designed for specific tasks in practical application scenarios with improved performance. Despite abundant progress in knowledge distillation (KD) in traditional models trained under the supervision of class labels in datasets encoded as integers, distilling such text-image contrastive learning model has not been explored extensively. Meanwhile, KD is well-known for being bothered by the capacity gap problem (i.e., distilling knowledge from a teacher significantly larger than a student often degrades the performance of the student). The teacher-student capacity gap in distilling foundation models is even larger. Therefore, how to overcome this potential issue is also elusive now. This paper presents detailed analyses of these questions aiming to successfully tap into a pretrained foundation model (CLIP) to boost the student's performance. Besides the practical performance benefits, several interesting discoveries are unveiled: (1) CLIP is not bothered by the capacity gap, which may let us re-evaluate if the "capacity-gap" issue is really due to the capacity gap (2) We find the reason is largely due to that CLIP is not over-confident on the wrong labels when misclassifies input image samples.

1. INTRODUCTION

Large, pretrained, foundation models (e.g., CLIP (Radford et al., 2021) , DALL-E 2 (Ramesh et al., 2022) and GPT-3 (Brown et al., 2020) ) are capable of many complex tasks such as zero-shot prediction -the ability of models to predict the classes to which the input samples belong during testing without previous exposure to samples from that classes during training, generating images according to text prompts, generating images inspired by their originals, translating, reading comprehension, etc. However, the scales, or the numbers of parameters that these models contain are so large that it would be difficult to deploy such models to devices with limited computing power such as mobile phones, tablets, and laptops. In addition, even though these foundation models are versatile, demonstrating great competence in abundant tasks that are considered to be challenging for regular neural networks, in some situations, however, instead of using all the functions that these models are capable of, we may only need to use parts of or even a derivative of their functions. These facts indicate that deploying a full foundation model in all use cases could be a waste of computational resource and memory, and such intentions could be even impractical in some situations. Therefore, the study of techniques that could be applied to compress or enable the utilization of a portion of the functions of such foundation models would be valuable and necessary. To use a portion or derivatives of the functions of these huge, pretrained foundation models, one promising mechanism is to transfer the knowledge from the foundation models to lightweight, taskspecific models. In (Hinton et al., 2014) , a knowledge distillation (KD) algorithm is proposed, which is able to improve the task-specific performance of a model with a smaller scale (the student network) by transferring knowledge from another model with a larger scale and better performance specific to the task to it. The KD algorithm proposed by Hinton et al. (2014) (HKD) aims at minimizing both the Kullback-Leibler divergence (KL divergence) loss between the outputs of the teacher network and the student network along with the cross entropy loss between the student network and class labels. However, given the differences in network architectures along with pretraining methods between foundation models and conventional models, applying HKD directly on foundation models may not be an effective approach to exploit the knowledge within the foundation models to benefit the performance of lightweight models designed for definite tasks. Firstly, the teacher network, in this case, a foundation model, is not pretrained to optimize its performance on the particular task that the student network is designed for. Moreover, the intrinsic properties of the dataset utilized for pretraining foundation models could be different from that of the dataset we adopt for a specific task. In addition, in (Cho & Hariharan, 2019; Mirzadeh et al., 2020) , the existence of "capacity gap" between a teacher and a student is believed to be the major factor that prevents the performance of a student network from further improving when the teacher network contains more parameters and have better task-specific performance. When a foundation model, which contains a considerably larger quantity of parameters compared to conventional models, is adopted as a teacher network for knowledge distillation, this problem could become even more severe. In this paper, we focus on the image classification task, exploring and investigating knowledge distillation-related properties of a pretrained foundation model CLIP (Radford et al., 2021) under various experimental settings. Our contributions are: • We notice that naively distilling knowledge from CLIP (Radford et al., 2021) to student networks does not lead to satisfactory results, meaning such student networks do not outperform those distilled from more commonly adopted teacher networks (e.g., ResNet 34, 50 (He et al. ( 2016))). We hence propose a process to improve the accuracy of the teacher network on image classification before knowledge distillation, which is the fine tuning of CLIP. This accuracy is the upper bound of that of the student network. • We find that distilling from CLIP is not vulnerable to the "capacity gap" issue even when the difference in the number of parameters between the teacher network and the student network reaches more than a thousand times. Moreover, when there are only limited training samples available, the superiority of CLIP in knowledge distillation increases. Our experimental results suggest the reason may well be related to the training recipe of CLIP instead of the network architecture. Our further quantitative analysis of the output of teacher networks reveals that it is more probable for image classifying models trained with crossentropy criterion to give a high score to a wrong label on misclassification, which can later mislead the student network in knowledge distillation. On the contrary, giving a relatively high score to wrong labels is less likely for models trained under CLIP paradigm. This can have a profound impact on the understanding of the capacity gap issue • Based on these findings, we assign our finetuned CLIP to supervise the training of the lightweight model MobileNetV3 (Howard et al., 2019) , a network designed for CPU deployments. The achieved performance turned out to be notably higher than that of those trained from scratch or under the supervision of regular networks.

2. RELATED WORK

Knowledge distillation. Buciluǎ et al. (2006); Hinton et al. (2014) proposed to improve the performance of lightweight models on particular tasks (e.g., image classification, speech recognition) by forcing such models (the students) to mimic cumbersome, over-parameterized models (the teachers) on the output level. Romero et al. (2015) followed this notion and proposed to maximize the similarity between the student and the teacher with respect to feature maps of hidden layers. Tian et al. ( 2020) proposed a contrastive learning objective, which allows a student network to learn much more important information from the data representation produced by a teacher network. In other works related to knowledge distillation, the knowledge to be transferred from a teacher to a student is defined as the association among input samples (Park et al., 2019; Tung & Mori, 2019) , the probabilistic distributions of features (Passalis & Tefas, 2018), etc. Capacity gap issues. Intuitively, with the supervision of a more complex teacher network comprising more parameters, the student network should be trained to perform better. In reality, however, the performance of a student network could not be enhanced indefinitely or become arbitrarily close to that of its corresponding teacher network. Cho & Hariharan (2019) pointed out the phenomenon that larger models may not correspond to better performing student networks, which was explained by their "mismatched capacity". They proposed to adopt early-stopped knowledge distillation (Cho

