

Abstract

Large pretrained foundation models (such as CLIP) are among the most recent significant advances in the AI community. Their implication is profound. This paper examines the value of these foundation models as a model knowledge base -we aim to distill the knowledge in these foundation models for training lightweight models designed for specific tasks in practical application scenarios with improved performance. Despite abundant progress in knowledge distillation (KD) in traditional models trained under the supervision of class labels in datasets encoded as integers, distilling such text-image contrastive learning model has not been explored extensively. Meanwhile, KD is well-known for being bothered by the capacity gap problem (i.e., distilling knowledge from a teacher significantly larger than a student often degrades the performance of the student). The teacher-student capacity gap in distilling foundation models is even larger. Therefore, how to overcome this potential issue is also elusive now. This paper presents detailed analyses of these questions aiming to successfully tap into a pretrained foundation model (CLIP) to boost the student's performance. Besides the practical performance benefits, several interesting discoveries are unveiled: (1) CLIP is not bothered by the capacity gap, which may let us re-evaluate if the "capacity-gap" issue is really due to the capacity gap (2) We find the reason is largely due to that CLIP is not over-confident on the wrong labels when misclassifies input image samples.

1. INTRODUCTION

Large, pretrained, foundation models (e.g., CLIP (Radford et al., 2021) , DALL-E 2 (Ramesh et al., 2022) and GPT-3 (Brown et al., 2020) ) are capable of many complex tasks such as zero-shot prediction -the ability of models to predict the classes to which the input samples belong during testing without previous exposure to samples from that classes during training, generating images according to text prompts, generating images inspired by their originals, translating, reading comprehension, etc. However, the scales, or the numbers of parameters that these models contain are so large that it would be difficult to deploy such models to devices with limited computing power such as mobile phones, tablets, and laptops. In addition, even though these foundation models are versatile, demonstrating great competence in abundant tasks that are considered to be challenging for regular neural networks, in some situations, however, instead of using all the functions that these models are capable of, we may only need to use parts of or even a derivative of their functions. These facts indicate that deploying a full foundation model in all use cases could be a waste of computational resource and memory, and such intentions could be even impractical in some situations. Therefore, the study of techniques that could be applied to compress or enable the utilization of a portion of the functions of such foundation models would be valuable and necessary. To use a portion or derivatives of the functions of these huge, pretrained foundation models, one promising mechanism is to transfer the knowledge from the foundation models to lightweight, taskspecific models. In (Hinton et al., 2014) , a knowledge distillation (KD) algorithm is proposed, which is able to improve the task-specific performance of a model with a smaller scale (the student network) by transferring knowledge from another model with a larger scale and better performance specific to the task to it. The KD algorithm proposed by Hinton et al. (2014) (HKD) aims at minimizing both the Kullback-Leibler divergence (KL divergence) loss between the outputs of the teacher network and the student network along with the cross entropy loss between the student network and class labels. However, given the differences in network architectures along with pretraining methods between foundation models and conventional models, applying HKD directly on foundation models

