

Abstract

Large pretrained foundation models (such as CLIP) are among the most recent significant advances in the AI community. Their implication is profound. This paper examines the value of these foundation models as a model knowledge base -we aim to distill the knowledge in these foundation models for training lightweight models designed for specific tasks in practical application scenarios with improved performance. Despite abundant progress in knowledge distillation (KD) in traditional models trained under the supervision of class labels in datasets encoded as integers, distilling such text-image contrastive learning model has not been explored extensively. Meanwhile, KD is well-known for being bothered by the capacity gap problem (i.e., distilling knowledge from a teacher significantly larger than a student often degrades the performance of the student). The teacher-student capacity gap in distilling foundation models is even larger. Therefore, how to overcome this potential issue is also elusive now. This paper presents detailed analyses of these questions aiming to successfully tap into a pretrained foundation model (CLIP) to boost the student's performance. Besides the practical performance benefits, several interesting discoveries are unveiled: (1) CLIP is not bothered by the capacity gap, which may let us re-evaluate if the "capacity-gap" issue is really due to the capacity gap (2) We find the reason is largely due to that CLIP is not over-confident on the wrong labels when misclassifies input image samples.

1. INTRODUCTION

Large, pretrained, foundation models (e.g., CLIP (Radford et al., 2021) , DALL-E 2 (Ramesh et al., 2022) and GPT-3 (Brown et al., 2020 )) are capable of many complex tasks such as zero-shot prediction -the ability of models to predict the classes to which the input samples belong during testing without previous exposure to samples from that classes during training, generating images according to text prompts, generating images inspired by their originals, translating, reading comprehension, etc. However, the scales, or the numbers of parameters that these models contain are so large that it would be difficult to deploy such models to devices with limited computing power such as mobile phones, tablets, and laptops. In addition, even though these foundation models are versatile, demonstrating great competence in abundant tasks that are considered to be challenging for regular neural networks, in some situations, however, instead of using all the functions that these models are capable of, we may only need to use parts of or even a derivative of their functions. These facts indicate that deploying a full foundation model in all use cases could be a waste of computational resource and memory, and such intentions could be even impractical in some situations. Therefore, the study of techniques that could be applied to compress or enable the utilization of a portion of the functions of such foundation models would be valuable and necessary. To use a portion or derivatives of the functions of these huge, pretrained foundation models, one promising mechanism is to transfer the knowledge from the foundation models to lightweight, taskspecific models. In (Hinton et al., 2014) , a knowledge distillation (KD) algorithm is proposed, which is able to improve the task-specific performance of a model with a smaller scale (the student network) by transferring knowledge from another model with a larger scale and better performance specific to the task to it. The KD algorithm proposed by Hinton et al. (2014) (HKD) aims at minimizing both the Kullback-Leibler divergence (KL divergence) loss between the outputs of the teacher network and the student network along with the cross entropy loss between the student network and class labels. However, given the differences in network architectures along with pretraining methods between foundation models and conventional models, applying HKD directly on foundation models may not be an effective approach to exploit the knowledge within the foundation models to benefit the performance of lightweight models designed for definite tasks. Firstly, the teacher network, in this case, a foundation model, is not pretrained to optimize its performance on the particular task that the student network is designed for. Moreover, the intrinsic properties of the dataset utilized for pretraining foundation models could be different from that of the dataset we adopt for a specific task. In addition, in (Cho & Hariharan, 2019; Mirzadeh et al., 2020), the existence of "capacity gap" between a teacher and a student is believed to be the major factor that prevents the performance of a student network from further improving when the teacher network contains more parameters and have better task-specific performance. When a foundation model, which contains a considerably larger quantity of parameters compared to conventional models, is adopted as a teacher network for knowledge distillation, this problem could become even more severe. In this paper, we focus on the image classification task, exploring and investigating knowledge distillation-related properties of a pretrained foundation model CLIP (Radford et al., 2021) under various experimental settings. Our contributions are: • We notice that naively distilling knowledge from CLIP (Radford et al., 2021) to student networks does not lead to satisfactory results, meaning such student networks do not outperform those distilled from more commonly adopted teacher networks (e.g., ResNet 34, 50 (He et al. (2016) )). We hence propose a process to improve the accuracy of the teacher network on image classification before knowledge distillation, which is the fine tuning of CLIP. This accuracy is the upper bound of that of the student network. • We find that distilling from CLIP is not vulnerable to the "capacity gap" issue even when the difference in the number of parameters between the teacher network and the student network reaches more than a thousand times. Moreover, when there are only limited training samples available, the superiority of CLIP in knowledge distillation increases. Our experimental results suggest the reason may well be related to the training recipe of CLIP instead of the network architecture. Our further quantitative analysis of the output of teacher networks reveals that it is more probable for image classifying models trained with crossentropy criterion to give a high score to a wrong label on misclassification, which can later mislead the student network in knowledge distillation. On the contrary, giving a relatively high score to wrong labels is less likely for models trained under CLIP paradigm. This can have a profound impact on the understanding of the capacity gap issue • Based on these findings, we assign our finetuned CLIP to supervise the training of the lightweight model MobileNetV3 (Howard et al., 2019) , a network designed for CPU deployments. The achieved performance turned out to be notably higher than that of those trained from scratch or under the supervision of regular networks. Foundation models. Foundation models are those with vast scale and trained on a large amount of data, such that they are competent in various downstream tasks (Bommasani et al., 2021) . In (Brown et al., 2020) , GPT-3, a model with 175 billion parameters, demonstrated its prominent ability in reading comprehension, commonsense reasoning, translating, etc. The "Bidirectional Encoder Representations from Transformers" or BERT (Kenton & Toutanova, 2019), a language model pretrained on unlabeled text can be adapted for a variety of natural language-related tasks (e.g., language inference, question answering) without major architectural modifications for specific tasks. 2022) proposed multimodal adaptive distillation to improve the performance of unimodal encoders in vision-language tasks (e.g., visual commonsense reasoning, visual question answering, visual entailment, etc.).

3.1. PREREQUISITES: KNOWLEDGE DISTILLATION

In this paper, we adopt the knowledge distillation proposed by Hinton et al. (2014) as the technique to transfer knowledge from CLIP to lightweight models, and we refer to this algorithm as Hinton knowledge distillation (HKD). The complete objective of HKD is a linear combination of two subobjectives: L HKD = αL KLDiv + βL CE , where α and β are adjustable hyper-parameters weighting the Kullback-Leibler (KL) divergence loss L KLDiv and the cross-entropy loss L CE respectively. The cross-entropy loss asks a student network to learn from the hard labels of datasets: L CE = H(y, y (s) ), where y denotes the class labels encoded as integers and y s denotes the output of a student network. The KL divergence loss asks the student to mimic the teacher on the output level. In the calculation of KL divergence loss, a hyperparameter named distillation temperature τ is introduced to soften the output of both the teacher and the student, allowing the probability distribution of teacher output to be more informative: L KLDiv = τ 2 KL(σ(y (t) /τ )|σ(y (s) /τ )), where y (t) denotes the teacher output and σ denotes softmax function.

3.2. FINETUNING CLIP

CLIP (Radford et al., 2021) stands for contrastive language-image pretraining. The two major components of the model are an image encoder and a text encoder. The inputs of CLIP are text-image pairs and the pretraining of the model enables the image encoder and the text encoder to generate adequate representations of the input images and text respectively. In addition, the representation of an image is trained to match the corresponding text representation by maximizing the cosine similarity between positive pairs while minimizing that between negative pairs. That is, let I 1,...,i be the normalized image features and T 1,...,j be the normalized text features. The objective of pretraining is to maximize I i T T j for i = j and minimize I i T T j for i ̸ = j. The finetuning of CLIP aims at optimizing the system performance on specific tasks and datasets, which contains two processes: the improvement of text prompts and the refinement of model output. Figure 1 gives an illustration of the system architecture, in which CLIP (Radford et al., 2021 ) is implanted.

3.2.1. TEXT PROMPTS IMPROVEMENT

For image classification task, the text input of CLIP is usually class names embedded in sentences or phrases (e.g., this is a photo of a {class}) and for CIFAR-10 dataset (Krizhevsky et al., 2009) , the class names could be automobile, airplane, horse, etc. However, for larger datasets with more classes like ImageNet (Deng et al., 2009a) , some class names become ambiguous due to polysemy while others could be proper nouns. Our approach is providing extra descriptions to or specifying the parent class of certain labels. For instance, we replace the labels Model T, which refers to a type of motor vehicle manufactured by FORD, with Model T, automobile, car, and substitute Saint Bernard with Saint Bernard, a type of dog respectively according to what the labels in the dataset actually refer to.

3.2.2. OUTPUT REFINEMENT

To improve the performance of CLIP on a specific task and dataset, the output of CLIP is refined. The procedures for refining the output of CLIP are: (1) appending extra multilayer perceptrons (MLP) fitting a particular dataset for image classification to the CLIP model; (2) freezing all the parameters in CLIP; (3) optimizing the parameters in the MLP on a dataset under the supervision of integer-encoded labels using conventional cross entropy criterion on the image classification task. This process can be mathematically formulated as below. Assume x, t to be the image and text input to CLIP respectively, and w CLIP to be the parameters in CLIP. Then we denote the output of CLIP to be: y = f (w CLIP , x, t). (4) The contrastive output of CLIP y, in which an image embedding is matched to its corresponding text embedding, will be passed to the input layer of the adjoining MLP. Let w MLP be the parameters in the MLP, and hence its output can be written as: z = h(w MLP , y). (5) Let z ′ represents the ground-truth vector of labels encoded as integers, the objective function of output refinement can then be expressed as: L refine = H(z ′ , z), which is to be minimized with respect to w MLP through training.

3.3. EXPLORING THE CAPACITY GAP

Capacity gap or mismatched capacity (Cho & Hariharan, 2019) is considered to be a significant factor leading to the phenomenon that a teacher with higher capability may not necessarily further enhance the performance of a given student network in knowledge distillation. In this part, we propose our approaches to examine the capacity gap resistance property of CLIP and a metric revealing the reason explaining why distilling CLIP is not bothered by the capacity gap.

3.3.1. EXAMINE CAPACITY GAP RESISTANCE PROPERTY

Baseline comparison. Under this setting, a common, relatively small-scaled convolutional neural network (CNN) is selected to be the student, while regular CNNs and the finetuned CLIP model are selected to be the candidate teacher networks. Comparisons are conducted among the accuracy of student networks with identical structures but distilled from different teacher networks. The involvement of regular CNN with different scales is intended to demonstrate the negative impact of the capacity gap. Reduced student network width. Reducing the widths of student networks (CNN) means decreasing the number of filters in each of the convolution layers, resulting in a reduction in the number of parameters in student networks. With the candidate teachers unchanged, the difference in parameter number or the gap in capacity within teacher-student pairs will be enlarged, and hence the capacity gap resistance of the teacher networks can be further justified. Low-shot classification. In this case, the students and teachers are both exposed to a limited quantity of training images. Specifically, given a dataset D, the train set of D is denoted as D train . For each class in D, k pictures in D train is selected to form the train set for low-shot classification, while the test set of D denoted as D test is adopted directly without any modification. This setting investigates the influence of the capacity gap under the situation of low training samples.

3.3.2. CAPACITY GAP RESISTANCE RELATED METRIC

In knowledge distillation, the media allowing the knowledge to be transferred from the teacher and the student is their output. The resistance to the capacity gap should be related to one or more quantifiable features within the output of the teacher networks. In the image classification task, the model output corresponding to an input image is an n-dimensional vector y o , where n is the total number of classes in the given dataset. A well-trained model would assign the highest score to the element in y o matching the class to which the input image belongs, otherwise, the input image is deemed to be mistakenly classified. That is for label ∈ {0, ...i, ..., n -1} and y o = [y o 0 , ..., y o i , ..., y o n-1 ], where i ∈ [0, n -1]. Assume an input image is with label i, and the image is considered to be correctly classified if and only if y o i is the maximum element in y o . We believe that when the teacher misclassifies an input image, meaning the highest score is assigned to the element in y o not matching the label of the input image, if the score is relatively high (over confidence), the student could hence be misguided. Therefore, we propose a probabilistic metric to evaluate the above-mentioned phenomenon that occurs in teacher network output: 1 ) in neural networks: fully connected layers, batch normalization layers, dropout layers and we use ReLU as the activation function. p = N err&oc N err , Hyperparameter settings We adopted part of the settings in (Matsubara, 2021) , in which a slightly higher student accuracy was reported (71.37%) compared to that reported in (Hinton et al., 2014) (70.66%). Modifications have been made to the hyperparameter configuration to enable it to be suitable for our hardware. The number of training epochs in pretraining or finetuning teacher networks and knowledge distillation is set to 100, the batch size is 128. The initial learning rate is set to 0.1, with a multi-step learning rate decay schedule at the epoch 60 and 90 by a factor of 0.1. Stochastic gradient descent optimizer is chosen in our experiments, with a momentum of 0.9 and a weight decay of 1e-4. In knowledge distillation experiments, we assign the distillation temperature τ to be 1, indicating that no label softening is applied. The cross-entropy loss (weighted by β) between student output and class labels with integer encoding and the KL divergence loss (weighted by α) between the output of teacher and student contribute equally to the total loss in knowledge distillation. That is, α = β = 0.5.

4.2. BASELINE PERFORMANCE COMPARISON

We adopt ResNet 18 (He et al., 2016) to be the student network, ResNet 34, 101, CLIP (Radford et al., 2021) without being finetuned (Raw CLIP) and finetuned CLIP to be the candidate teacher We further examine the observation we have in Section 4.2 that despite finetuned CLIP having the largest amount of parameters among all candidate teacher networks, knowledge distillation from it shows no sign of being influenced by the capacity gap. We enlarge the parameter number difference in a teacher-student pair by fixing candidate teacher networks while shrinking the number of filters in convolution layers in residual blocks in ResNet 18 (He et al., 2016) , which is the student network in our work. In our experiment, the filter number in the student network is reduced to 1/8 when compared to that in the original structure. Results in table 2 show that even the parameter number gap between finetuned CLIP and the student network becomes around 1.5 thousand times, the student network trained under the supervision of it maintains the highest accuracy among all students in different teacher-student pairs. In contrast, in the ResNet 101-ResNet 18 pair, even if there is a smaller gap in parameter number than that in finetuned CLIP-ResNet 18 pair, the student accuracy falls even lower than that trained from scratch. This could be viewed as a signal indicating that the capacity gap has a detrimental impact on knowledge distillation in ResNet 101-ResNet 18 pair. of CLIP, the above-mentioned four models are assigned to be the candidate teacher networks and the ResNet 18 with reduced width is adopted as the student network in the following knowledge distillation experiment (see Table 4 ). For a model trained under the regular cross entropy paradigm, when compared to a model adopting the pretraining method of CLIP, it is more likely that it would give the wrong label a high score when misclassifying a sample, and we deem this overconfidence will misguide a student network in knowledge distillation.

4.6. EXTRA KNOWLEDGE DISTILLATION EXPERIMENTS

We extend the superior performance of CLIP in knowledge distillation to supervise the training of MobileNetV3 (Howard et al., 2019) and perform a series of knowledge distillation experiments on our imagenet100 dataset. See Table 5 . Table 5 : Accuracy (%) comparison on test set of imagenet100 among different teacher-student pairs. MobileNetV3-L (Howard et al., 2019) is adopted as the student network. "Params" stands for the number of parameters in models and it is calculated in millions. "Param Gap" denotes the gap in parameter number between the teacher and student, which is measured in the number of times. "Scratch/None" stands for training from scratch without the supervision of teacher networks. 

5. CONCLUSION

In this paper, we have excessively examined that CLIP is robust to the impact of capacity gap issues in knowledge distillation under different experimental settings (extra small student network, low available training samples). We have demonstrated that the pretraining method of CLIP allows the model to overcome capacity gap issues because it is less likely for the model to be overconfident on the wrong class label when it misclassifies an input sample, which could mislead a student network during knowledge distillation. This encouraging result suggests that the knowledge within CLIP could be further exploited through knowledge distillation to benefit networks with even smaller scales designed to be deployed on devices with budget computational resources like mobile phones or those designed for tasks other than image classification.

ETHICS STATEMENT

To the best of our knowledge, the methods we proposed in this work and the experiments we conducted do not pose potential negative impacts on society and they comply with the ethical research standards of ICLR.

REPRODUCIBILITY STATEMENT

Code related to this work will be released upon publication for reproducing purposes.



Instead, finetuning the pretrained model with an extra output layer would be sufficient. CLIP (Radford et al., 2021) models trained on a dataset consisting of 400 million text-image pairs, are competent in the zero-shot task on multiple datasets. DALL-E 2 (Ramesh et al., 2022), a model that adopts the framework of CLIP (Radford et al., 2021), is able to generate images based on text input and variations of images inspired by the originals. Knowledge distillation on foundation models. Knowledge from foundation models can be transferred to a variety of lightweight, task-specific models through knowledge distillation and hence improve their performance. In (Chen et al., 2019; Tang et al., 2019), knowledge within pretrained BERT (Kenton & Toutanova, 2019) was distilled to leverage small-scaled models designed for specific tasks such as natural language understanding, text generation, sentiment classification, etc. Notably, before knowledge distillation, the teacher network -BERT (Kenton & Toutanova, 2019) was finetuned. Jiao et al. (2019) managed to shrink BERT (Kenton & Toutanova, 2019) as a whole through knowledge distillation without damaging its versatility and performance on different tasks. Wang et al. (

Figure 2: Low-shot distillation. Accuracy of student networks trained with the supervision of different teachers and exposed to different numbers of training samples in ima-genet100. We choose candidate sample numbers in each class to be {50, 100, 200, 500, 1000}.

The labels are given extra context information to eliminate ambiguities and to explain proper nouns (e.g., kuvatz → kuvatz, a type of dog; fig → fig, a type of fruit). The pretrained CLIP gives positive text-image pairs high cosine similarity while suppressing that of the negative pairs, such that an image is matched to its corresponding label in the form of text. A multi-layer perceptron appended to CLIP(Radford et al., 2021) takes in its output. Parameters in the MLP are

Student Accuracy (%) on test set of imagenet100 among different teacher-student pairs. ResNet 18(He et al., 2016) is adopted as the student network. "Params" stands for the number of parameters in models and it is calculated in millions. "Param Gap" denotes the gap in parameter number between the teacher and student, which is measured in the number of times. "Scratch/None" stands for training from scratch without the supervision of teacher networks.where N err represents the number of misclassified samples and N err&oc represents the number of samples that are both misclassified and the corresponding output vectors experience over confidence. In this paper, we define an output vector y o of a model is over confidence if:

Accuracy (%) comparison on test set of imagenet100 among different teacher-student pairs, with reduced student network width. ResNet 18(He et al., 2016) is adopted as the student network. "(1/8)" suggests that the number of filters in each convolution layer in residual blocks of the student network has reduced to 1/8 compared to that in the original structure. "Params" stands for the number of parameters in models and it is calculated in millions. "Param. Gap" denotes the gap in parameter number between the teacher and student, which is measured in the number of times. "Scratch/None" stands for training from scratch without the supervision of teacher networks. Except for raw CLIP, all models are pretrained or finetuned on our imagenet100 dataset. From the results shown in Table1, we observe that even when the difference in parameter number between the finetuned CLIP and ResNet 18 reached 26 times, the student still achieves the highest accuracy. In comparison, the parameter number difference between ResNet 101 and ResNet 18 is only 3.8 times but the student accuracy is slightly lower than that distilled from ResNet 34, which is a sign that the ResNet 101 -ResNet 18 pair is negatively influenced by the capacity gap in knowledge distillation while the finetuned CLIP -ResNet 18 pair is not. Further experiments are conducted to justify this observation.

Low-shot distillation, where both students and teachers are exposed to a limited number of training samples. Accuracy (%) comparison on test set of imagenet100 among different teacherstudent pairs. ResNet 18(He et al., 2016) is adopted as the student network. "Scratch/None" stands for training from scratch without the supervision of teacher networks. "Param Gap" denotes the gap in parameter number between the teacher and student, which is measured in the number of times. "k" denotes the number of training samples per class.We explore the capacity gap resistance under low-shot settings, meaning only a limited number of training samples are available. Similar to Section 4.2, candidate teachers are ResNet 34, 101, and Finetuned CLIP, with ResNet 18 to be the student network. Training data is a portion of our imagenet100 dataset, that is, for each of the class in imagenet100, k images in the original training set is sampled to form the training set for low-shot classification, where k ∈ {50, 100, 200, 500, 1000}. Each teacher is trained or finetuned on the low-shot training sets and later utilized to supervise the training of the student network. In other words, regular candidate teacher networks (ResNets), the MLP appended to CLIP and the student network is exposed to the same training set for low-shot classification in one low-shot setting. The results of low-shot classification are shown in Table3and Figure2. We observe that with a lower average quantity of training samples with respect to the number of classes in the dataset, regular teacher networks become more vulnerable to the capacity gap, meaning the student accuracy degrades rapidly (ResNet 101 -ResNet 18). In contrast, for finetuned CLIP, the student network distilled from it consistently outperforms the rest especially when the number of available training samples is limited.

Comparing models with different pretraining methods, CLIP(Radford et al., 2021) versus regular cross-entropy paradigm. "*-img-enc" denotes pretrained image encoder in CLIP(Radford et al., 2021), "-mod-CE" denotes models having the same structures as that of the corresponding image encoders mentioned above but trained with cross-entropy loss. we again utilize ResNet 18(He et al., 2016) with a reduced filter number in residual blocks as the student. p is the metric proposed in Section 3.3.2, which measures the probability that the model gives a relatively high score to the wrong label in the model output on the condition that the highest score is assigned to a wrong label (misclassification). γ is the threshold for determining whether a score is relatively high, where the scores are elements of a model output vector passed through the softmax function. In our experiment, the threshold is set to 0.5.

