TRANSFERRING INDUCTIVE BIASES THROUGH KNOWLEDGE DISTILLATION

Abstract

Having the right inductive biases can be crucial in many tasks or scenarios where data or computing resources are a limiting factor, or where training data is not perfectly representative of the conditions at test time. However, defining, designing, and efficiently adapting inductive biases is not necessarily straightforward. Inductive biases of a model affect its generalization behavior and influence the solution it converges to from different aspects. In this paper, we investigate the power of knowledge distillation in transferring the effects of inductive biases of a teacher model to a student model, when they have different architectures. We consider different families of models: LSTMs vs. Transformers and CNNs vs. MLPs, in the context of tasks and scenarios with linguistics and vision applications, where having the right inductive biases is critical. We train our models in different setups: no knowledge distillation, self-distillation, and distillation using a teacher with a better inductive bias for the task at hand. We show that in the later setup, compared to no distillation and self-distillation, we can not only improve the performance of the students, but also the solutions they converge become similar to their teachers with respect to a wide range of properties, including different task-specific performance metrics, per sample behavior of the models, representational similarity and how the representational space of the models evolve during training, performance on out-of-distribution datasets, confidence calibration, and finally whether the converged solutions fall within the same basins of attractions.

1. INTRODUCTION

Inductive biases are the characteristics of learning algorithms that influence their generalization behavior, independent of data. They are one of the main driving forces to push learning algorithms toward particular solutions (Mitchell, 1980) . Having the right inductive biases is especially important for obtaining high performance when data or compute is a limiting factor, or when training data is not perfectly representative of the conditions at test time. Moreover, in the absence of strong inductive biases, a model can be equally attracted to several local minima on the loss surface; and the converged solution can be arbitrarily affected by random variations like the initial state or the order of training examples (Sutskever et al., 2013; McCoy et al., 2020; Dodge et al., 2020) . There are different ways to inject inductive biases into learning algorithms, for instance through architectural choices, the objective function, the curriculum, or the optimization regime. In this paper, we exploit the power of Knowledge Distillation (KD) to transfer the effect of inductive biases between neural networks. KD refers to the process of transferring knowledge from a teacher model to a student model, where the logits from the teacher are used to train the student. KD is best known as an effective method for model compression (Buciluǎ et al., 2006; Hinton et al., 2015; Sanh et al., 2019) which allows taking advantage of a huge number of parameters during training while having an efficient smaller model during inference. The advantage of KD goes beyond model compression and it can be used to combine strengths of different learning algorithms (Kuncoro et al., 2019; 2020) . Different algorithms vary in terms of the computational/memory efficiency at training/inference or the inductive biases for learning particular patterns. This makes them better at solving certain problems and worse at others, i.e., there is no "one size fits all" learning algorithm. Hence, it is important to explore the potential of KD for finding better trade-offs. The question that we ask in this paper is: "In KD, are the preferences of the teacher that are rooted in its inductive biases, also reflected in its dark knowledgefoot_0 , and can they thus be transferred to the student?". We are interested in cases where the student model can realize functions that are realizable by the teacher, i.e., the student model is efficient with respect to the teacher model (Cohen et al., 2016) , while the teacher has a preference inductive bias so that the desired solutions are easily learnable for the teacher (Seung et al., 1991) . We consider two scenarios where the teacher and the student are neural networks with heterogeneous architectures, hence have different inductive biases. We train the models, both independently and using KD, on tasks for which having the right inductive biases is crucial. In the first test case, we study RNNs vs. Transformers (Vaswani et al., 2017) , on the subject-verb agreement prediction task (Linzen et al., 2016) . In this task, we use LSTMs (Hochreiter & Schmidhuber, 1997) as the most widely used RNN variant. LSTMs are shown to perform better than vanilla Transformers in this task and their superior performance is attributed to their so-called "recurrent" inductive bias (Tran et al., 2018) . First, we identify the sources of the recurrent inductive bias of LSTMs: sequentiality, memory bottleneck, and recursion, and design experiments to show the benefits of each. Then, we show that through distilling knowledge of LSTMs to Transformers, the solutions that the Transformer models learn become more similar to the solution learned by LSTMs. In the second test case, we study CNNs vs. MLPs, in the context of the MNIST-C (Corrupted MNIST) benchmark (Mu & Gilmer, 2019) , which is designed to measure out-of-distribution robustness of models. We train our models on MNIST and evaluate them on the Translated/Scaled MNIST. The particular form of parameter sharing in CNNs combined with the pooling mechanism makes them equivariant to these kinds of transformations (Goodfellow et al., 2016) , which leads to better generalization in these scenarios compared to MLPs. In our experiments and analysis on these two test cases 2 , we compare the behavior of different models, from a wide range of perspectives, when trained in different setups including (1) when trained without KD, but directly from the data, (2) when trained with KD using a teacher with a similar architecture to the student, i.e. self-distillation, and (3) when trained with KD using a teacher with a different architecture that has stronger inductive biases that suit the task, compared to the student. As the first step, in setup (1), i.e., no KD, we demonstrate how inductive biases arising from different architectural choices affect the generalization behavior of the models we study ( §2.1 and §3.1). We show that the models with more suitable inductive biases not only have better accuracy but also the solutions they converge to is a better solution in terms of other metrics. We also show that different instances of the model with stronger inductive biases have less variance in terms of all the metrics. Then, we apply KD to train the models and contrast the behavior of models trained with the setups (2) and (3) with the models trained with setup (1), i.e. with KD vs. without KD. We show that regardless of the properties of the teacher, KD is a powerful technique in which the teacher model drives the student toward a particular set of solutions that is more restricted compared to the set of possible solutions that a student can converge to when it learns directly from data ( §2.2, §3.2, and Appendix C).



Dark knowledge refers to the information encoded in the output logits of a neural network(Hinton et al., 2015).2 The codes for the input pipelines, models, analysis, and the details of the hyper-parameters used in our experiments is available at https://ANONYMIZED.



Figure 1: Training paths of different models on the Translated MNIST task. Different points represent the state of the model at different epochs, from the initial state to the convergence. The visualization is based on a 2D projection of the representational similarity of the activations from the penultimate layer for the examples from the validation set, i.e. Translated MNIST (more details in Appendix B).

