STUDENTS IN THE RIGHT FAM-ILY CAN LEARN FROM GOOD TEACHERS

Abstract

State-of-the-art results in deep learning have been improving steadily, in good part due to the use of larger models. However, widespread use is constrained by device hardware limitations, resulting in a substantial performance gap between state-ofthe-art models and those that can be effectively deployed on small devices. While Knowledge Distillation (KD) theoretically enables small student models to emulate larger teacher models, in practice selecting a good student architecture requires considerable human expertise. Neural Architecture Search (NAS) appears as a natural solution to this problem but most approaches can be inefficient, as most of the computation is spent comparing architectures sampled from the same distribution, with negligible differences in performance. In this paper, we propose to instead search for a family of student architectures sharing the property of being good at learning from a given teacher. Our approach AutoKD, powered by Bayesian Optimization, explores a flexible graphbased search space, enabling us to automatically learn the optimal student architecture distribution and KD parameters, while being 20× more sample efficient compared to existing state-of-the-art. We evaluate our method on 3 datasets; on large images specifically, we reach the teacher performance while using 3× less memory and 10× less parameters. Finally, while AutoKD uses the traditional KD loss, it outperforms more advanced KD variants using hand-designed students.

1. INTRODUCTION

Recently-developed deep learning models have achieved remarkable performance in a variety of tasks. However, breakthroughs leading to state-of-the-art (SOTA) results often rely on very large models: GPipe, Big Transfer and GPT-3 use 556 million, 928 million and 175 billion parameters, respectively (Huang et al., 2019; Kolesnikov et al., 2020; Brown et al., 2020) . Deploying these models on user devices (e.g. smartphones) is currently impractical as they require large amounts of memory and computation; and even when large devices are an option (e.g. GPU clusters), the cost of large-scale deployment (e.g. continual inference) can be very high (Cheng et al., 2017) . Additionally, target hardware does not always natively or efficiently support all operations used by SOTA architectures. The applicability of these architectures is, therefore, severely limited, and workarounds using smaller or simplified models lead to a performance gap between the technology available at the frontier of deep learning research and that usable in industry applications. In order to bridge this gap, Knowledge Distillation (KD) emerges as a potential solution, allowing small student models to learn from, and emulate the performance of, large teacher models (Hinton et al., 2015a) . The student model can be constrained in its size and type of operations used, so that it will satisfy the requirements of the target computational environment. Unfortunately, successfully achieving this in practice is extremely challenging, requiring extensive human expertise. For example, while we know that the architecture of the student is important for distillation (Liu et al., 2019b) , it remains unclear how to design the optimal network given some hardware constraints. With Neural Architecture Search (NAS) it is possible to discover an optimal student architecture. NAS automates the choice of neural network architecture for a specific task and dataset, given a search space of architectures and a search strategy to navigate that space (Pham et al., 2018; Real et al., 2017; Liu et al., 2019a; Carlucci et al., 2019; Zela et al., 2018; Ru et al., 2020) . One im-portant limitation of most NAS approaches is that the search space is very restricted, with a high proportion of resources spent on evaluating very similar architectures, thus rendering the approach limited in its effectiveness (Yang et al., 2020) . This is because traditional NAS approaches have no tools for distinguishing between architectures that are similar and architectures that are very different; as a consequence, computational resources are needed to compare even insignificant changes in the model. Conversely, properly exploring a large space requires huge computational resources: for example, recent work by Liu et al. (2019b) investigating how to find the optimal student requires evaluating 10, 000 models. By focusing on the comparison between distributions we ensure to use computational resources only on meaningful differences, thus performing significantly more efficiently: we evaluate 33× less architectures than the most related work to ours (Liu et al., 2019b) . To overcome these limitations, we propose an automated approach to knowledge distillation, in which we look for a family of good students rather than a specific model. We find that even though our method, AutoKD, does not output one specific architecture, all architectures sampled from the optimal family of students perform well when trained with KD. This reformulation of the NAS problem provides a more expressive search space containing very diverse architectures, thus increasing the effectiveness of the search procedure in finding good student networks. Our contributions are as follows: (A) a framework for combining KD with NAS and effectively emulate large models while using a fraction of the memory and of the parameters; (B) By searching for an optimal student family, rather than for specific architectures, our algorithm is up to 20x more sample efficient than alternative NAS-based KD solutions; (C) We significantly outperform advanced KD methods on a benchmark of vision datasets, despite using the traditional KD loss, showcasing the efficacy of our found students.

2. RELATED WORK

Model compression has been studied since the beginning of the machine learning era, with multiple solutions being proposed (Choudhary et al., 2020; Cheng et al., 2017) . Pruning based methods allow the removal of non-essential parameters from the model, with little-to-none drop in final performance. The primary motive of these approaches was to reduce the storage requirement, but they can also be used to speed up the model (LeCun et al., 1990; Han et al., 2015; Li et al., 2016a) . The idea behind quantization methods is to reduce the number of bits used to represent the weights and the activations in a model; depending on the specific implementation this can lead to reduced storage, reduced memory consumption and a general speed-up of the network (Fiesler et al., 1990; Soudry et al., 2014; Rastegari et al., 2016; Zhu et al., 2016) . In low rank factorization approaches, a given weight matrix is decomposed into the product of smaller ones, for example using singular value decomposition. When applied to fully connected layers this leads to reduced storage, while when applied to convolutional filters, it leads to faster inference (Choudhary et al., 2020) . All the above mentioned techniques can successfully reduce the complexity of a given model, but are not designed to substitute specific operations. For example, specialized hardware devices might only support a small subset of all the operations offered by modern deep learning frameworks. In Knowledge Distillation approaches, a large model (the teacher) distills its knowledge into a smaller student architecture (Hinton et al., 2015b) . This knowledge is assumed to be represented in the neural network's output distribution, hence in the standard KD framework, the output distribution of a student's network is optimized to match the teacher's output distribution for all the training data (Yun et al., 2020; Ahn et al., 2019; Yuan et al., 2020; Tian et al., 2020; Tung & Mori, 2019) . The work of Liu et al. (2019b) shows that the architecture of a student network is a contributing factor in its ability to learn from a given teacher. The authors propose combining KD with a traditional NAS pipeline, based on Reinforcement Learning, to find the optimal student. While this setup leads to good results, it does so at a huge computational cost, requiring over 5 days on 200 TPUs. Similarly, Gu & Tresp (2020) also look for the optimal student architecture, but do so by searching for a subgraph of the original teacher; therefore, it cannot be used to substitute unsupported operations. Orthogonal approaches, looking at how KD can improve NAS, are explored by Trofimov et al. (2020) and Li et al. (2020) . The first establishes that KD improves the correlation between different budgets in multi-fidelity methods, while the second uses the teacher supervision to search the architecture in a blockwise fashion.

