STUDENTS IN THE RIGHT FAM-ILY CAN LEARN FROM GOOD TEACHERS

Abstract

State-of-the-art results in deep learning have been improving steadily, in good part due to the use of larger models. However, widespread use is constrained by device hardware limitations, resulting in a substantial performance gap between state-ofthe-art models and those that can be effectively deployed on small devices. While Knowledge Distillation (KD) theoretically enables small student models to emulate larger teacher models, in practice selecting a good student architecture requires considerable human expertise. Neural Architecture Search (NAS) appears as a natural solution to this problem but most approaches can be inefficient, as most of the computation is spent comparing architectures sampled from the same distribution, with negligible differences in performance. In this paper, we propose to instead search for a family of student architectures sharing the property of being good at learning from a given teacher. Our approach AutoKD, powered by Bayesian Optimization, explores a flexible graphbased search space, enabling us to automatically learn the optimal student architecture distribution and KD parameters, while being 20× more sample efficient compared to existing state-of-the-art. We evaluate our method on 3 datasets; on large images specifically, we reach the teacher performance while using 3× less memory and 10× less parameters. Finally, while AutoKD uses the traditional KD loss, it outperforms more advanced KD variants using hand-designed students.

1. INTRODUCTION

Recently-developed deep learning models have achieved remarkable performance in a variety of tasks. However, breakthroughs leading to state-of-the-art (SOTA) results often rely on very large models: GPipe, Big Transfer and GPT-3 use 556 million, 928 million and 175 billion parameters, respectively (Huang et al., 2019; Kolesnikov et al., 2020; Brown et al., 2020) . Deploying these models on user devices (e.g. smartphones) is currently impractical as they require large amounts of memory and computation; and even when large devices are an option (e.g. GPU clusters), the cost of large-scale deployment (e.g. continual inference) can be very high (Cheng et al., 2017) . Additionally, target hardware does not always natively or efficiently support all operations used by SOTA architectures. The applicability of these architectures is, therefore, severely limited, and workarounds using smaller or simplified models lead to a performance gap between the technology available at the frontier of deep learning research and that usable in industry applications. In order to bridge this gap, Knowledge Distillation (KD) emerges as a potential solution, allowing small student models to learn from, and emulate the performance of, large teacher models (Hinton et al., 2015a) . The student model can be constrained in its size and type of operations used, so that it will satisfy the requirements of the target computational environment. Unfortunately, successfully achieving this in practice is extremely challenging, requiring extensive human expertise. For example, while we know that the architecture of the student is important for distillation (Liu et al., 2019b) , it remains unclear how to design the optimal network given some hardware constraints. With Neural Architecture Search (NAS) it is possible to discover an optimal student architecture. NAS automates the choice of neural network architecture for a specific task and dataset, given a search space of architectures and a search strategy to navigate that space (Pham et al., 2018; Real et al., 2017; Liu et al., 2019a; Carlucci et al., 2019; Zela et al., 2018; Ru et al., 2020) . One im-1

