LEARNING FROM DEEP MODEL VIA EXPLORING LOCAL TARGETS

Abstract

Deep neural networks often have huge number of parameters, which posts challenges in deployment in application scenarios with limited memory and computation capacity. Knowledge distillation is one approach to derive compact models from bigger ones. However, it has been observed that a converged heavy teacher model is strongly constrained for learning a compact student network and could make the optimization subject to poor local optima. In this paper, we propose ProKT, a new model-agnostic method by projecting the supervision signals of a teacher model into the student's parameter space. Such projection is implemented by decomposing the training objective into local intermediate targets with approximate mirror descent technique. The proposed method could be less sensitive with the quirks during optimization which could result in a better local optima. Experiments on both image and text datasets show that our proposed ProKT consistently achieves the state-of-the-art performance comparing to all existing knowledge distillation methods.

1. INTRODUCTION

Advanced deep learning models have shown impressive abilities in solving numerous machine learning tasks (Devlin et al., 2018b; Radford et al., 2018; He et al., 2016) . However, the advanced heavy models are not compatible with many real-world application scenarios due to the low inference efficiency and high energy consumption. Hence preserving the model capacity using fewer parameters has been an active research direction during recent years (Polino et al., 2018; Wu et al., 2016; Hinton et al., 2015) . Knowledge distillation (Hinton et al., 2015) is an essential way in the field which refers to a model-agnostic method where a model with fewer parameters (student) is optimized to minimize some statistical discrepancy between its predictions distribution and the predictions of a higher capacity model (teacher). Recently, it has been observed that employing a static target as the distillation objective would leash the effectiveness of the knowledge distillation method (Jin et al., 2019; Mirzadeh et al., 2019) when the capacity gap between student and teacher model is large. The underlying reason lies in common sense that optimizing deep learning models with gradient descent is favorable to the target which is close to their model family (Phuong & Lampert, 2019) . To counter the above issues, designing the intermediate target has been a popular solution: Teacher-Assistant learning (Jin et al., 2019) shows that within the same architecture setting, gradually increasing the teacher size will promote the distillation performance; Route-Constrained Optimization (RCO) (Mirzadeh et al., 2019) uses the intermediate model during the teacher's training process as the anchor to constrain the optimization path of the student, which could close the performance gap between student and teacher model. One reasonable explanation beyond the above facts could be derived from the perspective of curriculum learning (Bengio et al., 2009) : the learning process will be boosted if the goal is set suitable to the underlying learning preference (bias). The most common arrangement for the tasks is to gradually increase the difficulties during the learning procedures such as pre-training (Sutskever et al., 2009) . Correspondingly, TA-learning views the model with more similar capacity/model-size as the easier tasks while RCO views the model with more similar performance as the easier tasks, etc. In this paper, we argue that the utility of the teacher is not necessarily fully explored in previous approaches. First, the intermediate targets usually discretize the training process as several periods and the unsmoothness of target changes in optimization procedure will hurt the very property of

