LEARNING FROM DEEP MODEL VIA EXPLORING LOCAL TARGETS

Abstract

Deep neural networks often have huge number of parameters, which posts challenges in deployment in application scenarios with limited memory and computation capacity. Knowledge distillation is one approach to derive compact models from bigger ones. However, it has been observed that a converged heavy teacher model is strongly constrained for learning a compact student network and could make the optimization subject to poor local optima. In this paper, we propose ProKT, a new model-agnostic method by projecting the supervision signals of a teacher model into the student's parameter space. Such projection is implemented by decomposing the training objective into local intermediate targets with approximate mirror descent technique. The proposed method could be less sensitive with the quirks during optimization which could result in a better local optima. Experiments on both image and text datasets show that our proposed ProKT consistently achieves the state-of-the-art performance comparing to all existing knowledge distillation methods.

1. INTRODUCTION

Advanced deep learning models have shown impressive abilities in solving numerous machine learning tasks (Devlin et al., 2018b; Radford et al., 2018; He et al., 2016) . However, the advanced heavy models are not compatible with many real-world application scenarios due to the low inference efficiency and high energy consumption. Hence preserving the model capacity using fewer parameters has been an active research direction during recent years (Polino et al., 2018; Wu et al., 2016; Hinton et al., 2015) . Knowledge distillation (Hinton et al., 2015) is an essential way in the field which refers to a model-agnostic method where a model with fewer parameters (student) is optimized to minimize some statistical discrepancy between its predictions distribution and the predictions of a higher capacity model (teacher). Recently, it has been observed that employing a static target as the distillation objective would leash the effectiveness of the knowledge distillation method (Jin et al., 2019; Mirzadeh et al., 2019) when the capacity gap between student and teacher model is large. The underlying reason lies in common sense that optimizing deep learning models with gradient descent is favorable to the target which is close to their model family (Phuong & Lampert, 2019) . To counter the above issues, designing the intermediate target has been a popular solution: Teacher-Assistant learning (Jin et al., 2019) shows that within the same architecture setting, gradually increasing the teacher size will promote the distillation performance; Route-Constrained Optimization (RCO) (Mirzadeh et al., 2019) uses the intermediate model during the teacher's training process as the anchor to constrain the optimization path of the student, which could close the performance gap between student and teacher model. One reasonable explanation beyond the above facts could be derived from the perspective of curriculum learning (Bengio et al., 2009) : the learning process will be boosted if the goal is set suitable to the underlying learning preference (bias). The most common arrangement for the tasks is to gradually increase the difficulties during the learning procedures such as pre-training (Sutskever et al., 2009) . Correspondingly, TA-learning views the model with more similar capacity/model-size as the easier tasks while RCO views the model with more similar performance as the easier tasks, etc. In this paper, we argue that the utility of the teacher is not necessarily fully explored in previous approaches. First, the intermediate targets usually discretize the training process as several periods and the unsmoothness of target changes in optimization procedure will hurt the very property of introducing intermediate goals. Second, manual design of the learning procedure is needed which is hard to control and adapt among different tasks. Finally, the statistical dependency between the student and intermediate target is never explicitly constrained. To counter the above obstacles, we propose ProKT, a new knowledge distillation method, which better leverages the supervision signal of the teacher to improve the optimization path of student. Our method is mainly inspired by the guided policy search in reinforcement learning (Levine & Koltun, 2013) , where the intermediate target constructed by the teacher should be approximately projected on the student parameter space. More intuitively, the key motivation is to make the teacher model aware of the optimization progress of student model hence the student could get the "hand-on" supervision to get out of the poor minimal or bypass the barrier in the optimization landscape. The main contribution of this paper is that we propose a simple yet effective model-agnostic method for knowledge distillation, where intermediate targets are constructed by a model with the same architecture of teacher and trained by approximate mirror descent. We empirically evaluate our methods on a variety of challenging knowledge distillation setting on both image data and text data. We find that our method outperforms the vanilla knowledge distillation approach consistently with a large margin, which even leads to significant improvements compared to several strong baselines and achieves state-of-the-art on several knowledge distillation benchmark settings.

2. RELATED WORK

In this section, we discuss several most related literature in model miniaturization and knowledge distillation. Model Miniaturization. There has been a fruitful line of research dedicated to modifying the model structure to achieve fast inference during the test time. For instance, MobileNet (Howard et al., 2017) and ShuffleNet (Zhang et al., 2018a ) modify the convolution operator to reduce the computational burden. And the method of model pruning tries to compress the large network by removing the redundant connection in the large networks. The connections are removed either based on the weight magnitude or the impact on the loss function. One important hyperparameter of the model pruning is the compression ratio of each layer. He et al. (2018) proposes the automatical tuning strategy instead of setting the ratio manually which are proved to promote the performance. Knowledge Distillation. Knowledge distillation focuses on boosting the performance while the small network architecture is fixed. Hinton et al. (2015) ; Buciluǎ et al. (2006) introduced the idea of distilling knowledge from a heavy model with a relatively smaller and faster model which could preserve the generalization power. To this end, Buciluǎ et al. (2006) proposes to match the logits of the student and teacher model, and Hinton et al. (2015) tends to decrease the statistical dependency between the output probability distributions of the student model and the teacher model. And Zhang et al. (2018b) proposes the deep mutual learning which demonstrates that bi-jective learning process could boost the distillation performance. Orthogonal to output matching, many works have been conducted on matching the student model and teacher by enforcing the alignment on the latent representation (Yim et al., 2017; Jiao et al., 2019a; Sun et al., 2019) . This branch of works typically involves prior knowledge towards the network architectures of student and teacher model which is more favorable to distill from the model with the same architecture. In the context of knowledge distillation, our method is mostly related to TA-learning (Mirzadeh et al., 2019) and the Route-Constraint Optimization(RCO) (Jin et al., 2019) which improved the optimization of student model by designing a sequence of intermediate targets to impose constraint on the optimization path. Both of the above methods could be well motivated in the context of curriculum learning, while the underlying assumption indeed varies: TA-learning views the increasing order of the model capacity implied a suitable learning trajectory; while RCO considers the increasing order of the model performance forms a favorable learning curriculum for student. However, there have been several limitations. For example, the sequence of learning targets that are set before the training process needs to be manually designed. Besides, targets are also independent of the states of the student which does not enjoy all the merits of curriculum learning. Connections to Other Fields. Introducing a local target within the training procedure is a widely applied spirit in many fields of machine learning. Montgomery & Levine (2016) introduce the guided policy search where a local policy is then introduced to provide the local improved trajectory, which

