MODREDUCE: A MULTI-KNOWLEDGE DISTILLATION FRAMEWORK WITH ONLINE LEARNING

Abstract

Deep neural networks have produced revolutionary results in many applications; however, the computational resources required to use such models are expensive in terms of processing power and memory space. Research has been conducted in the field of knowledge distillation, aiming to enhance the performance of smaller models. Knowledge distillation transfers knowledge from large networks into smaller ones. Literature defines three types of knowledge that can be transferred: response-based, relational-based, and feature-based. To the best of our knowledge, only transferring one or two types of knowledge has been studied before, but transferring all three remains unexplored. In this paper, we propose ModReduce, a framework designed to transfer the three knowledge types in a unified manner using a combination of offline and online knowledge distillation. Moreover, an extensive experimental study on the effects of combining different knowledge types on student models' generalization and overall performance has been performed. Our experiments showed that ModReduce outperforms state-of-the-art knowledge distillation methods in terms of Average Relative Improvement.

1. INTRODUCTION

The term knowledge distillation was formally popularized in the work of Hinton et al. (2015) , and it refers to transferring knowledge from a large pre-trained model to a smaller one, aiming to retain comparable performance to the large model. Knowledge distillation has been receiving increasing attention from the research community due to its promising results. The methods by which knowledge can be distilled vary widely based on several factors like knowledge type, the distillation algorithm, and the teacher-student architecture (Gou et al., 2021) . Response-based knowledge uses the large model's logits as the teacher model's knowledge. The main idea is that the student optimizes its training over the soft targets, or the softened probability distribution, produced by the teacher model instead of using discrete labels (Hinton et al., 2015) . While this method showed great success, one of its major drawbacks is that it disregards the knowledge a teacher model retains in its intermediate layers. This encouraged researchers to introduce methods that capture the knowledge in the intermediate layers of the teacher model, feature knowledge. Feature-based algorithms focus on the features of the teacher model's intermediate layers to guide the student's learning. The challenge is that the teacher and the student models have different abstraction levels, which makes it one of the objectives of the distillation process to determine the best layer associations for maximum performance (Tung & Mori, 2019; Passalis et al., 2020; Kornblith et al., 2019) . Relational-based methods focus on the relationships between different data instances and different activations and neurons (Gou et al., 2021) . Several algorithms and methods have been introduced, focusing on distilling one or two of the knowledge sources from a teacher model to a student model. While they show promising results, no research addresses the issue of distilling the three discussed knowledge types instead of only one or two. To the best of our knowledge, ModReduce is the first work to explore this area. Moreover, we explore combining offline and online distillation strategies to achieve this goal. For online learning distillation, we have explored four different techniques for knowledge aggregation: Peer Collaborative Learning (PCL) (Wu & Gong, 2021), On-the-fly Native Ensembling (ONE) (Zhu et al., 2018) , Fully Connected Layers (FC), and Weighted Averaging. ModReduce exhibits the following characteristics: 1. Agnostic of the offline distillation algorithm used. The platform supports different implementations for each knowledge category distillation, including state-of-the-art algorithms. 2. Capable of executing different online learning algorithms. 3. Agnostic of the teacher and student model architectures. Moreover, we introduce a new benchmark which is a union of the benchmarks available at Chen et al. ( 2021); Tian et al. (2019) . With this broad benchmark, several findings have been concluded.

2. RELATED WORK

Before digging deeper into different knowledge distillation schemes, it is better to explore the different knowledge representations a neural network possesses.

2.1.1. RESPONSE-BASED

Response-based knowledge is defined as the response of a neural network whenever it is presented with a particular input. The most popular response-based knowledge distillation scheme in image classification tasks is using "soft targets," where the aforementioned probability distribution is controlled using a temperature factor, T , which is used to soften the probability distribution. According to Hinton et al. (2015) , these soft targets provide informative dark knowledge from the teacher model. The soft targets "Hinton" approach is the most popular approach with the best results thus far utilizing this type of knowledge. Unfortunately, this kind of knowledge is blind to the inner features in the hidden layers of the model, as it only focuses on the final outputs.

2.1.2. FEATURE-BASED

Feature-based knowledge extends on the idea of response-based knowledge and takes the outputs of intermediate layers into consideration. Accounting for the outputs of the intermediate layers is important because deep neural networks can learn features with different levels of abstraction. For instance, a deep CNN can learn abstract features like straight and curved lines in the shallowest layers while detecting features with higher complexity at the deeper layers (Bengio et al., 2013) . This idea is useful for constructing teacher-student architectures since this type of knowledge can be used in the training of the student network. Many knowledge distillation techniques use a distillation loss function that accounts for feature-based knowledge. Equation 1 represents a general form of the distillation loss function for feature-based knowledge, where f t (x) and f s (x) are the feature maps of the teacher and student models respectively; ϕ t and ϕ s represent the transformation function of the teacher and student models feature maps and l f (.) is a similarity measure between the feature maps. L F eaD ((f t (x), f s (x)) = l f (ϕ t (f t (x)), ϕ s (f s (x))) (1) The state-of-the-art framework in feature knowledge distillation was introduced in Chen et al. (2021) , which aimed to match the semantics between the teacher and student. They introduced semantic calibration for cross-layer knowledge distillation that made better use of the intermediate knowledge by matching the semantic level of the transferred knowledge. Then they used an attention mechanism to automatically learn a soft layer association with multiple targets, which helped the student model in learning from multiple semantically matched hidden layers instead of just one fixed layer. 



2.1.3 RELATION-BASED While the response-based distillation captures the knowledge in the output layers of the teacher model, and feature-based distillation captures knowledge contained in the intermediate layers, the relational-based distillation captures the interrelations between training data examples. Several techniques have been proposed to capture the relations between the training data. Instance Relationship Graph (IRG) (Liu et al., 2019) introduced a methodology of knowledge distillation based on constructing a graph where features are represented as vertices and relations as edges. Relational Knowledge Distillation (RKD) (Park et al., 2019) proposed measuring the relations between training data

