MODREDUCE: A MULTI-KNOWLEDGE DISTILLATION FRAMEWORK WITH ONLINE LEARNING

Abstract

Deep neural networks have produced revolutionary results in many applications; however, the computational resources required to use such models are expensive in terms of processing power and memory space. Research has been conducted in the field of knowledge distillation, aiming to enhance the performance of smaller models. Knowledge distillation transfers knowledge from large networks into smaller ones. Literature defines three types of knowledge that can be transferred: response-based, relational-based, and feature-based. To the best of our knowledge, only transferring one or two types of knowledge has been studied before, but transferring all three remains unexplored. In this paper, we propose ModReduce, a framework designed to transfer the three knowledge types in a unified manner using a combination of offline and online knowledge distillation. Moreover, an extensive experimental study on the effects of combining different knowledge types on student models' generalization and overall performance has been performed. Our experiments showed that ModReduce outperforms state-of-the-art knowledge distillation methods in terms of Average Relative Improvement.

1. INTRODUCTION

The term knowledge distillation was formally popularized in the work of Hinton et al. (2015) , and it refers to transferring knowledge from a large pre-trained model to a smaller one, aiming to retain comparable performance to the large model. Knowledge distillation has been receiving increasing attention from the research community due to its promising results. The methods by which knowledge can be distilled vary widely based on several factors like knowledge type, the distillation algorithm, and the teacher-student architecture (Gou et al., 2021) . Response-based knowledge uses the large model's logits as the teacher model's knowledge. The main idea is that the student optimizes its training over the soft targets, or the softened probability distribution, produced by the teacher model instead of using discrete labels (Hinton et al., 2015) . While this method showed great success, one of its major drawbacks is that it disregards the knowledge a teacher model retains in its intermediate layers. This encouraged researchers to introduce methods that capture the knowledge in the intermediate layers of the teacher model, feature knowledge. Feature-based algorithms focus on the features of the teacher model's intermediate layers to guide the student's learning. The challenge is that the teacher and the student models have different abstraction levels, which makes it one of the objectives of the distillation process to determine the best layer associations for maximum performance (Tung & Mori, 2019; Passalis et al., 2020; Kornblith et al., 2019) . Relational-based methods focus on the relationships between different data instances and different activations and neurons (Gou et al., 2021) . Several algorithms and methods have been introduced, focusing on distilling one or two of the knowledge sources from a teacher model to a student model. While they show promising results, no research addresses the issue of distilling the three discussed knowledge types instead of only one or two. To the best of our knowledge, ModReduce is the first work to explore this area. Moreover, we explore combining offline and online distillation strategies to achieve this goal. For online learning distillation, we have explored four different techniques for knowledge aggregation: Peer Collaborative Learning (PCL) (Wu & Gong, 2021) , On-the-fly Native Ensembling (ONE) (Zhu et al., 2018) , Fully Connected Layers (FC), and Weighted Averaging. ModReduce exhibits the following characteristics: 1

