IS LABEL SMOOTHING TRULY INCOMPATIBLE WITH KNOWLEDGE DISTILLATION: AN EMPIRICAL STUDY

Abstract

This work aims to empirically clarify a recently discovered perspective that label smoothing is incompatible with knowledge distillation (Müller et al., 2019). We begin by introducing the motivation behind on how this incompatibility is raised, i.e., label smoothing erases relative information between teacher logits. We provide a novel connection on how label smoothing affects distributions of semantically similar and dissimilar classes. Then we propose a metric to quantitatively measure the degree of erased information in sample's representation. After that, we study its one-sidedness and imperfection of the incompatibility view through massive analyses, visualizations and comprehensive experiments on Image Classification, Binary Networks, and Neural Machine Translation. Finally, we broadly discuss several circumstances wherein label smoothing will indeed lose its effectiveness. 1

1. INTRODUCTION

Label smoothing (Szegedy et al., 2016) and knowledge distillation (Hinton et al., 2015) are two commonly recognized techniques in training deep neural networks and have been applied in many state-of-the-art models, such as language translation (Vaswani et al., 2017; Tan et al., 2019; Zhou et al., 2020) , image classification (Xie et al., 2019; He et al., 2019) and speech recognition (Chiu et al., 2018; Pereyra et al., 2017; Chorowski & Jaitly, 2017) . Recently a large body of studies is focusing on exploring the underlying relationships between these two methods, for instance, Müller et al. (Müller et al., 2019) discovered that label smoothing could improve calibration implicitly but will hurt the effectiveness of knowledge distillation. Yuan et al. (Yuan et al., 2019) considered knowledge distillation as a dynamical form of label smoothing as it delivered a regularization effect in training. The recent study (Lukasik et al., 2020) further noticed label smoothing could help mitigate label noise, they showed that when distilling models from noisy data, the teacher with label smoothing is helpful. Despite the massive and intensive researches, how to use label smoothing as well as knowledge distillation in practice is still unclear, divergent, and under-explored. Moreover, it is hard to answer when and why label smoothing works well or not under a variety of discrepant circumstances. View of incompatibility between label smoothing and knowledge distillation. Recently, Müller et al. proposed a new standpoint that teachers trained with label smoothing distill inferior student compared to teachers trained with hard labels, even label smoothing improves teacher's accuracy, as the authors found that label smoothing tends to "erase" information contained intra-class across individual examples, which indicates that the relative information between logits will be erased to some extent when the teacher is trained with label smoothing. This rising idea is becoming more and more dominant and has been quoted by a large number of recent literatures (Arani et al., 2019; Tang et al., 2020; Mghabbar & Ratnamogan, 2020; Shen et al., 2020; Khosla et al., 2020) . However, this seems reasonable observation basically has many inconsistencies in practice when adopting knowledge distillation with smoothing trained teachers. Thus, we would like to challenge whether this perspective is entirely correct? To make label smoothing and knowledge distillation less mysterious, in this paper, we first systematically introduce the mechanism and correlation



Project page: http://zhiqiangshen.com/projects/LS_and_KD/index.html.1

