CLASS INTERFERENCE OF DEEP NEURAL NETWORKS

Abstract

Recognizing and telling similar objects apart is even hard for human beings. In this paper, we show that there is a phenomenon of class interference with all deep neural networks. Class interference represents the learning difficulty in data and it constitutes the largest percentage of generalization errors by deep networks. To understand class interference, we propose cross-class tests, class ego directions and interference models. We show how to use these definitions to study minima flatness and class interference of a trained model. We also show how to detect class interference during training through label dancing pattern and class dancing notes.

1. INTRODUCTION

Deep neural networks are very successful for classification (LeCun et al., 2015; Goodfellow et al., 2016) and sequential decision making (Mnih et al., 2015; Silver et al., 2016) . However, there lacks a good understanding of why they work well and where is the bottleneck. For example, it is well known that larger learning rates and smaller batch sizes can train models that generalize better. Keskar et al. (2016) found that large batch sizes lead to models that look sharp around the minima. According to Hochreiter & Schmidhuber (1997) , flat minima generalize better because of the minimum-description-length principle: low-complexity networks generalize well in practice. However, some works have different opinions about this matter (Kawaguchi et al., 2017; Dinh et al., 2017; Li et al., 2018) . Dinh et al. (2017) showed that sharp minima can also generalize well and a flat minimum can always be constructed from a sharp one by exploiting inherent geometric symmetry for ReLU based deep nets. Li et al. (2018) presented an experiment in which small batch minimizer is considerably sharper but it still generalizes better than large batch minimizer by turning on weight decay. Large batch training with good generalization also exists in literature (De et al., 2017; Goyal et al., 2017) . By adjusting the number of iterations, Hoffer et al. ( 2017) showed there is no generalization gap between small batch and large batch training. These works greatly helped understand the generalization of deep networks better. However, it still remains largely mythical. In this paper, we show there is an important phenomenon of deep neural networks, in which certain classes pose a great challenge for classifiers to tell them apart at test time, causing class interference. Popular methods of understanding the generalization of deep neural networks are based on minima flatness, usually by visualizing the loss using the interpolation between two models (Goodfellow et al., 2015; Keskar et al., 2016; Im et al., 2016; Jastrzebski et al., 2017; Draxler et al., 2018; Li et al., 2018; Lucas et al., 2021; Vlaar & Frankle, 2022; Doknic & Möller, 2022) . Just plotting the losses during training is not enough to understand generalization. Linearly interpolating between the initial model and the final trained model provides more information on the minima. A basic finding in this regard is the monotonic property: as the interpolation approaches the final model, loss decreases monotonically (Goodfellow et al., 2015) . Lucas et al. (2021) gave a deeper study of the monotonic property on the sufficient conditions as well as counter-examples where it does not hold. Vlaar & Frankle (2022) showed that certain hidden layers are more sensitive to the initial model, and the shape of the linear path is not indicative of the generalization performance of the final model. (Li et al., 2018) explored visualizing using two random directions and showed that it is important to normalize the filter. However, taking random directions produces stochastic loss contours. It is problematic when we compare models. We take a deterministic approach and

