CLASS INTERFERENCE OF DEEP NEURAL NETWORKS

Abstract

Recognizing and telling similar objects apart is even hard for human beings. In this paper, we show that there is a phenomenon of class interference with all deep neural networks. Class interference represents the learning difficulty in data and it constitutes the largest percentage of generalization errors by deep networks. To understand class interference, we propose cross-class tests, class ego directions and interference models. We show how to use these definitions to study minima flatness and class interference of a trained model. We also show how to detect class interference during training through label dancing pattern and class dancing notes.

1. INTRODUCTION

Deep neural networks are very successful for classification (LeCun et al., 2015; Goodfellow et al., 2016) and sequential decision making (Mnih et al., 2015; Silver et al., 2016) . However, there lacks a good understanding of why they work well and where is the bottleneck. For example, it is well known that larger learning rates and smaller batch sizes can train models that generalize better. Keskar et al. (2016) found that large batch sizes lead to models that look sharp around the minima. According to Hochreiter & Schmidhuber (1997) , flat minima generalize better because of the minimum-description-length principle: low-complexity networks generalize well in practice. However, some works have different opinions about this matter (Kawaguchi et al., 2017; Dinh et al., 2017; Li et al., 2018) . Dinh et al. (2017) showed that sharp minima can also generalize well and a flat minimum can always be constructed from a sharp one by exploiting inherent geometric symmetry for ReLU based deep nets. Li et al. (2018) presented an experiment in which small batch minimizer is considerably sharper but it still generalizes better than large batch minimizer by turning on weight decay. Large batch training with good generalization also exists in literature (De et al., 2017; Goyal et al., 2017) . By adjusting the number of iterations, Hoffer et al. (2017) showed there is no generalization gap between small batch and large batch training. These works greatly helped understand the generalization of deep networks better. However, it still remains largely mythical. In this paper, we show there is an important phenomenon of deep neural networks, in which certain classes pose a great challenge for classifiers to tell them apart at test time, causing class interference. Popular methods of understanding the generalization of deep neural networks are based on minima flatness, usually by visualizing the loss using the interpolation between two models (Goodfellow et al., 2015; Keskar et al., 2016; Im et al., 2016; Jastrzebski et al., 2017; Draxler et al., 2018; Li et al., 2018; Lucas et al., 2021; Vlaar & Frankle, 2022; Doknic & Möller, 2022) . Just plotting the losses during training is not enough to understand generalization. Linearly interpolating between the initial model and the final trained model provides more information on the minima. A basic finding in this regard is the monotonic property: as the interpolation approaches the final model, loss decreases monotonically (Goodfellow et al., 2015) . Lucas et al. (2021) gave a deeper study of the monotonic property on the sufficient conditions as well as counter-examples where it does not hold. Vlaar & Frankle (2022) showed that certain hidden layers are more sensitive to the initial model, and the shape of the linear path is not indicative of the generalization performance of the final model. (Li et al., 2018) explored visualizing using two random directions and showed that it is important to normalize the filter. However, taking random directions produces stochastic loss contours. It is problematic when we compare models. We take a deterministic approach and study the loss function in the space of class ego directions, following which parameter update can minimize the training loss for individual classes. The contributions of this paper are as follows. • Using a metric called CCTM that evaluates class interference on a test set, we show that class interference is the major source of generalization error for deep network classifiers. We show that class interference has a symmetry pattern. In particular, deep models have a similar amount of trouble in telling "class A objects are not class B", and "B objects are not A". • To understand class interference, we introduce the definitions of class ego directions and interference models. • In the class ego spaces, small learning rates can lead to extremely sharp minima, while learning rate annealing leads to minima that are located at large lowlands, in terrains that are much bigger than the flat minima previously discovered for big learning rates. • The loss shapes in class ego spaces are indicative of interference. Classes that share similar loss shapes in other class ego spaces are likely to interfere. • We show that class interference can also be observed in training. In particular, it can be detected from a special pattern called label dancing, which can be further understood better by plotting the dancing notes during training. Dancing notes show interesting interference between classes. For example, a surprise is that we found FROG interferes CAT for good reasons in the CIFAR-10 data set.

2.1. GENERALIZATION TESTS AND THE CLASS INTERFERENCE PHENOMENON

Let c 1 and c 2 be class labels. We use the following cross-class test of generalization, which is the percentage of c 2 predictions for the c 1 objects in the test set: CCT M (c 1 , c 2 ) = # predicting as c 2 #total c 1 objects , Note this test being an accuracy or error metric depends on whether the two classes are the same or not. Calculating the measure for all pairs of classes over the test set gives a matrix. We refer to this measure the CCT matrix, and simply the CCTM for short. CCTM extends the confusion matrix in literature by a probability measure, which can be viewed as a combination of the true positive rates and false positive rates in a matrix formatfoot_0 . This extension facilitates a visualization of the generalization performance as a heat map. Figure 1 shows the CCTM for VGG19 (Simonyan & Zisserman, 2015) and ResNet18 (He et al., 2015) on the CIFAR-10 (Krizhevsky et al., 2009) test set with a heat map. Models were trained with SGD (see Section 3 for the training details). From the map, we can see that the most significant generalization errors are from CAT and DOG for both models. This difficulty is not specific to models. It represents class similarity and learning difficulty in data. For example, in Table 1 , the accuracies in the columns of CAT and DOG are significantly lower than the other columns for all the four deep models. It is also observable that class interference has a symmetry pattern: If a classifier has trouble in recognizing that c 1 objects are not class c 2 , it will also have a hard time in ruling out class c 1 for c 2 objects. This can be observed from CAT and DOG in the plotted CCTM. We call generalization difficulties of deep neural networks between classes like CAT and DOG the class interference. If CCT M (c 1 , c 2 ) is large, we say that class c 2 interferes c 1 , or class c 1 has interference from c 2 . Class interference happens when classes are just similar. In this case, cats and dogs are hard to recognize for humans as well, especially when the resolution of images is low. Examining only the test error would not reveal the class interference phenomenon because it is an overall measure of all classes. The classes have a much varied difference in their test accuracies. For example, in VGG19, the recall accuracy of CAT, i.e., CCT M (CAT, CAT ), is only about 84.5%



See https://en.wikipedia.org/wiki/Sensitivity_and_specificity for example.

