INTRACLASS CLUSTERING: AN IMPLICIT LEARNING ABILITY THAT REGULARIZES DNNS

Abstract

Several works have shown that the regularization mechanisms underlying deep neural networks' generalization performances are still poorly understood (Neyshabur et al., 2015; Zhang et al., 2017). In this paper, we hypothesize that deep neural networks are regularized through their ability to extract meaningful clusters among the samples of a class. This constitutes an implicit form of regularization, as no explicit training mechanisms or supervision target such behaviour. To support our hypothesis, we design four different measures of intraclass clustering, based on the neuron-and layer-level representations of the training data. We then show that these measures constitute accurate predictors of generalization performance across variations of a large set of hyperparameters (learning rate, batch size, optimizer, weight decay, dropout rate, data augmentation, network depth and width).

1. INTRODUCTION

Figure 1 : In standard image classification datasets, classes are typically composed of multiple clusters of similarly looking images. We call intraclass clustering a model's ability to differentiate such clusters despite their association to identical labels. The generalization ability of deep neural networks remains largely unexplained. In particular, the traditional view that explicit forms of regularization (e.g. dropout, L 2 -regularization, data augmentation) are the sole factors for generalization performance of state of the art neural networks has been experimentally invalidated (Neyshabur et al., 2015; Zhang et al., 2017 ). Today's conventional wisdom rather conjectures the presence of implicit forms of regularization, emerging from the interactions between neural network architectures, optimization, and the inherent structure of the data itself (Arpit et al., 2017) . One structural component that seems to occur in most image classification datasets is the presence of multiple clusters amongst the samples of a class (or intraclass clusters, cfr. Figure 1 ). The extraction of such structure in the context of supervised learning is not self-evident, as today's standard training algorithms are designed to group samples from a class together, without any considerations for eventual intraclass clusters. This paper hypothesizes that the identification of intraclass clusters emerges during supervised training of deep neural networks, despite the absence of supervision or explicit training mechanisms targeting this behaviour. Moreover, our study suggests that this phenomenon improves the generalization ability of deep neural networks, hence constituting an implicit form of regularization. To verify our hypotheses, we define four measures of intraclass clustering and inspect the correlation between those measures and a network's generalization performance. These measures are designed to capture intraclass clustering from four different perspectives, defined by the representation level (neuron vs. layer) and the amount of knowledge about the data's inherent structure (datasets with or without hierarchical labels). To evaluate these measures' predictive power, we train more than 500 models, varying standard hyperparameters in a principled way in order to generate a wide range of generalization performances. The measures are then evaluated qualitatively through visual inspection of their relationship with generalization and quantitatively through the granulated Kendall rank-correlation coefficient introduced by Jiang et al. (2020) . Both evaluations reveal a tight connection between intraclass clustering measures and generalization ability, providing important evidence to support this work's hypotheses.

2. MEASURING INTRACLASS CLUSTERING IN INTERNAL REPRESENTATIONS

A challenge of our work resides in measuring a model's ability to differentiate intraclass clusters, without knowing which mechanisms underlie such ability. In such context, the design of multiple complementary measures (i) offers different perspectives that will help better characterize intraclass clustering mechanisms and (ii) reduces the risk that their potential correlation with generalization is induced by other phenomena independent of intraclass clustering. This section thus describes four measures of intraclass clustering, differing in terms of representation level (neuron vs. layer) and the amount of knowledge about the data's inherent structure (datasets with or without hierarchical labels).

2.1. TERMINOLOGY AND NOTATIONS

D denotes the training dataset. Let I be the number of classes in the dataset D, we denote the set of samples from class i by C i with i ∈ I = {1, 2, ..., I}. In the case of hierarchical labels, C i denotes the samples from subclass i and S s(i) the samples from the superclass containing subclass i. We denote by N = {1, 2, ..., N } and L = {1, 2, ..., L} the indexes of the N neurons and L layers of a network respectively. Neurons are considered across all the layers of a network, not a specific layer. The methodology by which indexes are assigned to neurons or layers does not matter. We further denote by mean j∈J and median j∈J the mean and median operations over the index j respectively. Moreover, mean k j∈J corresponds to the mean of the top-k highest values, over the index j. We call pre-activations (and activations) the values preceding (respectively following) the application of the ReLU activation function (Nair & Hinton, 2010) . In our experiments, batch normalization (Ioffe & Szegedy, 2015) is applied before the ReLU, and pre-activation values are collected after batch normalization. In convolutional layers, a neuron refers to an entire feature map. The spatial dimensions of such a neuron's (pre-)activations are reduced through a global max pooling operation before applying our measures.

2.2. MEASURES BASED ON LABEL HIERARCHIES

The first two measures take advantage of datasets that include a hierarchy of labels. For example, CIFAR100 is organized into 20 superclasses (e.g. flowers) each comprising 5 subclasses (e.g. orchids, poppies, roses, sunflowers, tulips). We hypothesize that these hierarchical labels reflect an inherent structure of the data. In particular, we expect the subclasses to approximately correspond to different clusters amongst the samples of a superclass. Hence, measuring the extent by which a network differentiates subclasses when being trained on superclasses should reflect its ability to extract intraclass clusters during training.

