REDESIGNING THE CLASSIFICATION LAYER BY RAN-DOMIZING THE CLASS REPRESENTATION VECTORS Anonymous

Abstract

Neural image classification models typically consist of two components. The first is an image encoder, which is responsible for encoding a given raw image into a representative vector. The second is the classification component, which is often implemented by projecting the representative vector onto target class vectors. The target class vectors, along with the rest of the model parameters, are estimated so as to minimize the loss function. In this paper, we analyze how simple design choices for the classification layer affect the learning dynamics. We show that the standard cross-entropy training implicitly captures visual similarities between different classes, which might deteriorate accuracy or even prevents some models from converging. We propose to draw the class vectors randomly and set them as fixed during training, thus invalidating the visual similarities encoded in these vectors. We analyze the effects of keeping the class vectors fixed and show that it can increase the inter-class separability, intra-class compactness, and the overall model accuracy, while maintaining the robustness to image corruptions and the generalization of the learned concepts.

1. INTRODUCTION

Deep learning models achieved breakthroughs in classification tasks, allowing setting state-of-theart results in various fields such as speech recognition (Chiu et al., 2018) , natural language processing (Vaswani et al., 2017) , and computer vision (Huang et al., 2017) . In image classification task, the most common approach of training the models is as follows: first, a convolutional neural network (CNN) is used to extract a representative vector, denoted here as image representation vector (also known as the feature vector). Then, at the classification layer, this vector is projected onto a set of weight vectors of the different target classes to create the class scores, as depicted in Fig. 1 . Last, a softmax function is applied to normalize the class scores. During training, the parameters of both the CNN and the classification layer are updated to minimize the cross-entropy loss. We refer to this procedure as the dot-product maximization approach since such training ends up maximizing the dot-product between the image representation vector and the target weight vector. Recently, it was demonstrated that despite the excellent performance of the dot-product maximization approach, it does not necessarily encourage discriminative learning of features, nor does it enforce the intra-class compactness and inter-class separability (Liu et al., 2016; Wang et al., 2017; Liu et al., 2017) . The intra-class compactness indicates how close image representations from the same class relate to each other, whereas the inter-class separability indicates how far away image representations from different classes are. Several works have proposed different approaches to address these caveats (Liu et al., 2016; 2017; Wang et al., 2017; 2018b; a) . One of the most effective yet most straightforward solutions that were proposed is NormFace (Wang et al., 2017) , where it was suggested to maximize the cosine-similarity between vectors by normalizing both the image and class vectors. However, the authors found when minimizing the cosine-similarity directly, the models fail to converge, and hypothesized that the cause is due to the bounded range of the logits vector. To allow convergence, the authors added a scaling factor to multiply the logits vector. This approach has been widely adopted by multiple works (Wang et al., 2018b; Wojke & Bewley, 2018; Deng et al., 2019; Wang et al., 2018a; Fan et al., 2019) . Here we will refer to this approach as the cosine-similarity maximization approach. This paper is focused on redesigning the classification layer, and the its role while kept fixed during training. We show that the visual similarity between classes is implicitly captured by the class vectors when they are learned by maximizing either the dot-product or the cosine-similarity between the image representation vector and the class vectors. Then we show that the class vectors of visually similar categories are close in their angle in the space. We investigate the effects of excluding the class vectors from training and simply drawing them randomly distributed over a hypersphere. We demonstrate that this process, which eliminates the visual similarities from the classification layer, boosts accuracy, and improves the inter-class separability (using either dot-product maximization or cosine-similarity maximization). Moreover, we show that fixing the class representation vectors can solve the issues preventing from some cases to converge (under the cosine-similarity maximization approach), and can further increase the intra-class compactness. Last, we show that the generalization to the learned concepts and robustness to noise are both not influenced by ignoring the visual similarities encoded in the class vectors. Recent work by Hoffer et al. (2018) , suggested to fix the classification layer to allow increased computational and memory efficiencies. The authors showed that the performance of models with fixed classification layer are on par or slightly drop (up to 0.5% in absolute accuracy) when compared to models with non-fixed classification layer. However, this technique allows substantial reduction in the number of learned parameters. In the paper, the authors compared the performance of dot-product maximization models with a non-fixed classification layer against the performance of cosine-similarity maximization models with a fixed classification layer and integrated scaling factor. Such comparison might not indicate the benefits of fixing the classification layer, since the dotproduct maximization is linear with respect to the image representation while the cosine-similarity maximization is not. On the other hand, in our paper, we compare fixed and non-fixed dot-product maximization models as well as fixed and non-fixed cosine-maximization models, and show that by fixing the classification layer the accuracy might boost by up to 4% in absolute accuracy. Moreover, while cosine-maximization models were suggested to improve the intra-class compactness, we reveal that by integrating a scaling factor to multiply the logits, the intra-class compactness is decreased. We demonstrate that by fixing the classification layer in cosine-maximization models, the models can converge and achieve a high performance without the scaling factor, and significantly improve their intra-class compactness. The outline of this paper is as follows. In Sections 2 and 3, we formulate dot-product and cosinesimilarity maximization models, respectively, and analyze the effects of fixing the class vectors. In Section 4, we describe the training procedure, compare the learning dynamics, and asses the generalization and robustness to corruptions of the evaluated models. We conclude the paper in Section 5.

2. FIXED DOT-PRODUCT MAXIMIZATION

Assume an image classification task with m possible classes. Denote the training set of N examples by S = {(x i , y i )} N i=1 , where x i ∈ X is the i-th instance, and y i is the corresponding class such that y i ∈ {1, ..., m}. In image classification a dot-product maximization model consists of two parts. The first is the image encoder, denoted as f θ : X → R d , which is responsible for representing the input image as a d-dimensional vector, f θ (x) ∈ R d , where θ is a set of learnable parameters. The second part of the model is the classification layer, which is composed of learnable parameters denoted as W ∈ R m×d . Matrix W can be viewed as m vectors, w 1 , . . . , w m , where each vector w i ∈ R d can be considered as the representation vector associated with the i-th class. For simplicity, we omitted the bias terms and assumed they can be included in W . A consideration that is taken when designing the classification layer is choosing the operation applied between the matrix W and the image representation vector f θ (x). Most commonly, a dotproduct operation is used, and the resulting vector is referred to as the logits vector. For training the models, a softmax operation is applied over the logits vector, and the result is given to a crossentropy loss which should be minimized. That is, arg min w 1 ,...,w m ,θ N i=0 -log e w y i •f θ (xi) m j=1 e w j •f θ (xi) = arg min w 1 ,...,w m ,θ N i=0 -log e w y i f θ (xi) cos(αy i ) m j=1 e w j f θ (xi) cos(αj ) . (1) The equality holds since w yi • f θ (x i ) = w yi f θ (x i ) cos(α yi ) , where α k is the angle between the vectors w k and f θ (x i ). We trained three dot-product maximization models with different known CNN architectures over four datasets, varying in image size and number of classes, as described in detail in Section 4.1. Since these models optimize the dot-product between the image vector and its corresponding learnable class vectors, we refer to these models as non-fixed dot-product maximization models. Inspecting the matrix W of the trained models reveals that visually similar classes have their corresponding class vectors close in space. On the left panel of Fig. 2 , we plot the cosine-similarity between the class vectors that were learned by the non-fixed model which was trained on the STL-10 dataset. It can be seen that the vectors representing vehicles are relatively close to each other, and far away from vectors representing animals. Furthermore, when we inspect the class vectors of non-fixed models trained on CIFAR-100 (100 classes) and Tiny ImageNet (200 classes), we find even larger similarities between vectors due to the high visual similarities between classes, such as boy and girl or apple and orange. By placing the vectors of visually similar classes close to each other, the inter-class separability is decreased. Moreover, we find a strong spearman correlation between the distance of class vectors and the number of misclassified examples. On the right panel of Fig. 2 , we plot the cosine-similarity between two class vectors, w i and w j , against the number of examples from category i that were wrongly classified as category j. As shown in the figure, as the class vectors are closer in space, the number of misclassifications increases. In STL-10, CIFAR-10, CIFAR-100, and Tiny ImageNet, we find a correlation of 0.82, 0.77, 0.61, and 0.79, respectively (note that all possible class pairs were considered in the computation of the correlation). These findings reveal that as two class vectors are closer in space, the confusion between the two corresponding classes increases. We examined whether the models benefit from the high angular similarities between the vectors. We trained the same models, but instead of learning the class vectors, we drew them randomly, normalized them ( w j = 1), and kept them fixed during training. We refer to these models as the fixed dot-product maximization models. Since the target vectors are initialized randomly, the cosine-similarity between vectors is low even for visually similar classes. See the middle panel of Fig. 2 . Notice that by fixing the class vectors and bias term during training, the model can minimize the loss in Eq. 1 only by optimizing the vector f θ (x i ). It can be seen that by fixing the class vectors, the prediction is influenced mainly by the angle between f θ and the fixed w yi since the magnitude of f θ (x i ) is multiplied with all classes and the magnitude of each class vectors is equal and set to 1. Thus, the model is forced to optimize the angle of the image vector towards its randomized class vector. Table 1 compares the classification accuracy of models with a fixed and non-fixed classification layer. Results suggest that learning the matrix W during training is not necessarily beneficial, and might reduce accuracy when the number of classes is high, or when the classes are visually close. Additionally, we empirically found that models with fixed class vectors can be trained with higher learning rate, due to space limitation we bring the results in the appendix (Table 7 , Table 8 , Table 9 ). By randomly drawing the class vectors, we ignore possible visual similarities between classes and force the models to minimize the loss by increasing the inter-class separability and encoding images from visually similar classes into vectors far in space, see Fig. 3 . -log e w y i •f θ (x i ) w y i f θ (x i ) m j=1 e w j •f θ (x i ) w j f θ (x i ) Comparing the right-hand side of Eq. 2 with Eq. 1 shows that the cosine-similarity maximization model simply requires normalizing f θ (x), and each of the class representation vectors w 1 , ..., w m , by dividing them with their l 2 -norm during the forward pass. The main motivation for this reformulation is the ability to learn more discriminative features in face verification by encouraging intra-class compactness and enlarging the inter-class separability. The authors showed that dot-product maximization models learn radial feature distribution; thus, the inter-class separability and intra-class compactness are not optimal (for more details, see the discussion in Wang et al. (2017) ). However, the authors found that cosine-similarity maximization models as given in Eq. 2 fail to converge and added a scaling factor S ∈ R to multiply the logits vector as follows: arg min According to Wang et al. (2017) , cosine-similarity maximization models fail to converge when S = 1 due to the low range of the logits vector, where each cell is bounded between [-1, 1]. This low range prevents the predicted probabilities from getting close to 1 during training, and as a result, the distribution over target classes is close to uniform, thus the loss will be trapped at a very high value on the training set. Intuitively, this may sound a reasonable explanation as to why directly maximizing the cosine-similarity fails to converge (S = 1). Note that even if an example is correctly classified and well separated, in its best scenario, it will achieve a cosine-similarity of 1 with its ground-truth class vector, while for other classes, the cosine-similarity would be (-1). Thus, for a classification task with m classes, the predicted probability for the example above would be: P (Y = y i |x i ) = e 1 e 1 + (m -1) • e -1 Notice that if the number of classes m = 200, the predicted probability of the correctly classified example would be at most 0.035, and cannot be further optimized to 1. As a result, the loss function would yield a high value for a correctly classified example, even if its image vector is placed precisely in the same direction as its ground-truth class vector. As in the previous section, we trained the same models over the same datasets, but instead of optimizing the dot-product, we optimized the cosine-similarity by normalizing f θ (x i ) and w 1 , ..., w m at the forward pass. We denote these models as non-fixed cosine-similarity maximization models. Additionally, we trained the same cosine-similarity maximization models with fixed random class vectors, denoting these models as fixed cosine-similarity maximization. In all models (fixed and non-fixed) we set S = 1 to directly maximize the cosine-similarity, results are shown in Table 2 . Surprisingly, we reveal that the low range of the logits vector is not the cause preventing from cosine-similarity maximization models from converging. As can be seen in the table, fixed cosine-maximization models achieve significantly high accuracy results by up to 53% compared to non-fixed models. Moreover, it can be seen that fixed cosine-maximization models with S = 1 can also outperform dot-product maximization models. This finding demonstrates that while the logits are bounded between [-1, 1], the models can still learn high-quality representations and decision boundaries. Table 2 : Classification accuracy of fixed and non-fixed cosine-similarity maximization models. In all models S = 1. We further investigated the effects of S and train for comparison the same fixed and non-fixed models, but this time we used grid-search for the best performing S value. As can be seen in Table 3 , increasing the scaling factor S allows non-fixed models to achieve higher accuracies over all datasets. Yet, there is no benefit at learning the class representation vectors instead of randomly drawing them and fixing them during training when considering models' accuracies. To better understand the cause which prevents non-fixed cosine-maximization models from converging when S = 1, we compared these models with the same models trained by setting the optimal S scalar. For each model we measured the distance between its learned class vectors and compared these distances to demonstrate the effect of S on them. Interestingly, we found that as S increased, the cosine-similarity between the class vectors decreased. Meaning that by increasing S the class vectors are further apart from each other. Compare, for example, the left and middle panels in Fig. 4 , which show the cosine-similarity between the class vectors of models trained on STL with S = 1 and S = 20, respectively. On the right panel in Fig. 4 , we plot the number of misclassification as a function of the cosinesimilarity between the class vectors of the non-fixed cosine-maximization model trained on STL-10 with S = 1. It can be seen that the confusion between classes is high when they have low angular distance between them. As in previous section, we observed strong correlations between the closeness of the class vectors and the number of misclassification. We found correlations of 0.85, 0.87, 0.81, and 0.83 in models trained on STL-10, CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. By integrating the scaling factor S in Eq. 4 we get P (Y = y i |x i ) = e S•1 e S•1 + (m -1) • e S•(-1) Note that by increasing S, the predicted probability in Eq. 5 increases. This is true even when the cosine-similarity between f (x i ) and w yi is less than 1. When S is set to a large value, the gap between the logits increases, and the predicted probability after the softmax is closer to 1. As a result, the model is discouraged from optimizing the cosine-similarity between the image representation and its ground-truth class vector to be close to 1, since the loss is closer to 0. In Table 4 , we show that as we increase S, the cosine-similarity between the image vectors and their predicted class vectors decreases. These observations can provide an explanation as to why non-fixed models with S = 1 fail to converge. By setting S to a large scalar, the image vectors are spread around their class vectors with a larger degree, preventing the class vectors from getting close to each other. As a result, the interclass separability increases and the misclassification rate between visually similar classes decreases. In contrast, setting S = 1 allows models to place the class vectors of visually similar classes closer in space and leads to a high number of misclassification. However, a disadvantage of increasing S and setting it to a large number is that the intra-class compactness is violated since image initially far in space from each other. By randomly drawing the class vectors, models are required to encode images from visually similar classes into vectors, which are far in space; therefore, the inter-class separability is high. Additionally, the intra-class compactness is improved since models are encouraged to maximize the cosine-similarity to 1 as S can be set to 1, and place image vectors from the same class close to their class vector. We validated this empirically by measuring the average cosine-similarity between image vectors and their predicted classes' vectors in fixed cosinemaximization models with S = 1. We obtained an average cosine-similarity of roughly 0.95 in all experiments, meaning that images from the same class were encoded compactly near their class vectors. In conclusion, although non-fixed cosine-similarity maximization models were proposed to improve the caveats of dot-product maximization by improving the inter-class separability and intra-class compactness, their performance are significantly low without the integration of a scaling factor to multiply the logits vector. Integrating the scaling factor and setting it to S > 1 decrease intra-class compactness and introduce a trade-off between accuracy and intra-class compactness. By fixing the class vectors, cosine-similarity maximization models can have both high performance and improved intra-class compactness. Meaning that multiple previous works (Wang et al. (2018b) ; Wojke & Bewley (2018); Deng et al. (2019) ; Wang et al. (2018a) ; Fan et al. (2019) ) that adopted the cosinemaximization method and integrated a scaling factor for convergence, might benefit from improved results by fixing the class vectors. Table 4 : Average cosine-similarity results between image vectors and their predicted class vectors, when S is set to 1, 20, and 40. Results are from non-fixed cosine-similarity maximization models trained on CIFAR-10 (C-10), CIFAR-100 (C-100), STL, and Tiny ImageNet (TI). 

4. GENERALIZATION AND ROBUSTNESS TO CORRUPTIONS

In this section we explore the generalization of the evaluated models to the learned concepts and measure their robustness to image corruptions. We do not aim to set a state-of-the-art results but rather validate that by fixing the class vectors of a model, the model's generalization ability and robustness to corruptions remain competitive.

4.1. TRAINING PROCEDURE

To evaluate the impact of ignoring the visual similarities in the classification layer we evaluated the models on CIFAR-10, CIFAR-100 Krizhevsky et al. (2009) with fixed and non-fixed class vectors. All models were trained using stochastic gradient descent with momentum. We used the standard normalization and data augmentation techniques. Due to space limitations, the values of the hyperparameters used for training the models can be found under our code repository. We normalized the randomly drawn, fixed class representation vectors by dividing them with their l 2 -norm. All reported results are the averaged results of 3 runs.

4.2. GENERALIZATION

For measuring how well the models were able to generalize to the learned concepts, we evaluated them on images containing objects from the same target classes appearing in their training dataset. For evaluating the models trained on STL-10 and CIFAR-100, we manually collected 2000 and 6000 images ,respectively, from the publicly available dataset Open Images V4 Krasin et al. (2017) . For CIFAR-10 we used the CIFAR-10.1 dataset Recht et al. (2018) . All collected sets contain an equal number of images for each class. We omitted models trained on Tiny ImageNet from the evaluation since we were not able to collect images for all classes appearing in this set. Table 5 summarizes the results for all the models. Results suggest that excluding the class representation vectors from training, does not decrease the generalization to learned concepts.

4.3. ROBUSTNESS TO CORRUPTIONS

Next, we verified that excluding the class vectors from training did not decrease the model's robustness to image corruptions. For this we apply three types of algorithmically generated corruptions on the test set and evaluate the accuracy of the models on these sets. The corruptions we apply are impulse-noise, JPEG compression, and de-focus blur. Corruptions are generated using Jung (2018) , and available under our repository. Results, as shown in Table 6 , suggest that randomly drawn fixed class vectors allow models to be highly robust to image corruptions. 

5. CONCLUSION

In this paper, we propose randomly drawing the parameters of the classification layer and excluding them from training. We showed that by this, the inter-class separability, intra-class compactness, and the overall accuracy of the model can improve when maximizing the dot-product or the cosine similarity between the image representation and the class vectors. We analyzed the cause that prevents the non-fixed cosine-maximization models from converging. We also presented the generalization abilities of the fixed and not-fixed classification layer.



https://tiny-imagenet.herokuapp.com/



Figure 1: A scheme of an image classification model with three target classes. Edges from the same color compose a class representation vector.

Figure 2: The matrices show the cosine-similarity between the class vectors of non-fixed (left) and fixed (middle) dot-product maximization models trained on STL-10 dataset. Right is a plot showing the number of misclassifications as a function of the cosine-similarity between class vectors.

Figure 3: Feature distribution visualization of non-fixed and fixed dot-product maximization models trained on CIFAR-10.

improved results for face verification task, and many recent alternations also integrated the scaling factor S for convergences when optimizing the cosine-similarity Wang et al. (2018b); Wojke & Bewley (2018); Deng et al. (2019); Wang et al. (2018a); Fan et al. (2019).

Figure 4: The matrices show the cosine-similarity between the class vectors of non-fixed cosinesimilarity maximization models, trained on STL-10 dataset, with S = 1 (left), and S = 20 (middle). Right is a plot showing the relationship between the number of misclassification as a function of the cosine-similarity between class vectors.

vectors from the same class are spread and encoded relatively far from each other, see Fig.5. Fixed cosine-maximization models successfully converge when S = 1, since the class vectors are

Figure 5: Feature distribution visualization of fixed and non-fixed cosine-maximization Resnet18 models trained on CIFAR-10.

Comparison between the classification accuracy of fixed and non-fixed dot-product maximization models.

Comparison between the classification accuracy of fixed and non-fixed cosine-similarity maximization models with their optimal S

Classification accuracy of fixed and non-fixed models for the generalization sets.

Classification accuracy of fixed and non-fixed models on corrupted test set's images.

