REDESIGNING THE CLASSIFICATION LAYER BY RAN-DOMIZING THE CLASS REPRESENTATION VECTORS Anonymous

Abstract

Neural image classification models typically consist of two components. The first is an image encoder, which is responsible for encoding a given raw image into a representative vector. The second is the classification component, which is often implemented by projecting the representative vector onto target class vectors. The target class vectors, along with the rest of the model parameters, are estimated so as to minimize the loss function. In this paper, we analyze how simple design choices for the classification layer affect the learning dynamics. We show that the standard cross-entropy training implicitly captures visual similarities between different classes, which might deteriorate accuracy or even prevents some models from converging. We propose to draw the class vectors randomly and set them as fixed during training, thus invalidating the visual similarities encoded in these vectors. We analyze the effects of keeping the class vectors fixed and show that it can increase the inter-class separability, intra-class compactness, and the overall model accuracy, while maintaining the robustness to image corruptions and the generalization of the learned concepts.

1. INTRODUCTION

Deep learning models achieved breakthroughs in classification tasks, allowing setting state-of-theart results in various fields such as speech recognition (Chiu et al., 2018) , natural language processing (Vaswani et al., 2017), and computer vision (Huang et al., 2017) . In image classification task, the most common approach of training the models is as follows: first, a convolutional neural network (CNN) is used to extract a representative vector, denoted here as image representation vector (also known as the feature vector). Then, at the classification layer, this vector is projected onto a set of weight vectors of the different target classes to create the class scores, as depicted in Fig. 1 . Last, a softmax function is applied to normalize the class scores. During training, the parameters of both the CNN and the classification layer are updated to minimize the cross-entropy loss. We refer to this procedure as the dot-product maximization approach since such training ends up maximizing the dot-product between the image representation vector and the target weight vector. Recently, it was demonstrated that despite the excellent performance of the dot-product maximization approach, it does not necessarily encourage discriminative learning of features, nor does it enforce the intra-class compactness and inter-class separability (Liu et al., 2016; Wang et al., 2017; Liu et al., 2017) . The intra-class compactness indicates how close image representations from the same class relate to each other, whereas the inter-class separability indicates how far away image representations from different classes are. Several works have proposed different approaches to address these caveats (Liu et al., 2016; 2017; Wang et al., 2017; 2018b; a) . One of the most effective yet most straightforward solutions that were proposed is NormFace (Wang et al., 2017) , where it was suggested to the cosine-similarity between vectors by normalizing both the image and class vectors. However, the authors found when minimizing the cosine-similarity directly, the models fail to converge, and hypothesized that the cause is due to the bounded range of the logits vector. To allow convergence, the authors added a scaling factor to multiply the logits vector. This approach has been widely adopted by multiple works (Wang et al., 2018b; Wojke & Bewley, 2018; Deng et al., 2019; Wang et al., 2018a; Fan et al., 2019) . Here we will refer to this approach as the cosine-similarity maximization approach. This paper is focused on redesigning the classification layer, and the its role while kept fixed during training. We show that the visual similarity between classes is implicitly captured by the class vectors when they are learned by maximizing either the dot-product or the cosine-similarity between the image representation vector and the class vectors. Then we show that the class vectors of visually similar categories are close in their angle in the space. We investigate the effects of excluding the class vectors from training and simply drawing them randomly distributed over a hypersphere. We demonstrate that this process, which eliminates the visual similarities from the classification layer, boosts accuracy, and improves the inter-class separability (using either dot-product maximization or cosine-similarity maximization). Moreover, we show that fixing the class representation vectors can solve the issues preventing from some cases to converge (under the cosine-similarity maximization approach), and can further increase the intra-class compactness. Last, we show that the generalization to the learned concepts and robustness to noise are both not influenced by ignoring the visual similarities encoded in the class vectors. Recent work by Hoffer et al. ( 2018), suggested to fix the classification layer to allow increased computational and memory efficiencies. The authors showed that the performance of models with fixed classification layer are on par or slightly drop (up to 0.5% in absolute accuracy) when compared to models with non-fixed classification layer. However, this technique allows substantial reduction in the number of learned parameters. In the paper, the authors compared the performance of dot-product maximization models with a non-fixed classification layer against the performance of cosine-similarity maximization models with a fixed classification layer and integrated scaling factor. Such comparison might not indicate the benefits of fixing the classification layer, since the dotproduct maximization is linear with respect to the image representation while the cosine-similarity maximization is not. On the other hand, in our paper, we compare fixed and non-fixed dot-product maximization models as well as fixed and non-fixed cosine-maximization models, and show that by fixing the classification layer the accuracy might boost by up to 4% in absolute accuracy. Moreover, while cosine-maximization models were suggested to improve the intra-class compactness, we reveal that by integrating a scaling factor to multiply the logits, the intra-class compactness is decreased. We demonstrate that by fixing the classification layer in cosine-maximization models, the models can converge and achieve a high performance without the scaling factor, and significantly improve their intra-class compactness. The outline of this paper is as follows. In Sections 2 and 3, we formulate dot-product and cosinesimilarity maximization models, respectively, and analyze the effects of fixing the class vectors. In Section 4, we describe the training procedure, compare the learning dynamics, and asses the generalization and robustness to corruptions of the evaluated models. We conclude the paper in Section 5.

2. FIXED DOT-PRODUCT MAXIMIZATION

Assume an image classification task with m possible classes. Denote the training set of N examples by S = {(x i , y i )} N i=1 , where x i ∈ X is the i-th instance, and y i is the corresponding class such that y i ∈ {1, ..., m}. In image classification a dot-product maximization model consists of two parts. The first is the image encoder, denoted as f θ : X → R d , which is responsible for representing the input image as a d-dimensional vector, f θ (x) ∈ R d , where θ is a set of learnable parameters. The second



Figure 1: A scheme of an image classification model with three target classes. Edges from the same color compose a class representation vector.

