REDESIGNING THE CLASSIFICATION LAYER BY RAN-DOMIZING THE CLASS REPRESENTATION VECTORS Anonymous

Abstract

Neural image classification models typically consist of two components. The first is an image encoder, which is responsible for encoding a given raw image into a representative vector. The second is the classification component, which is often implemented by projecting the representative vector onto target class vectors. The target class vectors, along with the rest of the model parameters, are estimated so as to minimize the loss function. In this paper, we analyze how simple design choices for the classification layer affect the learning dynamics. We show that the standard cross-entropy training implicitly captures visual similarities between different classes, which might deteriorate accuracy or even prevents some models from converging. We propose to draw the class vectors randomly and set them as fixed during training, thus invalidating the visual similarities encoded in these vectors. We analyze the effects of keeping the class vectors fixed and show that it can increase the inter-class separability, intra-class compactness, and the overall model accuracy, while maintaining the robustness to image corruptions and the generalization of the learned concepts.

1. INTRODUCTION

Deep learning models achieved breakthroughs in classification tasks, allowing setting state-of-theart results in various fields such as speech recognition (Chiu et al., 2018) , natural language processing (Vaswani et al., 2017), and computer vision (Huang et al., 2017) . In image classification task, the most common approach of training the models is as follows: first, a convolutional neural network (CNN) is used to extract a representative vector, denoted here as image representation vector (also known as the feature vector). Then, at the classification layer, this vector is projected onto a set of weight vectors of the different target classes to create the class scores, as depicted in Fig. 1 . Last, a softmax function is applied to normalize the class scores. During training, the parameters of both the CNN and the classification layer are updated to minimize the cross-entropy loss. We refer to this procedure as the dot-product maximization approach since such training ends up maximizing the dot-product between the image representation vector and the target weight vector. Recently, it was demonstrated that despite the excellent performance of the dot-product maximization approach, it does not necessarily encourage discriminative learning of features, nor does it 1



Figure 1: A scheme of an image classification model with three target classes. Edges from the same color compose a class representation vector.

