DEMYSTIFYING LOSS FUNCTIONS FOR CLASSIFICATION

Abstract

It is common to use the softmax cross-entropy loss to train neural networks on classification datasets where a single class label is assigned to each example. However, it has been shown that modifying softmax cross-entropy with label smoothing or regularizers such as dropout can lead to higher performance. In this paper, we compare a variety of loss functions and output layer regularization strategies that improve performance on image classification tasks. We find differences in the outputs of networks trained with these different objectives, in terms of accuracy, calibration, out-of-distribution robustness, and predictions. However, differences in hidden representations of networks trained with different objectives are restricted to the last few layers; representational similarity reveals no differences among network layers that are not close to the output. We show that all objectives that improve over vanilla softmax loss produce greater class separation in the penultimate layer of the network, which potentially accounts for improved performance on the original task, but results in features that transfer worse to other tasks.

1. INTRODUCTION

Softmax cross-entropy (Bridle, 1990a; b) is the canonical loss function for multi-class classification in deep learning. However, the popularity of softmax cross-entropy appears to be driven by the aesthetic appeal of its probabilistic interpretation, rather than by practical superiority. Early studies reported no empirical advantage of softmax cross-entropy over squared-error loss (Richard & Lippmann, 1991; Weigend, 1993; Dietterich & Bakiri, 1994) , and more recent work has found other objectives that yield better performance on certain tasks (e.g. Szegedy et al., 2016; Liu et al., 2016; Beyer et al., 2020) . These studies show that it is possible to achieve meaningful improvements in accuracy simply by changing the loss function. Nonetheless, there has been little comparison among these alternative objectives, and even less investigation of why some objectives work better than others. In this paper, we perform a comprehensive empirical study of the properties of 9 common and less-common loss functions and regularizers for deep learning, on standard image classification benchmarks. Most existing work in this area has proposed a new loss function or regularizer and attempted to demonstrate its superiority over a limited set of alternatives on benchmark tasks. This approach creates strong incentives to demonstrate the superiority of the proposed loss and little incentive to understand its limitations. Our goal is instead to understand when one might want to use one loss function or regularizer over another and, more broadly, to understand the extent to which neural network performance and representations can be manipulated through the choice of objective alone. Our key contributions are as follows: • We rigorously benchmark 9 training objectives on standard image classification tasks, measuring accuracy, calibration, and out-of-distribution robustness. Many objectives improve over vanilla softmax cross-entropy loss, but no single objective performs best on all benchmarks. • We demonstrate that different loss functions and regularizers produce different patterns of predictions, but combining them does not appear to improve accuracy. However, regularization that affects the input, such as AutoAugment (Cubuk et al., 2019) and Mixup (Zhang et al., 2017) , can provide further gains. Our best models achieve state-of-the-art accuracy (79.1%/94.5% top-1/top-5) on ImageNet for unmodified ResNet-50 architectures trained from scratch. • Using centered kernel alignment (CKA), we measure the similarity of the hidden representations of networks trained with different objectives. We show that the choice of objective affects representations in network layers close to the output, but earlier layers are highly similar regardless of what loss function is used. • We show that all objectives that improve accuracy over softmax cross-entropy also lead to greater separation between representations of different classes in the penultimate layer. This improvement in class separation may be related to the boost in accuracy these objectives provide. However, representations with greater class separation are also more heavily specialized for the original task, and linear classifiers operating on these features perform substantially worse on transfer tasks.

2. LOSS FUNCTIONS AND OUTPUT LAYER REGULARIZERS

We investigate 9 loss functions and output layer regularizers.Let ∈ R K denote the network's output ("logit") vector, and let t ∈ {0, 1} K denote a one-hot vector of targets, where t 1 = 1. Let x ∈ R M denote the vector of penultimate layer activations, which gives rise to the output vector as = W x + b, where W ∈ R K×M is the matrix of final layer weights, and b is a vector of biases. All investigated loss functions include a term that encourages to have a high dot product with t. To avoid solutions that make this dot product large simply by increasing the scale of , these loss functions must also include one or more contractive terms and/or normalize . Many "regularizers" correspond to additional contractive terms added to the loss, so we do not draw a firm distinction between losses and regularizers. We describe each loss in detail below. Hyperparameters are provided in Appendix A.1. Softmax cross-entropy (Bridle, 1990a;b) is the de facto loss function for multi-class classification in deep learning. It can be written as: L softmax ( , t) = - K k=1 t k log e k K j=1 e j = - K k=1 t k k + log K k=1 e k . The loss consists of a term that maximizes the dot product between the logits and targets, as well as a contractive term that minimizes the LogSumExp of the logits. Label smoothing (Szegedy et al., 2016) "smooths" the targets for softmax cross-entropy loss. The new targets are given by mixing the original targets with a uniform distribution over all labels, t = t × (1 -α) + α/K, where α determines the weighting of the original and uniform targets. In order to maintain the same scale for the gradient with respect to the positive logit, in our experiments, we scale the label smoothing loss by 1/(1 -α). The resulting loss is: L smooth ( , t; α) = - 1 1 -α K k=1 (1 -α)t k + α K log e k K j=1 e j (2) = - K k=1 t k k + 1 1 -α log K k=1 e k - α (1 -α)K K k=1 k . (3) Compared to softmax cross-entropy loss, label smoothing adds an additional term that encourages the logits to be positive. Müller et al. (2019) previously showed that label smoothing improves calibration and encourages class centroids to lie at the vertices of a regular simplex. Dropout (Srivastava et al., 2014) is among the most prominent regularizers in the deep learning literature. We consider dropout applied to the penultimate layer of the neural network, i.e., when inputs to the final layer are randomly kept with some probability ρ. When employing dropout, we replace the penultimate layer activations x with x = x ξ/ρ where ξ i ∼ Bernoulli(ρ). Writing the dropped out logits as ˜ = W x + b, the dropout loss is: L dropout (W , b, x, t; p) = E ξ L softmax ( ˜ , t) Dropout produces both implicit regularization, by introducing noise into the optimization process, and explicit regularization, by altering the representation that minimizes the loss (Wei et al., 2020) . Wager et al. (2013) have previously derived a quadratic approximation to the explicit regularizer for logistic regression and other generalized linear models; this strategy can also be used to approximate the explicit regularization imposed by dropout on the penultimate layer of a neural network with softmax loss. However, we observe that penultimate layer dropout has similar effects to extra final layer L 2 regularization, suggesting that implicit regularization is the more important component. Extra final layer L 2 regularization: It is common to place the same L 2 regularization on the final layer as elsewhere in the network. However, we find that applying greater L 2 regularization to the

