DEMYSTIFYING LOSS FUNCTIONS FOR CLASSIFICATION

Abstract

It is common to use the softmax cross-entropy loss to train neural networks on classification datasets where a single class label is assigned to each example. However, it has been shown that modifying softmax cross-entropy with label smoothing or regularizers such as dropout can lead to higher performance. In this paper, we compare a variety of loss functions and output layer regularization strategies that improve performance on image classification tasks. We find differences in the outputs of networks trained with these different objectives, in terms of accuracy, calibration, out-of-distribution robustness, and predictions. However, differences in hidden representations of networks trained with different objectives are restricted to the last few layers; representational similarity reveals no differences among network layers that are not close to the output. We show that all objectives that improve over vanilla softmax loss produce greater class separation in the penultimate layer of the network, which potentially accounts for improved performance on the original task, but results in features that transfer worse to other tasks.

1. INTRODUCTION

Softmax cross-entropy (Bridle, 1990a; b) is the canonical loss function for multi-class classification in deep learning. However, the popularity of softmax cross-entropy appears to be driven by the aesthetic appeal of its probabilistic interpretation, rather than by practical superiority. Early studies reported no empirical advantage of softmax cross-entropy over squared-error loss (Richard & Lippmann, 1991; Weigend, 1993; Dietterich & Bakiri, 1994) , and more recent work has found other objectives that yield better performance on certain tasks (e.g. Szegedy et al., 2016; Liu et al., 2016; Beyer et al., 2020) . These studies show that it is possible to achieve meaningful improvements in accuracy simply by changing the loss function. Nonetheless, there has been little comparison among these alternative objectives, and even less investigation of why some objectives work better than others. In this paper, we perform a comprehensive empirical study of the properties of 9 common and less-common loss functions and regularizers for deep learning, on standard image classification benchmarks. Most existing work in this area has proposed a new loss function or regularizer and attempted to demonstrate its superiority over a limited set of alternatives on benchmark tasks. This approach creates strong incentives to demonstrate the superiority of the proposed loss and little incentive to understand its limitations. Our goal is instead to understand when one might want to use one loss function or regularizer over another and, more broadly, to understand the extent to which neural network performance and representations can be manipulated through the choice of objective alone. Our key contributions are as follows: • We rigorously benchmark 9 training objectives on standard image classification tasks, measuring accuracy, calibration, and out-of-distribution robustness. Many objectives improve over vanilla softmax cross-entropy loss, but no single objective performs best on all benchmarks. • We demonstrate that different loss functions and regularizers produce different patterns of predictions, but combining them does not appear to improve accuracy. However, regularization that affects the input, such as AutoAugment (Cubuk et al., 2019) and Mixup (Zhang et al., 2017) , can provide further gains. Our best models achieve state-of-the-art accuracy (79.1%/94.5% top-1/top-5) on ImageNet for unmodified ResNet-50 architectures trained from scratch. • Using centered kernel alignment (CKA), we measure the similarity of the hidden representations of networks trained with different objectives. We show that the choice of objective affects representations in network layers close to the output, but earlier layers are highly similar regardless of what loss function is used. • We show that all objectives that improve accuracy over softmax cross-entropy also lead to greater separation between representations of different classes in the penultimate layer. This improvement in class separation may be related to the boost in accuracy these objectives provide. However, representations with greater class separation are also more heavily specialized for the original task, and linear classifiers operating on these features perform substantially worse on transfer tasks.

2. LOSS FUNCTIONS AND OUTPUT LAYER REGULARIZERS

We investigate 9 loss functions and output layer regularizers.Let ∈ R K denote the network's output ("logit") vector, and let t ∈ {0, 1} K denote a one-hot vector of targets, where t 1 = 1. Let x ∈ R M denote the vector of penultimate layer activations, which gives rise to the output vector as = W x + b, where W ∈ R K×M is the matrix of final layer weights, and b is a vector of biases. All investigated loss functions include a term that encourages to have a high dot product with t. To avoid solutions that make this dot product large simply by increasing the scale of , these loss functions must also include one or more contractive terms and/or normalize . Many "regularizers" correspond to additional contractive terms added to the loss, so we do not draw a firm distinction between losses and regularizers. We describe each loss in detail below. Hyperparameters are provided in Appendix A.1. Softmax cross-entropy (Bridle, 1990a;b) is the de facto loss function for multi-class classification in deep learning. It can be written as: L softmax ( , t) = - K k=1 t k log e k K j=1 e j = - K k=1 t k k + log K k=1 e k . The loss consists of a term that maximizes the dot product between the logits and targets, as well as a contractive term that minimizes the LogSumExp of the logits. Label smoothing (Szegedy et al., 2016) "smooths" the targets for softmax cross-entropy loss. The new targets are given by mixing the original targets with a uniform distribution over all labels, t = t × (1 -α) + α/K, where α determines the weighting of the original and uniform targets. In order to maintain the same scale for the gradient with respect to the positive logit, in our experiments, we scale the label smoothing loss by 1/(1 -α). The resulting loss is: L smooth ( , t; α) = - 1 1 -α K k=1 (1 -α)t k + α K log e k K j=1 e j (2) = - K k=1 t k k + 1 1 -α log K k=1 e k - α (1 -α)K K k=1 k . (3) Compared to softmax cross-entropy loss, label smoothing adds an additional term that encourages the logits to be positive. Müller et al. (2019) previously showed that label smoothing improves calibration and encourages class centroids to lie at the vertices of a regular simplex. Dropout (Srivastava et al., 2014) is among the most prominent regularizers in the deep learning literature. We consider dropout applied to the penultimate layer of the neural network, i.e., when inputs to the final layer are randomly kept with some probability ρ. When employing dropout, we replace the penultimate layer activations x with x = x ξ/ρ where ξ i ∼ Bernoulli(ρ). Writing the dropped out logits as ˜ = W x + b, the dropout loss is: L dropout (W , b, x, t; p) = E ξ L softmax ( ˜ , t) Dropout produces both implicit regularization, by introducing noise into the optimization process, and explicit regularization, by altering the representation that minimizes the loss (Wei et al., 2020) . Wager et al. (2013) have previously derived a quadratic approximation to the explicit regularizer for logistic regression and other generalized linear models; this strategy can also be used to approximate the explicit regularization imposed by dropout on the penultimate layer of a neural network with softmax loss. However, we observe that penultimate layer dropout has similar effects to extra final layer L 2 regularization, suggesting that implicit regularization is the more important component. Extra final layer L 2 regularization: It is common to place the same L 2 regularization on the final layer as elsewhere in the network. However, we find that applying greater L 2 regularization to the final layer can improve performance. In architectures with batch normalization, adding additional L 2 regularization has no explicit regularizing effect if the learnable scale (γ) parameters that are unregularized, but it still exerts an implicit regularizing effect by altering optimization. Logit penalty: Whereas label smoothing encourages logits not to be too negative, and dropout imposes a penalty on the logits that depends on the covariance of the weights, an alternative possibility is simply to explicitly constrain logits to be small in L 2 norm: L logit_penalty ( , t; β) = L softmax ( , t) + β 2 . ( ) Logit normalization: We consider the use of L 2 normalization, rather than regularization, of the logits. Because the entropy of the output of the softmax function depends on the scale of the logits, which is lost after normalization, we introduce an additional temperature parameter τ that controls the magnitude of the logit vector, and thus, indirectly, the minimum entropy of the output distribution: L logit_norm ( , t; τ ) = L softmax ( /(τ ), t) Cosine softmax: We additionally consider L 2 normalization of both the penultimate layer features and the final layer weights corresponding to each class. This loss is equivalent to softmax crossentropy loss if the logits are given by cosine similarity sim(x, y) = x T y/( x y ) between the weight vector and the penultimate layer plus a per-class bias: L cos_softmax (W , b, x, t; τ ) = - K k=1 t k (sim(W k,: , x)/τ + b k ) + log K k=1 e sim(W k,: ,x)/τ +b k (7) where τ is a temperature parameter as above. Similar losses have appeared in previous literature (Ranjan et al., 2017; Wojke & Bewley, 2018; Wang et al., 2018a; b; Deng et al., 2019; Liu et al., 2017) , and variants have introduced explicit additive or multiplicative margins to this loss that we do not consider here (Liu et al., 2017; Wang et al., 2018a; b; Deng et al., 2019) . It is possible that performance could be enhanced by employing one of these margin schemes, although we observe that manipulating the temperature alone has a large impact on observed class separation. Sigmoid cross-entropy is the natural analog to softmax cross-entropy for multi-label classification problems. Although we investigate only single-label multi-class classification tasks, we train networks with sigmoid cross-entropy and evaluate accuracy by ranking the logits of the sigmoids. This approach is related to the one-versus-rest strategy for converting binary classifiers to multi-class classifiers. The sigmoid cross-entropy loss is: L sigmoid ( , t) = - K k=1 t k log e k e k + 1 + (1 -t k ) log 1 - e k e k + 1 (8) = - K k=1 t k k + K k=1 log(e k + 1). The LogSumExp term of softmax loss is replaced with the sum of the softplus-transformed logits. We initialize the biases of the logits b to -log(K) so that the initial output probabilities are approximately 1/K. Beyer et al. (2020) have previously shown that sigmoid cross-entropy loss leads to improved accuracy on ImageNet relative to softmax cross-entropy. Squared error: Finally, we investigate squared error loss, as formulated by Hui & Belkin (2020) : L squared_error ( , t; κ, M ) = 1 K K k=1 κt k ( k -M ) 2 + (1 -t k ) 2 k ( ) where κ and M are hyperparameters. κ sets the strength of the loss for the correct class relative to incorrect classes, whereas M controls the magnitude of the correct class target. When κ = M = 1, the loss is simply the mean squared error between and t. Like Hui & Belkin (2020) , we find that placing greater weight on the correct class slightly improves ImageNet accuracy.

3. RESULTS

For each loss, we trained 8 ResNet-50 (He et al., 2016; Gross & Wilber, 2016 ) models on ImageNet. To tune loss hyperparameters and the epoch for early stopping, we performed 3 training runs per hyperparameter configuration where we held out a validation set of 50,046 ImageNet training example. We also trained 25 batch-normalized All-CNN-C (Springenberg et al., 2014) models for each loss on CIFAR-10 ( Krizhevsky & Hinton, 2009) , where we performed extensive hyperparameter tuning for learning rate and weight decay in addition to loss hyperparameters. We provide further details regarding training and hyperparameter selection in Appendix A.1.

3.1. REGULARIZERS AND ALTERNATIVE LOSSES ENHANCE ACCURACY

We found that, when properly tuned, many investigated objectives often provide a statistically significant improvement over softmax cross-entropy, as shown in Table 1 . The range of improvements was small, but meaningful, with sigmoid cross-entropy and cosine softmax both leading to an improvement of 0.9% in top-1 accuracy over the baseline for ResNet-50 on ImageNet. No single loss performed best across all benchmarks, although cosine softmax, logit penalty, and sigmoid were frequently among the top-performing losses. Losses that yielded large improvements in top-1 accuracy on ImageNet did not necessarily improve top-5 accuracy. For ResNet-50, sigmoid cross-entropy led to a large (0.9%) improvement in top-1 accuracy over vanilla softmax cross-entropy, but only a small (0.1%) improvement in top-5 accuracy. Cosine softmax performed comparably to sigmoid cross-entropy in terms of top-1 accuracy, but better in top-5 accuracy, with a 0.4% improvement over the baseline. Similar patterns were observed for Inception v3 (Table B .1), where sigmoid cross-entropy was the best-performing model in terms of top-1 accuracy but performed worse than the softmax baseline in terms of top-5 accuracy. Losses also differed in out-of-distribution robustness, and in the calibration of the resulting predictions. Table B .2 shows results on the out-of-distribution test sets ImageNet-v2 (Recht et al., 2019) , ImageNet-A (Hendrycks et al., 2019) , ImageNet-Sketch (Wang et al., 2019) , ImageNet-R (Hendrycks et al., 2020) , and ImageNet-C (Hendrycks & Dietterich, 2019) . In almost all cases, alternative loss functions outperformed softmax cross-entropy, with logit normalization and cosine softmax typically performing slightly better than alternatives. Effects on calibration, shown in Table B .3, were mixed. Label smoothing substantially reduced expected calibration error (Guo et al., 2017) , as previously shown by Müller et al. (2019) , although cosine softmax achieved a lower negative log likelihood. However, there was no clear relationship between calibration and accuracy. Although logit penalty performed well in terms of accuracy, it provided the worst calibration of any objective investigated. Our attempts to achieve higher accuracy by combining objectives were unsuccessful. As described in Appendix C, adding additional regularization did not improve performance of well-tuned loss functions, and normalized variants of sigmoid cross-entropy loss failed to improve accuracy on ImageNet. However, it was still possible to improve networks' performance substantially using AutoAugment (Cubuk et al., 2019) or Mixup (Zhang et al., 2017) , and gains from improved losses and these data augmentation strategies were approximately additive ( 

3.2. DIFFERENT LOSSES PRODUCE DIFFERENT PREDICTIONS

Given that effects of regularization were non-additive, we sought to determine whether different regularizers and losses had similar effects on network predictions. For each pair of models, we measured the percentage of images in the ImageNet validation set where both models predicted the same class. The results are shown in Figure 1 . We also examined the percentage of images that where both models are either correct or incorrect, and the agreement on examples that both models get incorrect (Figure D.1) . All ways of measuring similarity of predictions yielded similar results. Models' predictions clustered into distinct groups according to their loss functions. Models trained from different initializations with the same loss function were more similar than models trained with different loss functions. However, all models trained with (regularized) softmax loss or sigmoid loss were more similar to each other than they were to models trained with logit or feature + weight normalization. Networks trained with squared error were dissimilar to all others examined. Variability in predictions of models trained with the same loss but different random initializations was large. Although standard deviations in top-1 accuracy were <0.2% for all losses, even the most similar pair of models disagreed on 13.9% of test set examples. When ensembling the 8 models trained with the same loss but different random initializations, the least similar losses (softmax and squared error) disagreed on only 11.5% of examples (Figure D.2) . The accuracy of ensembles of models trained with different losses was closely related to the accuracies of the constituent models; ensembling models trained with the two best losses yielded only modest accuracy improvements over ensembles trained with either loss alone (Figure D.3).

3.3. LOSSES PRIMARILY AFFECT HIDDEN REPRESENTATIONS CLOSE TO THE OUTPUT

Loss functions differ not only in their predictions, but also in their effects on internal representations of neural networks. In Figure 2 , we show the sparsity of the activations of layers of networks trained with different loss functions. In all networks, the percentage of non-zero ReLU activations decreased with depth, attaining its minimum at the last convolutional layer. In the first three ResNet stages, activation sparsity was broadly similar regardless of the loss. However, in the final stage and penultimate average pooling layer, there were substantial differences. Given that these observations, we wondered whether the choice of loss had any effect on representations in these layers at all. We used linear centered kernel alignment (CKA) (Kornblith et al., 2019a; Cortes et al., 2012; Cristianini et al., 2002) to measure the similarity between networks' hidden representations. As shown in Figure 3 , representations of corresponding early, but not late, network layers were highly similar regardless of loss function. These results provide further confirmation that effects of the loss function are limited to later network layers.

3.4. REGULARIZATION IMPROVES CLASS SEPARATION

Is there a feature of the investigated regularizers that can potentially explain their beneficial effect on accuracy? We demonstrate that all investigated regularizers and alternative losses force the network to shrink or eliminate directions in the penultimate layer representation space that are not aligned with weight vectors. The universality of this finding suggests it may relate to the accuracy-enhancing properties of these losses. The ratio of the average within-class cosine distance to the overall average cosine distance provides a measure of how distributed examples within a class are that is between 0 and 1. We take one minus this quantity to get a closed-form measure of class separation: R 2 = 1 - K k=1 N k m=1 N k n=1 (1 -sim(x k,m , x k,n )) /N 2 K K j=1 K k=1 Nj m=1 N k n=1 (1 -sim(x j,m , x k,n )) /(N j N k ) (11) where x k,m is the embedding of example m in class k, N k is the number of examples in class k, and sim(x, y) = x T y/( x y ) is cosine similarity between vectors. If the embeddings are first L 2 normalized, then 1 -R 2 is the ratio of the average within-class variance to the weighted total variance, where the weights are inversely proportional to the number of examples in each class. For a balanced dataset, R 2 is also equivalent to centered kernel alignment (Cortes et al., 2012; Cristianini et al., 2002) between the embeddings and the one-hot label matrix, with a cosine kernel. We also examined alternative class separation metrics (Appendix E); results were similar. As shown in Table 2 and Figure 4 , all regularizers and alternative loss functions resulted in greater class separation in penultimate (average pooling) layer representations as compared to softmax loss. Whereas additional final layer L 2 , logit penalty, and squared error also produced greater class separation before the penultimate layer, other losses did not. Although losses that improve class separation also improve accuracy on the ImageNet validation set, they result in penultimate layer features that are substantially less useful for other tasks. Kornblith et al. (2019b) previously showed that networks trained with label smoothing and dropout learn less transferable features. As in this work, we trained logistic regression classifiers to classify a selection of transfer datasets (Bossard et al., 2014; Krizhevsky & Hinton, 2009; Berg et al., 2014; Xiao et al., 2010; Krause et al., 2013; Parkhi et al., 2012; Nilsback & Zisserman, 2008) , using fixed features from networks trained with different losses. As shown in Table 3 , features from networks trained with vanilla softmax loss yield the highest transfer accuracy. However, when we attempted to relearn the original 1000-way ImageNet classifier using 50,046 training set examples, features from networks trained with vanilla softmax loss performed worst. Thus, the ease with which ImageNet classifier weights can be relearned from representations is inversely related to the performance of these representations when they are used to classify other datasets (Figure 5 ). To confirm this relationship between class separation, ImageNet accuracy, and transfer, we trained models with cosine softmax with varying values of the temperature parameter τ .foot_0 As shown in Table 4 , lower temperatures resulted in lower top-1 accuracies and worse class separation, and made the ImageNet classifier weights more difficult to recover. However, even though the lowest temperature achieved 2.7% lower accuracy on ImageNet compared to higher temperatures, this lowest temperature yielded the better features for nearly all transfer datasets. Thus, τ controls a tradeoff between the generalizability of penultimate-layer features and the accuracy on the target dataset. 

4. RELATED WORK

Theoretical analysis of loss functions is challenging; in most cases, solutions cannot be expressed in closed form even when the predictor is linear. However, Soudry et al. (2018) have previously shown that, on linearly separable data, gradient descent on the unregularized logistic or multinomial logistic regression objectives (i.e., linear models with sigmoid or softmax cross-entropy loss) eventually converges to the minimum norm solution. These results can be extended to neural networks in certain restricted settings (Soudry et al., 2018; Gunasekar et al., 2018; Wei et al., 2019) . Our study of class separation in penultimate layers of neural networks is related to work investigating angular visual hardness (Chen et al., 2019) , which measures the arccosine-transformed cosine similarity between the weight vectors and examples. This metric is similar to the class separation metric we apply (Eq. 11), but fails to differentiate between networks trained with softmax and sigmoid cross-entropy; see Appendix Figure E.1. Other work has investigated how class information evolves through the hidden layers of neural networks, using linear classifiers (Alain & Bengio, 2016) , binning estimators of mutual information (Shwartz-Ziv & Tishby, 2017; Saxe et al., 2019; Goldfeld et al., 2018) , Euclidean distances (Schilling et al., 2018) , and manifold geometry (Cohen et al., 2020) . However, this previous work has not analyzed how training objectives affect these measures. The loss functions we investigate are only a subset of those explored in past literature. We have excluded loss functions that require specially constructed batches from the current investigation (Snell et al., 2017; Khosla et al., 2020) , as well as losses designed for situations with high label noise (Jindal et al., 2016; Ghosh et al., 2017; Patrini et al., 2017; Amid et al., 2019; Lukasik et al., 2020) . Other work has investigated replacing the softmax function with other functions that lead to normalized class probabilities (de Brébisson & Vincent, 2015; Laha et al., 2018) . Our approach is related to previous studies of metric learning (Musgrave et al., 2020) and optimizers (Choi et al., 2019) .

5. CONCLUSION

Our study identifies many similarities among networks trained with different objectives. On CIFAR-10, CIFAR-100, and ImageNet, different losses and regularizers achieve broadly similar accuracies. Although the accuracy differences are large enough to be meaningful in some contexts, the largest is still <1.5%. Representational similarity analysis using centered kernel alignment indicates that the choice of loss function affects representations in only the last few layers of the network, suggesting inherent limitations to what can be achieved by manipulating the loss. However, we also show that different objectives lead to substantially different penultimate layer representations. We find that class separation is an important factor that distinguishes these different penultimate layer representations, and show that it is inversely related to transferability of representations to other tasks. 



Training at low temperatures was unstable, so we scaled the loss by the temperature, which slightly worsened overall ImageNet accuracy. Relationships for temperatures >= 0.05 remain consistent without loss scaling. The torchvision ResNet-50 model and the "official" TensorFlow ResNet both implement this architecture, which was first proposed byGross & Wilber (2016) and differs from the ResNet v1 described byHe et al. (2016) in performing strided convolution in the first × 3 convolution in each stage rather than the first 1 × 1 convolution. Our implementation initializes the γ parameters of the last batch normalization layer in each block to 0, as inGoyal et al. (2017).3 Due to the large number of hyperparameter configurations, for squared error, we performed only 1 run per configuration to select hyperparameters, but 3 to select the epoch at which to stop. We manually narrowed the hyperparameter search range until all trained networks achieved similar accuracy. The resulting hyperparaameters performed better than those suggested byHui & Belkin (2020).



Figure 1: Different losses produce different predictions. a: Percentages of ImageNet validation set examples for which models assign the same top-1 predictions, for 8 seeds of ResNet-50 models. b: Dendrogram based on similarity of predictions. All models naturally cluster according to loss, except for "Dropout" and "More Final Layer L2" models. See also Figure D.1.

Figure 2: Loss functions affect sparsity of later layer representations. Plot shows the average % non-zero activations for each ResNet-50 block, after the residual connection and subsequent nonlinearity, on the ImageNet validation set. Dashed lines indicate boundaries between stages.

Figure 3: The loss function has little impact on representations in early network layers. All plots show linear centered kernel alignment (CKA) between representations computed on the ImageNet validation set. a: CKA between network layers, for pairs of networks trained from different initializations. b: CKA between representations extracted from architecturally corresponding layers of networks trained with different loss functions. Diagonal reflects similarity of networks with the same loss function trained from different initalizations.

Figure 4: Class separation in different layers of ResNet-50 models, on the ImageNet training set.

Figure 5: Transfer accuracy and accuracy of relearned ImageNet weights are negatively related. a: Average transfer task accuracy versus accuracy of a classifier trained on 50,046 ImageNet training set examples and tested on the validation set for different objectives. b: Relationship of transfer accuracy and relearned ImageNet accuracy with cosine softmax temperature.

Figure E.3: The distribution of cosine distance between examples. Kernel density estimate of the cosine distance between examples of the same class (solid lines) and of different classes (dashed lines), for penultimate layer embeddings of 10,000 training set examples from ResNet-50 on ImageNet. Top and bottom plots show the same data with different y scales.

Regularizers and alternative losses improve ImageNet accuracy. Accuracy of models trained with different losses/regularizers on the ImageNet validation (mean ± standard error of 8 models) and CIFAR-10 and CIFAR-100 test sets (mean ± standard error of 25 models). Losses are sorted from lowest to highest ImageNet top-1 accuracy. Accuracy values not significantly different from the best (p > 0.05, t-test) are bold-faced.

Regularization and alternative losses improve class separation in the penultimate layer. Results averaged over 8 ResNet-50 models per loss on the Image-Net training set.

Regularized networks learn features specialized to ImageNet. Accuracy of linear classifiers (L 2regularized multinomial logistic regression) trained to classify different datasets using fixed penultimate layer features. IN(50k) reflects accuracy of a classifier trained on 50,046 examples from the ImageNet training set and tested on the validation set. See Appendix A.2 for training details.

Temperature of cosine softmax loss controls ImageNet top-1 accuracy, class separation (R 2 ), and linear transfer accuracy.

TableB.2: Regularizers and alternative losses improve performance on out-of-distribution test sets. Accuracy averaged over 8 ResNet-50 models per loss.TableB.3: Regularizers and alternative losses may or may not improve calibration. We report negative log likelihood (NLL) and expected calibration error (ECE) for each loss on the ImageNet validation set, before and after scaling the temperature of the probability of the distribution to minimize NLL, as inGuo et al. (2017). ECE is computed with 15 evenly spaced bins. For networks trained with sigmoid loss, we normalize the probability distribution by summing probabilities over all classes.

1: Combining final-layer regularizers and/or improved losses does not enhance performance. ImageNet holdout set accuracy of ResNet-50 models when combining losses and regularizers between models. All results reflect the maximum accuracy on the holdout set at any point during training, averaged across 3 training runs. Accuracy numbers are higher on the holdout set than the official ImageNet validation set. This difference in accuracy is likely due to a difference in image distributions between the ImageNet training and validation sets, as previously noted in Section C.3.1 ofRecht et al. (2019).TableC.2: AutoAugment and Mixup provide consistent accuracy gains beyond well-tuned losses and regularizers. Top-1 accuracy of ResNet-50 models trained with and without AutoAugment, averaged over 3 (with AutoAugment) or 8 (without AutoAugment) runs. Models trained with AutoAugment use the loss hyperparameters chosen for models trained without AutoAugment, but the point at which to stop training was chosen independently on our holdout set. For models trained with Mixup, the mixing parameter α is chosen from [0.1, 0.2, 0.3, 0.4] on the holdout set. Best results in each column, as well as results insignificantly different from the best (p > 0.05, t-test), are bold-faced. ± 0.06 93.40 ± 0.02 77.7 ± 0.05 93.74 ± 0.05 78.0 ± 0.05 93.98 ± 0.03 Sigmoid 77.9 ± 0.05 93.50 ± 0.02 78.5 ± 0.04 93.82 ± 0.02 78.5 ± 0.07 93.94 ± 0.04 Logit penalty 77.7 ± 0.02 93.83 ± 0.02 78.3 ± 0.05 94.10 ± 0.03 78.0 ± 0.05 93.95 ± 0.05 Cosine softmax 77.9 ± 0.02 93.86 ± 0.01 78.3 ± 0.02 94.12 ± 0.04 78.4 ± 0.04 94.14 ± 0.02 Figure E.2: Singular value spectra of activations and weights learned by different losses. Singular value spectra computed for penultimate layer activations, final layer weights, and class centroids of ResNet-50 models on the ImageNet training set. Penultimate layer activations and final layer weights fail to differentiate sigmoid cross-entropy from softmax cross-entropy. By contrast, the singular value spectrum of the class centroids clearly distinguishes these losses.

A DETAILS OF TRAINING AND HYPERPARAMETER TUNING

A.1 TRAINING AND TUNING NEURAL NETWORKS ImageNet. We trained ImageNet models (ResNet-50 (He et al., 2016; Gross & Wilber, 2016; Goyal et al., 2017 ) "v1.5" 2 and Inception v3 (Szegedy et al., 2016) ) models with SGD with Nesterov momentum of 0.9 and a batch size 4096 and weight decay of 8 × 10 -5 (applied to the weights but not batch norm parameters). After 10 epochs of linear warmup to a maximum learning rate of 1.6, we decayed the learning rate by a factor of 0.975 per epoch. We took an exponential moving average of the weights over training as in Szegedy et al. (2016) , with a momentum factor of 0.9999. We used standard data augmentation comprising random crops of 10-100% of the image with aspect ratios of 0.75 to 1.33 and random horizontal flips. At test time, we resized images to 256 pixels on their shortest side and took a 224 × 224 center crop.To tune hyperparameters, we initially performed a set of training runs with a wide range of different parameters, and then narrowed the hyperparameter range to the range shown in Table A .1. To further tune the hyperparameters and the epoch for early stopping, we performed 3 training runs per configuration where we held out a validation set of approximately 50,000 ImageNet training examples. 3 We tuned loss hyperparameters for ResNet-50 only. For Inception v3, we used the same loss hyperparameters as for ResNet-50, but we still performed 3 training runs with the held out validation set to select the point at which to stop for each loss. CIFAR. We trained CIFAR-10 and CIFAR-100 models using SGD with Nesterov momentum of 0.9 and a cosine learning rate decay schedule without restarts, and without weight averaging.For CIFAR-10, we used a batch size of 128; for CIFAR-100, we used a batch size of 256. For these networks, we performed hyperparameter tuning to select the learning rate and weight decay parameters. We started by selecting the learning rate from {10 -2 , 10 -1.5 , 10 -1 , 10 -0.5 , 1.0, 10 0.5 } and the weight decay from {10 -4.5 , 10 -4 , 10 -3.5 , 10 -3 }, where we parameterize the weight decay so that it is divided by the learning rate. We manually inspected hyperparameter grids and expanded the learning rate and weight decay ranges when the best accuracy was on the edge of the searched grid. After finding the best hyperparameters in this coarse search, we performed a finer search in the vicinity of the best coarse hyperparameters with double the granularity, e.g., for an optimal learning rate of 10 -1 in the coarse search, our fine grid would include learning rates of {10 -0.5 , 10 -0.75 , 10 -1 , 10 -1.25 , 10 -1.5 }. All results show optimal hyperparameters from this finer grid. During both coarse and fine hyperparameter tuning, we computed accuracies averaged over 5 different initializations for each configuration to reduce the bias toward selecting high-variance hyperparameter combinations when searching over a large number of configurations. Hyperparameters are shown in Table A .2.The architecture we used for CIFAR-10 experiments was based on All-CNN-C architecture of Springenberg et al. ( 2014), with batch normalization added between layers and the global average pooling operation moved before the final convolutional layer. On CIFAR-100, we used the Wide ResNet 16-8 architecture from Zagoruyko & Komodakis (2016) . Our CIFAR-100 architecture applied weight decay to batch normalization parameters, but our CIFAR-10 architecture did not. Squared error η = 0.1, λ = 10 -3.5 , κ = 8, M = 0.83 η = 0.1, λ = 10 -3.75 , κ = 6, M = 12 Softmax η = 10 -0.75 , λ = 10 -3.75 η = 0.1, λ = 10 -4 Logit normalization η = 0.01, λ = 10 -4 , τ = 0.14 η = 10 -2.25 , λ = 10 -3.75 , τ = 0.11 Extra final layer L 2 η = 0.1, λ = 10 -3.5 , λfinal = 10 -1.5 η = 0.1, λ = 10 -3.75 , λfinal = 10 -3.33 Cosine softmax η = 10 -2.25 , λ = 10 -4 , τ = 0.08 η = 0.01, λ = 10 -3.75 , τ = 0.1 Dropout η = 0.1, λ = 10 -3.75 , ρ = 0.65 η = 10 -1.25 , λ = 10 -3.75 , ρ = 0.75 Sigmoid η = 1, λ = 10 -3.75 η = 0.1, λ = 10 -3.75 Label smoothing η = 0.1, λ = 10 -3.75 , α = 0.04 η = 0.1, λ = 10 -3.5 , α = 0.18 Logit penalty η = 10 -0.75 , λ = 10 -3.75 , β = 10 -2.83 η = 10 -1.25 , λ = 10 -3.75 , β = 10 -2.83

A.2 TRAINING AND TUNING MULTINOMIAL LOGISTIC REGRESSION CLASSIFIERS

To train multinomial logistic regression classifiers on fixed features, we follow a similar approach to Kornblith et al. (2019b) . We first extracted features for every image in the training set, by resizing them to 224 pixels on the shortest side and taking a 224 × 224 pixel center crop. We held out a validation set from the training set, and used this validation set to select the L 2 regularization hyperparameter, which we selected from 45 logarithmically spaced values between 10 -6 and 10 5 , applied to the sum of the per-example losses. Because the optimization problem is convex, we used the previous weights as a warm start as we increased the L 2 regularization hyperparameter. After finding the optimal hyperparameter on this validation set, we retrained on the entire training set and evaluated accuracy on the test set. E OTHER CLASS SEPARATION METRICS (Chen et al., 2019) scores of the 50,000 examples in the ImageNet validation set, computed with a Gaussian kernel of bandwidth 5 × 10 -6 , for ResNet-50 networks trained with different losses. Legend shows ImageNet top-1 accuracy for each loss function in parentheses. Although alternative loss functions generally reduce angular visual hardness vs. softmax loss, sigmoid loss does not, yet it is tied for the highest accuracy of any loss function.

