TEMPERATURE CHECK: THEORY AND PRACTICE FOR TRAINING MODELS WITH SOFTMAX-CROSS-ENTROPY LOSSES

Abstract

The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature β as well as the magnitude of the logits at initialization, ||βz|| 2 . We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on β is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of β as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal β to be sensitive to the architecture, our results suggest that tuning β over the range 10 -2 to 10 1 improves performance over all architectures studied. We find that smaller β may lead to better peak performance at the cost of learning stability.

1. INTRODUCTION

Deep learning has led to breakthroughs across a slew of classification tasks (LeCun et al., 1989; Krizhevsky et al., 2012; Zagoruyko and Komodakis, 2017) . Crucial components of this success have been the use of the softmax function to model predicted class-probabilities combined with the cross-entropy loss function as a measure of distance between the predicted distribution and the label (Kline and Berardi, 2005; Golik et al., 2013) . Significant work has gone into improving the generalization performance of softmax-cross-entropy learning. A particularly successful approach has been to improve overfitting by reducing model confidence; this has been done by regularizing outputs using confidence regularization (Pereyra et al., 2017) or by augmenting data using label smoothing (Müller et al., 2019; Szegedy et al., 2016) . Another way to manipulate model confidence is to tune the temperature of the softmax function, which is otherwise commonly set to one. Adjusting the softmax temperature during training has been shown to be important in metric learning (Wu et al., 2018; Zhai and Wu, 2019) and when performing distillation (Hinton et al., 2015) ; as well as for post-training calibration of prediction probabilities (Platt, 2000; Guo et al., 2017) . The interplay between temperature, learning, and generalization is complex and not well-understood in the general case. Although significant recent theoretical progress has been made understanding generalization and learning in wide neural networks approximated as linear models, analysis of linearized learning dynamics has largely focused on the case of squared error losses (Jacot et al., 2018; Du et al., 2019; Lee et al., 2019; Novak et al., 2019a; Xiao et al., 2019) . Infinitely-wide networks trained with softmax-cross-entropy loss have been shown to converge to max-margin classifiers in a particular function space norm (Chizat and Bach, 2020), but timescales of convergence are not known. Additionally, many well-performing models operate best away from the linearized regime (Novak et al., 2019a; Aitchison, 2019) . This means that understanding the deviations of models from their linearization around initialization is important for understanding generalization (Lee et al., 2019; Chizat et al., 2019) . In this paper, we investigate the training of neural networks with softmax-cross-entropy losses. In general this problem is analytically intractable; to make progress we pursue a strategy that combines analytic insights at short times with a comprehensive set of experiments that capture the entirety of training. At short times, models can be understood in terms of a linearization about their initial parameters along with nonlinear corrections. In the linear regime we find that networks trained with different inverse-temperatures, β = 1/T , behave identically provided the learning rate is scaled as η = ηβ 2 . Here, networks begin to learn over a timescale τ z ∼ Z 0 2 /η where Z 0 are the initial logits of the network after being multiplied by β. This implies that we expect learning to begin faster for networks with smaller logits. The learning dynamics begin to become nonlinear over another, independent, timescale τ nl ∼ β/η, suggesting more nonlinear learning for small β. From previous results we expect that neural networks will perform best in this regime where they quickly exit the linear regime (Chizat et al., 2019; Lee et al., 2020; Lewkowycz et al., 2020) . We combine these analytic results with extensive experiments on competitive neural networks across a range of architectures and domains including: Wide Residual networks (Zagoruyko and Komodakis, 2017) on CIFAR10 (Krizhevsky, 2009 ), ResNet-50 (He et al., 2016 ) on ImageNet (Deng et al., 2009 ), and GRUs (Chung et al., 2014) on the IMDB sentiment analysis task (Maas et al., 2011) . In the case of residual networks, we consider architectures with and without batch normalization, which can appreciably change the learning dynamics (Ioffe and Szegedy, 2015) . For all models studied, we find that generalization performance is poor at Z 0 2 1 but otherwise largely independent of Z 0 2 . Moreover, learning becomes slower and less stable at very small β; indeed, the optimal learning rate scales like η * ∼ 1/β and the resulting early learning timescale can be written as τ * z ∼ Z 0 2 /β. For all models studied, we observe strong performance for β ∈ [10 -2 , 10 1 ] although the specific optimal β is architecture dependent. Emphatically, the optimal β is often far from 1. For models without batch normalization, smaller β can give stronger results on some training runs, with others failing to train due to instability. Overall, these results suggest that model performance can often be improved by tuning β over the range of [10 -2 , 10 1 ].

2. THEORY

We begin with a precise description of the problem setting before discussing a theory of learning at short times. We will show the following: • The inverse temperature β and logit scale Z 0 control timescales which determine the rate of change of the loss, the relative change in logits, and the time for learning to leave the linear learning regime. • Small β causes training to access the non-linear learning regime. We will see empirically that increasing access to the non-linear regime can improve generalization. • The largest allowable learning rate is set by the timescale to leave the linearized learning regime, which suggests that networks with small β will train more slowly. All numerical results in this section are using a Wide Resnet (Zagoruyko and Komodakis, 2017) trained on CIFAR10.

2.1. BASIC MODEL AND NOTATION

We consider a classification task with K classes. For an N dimensional input x, let z(x, θ) be the pre-softmax output of a classification model parameterized by θ ∈ R P , such that the classifier predicts the class i corresponding to the largest output value z i . We will mainly consider θ trained by SGD on a training set (X , Y) of M input-label pairs. We focus on models trained with cross-entropy loss with a non-trivial inverse temperature β. The softmax-cross-entropy loss can be written as L(θ, X , Y) = K i=1 Y i • ln(σ(βz i (X , θ))) = K i=1 Y i • ln(σ(Z i (X , θ)))

