HOW TO KEEP COOL WHILE TRAINING

Abstract

Modern neural networks used for classification are notoriously prone to overly confident predictions. With multiple calibration methods proposed so far, there has been noteworthy progress in addressing overconfidence issues. However, to the best of our knowledge, prior methods have exclusively focused on those factors that affect calibration, leaving open the question of how (mis)calibration circles back to negatively impact network training. Aiming to better understand such dependencies, we propose a temperature-based Cooling method to calibrate classification neural networks during training. Cooling results in better gradient scaling and reduces the need for a learning rate schedule. We investigate different variants of Cooling, with the simplest, last layer Cooling, being also the best-performing one, improving network performance for a range of datasets, network architectures, and hyperparameter settings.

1. INTRODUCTION

Training neural networks can be a challenging task, with optimal performance depending on the right setting of hyperparameters. For this reason, finding a suitable network configuration can often take multiple costly training runs with varying parameters of the learning rate schedule, the optimizer and the batch size. Apart from standard learning rate schedules like piecewise constant schedules and exponential decay schedules, there has been activate research in developing better schedules: Among the most prominent of these are learning rate warmup (Goyal et al., 2017; He et al., 2016a) and cosine decay (Loshchilov & Hutter, 2017) schedules. Complementary to these challenges, (Guo et al., 2017) found that modern convolutional classification networks are often poorly calibrated, leading to overly confident predictions. They investigated multiple methods to improve calibration, with a simple temperature scaling method performing best: the network's output logits are multiplied by a temperature parameter, optimised on a validation dataset after training. Importantly, this leaves the maximal value and therefore the predicted class label unchanged since all the logits are multiplied by the same temperature value. Since then, multiple papers (Kull et al., 2019; Kumar et al., 2019; 2018; Müller et al., 2019; Gupta et al., 2021) proposed methods aiming to even better calibrated networks. More recently, (Desai & Durrett, 2020; Minderer et al., 2021) investigated the calibration of state-of-the-art nonconvolutional Transformer networks (Vaswani et al., 2017; Dosovitskiy et al., 2021) and MLP-Mixers (Tolstikhin et al., 2021) . They concluded that such architectures may have benefits, with further work needed to fully understand the factors contributing to calibration. Despite initially leaving the accuracy unchanged, we have noticed that temperature scaling can have an intriguing effect as training continues: scaling the output logits results in a change in the crossentropy loss, which in turn leads to scaled gradient updates and subsequently new parameter values. During training, this can lead to a significant increase in accuracy. To the best of our knowledge, temperature scaling has until now only been applied post hoc after completing network training. However, our investigation shows that networks become gradually overconfident during training (they overheat), which seems to have a detrimental effect on learning. This has motivated us to modify the original temperature scaling and propose a Cooling method to calibrate neural networks during training. • A Cooling method for calibrating classification neural networks during training. We propose two basic variants called last layer Cooling and distributed Cooling, and one hybrid variant called periodically redistributed Cooling. • A mathematical analysis of the effect of Cooling on the network gradients, with a comparison of different Cooling variants. • An empirical investigation of the effects of Cooling on a range of metrics, including network weights, gradients, output logits and the ECE (expected calibration error) calibration measure. • A broad set of experiments for different tasks (image classification and semantic segmentation), datasets and network architectures. We also include an extensive ablation study, involving different activation functions, optimizers, and hyperparameters such as the learning rate schedule, the Cooling factor and the use of weight decay and data augmentation. Our experiments indicate an interplay between the learning rate and calibration during training. Importantly, if well-calibrated, networks can train well without the use of a learning rate schedule.

2. BACKGROUND AND NOTATION

Let f θ : R d → R s denote the function of a classification neural network with parameters θ, mapping a d-dimensional input (in our case an image) x to an s-dimensional logits vector z = f θ (x). During training, each input x comes with a class-probability or label vector y, denoting probabilities of x belonging each of s classes. This is usually (but not necessarily) a one-hot vector corresponding to a so-called ground-truth class label, i * . In its simplest variant, we suppose the network consists of L affine (dense or convolutional) layers, each followed by a non-linear activation function. For the i th layer (1 ≤ i ≤ L) this gives an expression of the form x i = ρ(p i ) = ρ(W i x i-1 + b i ) with weight matrices W i , bias vectors b i , non-linearities ρ, pre-activation values p i and layer inputs and outputs x i-1 and output x i , respectively. (More generally, our method can be applied to any neural network, involving arbitrary functions and layers like e.g. attention, batch normalization and skip connections.) The output logits are then passed through the softmax function σ which results in a vector y = σ(z) of class probabilities. The classification network is trained to minimize the categorical cross-entropy loss function L(z) = H(y, σ(z)) = - s i=1 y i log( y i ) . (2.1) We say that a network is well-calibrated if the output values y can be interpreted as true probabilities. Intuitively, if a network makes 100 predictions with 90% confidence, we would expect that 90% are correctly classified. (Guo et al., 2017) observed that convolutional neural networks tend to display over-confidence in their results, in that y i * gives an over-estimate of the probability that λ i is the correct label. Thus the networks are badly calibrated, which we metaphorically express by saying that the networks become overheated. ( Guo et al., 2017) found that simply multiplying the pre-softmax logits z i by a factor τ does an excellent task of improving the network's calibration. Thus, the task of correcting the calibration of the network is to find a constant τ to correct its output, so that it becomes y = σ(τ z) = σ(τ f θ (x)) . (2.2) The optimal τ is found by minimizing the log-likelihood cost function on a small calibration set, held back from the training data. This operation is carried out when the network is fully trained. Usually, one finds that the optimum value is τ < 1. This process is known as temperature scaling. We refer to (Guo et al., 2017) for a more detailed introduction to network calibration.

