HOW TO KEEP COOL WHILE TRAINING

Abstract

Modern neural networks used for classification are notoriously prone to overly confident predictions. With multiple calibration methods proposed so far, there has been noteworthy progress in addressing overconfidence issues. However, to the best of our knowledge, prior methods have exclusively focused on those factors that affect calibration, leaving open the question of how (mis)calibration circles back to negatively impact network training. Aiming to better understand such dependencies, we propose a temperature-based Cooling method to calibrate classification neural networks during training. Cooling results in better gradient scaling and reduces the need for a learning rate schedule. We investigate different variants of Cooling, with the simplest, last layer Cooling, being also the best-performing one, improving network performance for a range of datasets, network architectures, and hyperparameter settings.

1. INTRODUCTION

Training neural networks can be a challenging task, with optimal performance depending on the right setting of hyperparameters. For this reason, finding a suitable network configuration can often take multiple costly training runs with varying parameters of the learning rate schedule, the optimizer and the batch size. Apart from standard learning rate schedules like piecewise constant schedules and exponential decay schedules, there has been activate research in developing better schedules: Among the most prominent of these are learning rate warmup (Goyal et al., 2017; He et al., 2016a) and cosine decay (Loshchilov & Hutter, 2017) schedules. Complementary to these challenges, (Guo et al., 2017) found that modern convolutional classification networks are often poorly calibrated, leading to overly confident predictions. They investigated multiple methods to improve calibration, with a simple temperature scaling method performing best: the network's output logits are multiplied by a temperature parameter, optimised on a validation dataset after training. Importantly, this leaves the maximal value and therefore the predicted class label unchanged since all the logits are multiplied by the same temperature value. Since then, multiple papers (Kull et al., 2019; Kumar et al., 2019; 2018; Müller et al., 2019; Gupta et al., 2021) proposed methods aiming to even better calibrated networks. More recently, (Desai & Durrett, 2020; Minderer et al., 2021) investigated the calibration of state-of-the-art nonconvolutional Transformer networks (Vaswani et al., 2017; Dosovitskiy et al., 2021) and MLP-Mixers (Tolstikhin et al., 2021) . They concluded that such architectures may have benefits, with further work needed to fully understand the factors contributing to calibration. Despite initially leaving the accuracy unchanged, we have noticed that temperature scaling can have an intriguing effect as training continues: scaling the output logits results in a change in the crossentropy loss, which in turn leads to scaled gradient updates and subsequently new parameter values. During training, this can lead to a significant increase in accuracy. To the best of our knowledge, temperature scaling has until now only been applied post hoc after completing network training. However, our investigation shows that networks become gradually overconfident during training (they overheat), which seems to have a detrimental effect on learning. This has motivated us to modify the original temperature scaling and propose a Cooling method to calibrate neural networks during training.

Our Contributions

• A Cooling method for calibrating classification neural networks during training. We propose two basic variants called last layer Cooling and distributed Cooling, and one hybrid variant called periodically redistributed Cooling. • A mathematical analysis of the effect of Cooling on the network gradients, with a comparison of different Cooling variants. • An empirical investigation of the effects of Cooling on a range of metrics, including network weights, gradients, output logits and the ECE (expected calibration error) calibration measure. • A broad set of experiments for different tasks (image classification and semantic segmentation), datasets and network architectures. We also include an extensive ablation study, involving different activation functions, optimizers, and hyperparameters such as the learning rate schedule, the Cooling factor and the use of weight decay and data augmentation. Our experiments indicate an interplay between the learning rate and calibration during training. Importantly, if well-calibrated, networks can train well without the use of a learning rate schedule.

2. BACKGROUND AND NOTATION

Let f θ : R d → R s denote the function of a classification neural network with parameters θ, mapping a d-dimensional input (in our case an image) x to an s-dimensional logits vector z = f θ (x). During training, each input x comes with a class-probability or label vector y, denoting probabilities of x belonging each of s classes. This is usually (but not necessarily) a one-hot vector corresponding to a so-called ground-truth class label, i * . In its simplest variant, we suppose the network consists of L affine (dense or convolutional) layers, each followed by a non-linear activation function. For the i th layer (1 ≤ i ≤ L) this gives an expression of the form x i = ρ(p i ) = ρ(W i x i-1 + b i ) with weight matrices W i , bias vectors b i , non-linearities ρ, pre-activation values p i and layer inputs and outputs x i-1 and output x i , respectively. (More generally, our method can be applied to any neural network, involving arbitrary functions and layers like e.g. attention, batch normalization and skip connections.) The output logits are then passed through the softmax function σ which results in a vector y = σ(z) of class probabilities. The classification network is trained to minimize the categorical cross-entropy loss function L(z) = H(y, σ(z)) = - s i=1 y i log( y i ) . (2.1) We say that a network is well-calibrated if the output values y can be interpreted as true probabilities. Intuitively, if a network makes 100 predictions with 90% confidence, we would expect that 90% are correctly classified. (Guo et al., 2017) observed that convolutional neural networks tend to display over-confidence in their results, in that y i * gives an over-estimate of the probability that λ i is the correct label. Thus the networks are badly calibrated, which we metaphorically express by saying that the networks become overheated. (Guo et al., 2017) found that simply multiplying the pre-softmax logits z i by a factor τ does an excellent task of improving the network's calibration. Thus, the task of correcting the calibration of the network is to find a constant τ to correct its output, so that it becomes y = σ(τ z) = σ(τ f θ (x)) . (2.2) The optimal τ is found by minimizing the log-likelihood cost function on a small calibration set, held back from the training data. This operation is carried out when the network is fully trained. Usually, one finds that the optimum value is τ < 1. This process is known as temperature scaling. We refer to (Guo et al., 2017) for a more detailed introduction to network calibration.

3. METHOD: COOLING WHILE TRAINING

3.1 FUNDAMENTALS When the network overheats, the predicted values y i become too close to 0 or 1. This can cause problems with gradients becoming large. We hypothesise therefore that keeping the network at the correct temperature during training can lead to improved convergence. Our proposed operation is to periodically correct the network, by multiplying the logits z i by the optimal temperature correcting constant. There are multiple ways to implement Cooling, the most basic being last layer Cooling: Definition 3.1. A network performs last layer Cooling if before the softmax function there is a final scaling layer, multiplying the network output logits z by a constant scalar τ > 0. This value τ is not modified during the batch updates of the gradients, but is corrected using a held-out validation set at the end of the Cooling period. Cooling factor. We investigate the effect of taking the optimal temperature parameter τ to some power κ, which we call the Cooling factor. Let us assume that τ < 1 (which is mostly the case). Then for κ > 1, multiplying by τ κ results in smaller logits, a scenario which we call overcooling. Conversely, κ < 1 produces larger logits, resulting in an undercooling of the network. We note that as κ → 0, we approach standard network training without Cooling. As we show in the experiments, using a suitable value for κ can have positive effects on the training stability and performance. Cooling period. We call the periodic time interval after which we perform temperature scaling the Cooling period. Typically, we let the Cooling period be equal to one epoch of training. We trained the VGG-style network described in §4.2 on CIFAR10, using an Nvidia GeForce GTX Titan Z GPU and a 16-core Intel Xeon CPU E5-2640 v3. 

3.2. DISTRIBUTED COOLING

Instead of waiting to perform scaling after the final layer, it is possible to redistribute the temperature correction across the network, by scaling layers other than the last. Suppose the optimal temperature is τ ∈ R + . When we redistribute the temperature across the network, we would like the temperature correction to gradually take effect. More precisely, we want to ensure that (1) each layer multiplies its input by an additional factor of β = τ 1/L , so that the output of the i th layer is now x ′ i = β i x i . Moreover, (2) the inputs to the non-linearities ρ have to be the same as before scaling because we would otherwise change the network output in a non-linear manner. Finally, (3) the output logits z should be multiplied by β L = τ . Definition 3.2. Let the notation be as in § 2. Let β = τ 1/L . A network performs distributed Cooling if after each Cooling period, 1. the weight matrix W i is multiplied by β, resulting in a new matrix W ′ i := βW i ; 2. the bias vector b i is multiplied by β i , resulting in a new vector b ′ i := β i b i ; 3. the activation ρ is changed to ρ β i , defined by ρ β i (x) = β i ρ(β -i x). Lemma 3.3. If a network performs distributed Cooling, then, compared to no Cooling, 1. the output of the i th layer is x ′ i = β i x i , for 1 ≤ i ≤ L -1; 2. the input to each of the non-linearities is left unchanged; 3. the output logits are scaled by a factor of τ : z ′ = τ z. Proof. We give the proof in in Appendix A.  for i = 1 to L do 9: bias scale ← bias scale * kernel scale 10: W i ← W i /kernel scale 11: b i ← b i /bias scale 12: Update s τ using new τ It may be observed that if ρ is the ReLU function, then ρ β i = ρ, so the activation layer is not changed. (More generally, this holds for piecewise-linear ReLU variants such as CReLU, Leaky ReLU and PRELU.) Note also that τ is usually less than 1, indicating overheating of the network, and hence multiplications by β and β i result in a decrease in the values of the parameters W i and b i . When performing last layer Cooling, we have observed that the network can correct for this by overheating even more. This means that the required temperature correction τ becomes smaller and smaller, towards zero. At the same time, this means that the parameters of the layers (for instance W and b in affine layers) become larger and larger, eventually causing numerical overflow. A possible solution to this problem is to keep track of the overheating parameter τ , and when it becomes too small, redistribute the temperature correction over all layers: Definition 3.4. A network performs periodically redistributed Cooling if for some τ max , τ re > 0, • it performs last layer Cooling as long as τ < τ max (i.e. the optimal temperature is less than a specified maximal temperature); • it redistributes the excess temperature τ /τ re across the layers as in Definition 3.2 if τ > τ max . The values τ max and τ re are manually specified. In our experiments, we used τ max = τ re = 100 for periodically redistributed Cooling, but other values can also be considered. Summary. Algorithm 3.1 gives the pseudocode of our proposed Cooling method. The algorithm shows periodically redistributed Cooling, since it is the most general case. We can recover last layer Cooling and distributed Cooling as special cases with (τ max , τ re ) = (∞, ∞) and (τ max , τ re ) = (0, 1), respectively. The algorithm displayed above is simplified for the sake of clarity: it does not take heed of layers like attention and batch normalization and excludes skip connections. These layers are straightforward to address in a general implementation of the Cooling method.

3.3. EFFECTS ON GRADIENT VALUES

We now perform an analysis of the effect Cooling on the network gradients.

3.3.1. LAST LAYER COOLING

Proposition 3.5 (Gradients of last layer Cooling). Let the notation be as in § 2. If C = L(σ(τ z)) where σ is the softmax function and L is the cross-entropy loss, the derivative with respect to any network parameter w ∈ θ is given by ∂C ∂w = σ(τ z) -y, ∂τ z ∂w = τ σ(τ z) -y, ∂z ∂w , (3.1) expressed as the inner-product of two vectors. Proof. When τ = 1, Equation 3.1 is a known result of a simple computation. The general case is an application of the chain rule. Interpretation. The difference ϵ = σ(z) -y may be termed the residual, namely the difference between the ground-truth label probabilities y and the label probabilities σ(z) computed by the network. The derivative ∂C/∂w is then the inner product of this residual with the vector ∂z/∂w. We can use this formula to analyze the effect of temperature-scaling by τ . Suppose τ = 1, so there is no heating correction. As is known, in this case probabilities tend to be overestimated, so that σ(z) approaches y. All values of σ(z) other than the ground truth become very small, meaning that values of ∂z i /∂w are multiplied by small values, and so are ultimately ignored, with harmful effects on convergence. Setting τ < 1 results in a more evenly distributed (less peaked) vector σ(τ z), meaning that all values of z i and ∂z i /∂w have an effect on the gradient.

3.3.2. DISTRIBUTED COOLING

Now, we consider what happens to gradients when distributed Cooling is applied. We simplify the analysis by thinking of scaling occurring in two steps. First global temperature scaling is applied by modifying the final layer so that its output is multiplied by τ . Subsequently, distributed scaling is applied to all layers resulting in the output of the i-th layer being multiplied by β i , in a way that the network output is unchanged. The effect of the final-layer scaling on gradients was addressed earlier. Now we concentrate on the effect of distributed scaling on gradients in the network. Consider a network with N layers, labelled 0 to N -1, let x i be the input to the i-th layer (which is also the output of the i -1-th layer), and x N the output of the last layer. Let another network have layer inputs denoted by x ′ i . Definition 3.6. We will say that two networks are scale-equivalent if for inputs x 0 = x ′ 0 there are constants β i with β 0 = β N = 1 such that x ′ i = β i x i . Evidently, for the same input x 0 = x ′ 0 , the outputs x N = x ′ N are the same, since β N = 1. It will be observed, however that the gradients of the parameters of these networks will be different. Let the first (unprimed) network be represented by x i+1 = ρ i (W i x i + b i ) where ρ i is an activation function, possibly different for each i. Then, given numbers β i with β 0 = β N = 1, an equivalent network is given by x ′ i+1 = ρ ′ i (W ′ i x ′ i + b ′ i ) where W ′ i = β i+1 β -1 i W i and b ′ i = β i+1 b i and ρ ′ i = ρ β i+1 (3.2) and ρ β is a modified activation function given by ρ β (x) = βρ(β -1 x). (It should be noted that if ρ is a ReLU activation, then ρ = ρ β .) Then, x ′ i = β i x i for all i, as required. Thus, with T i representing the transformation x i → ρ i (W i x i + b i ) = x i+1 , (and similarly T ′ ) we compare the two networks: σ • T N -1 • T N -2 • . . . • T 0 (x 0 ) and σ • T ′ N -1 • T ′ N -2 • . . . • T ′ 0 (x 0 ) , where σ represents the final softmax layer. It is evident that these two networks carry out the same operation. However, it will be shown that if optimized using a gradient-descent based method, the update of their parameters will be different, and the trajectory of the parameters in the path towards the optimum during training will be quite different. Let W i,jk be one of the entries of W i and b i,j be one of the parameters of b i . Similarly, let W ′ i,jk and b ′ i,j be the corresponding parameter of the primed (distributively scaled) network. The following will be shown: Theorem 3.7. Let C = L(z) where z = σ • f θ (x 0 ) = σ • f ′ θ ′ (x 0 ) for scale equivalent networks f θ and f ′ θ ′ . Functions σ and L are softmax and loss functions. Let W i,jk be the (j, k)-th entry of parameter matrix W i and b i,j the j-th entry of parameter vector b i . Then ∂C ∂W ′ i,jk = β i β i+1 ∂C ∂W i,jk and ∂C ∂b ′ i,j = 1 β i+1 ∂C ∂b i,j (3.3) Proof. Define p i+1 = W i x i + b i , and x i+1 = ρ i (p i+1 ), and similarly primed quantities. We see that x ′ i = β i x i and p ′ i = β i p i for all i. Let f i+1 be the mapping defined by C = f i+1 (p i+1 ) = L • σ • T N -1 • T N -2 • . . . • T i+1 • ρ i (p i+1 ) namely, the part of the network "downstream" from p i+1 (including the activation function ρ i in the i-th layer, and the softmax and loss functions). Function f ′ i+1 is similarly defined for the primed network. We apply the chain rule: ∂C ∂W i,jk = ∂C ∂p i+1 ∂p i+1 ∂W i,jk A similar formula holds for the primed case. Now, since C = f i+1 (p i+1 ) = f ′ i+1 (p ′ i+1 ) = f ′ i+1 (β i+1 p i+1 ) we see ∂C ∂p ′ i+1 = β -1 i+1 ∂C ∂p i+1 (3.4) Next, we compare ∂p i+1 /∂W i,jk and ∂p ′ i+1 /∂W ′ i,jk . Let E jk be the matrix with an entry 1 in position (j, k) and 0 elsewhere. Then ∂p i+1 /∂W i,jk = E jk x i . On the other hand, ∂p ′ i+1 /∂W ′ i,jk = E jk x ′ i = E jk (β i x i ) = β i ∂p i+1 /∂W i,jk Putting this together with equation equation 3.4 we see that ∂C/∂W ′ i,jk = (β i /β i+1 ) ∂C/∂W i,jk as required. In the case where b i,j is an entry of b i , we see that ∂p ′ i+1 /∂b ′ i,j = ∂p i+1 /∂b i,j so, ∂C/∂b ′ i,j = (1/β i+1 ) ∂C/∂b i,j .

Relative gradients.

Since a small change to a small parameter is more important to the same change to a large parameter, it is perhaps more important to determine the ratio (∂C/∂θ)/θ, which determines by what ratio a parameter is changed during gradient update. This gives: ∂C/∂W ′ i,jk W ′ i,jk = β 2 i β 2 i+1 ∂C/∂W i,jk W i,jk ∂C/∂b ′ i,j b ′ i,j = 1 β 2 i+1 ∂C/∂b i,j b i,j Interpretation. The effect of distributed scaling is to individually change the relative effect of gradients over the network. In particular, if β i = τ i/N , for i = 0, . . . , N -1 and β N = 1, distributing the scale evenly across the network, with τ < 1, then the effect is to modify the gradients and relative gradients across the network. This may have the effect of mitigating the effect of gradient vanishing. Distributed scaling can be used to control the magnitudes of the output x i at each level. 

4.1. GENERAL SETUP

We explore the use of Cooling on two image classification datasets and one semantic segmentation dataset. We use 99% of the CIFAR "training sets" (corresponding to 50,000 images in total) for training and 1% as a validation set to optimise the temperature τ on. We train our networks using either the SGD optimizer with a momentum of 0.9 or the Adam optimizer (Kingma & Ba, 2015) with ϵ = 0.1, β 1 = 0.9, β 2 = 0.999. All networks are trained using the TensorFlow (Abadi et al., 2015) framework.

4.2. IMAGE CLASSIFICATION: CIFAR10

Setup. We train a small VGG-style network (Simonyan & Zisserman, 2014) on the CIFAR10 dataset (Krizhevsky, 2009) . The network consists of a sequence of 6 convolutional layers of filter size 3 × 3 with 32, 32, 64, 64, 128 and 128 channels, respectively, followed by two dense layers with 128 and 10 output nodes, respectively. In total, the network has approximately 620,000 trainable parameters. The hidden layers either use the ReLU (Fukushima, 1980; Nair & Hinton, 2010) or the CReLU (Shang et al., 2016) activation. When we use learning rate warmup, we linearly increase the learning rate from 0 to 0.01 over 2 epochs. When we do not use warmup, we directly start with a learning rate of 0.01. In our learning rate schedule ablations, we experiment with (1) no schedule, (2) a piecewise linear schedule which drops by a factor of 0.1 after 30 and 40 epochs, (3) a schedule with linear decay from the initial rate to 0, a (4) schedule with exponential decay with a total drop by either a factor of 0.01 ("slow") or 0.001 ("fast") and ( 5) a cosine decay schedule (Loshchilov & Hutter, 2017) . We train the network for 50 epochs. We use a batch size of 64. The effect of the Cooling factor on the inverse of the network temperatures is shown in Figure 2 . In the left plot, where a smaller Cooling factor and last layer Cooling is used, the inverse temperature grows much more slowly and only exceeds 100 after 19 epochs. In the right plot, where periodically redistributed Cooling with τ re = τ max = 100 is used, the first temperature reset already happens after 9 epochs.

Results. As shown in

We present further experimental results on CIFAR10 in Appendix B.

4.3. IMAGE CLASSIFICATION: CIFAR100

Setup. We train a ResNet50 network (He et al., 2016b) on the CIFAR100 dataset (Krizhevsky, 2009) . The network consists of a sequence of 50 convolutional layers of varying filter sizes and channel numbers, followed by a global average pooling layer. The network uses skip connections and has approximately 23.5 million trainable parameters in total. The hidden layers either use the ReLU (Fukushima, 1980; Nair & Hinton, 2010) or the CReLU (Shang et al., 2016) activation. In our learning rate schedule ablations, we try out (1) no schedule, (2) a piecewise linear schedule with drops by a factor of 0.1 after 80, 120, 160 and 180 epochs. We train the network for 200 epochs. We use a batch size of 64. Results. Similar to the CIFAR10 experiments, we present an ablation on the effect of different Cooling factors on last layer Cooling. In Figure 1 (right) we notice the same pattern: Cooling factors κ that do not exceed 1.0 yield neural network models that outperform the baseline. On the other hand, we observe once more that κ > 1 leads to the divergence of network training.

4.4. SEMANTIC SEGMENTATION: ADE20K DATASET

Setup. We train a small U-Net Ronneberger et al. (2015) architecture on the challenging ADE20K dataset Zhou et al. (2019) , which includes 150 semantic categories. This dataset contains 20,000 We note that a larger Cooling factor causes a much steeper increase in the temperature. For last layer Cooling, the temperature keeps on growing, whereas for periodically redistributed Cooling, the temperature is redistributed whenever it exceeds a value of 100. images for training and 2,000 images for validation, on which we report results. We leave aside 320 images from the training set when performing Cooling. We work with images of size 256 × 256 and our U-Net architecture has ≈ 9 million trainable parameters. Results. We compare our proposed last layer Cooling against the baseline, with no temperature scaling. For the former we obtain 22.1 mIoU and 71.9% accuracy, while for the latter we obtain 21.0 mIoU and 70.9% accuracy. This shows that our proposed cooling method is also beneficial on denser, pixel-wise classification tasks. Further investigation on larger architectures and across multiple design choices could further reveal the full potential of Cooling.

5. DISCUSSION AND CONCLUSION

Our proposed Cooling method to adaptively calibrate classification neural networks produces significant benefits in terms of network performance and training stability. Theoretical and empirical findings point to significant benefits resulting from differently scaled gradients during network training. As a result of experiments on different tasks, datasets and network architectures, as well as an ablation study on different hyperparameter settings we find that that Cooling gives a significant performance benefit over relevant baselines. In particular, we notice that Cooling greatly reduces the need for a learning rate schedule. Even though all versions of Cooling re-scale the network to the same mathematical function, they all produce differently parameterised networks. This reparameterisation has a strong impact on gradients, resulting in different network functions as training progresses. This raises the question: What are the general conditions on the parametrisation of a network to achieve optimal training? Another highlight of our work is the connection between calibration and the learning rate. There are indications that well-calibrated networks are more stable in training and less reliant on the 'right' learning rate schedule. Since training stability is critical in a number of classification tasks (e.g. when training the discriminator of a GAN), a deeper investigation into the relation between calibration and training stability could be a promising direction for future research.



Figure 1: Plots showing the test accuracy of networks trained with last layer Cooling and various Cooling factors. Left: VGG network trained on CIFAR10. Right: ResNet50 network trained on CIFAR100. The non-Cooling baselines are shown as dashed. All Cooling factors κ ≤ 1 outperform the baseline, the best one by 4.6%. Training diverges for Cooling factors κ > 1. (All means and standard deviations are computed over three runs.)

Figure 2: Comparison of the increase of network temperatures for different Cooling factors. Both images show the training of a VGG network on CIFAR10 with ReLU activations. Left: last layer Cooling, CF: 0.75. Right: periodically redistributed Cooling, CF: 1.0.We note that a larger Cooling factor causes a much steeper increase in the temperature. For last layer Cooling, the temperature keeps on growing, whereas for periodically redistributed Cooling, the temperature is redistributed whenever it exceeds a value of 100.

Algorithm 3.1 Periodically Redistributed Cooling (simplified) Inputs: Neural network f θ with L linear layers (W i , b i ) Final Cooling layer s τ Validation data set X val Cooling factor κ Maximal temperature τ max ; reset temperature τ re Output: Calibrated Network f θ ′

Cooling can have a significantly positive impact on network performance. We see a stark difference between various LR schedules. Whereas smooth schedules (where the learning rate changes after each batch) hardly benefit from Cooling, there is a noticeable benefit for piecewise constant schedules and a drastic improvement when no schedule is employed. Starting from the lowest performance at 74.7% test accuracy, last layer Cooling increases the test accuracy by 4.6%. Figure 1 (left) shows an ablation of last layer Cooling, involving various Cooling factors κ. Last layer Cooling shows little sensitivity to the Cooling factor, as long as κ ≤ 1.0. Values greater than 1.0 lead to divergence. On the other hand, all values of κ ≤ 1.0 produce networks outperforming the non-Cooling baseline.

Test accuracy of a VGG network trained on CIFAR10 with various learning rate schedules and Cooling modes. The different Cooling modes perform at least on par with baselines on all LR schedules. Cooling considerably outperforms the baseline when no schedule is used. Significant gains are also achieved when a piecewise constant schedule is used. (All means and standard deviations are computed over three runs.)Table2shows that Cooling works well for both ReLU and CReLU activation functions. In particular, CReLU diverged in all of our experiments without learning rate schedules, but converged when Cooling was used. As for ReLU, we note that last layer and periodically redistributed Cooling outperform pure distributed Cooling.

Test accuracies of a VGG network trained without a learning rate schedule on CIFAR10 with different activation functions and Cooling modes. Last layer scaling and periodically redistributed scaling perform best for both activation functions. Training diverges for CReLU without Cooling and without a LR schedule. (All means and standard deviations are computed over three runs.)

