HOW TO KEEP COOL WHILE TRAINING

Abstract

Modern neural networks used for classification are notoriously prone to overly confident predictions. With multiple calibration methods proposed so far, there has been noteworthy progress in addressing overconfidence issues. However, to the best of our knowledge, prior methods have exclusively focused on those factors that affect calibration, leaving open the question of how (mis)calibration circles back to negatively impact network training. Aiming to better understand such dependencies, we propose a temperature-based Cooling method to calibrate classification neural networks during training. Cooling results in better gradient scaling and reduces the need for a learning rate schedule. We investigate different variants of Cooling, with the simplest, last layer Cooling, being also the best-performing one, improving network performance for a range of datasets, network architectures, and hyperparameter settings.

1. INTRODUCTION

Training neural networks can be a challenging task, with optimal performance depending on the right setting of hyperparameters. For this reason, finding a suitable network configuration can often take multiple costly training runs with varying parameters of the learning rate schedule, the optimizer and the batch size. Apart from standard learning rate schedules like piecewise constant schedules and exponential decay schedules, there has been activate research in developing better schedules: Among the most prominent of these are learning rate warmup (Goyal et al., 2017; He et al., 2016a) and cosine decay (Loshchilov & Hutter, 2017) schedules. Complementary to these challenges, (Guo et al., 2017) found that modern convolutional classification networks are often poorly calibrated, leading to overly confident predictions. They investigated multiple methods to improve calibration, with a simple temperature scaling method performing best: the network's output logits are multiplied by a temperature parameter, optimised on a validation dataset after training. Importantly, this leaves the maximal value and therefore the predicted class label unchanged since all the logits are multiplied by the same temperature value. Since then, multiple papers (Kull et al., 2019; Kumar et al., 2019; 2018; Müller et al., 2019; Gupta et al., 2021) proposed methods aiming to even better calibrated networks. More recently, (Desai & Durrett, 2020; Minderer et al., 2021) investigated the calibration of state-of-the-art nonconvolutional Transformer networks (Vaswani et al., 2017; Dosovitskiy et al., 2021) and MLP-Mixers (Tolstikhin et al., 2021) . They concluded that such architectures may have benefits, with further work needed to fully understand the factors contributing to calibration. Despite initially leaving the accuracy unchanged, we have noticed that temperature scaling can have an intriguing effect as training continues: scaling the output logits results in a change in the crossentropy loss, which in turn leads to scaled gradient updates and subsequently new parameter values. During training, this can lead to a significant increase in accuracy. To the best of our knowledge, temperature scaling has until now only been applied post hoc after completing network training. However, our investigation shows that networks become gradually overconfident during training (they overheat), which seems to have a detrimental effect on learning. This has motivated us to modify the original temperature scaling and propose a Cooling method to calibrate neural networks during training.

