CALIBRATING THE RIGGED LOTTERY: MAKING ALL TICKETS RELIABLE

Abstract

Although sparse training has been successfully used in various resource-limited deep learning tasks to save memory, accelerate training, and reduce inference time, the reliability of the produced sparse models remains unexplored. Previous research has shown that deep neural networks tend to be over-confident, and we find that sparse training exacerbates this problem. Therefore, calibrating the sparse models is crucial for reliable prediction and decision-making. In this paper, we propose a new sparse training method to produce sparse models with improved confidence calibration. In contrast to previous research that uses only one mask to control the sparse topology, our method utilizes two masks, including a deterministic mask and a random mask. The former efficiently searches and activates important weights by exploiting the magnitude of weights and gradients. While the latter brings better exploration and finds more appropriate weight values by random updates. Theoretically, we prove our method can be viewed as a hierarchical variational approximation of a probabilistic deep Gaussian process. Extensive experiments on multiple datasets, model architectures, and sparsities show that our method reduces ECE values by up to 47.8% and simultaneously maintains or even improves accuracy with only a slight increase in computation and storage burden.

1. INTRODUCTION

Sparse training is gaining increasing attention and has been used in various deep neural network (DNN) learning tasks (Evci et al., 2020; Dietrich et al., 2021; Bibikar et al., 2022) . In sparse training, a certain percentage of connections are maintained being removed to save memory, accelerate training, and reduce inference time, enabling DNNs for resource-constrained situations. The sparse topology is usually controlled by a mask, and various sparse training methods have been proposed to find a suitable mask to achieve comparable or even higher accuracy compared to dense training (Evci et al., 2020; Liu et al., 2021; Schwarz et al., 2021) . However, in order to deploy the sparse models in real-world applications, a key question remains to be answered: how reliable are these models? There has been a line of work on studying the reliability of dense DNNs, which means that DNNs should know what it does not know (Guo et al., 2017; Nixon et al., 2019; Wang et al., 2021) . In other words, a model's confidence (the probability associated with the predicted class label) should reflect its ground truth correctness likelihood. A widely used reliability metric is Expected Calibration Error (ECE) (Guo et al., 2017) , which measures the difference between confidence and accuracy, with a lower ECE indicating higher reliability. However, prior research has shown that DNNs tend to be over-confident (Guo et al., 2017; Rahaman et al., 2021; Patel et al., 2022) , suggesting DNNs may be too confident to notice incorrect decisions, leading to safety issues in real-world applications, e.g., automated healthcare and self-driving cars (Jiang et al., 2012; Bojarski et al., 2016) . In this work, we for the first time identify and study the reliability problem of sparse training. We start with the question of how reliable the current sparse training is. We find that the over-confidence When the accuracy is comparable to dense training (0%-95%), we observe that the ECE values increase with sparsity, implying that the problem of over-confidence becomes more severe at higher sparsity. And when the accuracy decreases sharply (>95%), the ECE value first decreases and then increases again. This leads to a double descent phenomenon (Nakkiran et al., 2021) when we view the ECE value curve from left to right (99.9%-0%) (see Section 6 for more discussion). To improve the reliability, we propose a new sparse training method to produce well-calibrated predictions while maintaining a high accuracy performance. We call our method "The Calibrated Rigged Lottery" or CigL. Unlike previous sparse training methods with only one mask, our method employs two masks, including a deterministic mask and a random mask, to better explore the sparse topology and weight space. The deterministic one efficiently searches and activates important weights by exploiting the magnitude of weights/gradients. And the random one, inspired by dropout, adds more exploration and leads to better convergence. When near the end of training, we collect weights & masks at each epoch, and use the designed weight & mask averaging procedure to obtain one sparse model. From theoretical analysis, we show our method can be viewed as a hierarchical variational approximation (Ranganath et al., 2016) to a probabilistic deep Gaussian process (Gal & Ghahramani, 2016) , which leads to a large family of variational distributions and better Bayesian posterior approximations. Our contributions are summarized as follows: • We for the first time identify and study the reliability problem of sparse training and find that sparse training exacerbates the over-confidence problem of DNNs. • We then propose CigL, a new sparse training method that improves confidence calibration with comparable and even higher accuracy. • We prove that CigL can be viewed as a hierarchical variational approximation to a probabilistic deep Gaussian process which improves the calibration by better characterizing the posterior. • We perform extensive experiments on multiple benchmark datasets, model architectures, and sparsities. CigL reduces ECE values by up to 47.8% and simultaneously maintain or even improve accuracy with only a slight increase in computational and storage burden.

2.1. SPARSE TRAINING

As the scale of models continues to grow, there is an increasing attention to the sparse training which maintains sparse weights throughout the training process. Different sparse training methods have been investigated, and various pruning and growth criteria, such as weight/gradient magnitude,



Figure 1: Reliability diagrams for (a) the dense model and (b) the sparse model. The sparse model is more over-confident than the dense model. (c) the scatter plot of test accuracy (%) and ECE value at different sparsities. From the high sparse model to the dense model, the ECE value first decreases, then increases, and then decreases again, showing a double descent pattern.problem becomes even more pronounced when sparse training is applied to ResNet-50 on CIFAR-100. Figures1 (a)-(b)show that the gap (blue area) between confidence and accuracy of the sparse model (95% sparsity) is larger than that of the dense model (0% sparsity), implying the sparse model is more over-confident than the dense model. Figure1 (c)shows the test accuracy (pink curve) and ECE value (blue curve, a measure of reliability)(Guo et al., 2017)  at different sparsities. When the accuracy is comparable to dense training (0%-95%), we observe that the ECE values increase with sparsity, implying that the problem of over-confidence becomes more severe at higher sparsity. And when the accuracy decreases sharply (>95%), the ECE value first decreases and then increases again. This leads to a double descent phenomenon (Nakkiran et al., 2021) when we view the ECE value curve from left to right (99.9%-0%) (see Section 6 for more discussion).

