CALIBRATING THE RIGGED LOTTERY: MAKING ALL TICKETS RELIABLE

Abstract

Although sparse training has been successfully used in various resource-limited deep learning tasks to save memory, accelerate training, and reduce inference time, the reliability of the produced sparse models remains unexplored. Previous research has shown that deep neural networks tend to be over-confident, and we find that sparse training exacerbates this problem. Therefore, calibrating the sparse models is crucial for reliable prediction and decision-making. In this paper, we propose a new sparse training method to produce sparse models with improved confidence calibration. In contrast to previous research that uses only one mask to control the sparse topology, our method utilizes two masks, including a deterministic mask and a random mask. The former efficiently searches and activates important weights by exploiting the magnitude of weights and gradients. While the latter brings better exploration and finds more appropriate weight values by random updates. Theoretically, we prove our method can be viewed as a hierarchical variational approximation of a probabilistic deep Gaussian process. Extensive experiments on multiple datasets, model architectures, and sparsities show that our method reduces ECE values by up to 47.8% and simultaneously maintains or even improves accuracy with only a slight increase in computation and storage burden.

1. INTRODUCTION

Sparse training is gaining increasing attention and has been used in various deep neural network (DNN) learning tasks (Evci et al., 2020; Dietrich et al., 2021; Bibikar et al., 2022) . In sparse training, a certain percentage of connections are maintained being removed to save memory, accelerate training, and reduce inference time, enabling DNNs for resource-constrained situations. The sparse topology is usually controlled by a mask, and various sparse training methods have been proposed to find a suitable mask to achieve comparable or even higher accuracy compared to dense training (Evci et al., 2020; Liu et al., 2021; Schwarz et al., 2021) . However, in order to deploy the sparse models in real-world applications, a key question remains to be answered: how reliable are these models? There has been a line of work on studying the reliability of dense DNNs, which means that DNNs should know what it does not know (Guo et al., 2017; Nixon et al., 2019; Wang et al., 2021) . In other words, a model's confidence (the probability associated with the predicted class label) should reflect its ground truth correctness likelihood. A widely used reliability metric is Expected Calibration Error (ECE) (Guo et al., 2017) , which measures the difference between confidence and accuracy, with a lower ECE indicating higher reliability. However, prior research has shown that DNNs tend to be over-confident (Guo et al., 2017; Rahaman et al., 2021; Patel et al., 2022) , suggesting DNNs may be too confident to notice incorrect decisions, leading to safety issues in real-world applications, e.g., automated healthcare and self-driving cars (Jiang et al., 2012; Bojarski et al., 2016) . In this work, we for the first time identify and study the reliability problem of sparse training. We start with the question of how reliable the current sparse training is. We find that the over-confidence problem becomes even more pronounced when sparse training is applied to ResNet-50 on CIFAR-100. Figures 1 (a )-(b) show that the gap (blue area) between confidence and accuracy of the sparse model (95% sparsity) is larger than that of the dense model (0% sparsity), implying the sparse model is more over-confident than the dense model. Figure 1 (c) shows the test accuracy (pink curve) and ECE value (blue curve, a measure of reliability) (Guo et al., 2017) at different sparsities. When the accuracy is comparable to dense training (0%-95%), we observe that the ECE values increase with sparsity, implying that the problem of over-confidence becomes more severe at higher sparsity. And when the accuracy decreases sharply (>95%), the ECE value first decreases and then increases again. This leads to a double descent phenomenon (Nakkiran et al., 2021) when we view the ECE value curve from left to right (99.9%-0%) (see Section 6 for more discussion). To improve the reliability, we propose a new sparse training method to produce well-calibrated predictions while maintaining a high accuracy performance. We call our method "The Calibrated Rigged Lottery" or CigL. Unlike previous sparse training methods with only one mask, our method employs two masks, including a deterministic mask and a random mask, to better explore the sparse topology and weight space. The deterministic one efficiently searches and activates important weights by exploiting the magnitude of weights/gradients. And the random one, inspired by dropout, adds more exploration and leads to better convergence. When near the end of training, we collect weights & masks at each epoch, and use the designed weight & mask averaging procedure to obtain one sparse model. From theoretical analysis, we show our method can be viewed as a hierarchical variational approximation (Ranganath et al., 2016) to a probabilistic deep Gaussian process (Gal & Ghahramani, 2016) , which leads to a large family of variational distributions and better Bayesian posterior approximations. Our contributions are summarized as follows: • We for the first time identify and study the reliability problem of sparse training and find that sparse training exacerbates the over-confidence problem of DNNs. • We then propose CigL, a new sparse training method that improves confidence calibration with comparable and even higher accuracy. • We prove that CigL can be viewed as a hierarchical variational approximation to a probabilistic deep Gaussian process which improves the calibration by better characterizing the posterior. • We perform extensive experiments on multiple benchmark datasets, model architectures, and sparsities. CigL reduces ECE values by up to 47.8% and simultaneously maintain or even improve accuracy with only a slight increase in computational and storage burden.

2.1. SPARSE TRAINING

As the scale of models continues to grow, there is an increasing attention to the sparse training which maintains sparse weights throughout the training process. Different sparse training methods have been investigated, and various pruning and growth criteria, such as weight/gradient magnitude, are designed (Mocanu et al., 2018; Bellec et al., 2018; Frankle & Carbin, 2019; Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2019; Evci et al., 2020; Jayakumar et al., 2020; Liu et al., 2021; Özdenizci & Legenstein, 2021; Zhou et al., 2021; Schwarz et al., 2021; Yin et al., 2022) . However, sparse training is more challenging in the weight space exploration because sparse constraints cut off update routes and produce spurious local minima (Evci et al., 2019; Sun & Li, 2021; He et al., 2022) . There are some studies that have started to promote exploration, while they primarily pursue high accuracy and might add additional costs (Liu et al., 2021; Huang et al., 2022) . Most sparse training methods use only one mask to determine the sparse topology, which is insufficient for achieving adequate exploration, and existing multi-mask methods are not designed for improved exploration (Xia et al., 2022; Bibikar et al., 2022) (more details in Section D.4).

2.2. CONFIDENCE CALIBRATION IN DNNS

Many studies have investigated whether the confidences of DNNs are well-calibrated (Guo et al., 2017; Nixon et al., 2019; Zhang et al., 2020) , and existing research has found DNNs tend to be over-confident (Guo et al., 2017; Rahaman et al., 2021; Patel et al., 2022) , which may mislead our choices and cause unreliable decisions in real-world applications. To improve confidence calibration, a widely-used method is temperature scaling (Guo et al., 2017) , which adds a scaling parameter to the softmax formulation and adjusts it on a validation set. Some other works incorporate regularization in the training, such as Mixup (Zhang et al., 2017) and label smoothing (Szegedy et al., 2016) . In addition, Bayesian methods also show the ability to improve calibration, such as Monte Carlo Dropout (Gal & Ghahramani, 2016) and Bayesian deep ensembles (Ashukha et al., 2020) . However, they mainly focus on dense training. Studies have been conducted on reliability of sparse DNNs (more details in Section D.5), but they target on pruning, starting with a dense model to gradually increase sparsity, which reduces the exploration challenge (Venkatesh et al., 2020; Chen et al., 2022) . They still find that uncertainty measures are more sensitive to pruning than generalization metrics, indicating the sensitivity of reliability to sparsity. Yin et al. (2022) studies sparse training, but it aims to boost the performance and brings limited improvement in reliability. Therefore, how to obtain a well-calibrated DNN in sparse training is more challenging and remains unknown.

3. METHOD

We propose a new sparse training method, CigL, to improve the confidence calibration of the produced sparse models, which simultaneously maintains comparable or even higher accuracy. Specifically, CigL starts with a random sparse network and uses two masks to control the sparse topology and explore the weight space, including a deterministic mask and a random mask. The former is updated periodically to determine the non-zero weights, while the latter is sampled randomly in each iteration to bring better exploration in the model update. Then, with the designed weight & mask averaging, we combine information about different aspects of the weight space to obtain a single output sparse model. Our CigL method is outlined in Algorithm 1.

3.1. DETERMINISTIC MASK & RANDOM MASK

In our CigL, we propose to utilize two masks, a deterministic mask M and a random mask Z, to search for a sparse model with improved confidence calibration and SOTA accuracy. We will first describe the two masks in detail and discuss how to set their sparsity. The deterministic mask controls the entire sparse topology with the aim of finding a wellperforming sparse model. That is, the mask determines which weights should be activated and which should not. Inspired by the widely-used sparse training method RigL (Evci et al., 2020) , we believe a larger weight/gradient magnitude implies that the weight is more helpful for loss reduction and needs to be activated. Thus, CigL removes a portion of the weights with small magnitudes, and activates new weights with large gradient magnitudes at fixed time intervals ∆T . The random mask allows the model to better explore the weight space under sparsity constraints. In each iteration prior to backpropagation, the mask is randomly drawn from Bernoulli distribution. In this way, the mask randomly selects a portion of the non-zero weights to be temporarily deactivated and forces the model to explore more in other directions of the weight space, which adds more randomness in the weight update step and leads to a better exploration of the weight space compared to one mask strategy. As a result, the model is more likely to jump out of spurious local minima while avoiding deviations from the sparse topology found by the deterministic mask.  W (t) = W (t-1) - αtM ⊙ Z (t) ⊙ ∇L(M ⊙ Z (t) ⊙ W (t-1) ; Bt) if t mod m = 0 and t > T * then if WCigL = None then WCigL = M ⊙ Z (t) ⊙ W (t) nmodels = 1 else WCigL = W CigL •n models +M ⊙Z (t) ⊙W (t) n models +1 nmodels = nmodels + 1 end if end if end for Output: Sparse Model Weights WCigL The sparsity setting of the two masks is illustrated as below. On the one hand, the deterministic mask is responsible for the overall sparsity of the output sparse model. Suppose we want to train a network with 95% sparsity, the deterministic mask will also have the same sparsity, with 5% of the elements being 1. On the other hand, the random mask deactivates some nonzero weights during the training process, producing temporary models with increasing sparsity. Since highly sparse models (like 95% sparsity) are sensitive to further increases in sparsity, we set a low sparsity, such as 10%, for the random mask so that no significant increases in sparsity and no dramatic degradation in performance occurs in these temporary models.

3.2. WEIGHT & MASK AVERAGING

With the two masks designed above, we propose a weight & mask averaging procedure to obtain one single sparse model with improved confidence calibration and comparable or even higher accuracy. We formalize this procedure as follows. We first iteratively update the two masks and model weights. Consistent with widely used sparse training methods (Evci et al., 2020; Liu et al., 2021) , the deterministic mask stops updating near the end of the training process. While we still continuously draw different random masks from the Bernoulli distribution and collect a pair of sparse weights and random masks {Z (t) , W (t) } at each epoch after T * -th epoch. Then, we can produce multiple temporary sparse models Z (t) ⊙ W (t) with different weight values and different sparse topologies, which contain more knowledge about the weight space than the single-mask training methods. Finally, inspired by a popular way of combining models (Izmailov et al., 2018; Wortsman et al., 2022) , we obtain the single output sparse model by averaging the weights of these temporary sparse models, which can be viewed as a mask-based weighted averaging.

4.1. CIGL WITH BETTER CONFIDENCE CALIBRATION

Obtaining reliable DNNs is more challenging in sparse training, and we will show why our CigL provides a solution to this problem. Bayesian methods have shown the ability to improve confidence calibration (Gal & Ghahramani, 2016; Ashukha et al., 2020) , but they become more difficult to fit the posterior well under sparse constraints, limiting their ability to solve unreliable problems. We find that CigL can be viewed as a hierarchical Bayesian method (shown in Section 4.2), which improves confidence calibration by performing better posterior approximations in two ways discussed below. On the one hand, the model is more challenging to fully explore the weight space due to sparsity constraint, and inappropriate weight values can also negatively affect the mask search. During sparse training, a large percentage of connections are removed, cutting off the update routes and thus narrowing the family of Bayesian proposal distributions. This leads to more difficult optimization and sampling. To overcome this issue, our CigL adds a hierarchical structure to the variational distributions so that we have a larger family of distributions, allowing it to capture more complex marginal distributions and reducing the difficulty of fitting the posterior. On the other hand, when sparsity constraints are added, the posterior landscape changes, leading to more complex posterior distributions. One example is the stronger correlation between hidden variables, such as the random mask Z and weight W (shown in the Appendix C.1). In a dense model, the accuracy does not change much if we randomly draw Z and use Z ⊙ W compared to using W . However, in high sparsity like 95%, we see a significant accuracy drop when Z ⊙ W is used compared to using W . Thus, in CigL, the pairings of Z and W are collected to capture the correlation, leading to a better posterior approximation.

4.2. CIGL AS A HIERARCHICAL BAYESIAN APPROXIMATION

We prove that training sparse neural networks with our CigL are mathematically equivalent to approximating the probabilistic deep GP (Damianou & Lawrence, 2013; Gal & Ghahramani, 2016) with hierarchical variational inference. We show that the objective of CigL is actually to minimize the Kullback-Leibler (KL) divergence between a hierarchical variational distribution and the posterior of a deep GP. During our study, we do not restrict the architecture, allowing the results applicable to a wide range of applications. Detailed derivation is shown in Appendix B. We first present the minimisation objective function of CigL for a sparse neural network (NN) model with L layers and loss function E. The sparse weights and bias of the l-th layer are denoted by W l ∈ R Ki×Ki-1 and b l ∈ R Ki (l = 1, • • • , L), and the output prediction is denoted by y i . Given data {x i , y i }, we train the NN model by iteratively update the deterministic mask and the sparse weights. Since the random mask is drawn from a Bernoulli distribution, it has no parameters that need to be updated. For deterministic mask updates, we prune weights with the smaller weight magnitude and regrow weights with larger gradient magnitude. For the weight update, we minimise Eq. ( 1) which is composed of the difference between y i and the true label y i and a L 2 regularisation. L CigL := 1 N N i=1 E(y i , y i ) + λ L l=1 (||W l || 2 2 + ||b l || 2 2 ). Then, we derive the minimization objective function of Deep GP which is a flexible probabilistic NN model that can model the distribution of functions (Gal & Ghahramani, 2016) . Taking regression as an example, we assume that W l is a random matrix and w = {W l } L l=1 , and denote the prior by p(w). Then, the predictive distribution of the deep GP can be expressed as Eq. ( 2) where τ > 0 p(y|x, X, Y ) = p(y|x, w)p(w|X, Y )dw, (2) p(y|x, w) = N (y; y, τ -1 I), y = 1 K L W L σ • • • 1 K 1 W 2 σ W 1 x + u 1 . The posterior distribution p(w|X, Y ) is intractable, and one way of training the deep GP is variational inference where a family of tractable distributions q(w) is chosen to approximate the posterior. Specifically, we define the hierarchy of q(w) as Eq. (3): q(W lij |Z lij , U lij , M l ) ∼ Z lij • N (M lij U lij , σ 2 ) + (1 -Z lij ) • N (0, σ 2 ), q(M l |U l ) ∝ exp(M l ⊙ (|U l | + |∇U l |)), U lij ∼ N (V lij , σ 2 ), Z lij ∼ Bernoulli(p l ), where l, i and j denote the layer, row, and column index, M l is a matrix with 0's and constrained 1's, W l is the sparse weights, U l is the variational parameters, and V l is the variational hyper parameters. Then, we iteratively update M l and W l to approximate the posterior. For the update of M l , we obtain a point estimate by maximising q(M l |U l ) under the sparsity constraint. In pruning step, since the gradient magnitudes |∇U l | can be relatively small compared to the weight magnitudes |U l | after training, we can use exp(M l ⊙ |U l |) to approximate the distribution. In regrowth step, since the inactive weights are zero, we directly compare the gradient magnitudes exp(M l ⊙ |∇U l |). Thus, the update of M l is aligned with the update in CigL. For W l , we minimise the KL divergence between q(w) and the posterior of deep GP as Eq. ( 4) -q(w) log p(Y |X, w)dw + D KL (q(w)∥p(w)). For the first term in Eq. ( 4), we can first rewrite it as -N n=1 q(w) log p(y n |x n , w). Then, we can approximate each integration in the sum with a single estimate w. For the second term in Eq. ( 4), we can approximate it as L l=1 ( p l 2 ∥U l ∥ 2 2 + 1 2 ∥u l ∥ 2 2 ). As a result, we can derive the objective as L GP := 1 N N i=1 -log p(y n |x n , w) τ + L l=1 ( p l 2 ∥U l ∥ 2 2 + 1 2 ∥u l ∥ 2 2 ), which is shown to have the same form as the objective in Eq. ( 1) with appropriate hyperparameters for the deep GP. Thus, the update of W l is also consistent with the update in CigL. This suggests that our CigL can be viewed as an approximation to the deep GP using hierarchical variational inference. The final weight & mask averaging procedure can be incorporated into the Bayesian paradigm as an approximation to the posterior distribution (Srivastava et al., 2014; Maddox et al., 2019) .

4.3. CONNECTION TO DROPOUT

Our CigL can be seen as a new version of Dropout, and our random mask Z is related to the Dropout mask. Dropout is a widely used method to overcome the overfitting problem (Hinton et al., 2012; Wan et al., 2013; Srivastava et al., 2014) . Two widely used types are unit dropout and weight dropout, which randomly discard units (neurons) and individual weights at each training step, respectively. Both methods use dropouts only in the training phase and remove them in the testing phase, which is equivalent to discarding Z and only using W for prediction. However, simply dropping Z can be detrimental to the fit of the posterior. Thus, MC dropout collects multiple models by randomly selecting multiple dropout masks, which is equivalent to extracting multiple Z and using one W for prediction. However, only using one W neither fully expresses the posterior landscape nor captures the correlation between Z and W . In contrast, our CigL uses multiple pairings of Z and W , which can better approximate the posterior under sparsity constraints.

4.4. CONNECTION TO WEIGHT AVERAGING

Our weight & mask averaging can be seen as an extension of weight averaging (WA), which averages the weights of multiple model samples to produce a single output model (Izmailov et al., 2018; Wortsman et al., 2022) . Compared to deep ensembles (Ashukha et al., 2020) , WA outputs only one model, which reduces the forward FLOPs and speeds up prediction. When these model samples are located in one low error basin, it usually leads to wider optima and better generalization. However, although WA can produce better generalization, it does not improve the confidence calibration (Wortsman et al., 2022) . In contrast to WA, our weight & mask averaging uses masks for weighted averaging and improves the confidence calibration with similar FLOPs in the prediction.

5. EXPERIMENTS

We perform a comprehensive empirical evaluation of CigL, comparing it with the popular baseline method RigL (Evci et al., 2020) . RigL is a popular sparse training method that uses weights magnitudes to prune and gradient magnitudes to grow connections. Datasets & Model Architectures: We follow the settings in Evci et al. (2020) for a comprehensive comparison. Our experiments are based on three benchmark datasets: CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) and ImageNet-2012 (Russakovsky et al., 2015) . For model architectures, we used ResNet-50 (He et al., 2016) and Wide-ResNet-22-2 (Zagoruyko & Komodakis, 2016) . We repeat all experiments 3 times and report the mean and standard deviation. Sparse Training Settings: For sparse training, we check multiple sparsities, including 80%, 90%, 95%, and 99%, which can sufficiently reduce the memory requirement and is of more interest. Implementations: We follow the settings in (Evci et al., 2020; Sundar & Dwaraknath, 2021) . The parameters are optimized by SGD with momentum. For the learning rate, we use piecewise constant decay scheduler. For CIFAR-10 and CIFAR-100, we train all the models for 250 epochs with a batch size of 128. For ImageNet, we train all the models for 100 epochs with a batch size of 64.

5.1. COMPARISON BETWEEN POPULAR SPARSE TRAINING METHOD

Results on CIFAR-10 and CIFAR-100. We first compare our CigL and RigL by the expected calibration error (ECE) (Guo et al., 2017) , a popular measure of the discrepancy between a model's confidence and true accuracy, with a lower ECE indicating better confidence calibration and higher reliability. In Figure 2 , the pink and blue curves represent CigL and RigL, respectively, where the colored ares represent the 95% confidence intervals. We can see that the pink curves are usually lower than the blue curves for different sparsities (80%, 90%, 95%, 99%), which implies that our CigL can reduce the ECE and improve the confidence calibration of the produced sparse models. Apart from ECE value, we also compare our CigL and RigL by the testing accuracy for multiple sparsities (80%, 90%, 95%, 99%). We summarize the results for sparse ResNet-50 in 

5.3. COMPARISON BETWEEN OTHER CALIBRATION METHODS

In this section, we compare our CigL with existing popular calibration methods, including mixup (Zhang et al., 2017) , temperature scaling (TS) (Guo et al., 2017) , and label smoothing (LS) (Szegedy et al., 2016) . The testing ECE are depicted in Figure 4 , where the pink and blue polygons represent CigL and other calibration methods, respectively. We can see that CigL usually gives smaller polygons, indicating a better confidence calibration.

5.4. ABLATION STUDIES

We do ablation studies to demonstrate the importance of each component in our CigL, where we train sparse networks using our CigL without random masks method, CigL, to produce reliable sparse models, which can simultaneously maintain or even improve accuracy with only a slight increase in computational and storage burden. Our CigL utilizes two masks, including a deterministic mask and a random mask, which allows the sparse model to better explore the weight space. Then, we design weight & mask averaging method to combine multiple sparse weights and random masks into a single model with improved reliability. We prove that CigL can be viewed as a hierarchical variational approximation to the probabilistic deep Gaussian process. Experiments results on multiple benchmark datasets, model architectures, and sparsities show that our CigL reduces ECE values by up to 47.8% with comparable or higher accuracy. One phenomenon we find worth discussing is the double descent in reliability of sparse training. Nakkiran et al. (2021) first observed this double descent phenomenon in DNNs, where as the model size, data size, or training time increases, the performance of the model first improves, then gets worse, and then improves again. Consistent with the previous definition, we consider sparsity and reliability as the measures of model size and performance, respectively. Then, as shown in the Figure 1 (c), as the sparsity decreases (model size increases), the reliability (model performance) gets better, then gets worse, and then gets better again. To explain this phenomenon, we divided sparsity into four phases, from the left (99.9%) to the right (0%), by drawing an analogy between the phases and model accuracy and size. (d) Finally, it reaches a dense deep model with over-confidence issues (moderate level of reliability & high accuracy). It is observed that at around 95% sparsity, the sparse model can achieve comparable accuracy and high sparsity at the same time, which makes it important in practical applications. However, the ECE value is at the peak of the double-descent curve at this point, which implies that the reliability of the sparse model is at a low level. Thus, our CigL smooths the double descent curve and produce reliable models on those important high sparsity levels. A APPENDIX: BACKGROUND In this section, we briefly summarize CigL, Gaussian processes, and hierarchical variational inference, which will be used to support the main theoretical analysis of this work.

A.1 SPARSE TRAINING: CIGL

We first review our CigL method for the case of a single hidden layer neural network (NN). This is done for ease of notation, and it is straightforward to generalise to multiple layers (Gal & Ghahramani, 2016) . Denote by W 1 , W 2 the sparse weight matrices connecting the first layer to the hidden layer and connecting the hidden layer to the output layer respectively. For the sparse mask controlling the sparse topology, we use M 1 , M 2 to denote the corresponding deterministic masks for W 1 and W 2 , respectively. And we use Z 1 , Z 2 to denote the corresponding random masks for W 1 and W 2 , respectively. These linearly transform the layers' inputs before applying some elementwise non-linearity σ(). Denote by b the biases by which we shift the input of the non-linearity. We assume the model to output D dimensional vectors while its input is Q dimensional vectors, with K hidden units. Thus W 1 , M 1 , and Z 1 are Q × K matrices, W 2 , M 1 , and Z 1 are K × D matrices, and b is a K dimensional vector. A sparse NN model with the two masks would output y = σ(xZ 1 ⊙ W 1 + b)Z 2 ⊙ W 2 given some input x. For the update of masks, the deterministic masks is updated by exploiting the magnitude of weights and gradients. The random mask is randomly sampled. For the update of weights, we use E to denote the loss function, which is the euclidean loss for regression problem E = 1 2N n=1 N ||y n -y n || 2 2 , where y n is the observed response, y n is the prediction based on input x n for n = 1, • • • , N . For classification task with D classes, we use softmax function to map the output y n to the probability score for each class p nd = exp( y nd )/( k exp( y nk )), and the loss function will be E = - 1 N n=1 N log( p n,cn ), where c n ∈ [1, 2, • • • , D] is the true class label for x n . During NN optimization process, apart from the loss function mentioned above, l 2 regularisation is often used to improve the performance, leading to a minimisation objective: L CigL := 1 N N i=1 E(y i , y i ) + λ 1 ||W 1 || 2 2 + λ 2 ||W 2 || 2 2 + λ 3 ||b|| 2 2 . Therefore, duing the training process of CigL, we iteratively update the sparse weights W by minimising Eq. ( 8) and update deterministic mask based on the magnitude of weights and gradients.

A.2 GAUSSIAN PROCESS

The Gaussian process (GP) is a popular non-parametric Bayesian methodto model distributions over functions, which can be applied to bothe regression and classification tasks. It has good performance in various fields, but it will bring huge computation burden when faced with a large number of data. The use of variational inference for GP can make it scalable to large data. Given a data {x n , y n }, n = 1, • • • , N , the task is to estimate an unknown function y = f (x), where X ∈ R N ×Q and Y ∈ R N ×D . GP usually put a prior over the function space and we want to fit the posterior distribution over the function space: p(f |X, Y ) ∝ p(Y |X, f )p(f ). Within Gaussian process, we usually place a Gaussian prior over the function space, and it equivalent to placing a joint Gaussian distribution over all function values F |X ∼ N (0, K(X, X)) (9) Y |F ∼ N (F , τ -1 ) where τ is a precision parameter and I N is the identity matrix with dimensions N × N . For classification tasks, we can formulate the model as F |X ∼ N (0, K(X, X)) (10) Y |F ∼ N (F , τ -1 ), c n |Y ∼ Categorical exp( y nd )/( k exp( y nk )) An important aspect of Gaussian process is the choice of the covariance function, which reflects how we believe the similarity between each pair of inputs x i and x j . One widely-used covariance function is stationary squared exponential covariance function. In addition, some non-stationary covariance function are proposed, such as dot-product kernels and more flexible deep network kernels.

A.3 HIERARCHICAL VARIATIONAL INFERENCE

Variational inference (VI) is a broadly-used technique to approximate intractable integrals in Bayesian modeling, which sets up a parameterized family of tractable distributions over the latent variables and then optimizes the parameters to be close to the posterior. More specifically, suppose w is the set of random variables defining our model. Then, the predictive distribution will be formulated as p(y * |x * , X, Y ) = p(y * |x * , w)p(w|X, Y )dw, where the posterior p(w|X, Y ) is usually intractable. Thus, we define a family of tractable approximating variational distributions q(w) to approach the posterior. To find the closest approximating distribution among the family of q(w), we minimise the Kullback-Leibler (KL) divergence between q(w) and posterior p(w|X, Y ), which is equivalent to maximising the log evidence lower bound (ELBO) with respect to q(w): L VI := q(w) log p(Y |X, w)dw -D KL (q(w)∥p(w)). After obtaining an good approximation q(w), we can update the predictive distribution to p(y * |x * , X, Y ) = p(y * |x * , w)q(w)dw. However, when faced with posterior difficult to fit, q(w) can be limited and not flexible enough to approach the posterior. In this case, VI cannot capture the posterior dependencies between latent variables that both improve the fidelity of the approximation and are sometimes intrinsically meaningful. To solve this limitation of VI, hierarchical variational inference (HVI) is proposed, which can capture both posterior dependencies between the latent variables and more complex marginal distributions (Ranganath et al., 2016) . More specifically about HVI, we extend the limited family of VI distribution hierarchically, i.e., by placing a prior on the parameters of the likelihood. Suppose VI uses q(w; λ) to approximate the posterior where λ is the variational parameters to optimise. HVI will add prior on λ and uses q(w|λ)q(λ; θ). The ELBO equivalently will be as: L HVI := E qHVI(w;λ) [log p(x, w) -log q HVI (w; θ)]. This ELBO can be further bounded by L HVI ≤ E qHVI(w;λ) [log p(x, w)] -E q(w,λ) [log q(λ) + log q(w|λ) -log r(λ|w; θ)]. where r(λ|w; θ) is introduced to apply the variational principle.

B APPENDIX: CIGL AS A HIERARCHICAL BAYESIAN APPROXIMATION

We show that sparse deep NNs trained with CigL are mathematically equivalent to approximate hierarchical variational inference in the deep Gaussian process (marginalised over its covariance function parameters). For this, we build on previous work (Gal & Ghahramani, 2016 ) that proved unit dropout applied before every weight layer are mathematically equivalent to approximate variational inference in the deep Gaussian process. Starting with the full Gaussian process we will develop an approximation that will be shown to be equivalent to the sparse NN optimisation objective with CigL (eq. ( 8)) with either the Euclidean loss in the case of regression or softmax loss in the case of classification. Our derivation takes regression as an example, which can be extended to classification by Section 4 of the Appendix of Gal & Ghahramani (2016) . This view of CigL will allow us to derive new probabilistic results in sparse training.

B.1 A GAUSSIAN PROCESS APPROXIMATION

In this section, we will re-parameterise the deep GP model and marginalise over the additional auxiliary random variables, which is built on Gal & Ghahramani (2016) . To define our covariance function, let σ(.) be some non-linear activation function and K(x, y) can be formulated as: K(x, y) = p(w)p(b)σ(w ⊤ x + b)σ(w ⊤ y + b)dwdb, where p(w) is a standard multivariate normal distribution in dimension Q. We use Monte Carlo integration with K samples to approximate the integral above and get the finite rank covarinace function K(x, y) = 1 K K k=1 σ(w ⊤ k x + b k )σ(w ⊤ k y + b k ), where w k ∼ p(w) and b k ∼ p(b). K is the number of hidden units in our single hidden layer sparse NN approximation. The generative model will be as follow when we use K instead of K: w ∼ p(w), b k ∼ p(b), W 1 = [w qk ] Q q=1 K k=1 , b = [b k ] K k=1 , K(x, y) = 1 K K k=1 σ(w ⊤ k x + b k )σ(w ⊤ k y + b k ), F |X, W 1 , b ∼ N (0, K(x, y)), Y |F ∼ N (F , τ -1 I n ), where W 1 ∈ R Q×K which parameterise the covariance function. We can get the predictive distribution by integrating over the F , W 1 , and b p(Y |X) = p(Y |F )p(F |W 1 , b, X)p(W 1 )p(b)dW 1 db. Denoting a 1 × K row vector ϕ(x, W 1 , b) = 1 K σ(W ⊤ 1 x + b) and a N × K feature matrix Φ = [ϕ(x n , W 1 , b)] N n=1 . Then, we can get K(X, X) = ΦΦ ⊤ and the predictive distribution can be rewritten as p(Y |X) = N (Y ; 0, ΦΦ ⊤ + τ -1 I N )p(W 1 )p(b)dW 1 db The normal distribution of Y inside the integral above can be written as a joint normal distribution over y d which denoting the d-th columns of the N × D matrix Y (d = 1, • • • , D). For each term in the joint distribution, following Bishop & Nasrabadi (2006) , we introduce a K × 1 auxiliary random variable w d ∼ N (0, I K ), N (y d ; 0, ΦΦ ⊤ + τ -1 I N ) = N (y d ; Φw d , τ -1 I N )N (w d ; 0, I K )dw d . We use W 2 = [w d ] D d=1 ∈ R K×D and we get the predictive distribution as p(Y |X) = p(Y |X, W 1 , W 2 , b)p(W 1 )p(W 2 )p(b)dW 1 dW 2 db.

B.2 HIERARCHICAL VARIATIONAL INFERENCE IN THE APPROXIMATE MODEL

We next approximate the posterior over these variables with appropriate hierarchical approximating variational distributions. We define a hierarchical variational distribution as: q(W 1 , W 2 , b) := q(W 1 )q(W 2 )q(b) = q(W 1 |U 1 )q(W 2 |U 2 )q(U 1 )q(U 2 )q(b)dU 1 dU 2 , where we define q(W 1 ) to be a Gaussian mixture distribution with two components, which is factorised over Q and K: q(W 1 |U 1 ) = Q q=1 K k=1 q(w qk |u qk ), q(w qk |u qk ) = p 1 N (u qk , σ 2 ) + (1 -p 1 )N (0, σ 2 ), q(u qk ) = N (v qk , σ 2 ) where p 1 ∈ [0, 1], and σ > 0. Similarly, we can define a hierarchical variational distribution over W 2 q(W 2 |U 2 ) = K k=1 D d=1 q(w kd |u kd ), q(w kd |u kd ) = p 2 N (u kd , σ 2 ) + (1 -p 2 )N (0, σ 2 ), q(u kd ) = N (v kd , σ 2 ) For b, we use a simple Gaussian distribution q(b) = N (u, σ 2 I K ).

B.3 EVALUATING THE LOG EVIDENCE LOWER BOUND FOR REGRESSION

Next we evaluate the log evidence lower bound for the task of regression. The log evidence lower bound is as below L GP-VI := q(W 1 , W 2 , b) log p(Y |X, W 1 , W 2 , b) -D KL (q(W 1 , W 2 , b)∥p(W 1 , W 2 , b)). where the integration is with respect to W 1 , W 2 , b. During regression, we can rewrite the integrand as a sum: log p(Y |W 1 , W 2 , b) = D d=1 log N (y d ; Φw d , τ -1 I N ), = - N D 2 log(2π) + N D 2 log(τ ) - D d=1 τ 2 ||y d -Φw d || 2 2 . as the output dimensions of a multi-output Gaussian process are assumed to be independent. Denote Y = ΦW 2 . We can then sum over the rows instead of the columns of Y and write D d=1 τ 2 ||y d -y d || 2 2 = N n=1 τ 2 ||y n -y n || 2 2 . Here we have y n = ϕ(x, W 1 , b)W 2 = 1 K σ(x n W 1 + b)W 2 , leading to log p(Y |W 1 , W 2 , b) = N n=1 log N (y n ; ϕ(x n , W 1 , b)W 2 , τ -1 I D ). Therefore, we can update the log evidence lower bound as N n=1 q(W 1 , W 2 , b) log p(Y |x n , W 1 , W 2 , b) -D KL (q(W 1 , W 2 , b)∥p(W 1 , W 2 , b)). We re-parametrise the integrands in the sum to not depend on W 1 , W 2 , b directly, but instead on the standard normal distribution and the Bernoulli distribution. Let q(ϵ 1 ) = N (0, I Q×K ), q(z 1,q,k ) = Bernoulli(p 1 ), q(ϵ 2 ) = N (0, I K×D ), q(z 2,k,d ) = Bernoulli(p 2 ), q(ϵ) = N (0, I K ), q(ϵ 3 ) = N (0, I Q×K ), and q(ϵ 4 ) = N (0, I K×D ). Then, we can have W 1 = Z 1 ⊙ (U 1 + σϵ 1 ) + (1 -Z 1 ) ⊙ σϵ 1 , W 2 = Z 2 ⊙ (U 2 + σϵ 2 ) + (1 -Z 2 ) ⊙ σϵ 2 , b = u + σϵ, U 1 = σϵ 3 , U 2 = σϵ 4 where ⊙ means element-wise multiplication. Thus, we can update the above the sum over the integrals N n=1 q(W 1 , W 2 , b) log p(y d |x n , W 1 , W 2 , b), = N n=1 q(Z 1 , ϵ 1 , Z 2 , ϵ 2 , ϵ, ϵ 3 , ϵ 4 ) log p(y d |x n , W 1 (Z 1 , ϵ 1 , ϵ 3 ), W 2 (Z 2 , ϵ 2 , ϵ 4 ), b(ϵ)). For the first term in L GP-MC , we can estimate the integrals using Monte Carlo integration with a distinct single sample to obtain L GP-MC := N n=1 log p(y d |x n , W n 1 , W n 2 , b n ) -D KL (q(W 1 , W 2 , b)∥p(W 1 , W 2 , b)). Following Gal & Ghahramani (2016) , by optimising the stochastic objective L GP-MC , we can converge to the same limit as L GP-VI , which justifies this stochastic approximation. Moving to the second term in L GP-MC , we use w and u to denote certain component in the weights (W 1 ) and variational parameters (U 1 ), respectively. Then, we can have -D KL q(w)∥p(w) = -q(w) log q(w) p(w) dw = q(w, u)du log p(w) q(w, u)du dw = q(u){ q(w|u) log p(w) q(w, u)du dw}du = q(u)[ q(w|u) log p(w)dw]du -q(u){ q(w|u) log[ q(w, u)du]dw}du = q(u)[ q(w|u) log p(w)dw]du -[ q(w, u)du] log[ q(w, u)du]dw For the first term in Eq. ( 12), we can first follow the Proposition 1 in (Gal & Ghahramani, 2016) to approximate q(w|u) log p(w)dw as below: q(w|u) log p(w)dw ≈ - 1 2 (p • u 2 + σ 2 ) Then, we can approximate the first term in Eq. ( 12) as q(u)[ q(w|u) log p(w)dw]du ≈ q(u)[- 1 2 (p • u 2 + σ 2 )]du = - 1 2 σ 2 - p 2 q(u)u 2 du = - p + 1 2 σ 2 - p 2 v 2 For the second term in Eq. ( 12), we can estimate the integral q(w, u)du using Monte Carlo integration with a distinct single sample to obtain [ q(w, u)du] log[ q(w, u)du]dw ≈ q(w| u) log[q(w| u)]dw Then we can follow the Proposition 1 in (Gal & Ghahramani, 2016) to approximate it as q(w| u) log[q(w| u)]dw ≈ 1 2 (log σ 2 + 1 + 2 log π) + C Therefore, we can get the approximation for Eq. ( 12) as Eq. ( 13): -D KL q(w)∥p(w) = - p + 1 2 σ 2 - p 2 v 2 + 1 2 (log σ 2 + K(1 + 2 log π)) + C Then, we can have the following equation based on Eq. ( 13): D KL (q(W 1 )∥p(W 1 )) ≈ QK(p + 1) 2 σ 2 - QK 2 (log(σ 2 ) + 1) + p 1 2 Q q=1 K k=1 v 2 qk + C, where C is a constant and D KL (q(W 2 )∥p(W 2 )) can be approximated in a similar way. For D KL (q(b)∥p(b)), it can be written as D KL (q(b)∥p(b)) = 1 2 (u ⊤ u + K(σ 2 -log(σ 2 ) -1)) + C.

B.4 LOG EVIDENCE LOWER BOUND OPTIMISATION FOR CIGL

Next we explain the relation between the above equations with equations for CigL. Ignoring the constant terms τ , σ we obtain the maximisation objective L GP-MC ∝ - τ 2 N n=1 ||y n -y n || 2 2 - p 1 2 ||V 1 || 2 2 - p 2 2 ||V 2 || 2 2 - 1 2 ||u|| 2 2 . ( ) We will show the equivalence between the iterative update of M 1 , Z 1 and U 1 in CigL and the hierarchical variational inference for deep GP. The update for M 2 , Z 2 and U 2 will be similar. In the hierarchical variational distribution, the distribution of sparse weights W 1 depends on three random variables M 1 , Z 1 , and U 1 . Given Z 1 and U 1 , we can know the variational distribution for M 1 as: q(M 1 |U 1 ) ∝ exp(M 1 ⊙ |U 1 |) where M 1 is under certain sparsity constraint. Thus, we can update M 1 by choosing M 1 that maximising Eq. ( 15), which is aligned with the M 1 update procedure in CigL. Given M 1 , we can use V 1 to approximate U 1 . Then, we can let σ tend to zero (Gal & Ghahramani, 2016) , and the random variable realisations W n 1 , W n 2 , b n can be W n 1 ≈ Z 1 ⊙ U 1 , W n 2 ≈ Z 2 ⊙ U 2 , b n ≈ u Then, we can get y n ≈ 1 K σ(x n ( Z 1 ⊙ U 1 ) + u)( Z 2 ⊙ U 2 ). We scale the optimisation objective by a positive constant 1 τ N and get the objective: L GP-MC ∝ - 1 2N N n=1 ||y n -y n || 2 2 - p 1 2τ N ||U 1 || 2 2 - p 2 2τ N ||U 2 || 2 2 - 1 2τ N ||u|| 2 2 . So, we recover equation for CigL. With correct stochastic optimisation scheduling, both will converge to the same limit.

B.5 MASK & WEIGHT AVERAGING FOR PREDICTION

For prediction, we design mask & weight averaging (WMA) to produce the final output model. Specifically, we collect samples of { Z 1 , U 1 } during the optimization process, where we use V 1 at different epochs to approximate U 1 . By using weight & mask averging and letting σ tend to zero (Gal & Ghahramani, 2016) , the random variable W 1 , W 2 , b can be approximated as s) . W 1 ≈ 1 S S s=1 Z (s) 1 ⊙ U (s) 1 , W 2 ≈ 1 S S s=1 Z (s) 2 ⊙ U (s) 2 , b ≈ 1 S S s=1 u WMA can be seen as approximating the mean of the posterior based on samples from variational distribution q(W ) using moment matching, which is justified as below: (i) CigL is connected to the Bayesian approach because of using both deterministic and random masks to explore the weight space. As shown in Equation 3, the design of Z and M results in a hierarchical variational distribution q(W ), where the hierarchy expands the approximation family and leads to a better posterior approximation capability. In Equation 3, updating the mask is equivalent to updating and to bring closer to the posterior 3. (ii) In a similar idea to weight dropout (Gal & Ghahramani, 2016) , WMA can be considered as a Bayesian approximation, which is used to approximate the mean of the posterior by moment matching. • Weight dropout has been shown to be equivalent to the Bayesian approximation method (Gal & Ghahramani, 2016) . After obtaining the final W , dropout approximates f (W Z)p(Z)dZ using f (E(W Z)), where f is the neural network and the mean E(W Z) = W Zp(Z)dZ (Srivastava et al., 2014) . This approximation actually approximates the whole posterior by the first moment of the posterior (i.e., the mean). • For our WMA, since we assume a hierarchy, we collect multiple samples of W and Z. Then, the WMA is used to approximate the first moment of the posterior which is used as an approximation of the posterior itself (Srivastava et al., 2014) . • In addition, if we really want the second moment, it is straightforward to obtain an estimation based on samples using moment matching again, similar to Maddox et al. (2019) . We do not estimate the second moment since the sparse training typically wants a single sparse model in the end to reduce both computational and memory costs, and we find that using the posterior mean already significantly improves the calibration of the sparse training. C APPENDIX: ADDITIONAL EXPERIMENTAL RESULTS

C.1 STRONGER CORRELATION BETWEEN HIDDEN VARIABLES

Empirically, we find that a stronger correlation between Z and W in sparse training. We use CigL to train sparse Wide-ResNet-22-2 on CIFAR-10 at multiple sparsities (0%, 50%, 80%, 90%). Then, we randomly draw five random masks Z i , i ∈ 1, • • • , 5 from Bernoulli distribution. Using the final sparse weights W , we obtain several new sparse models Z i ⊙ W , i ∈ 1, • • • , 5, and record their test accuracies. We compare the test accuracy of W with the average accuracy of Z i ⊙ W , i ∈ 1, • • • , 5 to see the correlation. The larger decrease in test accuracy after multiplying by Z i implies a stronger correlation between Z and W . As shown in Figure 6 (a), both the accuracy of W (red curve) and the average accuracy of Z i ⊙ W , i ∈ 1, • • • , 5 (blue curve) are decrease with increasing sparsity, and we see a more pronounced decrease in the blue curve. Figure 6 (b) further shows the decrease in test accuracy at each sparsity. We observe that the decrease is very small in the dense or sparse model at low sparsity. However, when it shifts to high sparsity such as 90%, we observe a larger decrease, which indicates a stronger correlation between Z and W in sparse training. 

C.2 RELIABILITY IN SPARSE TRAINING

To get a more comprehensive understanding of the reliability issues in sparse training, we also evaluated the ECE values of the sparse models generated by SET (Mocanu et al., 2018) . We find that the sparse model produced by SET is also more over-confident than the dense model. As shown in Table 4 , the ECE values of dense ResNet-50 are smaller than those of sparse ResNet-50 on both CIFAR-10 and CIFAR-100. This proves the reliability issue of sparse training. We further compare our CigL with a recent Sparse training baseline Sup-tickets (Yin et al., 2022) to show the effectiveness of CigL in reducing ECE values. Table 5 shows the change in ECE after using Sup-tickets or our CigL. We can see that Sup-tickets brings only a limited reduction in ECE, while the reduction of our CigL is much larger than that of Sup-tickets. (Liu et al., 2021) and DST-EE (Huang et al., 2022) study weight space exploration in sparse training and emphasized its importance. Compared to their studies, our work has two main differences that address their limitations. On the one hand, our work has a different goal from ITOP and DST-EE with respect to encouraging exploration of the weight space. Specifically, our work aims to better explore the weight space to find more reliable models, while ITOP and DST-EE aims to build models with higher accuracy, ignoring the safety aspects. On the other hand, the exploration of weight space has two aspects, namely "which weight is active" & "what value that weight has". The limitation of ITOP is that, given the mask, the second aspect is not addressed and the optimization of the algorithm remains more challenging than dense training due to the pseudo-local optimization introduced by the sparsity constraints. To meet this challenge, ITOP increases the iterations between mask updates, leading to an increase in training time. For DST-EE, it mainly targets the first aspect. In contrast to their study, our work addresses this limitation, as shown in the following discussion: The first aspect of weight space exploration is reflected by the ITOP rate, which is the percentage of all weights that have ever been selected as active weights by the mask. The second aspect of weight space exploration is reflected by the idea of "reliable exploration" in the ITOP paper. Ideally, a reliable exploration should allow a model to find the good direction and jump out of the bad local optimum. The sparsity constraints introduce some pseudo-local optima, which is difficult to jump out of. Our random mask can randomly cut off some directions and force the model to explore other directions, thus encouraging the model to better explore the weight space and avoid missing the correct direction.

D.2 DOUBLE DESCENT IN RELIABILITY

One phenomenon we find worth discussing is the double descent in the reliability of sparse training. We discuss it in Section 6, where we divide the sparsity into four stages, i.e., poor model, shallow model, sparse deep model, and dense deep model. The four stages are first supported by intuition. In the discussion, we draw analogies between model types such as "shallow models" and "poor models" in terms of model accuracy (expressiveness) and size. Consistent with the previous definition of double descent (Nakkiran et al., 2021; Somepalli et al., 2022) , we consider sparsity as a measure of model size. Intuitively, as we gradually reduce the model size (increase the sparsity), we will go through four stages. The four arguments are also supported by our sparse training experiments on ResNet-50 at CIFAR-100. As shown in Figure 1 (c), we can infer the model type by sparsity and accuracy: • For 99.7% sparsity, the accuracy of the model is 41.7%, which is similar to a shallow model. • For 99.9% sparsity, the accuracy of the model is 23.5%, which can be viewed as a poor model. More detailed and quantitative support for these four arguments is beyond the main scope of this paper and could be a good direction for future research. One potential direction is the use of effective depth as a measure of stage identification.

D.3 WEIGHT & MASK AVERAGING

Without the use of WMA, the analysis in Section 4.2 would be a non-hierarchical Bayesian method or a poor approximation to a hierarchical Bayesian approach. (i) In the absence of WMA, the algorithm can be viewed as a non-hierarchical variational inference. As described in Section 4.3, using the final W for prediction without WMA is equivalent to using weight dropout in RigL. Thus, the analysis in Section 4.2 will be updated in a similar way to Section 3 in Gal & Ghahramani (2016) which shows that weight dropout can be viewed as a non-hierarchical Bayesian approximation. (ii) Without WMA, the algorithm can also be viewed as a poor approximation to hierarchical variational inference. • If we continue to interpret the algorithm without WMA using the current analysis structure from Section 4.1, then how we generate the final posterior approximation will change. • In this case, although the algorithm is still a Bayesian approximation, we only use the final W to represent the posterior, which does not effectively capture all the information we explore from the weight space and the increased correlation between Z and W . • Therefore, it turns out to be a bad hierarchical approximation, which limits its power.

D.4 MULTI-MASK SPARSE DNNS

Existing multi-mask methods are not designed for improved weight space exploration. Bibikar et al. (2022) considers sparse training in federated learning and investigates the aggregation of multiple masks in edge devices. Xia et al. (2022) utilizes multiple masks with different granularities to allow greater flexibility in structured pruning and to improve accuracy. Despite the use of multiple masks, existing work (Xia et al., 2022; Bibikar et al., 2022) differ significantly from our work. They still use deterministic masks, which still suffer from the lack of exploration of the weight space, and consider only the accuracy of sparse models. In addition, their setups are federated learning and pruning, which are different from our work.

D.5 SPARSE DNNS: PRUNING & SPARSE TRAINING

Although pruning (e.g., Lottery Tickets) and sparse training are related and both produce subnetworks with high accuracy, their goals and discovering journeys are quite different, which leads to significant differences in several important properties, including uncertainty, geometry of the loss surface, generalization ability, and so on. (i) For the goal, Lottery Tickets mainly aim to reduce the inference cost, while the sparse training also aims to save resources during the training phase. (ii) Lottery Tickets and sparse training are different in several important properties. As shown in Figure 11 of Chen et al. (2022) , the blue and purple bars represent Lottery Tickets and sparse training with a sparsity level of 79%, respectively. • For uncertainty, sparse training does not improve the confidence calibration compared to dense training, while Lottery Tickets allows for improved confidence calibration. • For the geometry of the loss surface, sparse training leads to larger trace values and cannot locate flat local minima. In contrast, Lottery Tickets can still locate flat local minima. • For generalization ability, sparse training provides higher accuracy and improved robustness compared to dense training, while Lottery Tickets provide relatively less improvement. (iii) The main reason for the different properties is their different discovering journeys. For Lottery Tickets: • It retrains the weights from the initial training phase after each pruning, which significantly increases the training time but allows more time for the model to explore the weight space. • It starts from a dense model and has low sparsity in the early stages, which reduces the difficulty of weight space exploration caused by the sparsity constraints. For spare training: • It maintains a high level of sparsity throughout the training process, which does not extend the training time to enable more exploration of the weight space. • In addition, maintaining high sparsity can cut off a large portion of the optimization route and produce more spurious local minima, thus making training very difficult. • Chen et al. (2022) shows the differences in the properties of Lottery Tickets and sparse training at the 79% sparsity level. The difficulty of training typically increases with increasing sparsity, implying that the difference is likely to be greater at higher sparsity levels.



Figure 1: Reliability diagrams for (a) the dense model and (b) the sparse model. The sparse model is more over-confident than the dense model. (c) the scatter plot of test accuracy (%) and ECE value at different sparsities. From the high sparse model to the dense model, the ECE value first decreases, then increases, and then decreases again, showing a double descent pattern.

Figure 5: Ablation studies: test accuracy(%) and ECE value comparison between CigL, CigL without random mask (CigL w/o RM), and CigL without weight & mask averaging (CigL w/o WMA) at different sparsities (80%, 90%, 95%, 99%). Compared to (a)-(b) CigL w/o RM and (c)-(d) CigL w/o WMA, CigL more consistently produces sparse models with low ECE values and high accuracy.

(a)  The sparse model starts as a poor model, which is too sparse to learn the data well (low reliability & accuracy). (b) It gradually becomes equivalent to a shallow model that can learn some patterns but is not flexible enough to learn all the data well (high reliability & moderate level of accuracy). (c) Then, it moves to a sparse deep model that can accommodate complex patterns but suffers from poor exploration (low reliability & high accuracy).

Figure 6: (a) Test accuracy of sparse model W and the newly produced sparse model Z i ⊙ W . (b) decrease in test accuracy from sparse model W to newly produced sparse model Z i ⊙ W . At low sparsity, the decrease of the dense or sparse models is small. At high sparsity, the decrease is larger.

Testing accuracy (%) comparison between CigL and RigL at different sparsities (80%, 90%, 95%, 99%). Compared to RigL, CigL maintains comparable or higher test accuracy.

It is observed that CigL tends to bring comparable or higher accuracy, which demonstrates that CigL can simultaneously maintain or improve the accuracy. The comparison of test accuracy is shown in Table2. Our CigL usually provides a comparable or higher accuracy compared to RigL. However, using weight dropout and MC dropout in RigL usually result

Testing accuracy (%) comparison between CigL, RigL + weight dropout (W-DP), and RigL + MC dropout (MC-DP) at different sparsities (80%, 90%, 95%, 99%). Compared to RigL, RigL + W-DP, and RigL + MC-DP, CigL maintains comparable or higher test accuracy.

Testing ECE comparison between CigL, RigL + weight dropout (W-DP), and RigL + MC dropout (MC-DP) at different sparsities (80%, 90%, 95%, 99%). Compared to RigL + W-DP and RigL + MC-DP, CigL more consistently achieves a significant reduction in the ECE value of RigL.

(CigL w/o RM) and CigL without weight & mask averaging (CigL w/o WMA), respectively. In CigL w/o RM, we search for sparse topologies using only the deterministic mask. In CigL w/o WMA, we collect multiple model samples and use prediction averaging during testing. Figures5(a)-(b)show the effect of random masks on the test accuracy and ECE values, where the blue, green, and pink bars represent RigL, CigL w/o RM, and CigL, respectively. We can see that if we remove the random mask, we can still obtain an improvement in accuracy compared to RigL. However, the ECE values do not decrease as much as CigL, indicating that the CigL w/o RM is not as effective as CigL in improving the confidence calibration. Figures5(c)-(d) further show the effect of weight & mask averaging. We can see that without using weight & mask averaging, the accuracy decreases and the ECE value increases in high sparsity such as 95% and 99%, demonstrating the importance of weight & mask averaging.

ECE value of sparse ResNet-50 on CIFAR-10 and CIFAR-100 produced by SET at different sparsity including 0%, 50%, 80%, 90%, 95%, 99%.

ECE value changes of Sup-tickets and CigL in ResNet-50 on CIFAR-10 and CIFAR-100 at different sparsity including 80%, 90%, 95%. Table6shows the change in ECE after using CigL w/o RM or our CigL. We find that when only WMA is used, the ECE value cannot be effectively reduced, either increasing or with limited reduction. On the contrary, the reduction of our CigL is much larger than that of CigL w/o RM.

ECE value changes of CigL w/o RM and CigL in ResNet-50 (CIFAR-100) and Wide-

ACKNOWLEDGMENTS

This research was partially supported by NSF Grant No. NSF CCF-1934904 (TRIPODS).

REPRODUCIBILITY STATEMENT

The implementation code can be found in https://github.com/StevenBoys/CigL. All datasets and code platform (PyTorch) we use are public.

