SHARPER ANALYSIS OF SPARSELY ACTIVATED WIDE NEURAL NETWORKS WITH TRAINABLE BIASES

Abstract

This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The tantalizing benefit of such initialization is that the neural network will provably have sparse activation pattern before, during and after training, which can enable fast training procedures and, therefore, reduce the training cost. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. The required width is provided to ensure gradient descent can drive the training error towards zero at a linear rate. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). To our knowledge, this is the first sparsity-dependent generalization result via localized Rademacher complexity. As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice datadependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound. Experimental evaluation is provided to evaluate our results.

1. INTRODUCTION

The literature of sparse neural networks can be dated back to the early work of LeCun et al. (1989) where they showed that a fully-trained neural network can be pruned to preserve generalization. Recently, training sparse neural networks has been receiving increasing attention since the discovery of the lottery ticket hypothesis (Frankle & Carbin, 2018) . In their work, they showed that if we repeatedly train and prune a neural network and then rewind the weights to the initialization, we are able to find a sparse neural network that can be trained to match the performance of its dense counterpart. However, this method is more of a proof of concept and is computationally expensive for any practical purposes. Nonetheless, this inspires further interest in the machine learning community to develop efficient methods to find the sparse pattern at the initialization such that the performance of the sparse network matches the dense network after training (Lee et al., 2018; Wang et al., 2019; Tanaka et al., 2020; Liu & Zenke, 2020; Chen et al., 2021; He et al., 2017; Liu et al., 2021b) . On the other hand, instead of trying to find some desired sparsity patterns at the initialization, another line of research has been focusing on inducing the sparsity pattern naturally and then creatively utilizing such sparse structure via high-dimensional geometric data structures as well as sketching or even quantum algorithms to speedup per-step gradient descent training (Song et al., 2021a; b; Hu et al., 2022; Gao et al., 2022) . In this line of theoretical studies, the sparsity is induced by shifted ReLU which is the same as initializing the bias of the network's linear layer to some large constant instead of zero and holding the bias fixed throughout the entire training. By the concentration of Gaussian, at the initialization, the total number of activated neurons (i.e., ReLU will output some non-zero value) will be sublinear in the total number m of neurons, as long as the bias is initialized to be C √ log m for some appropriate constant C. We call this sparsity-inducing initialization. If the network is in the NTK regime, each neuron weight will exhibit microscopic change after training, and thus the sparsity can be preserved throughout the entire training process. Therefore, during the entire training process, only a sublinear number of the neuron weights need to be updated, which can significantly speedup the training process. The focus of this work is along the above line of theoretical studies of sparsely trained overparameterized neural networks and address the two main research limitations in the aforementioned studies. (1) The bias parameters used in the previous works are not trainable, contrary to what people are doing in practice. (2) The previous works only provided the convergence guarantee, while lacking the generalization performance which is of the central interest in deep learning theory. Thus, our study will fill the above important gaps, by providing a comprehensive study of training one-hidden-layer sparsely activated neural networks in the NTK regime with (a) trainable biases incorporated in the analysis; (b) finer analysis of the convergence; and (c) first generalization bound for such sparsely activated neural networks after training with sharp bound on the restricted smallest eigenvalue of the limiting NTK. We further elaborate our technical contributions are follows: 1. Convergence. Theorem 3.1 provides the required width to ensure that gradient descent can drive the training error towards zero at a linear rate. Our convergence result contains two novel ingredients compared to the existing study. (1) Our analysis handles trainable bias, and shows that even though the biases are allowed to be updated from its initialization, the network's activation remains sparse during the entire training. This relies on our development of a new result showing that the change of bias is also diminishing with a O(1/ √ m) dependence on the network width m. (2) A finer analysis is provided such that the required network width to ensure the convergence can be much smaller, with an improvement upon the previous result by a factor of Θ(n 8/3 ) under appropriate bias initialization, where n is the sample size. This relies on our novel development of (1) a better characterization of the activation flipping probability via an analysis of the Gaussian anti-concentration based on the location of the strip and (2) a finer analysis of the initial training error. 2. Generalization. Theorem 3.8 studies the generalization of the network after gradient descent training where we characterize how the network width should depend on activation sparsity, which lead to a sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). To our knowledge, this is the first sparsity-dependent generalization result via localized Rademacher complexity. In addition, compared with previous works, our result yields a better width's dependence by a factor of n 10 . This relies on (1) the usage of symmetric initialization and (2) a finer analysis of the weight matrix change in Frobenius norm in Lemma 3.13. 3. Restricted Smallest Eigenvalue. Theorem 3.8 shows that the generalization bound heavily depends on the smallest eigenvalue λ min of the limiting NTK. However, the previously known worst-case lower bounds on λ min under data separation have a 1/n 2 explicit dependence in (Oymak & Soltanolkotabi, 2020; Song et al., 2021a) , making the generalization bound vacuous. Instead, our Theorem 3.11 establishes a much sharper lower bound restricted to a data-dependent region, which is sample-size-independent. This hence yields a desirable generalization bound that vanishes as fast as O(1/ √ n), given that the label vector is in this region, which can be done with simple label-shifting.

1.1. FURTHER RELATED WORKS

Besides the works mentioned in the introduction, another work related to ours is (Liao & Kyrillidis, 2022) where they also considered training a one-hidden-layer neural network with sparse activation and studied its convergence. However, different from our work, their sparsity is induced by sampling a random mask at each step of gradient descent whereas our sparsity is induced by non-zero initialization of the bias terms. Also, their network has no bias term, and they only focus on studying the training convergence but not generalization. We discuss additional related works here.

