SHARPER ANALYSIS OF SPARSELY ACTIVATED WIDE NEURAL NETWORKS WITH TRAINABLE BIASES

Abstract

This work studies training one-hidden-layer overparameterized ReLU networks via gradient descent in the neural tangent kernel (NTK) regime, where, differently from the previous works, the networks' biases are trainable and are initialized to some constant rather than zero. The tantalizing benefit of such initialization is that the neural network will provably have sparse activation pattern before, during and after training, which can enable fast training procedures and, therefore, reduce the training cost. The first set of results of this work characterize the convergence of the network's gradient descent dynamics. The required width is provided to ensure gradient descent can drive the training error towards zero at a linear rate. The contribution over previous work is that not only the bias is allowed to be updated by gradient descent under our setting but also a finer analysis is given such that the required width to ensure the network's closeness to its NTK is improved. Secondly, the networks' generalization bound after training is provided. A width-sparsity dependence is presented which yields sparsity-dependent localized Rademacher complexity and a generalization bound matching previous analysis (up to logarithmic factors). To our knowledge, this is the first sparsity-dependent generalization result via localized Rademacher complexity. As a by-product, if the bias initialization is chosen to be zero, the width requirement improves the previous bound for the shallow networks' generalization. Lastly, since the generalization bound has dependence on the smallest eigenvalue of the limiting NTK and the bounds from previous works yield vacuous generalization, this work further studies the least eigenvalue of the limiting NTK. Surprisingly, while it is not shown that trainable biases are necessary, trainable bias helps to identify a nice datadependent region where a much finer analysis of the NTK's smallest eigenvalue can be conducted, which leads to a much sharper lower bound than the previously known worst-case bound and, consequently, a non-vacuous generalization bound. Experimental evaluation is provided to evaluate our results.

1. INTRODUCTION

The literature of sparse neural networks can be dated back to the early work of LeCun et al. (1989) where they showed that a fully-trained neural network can be pruned to preserve generalization. Recently, training sparse neural networks has been receiving increasing attention since the discovery of the lottery ticket hypothesis (Frankle & Carbin, 2018) . In their work, they showed that if we repeatedly train and prune a neural network and then rewind the weights to the initialization, we are able to find a sparse neural network that can be trained to match the performance of its dense counterpart. However, this method is more of a proof of concept and is computationally expensive for any practical purposes. Nonetheless, this inspires further interest in the machine learning community to develop efficient methods to find the sparse pattern at the initialization such that the performance of the sparse network matches the dense network after training (Lee et al., 2018; Wang et al., 2019; Tanaka et al., 2020; Liu & Zenke, 2020; Chen et al., 2021; He et al., 2017; Liu et al., 2021b) . On the other hand, instead of trying to find some desired sparsity patterns at the initialization, another line of research has been focusing on inducing the sparsity pattern naturally and then creatively utilizing such sparse structure via high-dimensional geometric data structures as well as sketching or even quantum algorithms to speedup per-step gradient descent training (Song et al., 2021a; b; Hu et al., 2022; Gao et al., 2022) . In this line of theoretical studies, the sparsity is induced by shifted

