ARE WIDER NETS BETTER GIVEN THE SAME NUMBER OF PARAMETERS?

Abstract

Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random, static sparsity pattern in the weight tensors, network width is the determining factor for good performance, while the number of weights is secondary, as long as the model achieves high training accuarcy. As a step towards understanding this effect, we analyze these models in the framework of Gaussian Process kernels. We find that the distance between the sparse finite-width model kernel and the infinite-width kernel at initialization is indicative of model performance. 1 



Deep neural networks have shown great empirical success in solving a variety of tasks across different application domains. One of the prominent empirical observations about neural nets is that increasing the number of parameters leads to improved performance (Neyshabur et al., 2015; 2019; Hestness et al., 2017; Kaplan et al., 2020) . The consequences of this effect for model optimization and generalization have been explored extensively. In the vast majority of these studies, both empirical and theoretical, the number of parameters is increased by increasing the width of the network (Neyshabur et al., 2019; Du et al., 2019; Allen-Zhu et al., 2019) . Network width itself on the other hand has been the subject of interest in studies analyzing its effect on the dynamics of neural network optimization, e.g. using Neural Tangent Kernels (Jacot et al., 2018; Arora et al., 2019) and Gaussian Process Kernels (Wilson et al., 2016; Lee et al., 2017) . All studies we know of suffer from the same fundamental issue: When increasing the width, the number of parameters is being increased as well, and therefore it is not possible to separate the effect of increasing width from the effect of increasing number of parameters. How does each of these factors -width and number of parameters -contribute to the improvement in performance? We conduct a principled study addressing this question, proposing and testing methods of increasing network width while keeping the number of parameters constant. Suprisingly, we find scenarios under which most of the performance benefits come from increasing the width.

1.1. OUR CONTRIBUTIONS

In this paper we make the following contributions: • We propose three candidate methods (illustrated in Figure 2 ) for increasing network width while keeping the number of parameters constant. (a) Linear bottleneck: Substituting each weight matrix by a product of two weight matrices. This corresponds to limiting the rank of the weight matrix. (b) Non-linear bottleneck: Narrowing every other layer and widening the rest. (c) Static sparsity: Setting some weights to zero using a mask that is randomly chosen at initialization and remains static throughout training. • We show that performance can be improved by increasing the width, without increasing the number of model parameters. We find that test accuracy can be improved using method (a) or (c), while method (b) only degrades performance. However, we find that (a) suffers from another degradation caused by the reparameterization, even before widening the network. Consequently, we focus on the sparsity method (c), as it leads to the best results and is applicable to any network type. • We empirically investigate different ways in which random, static sparsity can be distributed among layers of the network and, based on our findings, propose an algorithm to do this effectively (Section 2.3). • We demonstrate that the improvement due to widening (while keeping the number of parameters fixed) holds across standard image datasets and models. Surprisingly, we obesrve that for ImageNet, increasing the width according to (c) leads to almost identical performance as when we allow the number of weights to increase along with the width (Section 2.3). • To understand the observed effect theoretically, we study a simplified model and show that the improved performance of a wider, sparse network is correlated with a reduced distance between its Gaussian Process kernel and that of an infinitely wide network. We propose that reduced kernel distance may explain the observed effect (Section 3).

1.2. RELATED WORK

Our work is similar in nature to the body of work studying the role of overparametrization and width. Neyshabur et al. (2015) observed that increasing the number of hidden units beyond what



Figure 2: Schematic illustration of the methods we use to increase network width while keeping the number of weights constant. Blue polygons represent weight tensors, red stripes represent non-linear activations, and diagonal white stripes denote a sparsified weight tensor. We use f to denote the widening factor.

funding

* Work done while an intern at Blueshift.

