RETHINKING PARAMETER COUNTING: EFFECTIVE DIMENSIONALITY REVISITED

Abstract

Neural networks appear to have mysterious generalization properties when using parameter counting as a proxy for complexity. Indeed, neural networks often have many more parameters than there are data points, yet still provide good generalization performance. Moreover, when we measure generalization as a function of parameters, we see double descent behaviour, where the test error decreases, increases, and then again decreases. We show that many of these properties become understandable when viewed through the lens of effective dimensionality, which measures the dimensionality of the parameter space determined by the data. We relate effective dimensionality to posterior contraction in Bayesian deep learning, model selection, width-depth tradeoffs, double descent, and functional diversity in loss surfaces, leading to a richer understanding of the interplay between parameters and functions in deep models. We also show that effective dimensionality compares favourably to alternative norm-and flatness-based generalization measures.

1. INTRODUCTION

Parameter counting pervades the narrative in modern deep learning. "One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error" (Dziugaite and Roy, 2017). "Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance" (Zhang et al., 2017) . "Increasing the number of parameters of neural networks can give much better prediction accuracy" (Shazeer et al., 2017) . "Scale sensitive complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization" (Neyshabur et al., 2018) . "We train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model" (Brown et al., 2020) . The number of model parameters explicitly appears in many modern generalization measures, such as in Equations 20, 51, 52, 56, 57, 59 , and 60 of the recent study by Jiang et al. (2020) . Phenomena such as double descent are a consequence of parameter counting. Parameter counting even permeates our language, with expressions such as over-parametrization for more parameters than data points. But parameter counting can be a poor description of model complexity, model flexibility, and inductive biases. One can easily construct degenerate cases, such as predictions being generated by a sum of parameters, where the number of parameters is divorced from the statistical properties of the model. When reasoning about generalization, over-parametrization is besides the point: what matters is how the parameters combine with the functional form of the model. Indeed, the practical success of convolutional neural networks (CNNs) for image recognition tasks is almost entirely about the inductive biases of convolutional filters, depth, and sparsity, for extracting local similarities and hierarchical representations, rather than flexibility (LeCun et al., 1989; Szegedy et al., 2015) . Convolutional neural networks have far fewer parameters than fully connected networks, yet can provide much better generalization. Moreover, width can provide flexibility, but it is depth that has made neural networks distinctive in their generalization abilities. In this paper, we gain a number of insights into modern neural networks through the lens of effective dimensionality, in place of simple parameter counting. Effective dimensionality, defined by the eigenspectrum of the Hessian on the training loss (equation 2, Section 2), was used by MacKay (1992a) to measure how many directions in the parameter space had been determined in a Bayesian

