DEEP LEARNING IS SINGULAR, AND THAT'S GOOD

Abstract

In singular models, the optimal set of parameters forms an analytic set with singularities and classical statistical inference cannot be applied to such models. This is significant for deep learning as neural networks are singular and thus "dividing" by the determinant of the Hessian or employing the Laplace approximation are not appropriate. Despite its potential for addressing fundamental issues in deep learning, singular learning theory appears to have made little inroads into the developing canon of deep learning theory. Via a mix of theory and experiment, we present an invitation to singular learning theory as a vehicle for understanding deep learning and suggest important future work to make singular learning theory directly applicable to how deep learning is performed in practice.

1. INTRODUCTION

It has been understood for close to twenty years that neural networks are singular statistical models (Amari et al., 2003; Watanabe, 2007) . This means, in particular, that the set of network weights equivalent to the true model under the Kullback-Leibler divergence forms a real analytic variety which fails to be an analytic manifold due to the presence of singularities. It has been shown by Sumio Watanabe that the geometry of these singularities controls quantities of interest in statistical learning theory, e.g., the generalisation error. Singular learning theory (Watanabe, 2009) is the study of singular models and requires very different tools from the study of regular statistical models. The breadth of knowledge demanded by singular learning theory -Bayesian statistics, empirical processes and algebraic geometry -is rewarded with profound and surprising results which reveal that singular models are different from regular models in practically important ways. To illustrate the relevance of singular learning theory to deep learning, each section of this paper illustrates a key takeaway ideafoot_0 . The real log canonical threshold (RLCT) is the correct way to count the effective number of parameters in a deep neural network (DNN) (Section 4). To every (model, truth, prior) triplet is associated a birational invariant known as the real log canonical threshold. The RLCT can be understood in simple cases as half the number of normal directions to the set of true parameters. We will explain why this matters more than the curvature of those directions (as measured for example by eigenvalues of the Hessian) laying bare some of the confusion over "flat" minima. For singular models, the Bayes predictive distribution is superior to MAP and MLE (Section 5). In regular statistical models, the 1) Bayes predictive distribution, 2) maximum a posteriori (MAP) estimator, and 3) maximum likelihood estimator (MLE) have asymptotically equivalent generalisation error (as measured by the Kullback-Leibler divergence). This is not so in singular models. We illustrate in our experiments that even "being Bayesian" in just the final layers improves generalisation over MAP. Our experiments further confirm that the Laplace approximation of the predictive distribution Smith & Le (2017); Zhang et al. ( 2018) is not only theoretically inappropriate but performs poorly. Simpler true distribution means lower RLCT (Section 6). In singular models the RLCT depends on the (model, truth, prior) triplet whereas in regular models it depends only on the (model, prior) pair. The RLCT increases as the complexity of the true distribution relative to the supposed model increases. We verify this experimentally with a simple family of ReLU and SiLU networks.



The code to reproduce all experiments in the paper will be released on Github. For now, see the zip file.1

