DEEP LEARNING IS SINGULAR, AND THAT'S GOOD

Abstract

In singular models, the optimal set of parameters forms an analytic set with singularities and classical statistical inference cannot be applied to such models. This is significant for deep learning as neural networks are singular and thus "dividing" by the determinant of the Hessian or employing the Laplace approximation are not appropriate. Despite its potential for addressing fundamental issues in deep learning, singular learning theory appears to have made little inroads into the developing canon of deep learning theory. Via a mix of theory and experiment, we present an invitation to singular learning theory as a vehicle for understanding deep learning and suggest important future work to make singular learning theory directly applicable to how deep learning is performed in practice.

1. INTRODUCTION

It has been understood for close to twenty years that neural networks are singular statistical models (Amari et al., 2003; Watanabe, 2007) . This means, in particular, that the set of network weights equivalent to the true model under the Kullback-Leibler divergence forms a real analytic variety which fails to be an analytic manifold due to the presence of singularities. It has been shown by Sumio Watanabe that the geometry of these singularities controls quantities of interest in statistical learning theory, e.g., the generalisation error. Singular learning theory (Watanabe, 2009) is the study of singular models and requires very different tools from the study of regular statistical models. The breadth of knowledge demanded by singular learning theory -Bayesian statistics, empirical processes and algebraic geometry -is rewarded with profound and surprising results which reveal that singular models are different from regular models in practically important ways. To illustrate the relevance of singular learning theory to deep learning, each section of this paper illustrates a key takeaway ideafoot_0 . The real log canonical threshold (RLCT) is the correct way to count the effective number of parameters in a deep neural network (DNN) (Section 4). To every (model, truth, prior) triplet is associated a birational invariant known as the real log canonical threshold. The RLCT can be understood in simple cases as half the number of normal directions to the set of true parameters. We will explain why this matters more than the curvature of those directions (as measured for example by eigenvalues of the Hessian) laying bare some of the confusion over "flat" minima. For singular models, the Bayes predictive distribution is superior to MAP and MLE (Section 5). In regular statistical models, the 1) Bayes predictive distribution, 2) maximum a posteriori (MAP) estimator, and 3) maximum likelihood estimator (MLE) have asymptotically equivalent generalisation error (as measured by the Kullback-Leibler divergence). This is not so in singular models. We illustrate in our experiments that even "being Bayesian" in just the final layers improves generalisation over MAP. Our experiments further confirm that the Laplace approximation of the predictive distribution Smith & Le (2017); Zhang et al. (2018) is not only theoretically inappropriate but performs poorly. Simpler true distribution means lower RLCT (Section 6). In singular models the RLCT depends on the (model, truth, prior) triplet whereas in regular models it depends only on the (model, prior) pair. The RLCT increases as the complexity of the true distribution relative to the supposed model increases. We verify this experimentally with a simple family of ReLU and SiLU networks.

2. RELATED WORK

In classical learning theory, generalisation is explained by measures of capacity such as the l 2 norm, Radamacher complexity, and VC dimension (Bousquet et al., 2003) . It has become clear however that these measures cannot capture the empirical success of DNNs (Zhang et al., 2017) . For instance, over-parameterised neural networks can easily fit random labels (Zhang et al., 2017; Du et al., 2018; Allen-Zhu et al., 2019b) indicating that complexity measures such as Rademacher complexity are very large. There is also a slate of work on generalisation bounds in deep learning. Uniform convergence bounds (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur & Li, 2019; Arora et al., 2018) usually cannot provide non-vacuous bounds. Data-dependent bounds (Brutzkus et al., 2018; Li & Liang, 2018; Allen-Zhu et al., 2019a) consider the "classifiability" of the data distribution in generalisation analysis of neural networks. Algorithm-dependent bounds (Daniely, 2017; Arora et al., 2019; Yehudai & Shamir, 2019; Cao & Gu, 2019) consider the relation of Gaussian initialisation and the training dynamics of (stochastic) gradient descent to kernel methods (Jacot et al., 2018) . In contrast to many of the aforementioned works, we are interested in estimating the conditional distribution q(y|x). Specifically, we measure the generalisation error of some estimate qn (y|x) in terms of the Kullback-Leibler divergence between q and qn , see (8). The next section gives a crash course on singular learning theory. The rest of the paper illustrates the key ideas listed in the introduction. Since we cover much ground in this short note, we will review other relevant work along the way, in particular literature on "flatness", the Laplace approximation in deep learning, etc.

3. SINGULAR LEARNING THEORY

To understand why classical measures of capacity fail to say anything meaningful about DNNs, it is important to distinguish between two different types of statistical models. Recall we are interested in estimating the true (and unknown) conditional distribution q(y|x) with a class of models {p(y|x, w) : w ∈ W } where W ⊂ R d is the parameter space. We say the model is identifiable if the mapping w → p(y|x, w) is one-to-one. Let q(x) be the distribution of x. The Fisher information matrix associated with the model {p(y|x, w) : w ∈ W } is the matrix-valued function on W defined by I(w) ij = ∂ ∂w i [log p(y|x, w)] ∂ ∂w j [log p(y|x, w)]q(y|x)q(x)dxdy, if this integral is finite. Following the conventions in Watanabe (2009) , we have the following bifurcation of statistical models. A statistical model p(y|x, w) is called regular if it is 1) identifiable and 2) has positive-definite Fisher information matrix. A statistical model is called strictly singular if it is not regular. Let ϕ(w) be a prior on the model parameters w. To every (model, truth, prior) triplet, we can associate the zeta function, ζ(z) = K(w) z ϕ(w) dw, z ∈ C, where K(w) is the Kullback-Leibler (KL) divergence between the model p(y|x, w) and the true distribution q(y|x): K(w) := q(y|x) log q(y|x) p(y|x, w) q(x) dx dy. (1) For a (model, truth, prior) triplet (p(y|x, w), q(y|x), ϕ), let -λ be the maximum pole of the corresponding zeta function. We call λ the real log canonical threshold (RLCT) (Watanabe, 2009) of the (model, truth, prior) triplet. The RLCT is the central quantity of singular learning theory. By Watanabe (2009, Theorem 6.4) the RLCT is equal to d/2 in regular statistical models and bounded above by d/2 in strictly singular models if realisability holds: let W 0 = {w ∈ W : p(y|x, w) = q(y|x)} be the set of true parameters, we say q(y|x) is realisable by the model class if W 0 is non-empty. The condition of realisability is critical to standard results in singular learning theory. Modifications to the theory are needed in the case that q(y|x) is not realisable, see the condition called relatively finite variance in Watanabe (2018) . Neural networks in singular learning theory. Let W ⊆ R d be the space of weights of a neural network of some fixed architecture, and let f (x, w) : R N × W -→ R M be the associated function. We shall focus on the regression task and study the model p(y|x, w) = 1 (2π) M/2 exp -1 2 y -f (x, w) 2 (2) but singular learning theory can also apply to classification, for instance. It is routine to check (see Appendix A.1) that for feedforward ReLU networks not only is the model strictly singular but the matrix I(w) is degenerate for all nontrivial weight vectors and the Hessian of K(w) is degenerate at every point of W 0 . RLCT plays an important role in model selection. One of the most accessible results in singular learning theory is the work related to the widely-applicable Bayesian information criterion (WBIC) Watanabe ( 2013), which we briefly review here for completeness. Let D n = {(x i , y i )} n i=1 be a dataset of input-output pairs. Let L n (w) be the negative log likelihood L n (w) = - 1 n n i=1 log p(y i |x i , w) and p(D n |w) = exp(-nL n (w)). The marginal likelihood of a model {p(y|x, w) : w ∈ W } is given by p(D n ) = W p(D n |w)ϕ(w) dw and can be loosely interpreted as the evidence for the model. Between two models, we should prefer the one with higher model evidence. However, since the marginal likelihood is an intractable integral over the parameter space of the model, one needs to consider some approximation. The well-known Bayesian Information Criterion (BIC) derives from an asymptotic approximation of -log p(D n ) using the Laplace approximation, leading to BIC = nL n (w MLE ) + d 2 log n. Since we want the marginal likelihood of the data for some given model to be high one should almost never adopt a DNN according to the BIC, since in such models d may be very large. However, this argument contains a serious mathematical error: the Laplace approximation used to derive BIC only applies to regular statistical models, and DNNs are not regular. The correct criterion for both regular and strictly singular models was shown in Watanabe (2013) to be nL n (w 0 )+λ log n where w 0 ∈ W 0 and λ is the RLCT. Since DNNs are highly singular λ may be much smaller than d/2 (Section 6) it is possible for DNNs to have high marginal likelihood -consistent with their empirical success.

4. VOLUME DIMENSION, EFFECTIVE DEGREES OF FREEDOM, AND FLATNESS

Volume codimension. The easiest way to understand the RLCT is as a volume codimension (Watanabe, 2009, Theorem 7.1) . Suppose that W ⊆ R d and W 0 is nonempty, i.e., the true distribution is realisable. We consider a special case in which the KL divergence in a neighborhood of every point v 0 ∈ W 0 has an expression in local coordinates of the form K(w) = d i=1 c i w 2 i , where the coefficients c 1 , . This number of effective parameters can be computed by an integral. Consider the volume of the set of almost true parameters V (t, v 0 ) = K(w)<t ϕ(w)dw where the integral is restricted to a small closed ball around v 0 . As long as the prior ϕ(w) is non-zero on W 0 it does not affect the relevant features of the volume, so we may assume ϕ is constant on the region of integration in the first d directions and normal in the remaining directions, so up to a constant depending only on d we have V (t, v 0 ) ∝ t d /2 √ c 1 • • • c d (5) and we can extract the exponent of t in this volume in the limit d = 2 lim t→0 log V (at, v 0 )/V (t, v 0 ) log(a) for any a > 0, a = 1. We refer to the right hand side of (6) as the volume codimension at v 0 . The function K(w) has the special form (4) locally with d = d if the statistical model is regular (and realisable) and with d < d in some singular models such as reduced rank regression (Appendix A.2). While such a local form does not exist for a singular model generally (in particular for neural networks) nonetheless under natural conditions (Watanabe, 2009, Theorem 7 .1) we have V (t, v 0 ) = ct λ + o(t λ ) where c is a constant. We assume that in a sufficiently small neighborhood of v 0 the point RLCT λ at v 0 (Watanabe, 2009, Definition 2.7 ) is less than or equal to the RLCT at every point in the neighborhood so that the multiplicity m = 1, see Section 7.6 of (Watanabe, 2009) for relevant discussion. It follows that the limit on the right hand side of ( 6) exists and is equal to λ. In particular λ = d /2 in the minimally singular case. Note that for strictly singular models such as DNNs 2λ may not be an integer. This may be disconcerting but the connection between the RLCT, generalisation error and volume dimension strongly suggests that 2λ is nonetheless the only geometrically meaningful "count" of the effective number of parameters near v 0 . RLCT and likelihood vs temperature. Again working with the model in ( 2), consider the expectation over the posterior at temperature T as defined in (17) of the negative log likelihood ( 3) E(T ) = E 1/T w nL n (w) = E 1/T w 1 2 n i=1 y i -f (x i , w) 2 + nM 2 log(2π) . Note that when n is large L n (v 0 ) ≈ M 2 log(2π) for any v 0 ∈ W 0 so for T ≈ 0 the posterior concentrates around the set W 0 of true parameters and E(T ) ≈ nM 2 log(2π). Consider the increase ∆E = E(T + ∆T ) -E(T ) corresponding to an increase in temperature ∆T . It can be shown that ∆E ≈ λ∆T where the reader should see (Watanabe, 2013, Corollary 3) for a precise statement. As the temperature increases, samples taken from the tempered posterior are more distant from W 0 and the error E will increase. If λ is smaller then for a given increase in temperature the quantity E increases less: this is one way to understand intuitively why a model with smaller RLCT generalises better from the dataset D n to the true distribution. Flatness. It is folklore in the deep learning community that flatness of minima is related to generalisation (Hinton & Van Camp, 1993; Hochreiter & Schmidhuber, 1997) and this claim has been revisited in recent years (Chaudhari et al., 2017; Smith & Le, 2017; Jastrzebski et al., 2017; Zhang et al., 2018) . In regular models this can be justified using the lower order terms of the asymptotic expansion of the Bayes free energy (Balasubramanian, 1997, §3.1) but the argument breaks down in strictly singular models, since for example the Laplace approximation of Zhang et al. (2018) is invalid. The point can be understood via an analysis of the version of the idea in (Hochreiter & Schmidhuber, 1997) . Their measure of entropy compares the volume of the set of parameters with tolerable error t 0 (our almost true parameters) to a standard volume -log V (t 0 , v 0 ) t d/2 0 = d -d 2 log(t 0 ) + 1 2 d i=1 log c i . Hence in the case d = d the quantity -1 2 i log(c i ) is a measure of the entropy of the set of true parameters near w 0 , a point made for example in Zhang et al. (2018) . However when d < d this conception of entropy is inappropriate because of the d -d directions in which K(w) is flat near v 0 , which introduce the t 0 dependence in (7).

5. GENERALISATION

The generalisation puzzle (Poggio et al., 2018) is one of the central mysteries of deep learning. Theoretical investigations into the matter is an active area of research Neyshabur et al. (2017) . Many of the recent proposals of capacity measures for neural networks are based on the eigenspectrum of the (degenerate) Hessian, e.g., Thomas et al. (2019); Maddox et al. (2020) . But this is not appropriate for singular models, and hence for DNNs. Since we are interested in learning the distribution, our notion of generalisation is slightly different, being measured by the KL divergence. Precise statements regarding the generalisation behavior in singular models can be made using singular learning theory. Let the network weights be denoted θ rather than w for reasons that will become clear. Recall in the Bayesian paradigm, prediction proceeds via the so-called Bayes predictive distribution, p(y|x, D n ) = p(y|x, θ)p(θ|D n ) dθ. More commonly encountered in deep learning practice are the MAP and MLE point estimators. While in a regular statistical model, the three estimators 1) Bayes predictive distribution, 2) MAP, and 3) MLE have the same leading term in their asymptotic generalisation behavior, the same is not true in singular models. More precisely, let qn (y|x) be some estimate of the true unknown conditional density q(y|x) based on the dataset D n . The generalisation error of the predictor qn (y|x) is G(n) := KL(q(y|x)||q n (y|x)) = q(y|x) log q(y|x) qn (y|x) q(x) dy dx. To account for sampling variability, we will work with the average generalisation error, E n G(n), where E n denotes expectation over the dataset D n . By Watanabe (2009, Theorem 1.2 and Theorem 7. 2), we have E n G(n) = λ/n + o(1/n) if qn is the Bayes predictive distribution, ( ) where λ is the RLCT corresponding to the triplet (p(y|x, θ), q(y|x), ϕ(θ)). In contrast, we should note that Zhang et al. (2018) and Smith & Le (2017) rely on the Laplace approximation to explain the generalisation of the Bayes predictive distribution though both works acknowledge the Laplace approximate is inappropriate. For completeness, a quick sketch of the derivation of ( 9) is provided in Appendix A.4. Now by (Watanabe, 2009, Theorem 6.4) we have E n G(n) = C/n + o(1/n) if qn is the MAP or MLE, ( ) where C (different for MAP and MLE) is the maximum of some Gaussian process. For regular models, the MAP, MLE, and the Bayes predictive distribution have the same leading term for E n G(n) since λ = C = d/2. However in singular models, C is generally greater than λ, meaning we should prefer the Bayes predictive distribution for singular models. That the RLCT has such a simple relationship to the Bayesian generalisation error is remarkable. On the other hand, the practical implications of ( 19) are limited since the Bayes predictive distribution is intractable. While approximations to the Bayesian predictive distribution, say via variational inference, might inherit a similar relationship between generalisation and the (variational) RLCT, serious theoretical developments will be required to rigorously establish this. The challenge comes from the fact that for approximate Bayesian predictive distributions, the free energy and generalisation error may have different learning coefficients λ. This was well documented in the case of a neural network with one hidden layer (Nakajima & Watanabe, 2007) . We set out to investigate whether certain very simple approximations of the Bayes predictive distribution can already demonstrate superiority over point estimators. Suppose the input-target relationship is modeled as in (2) but we write θ instead of w. We set q(x) = N (0, I 3 ). For now consider the realisable case, q(y|x) = p(y|x, θ 0 ) where θ 0 is drawn randomly according to the default initialisation in PyTorch when model ( 2) is instantiated. We calculate E n G(n) using multiple datasets D n and a large testing set, see Appendix A.5 for more details. Since f is a hierarchical model, let's write it as f θ (•) = h(g(•; v); w) with the dimension of w being relatively small. Let θ MAP = (v MAP , w MAP ) be the MAP estimate for θ using batch gradient descent. The idea of our simple approximate Bayesian scheme is to freeze the network weights at the MAP estimate for early layers and perform approximate Bayesian inference for the final layersfoot_1 . e.g., freeze the parameters of g at v MAP and perform MCMC over w. Throughout the experiments, g : R 3 → R 3 is a feedforward ReLU block with each hidden layer having 5 hidden units and h : R 3 → R 3 is either BAx or B ReLU(Ax) where A ∈ R 3×r , B ∈ R r×3 . We set r = 3. We shall consider 1 or 5 hidden layers for g. To approximate the Bayes predictive distribution, we perform either the Laplace approximation or the NUTS variant of HMC (Hoffman & Gelman, 2014) in the last two layers, i.e., performing inference over A, B in h(g(•; v MAP ); A, B). Note that MCMC is operating in a space of 18 di-Figure 1 : Realisable and full batch gradient descent for MAP. Average generalisation errors E n G(n) are displayed for various approximations of the Bayes predictive distribution. The results of the Laplace approximations are reported in the Appendix and not displayed here because they are higher than other approximation schemes by at least an order of magnitude. Each subplot shows a different combination of hidden layers in g (1 or 5) and activation function in h (ReLU or identity). Note that the y-axis is not shared. From the outset, we expect the Laplace approximation over w = (A, B) to be invalid since the model is singular. We do however expect the last-layer-only Laplace approximation over B to be sound. Next, we expect the MCMC approximation in either the last layer or last two layers to be superior to the Laplace approximations and to the MAP. We further expect the last-two-layers MCMC to have better generalisation than the last-layer-only MCMC since the former is closer to the Bayes predictive distribution. In summary, we anticipate the following performance order for these five approximate Bayesian schemes (from worst to best): last-two-layers Laplace, last-layeronly Laplace, MAP, last-layer-only MCMC, last-two-layers MCMC. The results displayed in Figure 1 are in line with our stated expectations above, except for the surprise that the last-layer-only MCMC approximation is often superior to the last-two-layers MCMC approximation. This may arise from the fact that MCMC finds the singular setting in the last-twolayers more challenging. In Figure 1 , we clarify the effect of the network architecture by varying the following factors: 1) either 1 or 5 layers in g, and 2) ReLU or identity activation in h. Table 1 is a companion to Figure 1 and tabulates for each approximation scheme the slope of 1/n versus E n G(n), also known as the learning coefficient. The R 2 corresponding to the linear fit is also provided. In Appendix A.5, we also show the corresponding results when 1) the data-generating mechanism and the assumed model do not satisfy the condition of realisability and/or 2) the MAP estimate is obtained via minibatch stochastic gradient descent instead of batch gradient descent.

6. SIMPLE FUNCTIONS AND COMPLEX SINGULARITIES

In singular models the RLCT may vary with the true distribution (in contrast to regular models) and in this section we examine this phenomenon in a simple example. As the true distribution becomes more complicated relative to the supposed model, the singularities of the analytic variety of true parameters should become simpler and hence the RLCT should increase (Watanabe, 2009, §7.6) . Our experiments are inspired by (Watanabe, 2009, §7. 2) where tanh(x) networks are considered and the true distribution (associated to the zero network) is held fixed while the number of hidden nodes is increased. Consider the model p(y|x, w) in (2) where f (x, w) = c+ H i=1 q i ReLU( w i , x +b i ) is a two-layer ReLU network with weight vector w = ({w i } H i=1 , {b i } H i=1 , {q i } H i=1 , c) ∈ R 4H+1 and w i ∈ R 2 , b i ∈ R, q i ∈ R for 1 ≤ i ≤ H. We let W be some compact neighborhood of the origin. Given an integer 3 ≤ m ≤ H we define a network s m ∈ W and q m (y|x) := p(y|x, s m ) as follows. Let g ∈ SO(2) stand for rotation by 2π/m, set w 1 = √ g (1, 0) T . The components of s m are the vectors w i = g i-1 w 1 for 1 ≤ i ≤ m and w i = 0 for i > m, b i = -1 3 and q i = 1 for 1 ≤ i ≤ m and b i = q i = 0 for i > m, and finally c = 0. The factor of 1 3 ensures the relevant parts of the decision boundaries lie within X = [-1, 1] 2 . We let q(x) be the uniform distribution on X and define q m (x, y) = q m (y|x)q(x). The functions f (x, s m ) are graphed in Figure 2 . It is intuitively clear that the complexity of these true distributions increases with m. We let ϕ be a normal distribution N (0, 50 2 ) and estimate the RLCTs of the triples (p, q m , ϕ). We conducted the experiments with H = 5, n = 1000. For each m ∈ {3, 4, 5}, Table 2 shows the estimated RLCT. Algorithm 1 in Appendix A.3 details the estimation procedure which we base on (Watanabe, 2013, Theorem 4) . As predicted the RLCT increases with m verifying that in this case, the simpler true distributions give rise to more complex singularities. Note that the dimension of W is d = 21 and so if the model were regular the RLCT would be 10.5. It can be shown that when m = H the set of true parameters W 0 ⊆ W is a regular submanifold of dimension m. If such a model were minimally singular its RLCT would be 1 2 ((4m + 1) -m) = 1 2 (3m + 1). In the case m = 5 we observe an RLCT more than an order of magnitude less than the value 8 predicted by this formula. So the function K does not behave like a quadratic form near W 0 . Strictly speaking it is incorrect to speak of the RLCT of a ReLU network because the function K(w) is not necessarily analytic (Example A.4). However we observe empirically that the predicted linear relationship between E β w [nL n (w)] and 1/β holds in our small ReLU networks (see the R 2 values in Table 2 ) and that the RLCT estimates are close to those for the two-layer SiLU network (Hendrycks & Gimpel, 2016) which is analytic (the SiLU or sigmoid weighted linear unit is σ(x) = x(1 + e -τ x ) -1 which approaches the ReLU as τ → ∞. We use τ = 100.0 in our experiments). The competitive performance of SiLU on standard benchmarks (Ramachandran et al., 2017) shows that the non-analyticity of ReLU is probably not fundamental.

7. FUTURE DIRECTIONS

Deep neural networks are singular models, and that's good: the presence of singularities is necessary for neural networks with large numbers of parameters to have low generalisation error. Singular learning theory clarifies how classical tools such as the Laplace approximation are not just inappropriate in deep learning on narrow technical grounds: the failure of this approximation and the existence of interesting phenomena like the generalisation puzzle have a common cause, namely the existence of degenerate critical points of the KL function K(w). Singular learning theory is a promising foundation for a mathematical theory of deep learning. However, much remains to be done. The important open problems include: SGD vs the posterior. A number of works (S ¸im Šekli, 2017; Mandt et al., 2017; Smith et al., 2018) suggest that mini-batch SGD may be governed by SDEs that have the posterior distribution as its stationary distribution and this may go towards understanding why SGD works so well for DNNs. RLCT estimation for large networks. Theoretical RLCTs have been cataloged for small neural networks, albeit at significant effortfoot_2 (Aoyagi & Watanabe, 2005b; a) . We believe RLCT estimation in these small networks should be standard benchmarks for any method that purports to approximate the Bayesian posterior of a neural network. No theoretical RLCTs or estimation procedure are known for modern DNNs. Although MCMC provides the gold standard it does not scale to large networks. The intractability of RLCT estimation for DNNs is not necessarily an obstacle to reaping the insights offered by singular learning theory. For instance, used in the context of model selection, the exact value of the RLCT is not as important as model selection consistency. We also demonstrated the utility of singular learning results such as ( 9) and ( 10) which can be exploited even without knowledge of the exact value of the RLCT. Real-world distributions are unrealisable. The existence of power laws in neural language model training (Hestness et al., 2017; Kaplan et al., 2020) is one of the most remarkable experimental results in deep learning. These power laws may be a sign of interesting new phenomena in singular learning theory when the true distribution is unrealisable.

A APPENDIX

A.1 NEURAL NETWORKS ARE STRICTLY SINGULAR Many-layered neural networks are strictly singular (Watanabe, 2009, §7.2) . The degeneracy of the Hessian in deep learning has certainly been acknowledged in e.g., Sagun et al. (2016) which recognises the eigenspectrum is concentrated around zero and in Pennington & Worah (2018) which deliberately studies the Fisher information matrix of a single-hidden-layer, rather than multilayer, neural network. We first explain how to think about a neural network in the context of singular learning theory. A feedforward network of depth c parametrises a function f : R N -→ R M of the form f = A c • σ c-1 • A c-1 • • • σ 1 • A 1 where the A l : R d l-1 -→ R d l are affine functions and σ l : R d l -→ R d l is coordinate-wise some fixed nonlinearity σ : R -→ R. Let W be a compact subspace of R d containing the origin, where R d is the space of sequences of affine functions (A l ) c l=1 with coordinates denoted w 1 , . . . , w d so that f may be viewed as a function f : R N × W -→ R M . We define p(y|x, w) as in (2). We assume the true distribution is realisable, q(y|x) = p(y|x, w 0 ) and that a distribution q(x) on R N is fixed with respect to which p(x, y) = p(y|x)q(x) and q(x, y) = q(y|x)q(x). Given some prior ϕ(w) on W we may apply singular learning theory to the triplet (p, q, ϕ). By straightforward calculations we obtain K(w) = 1 2 f (x, w) -f (x, w 0 ) 2 q(x)dx (11) ∂ 2 ∂wi∂wj K(w) = ∂ ∂wi f (x, w), ∂ ∂wj f (x, w) q(x)dx + f (x, w) -f (x, w 0 ), ∂ 2 ∂wi∂wj f (x, w) q(x)dx I(w) ij = 1 2 (M -3)/2 π (M -2)/2 ∂ ∂wi f (x, w), ∂ ∂wj f (x, w) q(x)dx where -,is the dot product. We assume q(x) is such that these integrals exist. It will be convenient below to introduce another set of coordinates for W . Let w l jk denote the weight from the kth neuron in the (l -1)th layer to the jth neuron in the lth layer and let b l j denote the bias of the jth neuron in the lth layer. Here 1 ≤ l ≤ c and the input is layer zero. Let u l j and a l j denote the value of the jth neuron in the lth layer before and after activation, respectively. Let u l and a l denote the vectors with values u l j and a l j , respectively. Let d l denote the number of neurons in the lth layer. Then u l j = d l-1 k=1 w l jk a l-1 k + b l j , 1 ≤ l ≤ c, 1 ≤ j ≤ d l a l j = σ(u l j ) 1 ≤ l < c, 1 ≤ j ≤ d l with the convention that a 0 = x is the input and u c = y is the output. In the case where σ = ReLU the partial derivatives ∂ ∂wj f do not exist on all of R N . However given w ∈ W we let D(w) denote the complement in R N of the union over all hidden nodes of the associated decision boundary, that is R N \ D(w) = 1≤l<c 1≤j≤d l {x ∈ R N : u l j (x) = 0} . The partial derivative ∂ ∂wj f exists on the open subset {(x, w) : x ∈ D(w)} of R N × W . Lemma A.1. Suppose σ = ReLU and there are c > 1 layers. For any hidden neuron 1 ≤ j ≤ d l in layer l with 1 ≤ l < c there is a differential equation d l-1 k=1 w l jk ∂ ∂w l jk + b l j ∂ ∂b l j - d l+1 i=1 w l+1 ij ∂ ∂w l+1 ij f = 0 which holds on D(w) for any fixed w ∈ W . Proof. Without loss of generality assume M = 1, to simplify the notation. Let e i ∈ R d l+1 denote a unit vector and let H(x) = d dx ReLU(x). Writing ∂f ∂u l+1 for a gradient vector ∂f ∂w l+1 ij = ∂f ∂u l+1 , ∂u l+1 ∂w l+1 ij = ∂f ∂u l+1 , a l j e i = ∂f ∂u l+1 i u l j H(u l j ) ∂f ∂w l jk = ∂f ∂u l+1 , ∂u l+1 ∂w l jk = ∂f ∂u l+1 , d l+1 i=1 w l+1 ij a l-1 k H(u l j )e i = d l+1 i=1 ∂f ∂u l+1 i w l+1 ij a l-1 k H(u l j ) ∂f ∂b l j = ∂f ∂u l+1 , ∂u l+1 ∂b l j = ∂f ∂u l+1 , d l+1 i=1 w l+1 ij H(u l j )e i = d l+1 i=1 ∂f ∂u l+1 i w l+1 ij H(u l j ). The claim immediately follows. Lemma A.2. Suppose σ = ReLU, c > 1 and that w ∈ W has at least one weight or bias at a hidden node nonzero. Then the matrix I(w) is degenerate and if w ∈ W 0 then the Hessian of K at w is also degenerate. Proof. Let w ∈ W be given, and choose a hidden node where at least one of the incident weights (or bias) is nonzero. Then Lemma A.1 gives a nontrivial linear dependence relation i λ i ∂ ∂wi f = 0 as functions on D(w). The rows of I(w) satisfy the same linear dependence relation. At a true parameter the second summand in (12) vanishes so by the same argument the Hessian is degenerate. Remark A.3. Lemma A.2 implies that every true parameter for a nontrivial ReLU network is a degenerate critical point of K. Hence in the study of nontrivial ReLU networks it is never appropriate to divide by the determinant of the Hessian of K at a true parameter, and in particular Laplace or saddle-point approximations at a true parameter are invalid. The well-known positive scale invariance of ReLU networks (Phuong & Lampert, 2020) is responsible for the linear dependence of Lemma A.1, in the precise sense that the given differential operator is the infinitesimal generator (Boothby, 1986, §IV. 3) of the scaling symmetry. However, this is only one source of degeneracy or singularity in ReLU networks. The degeneracy, as measured by the RLCT, is much lower than one would expect on the basis of this symmetry alone (see Section 6). Example A.4. In general the KL function K(w) for ReLU networks is not analytic. For the minimal counterexample, let q(x) be uniform on [-N, N ] and zero outside and consider K(b) = q(x)(ReLU(x -b) -ReLU(x)) 2 dx . It is easy to check that up to a scalar factor K(b) = -2 3 b 3 + b 2 N 0 ≤ b ≤ N -1 3 b 3 + b 2 N -N ≤ b ≤ 0 so that K is C 2 but not C 3 let alone analytic. A.2 REDUCED RANK REGRESSION For reduced rank regression, the model is p(y|x, w) = 1 (2πσ 2 ) N/2 exp - 1 2σ 2 -BAx| 2 , where x ∈ R M , y ∈ R N , A an M × H matrix and B an H × N matrix; the parameter w denotes the entries of A and B, i.e. w = (A, B), and σ > 0 is a parameter which for the moment is irrelevant. If the true distribution is realisable then there is w 0 = (A 0 , B 0 ) such that q(y|x) = p(y|x, w 0 ). Without loss of generality assume q(x) is the uniform density. In this case the KL divergence from p(y|x, w) to q(y|x) is K(w) = q(y|x) log q(y|x) p(y|x, w) dxdy = BA -B 0 A 0 2 (1 + E(w)) where the error E is smooth and E(w) = O( BA-B 0 A 0 2 ) in any region where BA-B 0 A 0 < C, so K(w) is equivalent to BA -B 0 A 0 2 . We write K(w) = BA -B 0 A 0 2 for simplicity below. Now assume that B 0 A 0 is symmetric and that B 0 is square, i.e. N = H. Then the zero locus of K(w) is explicitly given as follows W 0 = {(A, B) : det B = 0 and A = B -1 B 0 A 0 }. It follows that W 0 is globally a graph over GL(H; R). Indeed, the set (B -1 B 0 A 0 , B) with B ∈ GL(H; R) is exactly W 0 . Thus W 0 is a smooth H 2 -dimensional submanifold of R H 2 × R H×M . To prove that W 0 is minimally singular in the sense of Section 4 it suffices to show that rank(D 2 A,B K) ≥ HM where D 2 A,B K denotes the Hessian, but as it is no more difficult to do so, we find explicit local coordinates (u, v) near an arbitrary point (A, B) ∈ W 0 for which {v = 0} = W 0 and K(u, v) = a(u, v)|u| 2 in this neighborhood, where a is a C ∞ function with a ≥ c > 0 for some c. Write A(v) = (B + v) -1 B 0 A 0 . Then u, v → (A(v) + u, B + v) gives local coordinates on R H 2 × R H×M near (A, B), and K(u, v) = |(B + v)((B + v) -1 B 0 A 0 + u) -B 0 A 0 | = |B 0 A 0 + (B + v)u -B 0 A 0 | 2 = |(B + v)u| 2 , so for v sufficiently small (and hence B + v invertible) we can take a(u, v) = |(B + v)u| 2 /|u| 2 .

A.3 RLCT ESTIMATION

In this section we detail the estimation procedure for the RLCT used in Section 6. Let L n (w) be the negative log likelihood as in (3). Define the data likelihood at inverse temperature β > 0 to be  p β (w|D n ) = Π n i=1 p(y i |x i , w) β ϕ(w) W Π n i=1 p(y i |x i , w) β ϕ(w) = p β (D n |w)ϕ(w) p β (D n ) ( ) where ϕ is the prior distribution on the network weights w and p β (D n ) = W p β (D n |w)ϕ(w) dw is the marginal likelihood of the data at inverse temperature β. Finally, denote the expectation of a random variable R(w) with respect to the tempered posterior p β (w|D n ) as E β w [R(w)] = W R(w)p β (w|D n ) dw In the main text, we drop the superscript in the quantities ( 14), ( 15), ( 16), ( 17) when β = 1, e.g., p(D n ) rather than p 1 (D n ). Assuming the conditions of Theorem 4 in Watanabe ( 2013) hold, we have E β w [nL n (w)] = nL n (w 0 ) + λ β + U n λ 2β + O p (1) where β 0 is a positive constant and U n is a sequence of random variables satisfying E n U n = 0. In Algorithm 1, we describe an estimation procedure for the RLCT based on the asymptotic result in (18). For the estimates in Table 2 the a posteriori distribution was approximated using the NUTS variant of Hamiltonian Monte Carlo (Hoffman & Gelman, 2014) where the first 1000 steps were omitted and 20, 000 samples were collected. Each λ(D n ) estimate in Algorithm 1 was performed by linear regression on the pairs {(1/β i , E βi w [nL n (w)])} 5 i=1 where the five inverse temperatures β i are centered on the inverse temperature 1/ log(20000).

A.4 CONNECTION BETWEEN RLCT AND GENERALISATION

For completeness, we sketch the derivation of (9) which gives the asymptotic expansion of the average generalisation error E n G(n) of the Bayes prediction distribution in singular models. The exposition is an amalgamation of various works published by Sumio Watanabe, but is mostly based on the textbook (Watanabe, 2009) . To understand the connection between the RLCT and G(n), we first define the so-called Bayes free energy as F (n) = -log p(D n ) whose expectation admits the following asymptotic expansion (Watanabe, 2009) :  E n F (n) = E n nS n + λ log n + o(log n)



The code to reproduce all experiments in the paper will be released on Github. For now, see the zip file. This is similar in spirit toKristiadi et al. (2020) who claim that even "being Bayesian a little bit" fixes overconfidence. They approach this via the Laplace approximation for the final layer of a ReLU network. It is also worth noting thatKristiadi et al. (2020) do not attempt to formalise what it means to "fix overconfidence"; the precise statement should be in terms of G(n). Hironaka's resolution of singularities guarantees existence. However it is difficult to do the required blowup transformations in high dimensions to obtain the standard form. Following Kristiadi et al. (2020), the code for the exact Hessian calculation is borrowed from https: //github.com/f-dangel/hbp



Companion to Figure1. The learning coefficient is the slope of the linear fit 1/n versus E n G(n) (no intercept since realisable). The R 2 value gives a sense of the goodness-of-fit.

Figure 2: Increasingly complicated true distributions q m (x, y) on [-1, 1] 2 × R.

Figure 3: Realisable and minibatch gradient descent for MAP training.

Figure 4: Nonrealisable and full batch gradient descent for MAP training.

Figure 5: Nonrealisable and minibatch gradient descent for MAP training. Missing points on the MAP learning curve are due to estimated probabilities too close to 0.

Figure 6: Realisable and full batch gradient descent for MAP. average generalisation errors of Laplace approximations of the predictive distribution. The last-two-layers Laplace approximation results in numerical instabilities due to degenerate Hessian. Any missing points are due to estimated probabilities too close to 0.

Figure 7: Realisable and minibatch gradient descent for MAP training. Details are same as for Figure 6

Figure 8: Nonrealisable and full batch gradient descent for MAP training. Details are same as for Figure 6

. . , c d > 0 may depend on v 0 and d may be strictly less than d. If the model is regular then this is true with d = d and if it holds for d < d then we say that the pair (p(y|x, w), q(y|x)) is minimally singular. It follows that the set W 0 ⊆ W of true parameters is a regular submanifold of codimension d (that is, W 0 is a manifold of dimension d -d where W has dimension d). Under this hypothesis there are, near each true parameter v 0 ∈ W 0 , exactly d -d directions in which v 0 can be varied without changing the model p(y|x, w) and d directions in which varying the parameters does change the model. In this sense, there are d effective parameters near v 0 .

case, which is small enough for us to expect MCMC to perform well. We also implemented the Laplace approximation and NUTS in the last layer only, i.e. performing inference over B in h 2 (h 1 (g(•; v MAP ); A MAP ); B). Further implementation details of these approximate Bayesian schemes are found in Appendix A.5.

RLCT estimates for ReLU and SiLU networks. We observe the RLCT increasing as m increases, i.e., the true distribution becomes more "complicated" relative to the supposed model.

Algorithm 1 RLCT via Theorem 4 in Watanabe (2013) Input: range of β's, set of training sets T each of size n, approximate samples {w 1 , . . . , w R } from p β (w|D n ) for each training set D n and each β for training set D n ∈ T do for β in range of β's do (D n |w) = Π n i=1 p(y i |x i , w) β which can also be written p β (D n |w) = exp(-βnL n (w)).(14) The posterior distribution, at inverse temperature β, is defined as

Companion to Figure 3.

Companion to Figure 4. The learning coefficient is the slope of the linear fit 1/n versus E

Companion to Figure 5. The learning coefficient is the slope of the linear fit 1/n versus E

annex

where S n = -1 n n i=1 log q(y i |x i ) is the entropy. The expected Bayesian generalisation error is related to the Bayes free energy as follows E n G(n) = EF (n + 1) -EF (n) Then for the average generalisation error, we have(19) Since models with more complex singularities have smaller RLCTs, this would suggest that the more singular a model is, the better its generalisation (assuming one uses the Bayesian predictive distribution for prediction). In this connection it is interesting to note that simpler (relative to the model) true distributions lead to more singular models (Section 6).

A.5 DETAILS FOR GENERALISATION ERROR EXPERIMENTS

Simulated data The distribution of x ∈ R 3 is set to q(x) = N (0, I 3 ). In the realisable case, y ∈ R 3 is drawn according to q(y|x) = p(y|x, θ 0 ). In the nonrealisable setting, we set q(y|x) ∝ exp{-||yh w0 (x)|| 2 /2}, where w 0 = (A 0 , B 0 ) is drawn according to the PyTorch model initialisation of h.

MAP training

The MAP estimator is found via gradient descent using the mean-squared-error loss with either the full data set or minibatch set to 32. Training was set to 5000 epochs. No form of early stopping was employed.Calculating the generalisation error Using a held-out-test set T n = {(x i , y i )} n i=1 , we calculate the average generalisation error asAssume the held-out test set is large enough so that the difference between E n G(n) and ( 20) is negligible. We will refer to them interchangeably as the average generalisation error. In our experiments we use n = 10, 000 and 30 draws of the dataset D n to estimate E n .Last layer(s) inference Without loss of generality, we discuss performing inference in the w parameters of h while freezing the parameters of g at the MAP estimate. The steps easily extend to performing inference over the final layer only of f = h • g. Let xi = g vMAP (x i ). Define a new transformed dataset Dn = {(x i , y i )} n i=1 . We take the prior on w to be standard Gaussian. Define the posterior over w given Dn as:Define the following approximation to the Bayesian predictive distribution p(y|x, D n ) = p(y|x, (v MAP , w))p(w| Dn ) dw.Let w 1 , . . . , w R be some approximate samples from p(w| Dn ). Then we approximate p(y|x,where R is a large number, set to 1000 in our experiments. We consider the Laplace approximation and the NUTS variant of HMC for drawing samples from p(w| Dn ):• Laplace in the last layer(s) Recall θ MAP = (v MAP , w MAP ) is the MAP estimate for f θ trained with the data D n . With the Laplace approximation, we draw w 1 , . . . w R from the Gaussian N (w MAP , Σ) where Σ = (-∇ 2 log p(w| Dn )| wMAP ) -1 is the inverse Hessian 4 of the negative log posterior evaluated at the MAP estimate of the mode.• MCMC in the last layer(s) We used the NUTS variant of HMC to draw samples from (21) with the first 1000 samples discarded.. Our implementation used the pyro package in PyTorch.

