EFFECTIVE DIMENSION OF MACHINE LEARNING MODELS

Abstract

Making statements about the performance of trained models on tasks involving new data is one of the primary goals of machine learning, i.e., to understand the generalization power of a model. Various capacity measures try to capture this ability, but usually fall short in explaining important characteristics of models that we observe in practice. In this study, we propose the local effective dimension as a capacity measure which seems to correlate well with generalization error on standard data sets. Importantly, we prove that the local effective dimension bounds the generalization error and discuss the aptness of this capacity measure for machine learning models.

1. INTRODUCTION

The essence of successful machine learning lies in the creation of a model that is able to learn from data and apply what it has learned to new, unseen data (Goodfellow et al., 2016) . The latter ability is termed the generalization performance of a machine learning model and has proven to be notoriously difficult to predict a priori (Zhang et al., 2021) . The relevance of generalization is rather straightforward: if one already has insight on the performance capability of a model class, this will allow for more robust models to be selected for training and deployment. But how does one begin to analyze generalization without physically training models and assessing their performance on new data thereafter? This age-old question has a rich history and is largely addressed through the notion of capacity. Loosely speaking, the capacity of a model relates to its ability to express a variety of functions (Vapnik et al., 1994) . The higher a model's capacity, the more functions it is able to fit. In the context of generalization, many capacity measures have been shown to mathematically bound the error a model makes when performing a task on new data, i.e. the generalization error (Vapnik & Chervonenkis, 1971; Liang et al., 2019; Bartlett et al., 2017) . Naturally, finding a capacity measure that provides a tight generalization error bound, and in particular, correlates with generalization error across a wide range of experimental setups, will allow us to better understand the generalization performance of machine learning models. Interestingly, through time, proposed capacity measures have differed quite substantially, with tradeoffs apparent among each of the current proposals (Jiang et al., 2019) . The perennial VC dimension has been famously shown to bound the generalization error, but it does not incorporate crucial attributes, such as data potentially coming from a distribution, and ignores the learning algorithm employed which inherently reduces the space of models within a model class that an algorithm has access to (Vapnik et al., 1994) . Arguably, one of the most promising contenders for capacity which attempts to incorporate these factors are norm-based capacity measures, which regularize the margin distribution of a model by a particular norm that usually depends on the model's trained parameters (Bartlett et al., 2017; Neyshabur et al., 2017b; 2015) . Whilst these measures incorporate the distribution of data, as well as the learning algorithm, the drawback is that most depend on the size of the model, which does not necessarily correlate with the generalization error in certain experimental setups (Zhang et al., 2021) . To this end, we present the local effective dimension which attempts to address these issues. By capturing the redundancy of parameters in a model, the local effective dimension is modified from (Berezniuk et al., 2020; Abbas et al., 2021) to incorporate the learning algorithm employed, in addition to being scale invariant and data dependent. The key results from our study can be summarized as follows: Table 1 : Overview of established capacity measures and desirable properties. The first property is whether the measure can be mathematically related to the generalization error via an upper bound. The second states whether this bound is good in practice, i.e., that the measure correlates with the generalization error in various experimental setups, such as (Zhang et al., 2021) . Scale invariance corresponds to the measure being insensitive to inconsequential transformations of the model, such as multiplying a neural network's weights by a constant. Data and training dependence refers to a measure accounting for data drawn from a distribution and the learning algorithm employed. Finite data merely implies that the measure can handle finite data. Lastly, efficient evaluation refers to the possibility of estimating the capacity measure in polynomial time (in the number of data). • We prove that the local effective dimension bounds the generalization error of a trained model with finite data (see Theorem 4.1). • The local effective dimension largely depends on the Fisher information, which is often approximated in practice (Kunstner et al., 2019) . We rigorously quantify the sensitivity of the local effective dimension when evaluated with an approximated Fisher information (see Proposition 3.2). • Lastly, we empirically show that the local effective dimension correlates well with generalization error in various experimental setups using standard data sets. The local effective dimension is found to decrease in line with the generalization error as a network increases in size. Similarly, the measure increases in line with the generalization error when models are trained on randomized training labels.

2. PRELIMINARIES

In this section, we provide an overview of relevant literature and a concise introduction to generalization error bounds and the Fisher information.

2.1. RELATED WORK

We briefly discuss relevant capacity measures proposed in literature, but defer to (Jiang et al., 2019) for a more comprehensive overview. Given a model class, Vapnik et al. showed that the VC dimension can provide an upper bound on generalization error (Vapnik et al., 1994) . While this was a crucial first step in using capacity to understand generalization, the VC dimension rests on unrealistic assumptions, such as access to infinite data, and ignores things like training dependence and the fact that data, more reasonably, comes from a distribution (Holden & Niranjan, 1995) . The closely-related Rademacher complexity relaxes some of the assumptions made on the model class, but still suffers similar issues to the VC dimension (Yin et al., 2019; Wang et al., 2018) . Since then, a myriad of capacity measures aiming to circumvent these problems and provide tighter generalization error bounds, have been proposed. Margin-based capacity measures stemmed from the work of Vapnik and Chervonenkis in 1974 who pointed out that generalization error bounds based on the VC dimension may be significantly enhanced in the case of linear classifiers that produce large margins. In (Bartlett et al., 1998) , it was shown that the phenomenon where boosting models (no matter how large you make them) do not overfit data, could also be explained by the large margins these boosting models achieved. Since the grand mystery in modern deep learning can be characterized by the same phenomenon -extremely large overparameterized neural networks that seemingly do not overfit data -it seems natural to try extend the idea of margin bounds to these model families. Moreover, margin-based approaches allow us to leverage the fact that learning algorithms, like gradient decent, produce classifiers with large margins on training data. 1 Unfortunately, looking at margins in isolation does not say much about the performance of deep neural networks on unseen data. There have been recent investigations on how to add a normalization such that margin-based measures become informative. Most of these proposals involve the incorporation of the Lipschitz constant of a network, which is simply the product of the spectral norms of the weight matrices (Bartlett et al., 2017) . These normalized margin-based techniques gave rise to norm-based capacity measures which appear promising, however, it is still unclear how to perform this normalization and often, the normalization depends on some factor that scales with the size of the model, which is undesirable in the case of deep neural networks (Neyshabur et al., 2015) . Another interesting proposal for measuring capacity came about by trying to characterize the local minima achieved by deep networks after training (Keskar et al., 2016; Hochreiter & Schmidhuber, 1997) . These so-called sharpness-based measures often depend on the Hessian, which incorporates a notion of curvature at a particular point in the loss landscape. It was believed that sharper minima led to better generalization properties, although this was later shown to be incorrect as sharpness measures were usually not scale invariant and thus, did not correlate well with generalization error in various scenarios (Dinh et al., 2017) . This leads us to the purpose of this study where we introduce and motivate the local effective dimension as a capacity measure. The effective dimension arises from the principle of minimum description length and thus, tries to capture existing redundancy in a statistical model (Berezniuk et al., 2020; Cover & Thomas, 2006; Rissanen, 1996) . Redundancy has been widely studied in deep learning through techniques like pruning and model compression (Yeom et al., 2021; Molchanov et al., 2019; Wiedemann et al., 2020; Tung & Mori, 2020; Cheng et al., 2018; 2017; Tishby & Zaslavsky, 2015) . Interestingly, attempts to connect redundancy/minimum description to generalization performance have also been studied in (Hinton & van Camp, 1993; Achille & Soatto, 2018; MacKay, 1992) , and the idea was used to compare the capacity of quantum and classical machine learning models in (Abbas et al., 2021) . We refine the existing definitions of the effective dimension, which in turn leads us to the creation of a local version that conveniently meets the criteria presented in Table 1 . 

2.2. GENERALIZATION

(h) := 1 n n i=1 (h(x i ), y i ). The difference between the expected and the empirical risk is known as the generalization error gap. This gap gives us an indication as to whether a hypothesis h ∈ H will perform well on unseen data, drawn from the unknown joint distribution p (Neyshabur et al., 2017a) . Therefore, an upper bound on the quantity sup h∈H |R(h) -R n (h)| , which vanishes as n grows large, is of considerable interest.foot_2 Capacity measures help quantify the expressiveness and power of H. Thus, the quantity in equation 1 is typically bounded by an expression that depends on some notion of capacity (Vapnik et al., 1994) .

2.3. FISHER INFORMATION

The Fisher information has many interdisciplinary interpretations (Frieden, 2004) . In machine learning, several capacity measures incorporate the Fisher information in different ways (Liang et al., 2019; Tsuda et al., 2004) . It is also a crucial quantity in the effective dimension and is thus, briefly introduced here. Consider a parameterized statistical model p(x, y; θ) = p(y|x; θ)p(x) which describes the joint relationship between data pairs (x, y) for all x ∈ X , y ∈ Y and θ ∈ Θ ⊆ R d . The input distribution, p(x), is a prior distribution over the data and the conditional distribution, p(y|x; θ) describes the input-output relation generated by the model for a fixed θ ∈ Θ. The full parameter space Θ forms a Riemannian space which gives rise to a Riemannian metric, namely, the Fisher information which we can represent in matrix form F (θ) = E (x,y)∼p ∂ ∂θ log p(x, y; θ) ∂ ∂θ log p(x, y; θ) T . By definition, the Fisher information matrix is positive semidefinite and hence, its eigenvalues are non-negative. In practical applications where d is typically large, there exists sophisticated techniques to efficiently approximate the Fisher information matrix. This is discussed in Appendix E.3.

3. EFFECTIVE DIMENSION

The origin of the effective dimension arose from a simple operational question: Is it possible to quantify the number of parameters that are truly active in a statistical model?foot_3 In the case of deep neural networks, it has already been shown that many parameters are inactive, inspiring better design techniques (Han et al., 2015) . Measuring parameter activeness can be made mathematically precise with tools from statistics and information theory. In particular, the effective dimension unites the principle of minimum description length with the Kolmogorov complexity of a model (Rissanen, 1996; Cover & Thomas, 2006) . We introduce the global effective dimension here and refer the interested reader to (Berezniuk et al., 2020; Abbas et al., 2021) for more details.

3.1. GLOBAL EFFECTIVE DIMENSION

To shorten notation we write κ n,γ = γn 2π log n , for n ∈ N, which represents the number of data samples available, and a constant γ ∈ ( 2π log n n , 1]. Definition 3.1. The global effective dimension of a statistical model M Θ := {p(•, •; θ) : θ ∈ Θ ⊂ R d } with respect to n ∈ N and γ ∈ ( 2π log n n , 1], is defined as d n,γ (M Θ ) := 2 log 1 VΘ Θ det id d + κ n,γ F (θ) dθ log κ n,γ , where V Θ := Θ dθ ∈ R + is the volume of the parameter space and κ n,γ is defined in equation 2. The matrix F (θ) ∈ R d×d is the normalized Fisher information matrix defined as Fij (θ) := d V Θ Θ tr(F (θ))dθ F ij (θ) , where F (θ) ∈ R d×d denotes the Fisher information matrix of p(•, •; θ). For conciseness, we simply denote the global effective dimension as d n,γ . The global effective dimension converges to the maximal rank of the Fisher information matrix r := max θ∈Θ r θ ∈ {1, 2, . . . , d} in the limit of n → ∞, where r θ denotes the rank of F (θ). Thus, it often makes sense to standardize the measure by looking at the normalized effective dimension, denoted by dn,γ = d n,γ /d, which gives us a proportion of active parameters relative to the total number of parameters in the model. We prove that the global effective dimension is continuous as a function of the Fisher information matrix. Since the Fisher information is typically approximated in practice, such a statement is relevant to ensure small deviations in the Fisher information do not exacerbate possible deviations in the global effective dimension (see Section 5 for more details).foot_4  Proposition 3.2 (Continuity of the effective dimension). Let n ∈ N, γ ∈ ( 2π log n n , 1], and consider two statistical models M Θ and M Θ with Θ ⊂ R d and corresponding Fisher information matrices F and F , respectively. Then, |d n,γ (M Θ ) -d n,γ (M Θ )| ≤ C d 1 φ(F ) + 1 φ(F ) max θ∈Θ F (θ) -F (θ) + 2ψ(F ) + 2ψ(F ) log κ n,γ , where C d is a dimensional constant, κ n,γ is defined in equation 2, φ(F ) := 1 VΘ Θ det( F (θ))dθ, ψ(F ) = max log 1 V Θ Θ det(id d + F (θ))dθ , -log 1 V Θ Θ det( F (θ))dθ . The proof is given in Appendix B. Proposition 3.2 is informative for statistical models M Θ and M Θ with corresponding Fisher information matrices F and F such that φ(F ) > 0 and φ(F ) > 0, respectively. This unavoidable consequence is due to lim n→∞ d n,γ (M Θ ) = r = max θ∈Θ rank(F (θ)) and lim n→∞ d n,γ (M Θ ) = r = max θ∈Θ rank(F (θ)). Hence, we see that when r = r , the effective dimension is not continuous as n → ∞. This is consistent with Proposition 3.2 where in the case of r < d or r < d, we have φ(F ) = 0 or φ(F ) = 0, respectively. Remark 3.3 (Stabilized computation of the effective dimension). For large d, sufficiently large n and models with a full rank Fisher information matrix, the effective dimension is of order d. This implies that det(id d + κ n,γ F (θ)) is exponentially large in d, which makes direct calculation of the effective dimension via equation 3 numerically challenging when large models are considered. This can be circumvented by rewriting the effective dimension as d n,γ (M Θ ) = 2 log κ n,γ log 1 V Θ Θ exp 1 2 log det id d + κ n,γ F (θ) dθ . Noting that 1 2 log det id d + κ n,γ F (θ) = 1 2 tr log id d + κ n,γ F (θ) = 1 2 d i=1 log 1 + κ n,γ λ i F (θ) =: z(θ) , where λ i ( F (θ)) denotes the i-th eigenvalue of F (θ). The quantity z(θ) can be computed without any under-or overflow problems for large n and d. Choosing ζ = max θ∈Θ z(θ) then gives d n,γ (M Θ ) = 2ζ log κ n,γ + 2 log κ n,γ log 1 V Θ Θ exp z(θ) -ζ dθ , which is a numerically stable expression for the effective dimension.

3.2. LOCAL EFFECTIVE DIMENSION

While the global effective dimension has nice properties, an important aspect to note is that it incorporates the full parameter space Θ. In practice, however, training models with a learning algorithm inherently restricts the space of parameters that a model truly has access to. Once a model is trained, only a fixed parameter set θ ∈ Θ is considered, which is chosen to minimize a certain loss function. This leads us to the introduction of the local effective dimension which accounts for dependence on the training algorithm. To achieve this, we define an -ball around a fixed parameter set θ ∈ Θ ⊂ R d for > 0 as B (θ ) := {θ ∈ Θ : θ -θ ≤ } , with a volume V := B (θ ) dθ. Definition 3.4. The local effective dimension of a statistical model M Θ := {p(•, •; θ) : θ ∈ Θ} around θ ∈ Θ with respect to n ∈ N, γ ∈ ( 2π log n n , 1], and > 1/ √ n is defined as d n,γ (M B (θ ) ) = 2 log 1 V B (θ ) det id d + κ n,γ F (θ) dθ log κ n,γ , for κ n,γ given by equation 2. The matrix F (θ) ∈ R d×d is the normalized Fisher information matrix defined as Fij (θ) := d V B (θ ) tr(F (θ))dθ F ij (θ) , where F (θ) ∈ R d×d denotes the Fisher information matrix of p(•, •; θ). For ease of notation, we denote the local effective dimension as d n,γ, . From Definition 3.4, we immediately see that the local effective dimension is scale invariant as it depends on the normalized Fisher information matrix, as well as training dependent, since the training determines θ . Via its dependence on the Fisher information, the local effective dimension also incorporates an assumed distribution for the data and is built for finite data, as summarized in Table 1 . Proposition 3.2 further proves that the local effective dimension is continuous in the Fisher information matrix. The computationally dominant part in evaluating the local effective dimension is the calculation of the Fisher information matrix. Luckily, this is a well-studied problem with existing proposals for efficient evaluation (Kunstner et al., 2019; Martens & Grosse, 2015) . Since we only require the eigenvalues of the Fisher matrix for the local effective dimension, we can further exploit these Fisher approximations and do not need to store a d × d matrix (see Section E.3 for more details). Additionally, the integral over the -ball can be evaluated efficiently with Monte-Carlo type methods. To complete the criteria from Table 1 , it remains to show that the local effective dimension bounds and correlates with the generalization error, which is illustrated next.

4. GENERALIZATION AND THE LOCAL EFFECTIVE DIMENSION

Understanding the role of the local effective dimension in the context of generalization requires a rigorous relationship to be defined. We demonstrate this relationship through the generalization error, bounded by the local effective dimension.

4.1. GENERALIZATION ERROR BOUND

Consider machine learning models described by stochastic maps, parameterized by some θ ∈ Θ and a loss function as a mapping : P(Y) × P(Y) → R where P(Y) denotes the set of distributions on Y. The following regularity assumption on the model  M Θ := {p(•, •; θ) : θ ∈ Θ} is assumed: Θ θ → p(•, •; θ) . |R(θ) -R n (θ)| ≥ 4M √ κ n,γ ≤ c d (1 + Λ) d • κ dn,γ, 2 n,γ exp - 16πM 2 2 log n B 2 γ , where M = M 1 M 2 , κ n,γ is defined in equation 2, and d n,γ, is the local effective dimension d n,γ (M B (θ ) ). Theorem 4.1 assumes that the loss function is Lipschitz continuous. This excludes some popular loss functions such as the relative entropy. Hence, in Appendix C, we extend Theorem 4.1 to include loss functions that are log-Lipschitz.foot_5  4.2 PROOF OF THEOREM 4.1 Let N B (θ ) (r) denote the number of boxes of side length r required to cover the set B (θ ) -the length being measured with respect to the metric Fij (θ ). Lemma 4.2. Under the assumption of Theorem 4.1, we have for any ξ ∈ (0, 1) P sup θ∈B (θ ) |R(θ) -R n (θ)| ≥ ξ ≤ 2 N B (θ ) ξ 4M exp - nξ 2 2B 2 . Proof. If we replace the full parameter space, Θ, by the relevant reduced space, B (θ ), the proof of this lemma follows directly from (Abbas et al., 2021 , Lemma 2 in the Supplementary Information) if we set α = 1. Lemma 4.3. Under the assumption of Theorem 4.1, there exists a dimensional constant c d such that N B (θ ) √ κ n,γ ≤ c d (1 + Λ) d • κ dn,γ, . Proof. If we choose Θ = B 1 (θ ) instead of [-1, 1] d , we can then rescale B (θ ) → B 1 (θ ), F (θ) → F ( θ), 1/ √ n → 1/( √ n) , and r → r/ . 6 In other words, the number of balls of radius r needed to cover B (θ ) is equal to the number of balls of radius r/ needed to cover B 1 (θ ). Then, constants c d and ĉd exist such that N B (θ ) (r) = N B1(θ ) (r/ ) ≤ ĉd (1 + c d Λ) d 1 V 1 B1(θ ) det id d + 2 r 2 F ( θ) dθ = ĉd (1 + c d Λ) d 1 V B (θ ) det id d + 2 r 2 F (θ) dθ . Hence, choosing ( /r) 2 = κ n,γ gives N B (θ ) √ κ n,γ ≤ c d (1 + c d Λ) d • κ dn,γ, 2 n,γ , which proves the assertion of the lemma. Thanks to Lemmas 4.2 and 4.3 we can deduce Theorem 4.1. For ξ = 4M / √ κ n,γ we find P sup θ∈B (θ ) |R(θ) -R n (θ)| ≥ 4M / √ κ n,γ ≤ 2 N B (θ ) / √ κ n,γ exp - 16πM 2 2 log n B 2 γ ≤ 2c d (1 + Λ) d • κ dn,γ, 2 n,γ exp - 16πM 2 2 log n B 2 γ , which completes the proof.foot_7 

4.3. REMARKS ON THE GENERALIZATION ERROR BOUND

Ideally, the generalization bound from equation 4.1 should be non-vacuous. This occurs if the righthand side is smaller than one, or equivalently, when the logarithm of the right-hand side is negative. Table 2 demonstrates that a choice for γ ∈ ( 2π log n n , 1] such that the bound remains non-vacuous, is reasonable in practical settings where we set γ = 0.003, but could become vacuous in deeper regimes. For more details, see Appendix D.

5. EMPIRICAL RESULTS

In this section, we perform experiments to verify whether the local effective dimension captures the true behaviour of generalization error in various regimes. We use standard fully-connected feedforward neural networks with two hidden layers and vary the model size by altering the number of neurons in the hidden layers. All training was conducted with batched stochastic gradient descent, with experimental setups identical to those of (Liang et al., 2019) . 8 The details can be found in Appendix E. We consider both shallow and deep regimes by training models on MNIST and CIFAR10 data sets, with the latter requiring far more parameters for the training to converge to zero error. Within these regimes, we conduct two experiments respectively: first, we incrementally increase the model size, train to zero error and calculate the local effective dimension, along with the generalization error; second, we replicate the experiment from (Zhang et al., 2021 ) by fixing the model size and randomizing the training labels by an increasing proportion, training to zero error and calculating the local effective dimension and generalization error. In all calculations, we perform simulations using the K-FAC approximation of the Fisher information from (Martens & Grosse, 2015) . K-FAC crucially allows computation of the local effective dimension in very large parameter spaces and we further exploit the block structure of this approximation for computation of the eigenvalues of the Fisher information matrix (George, 2021) . Regardless of the regime and particular experiment conducted, the local effective dimension seems to move in line with the generalization error. In Figures 1(a ) and 2(a), we see this for an increasing model size shown on the horizontal axes (with notably much larger models used to learn the CIFAR10 data set). As the models get larger, they are able to perform better on the learning task at hand and their generalization error declines accordingly, as does the (normalized) local effective dimension. The error bars represent the standard deviation around the mean of 10 independent training runs. A lower normalized local effective dimension as the model size increases, intuitively implies increasing redundancy, as also suggested in (Frankle & Carbin, 2018) , and motivates pruning techniques (Karnin, 1990) . In Figures 1(b ) and 2(b), we fix the model size to d ≈ 10 5 and d ≈ 10 7 respectively. Here, the horizontal axis marks the level at which the labels of the training data have been randomized. We begin at 20% randomization to 100% in increments of 20%. At all points, we train to zero training loss or terminate at 600 epochs and plot the resulting normalized local effective dimension and generalization error. Naturally, the generalization performance worsens as we randomize more labels since the network is fitting more and more noise that has been artificially introduced. Interestingly, the local effective dimension captures this behaviour too -increasing with the generalization error - indicating that more and more parameters need to become "active" to fit this noise. This result is independent of the regime, deep or shallow.

6. DISCUSSION

Whilst the search for a good capacity measure continues, we believe that the local effective dimension serves as a promising candidate. Besides being able to correlate with the generalization error in different experiments, the local effective dimension incorporates data, training and does not rest on unrealistic assumptions. It's intuitive interpretation as a measure of redundancy in a model, along with proof of a generalization error bound, suggests that the local effective dimension can explain the performance of machine learning models in various regimes. Investigation into the tightness of the generalization bound, in particular for specific model architectures and in the deep learning regime (where bounds are typically vacuous), would be beneficial in further understanding the local effective dimension's connection to generalization. Additionally, empirical analyses involving bigger models, different data sets and other training techniques/optimizers could shed more light on the practical usefulness of this promising capacity measure.

A BENEFITS OF THE LOCAL EFFECTIVE DIMENSION

When deciding what is a good measure of capacity for a model, in particular for deep neural networks which are notoriously difficult to understand in a generalization context, it is helpful to check whether the capacity measure satisfies certain criteria which we highlight in Table 1 . The first criterion is whether the measure can be mathematically related to the generalization error via an upper bound. This is the main contribution of our work, where we essentially show that the local effective dimension can indeed bound the generalization error. The question of whether one can obtain tighter generalization bounds using the effective dimension, for specific models, is left for future research. However, there are several interesting pieces of work that could be relevant for this direction, such as (Pennington & Worah, 2018) who investigate the spectrum of the Fisher information for a single layer neural network with infinite width. Since the effective dimension depends largely on the eigenvalues of the Fisher matrix, this would be a convenient place to start this investigation. Additionally, the work in (Pennington & Worah, 2018) shows that a single linear layer network produces a Fisher information spectrum which converges to a Marchenko-Pastur distribution in the infinite width limit. In this setting, the generalization bound based on the effective dimension reduces to something quite trivial which depends primarily on n, since the number of data determines how many eigenvalues are counted in the local effective dimension. We hope that future studies can improve the bound we present in this work, as we do not yet explore any optimality results. The second criterion for a well-proposed capacity measure tries to address whether the generalization bound using the capacity measure is actually good in practice, i.e., that the measure correlates with the generalization error in various experimental setups, such as (Zhang et al., 2021) . Through our numerical experiments in Section 5, we answer this with an affirmative answer. Another crucial property for capacity is scale invariance, which corresponds to the measure being insensitive to inconsequential transformations of the model, such as multiplying a neural network's weights by a constant. Since the local effective dimension is a function of the Fisher information, which is inherently scale invariant, this requirement is naturally accounted for in the local effective dimension. A good capacity measure should also account for data and training dependence, i.e. the fact that data is drawn from a distribution and one imposes a learning algorithm. Once again, the Fisher information incorporates the data distribution, and the purpose of the localization of the effective dimension is to account for training dependence. A capacity measure should also be realistic in the sense that it should allow for finite data, which is always the case in practice. The local effective dimension not only allows for finite data, but is structured for this realistic purpose to include the amount of data available as a resolution parameter. This creates a beautiful operational meaning for the local ED that depends on the amount of data one has in practice. Lastly, a capacity measure should be computationally efficient to evaluate (in polynomial time in the number of data). Thanks to various approximations of the Fisher information, this too is possible for the local effective dimension and is explained in Appendix E.3.

B PROOF OF PROPOSITION 3.2

We denote the maximal rank of F and F by r and r , respectively, and define the function f (t) := 1 V Θ Θ det id d / √ κ n,γ + t F (θ) + (1 -t) F (θ) =:Gt dθ . ( ) We consider a modified version of the effective dimension, defined as dn,γ (F ) := 2 log f (1) log κ n,γ + r and dn,γ (F ) := 2 log f (0) log κ n,γ + r . The triangle inequality then gives |d n,γ (F ) -d n,γ (F )| ≤ |d n,γ (F ) -dn,γ (F )| + | dn,γ (F ) -dn,γ (F )| + | dn,γ (F ) -d n,γ (F )| . ( ) We next bound all the three terms. For the first and the last one, recall (Abbas et al., 2021 , Supplementary Information, Section 2) that d n,γ (F ) ≤ r + 2 log κ n,γ log 1 V Θ Θ det(id d + F (θ))dθ and for A := {θ ∈ Θ : r θ = r} d n,γ (F ) ≥ r + 2 log κ n,γ log 1 V Θ Θ det( F (θ))dθ . Recalling that ψ(F ) = max log 1 V Θ Θ det(id d + F (θ))dθ , -log 1 V Θ Θ det( F (θ))dθ gives |d n,γ (F ) -dn,γ (F )| ≤ 2ψ(F ) log κn,γ and |d n,γ (F ) -dn,γ (F )| ≤ 2ψ(F ) log κn,γ . It thus remains to bound middle term in equation 7. To do so note that | log f (1) -log f (0)| ≤ 1 0 |f (t)| f (t) dt . We can bound the numerator of the integral as |f (t)| ≤ 1 V Θ Θ d dt det(id d / √ κ n,γ + G t ) dθ ≤ 1 V Θ Θ C θ,d F (θ) -F (θ) dθ ≤ C d max θ∈Θ F (θ) -F (θ) , where the constant C d depends on d, √ F d-1 , and √ F d-1 . Using the fact that A → (det A) 1/d is concave on the space of Hermitian positive definite matrices gives det id d / √ κ n,γ + G t ≥ t det id d / √ κ n,γ + G 1 1/d + (1 -t) det id d / √ κ n,γ + G 0 1/d d ≥ t d det id d / √ κ n,γ + G 1 + (1 -t) d det id d / √ κ n,γ + G 0 . Hence we have f (t) ≥ t d f (1) + (1 -t) d f (0). Combining this with equation 8 gives | log f (1) -log f (0)| ≤ C d max θ∈Θ F (θ) -F (θ) 1 0 1 t d f (1) + (1 -t) d f (0) dt ≤ C d max θ∈Θ F (θ) -F (θ) 1/2 0 1 t d f (1) dt + 1 1/2 1 (1 -t) d f (0) dt ≤ C d max θ∈Θ F (θ) -F (θ) 1 f (0) + 1 f (1) . Combining this with | dn,γ (F ) -dn,γ (F )| ≤ 2 κ n,γ | log f (1) -log f (0)| almost completes the proof. The final thing to note is that f (0) = 1 V Θ Θ det id d / √ κ n,γ + F (θ) dθ ≥ 1 V Θ Θ det F (θ) dθ , and similarly for f (1).

C GENERALIZATION BOUND FOR LOG-LIPSCHITZ LOSS FUNCTIONS

In this appendix we prove a generalization of Theorem 4.1 where the loss function is assumed to be log-Lipschitz continuous instead of Lipschitz continuous. Theorem C.1. Consider the same setting as in Theorem 4.1 with ∈ (1/ √ n, 1], but the loss function is log-Lipschitz continuous with constant M 2 in the first argument with respect to the total variation distance. Then P sup θ∈B (θ ) |R(θ) -R n (θ)| ≥ 2M √ κ n,γ log e + √ κ n,γ M 2 ≤ c d (1 + Λ) d • κ dn,γ, 2 n,γ exp - 2nM 2 2 κ n,γ B 2 log e + √ κ n,γ M 2 2 , where M = M 1 M 2 and κ n,γ is defined in equation 2. To prove Theorem C.1 we need a preparatory lemma. Lemma C.2. Under the assumption of Theorem C.1, we have for any ξ ∈ (0, 1) P sup θ∈B (θ ) |R(θ) -R n (θ)| ≥ ξ ≤ 2 N B (θ ) (r) exp - nξ 2 2B 2 , where r = r(ξ) is defined as the unique value such that 2M 1 M 2 r log(e + 1 M2r ) = ξ/2.foot_9  Proof. Let r ∈ P(X ) and q ∈ P(Y) denote the observed input and output distributions, respectively. Then using the log-Lipschitz assumption of the loss function, we find |R(θ 1 ) -R(θ 2 )| = E r,q p(y|x; θ 1 )r(x), q(y) -E r,q p(y|x; θ 2 )r(x), q(y) ≤ E r,q p(y|x; θ 1 )r(x), q(y) -p(y|x; θ 2 )r(x), q(y) ≤ M 2 E r p(y|x; θ 1 )r(x)-p(y|x; θ 2 )r(x) 1 log e + 1 p(y|x; θ 1 )r(x)-p(y|x; θ 2 )r(x) 1 ≤ M 2 p(y|x; θ 1 ) -p(y|x; θ 2 ) ∞ log e + 1 p(y|x; θ 1 ) -p(y|x; θ 2 ) ∞ ≤ M 2 M 1 θ 1 -θ 2 ∞ log e + 1 θ 1 -θ 2 ∞ , where the penultimate step uses that R + x → x log(e + 1/x) is monotone together with Hölder's inequality. The final step follows from the Lipschitz continuity assumption of the model. Equivalently we see that |R n (θ 1 ) -R n (θ 2 )| ≤ M 2 M 1 θ 1 -θ 2 ∞ log e + 1 θ 1 -θ 2 ∞ . Combining equation 10 with equation 11 gives for S(θ) := R(θ) -R n (θ) |S(θ 1 ) -S(θ 2 )| ≤ 2M 1 M 2 θ 1 -θ 2 ∞ log e + 1 θ 1 -θ 2 ∞ . Assume that B (θ ) can be covered by k subsets B 1 , . . . , B k , i.e. B (θ ) = B 1 ∪ . . . ∪ B k . Then, for any ξ > 0, P sup θ∈B (θ ) |S(θ)| ≥ ξ = P k i=1 sup θ∈Bi |S(θ)| ≥ ξ ≤ k i=1 P sup θ∈Bi |S(θ)| ≥ ξ , where the inequality is due to the union bound. Finally, let k = N (r) and let B 1 , . . . , B k be balls of radius r centered at θ 1 , . . . , θ k covering B (θ ). Recalling that by assumption r = r(ξ) is such that 2M 1 M 2 r log(e + 1 M2r ) = ξ/2 we find for all i = 1, . . . , k P sup θ∈Bi |S(θ)| ≥ ξ ≤ P |S(θ i )| ≥ ξ 2 . ( ) To see this recall that since |θ -θ i | ≤ r, by definition of r and using the monotonicity of x → x log(e + 1/x), Inequality equation 12 implies |S(θ) -S(θ i )| ≤ ξ/2. Hence, if |S(θ)| ≥ ξ, it must be that |S(θ i )| ≥ ξ 2 . This in turn implies equation 14. To conclude, we apply Hoeffding's inequality, which yields P |S(θ i )| ≥ ξ 2 = P |R(θ i ) -R n (θ i )| ≥ ξ 2 ≤ 2 exp -nξ 2 2B 2 . ( ) Combined with equation 13, we obtain P sup θ∈B (θ ) |S(θ)| ≥ ξ ≤ k i=1 P sup θ∈Bi |S(θ)| ≥ ξ ≤ k i=1 P |S(θ i )| ≥ ξ 2 ≤ 2N (r) exp -nξ 2 2B 2 , where the second step uses equation 14. The final step follows from equation 15 and by recalling that k = N (r). Proof of Theorem C.  |R(θ) -R n (θ)| ≥ ξ ≤ 2 N B (θ ) √ κ n,γ exp - nξ 2 2B 2 ≤ 2c d (1 + Λ) d • κ dn,γ, 2 n,γ exp - nξ 2 2B 2 = 2c d (1 + Λ) d • κ dn,γ, 2 n,γ exp - 2nM 2 2 κ n,γ B 2 log e + √ κ n,γ M 2 2 , where the second step uses Lemma 4.3 and the final step follows from equation 16.

D REMARKS ON THE GENERALIZATION ERROR BOUND

Lemma 4.2 implies that lim n→∞ P(sup θ∈B (θ ) |R(θ) -R n (θ)| ≥ ξ) = 0 for ξ ∈ (0, 1). As a result, to ensure that the generalization bound in equation 4.1 is meaningful, the right-hand side must vanish as n → ∞. This depends on the problem setting, in particular, on the parameters > 1/ √ n, and γ ∈ ( 2π log n n , 1]. There is some flexibility in choosing these parameters, with a "critical" scaling obtained if = Ω(1/ log(n)).foot_10 In the case where = O(1/n p ) for p < 1 2 , the generalization bound gets vacuous for sufficiently large n, regardless of the choice of the constant γ ∈ ( 2π log n n , 1].foot_11  Ideally, the generalization bound from equation 4.1 should be non-vacuous. This occurs if the righthand side is smaller than one, or equivalently, when the logarithm of the right-hand side is negative. Table 2 demonstrates that a choice for γ ∈ ( 2π log n n , 1] such that the bound remains non-vacuous, is reasonable in practical settings where we set γ = 0.003. As a result, we plot the accompanying error bound ξ n , the local effective dimension and the logarithm of the right-hand side of equation 4.1 which remains negative, for increasing values of n. √ n, we can still fix γ = 0.003 such that the generalization bound is nonvacuous, i.e., the RHS of equation 4.1 is ≤ 1. In fact, the log RHS of equation 4.1 is strongly negative, implying that the RHS is virtually zero. Following equation 4.1, the error bound is given by ξ n = 4M ( 2π log n γn ) 

E EXPERIMENTAL SETUP

Here, we explain the models and techniques used for the experiments in this study. All models constituted fully-connected feedforward neural networks with leaky relu activation functions. We used 2 hidden layers for all architectures, but varied the number of neurons per layer depending on the experiment. Training was done with stochastic gradient descent, with batch sizes equal to 50. In the instances where the CIFAR10 data set was used, we performed a standard transformation of the data by normalizing and cropping from the center. For more details on this transformation, see (Zhang et al., 2021) .

E.1 INCREASING MODEL SIZE

In Figures 1(a ) and 2(a) we train feedforward neural networks on the MNIST and CIFAR10 data sets respectively. In both cases, we plot the model size on the x-axis and incrementally increase the number of neurons in both hidden layers, thereby increasing the number of parameters in the model. For MNIST, we do not need to train very large models to achieve zero training error, thus, we vary the number of parameters from 2 × 10 4 to 10 5 . On the other hand, for CIFAR10, we train models with parameters ranging from 5 × 10 5 to 10 7 . In MNIST, the training and test split is 60000 and 10000 images respectively. For CIFAR10, it is 50000 and 10000. We train every model for 200 epochs and plot the resulting generalization errors, approximated by the test error. We also plot the normalized local effective dimension for every model using the trained parameter set θ * . For this, we use n = 6 × 10 4 (which is the size of both training sets) and set γ = 1. In both Figures, we repeat the entire experiment 10 times with different parameter initialization and plot the average generalization error and average normalized local effective dimension over these 10 trials, with error bars depicting ±1 standard deviation above and below the mean values. As expected, the local effective dimension declines along with the generalization error in both shallow (MNIST) and deep (CIFAR10) regimes.

E.2 RANDOMIZATION EXPERIMENT

In Figures 1(b ) and 2(b) we train models on the MNIST and CIFAR10 data sets respectively. In this experiment, both models are fixed to d ≈ 10 5 for MNIST and d ≈ 10 7 for CIFAR10. What we vary is the proportion of training labels that are replaced with random labels (as originally done in (Zhang et al., 2021) ). We begin by randomizing 20% of the training labels as shown on the x-axis. We train the models to zero training error or terminate after 600 epochs and plot the resulting generalization error and effective dimension. Thereafter, we increase the proportion of random labels in increments of 20%, until 100% randomization and plot the generalization error and normalized local effective dimension after training, each time. This entire process is repeated 10 times with different parameter initialization and we plot the average generalization error and average normalized local effective dimension over increasing label randomization. Unsurprisingly, the generalization error increases as we increase the level of randomization since the network is essentially learning to fit more and more noise and thus, does not generalize well. Interestingly, the local effective dimension moves in line with this trend and increases over increasing randomization too. This could be interpreted as the network requiring more and more parameters to forcefully fit the increasing noise levels that would not naturally occur. Thus, the local effective dimension captures the correct generalization behaviour, even in this artificial set up where most capacity measures fail to explain generalization performance. While in the case of noisy labels, deep neural networks seem to overfit regardless of how deep you make them, it is worthwhile to mention other progress in understanding deep learning through experiments aiming to probe the nature of overfitting. One particular analysis involves the study of memorization of overparameterized models which produces a "double descent" curve rather than the traditional U-shaped risk curve (Belkin et al., 2019) . The details of the double descent phenomenon are very subtle and depend on several interrelating factors such as the distribution of data, the optimizer used (i.e. the trained parameters) and the notion of capacity employed. Given that the local effective dimension accounts for these factors through the nature of its definition and it is able to capture overfitting in an overparameterized regime induced through artificial noise in the labels, we postulate that the local effective dimension through training would also track the double descent risk curve accordingly. Thus far, we have conducted various experiments that calculate the local effective dimension after training is complete, but the measure should intuitively inform us about local information around any set of parameters. Thus, one could also extend the analysis to look at the local effective dimension throughout training, which should mirror the double descent phenomenon containing various regimes of under-and overfitting as in the work of (Maddox et al., 2020) who employ a different notion of effective dimension.

E.3 ESTIMATING THE LOCAL EFFECTIVE DIMENSION

In all calculations involving the local effective dimension, there are two assumptions made. First, we use a fixed parameter set θ * chosen after training to estimate the local effective dimension and assume it is a good approximation of the average of sampling in an -ball around θ * . In other words, we ignore the integral over B (θ ) in Definition 3.4 and simply use the trained parameter set to compute the local effective dimension. In Table 3 we check the sensitivity of the local effective dimension by comparing this "midpoint" approximation to sampling in an -ball around θ * and conclude that the approximation is sufficiently close and thus helps reduce computational time. We use the more efficient reformulation from Remark 3.3 to calculate the local effective dimension with multiple samples and take = 1/ √ n. The second assumption made is that the Fisher information matrix can be approximated by the empirical Fisher information matrix. From (Liao et al., 2018) , we acknowledge that this assumption does not always necessarily hold. Thus, the continuity statement from Proposition 3.2 becomes relevant to ensure errors introduced by the estimate of the empirical Fisher do not strongly propagate in the calculation of the local effective dimension. Through the empirical Fisher information assumption, we further exploit work done in (Martens & Grosse, 2015) and use the Kronecker-Factored Approximation (K-FAC) Fisher matrix in the estimation of the effective dimension. The K-FAC Fisher allows us to estimate the eigenvalues of the empirical Fisher information much more efficiently for large models. We slightly extend the PyTorch (Paszke et al., 2019) implementation developed in (George, 2021) . We refer the interested reader to (Martens & Grosse, 2015) for more details. The K-FAC estimate constitutes several block matrices which comprise a diagonal block estimate of the empirical Fisher estimate. The block matrices relate to the hidden layers used in a neural network model. Conveniently, these block matrices can be further factorized into a tensor product of two smaller matrices. Thus, to calculate the eigenvalues of the K-FAC Fisher, it suffices to compute the eigenvalues of the block matrices, and thereby take advantage of their tensor decomposition. We extend the PyTorch K-FAC implementation from (George, 2021) to include a function that computes all the eigenvalues. Thereafter, the estimation of the local effective dimension follows from equation 4. Due to these approximations and the code implementation in (George, 2021) , we would like to highlight that computational bottleneck lies solely with training the model when d is very large. The memory and time overhead for computing the local effective dimension with the K-FAC Fisher is highly efficient and could, for example, run on a laptop with modest memory for a model with d = 10000000 in just few minutes.

E.4 ADDITIONAL EXPERIMENTS

In order to validate the robustness of our results with different optimizers, we have conducted the same experiment as in Figure 2 using the ADAM optimizer. As was the case with stochastic gradient descent, we see that the relationship between generalization and the local (normalized) effective dimension still holds. 



A beautiful review of margin bounds is encapsulated in(Anthony & Bartlett, 2009) where lower bounds are proved for certain function classes. We assume that the loss function is Lipschitz continuous which implies that equation 1 vanishes as n → ∞. A parameter is considered active if it has a sufficiently large influence on the outcome of its statistical model, i.e. varying the parameter changes the model. This proposition also holds for the local effective dimension introduced in Definition 3.4. We say that a function f is log-Lipschitz with constant L if |f (x) -f (y)| ≤ L|x -y| log(e + 1/|x -y|). Recall that by assumption of Theorem 4.1, we have > 1/ √ n. The full rank assumption of the Fisher information matrix in Theorem 4.1 can be relaxed following the ideas from(Abbas et al., 2021, Remark 2). We include results with the ADAM optimiser in the appendix. With the convention that r = ∞ if M1 = 0 or M2 = 0. In this case, the two terms in the right-hand side of equation 4.1 balance each other out and the hyperparameter γ can control the behaviour of the generalization bound as n → ∞. Recall that choosing γ dependent on n would conflict with the geometric interpretation of the effective dimension(Berezniuk et al., 2020;Abbas et al., 2021).



Figure 1: MNIST (a) Normalized local effective dimension and generalization error plotted over different model sizes (standard feedforward networks with two hidden layers and varying number of neurons). The parameter n is fixed to equal the size of the training set, i.e. n = 60000. (b) Normalized local effective dimension and generalization error over different percentages of randomized labels on the training data. Here, the model size is fixed to d ≈ 10 5 .

Figure 2: CIFAR10 (a) Normalized local effective dimension and generalization error plotted over different model sizes. The number of parameters required to train CIFAR10 is far greater than the number of training samples (n = 50000). We observe that the local effective dimension moves in line with the declining generalization error as the model is made larger. (b) Normalized local effective dimension and generalization error over different percentages of randomized labels on the training data. Here, the model size is fixed to d ≈ 10 7 .

Evaluation of the generalization bound equation 4.1 for a feedforward neural network trained on MNIST. The model sets d ≈ 10 5 , = 1/ √ n, c Λ,d = 2 √ d, and B = M = 1. Even when setting = 1/

Figure 3: CIFAR10 with ADAM (a) Normalized local effective dimension and generalization error plotted over different model sizes. The number of parameters required to train CIFAR10 is far greater than the number of training samples (n = 50000). We observe that the local effective dimension moves in line with the declining generalization error as the model is made larger, even with training using a different optimizer (ADAM as opposed to SGD in Figure 3. (b) Normalized local effective dimension and generalization error over different percentages of randomized labels on the training data. Here, the model size is fixed to d ≈ 10 7 and training using the ADAM optimizer.

Theorem 4.1. Let Θ = [-1, 1] d and consider a statistical model M Θ := {p(•, •; θ) : θ ∈ Θ} satisfying equation 5 such that F (θ) has full rank for all θ ∈ Θ, and ∇ θ log F (θ) ≤ Λ for some Λ ≥ 0 and all θ ∈ Θ. Furthermore, let : P(Y) × P(Y) → [-B/2, B/2] for B > 0 be a loss function that is Lipschitz continuous with constant M 2 in the first argument with respect to the total variation distance. Then, there exists a dimensional constant c d such that for θ ∈ Θ, n ∈ N,

Evaluation of the local effective dimension for a feedforward neural network trained on CIFAR10 with d ≈ 10 7 . We plot values for the normalized local effective dimension d calculated with increasing samples from an -ball around the trained θ * , where = 1/ √ n. The midpoint approximation uses the single θ * after training. For completeness, we include the unnormalized local effective dimension d n,γ, and the average of the z(θ) values generated from each sample as defined in equation 4.

