Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks

Abstract

Hessian captures important properties of the deep neural network loss landscape. Previous works have observed low rank structure in the Hessians of neural networks. We make several new observations about the top eigenspace of layer-wise Hessian -top eigenspaces for different models have surprisingly high overlap, and top eigenvectors form low rank matrices when they are reshaped into the same shape as the corresponding weight matrix. Towards formally explaining such structures of the Hessian, we show that the new eigenspace structure can be explained by approximating the Hessian using Kronecker factorization; we also prove the low rank structure for random data at random initialization for over-parametrized two-layer neural nets. Our new understanding can explain why some of these structures become weaker when the network is trained with batch normalization. The Kronecker factorization also leads to better explicit generalization bounds.

1. Introduction

The loss landscape for neural networks is crucial for understanding training and generalization. In this paper we focus on the structure of Hessians, which capture important properties of the loss landscape. For optimization, Hessian information is used explicitly in second order algorithms, and even for gradient-based algorithms properties of the Hessian are often leveraged in analysis (Sra et al., 2012) . For generalization, the Hessian captures the local structure of the loss function near a local minimum, which is believed to be related to generalization gaps (Keskar et al., 2017) . Several previous results including (Sagun et al., 2018; Papyan, 2018) observed interesting structures in Hessians for neural networks -it often has around c large eigenvalues where c is the number of classes. In this paper we ask: Why does the Hessian of neural networks have special structures in its top eigenspace? A rigorous analysis of the Hessian structure would potentially allow us to understand what the top eigenspace of the Hessian depends on (e.g., the weight matrices or data distribution), as well as predicting the behavior of the Hessian when the architecture changes. Towards this goal, we first focus on the structure for the top eigenspace of layer-wise Hessians. We observe that the top eigenspace of Hessians are far from random -models trained with different random initializations still have a large overlap in their top eigenspace, and the top eigenvectors are close to rank 1 when they are reshaped into the same shape as the corresponding weight matrix. We formalize a conjecture that allows us to understand all these structures using a Kronecker decomposition. We also analyze the Hessian in an over-parametrized two-layer neural network for random data, proving that the output Hessian is approximately rank c -1 and its top eigenspace can be easily computed based on weight matrices. Figure 1 : Some interesting observations on the structure of layer-wise Hessians. The eigenspace overlap is defined in Definition 4.1 and the reshape operation is defined in Definition 4.2 Structure of Top Eigenspace for Hessians: Consider two neural networks trained with different random initializations and potentially different hyper-parameters; their weights are usually nearly orthogonal. One might expect that the top eigenspace of their layer-wise Hessians are also very different. However, this is surprisingly false: the top eigenspace of the layer-wise Hessians have a very high overlap, and the overlap peaks at the dimension of the layer's output (see Fig. 1a ). Another interesting phenomenon is that if we express the top eigenvectors of a layer-wise Hessian as a matrix with the same dimensions as the weight matrix, then the matrix is approximately rank 1. In Fig. 1b we show the singular values of several such reshaped eigenvectors. Understanding Hessian Structure using Kronecker Factorization: We show that both of these new properties of layer-wise Hessians can be explained by a Kronecker Factorization. Under a decoupling conjecture, we can approximate the layer-wise Hessian using the Kronecker product of the output Hessian and input auto-correlation. This Kronecker approximation directly implies that the eigenvectors of the layer-wise Hessian should be approximately rank 1 when viewed as a matrix. Moreover, under stronger assumptions, we can generalize the approximation for the top eigenvalues and eigenvectors of the full Hessian.

Structure of auto-correlation:

The auto-correlation of the input is often very close to a rank 1 matrix. We show that when the input auto-correlation component is approximately rank 1, the layer-wise Hessians indeed have high overlap at the dimension of the layer's output, and the spectrum of the layer-wise Hessian is similar to the spectrum of the output Hessian. On the contrary, when the model is trained with batch normalization, the input auto-correlation matrix is much farther from rank 1 and the layer-wise Hessian often does not have the same low rank structure. Structure of output Hessian: For the output Hessian, we prove that in an over-parametrized twolayer neural network on random data, the Hessian is approximately rank c -1. Further, we can compute the top c -1 eigenspace directly from weight matrices. We show that this calculation can be extended to multiple layers and the result has a high overlap with the actual top eigenspace of Hessian in most settings. Applications: As a direct application of our results, we show that the Hessian structure can be used to improve the PAC-Bayes bound computed in Dziugaite & Roy (2017) .

2. Related Works

Hessian-based analysis for neural networks (NNs): Hessian matrices for NNs reflect the second order information about the loss landscape, which is important in characterizing SGD dynamics (Jastrzebski et al., 2019) and related to generalization (Li et al., 2020) , robustness to adversaries (Yao et al., 2018) and interpretation of NNs (Singla et al., 2019) . People have empirically observed several interesting phenomena of the Hessian, e.g., the gradient during training converges to the top eigenspace of Hessian (Gur-Ari et al., 2018; Ghorbani et al., 2019) , and the eigenspectrum of Hessian contains a "spike" which has about c -1 large eigenvalues and a continuous "bulk" (Sagun et al., 2016 (Sagun et al., , 2018;; Papyan, 2018) . Understanding structures of Hessian: People have developed different frameworks to explain the low rank structure of the Hessians including hierarchical clustering of logit gradients (Papyan, 2019 (Papyan, , 2020)) , independent Gaussian model for logit gradients (Fort & Ganguli, 2019) , and Neural Tangent Kernel (Jacot et al., 2020) . These results work in different settings but are not directly comparable (among themselves and our paper). A distinguishing feature of this work is that we can characterize the top eigenspace of the Hessian directly by the weight matrices of the network. Theoretical analysis for large eigenvalues of Fisher Information Matrices (FIM): The FIM can be seen as a component of the Hessian matrix for neural networks. Karakida et al. (2019b) showed that the largest c eigenvalues of the FIM for a randomly initialized neural network are much larger than the others. Their results rely on the eigenvalue spectrum analysis in Karakida et al. (2019c,a) , which assumes the weights used during forward propagation are drawn independently from the weights used in back propagation (Schoenholz et al., 2017) . Layer-wise Kronecker factorization for training NNs: The idea of approximating the FIM using Kronecker factorizations can be dated back to Heskes (2000) . More recently Martens & Grosse (2015) proposed Kronecker-factored approximate curvature (K-FAC) which approximates the inverse of FIM using layer-wise Kronecker product and perform approximated natural gradient descent (NGD) in training NNs. Kronecker factored eigenbasis has also been utilized (George et al., 2018) . K-FAC has been generalized to convolutional and recurrent NNs (Grosse & Martens, 2016; Martens et al., 2018) , Bayesian deep learning (Zhang et al., 2018) , and structured pruning (Wang et al., 2019) . Unlike these previous works which focus on accelerating computations, in this paper we use Kronecker factorization to explain the structures of Hessians. PAC-Bayes generalization bounds: People have established generalization bounds for neural networks under PAC-Bayes framework by McAllester (1999) . This bound was further tightened by Langford & Seeger (2001) , and Catoni (2007) proposed a faster-rate version. For neural networks, Dziugaite & Roy (2017) proposed the first non-vacuous generalization bound, which used PAC-Bayesian approach with optimization to bound the generalization error for a stochastic neural network. Their bound was then extended to ImageNet scale by Zhou et al. (2019) using compression.

3. Preliminaries and Notations

Basic Notations: In this paper, we generally follow the default notation suggested by Goodfellow et al. (2016) . Additionally, for a matrix M , let M F denote its Frobenius norm and M denote its spectral norm. For two matrices M ∈ R a1×b1 , N ∈ R a2×b2 , let M ⊗ N ∈ R (a1a2)×(b1b2) be their Kronecker product such that [M ⊗ N ] (i1-1)×a2+i2,(j1-1)×b2+j2 = M i1,i2 N j1,j2 . Neural Networks: We consider classification problems with cross-entropy loss. For a c-class classification problem, we are given a collection of training samples S = {(x i , y i )} N i=1 where ∀i ∈ [N ], (x i , y i ) ∈ R d × R c . We assume S is i.i.d. sampled from the underlying data distribution D. Consider an L-layer fully connected ReLU neural network without skip connection f θ : R d → R c . With σ(x) = x1 x≥0 as the Rectified Linear Unit (ReLU) function, the output of this network is a series of logits z ∈ R c computed recursively as z (p) := W (p) x (p) + b (p) . (1) x (p) := σ(z (p) ). (2) Here x (p) and z (p) are called the input and output of the p-th layer, and we set x (1) = x, z := f θ (x) = z (L) . We denote θ := (w (1) , b (1) , w (2) , b (2) , • • • , w (L) , b (L) ) ∈ R P the parameters of the network. In particular, w (i) is the flattened i-th layer weight coefficient matrix W (i) and b (i) is its bias vector. For convolutional networks, a similar analogue is presented in Appendix A.2. For a single input x ∈ R d with label y and logit output z, let n (p) and m (p) be the lengths of x (p) and z (p) . For convolutional layers, we consider the number of output channels as m (p) and width of unfolded input as n (p) . Note that x (1) = x, z (L) = z = f θ (x). We also denote p := softmax(z) = e z / c i=1 e zi the output confidence. With the loss function (p, y) = -c i=1 y i log(p i ) ∈ R + being the cross-entropy loss between the softmax of logits z = f θ (x i ) ∈ R c and the one-hot label y ∈ R c , the training process of the neural network optimizes parameter θ to minimize the empirical training loss: L(θ) = 1 N N i=1 (f θ (x i ), y i ) = E (x,y)∈S [ (z, y)] . Hessians: Fixing the parameter θ, we use H (v, x) to denote the Hessian of some vector v with respect to scalar loss function at input x. H (v, x) = ∇ 2 v (f θ (x), y) = ∇ 2 v (z, y). Note v here can be any vector. For example, the parameter Hessian is H (θ, x), where v = θ. The layer-wise weight Hessian of the p-th layer is H (w (p) , x). For simplicity, define E as the empirical expectation over the training sample S unless explicitly stated otherwise. We focus on the layer-wise weight Hessians H L (w (p) ) = E[H (w (p) , x)] with respect to loss, which are diagonal blocks in the full Hessian H L (θ) = E[H (θ, x)] corresponding to the cross terms between the weight coefficients of the same layer. We define M (p) x := H (z (p) , x) as the Hessian of output z (p) with respect to empirical loss. With the notations defined above, we have the p-th layer-wise hessian for a single input as H (w (p) , x) = ∇ 2 w (p) (z, y) = M (p) x ⊗ (x (p) x (p)T ). (5) It follows that H L (w (p) ) = E M (p) x ⊗ x (p) x (p)T = E M ⊗ xx T . The subscription x and the superscription (p) will be omitted when there is no confusion, as our analysis primarily focuses on the same layer unless otherwise stated.

4. Kronecker Factorization of Hessian

The fact that layer-wise Hessian for a single sample can be decomposed into Kronecker product of two components naturally leads to the following conjecture: Conjecture (Decoupling Conjecture). The layer-wise Hessian can be approximated by a Kronecker product of the expectation of its two components, that is H L (w (p) ) = E M ⊗ xx T ≈ E[M ] ⊗ E[xx T ]. In particular, the top eigenvalues and eigenspace of H L (w (p) ) is close to those of E[M ] ⊗ E[xx T ]. Note that this conjecture is certainly true when M and xx T are approximately statistically independent. In Section 4.1 and Section 4.2 we will show that this conjecture is true in practice. We then analyze the two components separately in Sections 4.3 and F.3.3. Assuming the decoupling conjecture, we can analyze the layer-wise Hessian by analyzing the two components separately. Note that E[M ] is the Hessian of the layer-wise output with respect to empirical loss, and E[xx T ] is the auto-correlation matrix of the layer-wise inputs. For simplicity we call E[M ] the output Hessian and E[xx T ] the input auto-correlation. For convolutional layers, we define a similar factorization E[M ] ⊗ E[xx T ] for the layer-wise Hessian, but with a different M motivated by Grosse & Martens (2016) . (See Appendix A.2) In this paper we also note that the off-diagonal blocks of the full Hessian can also be decomposed similarly. We can then approximate each block using the Kronecker factorization, and when the input auto-correlation matrices are close to rank 1, this allows us to approximate the eigenvalues and eigenvectors of the full parameter Hessian. The details of this approximation is stated in Appendix B. Experiment Setup: We conduct experiments on the CIFAR-10, CIFAR-100 (MIT) (Krizhevsky, 2009) , and MNIST (CC BY-SA 3.0) (LeCun et al., 1998) datasets as well as their random labeled versions, namely MNIST-R and CIFAR10-R. We use PyTorch (Paszke et al., 2019) framework for all experiments. We used several different fully connected (fc) networks (a fc network with m hidden layers and n neurons each hidden layer is denoted as F-n m ), several variations of LeNet (LeCun et al., 1998) , VGG11 (Simonyan & Zisserman, 2015) , and ResNet18 (He et al., 2016) . More representative results are included in Appendix E. The eigenvalues and eigenvectors of the exact layer-wise Hessians are approximated using a modified Lanczos algorithm (Golmant et al., 2018) which is described in detail in Appendix C. We use "layer:network" to denote a layer of a particular network. For example, conv2:LeNet5 refers to the second convolutional layer in LeNet5.

4.1. Approximation of Layer-wise Hessian and Full Hessian

We compare the top eigenvalues and eigenspaces of the approximated Hessian and the true Hessian. We use the standard definition of subspace overlap to measure the similarity between top eigenspaces. Definition 4.1 (Subspace Overlap). For k-dimensional subspaces U , V in R d (d ≥ k) where the basis vectors u i 's and v i 's are column vectors, with φ as the size k vector of canonical angles between U and V , we define the subspace overlap of U and V as Overlap(U , V ) := U T V 2 F /k = cos φ 2 2 /k. ( ) Note that when k = 1, the overlap is equivalent to the squared dot product between the two vectors. As shown in Fig. 2 , this approximation works reasonably well for the top eigenvalues and eigenspaces of both layer-wise weight Hessians and the full parameter Hessian.

4.2. Eigenvector Correspondence for Layer-wise Hessians

Suppose the i-th eigenvector for E[xx T ] is v i and the j-th eigenvector for E[M ] is u j . Then the Kronecker product E[M ] ⊗ E[xx T ] has an eigenvector u j ⊗ v i . Therefore if the decoupling conjecture holds, one would expect that the top eigenvector of the layer-wise Hessian has a clear correspondence with the top eigenvectors of its two components. Since u ⊗ v is just the flattened matrix uv T , we may naturally define the following reshape operation. Definition 4.2 (Layer-wise Eigenvector Matricization). Consider a layer with input dimension n and output dimension m. For an eigenvector h ∈ R mn of its layer-wise Hessian, the matricized form of h is Mat(h) ∈ R m×n where Mat(h) i,j = h (i-1)m+j . More concretely, to demonstrate the correspondence between the eigenvectors of the layer-wise Hessian and the eigenvectors of matrix E[M ] and E[xx T ], we introduce "eigenvector correspondence matrices" as shown in Fig. 3 . Definition 4.3 (Eigenvector Correspondence Matrices) . For layer-wise Hessian matrix H ∈ R mn×mn with eigenvectors h 1 , • • • , h mn , and its corresponding auto-correlation matrix E[xx T ] ∈ R n×n with eigenvectors v 1 , • • • , v n . The correspondence between v i and h j can be defined as Corr(v i , h j ) := Mat(h j )v i 2 . ( ) For the output Hessian matrix E[M ] ∈ R m×m with eigenvectors u 1 , • • • , u m , we can likewise define correspondence between v i and h j as Corr(u i , h j ) := Mat(h j ) T u i 2 (10) We may then define the eigenvector correspondence matrix between H and E[xx T ] as a n × mn matrix whose i, j-th entry is Corr(v i , h j ), and the eigenvector correspondence matrix between H and E[M ] as a m × mn matrix whose i, j-th entry is Corr(u i , h j ). Intuitively, if the i, j-th entry of the correspondence matrix is close to 1, then the eigenvector h j is likely to be the Kronecker product of v i (or u i ) with some vector. If the decoupling conjecture holds, every eigenvector of the layer-wise Hessian (column of the correspondence matrices) should have a perfect correlation of 1 with exactly one of v i and one of u i . In Fig. 3 we can see that the correspondence matrices for the true layer-wise Hessian H approximately satisfies this property for top eigenvectors. The similarity between the correspondence patterns for the true and Kroneckor product approximated Hessian Ĥ also verifies the validity of the Kronecker approximation for dominate eigenspace. From Fig. 3a and Fig. 3c , the top m eigenvectors of the true layer-wise Hessian and the approximated Hessian are all highly correlated with v 1 , the first eigenvector of E[xx T ]. From Fig. 3b and Fig. 3d , the correspondence with the E[M ] component has a near diagonal pattern for both the true Hessian and the Kronecker approximation. Thus for small i we have h i ≈ v 1 ⊗ u i . 4.3 Structure of Input Auto-correlation Matrix E[xx T ] To understand the structure of the auto-correlation matrix, a key observation is that the input x for most layers are outputs of a ReLU, hence it is nonnegative. We can decompose the auto-correlation matrix as E[xx T ] = E[x]E[x] T + E[(x -E[x])(x -E[x]) T . We denote Σ x := E[(x -E[x])(x -E[x]) T ] the auto-covariance matrix. As every sample x is nonnegative, the expectation E[x]E[x] T has a large norm and usually dominates the covariance matrix Σ x . We empirically verified this phenomenon on various networks and datasets. p) ). The vertical axes denote the eigenvalues. Similarity between the two eigenspectra is a direct consequence of a low rank E[xx T ] and the decoupling conjecture. Fig. 4 shows the similarity of eigenvalue spectrum between E[M ] and layer-wise Hessians in different situations, which agrees with our prediction. However, the eigengap only appear at initialization and at minimum for true labels (Fig. 4a and Fig. 4b ), but not at minimum for random labels (Fig. 4c ). To understand the structure of E[M ] itself, we consider a simplified setting where we have a random two-layer neural network with random data. Theorem 4.1 (informal). For a two-layer neural network with Gaussian input, at initialization, when the network is large, the output Hessian of the first layer is approximately rank (c -1) and the corresponding top eigenspace is R(W (2) )\{W (2) • 1} and R(W (2) ) denotes the row space of the weight matrix W (2) of the second layer. The formal statement of this theorem and the full proof is in Appendix H. The closed form calculation can be heuristically extended to the case with multiple layers, that the top eigenspace of the output Hessian of the k-layer would be approximately R(S (k) ) \ {S (k) • 1}, where k+1) and R(S (k) ) is the row space of S (k) . S (k) = W (n) W (n-1) • • • W ( Though our result was only proven for random initialization and random data, we observe that this subspace also has high overlap with the top eigenspace of output Hessian at the minima of models trained with real datasets. The corresponding empirical results are shown in Appendix H.4. 5 Understanding Structures of Layer-wise Hessian

5.1. Eigenspace Overlap of Different Models

Several interesting structures of layer-wise Hessians can be explained using the decoupling conjecture and eigenvalue correspondence. Consider models with the same network structure that are trained on the same dataset using different random initializations, despite no obvious similarity between their parameters, we observe surpisingly high overlap between the dominating eigenspace of their layer-wise Hessians. Fig. 5 includes 4 different variants of LeNet5 trained on CIFAR10, 3 different variants of ResNet18 trained on CIFAR100, and 3 different variants of VGG11 trained on CIFAR100. For each structural variant, 5 models are trained from independent random initializations. We plot the average pairwise overlap between the top eigenspaces of those models' layer-wise Hessians. In each figure, we vary the number of output neuron/channels. For the same structure, the top eigenspaces of different models exhibits a highly non-trivial overlap which peaks near m -the dimension of the layer's output. As we observed in Section 4.3, the auto-correlation matrix E[xx T ] is approximately E[x]E[x] T . Thus if the i-th eigenvector of E[M ] is u i , the i-th eigenvector of the layer-wise Hessian would be close to u i ⊗ Ê[x], where Ê[x] = E[x]/ E[x] , is normalized E[x]. Even though the directions of u i 's can be very different for different models, at rank m these vectors always span the entire space, as a result the top-m eigenspace for layer-wise Hessian is close to Ê[x] 2 are the same for the input layer and all non-negative for other layers, the inner-product between them is large and the overlap is expected to be high at dimension m. I m ⊗ Ê[x]. Now suppose we have two different models with Ê[x] 1 and Ê[x] 2 respectively. Their top-m eigenspaces are close to I m ⊗ Ê[x] 1 and I m ⊗ Ê[x] 2 respectively. In this case, it is easy to check that the overlap at m is approximately ( Ê[x] T 1 Ê[x] 2 ) 2 . Since Ê[x] 1 and Our explanations of the overlap relies on two properties: the auto-correlation matrix needs to be very close to rank 1 and the eigenspectrum of output Hessian should have a heavy-tail. While both are true in many shallow networks and in later layers of deeper networks, they are not satisfied for earlier layers of deeper networks. In Appendix F.3 we explain how one can still understand the overlap using correspondence matrices when the above simplified argument does not hold.

5.2. Dominating Eigenvectors of Layer-wise Hessian are Low Rank

A natural corollary for the Kronecker factorization approximation of layer-wise Hessians is that the eigenvectors of the layer-wise Hessians are low rank. Let h i be the i-th eigenvector of a layer-wise Hessian. The rank of Mat(h i ) can be considered as an indicator of the complexity of the eigenvector. Consider the case that h i is one of the top eigenvectors. From Section 5.1, we have h i ≈ u i ⊗ Ê[x]. Thus, Mat(h i ) ≈ u i Ê[x] T , which is approximately rank 1. Experiments shows that first singular values of Mat(h i ) divided by its Frobenius Norm are usually much larger than 0.5, indicating the top eigenvectors of the layer-wise Hessians are very close to rank 1. Fig. 22 shows first singular values of Mat(h i ) divided by its Frobenius Norm for i from 1 to 200. We can see that the top eigenvectors of the layer-wise Hessians are very close to rank 1. ). The horizontal axes denote the index i of eigenvector h i , and the vertical axes denote Mat(h i ) / Mat(h i ) F .

5.3. Batch Normalization and Zero-mean Input

According to our explanation, the good approximation and high overlap of top eigenspace both depend on the low rank structure of E[xx T ]. Also, the low rank structure is caused by the fact that E[x]E[x] T dominates Σ x in most cases. Therefore, it's natural to conjecture that models trained using Batch Normalization (BN) (Ioffe & Szegedy, 2015) will change these phenomena as E[x] will be zero and E[xx T ] = Σ x for those models. Indeed, as shown in Ghorbani et al. (2019) , BN suppresses the outliers in the Hessian eigenspectrum and Papyan (2020) provided an explanation. We experiment on our networks with BN. The results are shown in Appendix F.4. We found that E[xx T ] is no longer close to rank 1 for models trained with BN. However, E[xx T ] still have a few large eigenvalues. In this case, all the previous structures (c outliers, high eigenspace overlap, low rank eigenvectors) become weaker. The decoupling conjecture itself also becomes less accurate. However, the approximation still gives meaningful information. 6 Tighter PAC-Bayes Bound with Hessian Information PAC-Bayes bound is commonly used to derive upper bounds on the generalization error. Theorem 6.1 (PAC-Bayes Bound). (McAllester, 1999; Langford & Seeger, 2001) With the hypothesis space H parametrized by model parameters. For any prior distribution P in H that is chosen independently from the training set S, and any posterior distribution Q in H whose choice may inference S, with probability 1δ, D KL (ê(Q)||e(Q)) ≤ D KL (Q||P ) + log |S| δ |S| -1 , where e(Q) is the expected classification error for the posterior over the underlying data distribution and ê(Q) is the classification error for the posterior over the training set. Intuitively, if one can find a posterior distribution Q that has low loss on the training set, and is close to the prior P , then the generalization error on Q must be small. Dziugaite & Roy (2017) uses optimization techniques to find an optimal posterior in the family of Gaussians with diagonal covariance. They showed that the bound is nonvacuous for several neural network models. We follow Dziugaite & Roy (2017) to set the prior P to be a multi-variant Gaussian. The covariance is a multiple of identity. Thus, it is invariant with respect to the change of basis. For the posterior, when the variance in one direction is larger, the distance with the prior decreases; however this also has the risk of increasing the empirical loss over the posterior. In general, one would expect the variance to be larger along a flatter direction in the loss landscape. However, since the covariance matrix of Q is fixed to be diagonal in Dziugaite & Roy (2017) , the search of optimal deviation happens in standard basis vectors which are not aligned with the local loss landscape. Using the Kronecker factorization as in Eq. ( 7), we can approximate the layer-wise Hessian's eigenspace. We set Q to be a Gaussian whose covariance is diagonal in the eigenbasis of the layerwise Hessians. We expect the alignment of sharp and flat directions will result in a better optimized posterior Q and thus a tighter bound on classification error. We perform the same optimization process as proposed by Dziugaite & Roy (2017) . Our algorithm is called Approx Hessian (APPR) when we fix the layer-wise Hessian eigenbasis to the one at the minimum and Iterative Hessian (ITER) when we update the eigenbasis dynamically with the mean of the Gaussian which is being optimized. To accelerate Iterative Hessian, we used generalization of Theorem 4.1 to directly approximate the output hessian, which is then used to compute the eigenbasis. We call this variant algorithm of Iterative Hessian with approximated output Hessian (ITER.M). The results of this variant are only slightly worse than Iterative Hessian, which also suggests the approximation of the output hessian is reasonable. We used identical dataset, network structures and experiment settings as in Dziugaite & Roy (2017) , with a few adjustments in hyperparameters. We also added T-200 2 used in Section 4. T-600 10 and T-200 2 10 are trained on standard MNIST while all others are trained on MNIST-2 (see Appendix D.1). The results are shown in Table 1 with a confidence of 0.965. We also plotted the final posterior variance, s for network T-200 2 10 in Fig. 7 . For our algorithms, APPR, ITER, and ITER.M, we can see that direction associated with larger eigenvalue has a smaller variance. This agrees with our presumption that top eigenvectors are aligned with sharper directions and should have smaller variance after optimization. Detailed algorithm description and experiment results are shown in Appendix G. The horizontal axis denotes the eigenbasis ordered with decreasing eigenvalues. The abbreviation of algorithms are the same as in Table 1 .

7. Conclusions

In this paper we identified two new surprising structures in the dominating eigenspace of layerwise Hessians for neural networks. Specifically, the eigenspace overlap reveals a novel similarity between different models. We showed that under a decoupling conjecture, these structures can be explained by a Kronecker factorization. We analyze each component in the factorization, in particular we prove that the output Hessian is approximately rank c -1 for random two-layer neural networks. Our proof gives a simple heuristic formula to estimate the top-eigenspace of the Hessian. As a proof of concept, we showed that these structures can be used to find better explicit generalization bounds. Since the dominating eigenspace of Hessian, which is the sharpest directions on the loss landscape, plays an important role in both generalization (Keskar et al., 2017; Jiang et al., 2020) and optimization (Gur-Ari et al., 2018) , we believe our new understanding can benefit both fields. We hope this work would be a starting point towards formally proving the structures of neural network Hessians. Limitations and Open Problems Most of our work focuses on the layerwise Hessian except for Section 4.1 and Appendix B. The eigenspace overlap phenomenon depends on properties of auto-correlation and output Hessian, which are weaker for earlier layers of larger networks. Our theoretical results need to assume the parameters are random, and only applies to fully-connected networks. The immediate open problems include why the decoupling conjecture is correct and why the output Hessian is low rank in more general networks.

Appendix Roadmap

1. In Appendix A, we provide the detailed derivations of Hessian for fully-connected and convolutional layers. 2. In Appendix B, we provide detailed description on the approximation of dominating eigenvectors of the full hessian. 3. In Appendix C, we explain how we compute the eigenvalues and eigenvectors of full and layer-wise Hessian numerically. 4. In Appendix D, we give the detailed experiment setups, including the datasets, network structures, and the training settings we use.

5.

In Appendix E, we provide detailed experimental results that are not fully included in the main text. 6. In Appendix F, we provide additional and more general explanations of the phenomena we found. 7. In Appendix G, we give a detailed description of the PAC-Bayes bound that we optimize and the algorithm we use to optimize the bound. 8. In Appendix H, we provided the complete statement and proof for Theorem 4.1.

A Detailed Derivations

A.1 Derivation of Hessian For an input x with label y, we define the Hessian of single input loss with respect to vector v as H (v, x) = ∇ 2 v (f θ (x), y) = ∇ 2 v (z x , y). We define the Hessian of loss with respect to v for the entire training sample as H L (v) = ∇ 2 v L(θ) = N i=1 ∇ 2 v (f θ (x i ), y i ) = N i=1 H (v, x i ) = E [H (v, x)] . ( ) We now derive the Hessian for a fixed input label pair (x, y). Following the definition and notations in Section 3, we also denote output as z = f θ (x). We fix a layer p for the layer-wise Hessian. Here the layer-wise weight Hessian is H (w (p) , x). We also have the output for the layer as z (p) . Since w (p) only appear in the layer but not the subsequent layers, we can consider z = f θ (x) = g θ (z (p) (w, x)) where g θ only contains the layers after the p-th layer and does not depend on w (p) . Thus, using the Hessian Chain rule (Skorski, 2019) , we have H (w (p) , x) = ∂z (p) ∂w (p) T H (z (p) , x) ∂z (p) ∂w (p) + m (p) i=1 ∂ (z, y) ∂z (p) i ∇ 2 w (p) z (p) i , (p) i is the ith entry of z (p) and m (p) is the number of neurons in p-th layer (size of z (p) ). Since p) and w (p) = vec(W (p) ) we have z (p) = W (p) x (p) + b ( ∂z (p) ∂w (p) = I m (p) ⊗ x (p)T . ( ) Since ∂z (p) ∂w (p) does not depend on w (p) , for all i we have ∇ 2 w (p) z (p) i = 0. Thus, H (w (p) , x) = I m (p) ⊗ x (p) H (z (p) , x) I m (p) ⊗ x (p)T . We define M (p) x = H (z (p) , x) as in Section 3 so that H (w (p) , x) = I m (p) ⊗ x (p) M (p) x I m (p) ⊗ x (p)T = M (p) x ⊗ x (p) x (p)T . ( ) We now look into M (p) x = H (z (p) , x). Again we have z = g θ (z (p) ) and can use chain rule here, H (z (p) , x) = ∂z ∂z (p) T H (z, x) ∂z ∂z (p) + c i=1 ∂ (z, y) ∂z i ∇ 2 z (p) z i (18) By letting p := softmax(z) be the output confidence vector, we define the Hessian with respect to output logit z as A x and have A x := H (z, x) = ∇ 2 z l(z, y) = diag(p) -pp T , according to Singla et al. (2019) . We also define the Jacobian of z with respect to z (p) (informally logit gradient for layer p) as G (p) x := ∂z ∂z (p) . For FC layers with ReLUs, we can consider ReLU after the p-th layer as multiplying z (p) by an indicator function 1 z (p) >0 . To use matrix multiplication, we can turn the indicator function into a diagonal matrix and define it as D (p) where D (p) := diag (1 z (p) >0 ) . ( ) Thus, we have the input of the next layer as p) . The FC layers can then be considered as a sequential matrix multiplication and we have the final output as x (p+1) = D (p) z ( z = W (L) D (L-1) W (L-1) D (L-2) • • • D (p) z (p) . (21) Thus, G (p) x = ∂z ∂z (p) = W (L) D (L-1) W (L-1) D (L-2) • • • D (p) . ( ) Since G (p) x is independent of z (p) , we have ∇ 2 z (p) z i = 0, ∀i. (23) Thus, M (p) x = H (z (p) , x) = G (p)T x A x G (p) x . (24) Moreover, loss Hessian with respect to the bias term b (p) equals to that with respect to the output of that layer z (p) . We thus have H (b (p) , x) = M (p) x = G (p)T x A x G (p) x . The Hessians of loss for the entire training sample are simply the empirical expectations of the Hessian for single input. We have the formula as the following: H L (w (p) ) = E H (w (p) , x) = E M (p) x ⊗ x (p) x (p)T , H L (b (p) ) = H L (z (p) ) = E M (p) x = E G (p)T x A x G (p) x . Note that we can further decompose A x = Q T x Q x , where Q x = diag ( √ p) I c -1 c p T , with 1 c is a all one vector of size c, proved in Papyan (2019) . We can further extend the close form expression to off diagonal blocks and the bias entries to get the full Gauss-Newton term of Hessian. Let F T x =             G x (1)T ⊗ x (1) G x (1)T G x (2)T ⊗ x (2) G x (2)T . . . G x (L)T ⊗ x (n) G x (L)T             . ( ) The full Hessian is given by H L (θ) = E F T x A x F x + E c i=1 ∂ (z, y) z i ∇ 2 θ z i .

A.2 Approximating Weight Hessian of Convolutional Layers

The approximation of weight Hessian of convolutional layer is a trivial extension from the approximation of Fisher information matrix of convolutional layer by Grosse & Martens (2016) . Consider a two dimensional convolutional layer of neural network with m input channels and n output channels. Let its input feature map X be of shape (n, X 1 , X 2 ) and output feature map Z be of shape (m, P 1 , P 2 ). Let its convolution kernel be of size K 1 × K 2 . Then the weight W is of shape (m, n, K 1 , K 2 ), and the bias b is of shape (m). Let P be the number of patches slide over by the convolution kernel, we have P = P 1 P 2 . Follow Dangel et al. (2020) , we define Z ∈ R m×P as the reshaped matrix of Z and W ∈ R m×nK1K2 as the reshaped matrix of W. Define B ∈ R m×P by broadcasting b to P dimensions. Let X ∈ R nK1K2×P be the unfolded X with respect to the convolutional layer. The unfold operation (Paszke et al., 2019) is commonly used in computation to model convolution as matrix operations. After the above transformation, we have the linear expression of the p-th convolutional layer similar to FC layers: Z (p) = W (p) X (p) + B (p) (31) We still omit superscription of (p) for dimensions for simplicity. We also denote z (p) as the vector form of Z (p) and has size mP . Similar to fully connected layer, we have analogue of Eq. ( 17) for convolutional layer as H (w (p) , X) = I m ⊗ X (p) M (p) x I m ⊗ X (p)T , x = H (z (p) , X) and is a mP × mP matrix. Also, since convolutional layers can also be considered as linear operations (matrix multiplication with reshape) together with FC layers and ReLUs, Eq. ( 23) still holds. Thus, we still have H (z (p) , X) = M (p) x = G (p)T x A x G (p) x , where G (p) x = ∂z ∂z (p) and has dimension c × mP , although is cannot be further decomposed as direct multiplication of weight matrices as in the FC layers. However, for convolutional layers, X (p) is a matrix instead of a vector. Thus, we cannot make Eq. ( 32) into the form of a Kronecker product as in Eq. ( 17). Despite this, it is still possible to have a Kronecker factorization of the weight Hessian in the form H (w (p) , X) ≈ M (p) x ⊗ X (p) X (p)T , using further approximation motivated by Grosse & Martens (2016) . Note that M (p) x need to have a different shape (m × m) from M (p) x (mP × mP ), since H (w (p) , X) is mnK1K2 × mnK1K2 and X (p) X (p)T is nK1K2 × nK1K2.

Since we can further decompose

A x = Q x T Q x , we then have M x (p) = G (p)T x A x G (p) x = Q x G x (p) T Q x G x (p) . ( ) We define N (p) x = Q x G x (p) . Here Q x is c × c and G x (p) is c × mP so that N (p) x is c × mP . We can reshape N (p) x into a cP × m matrix Ñ (p) x . We then reduce M (p) x (mP × mP ) into a m × m matrix as M (p) x = 1 P Ñ (p)T x Ñ (p) x . The scalar 1 P is a normalization factor since we squeeze a dimension of size P into size 1. Thus, we can have similar Kronecker factorization approximation as H L (w (p) ) = E H (w (p) , X) = E I m ⊗ X (p) M (p) x I m ⊗ X (p)T (37) ≈ E M (p) x ⊗ X (p) X (p)T ≈ E M (p) x ⊗ E X (p) X (p)T . B Structure of Dominating Eigenvectors of the Full Hessian. Although it is not possible to apply Kronecker factorization to the full Hessian directly, we can construct an approximation of the top eigenvectors and eigenspace using similar ideas and our findings. In this section, we will always have superscript (p) for all layer-wise matrices and vectors in order to distinguish them from the full versions. As shown in Appendix A.1 Eq. ( 30), we have the full Hessian of fully connected networks as H L (θ) = E F T x A x F x + E c i=1 ∂ (z, y) z i ∇ 2 θ z i , where F T x =             G x (1)T ⊗ x (1) G x (1)T G x (2)T ⊗ x (2) G x (2)T . . . G x (L)T ⊗ x (n) G x (L)T             . ( ) In order to simplify the formula, we define x(p) = x (p) 1 to be the extended input of the p-th layer. Thus, the terms in the Hessian attributed to the bias can be included in the Kronecker product with the extended input, and F T x can be simplified as F T x =      G x (1)T ⊗ x(1) G x (2)T ⊗ x(2) . . . G x (L)T ⊗ x(n)      . ( ) As discussed in several previous works (Sagun et al., 2016; Papyan, 2018 Papyan, , 2019;; Fort & Ganguli, 2019) , the full Hessian can be decomposed in to the G-term and the H-term. Specifically, the G-term is E F T x A x F x , and the H-term is E c i=1 ∂ (z,y) zi ∇ 2 θ z i in Eq. (39). Empirically, the G-term usually dominates the H-term, and the top eigenvalues and eigenspace of the Hessian are mainly attributed to the G-term. Since we focus on the top eigenspace, we can approximate our full Hessian using the G-term, as H L (θ) ≈ E F T x A x F x . In our approximation of the layer-wise Hessian H L (w (p) ) Eq. ( 6), the two parts of the Kronecker factorization are the layer-wise output Hessian E[M (p) x ] and the auto-correlation matrix of the input E[x (p) x (p)T ]. Although we cannot apply Kronecker factorization to E F T x A x F x , we can still approximate its eigenspace using the eigenspace of the full output Hessian. Note here that the full output Hessian is not a common definition. Let m = L p=1 m (p) be the sum of output dimension of each layer. We define a full output vector z ∈ R m by concatenating all the layerwise outputs together, z :=      z (1) z (2) . . . z (L)      . ( ) We then define the full output Hessian is the Hessian w.r.t. z. Let the full output Hessian for a single input x be M x ∈ R m× m. Similar to Eq. ( 26), it can be expressed as M x := H ( z, x) = G T x A x G x , where G T x =       G (1)T x G (2)T x . . . G (L)T x       similar to Eq. ( 42). The full output Hessian for the entire training sample is thus H L ( z) = E[M x ] = E[G T x A x G x ]. We can then approximate the eigenvectors of the full Hessian H L (θ) using the eigenvectors of E[M x ]. Let the i-th eigenvector of H L (θ) be v i and that of E[M x ] be u i . We may then break up u i into segments corresponding to different layers as in u i =       u (1) i u (2) i . . . u (L) i       , where for all layer p, u (p) i ∈ R m (p) . Motivated by the relation between G x and F x , the i-th eigenvector of H L (θ) can be approximated as the following. Let w i =       u (1) i ⊗ E[ x (1) ] u (2) i ⊗ E[ x (2) ] . . . u (L) i ⊗ E[ x (L) ]       . ( ) We then have v i ≈ w i w i We can then use the Gram-Schmidt process to get the basis vectors of the approximated eigenspace. Another reason for this approximation is that the expectation is the input of each layer E[x (p) ] dominates its covariance as shown in Appendix E.1. Thus, the approximate is accurate for top eigenvectors and also top eigenspace. For latter eigenvectors, the approximation would not be as accurate since this approximate loses all information in the covariance of the inputs. We also approximated the eigenvalues using this approximation. Let the i-th eigenvalue of H L (θ) be λ i and that of E[M x ] be σ i . We have λ i ≈ σ i w i 2 . ( ) Below we show the approximation of the eigenvalues top eigenspace using this method. The eigenspace overlap is defined as in Definition 4.1. We experimented on several fully connected networks, the results shown below are for F-200 2 (same as Fig. 2d in the main text), F-200 4 , F-600 4 , and F-600 8 , all with dimension 50. we can approximate all mn eigenvectors for the layer-wise Hessian. All these calculation can be done directly. However, it is almost prohibitive to calculate the true Hessian explicitly. Thus, we use numerical methods with automatic differentiation (Paszke et al., 2017) to calculate them. The packages we use is Golmant et al. (2018) and we use the Lanczos method in most of the calculations. We also use package in Yao et al. (2019) as a reference. For layer-wise Hessian, we modified the Golmant et al. (2018) package. In particular, the package relies on the calculation of Hessian-vector product Hv, where v is a vector with the same size as parameter θ. To calculate eigenvalues and eigenvectors for layer-wise Hessian at the p-th layer, we cut the v into different layers. Then, we only leave the part corresponding to weights of the p-th layer and set all other entries to 0. Note that the dimension does not change. We let the new vector be v (p) and get the value of u = Hv (p) using auto differentiation. Then, we do the same operation to u and get u (p) .

D Detailed Experiment Setup D.1 Datasets

We conduct experiment on CIFAR-10, CIFAR-100 (MIT) (Krizhevsky, 2009) (https://www.cs. toronto.edu/~kriz/cifar.html), and MNIST (CC BY-SA 3.0) (LeCun et al., 1998) (http:// yann.lecun.com/exdb/mnist/). The datasets are downloaded through torchvision (Paszke et al., 2019) (https://pytorch.org/vision/stable/index.html). We used their default splitting of training and testing set. To compare our work on PAC-Bayes bound with the work of Dziugaite & Roy (2017) , we created a custom dataset MNIST-2 by setting the label of images 0-4 to 0 and 5-9 to 1. We also created randomlabeled datasets MNIST-R and CIFAR10-R by randomly labeling the images from the training set of MNIST and CIFAR10. The dataset information is summarized in All the datasets (MNIST, CIFAR-10, and CIFAR-100) we used are publicly available. According to their descriptions on the contents and collection methods, they should not contain any personal information or offensive content. MNIST is a remix of datasets from the National Institute of Standards and Technology (NIST), which obtained consent for collecting the data. However, we also note that CIFAR-10 and CIFAR-100 are subsets of the dataset 80 Million Tiny Image (Torralba et al., 2008) (http://groups.csail.mit.edu/vision/TinyImages/), which used automatic collection and includes some offensive images.

D.2 Network Structures

Fully Connected Network: We used several different fully connected networks varying in the number of hidden layers and the number of neurons for each hidden layer. The output of all layers except the last layer are passed into ReLU before feeding into the subsequent layer. As described in Section 4.1, we denote a fully connected network with m hidden layers and n neurons each hidden layer by F-n m . For networks without uniform layer width, we denote them by a sequence of numbers (e.g. for a network with three hidden layers, where the first two layers has 200 neurons each and the third has 100 neurons, we denote it as F-200 2 -100). For example, the structure of F-200 2 is shown in Table 3 . LeNet5: We adopted the LeNet5 structure proposed by LeCun et al. (1998) for MNIST, and slightly modified the input convolutional layers to adapt the input of CIFAR-10 dataset. The standard LeNet5 structure we used in the experiments is shown in Table 4 . We further modified the dimension of fc1 and conv2 to create several variants for the experiment in Section 5.1. Take the model whose first fully connected layer is adjusted to have 80 neurons as an example, we denote it as LeNet5-(fc1-80). Networks with Batch Normalization: In Appendix F.4 we conducted several experiments regarding the effect of batch normalization on our results. For those experiments, we use the existing structures and add batch normalization layer for each intermediate output after it passes the ReLU module. In order for the Hessian to be well-defined, we fix the running statistics of batch normalization and treat it as a linear layer during inference. We also turn off the learnable parameters θ and β (Ioffe & Szegedy, 2015) for simplicity. For network structure X, we denote the variant with batch normalization after all hidden layers X-BN. For example, the detailed structure LeNet5-BN is shown in Table 5 . Variants of VGG11: To verify that our results apply to larger networks, we trained a number of variant of VGG11 (originally named VGG-A in the paper, but commonly refered as VGG11) proposed by Simonyan & Zisserman (2015) . For simplicity, we removed the dropout regularization in the original network. To adapt the structure, which is originally designed for the 3 × 224 × 224 input of ImageNet, to 3 × 32 × 32 input of CIFAR-10. Since the original VGG11 network is too large for computing the top eigenspace up to hundreds of dimensions, we reduce the number of output channels of each convolution layer in the network to 32, 48, 64, 80, and 200 . We denote the small size variants as VGG11-W32, VGG11-W48, VGG11-W64, VGG11-W80, and VGG11-W200 respectively. We use conv1 -conv8 and fc1 to denote the layers of VGG11 where conv1 is closest to the input feature and fc1 is the classification layer.

Variants of ResNet18:

We also trained a number of variant of ResNet18 proposed by He et al. (2016) . As batch normalization will change the low rank structure of the auto correlation matrix and reduce the overlap, we removed all batch normalization operations. Following the adaptation of ResNet to CIFAR dataset as in https://github.com/kuangliu/pytorch-cifar, we changed the input size to 3 × 32 × 32 and added a 1x1 convolutional layer for each shortcut after the first block. Similar to VGG11, we reduce the number of output channels of each convolution layer in the network to 48, 64, 80. We denote the small size variants as ResNet18-W48, ResNet18-W64, and ResNet18-W80 respectively. We use conv1 -conv17 and fc1 to denote the layers of the ResNet18 backbone where conv1 is closest to the input feature and fc1 is the classification layer. For the 1x1 convolutional layers in the shortcut, we denote them by sc-conv1 -sc-conv3. where sc-conv1 is the convolutional layer on the shortcut of the second ResNet block and sc-conv3 is the convolutional layer on the shortcut of the fourth ResNet block.

D.3 Training Process and Hyperparameter Configuration

For all datasets, we used the default splitting of training and testing set. All models (except explicitly stated otherwise) are trained using batched stochastic gradient descent (SGD) with batch-size 128 and fixed learning rate 0.01 for 1000 epochs. No momentum and weight decay regularization were used. The loss objective converges by the end of training, so we may assume that the final models are at local minima. For generality we also used a training scheme with fixed learning rate at 0.001, and a training scheme with fixed learning rate at 0.01 with momentum of 0.9 and weight-decay factor of 0.0005. Models trained with these settings will be explicitly stated. Otherwise we assume they were trained with the default scheme mentioned above. Follow the default initialization scheme of PyTorch (Paszke et al., 2019) , the weights of linear layers and convolutional layers are initialized using the Xavier method (Glorot & Bengio, 2010) , and bias of each layer are initialized to be zero.

E.1 Low Rank Structure of Auto-correlation Matrix E[xx T ]

We have briefly discussed about the autocorrelation matrix E[xx T ] being approximately rank 1 in Section 4.3 in the main text. In particular, we claimed that the mean of layer input dominate the covariance, that E[xx T ] ≈ E[x]E[x T ]. In this section we provide some additional empirical results supporting that claim. We use two metrics to quantify the quality of this approximation: the squared dot product between normalized E[x] and the first eigenvector of E[xx T ] and the ratio between the first and second eigenvalue of E[xx T ]. Intuitively if the first quantity is close to 1 and the second quantity is large, then the approximation is accurate. Formally, for fully connected layers, define Ê[x] as the normalized expectation of the layer input x, namely E[x]/ E[x] . For convolutional layers, following the notations in Appendix A.2, define Ê[x] as the first left singular vector of E[X] where Ê[x] ∈ R nK1K2 . Abusing notations for simplicity, we use E[xx T ] to denote the nK 1 K 2 × nK 1 K 2 matrix E[XX T ]. In this section we consider the squared dot product between Ê[x] and the first eigenvector v 1 of E[xx T ], namely (v T 1 Ê[x]) 2 . For the spectral ratio, let λ 1 be the first eigenvalue of E[xx T ] and λ 2 be the second. We have λ 1 λ 2 ≥ E[x]E[x] T -Σ x Σ x = E[x]E[x] T Σ x -1, where Σ x is the covariance of x. Thus, the spectral norm of E[x]E[x] T divided by that of Σ x gives a lower bound to λ 1 /λ 2 . In our experiments, we usually have λ 1 /λ 2 ≥ E[x]E[x] T / Σ x . Table 6: Squared dot product (v T 1 Ê[x] ) 2 and spectral ratio λ 1 /λ 2 for fully connected layers in a selection of network structures and datasets. We independently trained 5 runs for each instance and compute the mean, minimum, and maximum of the two quantities over all layers (except the first layer which takes the input with mean-zero) in all runs.  (v T 1 Ê[x]) 2 λ 1 / (v T 1 Ê[x] ) 2 and spectral ratio λ 1 /λ 2 for convolutional layers in the selection of network structures and datasets in Table 6 .  (v T 1 Ê[x]) 2 λ 1 /

E.2 Eigenspace Overlap Between Different Models

The non trivial overlap between top eigenspaces of layer-wise Hessians is one of our interesting observations that had been discusses in Section 5.1. Here we provide more related empirical results. Some will further verify our claim in Section 5.1 and some will appear to be challenge that. Both results will be explained discussed more extensively in Appendix F.

E.2.1 Overlap preserved when varying hyper-parameters:

We first verify that the overlap also exists for a set of models trained with the different hyperparameters. Using the LeNet5 (defined in Table 4 ) as the network structure. We train 6 models using the default training scheme (SGD, lr=0.01, momentum=0), 5 models using a smaller learning rate (SGD, lr=0.001, momentum=0), and 5 models using a combination of optimization tricks (SGD, lr=0.01, momentum=0.9, weight decay=0.0005). With these 16 models, we compute the pairwise eigenspace overlap of their layer-wise Hessians (120 pairs in total) and plot their average in Fig. 9 . The shade areas in the figure represents the standard deviation. The pattern of overlap is clearly preserved, and the position of the peak roughly agrees with the output dimension m, demonstrating that the phenomenon is caused by a common structure instead of similarities in training process. Note that for fc3 (the final output layer), we are not observing a linear growth starting from 0 like other layers. This can be explained by the lack of neuron permutation. Related details will be discussed along with the reason for the linear growth pattern for other layers in Appendix F.3.

E.2.2 Eigenspace overlap for convolutional layers in large models:

Even though the exact Kroneckor Factorization for layer-wise Hessians is only well-defined for fully connected layers, we also observe similar nontrivial eigenspace overlap for convolutional layers in larger and deeper networks including variants of VGG11 and ResNet18 on datasets CIFAR10 and CIFAR100. Some representative results are shown in Fig. 25 and Fig. 11 . For each model on each dataset, we independently train 5 models and compute the average pairwise eigenspace overlap. The shade areas represents the standard deviation. For most of the convolutional layers, the eigenspace overlap peaks around the dimension which is equal to the number of output channels of that layer, which is similar to the layers in LeNet5 as in Fig. 9 . The eigenspace overlap of the final fully connected-layer also behaves similar to fc3:LeNet5, which remains around a constant then drops after exceeding the dimension of final output. However, there are also layers whose overlap does not peak around the output dimensions, (e.g. conv5 of Fig. 10b ) and conv7 of Fig. 11a ). We will cluster these "failure cases" in the following paragraph. 

E.2.3 Failure Cases

As seen in Fig. 25 and Fig. 11 , there is a small portion of layers, usually closer to the input, whose eigenspace overlap does peak around the output dimensions. These layers can be clustered into the following two general cases.

a. Early Peak of Low Overlap

For layers shown in Fig. 12 . The overlap of dominating eigenspaces are significantly lower than the other layers. Also there exists a small peak at very small dimensions. For layers shown in Fig. 13 , the top eigenspaces has a nontrivial overlap, but the peak dimension is larger than predicted output dimension. Figure 13 : Top eigenspace overlap for layers with a delayed peak. However, the existence of such failure cases does not undermine the theory of Kronecker factorization approximation. In fact, both "failure cases" appear because the top hessian eigenspace is not completely spanned by E[x], and can be predicted by computing the auto correlation matrices and the output Hessians. The details will also be elaborated in Appendix F.3 with the help of correspondence matrices.

E.3 Eigenvector Correspondence

Here we present the correspondence matrix for fc1, fc2, conv1, and conv2 layer of LeNet5. The top eigenvectors for all layers shows a strong correlation with the first eigenvector of E[xx T ] (which is approximately Ê[x]). However, the diagonal pattern in the correspondence matrix with E[M ] for fc2 is not as clear as the one for fc1. 

F Additional Explanations F.1 Outliers in Hessian Eigenspectrum

One characteristic of Hessian that has been mentioned by many is the outliers in the spectrum of eigenvalues. Sagun et al. (2018) suggests that there is a gap in Hessian eigenvalue distribution around the number of classes c in most cases, where c = 10 in our case. A popular theory to explain the gap is the class / logit clustering of the logit gradients (Fort & Ganguli, 2019; Papyan, 2019 Papyan, , 2020)) . Note that these explanations can be consistent with our heuristic formula for the top eigenspace of output Hessian at initialization-in the two-layer setting we considered the logit gradients are indeed clustered. In the layer-wise setting, the clustering claim can be formalized as follows: For each class k ∈ [c] and logit entry l ∈ [c], with Q be defined as in Eq. ( 28), and (x, y) as the input, label pair, let ∆ i,j = E Q x ∂z x ∂w (p) j y = i . ( ) Then at the initialization, for each logit entry j, {∆ i,j } i∈[c] is clustered around the "logit center" ∆j E i∈[c] [∆ i,j ]; at the minima, for each class i, {∆ i,j } j∈[c] is clustered around the "class center" ∆i E j∈[c] [∆ i,j ]. With the decoupling conjectures, we may also consider similar claims for output Hessians, where Γ i,j = E Q x ∂z x ∂z (p) x j y = i . A natural extension of the clustering phenomenon on output Hessians is then as follows: At the initialization, for each logit entry j, {Γ i,j } i∈[c] is clustered around Γj E i∈[c] [Γ i,j ]; at the minima, for each class i, {Γ i,j } j∈[c] is clustered around Γi E j∈[c] [Γ i,j ] . Note that we have the layer-wise Hessian and layer-wise output Hessian satisfying H L (w (p) ) = E i,j∈[c] [∆ T i,j ∆ i,j ], M (p) = E i,j∈[c] [Γ T i,j Γ i,j ].

Low-rank Hessian at Random Initialization and Logit Gradient Clustering

We first briefly recapture our explanation on the low-rankness of Hessian at random initialization. In Appendix H, we have shown that for a two layer ReLU network with Gaussian random initialization and Gaussian random input, the output hessian of the first layer M (1) is approximately 2) . We then heuristically extend this approximation to a randomly initialized L-layer network, that with p+1) , the output Hessian of the p-th layer H (p) can be approximated by M (p) where 1 4 W (2)T AW ( S (p) = W (L) W (L-1) • • • W ( M (p) 1 4 L-p S (p)T AS (P ) . ( ) Since A is strictly rank c -1 with null space of the all-one vector, H (p) is strictly rank c -1. Thus H (p) is approximately rank c -1, and so is the corresponding layerwise Hessian according to the decoupling conjecture. Now we discuss the connection between our analysis with the theory of logit gradient clustering. As previously observed by Papyan (2019) , for each logit entry l, {∆ i,j } l∈[c] are clustered around the logit gradients E l∈[c] [∆ i,j ]. Similar clustering effects for {Γ i,j } l∈[c] were also empirically observed by our experiments. Moreover, through the approximation above and the decoupling conjecture, for each logit entry j, the cluster centers Γj and ∆j can be approximated by Γj ≈ Γj (S T Q) j ∆j ≈ ∆j ((E[x] ⊗ S T )E[Q]) j . Following Papyan (2019) , we used t-SNE (Van der Maaten & Hinton, 2008) to visualize the logit gradients. As we see in Fig. 20 , the "logit centers" of the clustering directly corresponds to the approximated dominating eigenvectors of the Hessian, which is consistent with our analysis. Gradient Clustering at Minima Currently our theory does not provide an explanation to the low rank structure of Hessian at the minima. However we have observed that the class clustering of logit gradients does not universally apply to all models at the minima, even when the models have around c significant large eigenvalues. As shown in Fig. 21 , the class clustering is very weak but there are still around c significant large eigenvalues. We conjecture that the class clustering of logit gradients may be a sufficient but not necessary condition for the Hessian to be low rank at minima. A natural corollary for the Kronecker factorization approximation of layer-wise Hessians is that the eigenvectors of the layer-wise Hessians are low rank. Let h i be the i-th eigenvector of a layer-wise Hessian. The rank of Mat(h i ) can be considered as an indicator of the complexity of the eigenvector. Consider the case that h i is one of the top eigenvectors. From Section 5.1, we have  h i ≈ u i ⊗ Ê[x]. Thus, Mat(h i ) ≈ u i Ê[x]

F.3 Eigenspace Overlap of Different Models

From the experiment results in Appendix E together with Fig. 5 , we can see that our approximation and explanation stated in Section 5.1 of the main text is approximately correct but may not be so accurate for some layers. We now present a more general explanation which addresses why the overlap before rank-m grows linearly. We will also explain some exceptional cases as shown in Appendix E.2 and possible discrepancies of our approximation. Let h i be the i-th eigenvector of the layer-wise Hessian H L (w (p) ), under the assumption that the autocorrelation matrix E[xx T ] is approximately rank 1 that E[xx T ] ≈ E[x]E[x] T , for all i ≤ m, we can approximate the h i as u i ⊗ (E[x]/ E[x] ) where u i is the i-th eigenvector of E[M ]. Formally, the trend of top eigenspace can be characterized by the following theorem. For simplicity of notations, we abuse the superscript within parentheses to refer the two models instead of layer number in this section. Theorem F.1. Consider 2 different models with the same network structure trained on the same dataset. Fix the p-th hidden layer with input dimension n and output dimension m. For the first model, denote its output Hessian as E[M ] (1) with eigenvalues τ (1) 1 ≥ τ (1) 2 ≥ • • • ≥ τ (1) m ≥ 0 and eigenvectors r (1) 1 , • • • , r m ∈ R m ; denote its autocorrelation matrix as E[xx T ] (1) , with eigenvalues γ (1) 1 ≥ γ (1) 2 ≥ • • • ≥ γ (1) m ≥ 0 and eigenvectors t (1) 1 , • • • , t n ∈ R n . The variables for the second matrices are defined identically by changing 1 in the superscript parenthesis to 2.

Assume the Kronecker factorization approximation is accurate that H

L (w (p) ) (1) ≈ E[M ] (1) ⊗ E[xx T ] (1) and H L (w (p) ) (2) ≈ E[M ] (2) ⊗ E[xx T ] (2) . Also assume the autocorrelation matrices of two models are sufficiently close to rank 1 in the sense that τ (1) m γ (1) 1 > τ (1) 1 γ (1) 2 and τ (2) m γ (2) 1 > τ (2) 1 γ (2) 2 . Then for all k ≤ m, the overlap of top k eigenspace between their layerwise Hessians H L (w (p) ) (1) and H L (w (p) ) (2) will be approximately k m (t (1) 1 • t (2) 1 ) 2 . Consequently, the top eigenspace overlap will show a linear growth before it reaches dimension m. The peak at m is approximately (t 1 • t 2 ) 2 .

Proof. Let h

(2) i be the i-th eigenvector of the layer-wise Hessian for the first model H L (w (p) ) (1) , and g i be that of the second model H L (w (p) ) (2) . Consider the first model. By the Kronecker factorization approximation, since τ (1) m γ (1) 1 > τ (1) 1 γ (1) 2 , the top m eigenvalues of the layer-wise Hessian are γ (1) 1 τ (1) 1 , • • • , γ (1) 1 τ (1) m . Consequently, for all i ≤ m we have h i ≈ r (1)T i ⊗ t (1) 1 . Thus, for any k ≤ m, we have its top k eigenspace as V (1) k ⊗ t (1) 1 , where V (1) k ∈ R m×k has column vectors r (1) 1 , . . . , r k . Similarly, for the second model we have h (2) i ≈ r (2) i ⊗ t (2) 1 and the top k eigenspace as V (2) k ⊗ t (2) 1 , where V (2) k has column vectors r (2) 1 , . . . , r k . The eigenspace overlap of the 2 models at dimension k is thus Overlap V (1) k ⊗ t (1) 1 , V (2) k ⊗ t (2) 1 = 1 k V (1)T k V (2) k ⊗ t (1)T 1 t (2) 1 2 F = t (1) 1 • t (2) 1 2 Overlap V (1) k , V (58) Note that for all i ≤ m, r i , r (2) i ∈ R n , which is the space corresponding to the neurons. Since for hidden layers, the output neurons (channels for convolutional layers) can be arbitrarily permuted to give equivalent models while changing eigenvectors. For h i ≈ r i ⊗ t 1 , permuting neurons will permute entries in r i . Thus, we can assume that for two models, r (1) i and r (2) i are not correlated and thus have an expected inner product of 1/m.

It follows from Definition

4.1 that E[Overlap(V (1) k , V (2) k )] = k i=1 E[(r (1) i • r (2) i ) 2 ] = k( 1 m ) = k m and thus the eigenspace overlap of at dimension k would be approximately k m (t (1) 1 • t (2) 1 ) 2 . This explains the peak at dimension m and the linear growth before it. From our results on autocorrelation matrices in Section 4.3 and Appendix E.1, we have Ê [x] (1) ≈ t (1) 1 and Ê[x] (2) ≈ t (2) 1 where Ê is the normalized expectation. Hence when k = m, the overlap is approximately ( Ê[x] (1) • Ê[x] (2) ) 2 . Since Ê[x] (1) and Ê[x] (2) are the identical for the input layers, the overlap is expected to be very high at dimension m for input layers. For other hidden layers in a ReLU network, x are output of ReLU and thus non-negative. Two non-negative vectors Ê[x] (1) and Ê[x] (2) still have relatively large dot product, which contributes to the high overlap peak.

F.3.1 The Decreasing Overlap After Output Dimension

Consider the (m + 1)-th eigenvector h (1) m+1 of the first model. Following the Kronecker factorization approximation and assumptions in Theorem F.1, we have h (1) m+1 ≈ r (1) 1 ⊗t (1) 2 . Since top m eigenspace of the first model is approximately I m ⊗ t (1) 1 and t (1) 2 is orthogonal to t (1) 1 , the h (1) m+1 eigenvector will be orthogonal to the top m eigenspace of the first model. It will also have low overlap with I m ⊗ t (2) 1 since ( Ê[x] (1) • Ê[x] (2) ) 2 is large. Moreover, since the remaining eigenvectors of the autocorrelation matrix no longer has the all positive property as the first eigenvector and structure of the convariance Σ x is directly associated with the ordering of the input neurons which are randomly permuted across different models, the overlap between other eigenvectors of the autocorrelation matrix across different models will be close to random, hence the overlap after the top m dimension will decrease until the eigenspaces has sufficiently many basis vectors to make the random overlap large.

F.3.2 The Output Layer

Note that for the last layer satisfying the assumptions in Theorem F.1, the overlap will stay high before dimension m and be approximately (t 1 • t 2 ) 2 since the output neurons directly correspondence to classes, and hence neurons cannot be permuted. In this case, the overlap will be approximately (t 1 • t 2 ) 2 for all dimension k ≤ m. This is consistent with our observations. 

F.3.3 Explaining "Failure Cases" of Eigenspace Overlap

As shown in Fig. 12 and Fig. 13 , the nontrivial top eigenspace overlap does not necessarily peak at the output dimension for all layers. Some layers has a low peak at very small dimensions and others has a peak at a larger dimension. With the more complete analysis provided above, we now proceed to explain these two phenomenons. The major reason for such phenomenons is that the assumption of autocorrelation matrix being sufficiently close to rank 1 is not always satisfied. In particular, following the notations in Theorem F.1, for these exceptional layers we have τ m γ 1 < τ 1 γ 2 . We first consider the first phenomenon (early peak of low overlap) and take fc2:F-200 2 (MNIST) in as an example. Here Fig. 24a is identical to Fig. 12a , which displays the early peak around m = 10. Note that this example shows that Kronecker factorization can be used to predict when our conditions in Theorem F.1 fails and also predict the condition can be satisfied up to which dimension. As shown in Fig. 25 , similar explanation also applies to convolutional layers in larger networks. We then consider the second phenomenon (delayed peak) and take conv2:VGG11-W200 (CIFAR10) in as an example. Here Fig. 26a is identical to Fig. 13a , which has the overlap peak later than the output dimension 200. In this case, the second eigenvalue of the auto correlation matrix is still not negligible compared to the top eigenvalue. What differentiate this case from the first phenomenon is that the eigenvalues of the output Hessian no longer has a significant peak -instead it has a heavy tail which is necessary for high overlap. Towards dimension m there gradually exhibits higher correspondence to later eigenvectors of the input autocorrelation matrix and hence less correspondence to Ê[x]. This eventually results in the delayed and flattened peak. Since the full correspondence matrices are too large to be visualized, we plotted their first rows up to 400 dimensions in Fig. 26e and Fig. 26f , in which each dot represents the average of correlation with Ê[x] for the 10 eigenvector nearby. From these figures it is straightforward to see the gradual decreasing correlation with Ê[x].

F.4 Batch Normalization and Zero-mean Input

In this section, we show the results on networks with using Batch normalization (BN) (Ioffe & Szegedy, 2015) . For layers after BN, we have E[x] ≈ 0 so that E[x]E[x] T no longer dominates Σ x and the low rank structure of E[xx T ] should disappear. Thus, we can further expect that the overlap between top eigenspace of layer-wise Hessian among different models will not have a peak. Table 8 shows the same experiments done in Table 6 . The values for each network are the average of 3 different models. It is clear that the high inner product and large spectral ratio both do not hold here, except for the first layer where there is no normalization applied. Note that we had channel-wise normalization (zero-mean for each channel but not zero-mean for x) for conv1 in LeNet5 so that the spectral ratio is also small. Fig. 27a shows that E[xx T ] is no longer close to rank 1 when having BN. This is as expected. However, E[xx T ] still has a few large eigenvalues. to the disappearance of peak in top eigenspace overlap of different models, as shown in Fig. 28 . The peak still exists in conv1 because BN is not applied to the input. Comparing Fig. 27b and Fig. 27c , we can see that the Kronecker factorization still gives a reasonable approximation for the eigenvector correspondence matrix with E[xx T ], although worse than the cases without BN (Fig. 3 ). Fig. 29 compare the eigenvalues and top eigenspaces of the approximated Hessian and the true Hessian for LeNet5 with BN. The approximation using Kronecker factorization is also worse than the case without BN (Fig. 2 ). However, the approximation still gives meaningful information as the overlap of top eigenspace is still highly nontrivial.  = E (x,y)∼D [ l(θ, x)], ê(θ) := 1 N N i=1 [ l(θ, x i )] as the expected and empirical classification error of θ, respectively. We define the measurable hypothesis space of parameters H := R P . For any probabilistic measure P in H, let e(P ) = E θ∼P e(θ), ê(P ) = E θ∼P ê(θ), and ȇ(P ) = E θ∼P L(θ). Here ȇ(P ) serves as a differentiable convex surrogate of ê(P ). Theorem G.1 (Pac-Bayes Bound). (McAllester, 1999) (Langford & Seeger, 2001) For any prior distribution P in H that is chosen independently from the training set S, and any posterior distribution Q in H whose choice may inference S, with probability 1δ, D KL (ê(Q) e(Q)) ≤ D KL (Q P ) + log |S| δ |S| -1 . ( ) Fix some constant b, c ≥ 0 and θ 0 ∈ H as a random initialization, Dziugaite & Roy (2017) shows that when setting Q = N (w, diag(s)), P = N (θ 0 , λI P ), where w, s ∈ H and λ = c exp (-j/b) for some j ∈ N, and solve the optimization problem min w,s,λ ȇ(Q) + D KL (Q P ) + log |S| δ 2(|S| -1) , ( ) with initialization w = θ, s = θ 2 , one can achieved a nonvacous PAC-Bayes bound by Eq. ( 60). In order to avoid discrete optimization for j ∈ N , Dziugaite & Roy (2017) uses the B RE term to replace the bound in Eq. ( 60). The B RE term is defined as B RE (w, s, λ; δ) = D KL (P Q) + 2 log(b log c λ ) + log π 2 |S| 6δ |S| -1 , ( ) where Q = N (w, diag(s)), P = N (θ 0 , λI P ). The optimization goal actually used in the implementation is thus min w∈R P ,s∈R P + ,λ∈(0,c) ȇ(Q) + 1 2 B RE (w, s, λ; δ). ( ) Algorithm 1 shows the algorithm for Iterative Hessian (ITER) PAC-Bayes Optimization. If we set η = T , the algorithm will be come Approximate Hessian (APPR) PAC-Bayes Optimization.  ς ← log[abs(w)] where s(ς) = exp(2ς) 3: ← -3 where λ( ) = exp(2 ) 4: R(w, s, λ) = 1 2 B RE (w, s, λ; δ) BRE term 5: B(w, s, λ, w ) = L(w ) + R(w, s, λ) Optimization goal 6: for t = 0 → T -1 do Run SGD for T iterations 7: if t mod η == 0 then 8: HESSIANCALC end if 10: Sample ξ ∼ N (0, 1) P 11: w (w, ς) = w + TOSTANDARD (ξ exp(ς)) Generate noisy parameter for SNN 12: w ← w -τ [∇ w R(w, s, λ) + ∇ w L(w )] 13: ς ← ς -τ [∇ ς R(w, s(ς), λ) + TOHESSIAN (∇ w L(w )) ξ exp(ς)] 14: ← -τ ∇ R(w, s, λ( )) Gradient descent 15: end for 16: return w, s(ς), λ( ) 17: end procedure In the algorithm, HESSIANCALC(w) is the process to calculate Hessian information with respect to the posterior mean w in order to produce the Hessian eigenbasis to perform the change of basis. For very small networks, we can calculate Hessian explicitly but it is prohibitive for most common networks. However, efficient approximate change of basis can be performed using our approximated layer-wise Hessians. In this case, we would just need to calculate the full eigenspace of E[M ] and that of E[xx T ] for each layer. For pth layer, we denote them as U (p) and V (p) respectively with eigenvectors as columns. We can also store the corresponding eigenvalues by doing pairwise multiplications between eigenvalues of E[M ] and E[xx T ]. After getting the eigenspaces, we can perform the change of basis. Note that we perform change of basis on vectors with the same dimensionality as the parameter vector (or the posterior mean). TOHESSIAN(u) is the process to put a vector u in the standard basis to the Hessian eigenbasis. We first break u into different layers and let u (p) be the vector for the pth layer. We then define Mat (p) as the reshape of a vector to the shape of the parameter matrix W (p) of that layer. We have the new vector v (p) in Hessian basis as v (p) = vec U (p)T Mat (p) (u (p) )V (p) . The new vector v = TOHESSIAN(u) is thus the concatenation of all the v (p) . TOSTANDARD(v) is the process to put a vector v in the Hessian eigenbasis to the standard basis. It is the reverse process to TOHESSIAN. We also break v into layers and let the vector for the pth layer be v (p) . Then, the new vector u (p) is u (p) = vec U (p) Mat (p) (v (p) )V (p)T , The new vector u = TOSTANDARD(v) is thus the concatenation of all u (p) . After getting optimized w, s, λ, we compute the final bound using Monte Carlo methods same as in Dziugaite & Roy (2017) . Note that the prior P is invariant with respect to the change of basis, since its covariance matrix is a multiple of identity λI P . Thus, the KL divergence can be calculate in the Hessian eigenbasis without changing the value of λ. In the Iterative Hessianwith approximated output Hessian (ITER.M), we use M to approximate E[M ], as in Eq. ( 56). We followed the experiment setting proposed by Dziugaite & Roy (2017) in general. In all the results we present, we first trained the models from Gaussian random initialization w 0 to the initial posterior mean estimate w using SGD (lr=0.01) with batch-size 128 and epoch number 1000. We then optimize the posterior mean and variance with layer-wise Hessian information using Algorithm 1, where δ = 0.025, b = 100, and c = 0.1. We train for 2000 epochs, with learning rate τ initialized at 0.001 and decays with ratio 0.1 every 400 epochs. For Approximated Hessian algorithm, we set η = 1. For Iterative Hessian algorithm, we set η = 10. We also tried η with the same decay schedule as learning rate (multiply η by 10 every time the learning rate is multiplied by 0.1) and the results are similar to those without decay. We also used the same Monte Carlo method as in Dziugaite & Roy (2017) to calculate the final PAC-Bayes bound. Except that we used 50000 iterations instead of 150000 iterations because extra iterations do not further tighten the bound significantly. We use sample frequency 100 and δ = 0.01 as in that paper. The complete experiment results are listed in Table 9 . We follow the same naming convention as in Dziugaite & Roy (2017) In Table 9 , Prev means the previous results in Dziugaite & Roy (2017) , APPR means Approximated Hessian, ITER means Iterative Hessian, ITER (D) means Iterative Hessian with decaying η, ITER.M means Iterative Hessian with approximated output Hessian. BASE are Base PAC-Bayes optimization as in the previous paper. We also plotted the final posterior variance, s. Fig. 30 shown below is for T-200 2 10 . For posterior variance optimized with our algorithms (APPR, ITER, and ITER.M) we can see that direction associated with larger eigenvalue has a smaller variance. This agrees with our presumption that top eigenvectors are aligned with sharper directions and should have smaller variance after optimization. The effect is more significant and consistent for Iterative Hessian, where the PAC-Bayes bound is also tighter. We use [n] to denote the set {1, • • • , n}, and M to denote the spectral norm of a matrix M . We use A, B to denote the Frobenius inner product of two matrices A and B, namely A, B i,j A i,j B i,j . Denote tr(M ) the trace of a matrix M and denote 1 c the all-one vector of dimension c (the subscript may be omitted when it's clear from the context). Furthermore, for notation simplicity, we will say "with probability 1 over W (1) /W (2) , event E is true" to denote lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) [E] = 1.

H.1.2 Problem Setting

Consider a two layer fully connected ReLU neural network with input dimension d, hidden layer dimension n and output dimension c. The network is trained with cross-entropy objective L. Let σ denote the element-wise ReLU activation function which acts as σ(x) = x • 1 x≥0 . Let W (1) ∈ R n×d and W (2) ∈ R c×n denote the weight matrices of the first and second layer respectively. Let the neural network have standard normal input x ∼ N (0, I d ). Denoting the output of the first and second layer as y and z respectively, we have y = σ(W (1) x) and z = W (2) y. Let p = softmax(z) denote the softmax output of the network. Let A := diag(p)pp T . From the previous analysis of Hessian, we have the output Hessian of the second layer can be written as M := E[DW (2) T AW (2) D], where D := diag(1 y≥0 ) is the diagonal random matrix representing the activations of ReLU function after the first layer. In this problem, we look into the state of random Gaussian initialization, in which entries of both matrices are i.i.d. sampled from a standard normal distribution, and then re-scaled such that each row of W (1) and W (2) has norm 1. When taking n and d to infinity, with the concentration of norm in high-dimensional Gaussian random variables, in this problem we assume that entries of W (1) are iid sampled from a zero-mean distribution with variance 1/d, and entries of W (2) are iid sampled from a zero-mean distribution with variance 1/n. This initialization is standard in training neural networks. Since our formula for the top eigenspace is going to depend on W (2) , throughout the section when we take expectation we condition on the value of W (1) and W (2) . The expectation is only taken over the input x ∼ N (0, I d ) (due to concentration taking expectation on x is similar to having many samples from the input distribution). In this case, the output Hessian is defined as: M E[DW (2)T AW (2) D].

H.2 Main Theorem and Proof Sketch

In this section, we will provide a formal statement of our main theorem and its proof sketch. First, we state our main theorem: Theorem H.1. For all > 0, lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) λ c (M ) λ c-1 (M ) W (1) ,W (2) < = 1. Besides, for all > 0, if we define S 1 as the top c -1 eigenspace of M , and S 2 as R(W )\{W • 1} where R(W ) is the row space of W , then lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) [Overlap (S 1 , S 2 ) > 1 -] = 1. Proof of Theorem H.1. First of all, let us repeat the expression for the output Hessian M : M E[DW (2)T AW (2) D]. In the proof, we will first analyze the properties of D, W (2) , A separately: Firstly, D is a diagonal matrix with 0/1 entries, and the following lemma shows that its entries are independent when the input dimension tends to infinity. Lemma 1. When d → ∞, with probability 1 over W (1) , the entries of D are independent. Secondly, since each entry of W (2) is sampled i.i.d. from a spherical Gaussian distribution, this matrix enjoys some very nice properties when the network width n goes to infinity. We have the following lemma for W (2) . Lemma 2. (informal) When n is large enough, each row of W (2) has norm very close to 1, and these rows are nearly orthogonal to each other. Besides, the entries (and the average of the entries) cannot be too large. These properties (along with other useful properties) will be formally stated and proved in Appendix H.3.2. As for matrix A, it's very hard to compute its expectation explicitly because the generation of A involves softmax, but we are able to prove some useful properties of E[A] as shown in the following lemma. Lemma 3. Ã lim n→∞ E[A] exist and is rank-(c -1) with probability 1 over W (2) . This lemma is true because A is positive semi-definite (PSD) and is almost always of rank (c -1). Besides, its null space always contains the all-one vector 1 c . Having these properties of these three matrices, we then look at the expression of M again to see how to compute this expectation. This expectation is not easy to compute as we condition on W (1) and W (2) . With this conditioning, D and A are correlated and hard to decompose. This is when we need the most important observation in our proof: When the input dimension and the network width tend to infinity, A and D can be considered independent when computing M . These two matrices are actually not independent even when we take the limit: For example, if D happens to be 1 whenever the first row of W (2) is positive, then the first logit output is going to be much larger than the rest which significantly skews the distribution of A. However, since the computation of M only contains finite-degree polynomials of A and D, we only need a weaker form of independence, i.e., the distribution of A is approximately invariant condition on finite entries of D, as shown in the following lemma: Lemma 4. Let D X denote the distribution of X, and T V (D 1 , D 2 ) denote the total variation distance between D 1 and D 2 . Then ∀i, j ∈ [n], ∀ > 0 lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) T V (D A , D A|Di,i=Dj,j =1 ) > = 0. The intuition behind this is that A is uniquely determined by the output of the second layer z, where z = W (2) y is c-dimensional and y is an n-dimensional vector. Since D = diag(1 y≥0 ), fixing finite number of entries in D is equivalent to fixing the signs of finite entries of y. When n is large enough compare to c, only constraining finite entries of y shouldn't change the distribution of z by much. The formal proof of this theorem is given in Appendix H.3.4. Having Lemma 4, we can equivalently consider A and D as independent matrices. To formalize this, we need the following definition: Definition H.1. Let D be an independent copy of D and also independent of A. Define M * E[D W (2)T AW (2) D ]. In other words, M * is the matrix which has the same expression as M except that we assume D is independent of A in M * . Then we know that M and M * are essentially the same. Since D is a diagonal matrix with 0/1 entries, multiplying D at both sides of a matrix is equivalent to independently zero out each row and corresponding column with probability 1 2 . Thus, the probability of each diagonal entry to be kept is 1 2 while for off-diagonal ones it's 1 4 . Formally, we have M * ≈ 1 4 E W (2)T AW (2) + diag(E[W (2)T AW (2) ]) . We have two terms on the right hand side of this equation: T 1 E W (2)T AW (2) and T 2 diag(E[W (2)T AW (2) ] ). We make two observations of these two terms. On the one hand, they have the same trace. This is because the diagonal entries of these two matrices are exactly the same. On the other hand, T 1 is a low rank matrix but T 2 is approximately full rank. For T 1 , since the expectation is taken over x only, we know that 2) . Note that E[A] is a rank (c -1) PSD matrix, so T 1 is also PSD and has rank at most (c -1). T 2 is a diagonal matrix, and each diagonal entry equals a quadratic form w T 1 = W (2)T E [A] W ( (2)T i

E[A]w

(2) i where w (2) i is the i-th column of W (2) (i ∈ [n] ). This term is always positive unless w (2) i lies in the span of 1 c , which happens with probability 0. Actually, due to the random nature of w (2) i 's, the diagonal terms in T 2 do not differ too much from one another. To summarize, T 1 and T 2 are both PSD matrices with the same trace, while T 1 is low-rank but T 2 is approximately full rank. This intuitively indicates that the positive eigenvalues of T 1 is significantly large than those of T 2 , making the positive eigenvalues of T 1 the dominating eigenvalues of M * and those of T 2 the thin but long tail of M * 's eigenvalue spectrum. Now we have know that T 1 is almost the only contributing term to the top eigenvalues and eigenspaces of M * , so we only need to analyze these for T 1 = W (2)T E [A] W (2) . Since the rows of W (2) are close to rank 1 and almost mutually orthogonal, the matrix T 1 will roughly keeps all the eigenvalues of E[A], and the top (c -1) eigenspace should roughly be the "W (2) -rotated" version of R c \{1 c }, i.e., R(W (2) )\{W (2) • 1 c }. Despite the arguments above, we have some technical difficulties, the biggest of which is that the dimensions of M and M * will become infinite when n goes to infinity. To tackle this problem, we introduce an indirect and more complicated way to do the proof. We first project these matrices onto the row span of W (2) and show that this projection roughly keeps all the information of these matrices. Formally, we have the following lemma: Lemma 5. With probability 1 over W (2) , lim n→∞ W M W T 2 F M 2 F = 1. After that, we do the analysis in this finite-dimensional span and finish the proof of our main theorem.

H.3.1 Proof of Lemma 1

We first restate Lemma 1 here: Lemma 1. When d → ∞, with probability 1 over W (1) , the entries of D are independent. Proof of Lemma 1. Remember that D := diag(1 y≥0 ). The off-diagonal entries of D are always 0 and independent of anything. For the diagonal entries, each diagonal entry is decided by a corresponding entry of y. Therefore, we only need to prove that the entries of y are independent, and we have the following lemma: Lemma 6. When d → ∞, with probability 1 over W (1) , the entries of y are independent. Proof of Lemma 6. We will prove this lemma using the multivariate Lindeberg-Feller CLT. For each i ∈ [n], let w i ∈ R d denote the i-th column vector of W (1) . Let u i = w i x i , then we have u = d i=1 u i = d i=1 w i x i = W (1) x Note that x i 's are i.i.d standard Gaussian. It has the moments: E[x i ] = 0, E[(x i -E[x i ]) 2 ] = 1, E[(x i -E[x i ]) 4 ] = 3. ( ) It follows that Var[u i ] = Var[w i x i ] = w i w T i . ( ) Let V = n i=1 V ar[u i ], V = d i=1 w i w T i = W (1) W (1)T . As d → ∞, from Lemma 9 (we replace n, W with d and W (1) ) we have W (1) W (1)T → I n in probability, therefore lim d→∞ V = I n . We now verify the Lindeberg condition of independent random vectors {u 1 , . . . , u n }. First observe that the fourth moments of the u i are sufficiently small. lim d→∞ d i=1 E u i 3 = lim d→∞ d i=1 E      n j=1 W (1) ji x i 2   2    ≤ lim d→∞ d i=1 E   n 2 max j∈[n] W (2) ji 2 x 2 i 2   ≤ lim d→∞ n 2 max i∈[d],j∈[n] W (2) ji 4 n i=1 E (x i -E[x i ]) 4 (77) Since E[(x i -E[x i ]) 4 ] = 3 and max i∈[d],j∈[n] |W ji | < 2d -1 3 with probability 1 from Lemma 11, it follows from above that lim d→∞ d i=1 E u i 4 ≤ n 2 lim d→∞ 2d -1 3 4 d i=1 3 = 48n 2 lim d→∞ d -4 3 d = 48n 2 lim d→∞ d -1 3 = 0. For any > 0, since u i > in the domain of integration, lim d→∞ d i=1 E u i 2 1 [ u i > ] < lim d→∞ d i=1 E u i 2 2 u i 2 1 [ u i > ] ≤ 1 2 lim d→∞ n i=1 E u i 4 = 0. As the Lindeberg Condition is satisfied, with lim n→∞ V = I c in probability we have lim d→∞ u = lim d→∞ d i=1 u i D -→ N (0, I n ) . Thus, since u converges to N (0, I n ) in distribution with probability 1 over W (1) , entries of u are independent. Since y = σ(u) and ReLu is an entry-wise operator, entries of y are independent. Since the diagonal entries of D are uniquely determined by the corresponding entries of y, we know that when d → ∞, with probability 1 over W (1) , the entries of D are independent. This finishes the proof of Lemma 1.

H.3.2 Proof of Lemma 2

We first restate Lemma 2 here: Lemma 2. (informal) When n is large enough, each row of W (2) has norm very close to 1, and these rows are nearly orthogonal to each other. Besides, the entries (and the average of the entries) cannot be too large. This is not a formal lemma and will act as the intuition behind the properties of W (2) . In this section, we will formally state the properties we need and prove them. Besides, for simplicity of notations, we use W to denote W (2) from now on unless otherwise stated. Lemma 7. For all i ∈ [c], for all > 0, lim n→∞ Pr n j=1 W ij ≥ = 0. which implies that for all i, j ∈ [c], lim n→∞ Pr |(W W T ) i,j -δ i,j | ≥ = 0. ( ) Lemma 10. Let P W be the projection matrix onto the row span of W , then for all > 0, lim n→∞ Pr W T W -P W 2 F > = 0. ( ) Proof of Lemma 10. Without loss of generality, we assume that < 1. Let W i (i ∈ [c]) be the i-th row of W , and we will do the Gram-Schmidt process for the rows of W . Specifically, the Gram-Schmidt process is as following: Assume that {W i } k i=1 are the already normalized basis, we set W k+1 W k+1 - k i=1 W k+1 , W i and W k+1 W k+1 W k+1 . Finally, from the definition of projection matrix, we know that P W = W T W . Let 2 c 3 •16 2c+1 , from Lemma 9 we know that ∀i, j ∈ [c], lim n→∞ Pr |W i W T j -δ i,j | ≥ = 0. Besides, from Lemma 8 we know that ∀i ∈ [c], lim n→∞ Pr | W i 2 -1| ≥ = 0. Then we use induction to bound the difference between W and W . Specifically, we will show that ∀i ∈ [c], W i -W i ≤ 8 i . For notation simplicity, in the following proof we will not repeat the probability argument and assume that ∀i, j ∈ [c], |W i W T j -δ i,j | ≤ and ∀i ∈ [c], | W i 2 -1| ≤ . We will only use these inequalities finite times so applying a union bound will give the probability result. For i = 1, we know that W 1 = W1 W1 and | W 1 -1| ≤ , so W i -W i ≤ . If our inductive hypothesis holds for i ≤ k, then for i = k + 1, we have ∀j ≤ k, | W i , W j | ≤ | W i , W j | + | W i , W j -W j | ≤ + W i • W j -W j ≤ + (1 + )8 j ≤ (2 3j+1 + 1) . (91) Therefore, W i -W i ≤ j∈[k] | W i , W j | ≤ + j∈[k] (2 3j+1 + 1) ≤ (2 3k+2 -1) , and | W i -1| ≤ | W i -1| + W i -W i ≤ 2 3k+2 . (93) Thus, W i -W i ≤ W i -W i + W i -W i ≤ | W i -1| + W i -W i ≤ 8 k+1 , which finishes the induction and implies that ∀ > 0, ∀i ∈ [c], W i -W i ≤ 8 i . Thus, W -W 2 F = i∈[c] W i -W i 2 ≤ c • 16 c . ( ) This means that W T W -P W F = W T W -W T W F ≤ 2 W -W F W F + W -W 2 F ≤ 2c • √ c • 8 c √ + c • 16 c ≤ . ( ) Lemma 11. The largest entry of W (2) is reasonably small with high probability as n goes to infinity, namely, lim n→∞ P max i∈[c],j∈[n] W (2) ij > 2n -1 3 = 0 (97) Proof of Lemma 11. For i.i.d. random variables x 1 , • • • , x n ∼ N (0, 1), by concentration inequality on maximum of Gaussian random variables, for any t > 0, we have P n max i=1 x i > 2 log(2n) + t < 2e -t 2 2 . ( ) For any i, j, since W (2) ij are i.i.d. sampled from N (0, 1 n ), with rescaling of 1/ √ n we may substitute x j with W (2) ij . It follows that P max i∈[c],j∈[n] W (2) ij > 2 log(2cn) + t √ n < 2e -t 2 2 . ( ) Taking t = n 1 6 , with c as constant, for large n we have 2 log(2cn) < n 1 6 . Thus for large n, P max i∈[c],j∈[n] W (2) ij > 2n -1 3 = P max i∈[c],j∈[n] W (2) ij > n 1 6 + n 1 6 √ n < P max i∈[c],j∈[n] W (2) ij > 2 log(2n) + n 1 6 √ n < 2e -n 1 3 2 . ( ) With the same argument, we have P min i∈[c],j∈[n] W (2) ij < -2n -1 3 < 2e -n 1 3 2 . ( ) Passing n to infinity completes the proof.

H.3.3 Proof of Lemma 3

We first restate Lemma 3 here: Lemma 3. Ã lim n→∞ E[A] exist and is rank-(c -1) with probability 1 over W (2) . Before proving Lemma 3, we need some knowledge about the distribution of A. Since A is determined by the vector z, it suffice to know the distribution of z: Lemma 12. lim n→∞ z d -→ N (0, π-1 2π I c ) with probability 1 over W (2) . Proof of Lemma 12. We will prove this lemma using the multivariate Lindeberg-Feller CLT. For each i ∈ [n], let w i ∈ R c denote the i-th column vector of W (2) . Let v i = w i (y i -E[y i ]), then we have z = n i=1 w i y i = n i=1 v i + E[y i ] n i=1 w i . ( ) From Lemma 6 we know y i 's are i.i.d. rectified half standard normal. It has the moments: E[y i ] = 1 √ 2π , E[(y i -E[y i ]) 2 ] = π -1 2π , E[(y i -E[y i ]) 4 ] = 6π 2 -10π -3 4π 2 < 1. (103) It follows that V ar[v i ] = V ar[w i y i ] = π -1 2π w i w T i . ( ) Let V = n i=1 V ar[v i ], V = π -1 2π n i=1 w i w T i = π -1 2π W (2) W (2)T . As n → ∞, from Lemma 9 we have W (2) W (2)T → I c in probability, therefore lim n→∞ V = π-1 2π I c . We now verify the Lindeberg condition of independent random vectors {v 1 , . . . , v n }. First observe that the fourth moments of the v i 's are sufficiently small. lim n→∞ n i=1 E v i 4 = lim n→∞ n i=1 E      c j=1 W (2) ji (y i -E[y i ]) 2   2    ≤ lim n→∞ n i=1 E   c 2 max j∈[c] W (2) ji 2 (y i -E[y i ]) 2 2   ≤ lim n→∞ c 2 max i∈[n],j∈[c] W (2) ji 4 n i=1 E (y i -E[y i ]) 4 . ( ) Since E[(y i -E[y i ]) 4 ] < 1 and max i∈[n],j∈[c] |W ji | < 2n -1 3 with probability 1 from Lemma 11, it follows that lim n→∞ n i=1 E v i 4 ≤ c 2 lim n→∞ 2n -1 3 4 n i=1 1 = c 2 lim n→∞ 16n -4 3 n = 16c 2 lim n→∞ n -1 3 = 0. ( ) For any > 0, since v i > in the domain of integration, lim n→∞ n i=1 E v i 2 1 [ v i > ] < lim n→∞ n i=1 E v i 2 2 v i 2 1 [ v i > ] ≤ 1 2 lim n→∞ n i=1 E v i 4 = 0. As the Lindeberg Condition is satisfied, with lim n→∞ V = π-1 2π I c we have lim n→∞ n i=1 v i d -→ N 0, π -1 2π I c . By Lemma 7, we have lim n→∞ w i = 0 with probability 1 over W (2) , therefore plugging (Eq. ( 109)) into (Eq. ( 102)) we have lim n→∞ z d -→ N 0, π -1 2π I c . After that, we can proceed to prove Lemma 3. Proof of Lemma 3. Note that each entry of A is a quadratic function of p, and p is a continuous function of z. Therefore, we consider A as a function of z and write A(z) when necessary. From Lemma 12 we know that lim n→∞ z follows a standard normal distribution N (0, αI c ) with probability 1 over W , where α is some absolute constant. Therefore, Ã lim n→∞ E[A] exist and it equals E[A(lim n→∞ z)] = E z∼N (0,αIc) [A(z)]. For notation simplicity, we will omit the statement "with probability 1 over W " when there is no confusion. From the definition of A we know that A diag(p)-pp T where p is the vector obtained by applying softmax to z, so c i=1 p i = 1 and ∀i ∈ [c], p i ∈ (0, 1). Therefore, for any vector p satisfying the previous conditions, we have 1 T A1 = c i=1   p i - c j=1 p i p j   = c i=1 (p i -p i ) = 0, ( ) where 1 is the all-one vector. Therefore, we know that A has an eigenvalue 0 with eigenvector 1 √ c 1. This means that E[A] also has an eigenvalue 0 with eigenvector 1 √ c 1. Thus, E[A] is at most of rank (c -1). Then we analyze the other (c -1) eigenvalues of Ã. Since A = QQ T where Q = diag( √ p)(I c -1p T ), we know that A is always a positive semi-definite (PSD) matrix, which indicates that E[A] must also be PSD. Assume the c eigenvalues of Ã are λ 1 ≥ λ 2 ≥ • • • ≥ λ c-1 ≥ λ c = 0. Therefore, by definition, we have λ c-1 = min v∈S, v =1 v T Ãv = E z∼N (0,αIc) min v∈S, v =1 v T Av , where S R c \R{1 T } is the orthogonal subspace of the span of 1. v ∈ S implies that v ⊥ 1, i.e., c i=1 v i = 0. Direct computation gives us v T Av = c i=1 v 2 i p i - c i=1 v i p i 2 . ( ) Define two vectors a, b ∈ R c as ∀i ∈ [c], a i v i √ p i , b i √ p i , then b 2 = c i=1 p i = 1 and v T Av = a 2 -a, b 2 = a 2 • b 2 -a, b 2 . (114) Therefore, v T Av ≥ a 2 b 2 sin 2 θ(a, b), where θ(a, b) is the angle between a and b, i.e., θ(a, b) arccos a,b a b . Define p 0 min i∈[c] p i , then a 2 = c i=1 v 2 i p i ≥ c i=1 v 2 i p 0 = p 0 v 2 = p 0 . Since b = 1, we have sin 2 θ(a, b) = a -a, b • b 2 a 2 . Besides, a -a, b • b 2 = c i=1   v i √ p i -   c j=1 v j p j   √ p i   2 = c i=1 p i   v i - c j=1 v j p j   2 ≥ p 0 c i=1   v i - c j=1 v j p j   2 . ( ) Define s arg max i∈[c] v i and t arg min i∈[c] v i , then c i=1   v i - c j=1 v j p j   2 ≥   v s - c j=1 v j p j   2 +   v t - c j=1 v j p j   2 ≥ (v s -v t ) 2 2 . ( ) From v = 1 we know that max i∈[c] |v i | ≥ 1 √ c . Besides, since c i=1 v i = 0, we have v s > 0 > v t . Therefore, v s -v t > max i∈[c] |v i | ≥ 1 √ c . As a result, a -a, b • b 2 ≥ p 0 • (v s -v t ) 2 2 > p 0 2c . Moreover, a 2 = c i=1 v 2 i p i ≤ c i=1 p i = 1. Thus, sin 2 θ(a, b) ≥ p0 2c 1 = p 0 2c , which means that v T Av ≥ p 0 • 1 • p 0 2c = p 2 0 2c . Now we analyze the distribution of p 0 . Since z follows a spherical Gaussian distribution N (0, αI c ), we know that the entries of z are totally independent. Besides, for each entry z i (i ∈ [c]), we have |z i | < α with probability β, where β ≈ 0.68 is an absolute constant. Therefore, with probability β c , forall entries z i (i ∈ [c]), we have |z i | < α. In this case, p 0 = exp(min i∈[c] z i ) c i=1 exp(z i ) ≥ exp(-α) c exp(α) . In other cases, we know that p 0 > 0. Thus, λ c-1 = E z∼N (0,αIc) min v∈S, v =1 v T Av ≥ β c • exp(-α) c exp(α) 2 2c . The right hand side is independent of n. Therefore, λ c-1 > 0, which means that Ã has exactly (c -1) positive eigenvalues and a 0 eigenvalue, and the eigenvalue gap between the smallest positive eigenvalue and 0 is independent of n.

H.3.4 Proof of Lemma 4

We first restate Lemma 4 here: Lemma 4. Let D X denote the distribution of X, and T V (D 1 , D 2 ) denote the total variation distance between D 1 and D 2 . Then ∀i, j ∈ [n], ∀ > 0 lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) T V (D A , D A|Di,i=Dj,j =1 ) > = 0. Proof of Lemma 4. The proof of this lemma requires knowledge about the distributions of A condition on two diagonal entries of D. Since A can be uniquely determined by z, it is enough for us to know the distribution of z condition on the two entries of D. We use the following lemma to analyze this: Lemma 13. With probability 1 over W (2) , for any i, j ∈ [n], for any (p, q) ∈ {0, 1} 2 , lim n→∞ P (z|D ii = p, D jj = q) d -→ z. Proof of Lemma 13. With {w 1 , . . . , w n } and {v 1 , . . . , v n } defined as above, since different summands contributing to z are independent, and every v i is only affected by its corresponding D ii , P (z|D ii = p, D jj = q) = z -v i + P (v i |D ii = p) -v j + P (v j |D jj = q) = z -w i (y i -P (y i |D ii = p)) + w j (y j -P (y j |D jj = q)). For any i ∈ [n], with the condition of D ii = p, when p = 0, P (y i |D ii = p) = 0; when conditioned with p = 1, P (y i |D ii = p) is of a half standard normal distribution truncated at 0. In both cases the conditional distribution of P (y i |D ii = p) and hence y i -P (y i |D ii = p) has bounded mean and variance. For any w i , by Lemma 11 we have w i ≤ c max i∈[c],j∈[n] W (2) ij 2 < 4cn -2 3 with probability 1 over W (2) . Since lim n→∞ √ 4cn -2 3 = 0, as n goes to infinity we have w i (y i -P (y i |D ii = p)) d -→ 0 with probability 1 over W (2) . Therefore P (z|D ii = p, D jj = q) = z -w i (y i -P (y i |D ii = p)) + w j (y j -P (y j |D jj = q)) d -→ z. (130) From Lemma 13 we conclude that lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) T V (D z , D z|Di,i=Dj,j =1 ) > = 1. Since A can be uniquely determined by z, we have T V (D z , D z|Di,i=Dj,j =1 ) ≥ T V (D A , D A|Di,i=Dj,j =1 ). Therefore, lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) T V (D A , D A|Di,i=Dj,j =1 ) > = 0. This finishes the proof of Lemma 4.

H.3.5 Proof of Lemma 5

We first restate Lemma 5 here: Lemma 5. With probability 1 over W (2) , lim n→∞ W M W T 2 F M 2 F = 1. Proof of Lemma 5. To prove the equivalence between W M W T 2 F and M 2 F , we need some other terms, including terms containing M * , as bridges. To prove the equivalence between M and M * , we need the following lemma which explains the reason why we only need the weaker sense of independence (Lemma 4) instead of the total independence between A and D. Lemma 14. Let p(A, D) be a homogeneous polynomial of A and D and is degree 1 in A and degree 2 in D, and let the coefficients in p are upper bounded in 1 -norm by an absolute constant. Also let A be an independent copy of A. Then lim n→∞ lim d→∞ Pr W (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) [|E[p(A, D)] -E[p(A , D)]| > ] = 0. ( ) Proof of Lemma 14. Assume that p(A, D) = m i=1 c i A s(i),t(i) D u(i),u(i) D v(i),v(i) , then from linearity of expectation we know E[p(A, D)] = m i=1 c i E[A s(i),t(i) D u(i),u(i) D v(i),v(i) ]. Since the entries of D can only be 0 or 1, we have  E[A s(i),t(i) D u(i),u(i) D v(i),v(i) ] = E[A s(i),t(i) |D u(i),u(i) = D v(i),v(i) = 1]. (1) ∼N (0, 1 d Ind),W (2) ∼N (0, 1 n Icn) T V (D A , D A|Di,i=Dj,j =1 ) > = 0. In other words, with probability 1 we have T V (D A , D A|Di,i=Dj,j =1 ) ≤ . Besides, since ∀i, j ∈ [c], i = j, p i , p j , p i + p j ∈ (0, 1), each entry of A (either p ip 2 i or -p i p j ) must be in (-1 4 , 1 4 ). Therefore, when n and d goes to infinity, with probability 1 over W (1) and W (2) we have E[A s(i),t(i) |D u(i),u(i) = D v(i),v(i) = 1] -E[A s(i),t(i) ] ≤T V (D A , D A|Di,i=Dj,j =1 ) • 1 4 -- 1 4 ≤ 2 . (138) Thus, |E[p(A, D)] -E[p(A , D)]| ≤ m i=1 |c i | • E[A s(i),t(i) |D u(i),u(i) = D v(i),v(i) = 1] -E[A s(i),t(i) ] ≤ m i=1 |c i | • 2 ≤ α • 2α < . (139) This finishes the proof of Lemma 14. After this, using Lemma 4, we have the following lemmas: Lemma 15. with probability 1 over W , lim n→∞ M 2 F M * 2 F = 1. Proof of Lemma 15. Let (D , A ) be an independent copy of (D, A), then M 2 F = E[DW T AW D] 2 F = E DW T AW D, D W T A W D = E tr DW T AW DD W T A W D = E tr W D DW T AW DD W T A . (140) Expressing the term inside the expectation as a polynomial of entries of A, D, A and D , we get tr W D DW T AW DD W T A = c i=1 W D DW T AW DD W T A i,i = c i,j=1 W D DW T A i,j W DD W T A j,i = c i,j=1 c k=1 n l=1 W i,l W k,l D l,l D l,l A k,j c s=1 n t=1 W j,t W s,t D t,t D t,t A s,i = c i,j,k,s=1 n l,t=1 W i,l W k,l W j,t W s,t A k,j A s,i D l,l D l,l D t,t D t,t . Now we can bound the 1 norm of the coefficient of this polynomial as follows (note that absolute value of each entry of A is bounded by 1): c i,j,k,s=1 n l,t=1 W i,l W k,l W j,t W s,t 1 ≤ c i,j,k,s=1 n l,t=1 |W i,l | • |W k,l | • |W j,t | • |W s,t | =   c i,k=1 n l=1 |W i,l | • |W k,l |     c j,s=1 n t=1 |W j,t | • |W s,t |   ≤   c i,k=1 n l=1 W 2 i,l + W 2 k,l 2     c j,s=1 n t=1 W 2 j,t + W 2 s,t 2   =   c i,k=1 W i 2 + W k 2 2     c j,s=1 W j 2 + W s 2 2   =(c W 2 F ) 2 = c 2 W 4 F . When n → ∞, we know that W 2 F = O(c) with probability 1 over W , so the coefficient of this polynomial is 1 -norm bounded. We know from Lemma 4 that the distribution of A is invariant condition on two entries of D. Furthermore, since A and D are independent copies of A and D, we know that the distribution of (A, A ) is invariant conditioning on two entries of D and two entries of D . Each term in this polynomial is a 4-th order term containing two entries from D and two from D . This combined with Lemma 14 gives us lim n→∞ M 2 F M * 2 F = 1. ( ) Lemma 16. For all i, j ∈ [c], lim n→∞ ((W M W T ) i,j -(W M * W T ) i,j ) = 0. Thus, lim n→∞ W M W T 2 F W M * W T 2 F = 1. Proof of Lemma 16. This proof is very similar to that of Lemma 15. First, we focus on a single entry of the matrix W M W T and express it as a polynomial of entries of D: (W M W T ) i,j = E[(W DW T AW DW T ) i,j ] = E c k=1 (W DW T A) i,k (W DW T ) k,j = E c k=1 c s=1 n l=1 W i,l W s,l D l,l A s,k n t=1 W k,j W j,t D t,t = E   c k,s=1 n l,t=1 A s,k W i,l W s,l W k,t W j,t D l,l D t,t   . Then we bound the 1 norm of the coefficients of this polynomial as follows: c k,s=1 n l,t=1 A s,k W i,l W s,l W k,t W j,t 1 ≤ c k,s=1 n l,t=1 |W i,l | • |W s,l | • |W k,t | • |W j,t | = c s=1 n l=1 |W i,l | • |W s,l | c k=1 n t=1 |W k,t | • |W j,t | ≤ c s=1 n l=1 W 2 i,l + W 2 s,l 2 c k=1 n t=1 W 2 k,t + W 2 j,t 2 = c W i 2 + W 2 F c W j 2 + W 2 F ≤(2c W 2 F ) 2 = 4c 2 W 4 F . Similar to Lemma 15, this coefficient is 1 -norm bounded. Therefore, using Lemma 14, we have with probability 1 over W , for all i, j ∈ [c], lim n→∞ ((W M W T ) i,j -(W M * W T ) i,j ) = 0, which indicates that lim n→∞ W M W T 2 F W M * W T 2 F = 1. Lemma 17. lim n→∞ W M * W T 2 F M * 2 F = 1. Proof of Lemma 17. The proof of this lemma will be divided into two parts. In the first part, we will estimate the Frobenius norm of M * , and in the second part we do the same thing for W M * W T . Part  W M * W = 1 4 E[W W T AW W T ] + E[W diag(W T AW )W T ] . Similar to Part 1, when n → ∞, with probability 1, we have lim n→∞ E[W W T AW W T ] 2 F Ã 2 F = 1. Besides, when n → ∞, with probability 1 we have W diag(E[W T AW ])W T 2 F ≤ W 2 F Ã 2 n i=1 w i 4 ≤ Ã 2 • c 2 + 3c n W 2 F . ( ) As a result, with probability 1, lim n→∞ W diag(E[W T AW ])W T 2 F W W T ÃW W T 2 F = 0, i.e.,  M * F ≥ E[W T AW ] F = W T ÃW F . From equation Eq. ( 125), we know that the first (c -1) eigenvalues of Ã is bounded by some constant γ β c • ( exp(-α) c exp(α) ) 2 2c which is independent of n. Then we analyze the eigenvalues of W T ÃW : From Lemma 9 and set = 1 2 , we know that the smallest singular value of W is lower bounded by 1 2 . Therefore, for any unit vector v in the row span of W , we have v T W T ÃW v = (W v) T Ã(W v) ≥ γ W v 2 ≥ γ 4 . Thus, W T ÃW F ≥ γ 4 , which is some constant that is independent of n. Besides, since D is a dignonal matrix with 0/1 entries, and the absolute value of each entry of A is bounded by 1, we have M F = E[DW T AW D] F ≤ E[W T AW ] F ≤ W 2 F A F ≤ c W 2 F . From Lemma 8, we know that with probability 1, W 2 F ≤ 2c, therefore, M F is upper bounded by 2c 2 , which is independent of n. Now we are ready to prove our main theorem. From Lemma 5 we have lim n→∞ W M W T 2 F M 2 F = 1. ( ) Then we consider W T W M W T W 2 F . Note that W T W M W T W 2 F = tr(W T W M W T W W T W M W T W ) = tr(W W T W M W T W W T W M W T ). From Lemma 9 we know that for all > 0, lim n→∞ Pr( W W T -I c ≥ ) = 0. For notation simplicity, in this proof we will omit the limit and probability arguments which can be dealt with using union bound. Therefore, we will directly state W W T -I c ≤ . From Kleinman & Athans (1968) we know that for positive semi-definite matrices A and B we have λ min (A)tr(B) ≤ tr(AB) ≤ λ max (A)tr(B), so (167) Therefore, |tr(W W T • W M W T W W T W M W T ) -tr(W M W T W W T W M W T )| ≤ max{1 -λ min (W W T ), λ max (W W T ) -1}tr(W M W T W W T W M W T ) ≤ W W T -I c tr(W M W T W W T W M W T ) ≤ tr(W M W T W W T W M W T ). | W T W M W T W 2 F -W M W T 2 F | =|tr(W W T • W M W T W W T W M W T ) -tr(W M W T W M W T )| ≤|tr(W W T • W M W T W W T W M W T ) -tr(W M W T W W T W M W T )| + |tr(W M W T W W T W M W T ) -tr(W M W T W M W T )| ≤ tr(W M W T W W T W M W T ) + tr(W M W T W M W T ) ≤ (1 + )tr(W M W T W M W T ) + tr(W M W T W M W T ) ≤(2 + ( ) 2 )tr(W M W T W M W T ) = (2 + ( ) 2 ) W M W T 2 F . For all > 0, select < min{ √ 2 , 4 }, we have | W T W M W T W 2 F -W M W T 2 F | < W M W T 2 F . In other words, lim n→∞ W T W M W T W 2 F W M W T 2 F = 1. ( ) Hence we get lim n→∞ W T W M W T W 2 F M 2 F = 1. Next, consider the orthogonal projection matrix P W that projects vectors in R n into the subspace spanned by all rows of W . We will consider the matrix P W M P W . Define δ W T W -P W , then from Lemma 10 we get δ 2 F ≤ . Therefore, | W T W M W T W F -P W M P W F | ≤ P W M δ F + δM P W F + δM δ F ≤ M F 2 P W F δ F + δ 2 F ≤ M F 2 • 4c 2 + ( ) 2 . ( ) For all > 0, we choose < min{ √ 2 , 16c 2 } and have | W T W M W T W F -P W M P W F | M F < , which means that lim n→∞ | W T W M W T W F -P W M P W F | W T W M W T W F = lim n→∞ | W T W M W T W F -P W M P W F | M F = 0. (174) Thus, lim n→∞ P W M P W F M F = lim n→∞ P W M P W F W T W M W T W F = 1. ( ) Note that M From Lemma 18 we know that lim n→∞ M F (this hasn't proved to be exist, so we perhaps need to say "for large enough n") is lower bounded by some constant that is independent of n, so For any > 0, set δ < min{ γ 8c 2 , √ γ 2c] }, from Lemma 10, we know that with probability 1, P W -W T W F ≤ δ. Therefore, We will first analyze the second term on the RHS of equation Eq. ( 188). ∀ > 0, set = √ c , and from Lemma 9 we know that W W T -I c < with probability 1, which means that | W W T F -c| < with probability 1. Set = c, we know that W W T F < 2c with probability 1. Note that P W M P W -W T W M W T W F ≤ P W -W T W 2 F M F + 2 P W -W T W F M F P W F ≤δ 2 • 2c 2 + 2δ • 2c 2 < . W T W diag(E[W T AW ])W T W F ≤ W T W 2 F diag(E[W T AW ]) F = W W T 2 F diag(E[W T AW ]) F ≤ 4c 2 diag(E[W T AW ]) F . Combine this with equation Eq. ( 154) and we have lim n→∞ W T W diag(E[W T AW ])W T W F W T ÃW F = 0. From equation Eq. ( 162) we know that W T ÃW F ≥ γ 4 with probability 1, so lim n→∞ 4W T W M * W T W -W T W W T ÃW W T W F = 0. Similarly, define δ W W T -I c , then W T W W T ÃW W T W -W T ÃW F ≤ W T δ ÃδW F + 2 W T Ãδ F ≤ W 2 F δ 2 F Ã F + 2 W F δ F Ã F . Set < min{ 8c 2 , 8c 3 }, then from Lemma 9 we know that δ F < with probability 1, and from Lemma 8 we have W F ≤ 2c with probability 1. We also have Ã F ≤ c since each entry of A is bounded by 1 in absolute value. Therefore, W T W W T ÃW W T W -W T ÃW F ≤ 4c 2 ( ) 2 • c + 2 • 2c • c < 2 + 2 = , which means that lim n→∞ W T W W T ÃW W T W -W T ÃW F = 0. From equations Eq. ( 187), Eq. ( 191), and Eq. ( 194) we get lim n→∞ M - 1 4 W T ÃW F = 0. Besides, from equation Eq. ( 95) in Lemma 10 we know that for any > 0, (196) where W is the orthogonal version of W , i.e., we run the Gram-Schmidt process for the rows of W . Define δ W -W , for any > 0, set = min{ 8c 2 , 2c }, we have with probability 1, W -W 2 F = i∈[c] W i -W i 2 < , W T ÃW -W T ÃW F ≤ 2 δ F Ã F W F + δ 2 F Ã F 4c 2 + c( ) 2 < . Therefore, lim n→∞ W T ÃW -W T ÃW F = 0, which implies lim n→∞ M - 1 4 W T ÃW F = 0. ( ) From Lemma 3 we know that with probability 1, Ã is of rank (c -1). Since A • 1 = 0 is always true, the top (c -1) eigenspace of Ã is R c \{1}. Note that the rows in W are of unit norm and orthogonal to each other, we conclude that W T ÃW is of rank (c -1) and the corresponding eigenspace is R{W i } c i=1 \{W • 1}. Moreover, the minimum positive eigenvalue of W T ÃW is lower bounded by γ. As for the top c -1 eigenvectors of M , define δ M -1 4 W T ÃW , then M = 1 4 W T ÃW + δ. Define S 1 as the top c -1 eigenspaces for M , and S 2 to be the top c -1 eigenspaces for 1 4 W T ÃW . Then from Davis-Kahan Theorem we know that sin Θ(S 1 , S 2 ) F ≤ δ F λ c-1 ( 1 4 W T ÃW ) . Here Θ(S 1 , S 2 ) is a (c -1) × (c -1) diagonal matrix whose i-th diagonal entry is the i-th canonical angle between S 1 and S 2 . Since lim n→∞ δ F = 0, and with probability 1, λ c-1 ( 1 4 W T ÃW ) ≥ γ which is independent of n, we have with probability 1, lim n→∞ sin Θ(S 1 , S 2 ) F = 0, which indicates that the top c -1 eigenspaces for M and 1 4 W T ÃW are the same when n → ∞. Notice that the top c -1 eigenspaces of W T ÃW are R{W i } c i=1 \{W • 1}, so M will also have the same top c -1 eigenspaces. Besides, from equation Eq. ( 95) we know that lim n→∞ W -W F = 0, so R{W i } c i=1 \{W • 1} are the same as R{W i } c i=1 \{W • 1}. This completes the proof of this theorem.

H.4 Experiment Results

Table 10: Overlap of R(S (k) ) \ {S (k) • 1} and the top c -1 dimension eigenspace of E[M (k) ] of different layers at minima. Note that the overlap can be low for random-label datasets which do not have a clear eigengap (as in Fig. 4 ). Understanding how the data could change the behavior of the Hessian is an interesting open problem. Other papers have given alternative explanations which are not directly comparable to ours, however ours is the only one that gives a closed-form formula for top eigenspace. In Appendix F.1 we will discuss the other explanations in more details.



Overlap between dominate eigenspace of layer-wise Hessian at different minima for fc1:LeNet5 (left) with output dimension 120 and conv11:ResNet18-W64 (right) with output dimension 64. Top 10 singular values of the top 4 eigenvectors of the layer-wise Hessian of fc1:LeNet5 after reshaped as matrix.

Figure 2: Comparison between the approximated and true layer-wise Hessian of F-200 2 .

Figure 3: Heatmap of Eigenvector Correspondence Matrices for fc1:LeNet5, which has 120 output neurons. Here we take the top left corner of the eigenvector correspondence matrices. Similarities between (a)(c) and (b)(d) respectively verify the decoupling conjecture.

Low Rank Structure of E[M ] Previous works observed the gap in Hessian eigenspectrum around the number of classes c (where c = 10 in our experiments on CIFAR10 and MNIST). Since E[xx T ] is close to rank 1 and the Kronecker factorization is a good approximation for top eigenspace, the top eigenvalues of layer-wise Hessian can be approximated as the top eigenvalues of E[M ] multiplied by the first eigenvalue of E[xx T ]. Thus, the top eigenvalues of Hessians should have the same relative ratios as the top eigenvalues of their corresponding E[M ]'s. Therefore, the outliers should also appear in E[

Figure 4: Eigenspectrum of the layer-wise output Hessian E[M ] and the layer-wise weight Hessian H L (w (p) ). The vertical axes denote the eigenvalues. Similarity between the two eigenspectra is a direct consequence of a low rank E[xx T ] and the decoupling conjecture.

Figure 5: Overlap between the top k dominating eigenspace of different independently trained models. The overlap peaks at the output dimension m. The eigenspace overlap is defined in Definition 4.1.

Figure 6: Ratio between top singular value and Frobenius norm of matricized dominating eigenvectors. (LeNet5 on CIFAR10). The horizontal axes denote the index i of eigenvector h i , and the vertical axes denote Mat(h i ) / Mat(h i ) F .

Figure 7: Optimized posterior variance s using different algorithms (fc1:T-200 2 trained on MNIST).The horizontal axis denotes the eigenbasis ordered with decreasing eigenvalues. The abbreviation of algorithms are the same as in Table1.

Figure 8: Top 50 Eigenvalues and Eigenspace approximation for full Hessian

Figure 9: Eigenspace overlap of different models of LeNet5 trained with different hyper parameters.

Figure 10: Top Eigenspace overlap for varients of VGG11 on CIFAR10 and CIFAR100

Figure 12: Top eigenspace overlap for layers with an early low peak. Figures in the second row are the zoomed in versions of the figures in the first row.

Correspondence with E[M ].

Figure 14: Eigenvector Correspondence for fc1:LeNet5. (m=120)

Figure 15: Eigenvector Correspondence for fc2:LeNet5. (m=84)

Figure 16: Eigenvector Correspondence for conv1:LeNet5. (m=6)

Figure 17: Eigenvector Correspondence for conv2:LeNet5. (m=16)

Clustering of ∆ with logits at initialization.

Figure 20: Logit clustering behavior of ∆ and Γ at initialization (fc1:T-200 2 )

of ∆ with class at minimum.

Figure 21: Class clustering behavior of ∆ and Γ at minimum. (fc1:T-200 2 )

Figure 22: Ratio between top singular value and Frobenius norm of matricized dominating eigenvectors. (LeNet5 on CIFAR10). The horizontal axes denote the index i of eigenvector h i , and the vertical axes denote Mat(h i ) / Mat(h i ) F .

Figure 23: Top eigenspace overlap for the final fully connected layer.

Figure 24: Eigenspace overlap, eigenspectrum, and cropped (upper 20 × 20 block) eigenvector correspondence matrices for fc2:F-200 2 (MNIST)

Figure 25: Eigenspace overlap, eigenspectrum, and cropped (upper 20 × 20 block) eigenvector correspondence matrices for conv5:VGG11-W200 (CIFAR10)

Figure 26: Eigenspace overlap, eigenspectrum, and cropped (upper 50 × 50 block)eigenvector correspondence matrices for conv2:VGG11-W200 (CIFAR10)

Fig. 27b shows the eigenvector correspondance matrix of True Hessian with E[xx T ] for fc1:LeNet5. Because E[xx T ] is no longer close to rank 1, only very few eigenvectors of the layer-wise Hessian will have high correspondance with the top eigenvector of E[xx T ], as expected. This directly leads

Figure 27: Eigenspectrum and Eigenvector correspondence matrices with E[xx T ] for LeNet5-BN.

Top eigenspace overlap between approximated and true layer-wise Hessian.

Figure 29: Comparison between the true and approximated layer-wise Hessians for LeNet5-BN.

except adding T-200 2 we introduced in Section 4. T-600 10 , T-600 2 10 , and T-200 2 10 are trained on standard MNIST with 10 classes, and others are trained on MNIST-2 (see Appendix D.1), in which we combined class 0-4 and class 5-9.

Figure 30: Optimized posterior variance, s. (fc1:T-200 2 , trained on MNIST), the horizontal axis is ordered with decreasing eigenvalues.

Assume that the upper bound of the sum of |c i |s is α, i.e., m i=1 |c i | ≥ α. Set = α and from Lemma 4

We know from the definition of M * thatM * = 1 4 E[W T AW ] + diag(E[W T AW ]) . (146) Define Ã := E[A], then E[W T AW ] = W T E[A]W = W T ÃW . (147)From Lemma 9, ∀ > 0, with probability 1 we have W W T -I c ≤ . Besides, fromKleinman & Athans (1968) we know that for positive semi-definite matrices A and B we haveλ min (A)tr(B) ≤ tr(AB) ≤ λ max (A)tr(B), so tr(W T ÃW W T ÃW )tr( Ã Ã) = tr(W W T ÃW W T Ã)tr( Ã Ã) ≤ W W T -I c + 1 tr( ÃW W T Ã)tr( Ã Ã) = W W T -I c + 1 tr(W W T Ã Ã)tr( Ã Ã) ≤ W W T -I c + 1 2 tr( Ã Ã)tr( Ã Ã) we denote the i-th column of W by w i , then diag(E[W T AW ]) 2 + 2c) ≥ c ≤ e -2nc 2 . (152)Therefore, when n → ∞, with probability 1 we have diag(E[W T AW ]) Plug equation Eq. (146) into W M * W and we get

of Part 1 and Part 2 proves this lemma.Combining Lemma 15, Lemma 16, and Lemma 17 directly finishes the proof of Lemma 5.H.3.6 Proof of Theorem Theorem H.1Proof of Theorem Theorem H.1. Now we can proceed to the proof of our main theorem. In this proof, we will use the bounds for M F , which are formalized into the lemma below: Lemma 18. With probability 1, lim n→∞ M F is both lower bounded and upper bounded by constants that are independent of n.Proof of Lemma 18. From Lemma 15 we know that lim n→∞M F M * so we only need to bound lim n→∞ M * F . Since M * = 1 4 E[W T AW ] + diag(E[W T AW ]) and E[W T AW ], diag(E[W T AW ]) ≥ 0, we have

M W T W W T W M W T )tr(W M W T W M W T )| =|tr(W W T • W M W T W M W T )tr(W M W T W M W T )| ≤ W W T -I c tr(W M W T W M W T ) ≤ tr(W M W T W M W T ).

= P W M P W 2 + P W M P ⊥ W

M = P W M P W + P W M P ⊥ W + P ⊥ W M P W + P ⊥ W M P ⊥ W .

other words,lim n→∞ P W M P W -W T W M W T W F = 0. (182)Now we conclude thatlim n→∞ M -W T W M W T W F = 0. (183)From Lemma 16 we know that ∀i, j ∈[c], lim n→∞ ((W M W T ) i,j -(W M * W T ) i,j ) = 0, i.e., M W T W -W T W M * W T W F ≤ W 2 W M W T -W M * W T F ,(185) from Lemma 8 which bounds the Frobenius norm of W we know thatlim n→∞ W T W M W T W -W T W M * W T W F = 0. T W M * W T W F = 0. (187) Note that M * = 1 4 E[W T AW ] + diag(E[W T AW ]) , so 4W T W M * W T W = W T W W T ÃW W T W + W T W diag(E[W T AW ])W T W . (188)

Optimized PAC-Bayes bounds using different methods. T-n m and R-n m represents network F-n m trained with true/random labels. TESTER. gives the empirical generalization gap. BASE represents the bound given by the algorithm proposed byDziugaite & Roy (2017). APPR, ITER, and ITER.M represents the bound given by our algorithms.



Structure of F-200 2 on MNIST

Structure of LeNet5 on CIFAR-10

Structure of LeNet5-BN on CIFAR-10

Squared dot product

Structure of E[xx T ] for BN networks

It is based on Algorithm 1 in Dziugaite & Roy (2017). The initialization of w is different from Dziugaite & Roy (2017) because we believe what they wrote, abs(w) is a typo and log[abs(w)] is what they actually means. It is more reasonable to initialize the variance s as w 2 instead of exp[2 abs(w)].

Full PAC-Bayes bound optimization results

annex

Proof of Lemma 7. Since each entry of W is initialized independently from N (0, 1 n ), by Central Limit Theorem we have n j=1 W ij ∼ N (0, 1 n ). For any > 0, fix . By chebyshev's inequality,Proof of Lemma 8. Since each entry of W is initialized independently fromUsing the tail bound provided by Lemma 1 in Laurent & Massart (2000) , we know that for large enough n,(82) In other words,F follows a χ 2 n -distribution, so for large enough n,which indicates thatLemma 9. For all > 0, lim n→∞ Pr W W T -I c ≥ = 0.Besides, for all i, j ∈ [c], lim n→∞ Pr |(W W T ) i,jδ i,j | ≥ = 0. Here δ is the Kronecker delta function, i.e., δ i,j = 1 [i=j] .Proof of Lemma 9. Since each entry of W is initialized independently from N (0, 1 n ), we know that W W T follows Wishart distribution W c ( 1 n I c , n). Using the third tail bound in Theorem 1 of Zhu (2012) , for large enough n, we getMoreover, for all i, j ∈ [c], we have

