Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks

Abstract

Hessian captures important properties of the deep neural network loss landscape. Previous works have observed low rank structure in the Hessians of neural networks. We make several new observations about the top eigenspace of layer-wise Hessian -top eigenspaces for different models have surprisingly high overlap, and top eigenvectors form low rank matrices when they are reshaped into the same shape as the corresponding weight matrix. Towards formally explaining such structures of the Hessian, we show that the new eigenspace structure can be explained by approximating the Hessian using Kronecker factorization; we also prove the low rank structure for random data at random initialization for over-parametrized two-layer neural nets. Our new understanding can explain why some of these structures become weaker when the network is trained with batch normalization. The Kronecker factorization also leads to better explicit generalization bounds.

1. Introduction

The loss landscape for neural networks is crucial for understanding training and generalization. In this paper we focus on the structure of Hessians, which capture important properties of the loss landscape. For optimization, Hessian information is used explicitly in second order algorithms, and even for gradient-based algorithms properties of the Hessian are often leveraged in analysis (Sra et al., 2012) . For generalization, the Hessian captures the local structure of the loss function near a local minimum, which is believed to be related to generalization gaps (Keskar et al., 2017) . Several previous results including (Sagun et al., 2018; Papyan, 2018) observed interesting structures in Hessians for neural networks -it often has around c large eigenvalues where c is the number of classes. In this paper we ask: Why does the Hessian of neural networks have special structures in its top eigenspace? A rigorous analysis of the Hessian structure would potentially allow us to understand what the top eigenspace of the Hessian depends on (e.g., the weight matrices or data distribution), as well as predicting the behavior of the Hessian when the architecture changes. Towards this goal, we first focus on the structure for the top eigenspace of layer-wise Hessians. We observe that the top eigenspace of Hessians are far from random -models trained with different random initializations still have a large overlap in their top eigenspace, and the top eigenvectors are close to rank 1 when they are reshaped into the same shape as the corresponding weight matrix. We formalize a conjecture that allows us to understand all these structures using a Kronecker decomposition. We also analyze the Hessian in an over-parametrized two-layer neural network for random data, proving that the output Hessian is approximately rank c -1 and its top eigenspace can be easily computed based on weight matrices. Structure of Top Eigenspace for Hessians: Consider two neural networks trained with different random initializations and potentially different hyper-parameters; their weights are usually nearly orthogonal. One might expect that the top eigenspace of their layer-wise Hessians are also very different. However, this is surprisingly false: the top eigenspace of the layer-wise Hessians have a very high overlap, and the overlap peaks at the dimension of the layer's output (see Fig. 1a ). Another interesting phenomenon is that if we express the top eigenvectors of a layer-wise Hessian as a matrix with the same dimensions as the weight matrix, then the matrix is approximately rank 1. In Fig. 1b we show the singular values of several such reshaped eigenvectors. Understanding Hessian Structure using Kronecker Factorization: We show that both of these new properties of layer-wise Hessians can be explained by a Kronecker Factorization. Under a decoupling conjecture, we can approximate the layer-wise Hessian using the Kronecker product of the output Hessian and input auto-correlation. This Kronecker approximation directly implies that the eigenvectors of the layer-wise Hessian should be approximately rank 1 when viewed as a matrix. Moreover, under stronger assumptions, we can generalize the approximation for the top eigenvalues and eigenvectors of the full Hessian.

Structure of auto-correlation:

The auto-correlation of the input is often very close to a rank 1 matrix. We show that when the input auto-correlation component is approximately rank 1, the layer-wise Hessians indeed have high overlap at the dimension of the layer's output, and the spectrum of the layer-wise Hessian is similar to the spectrum of the output Hessian. On the contrary, when the model is trained with batch normalization, the input auto-correlation matrix is much farther from rank 1 and the layer-wise Hessian often does not have the same low rank structure.



Overlap between dominate eigenspace of layer-wise Hessian at different minima for fc1:LeNet5 (left) with output dimension 120 and conv11:ResNet18-W64 (right) with output dimension 64. 10 singular values of the top 4 eigenvectors of the layer-wise Hessian of fc1:LeNet5 after reshaped as matrix.

Figure 1: Some interesting observations on the structure of layer-wise Hessians. The eigenspace overlap is defined in Definition 4.1 and the reshape operation is defined in Definition 4.2

