Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks

Abstract

Hessian captures important properties of the deep neural network loss landscape. Previous works have observed low rank structure in the Hessians of neural networks. We make several new observations about the top eigenspace of layer-wise Hessian -top eigenspaces for different models have surprisingly high overlap, and top eigenvectors form low rank matrices when they are reshaped into the same shape as the corresponding weight matrix. Towards formally explaining such structures of the Hessian, we show that the new eigenspace structure can be explained by approximating the Hessian using Kronecker factorization; we also prove the low rank structure for random data at random initialization for over-parametrized two-layer neural nets. Our new understanding can explain why some of these structures become weaker when the network is trained with batch normalization. The Kronecker factorization also leads to better explicit generalization bounds.

1. Introduction

The loss landscape for neural networks is crucial for understanding training and generalization. In this paper we focus on the structure of Hessians, which capture important properties of the loss landscape. For optimization, Hessian information is used explicitly in second order algorithms, and even for gradient-based algorithms properties of the Hessian are often leveraged in analysis (Sra et al., 2012) . For generalization, the Hessian captures the local structure of the loss function near a local minimum, which is believed to be related to generalization gaps (Keskar et al., 2017) .

