

Abstract

The curvature of the loss, provides rich information on the geometry underlying neural networks, with applications in second order optimisation and Bayesian deep learning. However, accessing curvature information is still a daunting engineering challenge, inaccessible to most practitioners. We hence provide a software package the Deep Curvature Suite, which allows easy curvature evaluation for large modern neural networks. Beyond the calculation of a highly accurate moment matched approximation of the Hessian spectrum using Lanczos, our package provides: extensive loss surface visualisation, the calculation of the Hessian variance and stochastic second order optimisers. We further address and disprove many common misconceptions in the literature about the Lanczos algorithm, namely that it learns eigenvalues from the top down. We prove using high dimensional concentration inequalities that for specific matrices a single random vector is sufficient for accurate spectral estimation, informing our spectral visualisation method. We showcase our package practical utility on a series of examples based on realistic modern neural networks such as the VGG-16 and Preactivated ResNets on the CIFAR-10/100 datasets. We further detail 3 specific potential use cases enabled by our software: research in stochastic second order optimisation for deep learning, learning rate scheduling using known optimality formulae for convex surfaces and empirical verification of deep learning theory based on comparing empirical and theoretically implied spectra.

1. INTRODUCTION

The success of deep neural networks trained with gradient based optimisers in speech and object recognition (LeCun et al., 2015) , has led to an explosion in easy to use high performance software implementations. Automatic differentiation packages such as TensorFlow (Abadi et al., 2016) and PyTorch (Paszke et al., 2017) have become widely adopted. Higher level packages, such as Keras (Chollet, 2015) allow practitioners users to state their model, dataset and optimiser in a few lines of code, effortlessly achieving state of the art performance. However, software for extracting second order information, representing the curvature of the loss at a point in weight space, has not kept abreast. Researchers aspiring to evaluate curvature information need to implement their own libraries, which are rarely shared or kept up to date. Naive implementations, which rely on full eigendecomposition (cubic cost in the parameter count) are computationally intractable for all but the smallest of models. Hence, researchers typically ignore curvature information or use highly optimistic approximations. Examples in the literature include the diagonal elements of the matrix or of a surrogate matrix,Chaudhari et al. ( 2016); Dangel et al. ( 2019), which we show in AppendixE can be very misleading.

2. MOTIVATION

The curvature of the loss informs us about the local conditioning of the problem (i.e the ratio of the largest to smallest eigenvalues λ1 λ P ). This determines the rate of convergence for first order methods and informs us about the optimal learning and momentum rates (Nesterov, 2013) . Hence easily accessible curvature information could allow practitioners to scale their learning rates in an optimal way throughout training, instead of relying on expert scheduling, we investigate this using our software in Section 5.2. Research areas where curvature information features most prominently are analyses of the Loss Surface and Newton type optimization methods.

