A CHAOS THEORY APPROACH TO UNDERSTAND NEURAL NETWORK OPTIMIZATION

Abstract

Despite the complicated structure of modern deep neural network architectures, they are still optimized with algorithms based on Stochastic Gradient Descent (SGD). However, the reason behind the effectiveness of SGD is not well understood, making its study an active research area. In this paper, we formulate deep neural network optimization as a dynamical system and show that the rigorous theory developed to study chaotic systems can be useful to understand SGD and its variants. In particular, we first observe that the inverse of the instability timescale of SGD optimization, represented by the largest Lyapunov exponent, corresponds to the most negative eigenvalue of the Hessian of the loss. This observation enables the introduction of an efficient method to estimate the largest eigenvalue of the Hessian. Then, we empirically show that for a large range of learning rates, SGD traverses the loss landscape across regions with largest eigenvalue of the Hessian similar to the inverse of the learning rate. This explains why effective learning rates can be found to be within a large range of values and shows that SGD implicitly uses the largest eigenvalue of the Hessian while traversing the loss landscape. This sheds some light on the effectiveness of SGD over more sophisticated second-order methods. We also propose a quasi-Newton method that dynamically estimates an optimal learning rate for the optimization of deep learning models. We demonstrate that our observations and methods are robust across different architectures and loss functions on CIFAR-10 dataset.

1. INTRODUCTION

An interesting observation from current deep learning research is that classification and regression accuracy gains seem to be achieved from the intricacy of the underlying models rather than the optimization algorithm used for their training. Actually, the de facto choice for the optimization algorithm is still the classic Stochastic Gradient Descent (SGD) algorithm (Robbins & Monro, 1951) with minor modifications (Duchi et al., 2011; Sutskever et al., 2013; Kingma & Ba, 2014) . Even though several sophisticated second-order and quasi-Newton methods (Martens, 2010; Martens & Grosse, 2015; Berahas et al., 2019) have been introduced, first-order methods remain popular and none of them seem to outperform SGD with a carefully tuned learning rate schedule (Hardt et al., 2016) . This indicates that SGD (or in general first-order methods) probably has some intrinsic properties that make it effective to optimize over-parametrized deep neural networks. Despite various attempts to explain such phenomenon (Chaudhari & Soatto, 2018; Keskar et al., 2016; Kleinberg et al., 2018) , little is understood about the effectiveness of SGD over sophisticated second-order optimization methods. In this paper, we argue that chaos theory (Sprott & Sprott, 2003) is a useful approach to understand the neural network optimization based on SGD. The basic idea is to view neural network optimization as a dynamical system where the SGD update equation maps from the space of learnable parameters to itself and describes the evolution of the system over time. Once the evolution is defined, the rich theory developed to study chaotic dynamical systems can be leveraged to analyze and understand SGD and its variants. In essence, chaos theory enables us to study the evolution of the learnable parameters (i.e., the optimization trajectory) in order to understand the training behavior over large time scales (i.e., number of iterations). In particular, we focus on understanding the influence of the learning rate on the SGD optimization trajectory. First, by observing that the Lyapunov exponent of SGD is the most negative eigenvalue of the Hessian of the loss, we introduce an efficient and accurate method to estimate the loss curvature. Then, we empirically show that for a range of learning rate schedules, SGD traverses the optimization landscape across regions with largest eigenvalue of the Hessian similar to the inverse of the learning rate. This demonstrates that at a specific time step, performing SGD update is similar to performing a quasi-Newton step, considering only the largest eigenvalue of the Hessian of the loss. This, for the first time, sheds some light on the effectiveness of SGD over more sophisticated second-order methods and corroborates the observation that SGD robustly converges for a variety of learning rate schedules (Sun, 2019) . Furthermore, as pointed out in (LeCun et al., 1993) , the inverse of the estimated curvature can be used as the learning rate when applying SGD to a new dataset or architecture. Hence, we can set up a "feedback" system where the quasi-Newton optimal learning rate is calculated dynamically based on the current largest eigenvalue of the Hessian (curvature), and the learning rate is consequently adjusted during the training, allowing a "parameter free" stochastic gradient descent optimization. The experiments are conducted on CIFAR-10 dataset to demonstrate that our observations are robust across a variety of models, including a simple linear model regression and more modern deep neural network architectures, trained with both cross entropy and mean square error loss functions.

2. CHAOS THEORY FOR NEURAL NETWORK OPTIMIZATION

In recent years, several papers have used dynamical systems to study theoretical aspects of deep learning optimization (Liu & Theodorou, 2019) . Essentially, this is achieved by defining the optimization of deep neural networks as the evolution of parameters over time. In particular, a dynamical system progresses according to a map function that describes how the system evolves in a specific time step. In the case of deep neural network optimization, this map function is defined from the space of parameters into itself. By describing the system evolution using such a map function, it is possible to leverage the mathematical machinery of dynamical systems. For instance, viewing SGD as a discrete approximation of a continuous stochastic differential equations, allowed Li et al. (2017) and An et al. (2018) to propose adaptive SGD algorithms. Furthermore, dynamical systems enabled LeCun et al. (1993) to relate learning rate with the inverse of the local Hessian in a quasi-Newton optimization framework. Our paper also uses dynamical systems to study deep learning optimization, but differently from all methods above, we rely on chaos theory. Chaos theory (Sprott & Sprott, 2003) studies the evolution of dynamical systems over large time scales and can categorize systems into chaotic or non chaotic. Under some simplifying but still general assumptions, chaotic systems are bounded and have strong dependence on the initial conditions. This means that chaotic systems evolving from different starting points that are within a relatively small region around a particular reference point, will diverge exponentially during the evolution process, where the amount of time taken for this divergence to happen is defined as the chaotic timescale. This chaotic timescale imposes a limit on our ability to predict the future state distribution of a dynamical system. In fact, the distribution of the future state, which have evolved for more than a few times the chaotic timescale, cannot be distinguished from random distributions, even when the system is fully deterministic. We apply concepts from chaos theory to improve our current understanding of the optimization of deep neural networks. More specifically, we describe how to use standard chaos theory techniques to efficiently calculate the leading (positive and negative) eigenvalues of the Hessian of the loss function. With these eigenvalues we measure, in turn, the loss function curvature, which can be used to study the behavior of first-order optimization methods, such as SGD (Robbins & Monro, 1951) . In particular, with this technique we formulate an explanation for the empirical robustness of SGD to the choice of learning rate and its scheduling function, and we investigate a method (based on quasi-Newton second order method) for dynamically finding the optimal learning rate during the optimization of deep neural networks. Such automated and dynamic estimation of optimal learning rate can lift a significant burden from the manual definition of learning rate schedules in deep learning optimization.

