SGD THROUGH THE LENS OF KOLMOGOROV COM-PLEXITY

Abstract

We initiate a thorough study of the dynamics of stochastic gradient descent (SGD) under minimal assumptions using the tools of entropy compression. Specifically, we characterize a quantity of interest which we refer to as the accuracy discrepancy. Roughly speaking, this measures the average discrepancy between the model accuracy on batches and large subsets of the entire dataset. We show that if this quantity is sufficiently large, then SGD finds a model which achieves perfect accuracy on the data in O(1) epochs. On the contrary, if the model cannot perfectly fit the data, this quantity must remain below a global threshold, which only depends on the size of the dataset and batch. We use the above framework to lower bound the amount of randomness required to allow (non-stochastic) gradient descent to escape from local minima using perturbations. We show that even if the model is extremely overparameterized, at least a linear (in the size of the dataset) number of random bits are required to guarantee that GD escapes local minima in subexponential time.

1. INTRODUCTION

Stochastic gradient descent (SGD) is at the heart of modern machine learning. However, we are still lacking a theoretical framework that explains its performance for general, non-convex functions. Current results make significant assumptions regarding the model. Global convergence guarantees only hold under specific architectures, activation units, and when models are extremely overparameterized (Du et al., 2019; Allen-Zhu et al., 2019; Zou et al., 2018; Zou and Gu, 2019) . In this paper, we take a step back and explore what can be said about SGD under the most minimal assumptions. We only assume that the loss function is differentiable and L-smooth, the learning rate is sufficiently small and that models are initialized randomly. Clearly, we cannot prove general convergence to a global minimum under these assumptions. However, we can try and understand the dynamics of SGD -what types of execution patterns can and cannot happen. Motivating example: Suppose hypothetically, that for every batch, the accuracy of the model after the Gradient Descent (GD) step on the batch is 100%. However, its accuracy on the set of previously seen batches (including the current batch) remains at 80%. Can this process go on forever? At first glance, this might seem like a possible scenario. However, we show that this cannot be the case. That is, if the above scenario repeats sufficiently often the model must eventually achieve 100% accuracy on the entire dataset. To show the above, we identify a quantity of interest which we call the accuracy discrepancy (formally defined in Section 3). Roughly speaking, this is how much the model accuracy on a batch differs from the model accuracy on all previous batches in the epoch. We show that when this quantity (averaged over epochs) is higher than a certain threshold, we can guarantee that SGD convergence to 100% accuracy on the dataset within O(1) epochs w.h.pfoot_0 . We note that this threshold is global, that is, it only depends on the size of the dataset and the size of the batch. In doing so, we provide a sufficient condition for SGD convergence. The above result is especially interesting when applied to weak models that cannot achieve perfect accuracy on the data. Imagine a dataset of size n with random labels, a model with n 0.99 parameters, and a batch of size log n. The above implies that the accuracy discrepancy must eventually go below the global threshold. In other words, the model cannot consistently make significant progress on batches. This is surprising because even though the model is underparameterized with respect to the entire dataset, it is extremely overparameterized with respect to the batch. We verify this observation experimentally (Appendix B). This holds for a single GD step, but what if we were to allow many GD steps per batch, would this mean that we still cannot make significant progress on the batch? This leads us to consider the role of randomness in (non-stochastic) gradient descent. It is well known that overparameterized models trained using SGD can perfectly fit datasets with random labels (Zhang et al., 2017) . It is also known that when models are sufficiently overparameterized (and wide) GD with random initialization convergences to a near global minimum (Du et al., 2019) . This leads to an interesting question: how much randomness does GD require to escape local minima efficiently (in polynomial time)? It is obvious that without randomness we could initialize GD next to a local minimum, and it will never escape it. However, what about the case where we are provided an adversarial input and we can perturb that input (for example, by adding a random vector to it), how many bits of randomness are required to guarantee that after the perturbation GD achieves good accuracy on the input in polynomial time? In Section 4 we show that if the amount of randomness is sublinear in the size of the dataset, then for any differentiable and L-smooth model class (e.g., a neural network architecture), there are datasets that require an exponential running time to achieve any non-trivial accuracy (i.e., better than 1/2 + o(1) for a two-class classification task), even if the model is extremely overparameterized. This result highlights the importance of randomness for the convergence of gradient methods. Specifically, it provides an indication of why SGD converges in certain situations and GD does not. We hope this result opens the door to the design of randomness in other versions of GD.

Outline of our techniques

We consider batch SGD, where the dataset is shuffled once at the beginning of each epoch and then divided into batches. We do not deal with the generalization abilities of the model. Thus, the dataset is always the training set. In each epoch, the algorithm goes over the batches one by one, and performs gradient descent to update the model. This is the "vanilla" version of SGD, without any acceleration or regularization (for a formal definition, see Section 2). For the sake of analysis, we add a termination condition after every GD step: if the accuracy on the entire dataset is 100% we terminate. Thus, in our case, termination implies 100% accuracy. To achieve our results, we make use of entropy compression, first considered by Moser and Tardos (2010) to prove a constructive version of the Lovász local lemma. Roughly speaking, the entropy compression argument allows one to bound the running time of a randomized algorithmfoot_1 by leveraging the fact that a random string of bits (the randomness used by the algorithm) is computationally incompressible (has high Kolmogorov complexity) w.h.p. If one can show that throughout the execution of the algorithm, it (implicitly) compresses the randomness it uses, then one can bound the number of iterations the algorithm may execute without terminating. To show that the algorithm has such a property, one would usually consider the algorithm after executing t iterations, and would try to show that just by looking at an "execution log" of the algorithm and some set of "hints", whose size together is considerably smaller than the number of random bits used by the algorithm, it is possible to reconstruct all of the random bits used by the algorithm. We apply this approach to SGD with an added termination condition when the accuracy over the entire dataset is 100%. Thus, termination in our case guarantees perfect accuracy. The randomness we compress is the bits required to represent the random permutation of the data at every epoch. So indeed the longer SGD executes, the more random bits are generated. We show that under our assumptions it is possible to reconstruct these bits efficiently starting from the dataset X and the model after executing t epochs. The first step in allowing us to reconstruct the random bits of the permutation in each epoch is to show that under the L-smoothness assumption and a sufficiently small step size, SGD is reversible. That is, if we are given a model W i+1 and a batch B i such that W i+1 results from taking a gradient step with model W i where the loss is calculated with respect to B i , then we can uniquely retrieve W i using only B i and W i+1 . This means that if we can efficiently encode the batches used in every epoch (i.e., using less bits than encoding the entire permutation of the data), we can also retrieve all intermediate models in that epoch (at no additional cost). We prove this claim in Section 2.



With high probability means a probability of at least 1 -1/n, where n is the size of the dataset. We require that the number of the random bits used is proportional to the execution time of the algorithm. That is, the algorithm flips coins for every iteration of a loop, rather than just a constant number at the beginning of the execution.

