SGD THROUGH THE LENS OF KOLMOGOROV COM-PLEXITY

Abstract

We initiate a thorough study of the dynamics of stochastic gradient descent (SGD) under minimal assumptions using the tools of entropy compression. Specifically, we characterize a quantity of interest which we refer to as the accuracy discrepancy. Roughly speaking, this measures the average discrepancy between the model accuracy on batches and large subsets of the entire dataset. We show that if this quantity is sufficiently large, then SGD finds a model which achieves perfect accuracy on the data in O(1) epochs. On the contrary, if the model cannot perfectly fit the data, this quantity must remain below a global threshold, which only depends on the size of the dataset and batch. We use the above framework to lower bound the amount of randomness required to allow (non-stochastic) gradient descent to escape from local minima using perturbations. We show that even if the model is extremely overparameterized, at least a linear (in the size of the dataset) number of random bits are required to guarantee that GD escapes local minima in subexponential time.

1. INTRODUCTION

Stochastic gradient descent (SGD) is at the heart of modern machine learning. However, we are still lacking a theoretical framework that explains its performance for general, non-convex functions. Current results make significant assumptions regarding the model. Global convergence guarantees only hold under specific architectures, activation units, and when models are extremely overparameterized (Du et al., 2019; Allen-Zhu et al., 2019; Zou et al., 2018; Zou and Gu, 2019) . In this paper, we take a step back and explore what can be said about SGD under the most minimal assumptions. We only assume that the loss function is differentiable and L-smooth, the learning rate is sufficiently small and that models are initialized randomly. Clearly, we cannot prove general convergence to a global minimum under these assumptions. However, we can try and understand the dynamics of SGD -what types of execution patterns can and cannot happen. Motivating example: Suppose hypothetically, that for every batch, the accuracy of the model after the Gradient Descent (GD) step on the batch is 100%. However, its accuracy on the set of previously seen batches (including the current batch) remains at 80%. Can this process go on forever? At first glance, this might seem like a possible scenario. However, we show that this cannot be the case. That is, if the above scenario repeats sufficiently often the model must eventually achieve 100% accuracy on the entire dataset. To show the above, we identify a quantity of interest which we call the accuracy discrepancy (formally defined in Section 3). Roughly speaking, this is how much the model accuracy on a batch differs from the model accuracy on all previous batches in the epoch. We show that when this quantity (averaged over epochs) is higher than a certain threshold, we can guarantee that SGD convergence to 100% accuracy on the dataset within O(1) epochs w.h.pfoot_0 . We note that this threshold is global, that is, it only depends on the size of the dataset and the size of the batch. In doing so, we provide a sufficient condition for SGD convergence. The above result is especially interesting when applied to weak models that cannot achieve perfect accuracy on the data. Imagine a dataset of size n with random labels, a model with n 0.99 parameters, and a batch of size log n. The above implies that the accuracy discrepancy must eventually go below



With high probability means a probability of at least 1 -1/n, where n is the size of the dataset.1

