EFFICIENT ESTIMATORS FOR HEAVY-TAILED MACHINE LEARNING

Abstract

A dramatic improvement in data collection technologies has aided in procuring massive amounts of unstructured and heterogeneous datasets. This has consequently led to a prevalence of heavy-tailed distributions across a broad range of tasks in machine learning. In this work, we perform thorough empirical studies to show that modern machine learning models such as generative adversarial networks and invertible flow models are plagued with such ill-behaved distributions during the phase of training them. To alleviate this problem, we develop a computationally-efficient estimator for mean estimation with provable guarantees which can handle such ill-behaved distributions. We provide specific consequences of our theory for supervised learning tasks such as linear regression and generalized linear models. Furthermore, we study the performance of our algorithm on synthetic tasks and real-world experiments and show that our methods convincingly outperform a variety of practical baselines.

1. INTRODUCTION

Existing estimators in machine learning are largely designed for "thin-tailed" data, such as those coming from a Gaussian distribution. Past work in statistical estimation has given sufficient evidence that in the absence of these "thin-tails", classical estimators based on minimizing the empirical error perform poorly (Catoni, 2012; Lugosi et al., 2019) . Theoretical guarantees for methods commonly used in machine learning usually place assumptions on the tails of the underlying distributions that are analyzed. For instance, rates of convergences proven for a variety of stochastic optimization procedures assume that the distribution of gradients have bounded variance (for e.g., Zou et al. (2018)) or in some cases are sub-Gaussian (for e.g., Li & Orabona (2019) ). Thus, these guarantees are no longer applicable for heavy-tailed gradient distributions. From a practical point of view however, this is a less than desirable state of affairs: heavy-tailed distributions are ubiquitous in a variety of fields including large scale biological datasets and financial datasets among others (Fan et al., 2016; Zhou et al., 2017; Fan et al., 2017) . While this may be argued as just artifacts of the domain, recent work has found interesting evidence of heavy-tailed distributions in the intermediate outputs of machine learning algorithms. 2019), we look for sources of heavy-tailed gradients arising during the training of modern generative model based unsupervised learning tasks as well. In our preliminary investigation, we noticed that the distribution of gradient norms i.e., �g t � 2 are indeed heavy-tailed. These are showcased in Figure 1 ; Figures 1a and 1b show the distribution of gradient norms obtained while training the generator of a DCGAN (Radford et al., 2015) and Real-NVP (Dinh et al., 2016) on the CIFAR-10 dataset, respectively. These distributions are noticeably heavy-tailed, especially when juxtaposed with those obtained from a Gaussian distribution (Figure 1c ). We discuss more about the empirical setup in Section 5.2. Interestingly, in all the supervised and unsupervised machine learning problems discussed above, we merely need to compute expectations of these varied random heavy-tailed quantities. For instance, mini-batch gradient descent involves aggregating a batch of gradients pertaining to each sample in the mini-batch. Typically, this aggregation is performed by considering the sample mean, and this is a reasonable choice due to its simplicity as an estimate of the expectation of the random gradient. For computing the mean of such heavy-tailed gradient distributions, the sample mean however is highly sub-optimal. This is because sample mean estimates are greatly skewed by samples on the tail. Thus gradient estimates using these sub-optimal sample means of gradients do not necessarily point in the right direction leading to bad solutions, prolonged training time, or a mixture of both. Thus, a critical requirement for training of modern machine learning models is a scalable estimation of the mean of a heavy-tailed random vector. Note that such computations of mean of sample gradients are done in each iteration of (stochastic) gradient descent, so that we require that the heavy-tailed mean estimation be extremely scalable, yet with strong guarantees. Note that once we have such a scalable heavy-tailed mean estimator, we could simply use it to compute robust gradient estimates Prasad et al. ( 2020) , and learn generic statistical models. We summarize our contributions as follows: • We extend recent analyses of heavy-tailed behavior in machine learning, and provide novel empirical evidence of heavy-tailed gradients while training modern generative models such as generative adversarial networks (GANs) and invertible flow models. • To combat the issue of aggregating gradient samples from a heavy-tailed distribution, we propose a practical and easy-to-implement algorithm for heavy-tailed mean estimation with provable guarantees on the error of the estimate. • We use the proposed mean estimator to compute robust gradient estimates, which allows us to learn generalized linear models in the heavy-tailed setting, with strong guarantees on the estimation errors. • Finally, we propose a heuristic approximation of the mean estimation algorithm, which scales to random vectors with millions of variables. Accordingly, we use this heuristic to compute robust gradients of large-scale deep learning models with millions of parameters. We show that training with this heuristic outperforms a variety of practical baselines. Notation and other definitions. Let x be a random vector with mean µ. We say that the x has bounded 2k-moments if for all v ∈ S p-1 (unit ball), E[(v T (x -µ)) 2k ] ≤ C 2k � E[(v T (x -µ)) 2 ] � k . Throughout the paper, we use c, c 1 , c 2 , . . . , C, C 1 , C 2 , . . . to denote positive universal constants.

2. EFFICIENT AND PRACTICAL MEAN ESTIMATION

We begin by formalizing the notion of heavy-tailed distributions. Definition 1 (Heavy-Tailed Distribution (Resnick, 2007)). A non-negative random variable X is called heavy-tailed if the tail probability P (X > t) is asymptotically proportional to t -α * , where α * is a positive constant called the tail index of X. Intuitively, this definition states that if the tail of the distribution P (X > t) decreases at a rate slower that e -t , then the distribution is heavy-tailed. An interesting consequence of this definition is the non-existence of higher order moments. Specifically, one can show that the quantity E[X α ] is finite for any α if and only if α < α * and X is a heavy-tailed random variable with tail index α * . In recent statistical estimation literature (for e.g., Minsker (2015) ; Hopkins (2018); Lugosi & Mendelson (2019)), heavy-tailed distributions are defined by the absence of finite higher order moments.



Specifically, recent work by Simsekli et al. (2019) and Zhang et al. (2019) have provided empirical evidence about the existence of such heavy-tailed distributions, especially during neural network training for supervised learning tasks. Following these empirical analyses of Simsekli et al. (2019) and Zhang et al. (

Figure 1: Distribution of sampled gradient norms while training DCGAN (a) and Real-NVP (b) on the CIFAR-10 dataset. (c) Distribution of norms of Gaussian random vectors and (d) Distribution of norms of α-stable random vectors with α = 1.95. X-axis: norm, Y-axis: Density.

