DATA AUGMENTATION AS STOCHASTIC OPTIMIZATION

Abstract

We present a theoretical framework recasting data augmentation as stochastic optimization for a sequence of time-varying proxy losses. This provides a unified language for understanding techniques commonly thought of as data augmentation, including synthetic noise and label-preserving transformations, as well as more traditional ideas in stochastic optimization such as learning rate and batch size scheduling. We then specialize our framework to study arbitrary augmentations in the context of a simple model (overparameterized linear regression). We extend in this setting the classical Monro-Robbins theorem to include augmentation and obtain rates of convergence, giving conditions on the learning rate and augmentation schedule under which augmented gradient descent converges. Special cases give provably good schedules for augmentation with additive noise, minibatch SGD, and minibatch SGD with noise.

1. INTRODUCTION

Implementing gradient-based optimization in practice requires many choices. These include setting hyperparameters such as learning rate and batch size as well as specifying a data augmentation scheme, a popular set of techniques in which data is augmented (i.e. modified) at every step of optimization. Trained model quality is highly sensitive to these choices. In practice they are made using methods ranging from a simple grid search to Bayesian optimization and reinforcement learning (Cubuk et al., 2019; 2020; Ho et al., 2019) . Such approaches, while effective, are often ad-hoc and computationally expensive due to the need to handle scheduling, in which optimization hyperparameters and augmentation choices and strengths are chosen to change over the course of optimization. These empirical results stand in contrast to theoretically grounded approaches to stochastic optimization which provide both provable guarantees and reliable intuitions. The most extensive work in this direction builds on the seminal article (Robbins & Monro, 1951) , which gives provably optimal learning rate schedules for stochastic optimization of strongly convex objectives. While rigorous, these approaches are typically are not sufficiently flexible to address the myriad augmentation types and hyperparameter choices beyond learning rates necessary in practice. This article is a step towards bridging this gap. We provide in §3 a rigorous framework for reinterpreting gradient descent with arbitrary data augmentation as stochastic gradient descent on a time-varying sequence of objectives. This provides a unified language to study traditional stochastic optimization methods such as minibatch SGD together with widely used augmentations such as additive noise (Grandvalet & Canu, 1997 ), CutOut (DeVries & Taylor, 2017) , Mixup (Zhang et al., 2017) and label-preserving transformations (e.g. color jitter, geometric transformations (Simard et al., 2003) ). It also opens the door to studying how to schedule and evaluate arbitrary augmentations, an important topic given the recent interest in learned augmentation Cubuk et al. (2019) . Quantitative results in our framework are difficult to obtain in full generality due to the complex interaction between models and augmentations. To illustrate the utility of our approach and better understand specific augmentations, we present in §3 and §5 results about arbitrary augmentations for overparameterized linear regression and specialize to additive noise and minibatch SGD in §4 and §6. While our results apply directly only to simple quadratic losses, they treat very general augmentations. Treating more complex models is left to future work. Our main contributions are: • In Theorem 5.1, we give sufficient conditions under which gradient descent under any augmentation scheme converges in the setting of overparameterized linear regression. Our result extends classical results of Monro-Robbins type and covers schedules for both learning rate and data augmentation scheme. • We complement the asymptotic results of Theorem 5.1 with quantitative rates of convergence furnished in Theorem 5.2. These rates depend only on the first few moments of the augmented data distribution, underscoring the flexibility of our framework. • In §4, we analyze additive input noise, a popular augmentation strategy for increasing model robustness. We recover the known fact that it is equivalent to stochastic optimization with 2 -regularization and find criteria in Theorem 4.1 for jointly scheduling the learning rate and noise level to provably recover the minimal norm solution. • In §6, we analyze minibatch SGD, recovering known results about rates of convergence for SGD (Theorem 6.1) and novel results about SGD with noise (Theorem 6.2).

2. RELATED WORK

In addition to the extensive empirical work on data augmentation cited elsewhere in this article, we briefly catalog other theoretical work on data augmentation and learning rate schedules. The latter were first considered in the seminal work Robbins & Monro (1951) . This spawned a vast literature on rates of convergence for GD, SGD, and their variants. We mention only the relatively recent articles Bach & Moulines ( 2013 2020)). These articles focus directly on the generalization effects of ridge-regularized minima but not on the dynamics of optimization. We also point the reader to Lewkowycz & Gur-Ari (2020), which considers optimal choices for the weight decay coefficient empirically in neural networks and analytically in simple models. We also refer the reader to a number of recent attempts to characterize the benefits of data augmentation. In Rajput et al. (2019) , for example, the authors quantify how much augmented data, produced via additive noise, is needed to learn positive margin classifiers. Chen et al. (2019) , in contrast, focuses on the case of data invariant under the action of a group. Using the group action to generate label-preserving augmentations, the authors prove that the variance of any function depending only on the trained model will decrease. This applies in particular to estimators for the trainable parameters themselves. Dao et al. (2019) shows augmented k-NN classification reduces to a kernel method for augmentations transforming each datapoint to a finite orbit of possibilities. It also gives a second order expansion for the proxy loss of a kernel method under such augmentations and interprets how each term affects generalization. Finally, the article Wu et al. ( 2020) considers both label preserving and noising augmentations, pointing out the conceptually distinct roles such augmentations play. In this context, we define a data augmentation scheme to be any procedure that consists, at every step of optimization, of replacing the dataset D by a randomly augmented variant, which we will denote



); Défossez & Bach (2015); Bottou et al. (2018); Smith et al. (2018); Ma et al. (2018) and the references therein. The last of these, namely Ma et al. (2018), finds optimal choices of learning rate and batch size for SGD in the overparametrized linear setting. A number of articles have also pointed out in various regimes that data augmentation and more general transformations such as feature dropout correspond in part to 2 -type regularization on model parameters, features, gradients, and Hessians. The first article of this kind of which we are aware is Bishop (1995), which treats the case of additive Gaussian noise (see §4). More recent work in this direction includes Chapelle et al. (2001); Wager et al. (2013); LeJeune et al. (2019); Liu et al. (2020). There are also several articles investigating optimal choices of 2 -regularization for linear models (cf e.g. Wu et al. (2018); Wu & Xu (2020); Bartlett et al. (

common task in modern machine learning is the optimization of an empirical riskL(W ; D) = 1 |D| (xj ,yj )∈D (f (x j ; W ), y j ), (3.1)where f (x; W ) is a parameterized model for a dataset D of input-response pairs (x, y) and is a per-sample loss. Optimizing W by vanilla gradient descent on L corresponds to the update equationW t+1 = W t -η t ∇ W L(W t ; D).

