DATA AUGMENTATION AS STOCHASTIC OPTIMIZATION

Abstract

We present a theoretical framework recasting data augmentation as stochastic optimization for a sequence of time-varying proxy losses. This provides a unified language for understanding techniques commonly thought of as data augmentation, including synthetic noise and label-preserving transformations, as well as more traditional ideas in stochastic optimization such as learning rate and batch size scheduling. We then specialize our framework to study arbitrary augmentations in the context of a simple model (overparameterized linear regression). We extend in this setting the classical Monro-Robbins theorem to include augmentation and obtain rates of convergence, giving conditions on the learning rate and augmentation schedule under which augmented gradient descent converges. Special cases give provably good schedules for augmentation with additive noise, minibatch SGD, and minibatch SGD with noise.

1. INTRODUCTION

Implementing gradient-based optimization in practice requires many choices. These include setting hyperparameters such as learning rate and batch size as well as specifying a data augmentation scheme, a popular set of techniques in which data is augmented (i.e. modified) at every step of optimization. Trained model quality is highly sensitive to these choices. In practice they are made using methods ranging from a simple grid search to Bayesian optimization and reinforcement learning (Cubuk et al., 2019; 2020; Ho et al., 2019) . Such approaches, while effective, are often ad-hoc and computationally expensive due to the need to handle scheduling, in which optimization hyperparameters and augmentation choices and strengths are chosen to change over the course of optimization. These empirical results stand in contrast to theoretically grounded approaches to stochastic optimization which provide both provable guarantees and reliable intuitions. The most extensive work in this direction builds on the seminal article (Robbins & Monro, 1951) , which gives provably optimal learning rate schedules for stochastic optimization of strongly convex objectives. While rigorous, these approaches are typically are not sufficiently flexible to address the myriad augmentation types and hyperparameter choices beyond learning rates necessary in practice. This article is a step towards bridging this gap. We provide in §3 a rigorous framework for reinterpreting gradient descent with arbitrary data augmentation as stochastic gradient descent on a time-varying sequence of objectives. This provides a unified language to study traditional stochastic optimization methods such as minibatch SGD together with widely used augmentations such as additive noise (Grandvalet & Canu, 1997 ), CutOut (DeVries & Taylor, 2017) , Mixup (Zhang et al., 2017) and label-preserving transformations (e.g. color jitter, geometric transformations (Simard et al., 2003) ). It also opens the door to studying how to schedule and evaluate arbitrary augmentations, an important topic given the recent interest in learned augmentation Cubuk et al. (2019) . Quantitative results in our framework are difficult to obtain in full generality due to the complex interaction between models and augmentations. To illustrate the utility of our approach and better understand specific augmentations, we present in §3 and §5 results about arbitrary augmentations for overparameterized linear regression and specialize to additive noise and minibatch SGD in §4 and §6. While our results apply directly only to simple quadratic losses, they treat very general augmentations. Treating more complex models is left to future work. Our main contributions are: • In Theorem 5.1, we give sufficient conditions under which gradient descent under any augmentation scheme converges in the setting of overparameterized linear regression. Our

