DATA AUGMENTATION AS STOCHASTIC OPTIMIZATION

Abstract

We present a theoretical framework recasting data augmentation as stochastic optimization for a sequence of time-varying proxy losses. This provides a unified language for understanding techniques commonly thought of as data augmentation, including synthetic noise and label-preserving transformations, as well as more traditional ideas in stochastic optimization such as learning rate and batch size scheduling. We then specialize our framework to study arbitrary augmentations in the context of a simple model (overparameterized linear regression). We extend in this setting the classical Monro-Robbins theorem to include augmentation and obtain rates of convergence, giving conditions on the learning rate and augmentation schedule under which augmented gradient descent converges. Special cases give provably good schedules for augmentation with additive noise, minibatch SGD, and minibatch SGD with noise.

1. INTRODUCTION

Implementing gradient-based optimization in practice requires many choices. These include setting hyperparameters such as learning rate and batch size as well as specifying a data augmentation scheme, a popular set of techniques in which data is augmented (i.e. modified) at every step of optimization. Trained model quality is highly sensitive to these choices. In practice they are made using methods ranging from a simple grid search to Bayesian optimization and reinforcement learning (Cubuk et al., 2019; 2020; Ho et al., 2019) . Such approaches, while effective, are often ad-hoc and computationally expensive due to the need to handle scheduling, in which optimization hyperparameters and augmentation choices and strengths are chosen to change over the course of optimization. These empirical results stand in contrast to theoretically grounded approaches to stochastic optimization which provide both provable guarantees and reliable intuitions. The most extensive work in this direction builds on the seminal article (Robbins & Monro, 1951) , which gives provably optimal learning rate schedules for stochastic optimization of strongly convex objectives. While rigorous, these approaches are typically are not sufficiently flexible to address the myriad augmentation types and hyperparameter choices beyond learning rates necessary in practice. This article is a step towards bridging this gap. We provide in §3 a rigorous framework for reinterpreting gradient descent with arbitrary data augmentation as stochastic gradient descent on a time-varying sequence of objectives. This provides a unified language to study traditional stochastic optimization methods such as minibatch SGD together with widely used augmentations such as additive noise (Grandvalet & Canu, 1997) , CutOut (DeVries & Taylor, 2017) , Mixup (Zhang et al., 2017) and label-preserving transformations (e.g. color jitter, geometric transformations (Simard et al., 2003) ). It also opens the door to studying how to schedule and evaluate arbitrary augmentations, an important topic given the recent interest in learned augmentation Cubuk et al. (2019) . Quantitative results in our framework are difficult to obtain in full generality due to the complex interaction between models and augmentations. To illustrate the utility of our approach and better understand specific augmentations, we present in §3 and §5 results about arbitrary augmentations for overparameterized linear regression and specialize to additive noise and minibatch SGD in §4 and §6. While our results apply directly only to simple quadratic losses, they treat very general augmentations. Treating more complex models is left to future work. Our main contributions are: • In Theorem 5.1, we give sufficient conditions under which gradient descent under any augmentation scheme converges in the setting of overparameterized linear regression. Our result extends classical results of Monro-Robbins type and covers schedules for both learning rate and data augmentation scheme. • We complement the asymptotic results of Theorem 5.1 with quantitative rates of convergence furnished in Theorem 5.2. These rates depend only on the first few moments of the augmented data distribution, underscoring the flexibility of our framework. • In §4, we analyze additive input noise, a popular augmentation strategy for increasing model robustness. We recover the known fact that it is equivalent to stochastic optimization with 2 -regularization and find criteria in Theorem 4.1 for jointly scheduling the learning rate and noise level to provably recover the minimal norm solution. • In §6, we analyze minibatch SGD, recovering known results about rates of convergence for SGD (Theorem 6.1) and novel results about SGD with noise (Theorem 6.2).

2. RELATED WORK

In addition to the extensive empirical work on data augmentation cited elsewhere in this article, we briefly catalog other theoretical work on data augmentation and learning rate schedules. The latter were first considered in the seminal work Robbins & Monro (1951) . This spawned a vast literature on rates of convergence for GD, SGD, and their variants. We mention only the relatively recent articles Bach & Moulines (2013) ; Défossez & Bach (2015) ; Bottou et al. (2018) ; Smith et al. (2018) ; Ma et al. (2018) and the references therein. The last of these, namely Ma et al. (2018) , finds optimal choices of learning rate and batch size for SGD in the overparametrized linear setting. A number of articles have also pointed out in various regimes that data augmentation and more general transformations such as feature dropout correspond in part to 2 -type regularization on model parameters, features, gradients, and Hessians. The first article of this kind of which we are aware is Bishop (1995) , which treats the case of additive Gaussian noise (see §4). More recent work in this direction includes Chapelle et al. (2001) ; Wager et al. (2013) ; LeJeune et al. (2019) ; Liu et al. (2020) . There are also several articles investigating optimal choices of 2 -regularization for linear models (cf e.g. Wu et al. (2018) ; Wu & Xu (2020) ; Bartlett et al. (2020) ). These articles focus directly on the generalization effects of ridge-regularized minima but not on the dynamics of optimization. We also point the reader to Lewkowycz & Gur-Ari (2020) , which considers optimal choices for the weight decay coefficient empirically in neural networks and analytically in simple models. We also refer the reader to a number of recent attempts to characterize the benefits of data augmentation. In Rajput et al. (2019) , for example, the authors quantify how much augmented data, produced via additive noise, is needed to learn positive margin classifiers. Chen et al. (2019) , in contrast, focuses on the case of data invariant under the action of a group. Using the group action to generate label-preserving augmentations, the authors prove that the variance of any function depending only on the trained model will decrease. This applies in particular to estimators for the trainable parameters themselves. Dao et al. (2019) shows augmented k-NN classification reduces to a kernel method for augmentations transforming each datapoint to a finite orbit of possibilities. It also gives a second order expansion for the proxy loss of a kernel method under such augmentations and interprets how each term affects generalization. Finally, the article Wu et al. (2020) considers both label preserving and noising augmentations, pointing out the conceptually distinct roles such augmentations play.

3. DATA AUGMENTATION AS STOCHASTIC OPTIMIZATION

A common task in modern machine learning is the optimization of an empirical risk L(W ; D) = 1 |D| (xj ,yj )∈D (f (x j ; W ), y j ), (3.1) where f (x; W ) is a parameterized model for a dataset D of input-response pairs (x, y) and is a per-sample loss. Optimizing W by vanilla gradient descent on L corresponds to the update equation W t+1 = W t -η t ∇ W L(W t ; D). In this context, we define a data augmentation scheme to be any procedure that consists, at every step of optimization, of replacing the dataset D by a randomly augmented variant, which we will denote by D t . Typically, D t is related to D in some way, but our framework does not explicitly constrain the form of this relationship. Instead, certain conditions on this relationship will be required for our main results Theorems 5.1 and 5.2 to give useful results for a specific augmentation scheme. A data augmentation scheme therefore corresponds to the augmented update equation W t+1 = W t -η t ∇ W L(W t ; D t ). (3.2) Since D t is a stochastic function of D, it is natural to view the augmented update rule (3.2) as a form of stochastic optimization for the proxy loss at time t L t (W ) := E Dt [L(W ; D t )] . (3. 3) The update (3.2) corresponds precisely to stochastic optimization for the time-varying objective L t (W ) in which the unbiased estimate of its gradient is obtained by evaluating the gradient of L(W ; D t ) on a single sample D t drawn from the augmentation distribution. The connection between data augmentation and this proxy loss was introduced for Gaussian noise in Bishop (1995) and in general in Chapelle et al. (2001) , but we now consider it in the context of stochastic optimization. Despite being mathematically straightforward, reformulating data augmentation as stochastic optimization provides a unified language for questions about learning rate schedules and general augmentation schemes including SGD. In general, such questions can be challenging to answer, and even evaluating the proxy loss L t (W ) may require significant ingenuity. While we will return to more sophisticated models in future work, we henceforth analyze general augmentations in the simple context of overparameterized linear regression. Though there are many ways to perform linear regression, we restrict to augmented gradient descent both to gain intuition about specific augmentations and to understand the effect of augmentation on optimization. We therefore consider optimizing the entries of a weight matrix W ∈ R p×n by gradient descent on L(W ; D) = 1 |D| (x,y)∈D ||y -W x|| 2 F = 1 N ||Y -W X|| 2 F , where our dataset D is summarized by data matrices X ∈ R n×N and Y ∈ R p×N , whose N < n columns consist of inputs x i ∈ R n and associated labels y i ∈ R p . Following this notation, a data augmentation scheme is specified by prescribing at each time step an augmented dataset D t consisting of modified data matrices X t , Y t , whose columns we denote by x i,t ∈ R n and y i,t ∈ R p . Here, the number of columns in X t and Y t (i.e. the number of datapoints in D t ) may vary. We now give examples of some commonly used augmentations our framework can address. • Additive Gaussian noise: This is implemented by setting X t = X + σ t • G and Y t = Y for σ t > 0 and G a matrix of i.i.d. standard Gaussians. We analyze this in §4. • Mini-batch SGD: To implement mini-batch SGD with batch size B t , we can take X t = XA t and Y t = Y A t where A t ∈ R N ×Bt has i.i.d. columns containing a single non-zero entry equal to 1 chosen uniformly at random. We analyze this in detail in §6. • Random projection: This is implemented by X t = Π t X and Y t = Y , where Π t is an orthogonal projection onto a random subspace. For γ t = Tr(Π t )/n, the proxy loss is L t (W ) = Y -γ t W X 2 F + γ t (1 -γ t )n -1 Tr(XX T ) W 2 F + O(n -1 ), which adds a data-dependent 2 penalty and applies a Stein shrinkage on input data. • Label-preserving transformations: For a 2-D image viewed as a vector x ∈ R n , geometric transforms (with pixel interpolation) or other label-preserving transforms such as color jitter take the form of linear transforms R n → R n . We may implement such augmentations in our framework by X t = A t X and Y t = Y for some random transform matrix A t . • Mixup: To implement Mixup, we can take X t = XA t and Y t = Y A t , where A t ∈ R N ×Bt has i.i.d. columns containing with two random non-zero entries equal to 1 -c t and c t with mixing coefficient c t drawn from a Beta(α t , α t ) distribution for a parameter α t . Our main technical results, Theorems 5.1 and 5.2, give sufficient conditions for a learning rate schedule η t and a schedule for the statistics of X t , Y t under which optimization with augmented gradient descent will provably converge. We state these general results in §5. Before doing so, we seek to demonstrate both the utility of our framework and the flavor of our results by focusing on the simple but already informative case of additive Gaussian noise.

4. AUGMENTATION WITH ADDITIVE GAUSSIAN NOISE

An common augmentation in practice injects input noise as a regularizer (Grandvalet & Canu, 1997) : D t = {(x i,t , y i,t ), i = 1, . . . , N }, x i,t = x i + σ t g i,t , y i,t = y i , where g i,t are i.i.d. standard Gaussian vectors and σ t is a strength parameter. This section studies such augmentations using our framework. A direct computation reveals that the proxy loss L t (W ) = L σt (W ) := L(W ; D) + σ 2 t ||W || 2 F corresponding to additive Gaussian noise adds an 2 -penalty to the original loss L. This is simple but useful intuition. It also raises the question: what is the optimal relation between the learning rate η t and the augmentation strength σ t (i.e. the 2 -penalty)? To get a sense of what optimal might mean in this context, observe first that if σ t = 0, then directly differentiating the loss L yields the following update rule: W t+1 = W t + 2η t N • (Y -W t X)X T . (4.1) The increment W t+1 -W t is therefore contained in the column span V := column span of XX T ⊆ R n (4.2) of the model Hessian XX T . Overparameterization implies V = R n . The component W t,⊥ of W t that is in the orthogonal complement of V thus remains frozen to its initialized value. Geometrically, this means that there are some directions, namely those in the orthogonal complement to V , which gradient descent "cannot see." Optimization with appropriate step sizes then yields lim t→∞ W t = W 0,⊥ + W min , W min := Y X T (XX T ) + , where W min is the minimum norm solution of Y = W X. The original motivation for introducing the 2 -regularized losses L σ is that they provide a mechanism to eliminate the component W 0,⊥ for all initializations, not just the special choice W 0 = 0, and they can be used to regularize non-linear models as well. Indeed, for σ > 0, the loss L σ is strictly convex and has a unique minimum W * σ := Y X T XX T + σ 2 N • Id n×n -1 , which tends to the minimal norm solution in the weak regularization limit lim σ→0 W * σ = W min . Geometrically, this is reflected in the fact that 2 -penalty yields non-trivial gradient updates W t+1,⊥ = W t,⊥ -η t σ 2 W t,⊥ = (Id -η t σ 2 )W t,⊥ = t s=1 (Id -η s σ 2 )W 0,⊥ , which drive this perpendicular component of W t to zero provided ∞ t=1 η t = ∞. However, for each positive value of σ, the 2 -penalty also modifies the gradient descent updates for W t, , ultimately causing W t to converge to W * σ , which is not a minimizer of the original loss L. This downside of ridge regression motivates jointly scheduling the step size η t and the noise strength σ t . We hope that driving σ t to 0 at an appropriate rate can guarantee convergence of W t to W min . Namely, we want to retain the regularizing effects of 2 -noise to force W t,⊥ to zero while mitigating its adverse effects which prevent W * σ from minimizing L. It turns out that this is indeed possible: Theorem 4.1 (Special case of Theorem 5.1). Suppose σ 2 t , η t → 0 with σ 2 t non-increasing and ∞ t=0 η t σ 2 t = ∞ and ∞ t=0 η 2 t σ 2 t < ∞. (4.4) Then, W t p → W min . Further, if η t = Θ(t -x ) and σ 2 t = Θ(t -y ) with x, y > 0, x + y < 1, and 2x + y > 1, then for any ∈ (0, min{y, x/2}), we have that t min{y, 1 2 x}-W t -W min F p → 0. Let us give a few comments on Theorem 4.1. First, although it is stated for additive Gaussian noise, an analogous version holds for arbitrary additive noise with bounded moments, with the only change being a constant multiplicative factor in the second condition of (4.4). Second, that convergence in probability W t p → W min follows from (4.4) is analogous to a Monro-Robbins type theorem (Robbins & Monro, 1951) . Indeed, inspecting (4.3), we see that the first condition in (4.4) guarantees that the effective learning rate η t σ 2 t in the orthogonal complement to V is sufficiently large that the corresponding component W t,⊥ of W t tends to 0, allowing the result of optimization to be independent of the initial condition W 0 . Further, the second condition in (4.4) guarantees that the variance of the gradients, which at time t scales like η 2 t σ 2 t is summable. As in the usual Monro-Robbins setup, this means that only a finite amount of noise is injected into the optimization. Further, (4.4) is a direct specialization of (5.5) and (5.6) from Theorem 5.1. Third, by optimizing over x, y, we see that fastest rate of convergence guaranteed by Theorem 4.1 is obtained by setting η t = t -2/3+ , σ 2 t = t -1/3 and results in a O(t -1/3+ ) rate of convergence. It is not evident that this is the best possible rate, however. Finally, although we leave systemic study of augmentation in non-linear models to future work, our framework can be applied beyond linear models and quadratic losses. To see this, as noted for kernels in Dao et al. (2019) , augmenting inputs of nonlinear feature models correspond to applying different augmentations on the outputs of the feature map. To give a concrete example, consider additive noise for small σ t . For any sufficiently smooth function g, Taylor expansion reveals E [g(x + σ t G)] = g(x) + σ 2 t 2 ∆g(x) + O(σ 4 t ), where ∆ = i ∂ 2 i is the Laplacian and G is a standard Gaussian vector. For a general loss of the form (3.1) we have L t (W ) = L(W ; D) + σ 2 t 2|D| (x,y∈D) Tr (∇ x f ) T (H f ) ∇ x f + (∇ f ) T ∆ x f + O(σ 4 t ), where we have written H f for the Hessian of some convex per-sample loss with respect to f and ∇ x , ∇ f for the gradients with respect to x, f , respectively. This is consistent with the similar expansion done in the kernel setting by Dao et al. (2019, Section 4) . If σ t is small, then the proxy loss L t will differ significantly from the unaugmented loss L only near the end of training, when we expect ∇ f to be small and H f to be positive semi-definite. Hence, we find heuristically that, neglecting higher order terms in σ t , additive noise with small σ t corresponds to an 2 -regularizer Tr σ 2 t 2 (∇ x f ) T (H f L) ∇ x f =: σ 2 t 2 ||∇ x f || 2 H f L for the gradients of f with respect to the natural inner product determined by the Hessian of the loss. This is intuitive since penalizing the gradients of f is the same as requiring that f is approximately constant in a neighborhood of every datapoint. However, although the input noise was originally isotropic, the 2 -penalty is aligned with the loss Hessian and hence need not be.

AUGMENTATION

In this section, we state two general results, Theorems 5.1 and 5.2, which provide sufficient conditions for jointly scheduling learning rates and general augmentation schemes to guarantee convergence of augmented gradient descent in the overparameterized linear model (3.4).

5.1. A GENERAL TIME-VARYING MONRO-ROBBINS THEOREM

Given an augmentation scheme for the model (3.4), the time t gradient update at learning rate η t is W t+1 := W t + 2η t N • (Y t -W t X t )X T t , where D t = (X t , Y t ) is the augmented dataset at time t. The minimum norm minimizer of the corresponding proxy loss L t (see 3.3) is W * t := E[Y t X T t ]E[X t X T t ] + , (5.2) where E[X t X T t ] + denotes the Moore-Penrose pseudo-inverse. In this section we state a rigorous result, Theorem 5.1, giving sufficient conditions on the learning rate η t and distributions of the augmented matrices X t , Y t under which augmented gradient descent converges. In analogy with the case of Gaussian noise, (5.1) shows W t+1 -W t is contained in the column span of the Hessian X t X T t of the augmented loss and almost surely belongs to the subspace V := column span of E[X t X T t ] ⊆ R n . (5.3) To ease notation, we assume that V is independent of t. This assumption is valid for additive Gaussian noise, random projection, MixUp, SGD, and their combinations. We explain in Remark B.2 how to generalize Theorems 5.1 and 5.2 to the case where V varies with t. Let us denote by Q : R n → R n the orthogonal projection onto V . At time t, gradient descent leaves the projection W t (Id -Q ) of W t onto the orthogonal complement of V unchanged. In contrast, ||W t Q -W * t || F decreases at a rate governed by the smallest positive eigenvalue λ min,V E X t X T t := λ min Q E X t X T t Q of the Hessian for the proxy loss L t , which is obtained by restricting its full Hessian E X t X T t to V . Moreover, whether and at what rate W t Q -W * t converges to 0 must depend on how quickly Ξ * t := W * t+1 -W * t (5.4) tends to zero. Indeed, ||Ξ * t || F is the distance between proxy loss optima at different times and hence must tend to zero if ||W t Q -W * t || F converges to zero. Theorem 5.1. Suppose that V is independent of t, that the learning rate satisfies η t → 0, that the proxy optima satisfy ∞ t=0 Ξ * t F < ∞, (5.5) ensuring the existence of a limit W * ∞ := lim t→∞ W * t , and that ∞ t=0 η t λ min,V (E[X t X T t ]) = ∞. (5.6) If either ∞ t=0 η 2 t E X t X T t -E[X t X T t ] 2 F + Y t X T t -E[Y t X T t ] 2 F < ∞ (5.7) or the more refined condition ∞ t=0 η 2 t E X t X T t -E[X t X T t ] 2 F + E[W t ](X t X T t -E[X t X T t ]) -(Y t X T t -E[Y t X T t ]) 2 F < ∞ (5.8) hold, then for any initialization W 0 we have W t Q p → W * ∞ . The conditions of Theorem 5.1 can be applied to the choice of joint schedule for the learning rate and augmentation scheme applied to gradient descent. If the same augmentation is applied with different strength parameters at each step t such as σ t for Gaussian noise, they impose conditions on the choice of joint schedule for η t and these strength parameters. In the example of Theorem 4.1 for Gaussian noise, the condition that σ 2 t is non-increasing implies (5.5), the first condition of (4.4) implies (5.6), and the second condition of (4.4) implies (5.7). In addition to the conditions Theorem 5.1 imposes on D t , the proxy optima W * t and their limit W * ∞ are determined by the distribution of D t . Therefore, for W * ∞ in Theorem 5.1 to be a desirable set of parameters for the original dataset D, the augmented dataset D t must have some relation to D. When the augmentation procedure is static in t, Theorem 5.1 reduces to a standard Monro-Robbins theorem Robbins & Monro (1951) for the (static) proxy loss L t (W ). As in that setting, condition (5.6) enforces that the learning trajectory travels far enough to reach an optimum. Condition (5.7) implies the weaker condition (5.8); the second summand in (5.8) is the variance of the gradient of the augmented loss L(W ; D t ), meaning (5.8) implies the total variance of the stochastic gradients is summable. Condition (5.5) is new; it enforces that the minimizers W * t of the proxy losses L t (W ) change slowly enough that the augmented optimization procedure can keep pace. Though it may be surprising that E[W t ] appears in this condition, it may be interpreted as the gradient descent trajectory for the deterministic sequence of proxy losses L t (W ). Accounting for the dependence on E[W t ] allows us to give more precise rates using the variance of the stochastic gradient in (5.8); we include both (5.7) and (5.8) to allow a user of our results to separately analyze E[W t ] to obtain stronger conclusions.

5.2. CONVERGENCE RATES AND SCHEDULING FOR DATA AUGMENTATION

A more precise analysis of the the proof of Theorem 5.1 allows us to obtain rates of convergence for the projections W t Q of the weights onto V to the limiting optimum W * ∞ . In particular, when the quantities in Theorem 5.1 have power law decay, we obtain the following result. Theorem 5.2 (informal -Special case of Theorem B.4). If V is independent of t, the learning rate satisfies η t → 0, and for some 0 < α < 1 < β 1 , β 2 and γ > α we have η t λ min,V (E[X t X T t ]) = Ω(t -α ), Ξ * t F = O(t -β1 ) (5.9) and η 2 t E[ X t X T t -E[X t X T t ] 2 2 ] = O(t -γ ) (5.10) and η 2 t E E[W t ](X t X T t -E[X t X T t ]) -(Y t X T t -E[Y t X T t ]) 2 F = O(t -β2 ), then for any initialization W 0 , we have for any > 0 that t min{β1-1, β 2 -α 2 }-W t Q -W * ∞ F p → 0. Theorem 5.2 measures rates in terms of optimization steps t, but a different measurement of time called the intrinsic time of the optimization will be more suitable for measuring the behavior of optimization quantities. This was introduced for SGD in Smith & Le (2018); Smith et al. ( 2018), and we now generalize it to our broader setting. For gradient descent on a loss L, the intrinsic time is a quantity which increments by ηλ min (H) for a optimization step with learning rate η at a point where L has Hessian H. When specialized to our setting, it is given by τ (t) := t-1 s=0 2η s N λ min,V (E[X s X T s ]). (5.12) Notice that intrinsic time of augmented optimization for the sequence of proxy losses L s appears in Theorems 5.1 and 5.2, which require via condition (5.6) that the intrinsic time tends to infinity as the number of optimization steps grows. Intrinsic time will be a sensible variable in which to measure the behavior of quantities such as the fluctuations of the optimization path f (t ) := E[ (W t -E[W t ])Q 2 F ]. In the proofs of Theorems 5.1 and 5.2, we show that the fluctuations satisfy an inequality of the form f (t + 1) ≤ f (t)(1 -a(t)) 2 + b(t) (5.13) for a(t) := 2η t 1 N λ min,V (E[X t X T t ]) and b(t) := Var[||η t ∇ W L(W t )|| F ] so that τ (t) = t-1 s=0 a(s). Iterating the recursion (5.13) shows that f (t) ≤ f (0) t-1 s=0 (1 -a(s)) 2 + t-1 s=0 b(s) t-1 r=s+1 (1 -a(r)) 2 ≤ e -2τ (t) f (0) + t-1 s=0 b(s) a(s) e 2τ (s+1)-2τ (t) (τ (s + 1) -τ (s)). For τ := τ (t) and changes of variable A(τ ), B(τ ), and F (τ ) such that A(τ (t)) = a(t), B(τ (t)) = b(t), and F (τ (t)) = f (t), we find by replacing a right Riemann sum by an integral that F (τ ) e -2τ F (0) + τ 0 B(σ) A(σ) e 2σ dσ . (5.14) In order for the result of optimization to be independent of the starting point, by (5.14) we must have τ → ∞ to remove the dependence on F (0); this provides one explanation for the appearance of τ in condition (5.6). Further, (5.14) implies that the fluctuations at an intrinsic time are bounded by an integral against the function B(σ) A(σ) which depends only on the ratio of A(σ) and B(σ). In the case of minibatch SGD, we compute this ratio in (6.2) and recover the commonly used "linear scaling" rule for learning rate. In Section 6, we specialize Theorem 5.2 to obtain rates of convergence for specific augmentations. Optimizing the learning rate and augmentation parameter schedules in Theorem 5.2 allows us to derive power law schedules with convergence rate guarantees in these settings.

6. IMPLICATIONS FOR MINI-BATCH STOCHASTIC GRADIENT DESCENT (SGD)

We now apply our framework to study mini-batch stochastic gradient descent (SGD) with the potential presence of additive noise. Though data augmentation commonly refers to techniques aside from SGD, we will see that our framework handles it uniformly with other augmentations.

6.1. MINI-BATCH SGD

In mini-batch stochastic gradient descent, D t is obtained by choosing a random subset B t of D of prescribed batch size B t = |B t |. Each datapoint in B t is chosen uniformly with replacement from D, and the resulting data matrices X t and Y t are scaled so that L t (W ) = L(W ; D). Concretely, this means that for the normalizing factor c t := N/B t we have X t = c t XA t and Y t = c t Y A t , where A t ∈ R N ×Bt has i.i.d. columns A t,i with a single non-zero entry equal to 1 chosen uniformly at random. In this setting the minimum norm optimum for each t are the same and given by W * t = W * ∞ = Y X T (XX T ) + , which coincides with the minimum norm optimum for the unaugmented loss. Our main result for standard SGD is the following theorem, whose proof is given in Appendix D.1. Theorem 6.1. If the learning rate satisfies η t → 0 and ∞ t=0 η t = ∞, (6.1) then for any initialization W 0 , we have W t Q p → W * ∞ . If further we have that η t = Θ(t -x ) with 0 < x < 1, then for some C > 0 we have e Ct 1-x W t Q -W * ∞ F p → 0. Theorem 6.1 recovers the exponential convergence rate for SGD, which has been extensively studied through both empirical and theoretical means (Bottou et al., 2018; Ma et al., 2018) . Because 1 ≤ B t ≤ N for all t, it does not affect the asymptotic results in Theorem 6.1. In practice, however, the number of optimization steps t is often small enough that Bt N is of order t -α for some α > 0, meaning the choice of B t can affect rates in this non-asymptotic regime. Though we do not attempt to push our generic analysis to this granularity, this is done in Ma et al. (2018) to derive optimal batch sizes and learning rates in the overparametrized setting. Our proof of Theorem 6.1 shows the intrinsic time is τ (t) =  (t) a(t) ≤ C • η t B t . (6.2) Thus, keeping b(t) a(t) fixed as a function of τ suggests the "linear scaling" η t ∝ B t used empirically in Goyal et al. (2017) and proposed via an heuristic SDE limit in Smith et al. (2018) .

6.2. MINI-BATCH SGD WITH ADDITIVE SYNTHETIC NOISE

In addition to handling synthetic noise and SGD separately, our results and framework also cover the hybrid case of mini-batch SGD with batch size B t and additive noise at level σ t . Here, X t = c t (XA t + σ t G t ) and Y t = c t Y A t , where c t and A t are as in Section 6.1 and G t ∈ R n×Bt has i.i.d. Gaussian entries. The proxy loss is L t (W ) := 1 N E c t Y A t -c t W XA t -c t σ t W G t 2 F = 1 N Y -W X 2 F + σ 2 t W 2 F , with ridge minimizer W * t = Y X T (XX T + σ 2 t N • Id n×n ) -1 . Like with synthetic noise but unlike noiseless SGD, the optima W * t converge to the minimal norm interpolant W min = Y X T (XX T ) + . Theorem 6.2. Suppose σ 2 t → 0 is decreasing, η t → 0, and for any C > 0 we have ∞ t=0 (η t σ 2 t -Cη 2 t ) = ∞ and ∞ t=0 η 2 t σ 2 t < ∞. (6.3) Then we have W t p → W min . If we further have η t = Θ(t -x ) and σ 2 t = Θ(t -y ) with x, y > 0 and 0 < x + y < 1 < 2x + y, we have for any > 0 that t min{y, 1 2 x}-W t -W min F p → 0. Theorem 6.2 provides an example where our framework can handle the composition of two augmentations, namely additive noise and SGD. It reveals a qualitative difference between SGD with and without additive noise. For polynomially decaying η t the convergence of noiseless SGD in Theorem 6.1 is exponential in t, while the bound from Theorem 6.2 is polynomial in t. This is unavoidable. Indeed, for components of W t orthogonal to colspan(X), convergence requires that ∞ t=0 η t σ 2 t = ∞ (see (4.3)). This occurs only if σ t has power law decay, causing the ||W * t -W min || F to have at most power law decay as well. Finally, the Monro-Robbins conditions (6.3) are more restrictive than the analogous conditions in the pure noise setting (see (4.4)), as the latter allow for large η t schedules in which ∞ t=0 η 2 t diverges but ∞ t=0 η 2 t σ 2 t does not.

7. DISCUSSION

We have presented a theoretical framework to rigorously analyze the effect of data augmentation. As can be seen in our main results, our framework applies to completely general augmentations and relies only on analyzing the first few moments of the augmented dataset. This allows us to handle augmentations as diverse as additive noise and mini-batch SGD as well as their composition in a uniform manner. We have analyzed some representative examples in detail in this work, but many other commonly used augmentations may be handled similarly: label-preserving transformations (e.g. color jitter, geometric transformations), random projections (DeVries & Taylor, 2017; Park et al., 2019) , and Mixup (Zhang et al., 2017) , among many others. Another line of investigation left to future work is to compare different methods of combining augmentations such as mixing, alternating, or composing, which often improve performance in the empirical literature (Hendrycks et al., 2020) . Though our results provide a rigorous baseline to compare to more complex settings, the restriction of the present work to linear models is of course a significant constraint. In future work, we hope to extend our general analysis to models closer to those used in practice. Most importantly, we intend to consider more complex models such as kernels (including the neural tangent kernel) and neural networks by making similar connections to stochastic optimization. In an orthogonal direction, our analysis currently focuses on the mean square loss for regression, and we aim to extend it to other losses such as the cross-entropy loss. Finally, our study has thus far been restricted to the effect of data augmentation on optimization, and it would be of interest to derive consequences for generalization with more complex models. We hope our framework can provide the theoretical underpinnings for a more principled understanding of the effect and practice of data augmentation.

A.2 ONE-AND TWO-SIDED DECAY

Definition A.1. Let A t ∈ R n×n be a sequence of independent random non-negative definite matrices with sup t ||A t || ≤ 2 almost surely, let B t ∈ R p×n be a sequence of arbitrary matrices, and let C t ∈ R n×n be a sequence of nonnegative definite matrices. We say that the sequence of matrices X t ∈ R p×n has one-sided decay of type ({A t }, {B t }) if it satisfies X t+1 = X t (Id -E[A t ]) + B t . (A.3) We say that a sequence of non-negative definite matrices Z t ∈ R n×n has two-sided decay of type ({A t }, {C t }) if it satisfies Z t+1 = E[(Id -A t )Z t (Id -A t )] + C t . (A.4) Intuitively, if a sequence of matrices X t (resp. Z t ) satisfies one decay of type ({A t }, {B t }) (resp. two-sided decay of type ({A t }, {C t })), then in those directions u ∈ R n for which ||A t u|| does not decay too quickly in t we expect that X t (resp. Z t ) will converge to 0 provided B t (resp. C t ) are not too large. More formally, let us define V := ∞ t=0 ker ∞ s=t (Id -E[A s ]) = u ∈ R n lim T →∞ T s=t (Id -E[A s ])u = 0, ∀t ≥ 1 , and let Q be the orthogonal projection onto V . It is on the space V that that we expect X t , Z t to tend to zero if they satisfy one or two-side decay, and the precise results follows.

A.3 LEMMAS ON CONVERGENCE FOR MATRICES WITH ONE AND TWO-SIDED DECAY

We state here several results that underpin the proofs of our main results. We begin by giving in Lemmas A.2 and A.3 two slight variations of the same simple argument that matrices with one or two-sided decay converge to zero. Lemma A.2. If a sequence {X t } has one-sided decay of type ({A t }, {B t }) with ∞ t=0 B t F < ∞, (A.5) then lim t→∞ X t Q = 0. Proof. For any > 0, choose T 1 so that ∞ t=T1 B t F < 2 and T 2 so that for t > T 2 we have t s=T1 (Id -E[A s ]) Q 2 < 2 1 X 0 F + T1-1 s=0 B s F . By (A.3), we find that X t+1 = X 0 t s=0 (Id -E[A s ]) + t s=0 B s t r=s+1 (Id -E[A r ]), which implies for t > T 2 that X t+1 Q F ≤ X 0 F t s=0 (Id -E[A s ]) Q 2 + t s=0 B s F t r=s+1 (Id -E[A r ]) Q 2 . (A.6) Our assumption that ||A t || ≤ 2 almost surely implies that for any T ≤ t t s=0 (Id -E[A s ]) Q 2 ≤ T s=0 (Id -E[A s ]) Q 2 since each term in the product is non-negative-definite. Thus, we find X t+1 Q F ≤ X 0 F + T1-1 s=0 B s F t s=T1 (Id -E[A s ]) Q 2 + t s=T1 B s F < . Taking t → ∞ and then → 0 implies that lim t→∞ X t Q = 0, as desired. Lemma A.3. If a sequence {Z t } has two-sided decay of type ({A t }, {C t }) with lim T →∞ E   T s=t (Id -A s ) Q 2 2   = 0 for all t ≥ 0 (A.7) and ∞ t=0 Tr(C t ) < ∞, (A.8) then lim t→∞ Q T Z t Q = 0. Proof. The proof is essentially identical to that of Lemma A.2. That is, for > 0, choose T 1 so that ∞ t=T1 Tr(C t ) < 2 and choose T 2 by (A.7) so that for t > T 2 we have E   t s=T1 (Id -A s ) Q 2 2   < 2 1 Tr(Z 0 ) + T1-1 s=0 Tr(C s ) . Conjugating (A.4) by Q , we have that Q T Z t+1 Q = E Q T t s=0 (Id -A s ) T Z 0 t s=0 (Id -A s ) Q + t s=0 E Q T t r=s+1 (Id -A r ) T C s t r=s+1 (Id -A r ) Q . Our assumption that ||A t || ≤ 2 almost surely implies that for any T ≤ t t s=0 (Id -A s ) Q 2 ≤ T s=0 (Id -A s ) Q 2 . For t > T 2 , this implies by taking trace of both sides that Tr(Q T Z t+1 Q ) ≤ Tr(Z 0 )E   t s=0 (Id -A s ) Q 2 2   + t s=0 Tr(C s )E   t r=s+1 (Id -A r ) Q 2 2   (A.9) ≤ Tr(Z 0 ) + T1-1 s=0 Tr(C s ) E   t s=T1 (Id -A s ) Q 2 2   + t s=T1 Tr(C s ) < , which implies that lim t→∞ Q T Z t Q = 0. The preceding Lemmas will be used to provide sufficient conditions for augmented gradient descent to converge as in Theorem B.1 below. Since we are also interested in obtaining rates of convergence, we record here two quantitative refinements of the Lemmas above that will be used in the proof of Theorem B.4. Lemma A.4. Suppose {X t } has one-sided decay of type ({A t }, {B t }). Assume also that for some X ≥ 0 and C > 0, we have log t r=s (Id -E[A r ]) Q 2 < X -C t+1 s r -α dr and B t F = O(t -β ) for some 0 < α < 1 < β. Then, X t Q F = O(t α-β ). Proof. Denote γ s,t := t s r -α dr. By (A.6), we have for some constants C 1 , C 2 > 0 that X t+1 Q F < C 1 e -Cγ1,t+1 + C 2 e X t s=1 (1 + s) -β e -Cγs+1,t+1 . (A.10) The first term on the right hand side is exponentially decaying in t since γ 1,t+1 grows polynomially in t. To bound the second term, observe that the function f (s) := Cγ s+1,t+1 -β log(s + 1) satisfies f (s) ≥ 0 ⇔ C(s + 1) -α - β 1 + s ≥ 0 ⇔ s ≥ β C 1/(1-α) =: K. Hence, the summands are monotonically increasing for s greater than a fixed constant K depending only on α, β, C. Note that K s=1 (1 + s) -β e -Cγs+1,t+1 ≤ Ke -Cγ K+1,t+1 ≤ Ke -C t 1-α for some C depending only on α and K, and hence sum is exponentially decaying in t. Further, using an integral comparison, we find t s=K+1 (1 + s) -β e -Cγs+1,t+1 ≤ t K (1 + s) -β e -C 1-α ((t+1) 1-α -(s+1) 1-α ) ds. (A.11) Changing variables using u = (1 + s) 1-α /(1 -α), the last integral has the form e -Cgt (1 -α) -ξ gt g K u -ξ e Cu du, g x := (1 + x) 1-α 1 -α , ξ := β -α 1 -α . (A.12) Integrating by parts, we have gt g K u -ξ e u du = C -1 ξ gt g K u -ξ-1 e Cu du + (u -ξ e Cu )| gt g K Further, since on the range g K ≤ u ≤ g t the integrand is increasing, we have e -Cgt ξ gt g K u -ξ-1 e Cu du ≤ ξg -ξ t . Hence, e -Cgt times the integral in (A.12) is bounded above by O(g -ξ t ) + e -Cgt (u -ξ e Cu )| gt g K = O(g -ξ t ). Using (A.11) and substituting the previous line into (A.12) yields the estimate t s=K+1 (1 + s) -β e -Cγs+1,t+1 ≤ (1 + t) -β+α , which completes the proof. Lemma A.5. Suppose {Z t } has two-sided decay of type ({A t }, {C t }). Assume also that for some X ≥ 0 and C > 0, we have log E   t r=s (Id -A r ) Q 2 2   < X -C t+1 s r -α dr as well as Tr(C t ) = O(t -β ) for some 0 < α < 1 < β. Then Tr(Q T Z t Q ) = O(t α-β ). Proof. This argument is identical to the proof of Lemma A.4. Indeed, using (A.9) we have that Tr Q T Z t Q ≤ C 1 e -Cγ1,t+1 + C 2 e X t s=1 (1 + s) -β e -Cγs+1,t+1 . The right hand side of this inequality coincides with the expression on the right hand side of (A.10), which we already bounded by O(t β-α ) in the proof of Lemma A.4. In what follows, we will use a concentration result for products of matrices from Huang et al. (2020) . Let Y 1 , . . . , Y n ∈ R N ×N be independent random matrices. Suppose that E[Y i ] 2 ≤ a i and E Y i -E[Y i ] 2 2 ≤ b 2 i a 2 i for some a 1 , . . . , a n and b 1 , . . . , b n . We will use the following result, which is a specialization of (Huang et al., 2020, Theorem 5 .1) for p = q = 2. Theorem A.6 ( (Huang et al., 2020, Theorem 5.1)) . For Z 0 ∈ R N ×n , the product Z n = Y n Y n-1 • • • Y 1 Z 0 satisfies E Z n 2 2 ≤ e n i=1 b 2 i n i=1 a 2 i • Z 0 2 2 E Z n -E[Z n ] 2 2 ≤ e n i=1 b 2 i -1 a 2 i • Z 0 2 2 . Finally, we collect two simple analytic lemmas for later use. Lemma A.7. For any matrix M ∈ R m×n , we have that E[ M 2 2 ] ≥ E[M ] 2 2 . Proof. We find by Cauchy-Schwartz and the convexity of the spectral norm that E[ M 2 2 ] ≥ E[ M 2 ] 2 ≥ E[M ] 2 2 . Lemma A.8. For bounded a t ≥ 0, if we have ∞ t=0 a t = ∞, then for any C > 0 we have ∞ t=0 a t e -C t s=0 as < ∞. Proof. Define b t := t s=0 a s so that S := ∞ t=0 a t e -C t s=0 as = ∞ t=0 (b t -b t-1 )e -Cbt ≤ ∞ 0 e -Cx dx < ∞, where we use ∞ 0 e -Cx dx to upper bound its right Riemann sum.

B ANALYSIS OF DATA AUGMENTATION AS STOCHASTIC OPTIMIZATION

In this section, we prove generalizations of our main theoretical results Theorems 5.1 and 5.2 giving Monro-Robbins type conditions for convergence and rates for augmented gradient descent in the linear setting.

B.1 MONRO-ROBBINS TYPE RESULTS

To state our general Monro-Robbins type convergence results, let us briefly recall the notation. We consider overparameterized linear regression with loss L(W ; D) = 1 N ||W X -Y || 2 F , where the dataset D of size N consists of data matrices X, Y that each have N columns x i ∈ R n , y i ∈ R p with n > N. We optimize L(W ; D) by augmented gradient descent, which means that at each time t we replace D = (X, Y ) by a random dataset D t = (X t , Y t ). We then take a step W t+1 = W t -η t ∇ W L(W t ; D t ) of gradient descent on the resulting randomly augmented loss L(W ; D t ) with learning rate η t . Recall that we set V := column span of E[X t X T t ] and denoted by Q the orthogonal projection onto V . As noted in §5, on V the proxy loss L t = E [L(W ; D t )] is strictly convex and has a unique minimum, which is W * t = E Y t X T t (Q || E X t X T t Q || ) -1 . The change from one step of augmented GD to the next in these proxy optima is captured by Ξ * t := W * t+1 -W * t . With this notation, we are ready to state Theorems B.1, which gives two different sets of timevarying Monro-Robbins type conditions under which the optimization trajectory W t converges for large t. In Theorem B.4, we refine the analysis to additionally give rates of convergence. Theorem B.1. Suppose that V is independent of t, that the learning rate satisfies η t → 0, that the proxy optima satisfy ∞ t=0 Ξ * t F < ∞, (B.1) ensuring the existence of a limit W * ∞ := lim t→∞ W * t and that ∞ t=0 η t λ min,V (E[X t X T t ]) = ∞. (B.2) Then if either ∞ t=0 η 2 t E X t X T t -E[X t X T t ] 2 F + Y t X T t -E[Y t X T t ] 2 F < ∞ (B.3) or ∞ t=0 η 2 t E X t X T t -E[X t X T t ] 2 F + E[W t ](X t X T t -E[X t X T t ]) -(Y t X T t -E[Y t X T t ]) 2 F < ∞ (B.4) hold, then for any initialization W 0 , we have W t Q p → W * ∞ . Remark B.2. In the general case, the column span V || of E[X t X T t ] may vary with t. This means that some directions in R n may only have non-zero overlap with colspan(E[X t X T t ]) for some positive but finite collection of values of t. In this case, only finitely many steps of the optimization would move W t in this direction, meaning that we must define a smaller space for convergence. The correct definition of this subspace turns out to be the following V := ∞ t=0 ker ∞ s=t Id - 2η s N E[X s X T s ] (B.5) = ∞ t=0 u ∈ R n lim T →∞ T s=t Id - 2η s N E[X s X T s ] u = 0 . With this re-definition of V || and with Q still denoting the orthogonal projection to V , Theorem B.1 holds verbatim and with the same proof. Note that if η t → 0, V || colspan(E[X t X T t ] ) is fixed in t, and (B.2) holds, this definition of V reduces to that defined in (5.3). Remark B.3. The condition (B.4) can be written in a more conceptual way as ∞ t=0 X t X T t -E[X t X T t ] 2 F + η 2 t Tr Id •Var (E[W t ]X t -Y t )X T t < ∞, where we recognize that (E[W t ]X t -Y t )X T t is precisely the stochastic gradient estimate at time t for the proxy loss L t , evaluated at E [W t ], which is the location at time t for vanilla GD on L t since taking expectations in the GD update equation (5.1) coincides with GD for L t . Moreover, condition (B.4) actually implies condition (B.3) (see (B.12) below). The reason we state Theorem B.1 with both conditions, however, is that (B.4) makes explicit reference to the average E [W t ] of the augmented trajectory. Thus, when applying Theorem B.1 with this weaker condition, one must separately estimate the behavior of this quantity. Theorem B.1 gave conditions on joint learning rate and data augmentation schedules under which augmented optimization is guaranteed to converge. Our next result proves rates for this convergence. Theorem B.4. Suppose that η t → 0 and that for some 0 < α < 1 < β 1 , β 2 and C 1 , C 2 > 0, we have log E   t r=s Id - 2η r N X r X T r Q 2 2   < C 1 -C 2 t+1 s r -α dr (B.6) as well as Ξ * t F = O(t -β1 ) (B.7) and η 2 t Tr Id •Var(E[W t ]X t X T t -Y t X T t = O(t -β2 ). (B.8) Then, for any initialization W 0 , we have for any > 0 that t min{β1-1, β 2 -α 2 }-W t Q -W * ∞ F p → 0. Remark B.5. To reduce Theorem 5.2 to Theorem B.4, we notice that (5.9) and (5.10) mean that Theorem A.6 applies to Y t = Id -2η t XtX T t N with a t = 1 -Ω(t -α ) and and b 2 t = O(t -γ ), thus implying (B.6). The first step in proving both Theorem B.1 and Theorem B.4 is to obtain recursions for the mean and variance of the difference W t -W * t between the time t proxy optimum and the augmented optimization trajectory at time t. We will then complete the proof of Theorem B.1 in §B.3 and the proof of Theorem B.4 in §B.4.

B.2 RECURSION RELATIONS FOR PARAMETER MOMENTS

The following proposition shows that difference between the mean augmented dynamics E[W t ] and the time-t optimum W * t satisfies, in the sense of Definition A.1, one-sided decay of type ({A t }, {B t }) with A t = 2η t N X t X T t , B t = -Ξ * t . It also shows that the variance of this difference, which is non-negative definite, satisfies two-sided decay of type ({A t }, {C t }) with A t as before and C t = 4η 2 t N 2 Id •Var E[W t ]X t X T t -Y t X T t . In terms of the notations of Appendix A.1, we have the following recursions. Proposition B.6. The quantity E[W t ] -W * t satisfies E[W t+1 ] -W * t+1 = (E[W t ] -W * t ) Id - 2η t N E[X t X T t ] -Ξ * t (B.9) and Z t := E[(W t -E[W t ]) T (W t -E[W t ])] satisfies Z t+1 = E (Id - 2η t N X t X T t )Z t (Id - 2η t N X t X T t ) + 4η 2 t N 2 Id •Var E[W t ]X t X T t -Y t X T t . (B.10) Proof. Notice that E[X t X T t ]u = 0 if and only if X T t u = 0 almost surely, which implies that W * t E[X t X T t ] = E[Y t X T t ]E[X t X T t ] + E[X t X T t ] = E[Y t X T t ] . Thus, the learning dynamics (5.1) yield E[W t+1 ] = E[W t ] - 2η t N E[W t ]E[X t X T t ] -E[Y t X T t ] = E[W t ] - 2η t N (E[W t ] -W * t )E[X t X T t ]. Subtracting W * t+1 from both sides yields (B.9). We now analyze the fluctuations. Writing Sym(A) := A + A T , we have E[W t+1 ] T E[W t+1 ] = E[W t ] T E[W t ] + 2η t N Sym E[W t ] T E[Y t X T t ] -E[W t ] T E[W t ]E[X t X T t ] + 4η 2 t N 2 E[X t X T t ]E[W t ] T E[W t ]E[X t X T t ]+E[X t Y T t ]E[Y t X T t ]-Sym(E[X t X T t ]E[W t ] T E[Y t X T t ]) . Similarly, we have that E[W T t+1 W t+1 ] = E[W T t W t ] + 2η t N Sym(E[W T t Y t X T t -W T t W t X t X T t ]) + 4η 2 t N 2 E[X t X T t W T t W t X t X T t -Sym(X t X T t W T t Y t X T t ) + X t Y T t Y t X T t ]. Noting that X t and Y t are independent of W t and subtracting yields the desired.

B.3 PROOF OF THEOREM B.1

First, by Proposition B.6, we see that E[W t ] -W * t has one-sided decay with A t = 2η t X t X T t N and B t = -Ξ * t . Thus, by Lemma A.2 and (B.1), we find that lim t→∞ (E[W t ]Q -W * t ) = 0, (B.11) which gives convergence in expectation. For the second moment, by Proposition B.6, we see that Z t has two-sided decay with A t = 2η t X t X T t N and C t = 4η 2 t N 2 Id •Var E[W t ]X t X T t -Y t X T t . We now verify (A.7) and (A.8) in order to apply Lemma A.3. For (A.7), for any > 0, notice that E[ A s -E[A s ] 2 F ] = η 2 s E[ X s X T s -E[X s X T s ] 2 F ] so by either (B.3) or (B.4) we may choose T 1 > t so that ∞ s=T1 E[ A s -E[A s ] 2 F ] < 2 . Now choose T 2 > T 1 so that for T > T 2 , we have T r=T1 E[Id -A r ] Q 2 2 < 2 1 T1-1 s=t E[Id -A s ] 2 F + T1-1 s=t E[ A s -E[A s ] 2 F ] . For T > T 2 , we then have E   T s=t (Id -A s ) Q 2 2   ≤ T s=t E[Id -A s ] Q 2 + T s=t E   s r=t (Id -A r ) T r=s+1 (Id -E[A r ])Q 2 F - s-1 r=t (Id -A r ) T r=s (Id -E[A r ])Q 2 F   = T s=t E[Id -A s ] Q 2 F + T s=t E   s-1 r=t (Id -A r )(A s -E[A s ]) T r=s+1 (Id -E[A r ])Q 2 F   ≤ T1-1 s=t E[Id -A s ] 2 F T r=T1 E[Id -A r ] Q 2 2 + T s=t E[ A s -E[A s ] 2 F ] T r=s+1 E[Id -A r ] Q 2 2 ≤ T1-1 s=t E[Id -A s ] 2 F + T1-1 s=t E[ A s -E[A s ] 2 F ] T r=T1 E[Id -A r ] Q 2 2 + T s=T1 E[ A s -E[A s ] 2 F ] < , which implies (A.7). Condition (A.8) follows from either (B.4) or (B.3) and the bounds Tr(C t ) ≤ 8η 2 t N 2 E[W t ](X t X T t -E[X t X T t ]) 2 F + Y t X T t -E[Y t X T t ] 2 F (B.12) ≤ 8η 2 t N 2 E[W t ] 2 X t X T t -E[X t X T t ] 2 F + Y t X T t -E[Y t X T t ] 2 F , where in the first inequality we use the fact that M 1 -M 2 2 F ≤ 2( M 1 2 F + M 2 2 F ). Furthermore, iterating (B.9) yields E[W t ]-W * t F ≤ W 0 -W * 0 F + ∞ t=0 Ξ * t F , which combined with (B.12) and either (B.3) or (B.4) therefore implies (A.8). We conclude by Lemma A.3 that  lim t→∞ Q T Z t Q = lim t→∞ E[Q T (W t -E[W t ]) T (W t -E[W t ])Q ] = 0. (B. Id -2η r 1 N E[X r X T r ] Q 2 ≤ 1 2 log E   t r=s Id -2η r X r X T r N Q 2 2   < C 1 2 - C 2 2 t+1 s r -α dr. Applying Lemma A.4 using this bound and (B.7), we find that E[W t ]Q -W * t F = O(t α-β1 ). Moreover, because Ξ * t F = O(t -β1 ), we also find that W * t -W * ∞ F = O(t -β1+1 ), and hence E[W t ]Q -W * ∞ F = O(t -β1+1 ). Further, by Proposition B.6, E[(W t -E[W t ]) T (W t -E[W t ])] has two-sided decay with A t = 2η t N X t X T t , C t = 4η 2 t N 2 Id •Var E[W t ]X t X T t -Y t X T t . We also find that Ξ * t F = |σ 2 t -σ 2 t+1 |N Y X T XX T + σ 2 t N • Id n×n -1 XX T + σ 2 t+1 N • Id n×n -1 F ≤ |σ 2 t -σ 2 t+1 |N Y X T [(XX T ) + ] 2 F . Thus, because σ 2 t is decreasing, we see that the hypothesis (5.5) of Theorem 5.1 indeed holds. Further, we note that ∞ t=0 η 2 t E X t X T t -E[X t X T t ] 2 F + Y t X T t -E[Y t X T t ] 2 F = ∞ t=0 η 2 t σ 2 t 2(n + 1) X 2 F + N Y 2 F + σ 2 t N n(n + 1) = O ∞ t=0 η 2 t σ 2 t , which by (C.1) implies (B.3). Theorem 5.1 and the fact that lim t→∞ W * t = W min therefore yield that W t p → W min . For the rate of convergence, we aim to show that if η t = Θ(t -x ) and σ 2 t = Θ(t -y ) with x, y > 0, x + y < 1, and 2x + y > 1, then for any > 0, we have that t min{β, 1 2 α}-W t -W min F p → 0. We now check the hypotheses for and apply Theorem B  2 r = η 2 r σ 2 r a 2 r 2(n + 1) X 2 F + σ 2 r N n(n + 1) . Thus, by Theorem A.6 and the fact that η t = Θ(t -x ) and σ 2 t = Θ(t -y ), we find for some C 1 , C 2 > 0 that log E   t r=s (Id -2η r X r X T r N ) 2 2   ≤ t r=s b 2 r + 2 t r=s log(1 -2η r σ 2 r ) ≤ C 1 -C 2 t+1 s r -x-y dr. For (B.7), we find that Ξ * t F ≤ |σ 2 t -σ 2 t+1 |N Y X T [(XX T ) + ] 2 F = O(t -y-1 ). Finally, for (B.8), we find that η 2 t Tr Id •Var E[W t ]X t X T t -Y t X T t = O(t -2x-y ). Noting finally that W * t -W min F = O(σ 2 t ) = O(t -y ), we apply Theorem B.4 with α = x + y, β 1 = y + 1, and β 2 = 2x + y to obtain the desired estimates. This concludes the proof of Theorem 4.1.

D ANALYSIS OF SGD

This section gives the full analysis of the results for stochastic gradient descent with and without additive synthetic noise presented in Sections 6.1 and 6.2. Let us briefly recall the notation. As before, we consider overparameterized linear regression with loss L(W ; D) = 1 N ||W X -Y || 2 F , where the dataset D of size N consists of data matrices X, Y that each have N columns x i ∈ R n , y i ∈ R p with n > N. where A t ∈ R N ×Bt has i.i.d. columns A t,i with a single non-zero entry equal to 1 chosen uniformly at random. In this setting the minimum norm optimum for each t are the same and given by W * t = W * ∞ = Y X T (XX T ) + , which coincides with the minimum norm optimum for the unaugmented loss. In the setting of SGD with additive noise at level σ t , we take instead X t = c t (XA t + σ t G t ) and Y t = c t Y A t , where c t and A t are as before and G t ∈ R n×Bt has i.i.d. Gaussian entries. In this setting, the proxy loss is L t (W ) := 1 N E c t Y A t -c t W XA t -c t σ t W G t 2 F = 1 N Y -W X 2 F + σ 2 t W 2 F , which has ridge minimizer W * t = Y X T (XX T + σ 2 t N • Id n×n ) -1 . We begin in §D.1 by treating the case of noiseless SGD. We then do the analysis in the presence of noise in §D.2.

D.1 PROOF OF THEOREM 6.1

In order to apply Theorems B.1 and B.4, we begin by computing the moments of A t as follows. Recall the notation diag(M ) from Appendix A.1. Lemma D.1. For any Z ∈ R N ×N , we have that E[A t A T t ] = B t N Id N ×N and E[A t A T t ZA t A T t ] = B t N diag(Z) + B t (B t -1) N 2 Z. Proof. We have that E[A t A T t ] = Bt i=1 E[A i,t A T i,t ] = B t N Id N ×N . Similarly, we find that E[A t A T t ZA t A T t ] = Bt i,j=1 E[A i,t A T i,t ZA j,t A T j,t ] = Bt i=1 E[A i,t A T i,t ZA i,t A T i,t ] + 2 1≤i<j≤Bt E[A i,t A T i,t ZA j,t A T j,t ] = B t N diag(Z) + B t (B t -1) N 2 Z, which completes the proof. Let us first check convergence in mean: E[W t ]Q → W * ∞ . To see this, note that Lemma D.1 implies E[Y t X T t ] = Y X T E[X t X T t ] = XX T , which yields that W * t = Y X T [XX T ] + = W * ∞ (D.2) for all t. We now prove convergence. Since all W * t are equal to W * ∞ , we find that Ξ * t = 0. By (B.9) and Lemma D.1 we have E[W t+1 ] -W * ∞ = (E[W t ] -W * ∞ ) Id - 2η t N XX T , which implies since 2ηt N < λ max (XX T ) -1 for large t that for some C > 0 we have E[W t ]Q -W * ∞ F ≤ W 0 Q -W * ∞ F t-1 s=0 Q - 2η s N XX T 2 ≤ C W 0 Q -W * ∞ F exp - t-1 s=0 2η s N λ min,V (XX T ) . (D.3) From this we readily conclude using (6.1) the desired convergence in mean E[W t ]Q → W * ∞ . Let us now prove that the variance tends to zero. By Proposition B.6, we find that Z t = E[(W t - E[W t ]) T (W t -E[W t ])] has two-sided decay of type ({A t }, {C t }) with A t = 2η t N X t X T t , C t = 4η 2 t N 2 Id •Var((E[W t ]X t -Y t )X T t ) . To understand the resulting rating of convergence, let us first obtain a bound on Tr(C t ). To do this, note that for any matrix A, we have Tr (Id •Var[A]) = Tr E A T A -E [A] T E [A] . Moreover, using the definition (D.1) of the matrix A t and writing M t := E [W t ] X -Y, we find (E [W t ] X t -Y t )X T t T (E [W t ] X t -Y t )X T t = XA t A T t M T t M t A t A T t X T as well as E (E[W t ]X t -Y t )X T t T E (E[W t ]X t -Y t )X T t = XE A t A T t M T t M t E A t A T t X T . Hence, using the expression from Lemma D.1 for the moments of A t and recalling the scaling factor c t = (N/B t ) 1/2 , we find Tr(C t ) = 4η 2 t B t Tr X diag M T t M t - 1 N M T t M t X T . Next, writing ∆ t := E[W t ] -W * ∞ and recalling (D.2), we see that M t = ∆ t X. Thus, applying the estimates (D.3) about exponential convergence of the mean, we obtain Tr(C t ) ≤ 8η 2 t B t ∆ t Q || 2 2 XX T 2 2 ≤ C 8η 2 t B t XX T 2 2 ∆ 0 Q 2 F exp - t-1 s=0 4η s N λ min,V (XX T ) . (D.4) Notice now that Y r = Q -A r satisfies the conditions of Theorem A.6 with a r = 1 - 2η r 1 N λ min,V (XX T ) and b 2 r = 4η 2 r Bra 2 r N Tr X diag(X T X)X -1 N XX T XX T . By Theorem A.6 we then obtain for any t > s > 0 that E   t r=s+1 (Q -A r ) 2 2   ≤ e t r=s+1 b 2 r t r=s+1 1 -2η r 1 N λ min,V (XX T ) 2 . (D.5) By two-sided decay of Z t , we find by (D.4), (D.5), and (A.9) that E[ W t Q -E[W t ]Q 2 F ] = Tr(Q Z t Q ) ≤ e -4 N λ min,V (XX T ) t-1 s=0 ηs XX T 2 2 N 2 ∆ 0 Q 2 F C t-1 s=0 8η 2 s B s /N e 4ηs N λ min,V (XX T )+ t r=s+1 b 2 r . (D.6) Since η s → 0, we find that η s N Bs e 4ηs N λ min,V (XX T ) is uniformly bounded and that b 2 r ≤ 4 N λ min,V (XX T )η r for sufficiently large r. We therefore find that for some C > 0, E[ W t Q -E[W t ]Q 2 F ] ≤ C t-1 s=0 η s e -4 N λ min,V (XX T ) s r=0 ηr , hence lim t→∞ E[ W t Q -E[W t ]Q 2 F ] = 0 by Lemma A.8. Combined with the fact that E[W t ]Q → W * ∞ , this implies that W t Q p → W * ∞ . To obtain a rate of convergence, observe that by (D.3) and the fact that η t = Θ(t -x ), for some C 1 , C 2 > 0 we have E[W t ]Q -W * ∞ F ≤ C 1 exp -C 2 t 1-x . (D.7) Similarly, by (D.6) and the fact that ηs Bs/N < ∞ uniformly, for some C 3 , C 4 , C 5 > 0 we have E[ W t Q -E[W t ]Q 2 F ] ≤ C 3 exp -C 4 t 1-x t 1-x We conclude by Chebyshev's inequality that for any a > 0 we have P W t Q -W * ∞ F ≥ C 1 exp -C 2 t 1-x + a • C 3 t 1 2 -x 2 e -C4t 1-x /2 ≤ a -2 . Taking a = t, we conclude as desired that for some C > 0, we have e Ct 1-x W t Q -W * ∞ F p → 0. This completes the proof of Theorem 6.1. D.2 PROOF OF THEOREM 6.2 We now complete our analysis of SGD with Gaussian noise. We will directly check that the optimization trajectory W t converges at large t to the minimal norm interpolant W * ∞ with the rates claimed in Theorem 6.2. We will deduce this from Theorem B.4. To check the hypotheses of this theorem, we will need expressions for its moments, which we record in the following lemma. Lemma D.2. We have Proof. All these formulas are obtained by direct, if slightly tedious, computation. E[Y t X T t ] = Y X T and E[X t X T t ] = XX T + σ 2 t N Id n×n . (D.8) Moreover, E[Y t X T t X t Y T t ] = c 4 t E[Y A t A T t X T XA t A T t Y T + σ 2 t Y A t G T t G t A T t Y T ] = N B t Y diag(X T X)Y T + B t -1 B t Y X T XY T + σ 2 t N Y Y T E[Y t X T t X t X T t ] = c 4 t E[Y A t A T t X T XA t A T t X T + σ 2 t Y A t G T t G t A T t X T + σ 2 t Y A t G T t XA t G T t + σ 2 t Y A t A T t X T G t G T t ] = N B t Y diag(X T X)X T + B t -1 B t Y X T XX T + σ 2 t (N + n + 1 B t /N )Y X T E[X t X T t X t X T t ] = c 4 t E[XA t A T t X T XA t A T t X T + σ 2 t G t G T t XA t A T t X T + σ 2 t XA t G T t G t A T t X T + σ 2 t XA t A T t X T G t G T t + σ 2 t G t A T t X T G t A T t X T + σ 2 t XA t G T t XA t G T t + σ 2 t G t A T t X T XA t G T t + σ 4 t G t G T t G t G T t ] = N B t X diag(X T X)X T + B t - With these expressions in hand, we can readily check the of conditions Theorem B.4. First, we find using the Sherman-Morrison-Woodbury matrix inversion formula that Ξ * t F = |σ 2 t N -σ 2 t+1 N | Y X T (XX T + σ 2 t N • Id n×n ) -1 (XX T + σ 2 t+1 N • Id n×n ) -1 F (D.9) ≤ N |σ 2 t -σ 2 t+1 | Y X T [(XX T ) + ] 2 F . Hence, assuming that σ 2 t = Θ(t -y ), we see that condition (B.7) of Theorem B.4 holds with β 1 = -y -1. Next, let us verify that the condition (B.6) holds for an appropriate α. For this, we need to bound log E t r=s Id - 2η r N X r X T r 2 2 , which we will do using Theorem A.6. In order to apply this result, we find by direct inspection of the formula E[X r X T r ] = XX T + σ 2 r N Id n×n that E Id - 2η r N X r X T r 2 = 1 -2η r σ 2 r := a r . Moreover, we have E Id - 2η r N X r X T r -E Id - 2η r N X r X T r 2 2 = 4η 2 r N 2 E X r X T r -E X r X T r 2 2 . Using the exact expressions for the resulting moments from Lemma D.2, we find 4η 2 r N 2 E X r X T r -E X r X T r 2 2 = 4η 2 r N 2 1 B t Tr X(N diag(X T X) -X T X)X T + 2σ Recall that, in the notation of Theorem 6.2, we have η r = Θ(r -x ), σ 2 r = Θ(r -y ). Hence, since under out hypotheses we have x < 2y, we conclude that condition (B.6) holds with α = x + y. Moreover, exactly as in Proposition B.6, we have ∆ t+1 = ∆ t Id - 2η t N E X t X T t + 2 N Ξ * t , ∆ t := E [W t -W * t ] . Since ||Ξ * t || F = O(t -y-1 ) and we already saw that Id - 2η t N E X t X T t 2 = 1 -2η t σ 2 t , we may use the single sided decay estimates Lemma A.4 to conclude that ||∆ t || F = O(t x-1 ). Finally, it remains to bound η 2 t Tr Id •Var(E[W t ]X t X T t -Y t X T t . A direct computation using Lemma D.2 shows E Y t X T t -E[Y t X T t ] 2 F = 1 B t Tr Y (N diag(X T X) -X T X)Y T + σ 2 t N Tr(Y Y T ). Hence, again using D.2, we find η 2 t Tr Id •Var(E[W t ]X t X T t -Y t X T t ) = η 2 t Tr 1 B t E[W t ]X(N diag(X T X) -X T X)X T E[W t ] T + 2σ 2 t n + 1 B t /N E[W t ]XX T E[W t ] T + (σ 2 t N B t Tr(XX T ) + σ 4 t N n + 1 B t /N )E[W t ]E[W t ] T -2η 2 t Tr 1 B t Y (N diag(X T X) -X T X)X T E[W t ] T + σ 2 t n + 1 B t /N Y X T E[W t ] T + η 2 t Tr 1 B t Y (N diag(X T X) -X T X)Y T + σ 2 t N Y Y T . To make sense of this term, note that W * ∞ X = Y. Hence, we find after some rearrangement that  η 2 t Tr Id •Var(E[W t ]X t X T t -Y t X T t ) ≤



λ min,V (XX T ) and the ratio b(t) a(t) in (5.14) is by (D.4) bounded uniformly for a constant C > 0 by b

.4. For (B.6), notice that Y r = Id -2η r XrX T r N satisfies the hypotheses of Theorem A.6 with a r = 1 -2η r σ 2 r and b

is chosen uniformly with replacement from D, and the resulting data matrices X t and Y t are scaled so that L t (W ) = L(W ; D). Concretely, this means that for the normalizing factor c t := N/B t we have X t = c t XA t and Y t = c t Y A t , (D.1)

Cη 2 t (σ 2 t + ||∆ t || Finally, we have∆ t ≤ ∆ t + ||W * t -W * ∞ || F = O(t x-1 ) + Θ(t -y ) = Θ(t -y) since we assumed that x + y < 1. Therefore, we obtainη 2 t Tr Id •Var(E[W t ]X t X T t -Y t X T t ) ≤ Cη 2 t σ 2 t = Θ(t -2x-y ),showing that condition (B.8) holds with β 2 = 2x + y. Applying Theorem B.4 completes the proof.

A ANALYTIC LEMMAS

In this section, we present several basic lemmas concerning convergence for certain matrix-valued recursions that will be needed to establish our main results. For clarity, we first collect some matrix notations used in this section and throughout the paper.

A.1 MATRIX NOTATIONS

Let M ∈ R m×n be a matrix. We denote its Frobenius norm by M F and its spectral norm by M 2 . If m = n so that M is square, we denote by diag(M ) the diagonal matrix with diag(M ) ii = M ii . For matrices A, B, C of the appropriate shapes, defineApplying Lemma A.5 with (B.6) and (B.8), we find that). By Chebyshev's inequality, for any x > 0 we haveFor any > 0, choosing x = t δ for small 0 < δ < we find as desired thatthus completing the proof of Theorem B.4.

C ANALYSIS OF NOISING AUGMENTATIONS

In this section, we give a full analysis of the noising augmentations presented in Section 4. Let us briefly recall the notation. As before, we consider overparameterized linear regression with losswhere the dataset D of size N consists of data matrices X, Y that each have N columns x i ∈ R n , y i ∈ R p with n > N. We optimize L(W ; D) by augmented gradient descent with additive Gaussian noise, which means that at each time t we replace D = (X, Y ) by a random dataset D t = (X t , Y ), where the columns x i,t of X t areWe then take a stepof gradient descent on the resulting randomly augmented loss L(W ; D t ) with learning rate η t . A direct computation shows that the proxy lossF , which is strictly convex. Thus, the spaceis simply all of R n . Moreover, the proxy loss has a unique minimum, which isC.1 PROOF OF THEOREM 4.1We first show convergence. For this, we seek to show that if σ 2 t , η t → 0 with σ 2 t non-increasing andthen, W t p → W min . We will do this by applying Theorem 5.1, so we check that our assumptions imply the hypotheses of these theorems. For Theorem 5. 

