WHY (AND WHEN) DOES LOCAL SGD GENERALIZE BETTER THAN SGD?

Abstract

Local SGD is a communication-efficient variant of SGD for large-scale training, where multiple GPUs perform SGD independently and average the model parameters periodically. It has been recently observed that Local SGD can not only achieve the design goal of reducing the communication overhead but also lead to higher test accuracy than the corresponding SGD baseline (Lin et al., 2020b), though the training regimes for this to happen are still in debate (Ortiz et al., 2021) . This paper aims to understand why (and when) Local SGD generalizes better based on Stochastic Differential Equation (SDE) approximation. The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD in the small learning rate regime, showing how noise drives the iterate to drift and diffuse after it has reached close to the manifold of local minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having a small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement. Parallel SGD tries to improve wall-clock time when the batch size B is large enough. It distributes the gradient computation to K ≥ 2 workers, each of whom focuses on a local batch of B loc := B/K samples and computes the average gradient over the local batch. Finally, g t is obtained by averaging the local gradients over the K workers. However, large-batch training leads to a significant test accuracy drop compared to a small-batch training baseline with the same number of training steps or epochs (Smith et al., 2020; Shallue et al., 

1. INTRODUCTION

As deep models have grown larger, training them with reasonable wall-clock times has led to new distributed environments and new variants of gradient-based training. Recall that Stochastic Gradient Descent (SGD) tries to solve min θ∈R d E ξ∼ D [ℓ(θ; ξ)], where θ ∈ R d is the parameter vector of the model, ℓ(θ; ξ) is the loss function for a data sample ξ drawn from the training distribution D, e.g., the uniform distribution over the training set. SGD with learning rate η and batch size B does the following update at each step, using a batch of B independent ξ t,1 , . . . , ξ t,B ∼ D: 2019; Keskar et al., 2017; Jastrzębski et al., 2017) . Reducing this generalization gap is the goal of much subsequent research. It was suggested that the generalization gap arises because larger batches lead to a reduction in the level of noise in batch gradient (see Appendix A for more discussion). The Linear Scaling Rule (Krizhevsky, 2014; Goyal et al., 2017; Jastrzębski et al., 2017) tries to fix this by increasing the learning rate in proportion to batch size. This is found to reduce the generalization gap for (parallel) SGD, but does not entirely eliminate it. To reduce the generalization gap further, Lin et al. (2020b) discovered that a variant of SGD, called Local SGD (Yu et al., 2019; Wang & Joshi, 2019; Zhou & Cong, 2018) , can be used as a strong component. Perhaps surprisingly, Local SGD itself is not designed for improving generalization, but for reducing the high communication cost for synchronization among the workers, which is another important issue that often bottlenecks large-batch training (Seide et al., 2014; Strom, 2015; Chen et al., 2016; Recht et al., 2011) . Instead of averaging the local gradients per step as in parallel SGD, Local SGD allows K workers to train their models locally and averages the local model parameters whenever they finish H local steps. Here every worker samples a new batch at each local step, and in this paper we focus on the case where all the workers draw samples with or without replacement from the same training set. See Appendix C for the pseudocode. More specifically, Lin et al. (2020b) proposed Post-local SGD, a hybrid method that starts with parallel SGD (equivalent to Local SGD with H = 1 in math) and switches to Local SGD with H > 1 after a fixed number of steps t 0 . They showed through extensive experiments that Postlocal SGD significantly outperforms parallel SGD in test accuracy when t 0 is carefully chosen. In Figure 1 , we reproduce this phenomenon on both CIFAR-10 and ImageNet. As suggested by the success of Post-local SGD, Local SGD can improve the generalization of SGD by merely adding more local steps (while fixing the other hyperparameters), at least when the training starts from a model pre-trained by SGD. But the underlying mechanism is not very clear, and there is also controversy about when this phenomenon can happen (see Section 2.1 for a survey). The current paper tries to understand: Why does Local SGD generalize better? Under what general conditions does this generalization benefit arise? Previous theoretical research on Local SGD is mainly restricted to the convergence rate for minimizing a convex or non-convex objective (see Appendix A for a survey). A related line of works (Stich, 2018; Yu et al., 2019; Khaled et al., 2020) showed that Local SGD has a slower convergence rate compared with parallel SGD after running the same number of steps/epochs. This convergence result suggests that Local SGD may implicitly regularize the model through insufficient optimization, but this does not explain why parallel SGD with early stopping, which may incur an even higher training loss, still generalizes worse than Post-local SGD. Our Contributions. In this paper, we provide the first theoretical understanding on why (and when) switching from parallel SGD to Local SGD improves generalization. 1. In Section 2.2, we conduct ablation studies on CIFAR-10 and ImageNet and identify a clean setting where adding local steps to SGD consistently improves generalization: if the learning rate is small and the total number of steps is sufficient, Local SGD eventually generalizes better than the corresponding (parallel) SGD baseline. 2. In Section 3.2, we derive a special SDE that characterizes the long-term behavior of Local SGD in the small learning rate regime, as inspired by a previous work (Li et al., 2021b) that proposed this type of SDE for modeling SGD. These SDEs can track the dynamics after the iterate has reached close to a manifold of minima. In this regime, the expected gradient is near zero, but the gradient noise can drive the iterate to wander around. In contrast to the conventional SDE (3) for 

2.2. KEY FACTORS: SMALL LEARNING RATE AND SUFFICIENT TRAINING TIME

All the above papers agree that Post-local/Local SGD improves upon SGD to some extent. However, it is in debate under what conditions the generalization benefit can consistently occur. We now conduct ablation studies to identify the key factors so that adding local steps improves the generalization of SGD. We run parallel SGD and Local SGD with the same learning rate η, local batch size B loc , and number of workers K. We start training from the same initialization and compare their generalization after the same number of epochs. As Post-local SGD can be viewed as Local SGD starting from an SGD-pretrained model, the initial point in our experiments can be either random or a checkpoint of SGD training. See Appendix C for implementation details and Appendix M.2 for more details about the experimental setup. The first observation we have is that the generalization benefits can be reproduced on both CIFAR-10 and ImageNet in our setting (see Figure 1 ). We remark that Post-local SGD and SGD in Lin et al. (2020b) ; Ortiz et al. (2021) are implemented with accompanying Nesterov momentum terms. The learning rate also decays a couple of times in training with Local SGD. Nevertheless, our experiments show that the Nesterov momentum and learning rate decay are not necessary for Local SGD to generalize better than SGD. Our main finding after further ablation studies is summarized below: Finding 2.1. Given a sufficiently small learning rate and a sufficiently long training time, Local SGD exhibits better generalization than SGD, if the number of local steps H per round is tuned properly according to the learning rate. This holds for both training from random initialization and from pre-trained models. Now we go through each point of our main finding. See also Appendix F for more plots. (1). Pretraining is not necessary. In contrast to previous works claiming the benefits of Post-local SGD over Local SGD (Lin et al., 2020b; Ortiz et al., 2021) , we observe that Local SGD with random initialization also generalizes significantly better than SGD, as long as the learning rate is small and the training time is sufficiently long (Figure 2 (a)). Starting from a pretrained model may shorten the time to reach this generalization benefit to show up (Figure 2(b) ), but it is not necessary. (2). Learning rate should be small. We experiment with a wide range of learning rates to conclude that setting a small learning rate is necessary. The learning rate is 0.32 for Figures 2(a (4). The number of local steps H should be tuned carefully. The number of local steps H has a complex interplay with the learning rate η, but generally speaking, a smaller η needs a higher H to achieve consistent generalization improvement. For CIFAR-10 with a post-local training budget of 250 epochs (see Figure 2 (e)), the test accuracy first rises as H increases, and begins to fall as H exceeds some threshold for relatively large η (e.g., η ≥ 0.5) while keeps growing for smaller η (e.g., η < 0.5). For ImageNet with a post-local training budget of 50 epochs (see Figure 2 (f)), the test accuracy first increases and then decreases in H for all learning rates. Reconciling previous works. Our finding can help to settle the debate presented in Section 2.1 to a large extent. Simultaneously requiring a small learning rate and sufficient training time poses a trade-off when learning rate decay is used with a limited training budget: switching to Local SGD earlier may lead to a large learning rate, while switching later makes the generalization improvement of Local SGD less noticeable due to fewer update steps. It is thus unsurprising that first-decay switching strategy is not always the best. The need for sufficient training time does not contradict with Ortiz et al. ( 2021)'s conjecture that Local SGD only has a "short-term" generalization benefit. In their experiments, the generalization improvement usually disappears right after the next learning rate decay (instead of after a fixed amount of time). We suspect that the real reason why the improvement vanishes is that the number of local steps H was kept as a constant, but our finding suggests tuning H after η changes. In Figure 5 (e), we reproduce this phenomenon and show that increasing H after learning rate decay retains the improvement. Generalization performances at the optimal learning rate of SGD. In practice, the learning rate of SGD is usually tuned to achieve the best training loss/validation accuracy within a fixed training budget. Our finding suggests that when the tuned learning rate is small and the training time is sufficient, Local SGD can offer generalization improvement over SGD. As an example, in our experiments on training from an SGD-pretrained model, the optimal learning rate for SGD is 0.5 on CIFAR-10 (Figure 2 (e)) and 0.064 on ImageNet (Figure 2 (f)). With the same learning rate as SGD, the test accuracy is improved by 1.1% on CIFAR-10 and 0.3% on ImageNet when using Local SGD with H = 750 and H = 26 respectively. The improvement could become even higher if the learning rate of Local SGD is carefully tuned.

3. THEORETICAL ANALYSIS OF LOCAL SGD: THE SLOW SDE

In this section, we adopt an SDE-based approach to rigorously establish the generalization benefit of Local SGD in a general setting. Below, we first identify the difficulty of adapting the SDE framework to Local SGD. Then, we present our novel SDE characterization of Local SGD around the manifold of minimizers and explain the generalization benefit of Local SGD with our SDE. Notations. We follow the notations in Section 1. We denote by η the learning rate, K the number of workers, B the (global) batch size, B loc := B/K the local batch size, H the number of local steps, ℓ(θ; ζ) the loss function for a data sample ζ, and D the training distribution. Furthermore, we define L(θ) := E ξ∼ D [ℓ(θ; ξ)] as the expected loss, Σ(θ) := Cov ξ∼ D [∇ℓ(θ; ξ)] as the noise covariance of gradients at θ. Let {W t } t≥0 denote the standard Wiener process. For a mapping F : R d → R d , denote by ∂F (θ) the Jacobian at θ and ∂ 2 F (θ) the second order derivative at θ. Furthermore, for any matrix M ∈ R d×d , ∂ 2 F (θ)[M ] = i∈[d] ⟨ ∂ 2 Fi ∂θ 2 , M ⟩e i where e i is the i-th vector of the standard basis. We write ∂ 2 (∇L)(θ)[M ] as ∇ 3 L(θ)[M ] for short. Local SGD. We use the following formulation of Local SGD for theoretical analysis. See also Appendix C for the pseudocode. Local SGD proceeds in multiple rounds of model averaging, where each round produces a global iterate θ(s) . In the (s + 1)-th round, every worker k ∈ [K] starts with its local copy of the global iterate θ  θ (s) k,t+1 ← θ (s) k,t -ηg (s) k,t , where g (s) k,t = 1 B loc B loc i=1 ∇ℓ(θ (s) k,t ; ξ (s) k,t,i ), t = 0, . . . , H -1. (2) The local updates on different workers are independent of each other as there is no communication. After finishing the H local steps, the workers aggregate the resulting local iterates θ (s) k,H and assign the average to the next global iterate: θ(s+1) ← 1 K K k=1 θ (s) k,H .

3.1. DIFFICULTY OF ADAPTING THE SDE FRAMEWORK TO LOCAL SGD

A widely-adopted approach to understanding the dynamics of SGD is to approximate it from a continuous perspective with the following SDE (3), which we call the conventional SDE approximation. Below, we discuss why it cannot be directly adopted to characterize the behavior of Local SGD. dX(t) = -∇L(X)dt + η B Σ 1 /2 (X)dW t . It is proved by Li et al. (2019a) that this SDE is a first-order approximation to SGD, where each discrete step corresponds to a continuous time interval of η. Several previous works adopt this SDE approximation and connect good generalization to having a large diffusion term η B Σ 1 /2 dW t in the SDE (Jastrzębski et al., 2017; Smith et al., 2020) , because a suitable amount of noise can be necessary for large-batch training to generalize well (see also Appendix A). According to Finding 2.1, it is tempting to consider the limit η → 0 and see if Local SGD can also be modeled via a variant of the conventional SDE. In this case the typical time length that guarantees a good SDE approximation error is O(η -1 ) discrete steps (Li et al., 2019a; 2021a) . However, this time scaling is too short for the difference to appear between Local SGD and SGD. Indeed, Theorem 3.1 below shows that they closely track each other for O(η -1 ) steps. Theorem 3.1. Assume that the loss function L is C 3 -smooth with bounded second and third order derivatives and that ∇ℓ(θ; ξ) is bounded. Let T > 0 be a constant, θ(s) be the s-th global iterate of Local SGD and w t be the t-th iterate of SGD with the same initialization w 0 = θ(0) and same η, B loc , K. Then for any H ≤ T η and δ = O(poly(η)), it holds with probability at least 1 -δ that for all s ≤ T ηH , ∥ θ(s) -w sH ∥ 2 = O( η log 1 ηδ ). We defer the proof to Appendix I. See also Appendix D for Lin et al. (2020b) 's attempt to model Local SGD with multiple conventional SDEs and discussions on why it does not give much insight.

3.2. SDE APPROXIMATION NEAR THE MINIMIZER MANIFOLD

Inspired by a recent paper (Li et al., 2021b) , our strategy to overcome the shortcomings of the conventional SDE is to design a new SDE that can guarantee a good approximation for O(η -2 ) discrete steps, much longer than the O(η -1 ) discrete steps for the conventional SDE. Following their setting, we assume the existence of a manifold Γ consisting only of local minimizers and track the global iterate θ(s) around Γ after it takes Õ(η -1 ) steps to approach Γ. Though the expected gradient ∇L is near zero around Γ, the dynamics are still non-trivial because the noise can drive the iterate to move a significant distance in O(η -2 ) steps. Assumption 3.1. The loss function L(•) and the matrix square root of the noise covariance Σ 1 /2 (•) are C ∞ -smooth. Besides, we assume that ∥∇ℓ(θ; ξ)∥ 2 is bounded by a constant for all θ and ξ. Assumption 3.2. Γ is a C ∞ -smooth, (d -m)-dimensional submanifold of R d , where any ζ ∈ Γ is a local minimizer of L. For all ζ ∈ Γ, rank(∇ 2 L(ζ)) = m. Additionally, there exists an open neighborhood of Γ, denoted as U , such that Γ = arg min θ∈U L(θ). Assumption 3.3. Γ is a compact manifold. The smoothness assumption on L is generally satisfied when we use smooth activation functions, such as Swish (Ramachandran et al., 2017) , softplus and GeLU (Hendrycks & Gimpel, 2016) , which work equally well as ReLU in many circumstances. The existence of a minimizer manifold with rank(∇ 2 L(ζ)) = m has also been made as a key assumption in Fehrman et al. (2020) ; Li et al. (2021b) ; Lyu et al. (2022) , where rank(∇ 2 L(ζ)) = m ensures that the Hessian is maximally nondegenerate on the manifold and implies that the tangent space at ζ ∈ Γ equals the null space of ∇ 2 L(ζ). The last assumption is made to prevent the analysis from being too technically involved. Our SDE for Local SGD characterizes the training dynamics near Γ. For ease of presentation, we define the following projection operators Φ, P ζ for points and differential forms respectively. Definition 3.1 (Gradient Flow Projection). Fix a point θ null / ∈ Γ. For x ∈ R d , consider the gradient flow dx(t) dt = -∇L(x(t)) with x(0) = x. We denote the gradient flow projection of x as Φ(x). Φ(x) := lim t→+∞ x(t) if the limit exists and belongs to Γ; otherwise, Φ(x) = θ null . Definition 3.2. For any ζ ∈ Γ and any differential form AdW t + bdt in Itô calculus, where A is a matrix and b is a vector, we use P ζ (AdW t + bdt) as a shorthand for the differential form ∂Φ(ζ)AdW t + ∂Φ(ζ)b + 1 2 ∂ 2 Φ(ζ)[AA ⊤ ] dt. See Øksendal (2013)  dζ(t) = P ζ 1 √ B Σ 1 /2 ∥ (ζ)dW t (a) diffusion -1 2B ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt (b) drift-I -K-1 2B ∇ 3 L(ζ)[ Ψ(ζ)]dt (c) drift-II . ( ) Here Σ ♢ (ζ), Ψ(ζ) ∈ R d×d are defined as Σ ♢ (ζ) := i,j:(λi̸ =0)∨(λj ̸ =0) 1 λi+λj Σ ♢ (ζ), v i v ⊤ j v i v ⊤ j , Ψ(ζ) := i,j:(λi̸ =0)∨(λj ̸ =0) ψ(ηH•(λi+λj )) λi+λj Σ ♢ (ζ), v i v ⊤ j v i v ⊤ j , where {v i } d i=1 is a set of eigenvectors of ∇ 2 L(ζ) that forms an orthonormal eigenbasis, and λ 1 , . . . , λ d are the corresponding eigenvalues. Additionally, ψ(x) := e -x -1+x x for x ̸ = 0 and ψ(0) = 0. The use of P ζ keeps ζ(t) on the manifold Γ through projection. Σ 1 2 ∥ (ζ) introduces a diffusion term to the SDE in the tangent space. The two drift terms involve Σ ♢ (•) and Ψ(•), which can be intuitively understood as rescaling the entries of the noise covariance in the eigenbasis of Hessian. In the special case where ∇ 2 L = diag(λ 1 , • • • , λ d ) ∈ R d×d , we have Σ ♢,i,j = 1 λi+λj Σ 0,i,j . Ψ i,j = ψ(ηH(λi+λj )) λi+λj Σ 0,i,j . ψ(x) is a monotonically increasing function, which goes from 0 to 1 as x goes from 0 to infinity (see Figure 9 ) We name this SDE as the Slow SDE for Local SGD because we will show that each discrete step of Local SGD corresponds to a continuous time interval of η 2 instead of an interval of η in the conventional SDE. In this sense, our SDE is "slower" than the conventional SDE (and hence can track a longer horizon). This Slow SDE is inspired by Li et al. (2021b) . Under nearly the same set of assumptions, they proved that SGD can be tracked by an SDE that is essentially equivalent to (4) with K = 1, namely, without the drift-II term. dζ(t) = P ζ 1 √ B Σ 1 /2 ∥ (ζ)dW t (a) diffusion -1 2B ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt (b) drift-I , We refer to (7) as the Slow SDE for SGD. We remark that the drfit-II term in ( 4) is novel and is the key to separate the generalization behaviors of Local SGD and SGD in theory. We will discuss this point later in Section 3.3. Now we present our SDE approximation theorem for Local SGD. Theorem 3.2. Let Assumptions 3.1 to 3.3 hold. Let T > 0 be a constant and ζ(t) be the solution to (4) with the initial condition ζ(0) = Φ( θ(0) ) ∈ Γ. If H is set to α η for some constant α > 0, then for any C 3 -smooth function g(θ), max 0≤s≤ T Hη 2 E[g(Φ( θ(s) )] -E[g(ζ(sHη 2 )] = Õ(η 0.25 ), where Õ(•) hides log factors and constants that are independent of η but can depend on g(θ). Theorem 3.3. For δ = O(poly(η)), with probability at least 1 -δ, it holds for all O( 1 α log 1 η ) ≤ s ≤ T αη that Φ( θ(s) ) ∈ Γ and ∥ θ(s) -Φ( θ(s) )∥ 2 = O( αη log α ηδ ) , where O(•) hides constants independent of η, α and δ. Theorem 3.2 suggests that the trajectories of the manifold projection and the solution to the Slow SDE (4) are close to each other in the weak approximation sense. That is, {Φ( θ(s) )} and {ζ(t)} cannot be distinguished by evaluating test functions from a wide function class, including all polynomials. This measurement of closeness between the iterates of stochastic gradient algorithms and their SDE approximations is also adopted by Li et al. (2019a; 2021a) ; Malladi et al. (2022) , but their analyses are for conventional SDEs. Theorem 3.3 further states that the iterate θ(s) keeps close to its manifold projection after the first few rounds. Published as a conference paper at ICLR 2023 Remark 3.1. To connect to Finding 2.1, we remark that our theorems (1) do not require the model to be pre-trained (as long as the gradient flow starting with θ (0) converges to Γ); (2) give better bounds for smaller η; (3) characterize a long training horizon ∼ η -2 . The need for tuning H will be discussed in Section 3.3.3. Technical Contribution. The proof technique for Theorem 3.2 is novel and significantly different from the Slow SDE analysis of SGD in Li et al. (2021a) . Their analysis uses advanced stochastic calculus and invokes Katzenberger's theorem (Katzenberger, 1991) to show that SGD converges to the Slow SDE in distribution, but no quantitative error bounds are provided. Also, due to the local updates and multiple aggregation steps in Local SGD, it is unclear how to extend Katzenberger's theorem to our case. To overcome this difficulty, we develop a new approach to analyze the Slow SDEs, which is based on the method of moments (Li et al., 2019a) and can provide the quantitative error bound Õ(η 0.25 ) in weak approximation. See Appendix J for our proof outline. A by-product of our result is the first quantitative approximation bound for the Slow SDE approximation for SGD, which can be easily obtained by setting K = 1.

3.3. INTERPRETATION OF THE SLOW SDES

In this subsection, we compare the Slow SDEs for SGD and Local SGD and provide an important insight into why Local SGD generalizes better than SGD: Local SGD strengthens the drift term in the Slow SDE, which makes the implicit regularization of stochastic gradient noise more effective.

3.3.1. INTERPRETATION OF THE SLOW SDE FOR SGD.

The Slow SDE for SGD (7) consists of the diffusion and drift-I terms. The former injects noise into the dynamics in the tangent space; the latter one drives the dynamics to move along the negative gradient of 1 2B ⟨∇ 2 L(ζ), Σ ♢ (ζ)⟩ projected onto the tangent space, but ignoring the dependency of Σ ♢ (ζ) on ζ. This can be connected to the class of semi-gradient methods which only computes a part of the gradient (Mnih et al., 2015; Sutton & Barto, 1998; Brandfonbrener & Bruna, 2020) . In this view, the long-term behavior of SGD is similar to a stochastic semi-gradient method minimizing the implicit regularizer 1 2B ⟨∇ 2 L(ζ), Σ ♢ (ζ)⟩ on the minimizer manifold of the original loss L. Though the semi-gradient method may not perfectly optimize its objective, the above argument reveals that SGD has a deterministic trend toward the region with a smaller magnitude of Hessian, which is commonly believed to correlate with better generalization (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Neyshabur et al., 2017; Jiang et al., 2020) (see Appendix A for more discussions). In contrast, the diffusion term can be regarded as a random perturbation to this trend, which can impede optimization when the drift-I term is not strong enough. Based on this view, we conjecture that strengthening the drift term of the Slow SDE can help SGD to better regularize the model, yielding a better generalization performance. More specifically, we propose the following hypothesis, which compares the generalization performances of the following generalized Slow SDEs. Note that ( 1 B , 1 2B )-Slow SDE corresponds to the Slow SDE for SGD (7). Definition 3.4. For κ 1 , κ 2 ≥ 0, define (κ 1 , κ 2 )-Slow SDE to be the following: dζ(t) = P ζ √ κ 1 Σ 1 /2 ∥ (ζ)dW t -κ 2 ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt . ( ) Hypothesis 3.1. Starting at a minimizer ζ 0 ∈ Γ, run (κ 1 , κ 2 )-Slow SDE and (κ 1 , κ ′ 2 )-Slow SDE respectively for the same amount of time T > 0 and obtain ζ(T ), ζ ′ (T ). If κ 2 > κ ′ 2 , then the expected test accuracy at ζ(T ) is better than that at ζ ′ (T ). Due to the No Free Lunch Theorem, we do not claim that our hypothesis is always true, but we do believe that the hypothesis holds when training usual neural networks (e.g., ResNets, VGGNets) on standard benchmarks (e.g., CIFAR-10, ImageNet). Example: Training with Label Noise Regularization. To exemplify the generalization benefit of having a larger drift term, we follow a line of theoretical works (Li et al., 2021b; Blanc et al., 2020; Damian et al., 2021) to study the case of training over-parameterized neural nets with label noise regularization. For a C-class classification task, the label noise regularization is as follows: every time we draw a sample from the training set, we make the true label as it is with probability 1 -p, and replace it with any other label with equal probability p C-1 . When we use cross-entropy loss, the Slow SDE for SGD turns out to be a simple deterministic gradient flow on Γ (instead of a semigradient method) for minimizing the trace of Hessian: dζ(t) = -1 4B ∇ Γ tr(∇ 2 L(ζ))dt, where ∇ Γ f stands for the gradient of the function f projected to the tangent space of Γ. Checking the validity of our hypothesis reduces to the following question: Is minimizing the trace of Hessian beneficial to generalization? Many works prove positive results in concrete settings, including the line of works we just mentioned. We refer the readers to Appendix G for further discussion. Based on Hypothesis 3.1, we argue that Local SGD improves generalization by strengthening the drift term of the Slow SDE. First, it can be seen from ( 4) that the Slow SDE for Local SGD has an additional drfit-II term. Similar to the drift-I term of the Slow SDE for SGD, this drift-II term drives the dynamics to move along the negative semi-gradient of K-1 2B ⟨∇ 2 L(ζ), Ψ(ζ)⟩ (with the dependency of Ψ(ζ) on ζ ignored). Combining it with the implicit regularizer induced by the drift-I term, we can see that the long-term behavior of Local SGD is similar to a stochastic semi-gradient method minimizing the implicit regularizer 1 2B ⟨∇ 2 L(ζ), Σ ♢ (ζ)⟩ + K-1 2B ⟨∇ 2 L(ζ), Ψ(ζ)⟩ on Γ. Comparing the definitions of Σ ⋄ (ζ) (5) and Ψ(ζ) (6), we can see that Ψ(ζ) is basically a rescaling of the entries of Σ ⋄ (ζ) in the eigenbasis of Hessian, where the rescaling factor ψ(ηH • (λ i + λ j )) for each entry is between 0 and 1 (see Figure 9 for the plot of ψ). When ηH is small, the rescaling factors should be close to ψ(0) = 0, then Ψ(ζ) ≈ 0, leading to almost no additional regularization. On the other hand, when ηH is large, the rescaling factors should be close to ψ(+∞) = 1, so Ψ(ζ) ≈ Σ ⋄ (ζ). We can then merge the two implicit regularizers as K 2B ⟨∇ 2 L(ζ), Σ ♢ (ζ)⟩, and (4) becomes the ( 1 B , K 2B )-Slow SDE, which is restated below: dζ(t) = P ζ 1 √ B Σ 1 /2 ∥ (ζ)dW t -K 2B ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt . From the above argument we know how the Slow SDE of Local SGD (4) changes as ηH transitions from 0 to +∞. Initially, when ηH = 0, (4) is the same as the ( 1 B , 1 2B )-Slow SDE for SGD. Then increasing ηH strengthens the drift term of (4). As ηH → +∞, (4) transitions to the ( 1 B , K 2B )-Slow SDE, where the drift term becomes K times larger. According to Hypothesis 3.1, the ( 1 B , K 2B )-Slow SDE generalizes better than the ( 1 B , 1 2B )-Slow SDE, so Local SGD with ηH = +∞ should generalize better than SGD. When ηH is chosen realistically as a finite value, the generalization performance of Local SGD interpolates between these two cases, which results in a worse generalization than ηH = +∞ but should still be better than SGD.

3.3.3. THEORETICAL INSIGHTS INTO TUNING THE NUMBER OF LOCAL STEPS

Based on our Slow SDE approximations, we now discuss how the number of local steps H affects the generalization of Local SGD. When η is small but finite, tuning H offers a trade-off between regularization strength and SDE approximation quality. Larger α := ηH makes the regularization stronger in the SDE (as discussed in Section 3.3.2), but the SDE itself may lose track of Local SGD, which can be seen from the error bound O( αη log(α/ηδ)) in Theorem 3.3. Therefore, we expect the test accuracy to first increase and then decrease as we gradually increase H. Indeed, we observe in Figures 2(e ) and 2(f) that the plot of test accuracy versus H is unimodal for each η. It is thus necessary to tune H for the best generalization. When H is tuned together with other hyperparameters, such as learning rate η, our Slow SDE approximation recommends setting H to be at least Ω(η -1 ) so that α := ηH does not vanish in the Slow SDE. Since larger α gives a stronger regularization effect, the optimal H should be set to the largest value so that the Slow SDE does not lose track of Local SGD. Indeed, we empirically observed that when H is tuned optimally, α increases as η decreases, suggesting that the optimal H grows faster than Ω(η -1 ). See Figure 5(f) .

4. CONCLUSIONS

In this paper, we analyze the long-term generalization behavior of Local SGD in the small learning rate regime by deriving the Slow SDE for Local SGD as a generalization of that for SGD (Li et al., 2021b) . We attribute the generalization improvement over SGD to the larger drift term in the SDE for Local SGD. Our empirical validation shows that Local SGD indeed induces generalization benefits with small learning rate and long enough training time. The main limitation of our work is that our analysis does not imply any direct theoretical separation between SGD and Local SGD in test accuracy, which requires a much deeper understanding of the loss landscape and the Slow SDEs and is left for future work. Another direction for future work is to design distributed training methods that provably generalize better than SGD based on the theoretical insights obtained from Slow SDEs. 

A ADDITIONAL RELATED WORKS

Optimization aspect of Local SGD. Local SGD is a communication-efficient variant of parallel SGD, where multiple workers perform SGD independently and average the model parameters periodically. Dating back to Mann et al. (2009) and Zinkevich et al. (2010) , this strategy has been widely adopted to reduce the communication cost and speed up training in both scenarios of data center distributed training (Chen & Huo, 2016; Zhang et al., 2014; Povey et al., 2014; Su & Chen, 2015) and Federated Learning (McMahan et al., 2017; Kairouz et al., 2021) et al., 2020) and hurt optimization. The error bound of Local SGD obtained by these works is typically inferior to that of SGD with the same global batch size for fixed number of iterations/epochs and becomes worse as the number of local steps increases, revealing a trade-off between less communication and better optimization. In this paper, we are interested in the generalization aspect of Local SGD in the homogeneous setting, assuming the training loss can be optimized to a small value. Gradient noise and generalization. The effect of stochastic gradient noise on generalization has been studied from different aspects, e.g., changing the order of learning different patterns Li et al. (2019a) , inducing an implicit regularizer in the second-order SDE approximation Smith et al. (2021) ; Li et al. (2019a) . Our work follows a line of works studying the effect of noise in the lens of sharpness, which is long believed to be related to generalization Hochreiter & Schmidhuber (1997) ; Neyshabur et al. (2017) . Keskar et al. (2017) empirically observed that large-batch training leads to worse generalization and sharper minima than small-batch training. Wu et al. (2018) ; Hu et al. (2017) ; Ma & Ying (2021) showed that gradient noise destabilizes the training around sharp minima, and Kleinberg et al. (2018) ; Zhu et al. (2018) ; Xie et al. (2021) ; Ibayashi & Imaizumi (2021) quantitatively characterized how SGD escapes sharp minima. The most related papers are Blanc et al. (2020) ; Damian et al. (2021) ; Li et al. (2021b) , which focus on the training dynamics near a manifold of minima and study the effect of noise on sharpness (see also Section 3.2). Though the mathematical definition of sharpness may be vulnerable to the various symmetries in deep neural nets (Dinh et al., 2017) , sharpness still appears to be one of the most promising tools for predicting generalization (Jiang et al., 2020; Foret et al., 2021) . Improving generalization in large-batch training. The generalization issue of the large-batch (or full-batch) training has been observed as early as (Bengio, 2012; LeCun et al., 2012) . As mentioned in Section 1, the generalization issue of large-batch training could be due to the lack of a sufficient amount of stochastic noise. To make up the noise in large-batch training, Krizhevsky (2014) ; Goyal et al. (2017) empirically discovered the Linear Scaling Rule for SGD, which suggests enlarging the learning rate proportionally to the batch size. Jastrzębski et al. (2017) adopted an SDE-based analysis to justify that this scaling rule indeed retains the same amount of noise as small-batch training (see also Section 3.1). However, the SDE approximation may fail if the learning rate is too large (Li et al., 2021a) , especially in the early phase of training before the first learning rate decay (Smith et al., 2020) . Shallue et al. (2019) demonstrated that generalization gap between small-and large-batch training can also depend on many other training hyperparameters. Besides enlarging the learning rate, other approaches have also been proposed to reduce the gap, including training longer (Hoffer et al., 2017) , learning rate warmup (Goyal et al., 2017) , LARS (You et al., 2018) , LAMB (You et al., 2020) . In this paper, we focus on using Local SGD to improve generalization, but adding local steps is a generic training trick that can also be combined with others, e.g., Local LARS (Lin et al., 2020b) , Local Extrap-SGD (Lin et al., 2020a) .

B ADDITIONAL DISCUSSIONS

Connection to the conventional wisdom that the diffusion term matters more. As mentioned in Section 3.1, it is believed in the literature is that a large diffusion term in the conventional SDE leads to good generalization. One may think that the diffusion term in the Slow SDE corresponds to that in the conventional SDE, and thus enlarging the diffusion term rather than the drift term should lead to better generalization. However, we note that both the diffusion and drift terms in the Slow SDEs result from the long-term effects of the diffusion term in the conventional SDE (Slow SDEs become stationary if Σ = 0). This means our view characterizes the role of gradient noise in more detail, and therefore, goes one step further on the conventional wisdom. Slow SDEs for neural nets with modern training techniques. In modern neural net training, it is common to add normalization layers and weight decay (L 2 -regularization) for better optimization and generalization. However, these techniques lead to violations of our assumptions, e.g., no fixed point exists in the regularized loss (Li et al., 2020; Ahn et al., 2022) . Still, a minimizer manifold can be expected to exist for the unregularized loss. Li et al. (2022) noted that the drift and diffusion around the manifold proceeds faster in this case, and derived a Slow SDE for SGD that captures O( 1 η log 1 η ) discrete steps instead of O( 1 η 2 ). We believe that our analysis can also be extended to this case, and that adding local steps still results in the effect of strengthening the drift term. Discrepancy in Sampling Schemes. We argue that this discrepancy between theory and experiments on sample schemes is minor. Though sampling without replacement is standard in practice, most previous works, e.g., Wang & Joshi (2019) ; Li et al. (2021a) ; Zhang et al. (2020) , analyze sampling with replacement for technical simplicity and yields meaningful results. Moreover, even if we change the sampling scheme to with replacement, Local SGD can still improve the generalization of SGD (by merely adding local steps). See Appendix F for the experiments. We believe that the reasons for better generalization of Local SGD with either sampling scheme are similar and leave the analysis for sampling without replacement for future work. // synchronize Draw a random permutation P of 1, . . . , |D| jointly with other workers so that the same permutation is shared among all workers ; // reshuffle the dataset Q (k) j ← P (k-1)N loc B loc +j for all 1 ≤ j ≤ N loc ; // partition the dataset c (k) ← 0 ; end for i = 1, . . . , B loc do ξi ← the Q (k) c (k) +i -th data point of D ; // sample without replacement ξ i ← A( ξi ) ; // apply data augmentation end c (k) ← c (k) + B loc ; return (ξ 1 , . . . , ξ B loc ) ; end Algorithm 3: Parallel SGD on K Workers Input: loss function ℓ(θ; ξ), initial parameter θ 0 Hyperparameters: total number of iterations T , learning rate η, local batch size B loc for t = 0, • • • , T -1 do for each worker k do in parallel (ξ k,t,1 , . . . , ξ k,t,B loc ) ← Sample() ; // sample a local batch g k,t ← 1 B loc B loc i=1 ∇ℓ(θ t ; ξ k,t,i ) ; // computing the local gradient end g t ← 1 K K k=1 g k,t ; // all-Reduce aggregation of local gradients k,H ). The key difference between each of these SDEs and the SDE for SGD (3) is that the former one has a larger diffusion term because the workers use batch size B loc instead of B: θ t+1 ← θ t -η t g t ; // for t = 0, . . . , H -1 do (ξ (s) k,t,1 , . . . , ξ (s) k,t,B loc ) ← Sample() ; // sample a local batch g (s) k,t ← 1 B loc B loc i=1 ∇ℓ(θ (s) k,t ; ξ (s) k,t,i ) ; // computing the local gradient θ (s) k,t+1 ← θ (s) k,t -ηg dX(t) = -∇L(X)dt + η B loc Σ 1 /2 (X)dW t . Lin et al. ( 2020b) then argue that the total amount of "noise" in the training dynamics of Local SGD is larger than that of SGD. However, it is hard to see whether it is indeed larger, since the model averaging step at the end of each round can reduce the variance in training and may cancel the effect of having larger diffusion terms. More formally, a complete modeling of Local SGD following this idea should view the sequence of global iterates { θ(s) } as a Markov process {X (s) }. Let P X (x, B, t) the distribution of X(t) in ( 3) with initial condition X(0) = x. Then the Markov transition should be X (s+1) = 1 K K k=1 X (s) k,H , where X (s) 1,H , . . . , X K,H are K independent samples from P X (X (s) , B loc , Hη), i.e., sampling from (10). Consider one round of model averaging. It is true that P X (X (s) , B loc , Hη) may have a larger variance than the corresponding SGD baseline P X (X (s) , B, Hη) because the former one has a smaller batch size. However, it is unclear whether X (s+1) also has a larger variance than P X (X (s) , B, Hη). This is because X (s+1) is the average of K samples, which means we have to compare 1 K times the variance of P X (X (s) , B loc , Hη) with the variance of P X (X (s) , B, Hη). Then it is unclear which one is larger. In the special case where Hη is small, P X (X (s) , B loc , Hη) is approximately equal to the following Gaussian distribution: N X (s) -ηH∇L(X (s) ), η 2 H B loc Σ(X (s) ) Then averaging over K samples gives N X (s) -ηH∇L(X (s) ), η 2 H B Σ(X (s) ) , which is exactly the same as the Gaussian approximation of the SGD baseline. This means there do exist certain cases where Lin et al. (2020b)'s argument does not give a good separation between Local SGD and SGD. Moreover, we do not gain any further insights from this modeling since it is hard to see how model averaging interacts with the SDEs.

E ADDITIONAL INTERPRETATION OF THE SLOW SDES E.1 UNDERSTANDING THE DIFFUSION TERM IN THE SLOW SDE

So far, we have discussed why adding local steps enlarges the drift term in the Slow SDE and why enlarging the drift term can benefit generalization. Besides this, here we remark that another way to accelerate the corresponding semi-gradient method for minimizing the implicit regularizer is to reduce the diffusion term, so that the trajectory more closely follows the drift term. More formally, we propose the following:  Hypothesis E.1. Starting at a minimizer ζ 0 ∈ Γ, run (κ 1 , κ 2 )-Slow SDE and (κ 1 , κ ′ 2 )-Slow SDE respectively for the same amount of time T > 0 and obtain ζ(T ), ζ ′ (T ). If Σ ∥ ̸ ≡ 0 and κ 1 < κ ′ 1 , then the expected test accuracy at ζ(T ) is better than that at ζ ′ (T ).

Test Accuracy

K = 1 K = 2 K = 4 K = 8 K = 16 K = 64 (a) CIFAR-10, H = 600 for K > 1. Top-1 Test Accuracy Test accuracy improves as we increase K with fixed η and H to reduce the diffusion term while keeping the drift term untouched. See Appendix M.4 for details. K = 256 K = 64 K = 16 K = 8 K = 4 K = 1 (b) ImageNet, H = 78 for K > 1. Here we exclude the case of Σ ∥ ≡ 0 because in this case the diffusion term in the Slow SDE is always zero. To verify Hypothesis E.1, we set the product α := ηH large, keep H, η fixed, increase the number of workers K, and compare the generalization performances after a fixed amount of training steps (but after different numbers of epochs). This case corresponds to the ( 1 KB loc , 1 2B loc )-Slow SDE, so adding more workers should reduce the diffusion term. As shown in Figure 3 , a higher test accuracy is indeed achieved for larger K. Implication: Enlarging the learning rate is not equally effective as adding local steps. Given that Local SGD improves generalization by strengthening the drift term, it is natural to wonder if enlarging the learning rate of SGD would also lead to similar improvements. While it is true that enlarging the learning rate effectively increases the drift term, it also increases the diffusion term simultaneously, which can hinder the implicit regularization by Hypothesis E.1. In contrast, adding local steps does not change the diffusion term. As shown in Figure 6 (a), even when the learning rate of SGD is increased, SGD still underperforms Local SGD by about 2% in test accuracy. On the other hand, in the special case of where Σ ∥ ≡ 0, Hypothesis E.1 does not hold, and enlarging the learning rate by √ K results in the same Slow SDE as adding local steps (see Appendix G for derivation). Then these two actions should produce the same generalization improvement, unless the learning rate is so large that Slow SDE loses track of the training dynamics. As an example of such a special case, an experiment with label noise regularization is presented in Figure 8 .

E.2 THE EFFECT OF GLOBAL BATCH SIZE ON GENERALIZATION

In this section, we discuss the effect of global batch size on the generalization of Local SGD. Given that the computation power of a single worker is limited, we consider the case where the local batch size B loc is fixed and the global batch size B = KB loc is tuned by adding or removing the workers. This scenario is relevant to the practice because one may want to know the maximum parallelism possible to train the neural net without causing generalization degradation. For SGD, previous works have proposed the Linear Scaling Rule (LSR) (Krizhevsky, 2014; Goyal et al., 2017; Jastrzębski et al., 2017) : scaling the learning rate η → κη linearly with the global batch size B → κB yields the same conventional SDE (3) under a constant epoch budget, hence leading to almost the same generalization performance as long as the SDE approximation does not fail. We show in Theorem H.1 that the LSR does not change the Slow SDE of SGD either. Experiments in Figure 4 show that the LSR indeed holds nicely when we continue training with small learning rates from the same CIFAR-10 and ImageNet checkpoints as in Figure 2 . Here we choose K = 16 and K = 256 as the base settings for CIFAR-10 and ImageNet, respectively, and then tune the learning rate to maximize the test accuracy. As shown in Figures 4(a ) and 4(b), the optimal learning rate turns out to be small enough that the LSR can be applied to scale the global batch size with only a minor change in test accuracy. Now, assuming the learning rate is scaled as LSR, we study how to tune the number of local steps H for Local SGD for better generalization. A natural choice is to tune H in the base settings and keep α unchanged via scaling H → H/κ. Then the following SDE can be derived (see Theorem H.2): Compared with (4), the drift-II term here is rescaled by a positive factor. Again, when α is large, we can follow the argument in Section 3.3.2 to approximate Ψ(ζ) ≈ Σ ♢ (ζ) and obtain the following dζ(t) = P ζ 1 √ B Σ 1 /2 ∥ (ζ)dW t (a) diffusion (unchanged) -1 2B ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt (b) drift-I (unchanged) -κK-1 2B ∇ 3 L(ζ)[ Ψ(ζ)]dt (c) drift-II (rescaled) . ( ) ( 1 B , κK B )-Slow SDE: dζ(t) = P ζ 1 √ B Σ 1 /2 ∥ (ζ)dW (t) -κK 2B ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt . ( ) The drift term of the above SDE is always stronger than SGD (7), as long as there exists more than one worker after the scaling (i.e., κK > 1). As expected from Hypothesis 3.1, we observed in the experiments that the generalization performance of Local SGD is always better than or at least comparable to SGD across different batch sizes (see Figures 4(a ) and 4(b)). Taking a closer look into the drift term in the Slow SDE ( 14), we can find that it scales linearly with κ. According to Hypothesis 3.1, the SDE is expected to generalize better when adding more workers (κ > 1) and to generalize worse when removing some workers (κ < 1). For the latter case, we indeed observed that the test accuracy of Local SGD drops when removing workers. For the case of adding workers, however, we also need to take into account that the LSR specifies a larger learning rate and causes a larger SDE approximation error for the same α, which may cancel the generalization improvement brought by strengthening the drift term. In the experiments, we observed that the test accuracy does not rise when adding more workers to the base settings. Since α also controls the regularization strength (Section 3.3.3), it would be beneficial to decrease α for large batch size so as to better trade-off between regularization strength and approximation quality. In Figures 4(c ) and 4(d), we plot the optimal value of α for each batch size, and we indeed observed that the optimal α drops as we scale up K. Conversely, a smaller batch size (and hence a smaller learning rate) allows for using a larger α to enhance regularization while still keeping a low approximation error (Theorem 3.3). The test accuracy curves in Figures 4(a ) and 4(b) indeed show that setting a larger α can compensate for the accuracy drop when reducing the batch size.

F ADDITIONAL EXPERIMENTAL RESULTS

In this section, we present additional experimental results to further verify our finding. Figure 6 : Additional experimental results on CIFAR-10. See Appendix M.3 for details. SGD generalizes worse even with extensively tuned learning rates. In Figure 6 (a), we run SGD from both random initialization and the pre-trained model for another 3, 000 epochs with various learning rates and report the test accuracy. We can see that none of the SGD runs beat Local SGD with the fixed learning rate η = 0.32. Therefore, the inferior performance of SGD in Figures 2(a ) and 2(b) is not due to the improper learning rate and Local SGD indeed generalizes better. H = 1, K = 32 H = 1, K = 64 H = 1, K = 128 H = 600, K = 32, η = 0.32 SGD with larger batch sizes performs no better. In Figure 6 (b), we enlarge the batch size of SGD and report the test accuracy for various learning rates. We can see that SGD with larger batch sizes performs no better and none of the SGD runs outperform Local SGD with the fixed learning rate η = 0.32. This result is unsurprising since it is well established in the literature (Jastrzębski et al., 2017; Smith et al., 2020; Keskar et al., 2017) that larger batch size typically leads to worse generalization. See Appendix A for a survey of empirical and theoretical works on understanding and resolving this phenomenon. Sampling with or without replacement does not matter. Note that there is a slight discrepancy in sampling schemes between our theoretical and experimental setup: the update rules (1) and (2) assume that data are sampled with replacement while most experiments use sampling without replacement (Appendix C). To eliminate the effect of this discrepancy, we conduct additional experiments on Post-local SGD using sampling with replacement (see Figure 6  dζ(t) = - 1 4B ∇ Γ tr(∇ 2 L(ζ)) + (K -1) • tr(F (2Hη∇ 2 L(ζ))) 2Hη dt, where F (x) := x 0 ψ(y)dy and is interpreted as a matrix function. Additionally, ∇ Γ f stands for the gradient of a function f projected to the tangent space of Γ. Proof. See Appendix L. Note that the magnitude of the RHS in (15) becomes larger as H increases. By letting H to go to infinity, we further have the following theorem. Theorem G.2. As the number of local steps H goes to infinity, the slow SDE of Local SGD with label noise (15)can be simplified as: dζ(t) = - K 4B ∇ Γ tr(∇ 2 L(ζ))dt. Proof. We obtain the corollary by simply taking the limit. By L'Hospital's rule, lim x→+∞ F (ax) x = lim x→+∞ dF (ax) dx = lim x→+∞ aψ(ax) = a. Therefore, lim x→+∞ tr(F (2Hη∇ 2 L(ζ))) 2Hη = tr(∇ 2 L(ζ)). Substituting ( 17) into (15) yields ( 16). As introduced in Section 3.3, the Slow SDE for SGD with label noise regularization has the following form: dζ(t) = - 1 4B ∇ Γ tr(∇ 2 L(ζ))dt, which is a deterministic flow that keeps reducing the trace of Hessian. As the trace of Hessian can be seen as a measure for the sharpness of the local loss landscape, (18) indicates that SGD with label noise regularization has an implicit bias toward flatter minima, which presumably promotes generalization (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Neyshabur et al., 2017) . More concretely, Blanc et al. (2020) From Theorems G.1 and G.2, we can conclude that Local SGD accelerates the process of sharpness reduction, thereby leading to better generalization. Furthermore, the regularization effect gets stronger for larger H and is approximately K times that of SGD. We also conduct experiments on non-augmented CIFAR-10 with label noise regularization to verify our conclusion. As shown in Figure 7 , increasing the number of local steps indeed gives better generalization performance.

G.2 THE EQUIVALENCE OF ENLARGING THE LEARNING RATE AND ADDING LOCAL STEPS

In this subsection, we explain in detail why training with label noise regularization is a special case where enlarging the learning rate of SGD can bring the same generalization benefit as adding local steps. TWhen we scale up the learning rate of SGD η → κη (while keeping other hyperparameters unchanged), the corresponding Slow SDE is (18) with time horizon κ 2 T instead of T , where SGD tracks a continuous interval of κ 2 η 2 per step instead of η 2 . After rescaling the time horizon to T so that SGD tracks a continuous interval of η 2 per step, we obtain 19) and we obtain the same Slow SDE as ( 16), which is for Local SGD with a large number of local steps. In Figure 8 , we conduct experiments to verify that SGD indeed achieves comparable test accuracy to that of Local SGD with a large H if its learning rate is scaled up by dζ(t) = - κ 2 4B ∇ Γ tr(∇ 2 L(ζ))dt. ( ) Let κ = √ K in ( √ K that of Local SGD.

H DERIVING THE SLOW SDE AFTER APPLYING THE LSR

In this section, we derive the Slow SDEs for SGD and Local SGD after applying the LSR in Appendix E.2. The results are formally summarized in the following theorems. Theorem H.1 (Slow SDE for SGD after applying the LSR). Let Assumptions 3.1 to 3.3 hold. Assume that we run SGD with learning rate η ′ = κη and the number of workers K ′ = κK for some constant κ > 0. Let T > 0 be a constant and ζ(t) be the solution to (7) with the initial condition ζ(0) = Φ(θ 0 ) ∈ Γ. Then for any C 3 -smooth function g(θ), .25 ), where Õ(•) hides log factors and constants that are independent of η ′ but can depend on g(θ). Proof. Replacing B with κB in the original Slow SDE for Local SGD (7) gives the following Slow SDE: max 0≤s≤ κT η ′2 E[g(Φ(θ s )] -E[g(ζ(sη ′2 /κ)] = Õ(η ′0 dζ(t) = P ζ 1 √ κB Σ 1 /2 ∥ (ζ)dW t (a) diffusion -1 2κB ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt (b) drift-I . ( ) Note that the continuous time horizon for ( 20) is κT instead of T since after applying the LSR, SGD tracks a continuous interval of κ 2 η 2 per step instead of η 2 while the total number of steps is scaled down by κ. We can then rescale the time scaling to obtain (7) that holds for T . Theorem H.2 (Slow SDE for Local SGD after applying the LSR). Let Assumptions 3.1 to 3.3 hold. Assume that we run Local SGD with learning rate η ′ = κη, the number of workers K ′ = κK, and the number of local steps H ′ = α κη for some constants α, κ > 0. Let T > 0 be a constant and ζ(t) be the solution to (21) with the initial condition .25 ), where Õ(•) hides log factors and constants that are independent of η ′ but can depend on g(θ). ζ(0) = Φ( θ(0) ) ∈ Γ. Then for any C 3 -smooth function g(θ), max 0≤s≤ κT H ′ η ′2 E[g(Φ( θ(s) )] -E[g(ζ(sH ′ η ′2 /κ)] = Õ(η ′0 dζ(t) = P ζ 1 √ B Σ 1 /2 ∥ (ζ)dW t (a) diffusion (unchanged) -1 2B ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt (b) drift-I (unchanged) -κK-1 2B ∇ 3 L(ζ)[ Ψ(ζ)]dt (c) drift-II (rescaled) . ( ) Proof. Replacing B with κB in the original Slow SDE for Local SGD (4) gives the following Slow SDE: dζ(t) = P ζ 1 √ κB Σ 1 /2 ∥ (ζ)dW t (a) diffusion -1 2κB ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt (b) drift-I -κK-1 2κB ∇ 3 L(ζ)[ Ψ(ζ)]dt (c) drift-II . ( ) Note that the continuous time horizon for ( 22) is κT instead of T since after applying the LSR, Local SGD tracks a continuous interval of κ 2 η 2 per step instead of η 2 while the total number of steps is scaled down by κ. We can then rescale the time scaling to obtain (21) that holds for T . I PROOF OF THEOREM 3.1 This section presents the proof for Theorem 3.1. First, we introduce some notations that will be used throughout this section. For the sequence of Local SGD iterates {θ (s) k,t : k ∈ [K], 0 ≤ t ≤ H, s ≥ 0}, we introduce an auxiliary sequence { ût } t∈N , which consists of GD iterates from θ(0) : û0 = θ(0) , ût+1 ← ût -η∇L( ût ). For convenience, let û(s) t := ûsH+t and z k,sH+t := z (s) k,t . We will use û(s) t and ûsH+t , z k,t and z k,sH+t interchangeably. Recall that we have assumed that L is C 3 -smooth with bounded second and third order derivatives. Let ν 2 := sup θ∈R d ∥∇ 2 L(θ)∥ 2 and ν 3 := sup θ∈R d ∥∇ 3 L(θ)∥ 2 . Since ∇ℓ(θ; ζ) is bounded, the gradient noise z (s) k,t is also bounded. We denote by σ max an upper bound such that ∥z (s) k,t ∥ 2 ≤ σ max holds for all s, k, t. To prove Theorem 3.1, we will show that both Local SGD iterates θ(s) and SGD iterates w sH track GD iterates ûsH closely with high probability. For each client k, define the following sequence { Ẑk,t : t ≥ 0}, which will be used in the proof for bounding the overall effect of noise. Ẑk,t = t-1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( ûl )) z k,τ , Ẑk,0 = 0, ∀k ∈ [K]. The following lemma shows that Ẑk,t is concentrated around the origin. Lemma I.1 (Concentration property of { Ẑk,t }). With probability at least 1 -δ, the following holds simultaneously for all k ∈ [K], 0 ≤ t < ⌊ T η ⌋: ∥ Ẑk,t ∥ 2 ≤ Ĉ1 σ max 2T η log 2T K δη , where Ĉ1 := exp(T ν 2 ). Proof. For each Ẑk,t , construct a sequence { Ẑk,t,t ′ } t t ′ =0 : Ẑk,t,t ′ := t ′ -1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( ûl )) z (s) k,τ , Z(s) k,t,0 = 0. Since ∥∇ 2 L( ûl )∥ 2 ≤ ν 2 for all l ≥ 0, the following holds for all 0 ≤ τ < t -1 and 0 < t < ⌊ T η ⌋: t-1 l=τ +1 (I -η∇ 2 L( ûl )) 2 ≤ (1 + ρ 2 η) t ≤ exp(T ν 2 ) = Ĉ1 . So { Ẑk,t,t ′ } t t ′ =0 is a martingale with ∥ Ẑk,t,t ′ -Ẑk,t,t ′ -1 ∥ 2 ≤ Ĉ1 σ max . Since Ẑk,t = Ẑk,t,t , by Azuma-Hoeffding's inequality, P(∥ Ẑk,t ∥ 2 ≥ ϵ ′ ) ≤ 2 exp    -ϵ ′2 2t Ĉ1 σ max 2   . Taking union bound on all k ∈ [K] and 0 ≤ t ≤ ⌊ T η ⌋, we can conclude that with probability at least 1 -δ, ∥ Ẑk,t ∥ 2 ≤ Ĉ1 σ max 2T η log 2T K δη , ∀0 ≤ t < T η , k ∈ [K]. Published as a conference paper at ICLR 2023 The following lemma states that, with high probability, Local SGD iterates θ (s) k,t and θ(s) closely track the gradient descent iterates ûsH for ⌊ T Hη ⌋ rounds. Lemma I.2. For δ = O(poly(η)), the following inequalities hold with probability at least 1 -δ: ∥θ (s) k,t -ûsH+t ∥ 2 ≤ Ĉ3 η log 1 ηδ , ∀k ∈ [K], 0 ≤ s < T Hη , 0 ≤ t ≤ H, ∥ θ(s) -ûsH ∥ 2 ≤ Ĉ3 η log 1 ηδ , ∀0 ≤ s ≤ T Hη , where Ĉ3 is a constant independent of η and H. Proof. Let ∆(s) k,t := θ (s) k,t - û(s) t and ∆(s) := θ(s) -û(s) 0 be the differences between the Local SGD and GD iterates. According to the update rule for θ (s) k,t and û(s) t , θ (s) k,t+1 = θ (s) k,t -η∇L(θ (s) k,t ) -ηz (s) k,t û(s) t+1 = û(s) t -η∇L( û(s) t ). Subtracting ( 23) by ( 24) gives ∆(s) k,t+1 = ∆(s) k,t -η(∇L(θ (s) k,t ) -∇L( û(s) t )) -ηz (s) k,t = (I -η∇ 2 L( û(s) t )) ∆(s) k,t -ηz (s) k,t + η v(s) k,t , where v(s) k,t is a remainder term with norm ∥v (s) k,t ∥ 2 ≤ ν3 2 ∥ ∆(s) k,t ∥ 2 2 . For the s-th round of Local SGD, we can apply (25) t times to obtain the following: ∆(s) k,t = t-1 τ =0 (I -η∇ 2 L( û(s) τ )) ∆(s) k,0 -η t-1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( û(s) l )) z (s) k,τ T + η t-1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( û(s) l ))v (s) k,τ . Here, T can be expressed in the following form: T = Ẑk,sH+t - sH+t-1 l=sH (I -η∇ 2 L( ûl )) Ẑk,sH . Substituting in t = H and taking the average, we derive the following recursion: ∆(s+1) = 1 K k∈[K] ∆(s) k,H = H-1 τ =0 (I -η∇ 2 L( û(s) τ )) ∆(s) - η K k∈[K] Ẑk,(s+1)H + η K k∈[K]   (s+1)H-1 l=sH (I -η∇ 2 L( ûl ))   Ẑk,sH + η K k∈[K] H-1 τ =0 H-1 l=τ +1 (I -η∇ 2 L( û(s) l ))v (s) k,τ . Applying ( 27) s times yields ∆(s) = - η K k∈[K] Ẑk,sH + η K s-1 r=0 H-1 τ =0 k∈[K] sH l=rH+τ +1 (I -η∇ 2 L( ûl )) v(r) k,τ . Substitute ( 28) into ( 26) and we have ∆(s) k,t = - η K k ′ ∈[K] Ẑk ′ ,sH -η Ẑk,sH+t + η sH+t-1 l=sH (I -η∇ 2 L( ûl )) Ẑk,sH + η K s-1 r=0 H-1 τ =0 k ′ ∈[K] sH+t-1 l=rH+τ +1 (I -η∇ 2 L( ûl )) v(r) k ′ ,τ + η t-1 τ =0 sH+t-1 l=sH+τ +1 (I -η∇ 2 L( ûl )) v(s) k,τ . By Cauchy-Schwartz inequality and triangle inequality, we have ∥ ∆(s) k,t ∥ 2 ≤ η K   k ′ ∈[K] ∥ Ẑk ′ ,sH ∥ 2   + η∥ Ẑk,sH+t ∥ 2 + η Ĉ1 ∥ Ẑk,sH ∥ 2 + η Ĉ1 ν 3 2K s-1 r=0 H-1 τ =0 k ′ ∈[K] ∥ ∆(r) k ′ ,τ ∥ 2 2 + η Ĉ1 ν 3 2 t-1 τ =0 ∥ ∆(r) k,τ ∥ 2 2 , where Ĉ1 = exp(ν 2 T ). Below we prove by induction that for δ = O(poly(η)), if ∥ Ẑk,t ∥ 2 ≤ Ĉ1 σ max 2T η log 2T K ηδ , ∀0 ≤ t < T η , k ∈ [K], then there exists a constant Ĉ2 such that for all k ∈ [K], 0 ≤ s < ⌊ T ηH ⌋ and 0 ≤ t ≤ H, ∥ ∆(s) k,t ∥ 2 ≤ Ĉ2 η log 2T K ηδ . First, for all k ∈ [K], ∥ ∆(0) k,0 ∥ 2 = 0 and hence (31) holds. Assuming that (31) holds for all ∆(r) k ′ ,τ where k ′ ∈ [K], 0 ≤ r < s, 0 ≤ τ ≤ H and r = s, 0 ≤ τ < t, then by ( 29), for all k ∈ [K], the following holds: ∥ ∆(s) k,t ∥ 2 ≤ 3 Ĉ2 1 σ max 2T η log 2T K ηδ + Ĉ1 Ĉ2 2 T ην 3 log 2T K ηδ . Let Ĉ2 ≥ 6 Ĉ2 1 σ max √ 2T . Then for sufficiently small η, (31) holds. By Lemma I.1, (30) holds with probability at least 1 -δ. Furthermore, notice that θ(s) -ûsH = 1 K k∈[K] ∆(s-1) k,H . Hence we have the lemma. The iterates of standard SGD can be viewed as the local iterates on a single client with the number of local steps ⌊ T η ⌋. Therefore, we can directly apply Lemma I.2 and obtain the following lemma about the SGD iterates w t . Corollary I.1. For δ = O(poly(η)), the following holds with probability at least 1 -δ: ∥w sH -ûsH ∥ 2 ≤ Ĉ3 η log 1 ηδ , ∀0 ≤ s ≤ T Hη , where Ĉ3 is the same constant as in Lemma I.2. Applying Lemma I.2 and Corollary I.1 and taking the union bound, we have Theorem 3.1.

J PROOF OUTLINE OF MAIN THEOREMS

We adopt the general framework proposed by Li et al. (2019a) to bound the closeness of discrete algorithms and SDE solutions via the method of moments. However, their framework is not directly applicable to our case since they provide approximation guarantees for discrete algorithms with learning rate η for O(η -1 ) steps while we want to capture Local SGD for O(η -2 ) steps. To overcome this difficulty, we treat R grp := ⌊ 1 αη β ⌋ rounds as a "giant step" of Local SGD with an "effective" learning rate η 1-β , where β is a constant in (0, 1), and derive the recursive formulas to compute the moments for the change in every step, every round, and every R grp rounds. The formulation of the recursions requires a detailed analysis of the limiting dynamics of the iterate and careful control of approximation errors. The dynamics of the iterate can be divided into two phases: the approaching phase (Phase 1) and the drift phase (Phase 2). The approaching phase only lasts for O(log 1 η ) rounds, during which the iterate is quickly driven to the minimizer manifold by the negative gradient and ends up within only Õ( √ η) from Γ (see Appendix K.5). After that, the iterate enters the drifting phase and moves in the tangent space of Γ while staying close to Γ (see Appendix K.6). The closeness of the iterates (local and global) and Γ is summarized in the following theorem. Theorem J.1 (Closeness of the iterates and Γ). For δ = O(poly(η)), with probability at least 1 -δ, for all O(log 1 η ) ≤ s ≤ ⌊T /(Hη 2 )⌋, Φ( θ(s) ) ∈ Γ, ∥ θ(s) -Φ( θ(s) )∥ 2 = O η log 1 ηδ . Also, for all O(log 1 η ) ≤ s < ⌊T /(Hη 2 )⌋, k ∈ [K] and 0 ≤ t ≤ H, ∥θ k,t -Φ( θ(s) )∥ 2 = O η log 1 ηδ . Here, O(•) hides constants independent of η and δ. To control the approximation errors, we also provide a high probability bound for the change of the manifold projection within R grp rounds. Theorem J.2 (High probability bound for the change of manifold projection). For δ = O(poly(η)), with probability at least 1 -δ, for all 0 ≤ s ≤ ⌊T /(Hη 2 )⌋ -R grp and 0 ≤ r ≤ R grp , Φ( θ(s) ), Φ( θ(s+r) ) ∈ Γ, ∥Φ( θ(s+r) ) -Φ( θ(s) )∥ 2 = O η 0.5-0.5β log 1 ηδ , where O(•) hides constants independent of η and δ. The proof of Theorems J.1 and J.2 is based on the analysis of the dynamics of the iterate and presented in Appendix K.7. Utilizing Theorems J.1 and J.2, we move on to estimate the first and second moments of the change of the manifold projection every R grp rounds. However, the randomness during training might drive the iterate far from the manifold (with a low probability, though), making the dynamics intractable. To tackle this issue, we construct a well-behaved auxiliary sequence { θ(s) k,t }, which is constrained to the neighborhood of Γ and equals the original sequence {θ (s) k,t } with high probability (see Definition K.5). Then we can formulate recursions for the change of manifold projection of the auxiliary sequence using the nice properties near Γ. The estimate of moments is summarized in Theorem K.2. Finally, based on the moment estimates, we apply the framework in Li et al. (2019a) to show that the manifold projection and the SDE solution are weak approximations of each other in Appendix K.10.

K PROOF DETAILS OF MAIN THEOREMS

The detailed proof is organized as follows. In Appendix K.1, we introduce the notations that will be used throughout the proof. To establish preliminary knowledge, Appendix K.2 provides explicit expression for the projection operator Φ(•), and Appendix K.3 presents lemmas about gradient descent (GD) and gradient flow (GF). Based on the preliminary knowledge, we construct a nested working zone to characterize the closeness of the iterate and Γ in Appendix K.4. Appendices K.5 to K.10 make up the main body of the proof. Specifically, Appendices K.5 and K.6 analyze the dynamics of Local SGD iterates for phases 1 and 2, respectively. Utilizing these analyses, we provide the proof of Theorems J.1 and J.2 in Appendix K.7 and the proof of Theorem 3.3 in Appendix K.8. Then we derive the estimation for the first and second moments of one "giant step " Φ( θ(s+Rgrp) ) -Φ( θ(s) ) in Appendix K.9. Finally, we prove the approximation theorem 3.2 in Appendix K.10.

K.1 ADDITIONAL NOTATIONS

Let R tot := ⌊ T Hη 2 ⌋ be the total number of rounds. Denote by ϕ (s) the manifold projection of the global iterate at the beginning of round s. Let x s) be the difference between the local iterate and the manifold projection of the global iterate. Also define (s) k,t := θ (s) k,t -ϕ ( x(s) H := 1 K k∈[K] x (s) k,H and x(s) 0 := 1 K k∈[K] x (s) k,0 which is the average of x s) . Finally, Since ∇ℓ(θ; ζ) is bounded, the gradient noise z (s) k,t is also bounded and we denote by σ max the upper bound such that ∥z k,0 = x(s) 0 = θ(s) -ϕ ( (s) k,t ∥ 2 ≤ σ max , ∀s, k, t. We first introduce the notion of µ-PL. We will later show that there exists a neighborhood of the minimizer manifold Γ where L satisfies µ-PL. Definition K.1 (Polyak-Łojasiewicz Condition). For µ > 0, we say a function L(•) satisfies µ-Polyak-Łojasiewicz condition (abbreviated as µ-PL) on set U if 1 2 ∥∇L(θ)∥ 2 2 ≥ µ(L(θ) -inf θ ′ ∈U L(θ ′ )). We then introduce the definitions of the ϵ-ball at a point and the ϵ-neighborhood of a set. For θ ∈ R d and ϵ > 0, B ϵ (θ) := {θ ′ : ∥θ ′ -θ∥ 2 < ϵ} is the open ϵ-ball centered at θ. For a set Z ⊆ R d , Z ϵ := θ∈Z B ϵ (θ) is the ϵ-neighborhood of Z.

K.2 COMPUTING THE DERIVATIVES OF THE LIMITING MAPPING

In subsection, we present lemmas that relate the derivatives of the limiting mapping Φ(•) to the derivatives of the loss function L(•). We first introduce the operator V H . Definition K.2. For a semi-definite symmetric matrix H ∈ R d×d , let λ j , v j be the j-th eigenvalue and eigenvector and v j 's form an orthonormal basis of R d . Then, define the operator V H : R d×d → R d×d as V H (M ) := i,j:λi̸ =0∨λj ̸ =0 1 λ i + λ j M , v i v ⊤ j v i v ⊤ j , ∀M ∈ R d×d . Intuitively, this operator projects M to the base matrix v i v ⊤ j and sums up the projections with weights 1 λi+λj . Additionally, for θ ∈ Γ, denote by T θ and T ⊥ θ the tangent and normal space of Γ at θ respectively. Lemmas K.1 to K.4 are from Li et al. (2021b) . We include them to make the paper self-contained. Lemma K.1 (Lemma C.1 of Li et al. (2021b) ). For any θ ∈ Γ and any v ∈ T θ (Γ), it holds that ∇ 2 L(θ)v = 0. Lemma K.2 (Lemma 4.3 of Li et al. (2021b) ). For any θ ∈ Γ, ∂Φ(θ) ∈ R d×d is the projection matrix onto the tangent space T θ (Γ). Lemma K.3 (Lemma C.4 of Li et al. (2021b) ). For any θ ∈ Γ, u ∈ R d and v ∈ T θ (Γ), it holds that ∂ 2 Φ(θ)[v, u] = -∂Φ(θ)∇ 3 L(θ)[v, ∇ 2 L(θ) + u] -∇ 2 L(θ) + ∇ 3 L(θ)[v, ∂Φ(θ)u]. Lemma K.4 (Lemma C.6 of Li et al. (2021b) ). For any θ ∈ Γ and Σ ∈ span{uu ⊤ | u ∈ T ⊥ θ (Γ)}, ∂ 2 Φ(θ), Σ = -∂Φ(θ)∇ 3 L(θ)[V ∇ 2 L(θ) (Σ)]. Lemma K.5. For all θ ∈ Γ, u, v ∈ T θ (Γ), it holds that ∂Φ(θ)∇ 3 L[vu ⊤ ] = 0. (32) Proof. This proof is inspired by Lemma C.4 of Li et al. (2021b) . For any θ ∈ Γ, consider a parameterized smooth curve v(t), t ≥ 0 on Γ such that v(0) = θ and v ′ (0) = v. Let P ∥ (t) = ∂Φ(v(t)), P ⊥ (t) = I -∂Φ(v(t)) and H(t) = ∇ 2 L(v(t)). By Lemma C.1 and 4.3 in Li et al. (2021b) , H(t) = P ⊥ (t)H(t). Take the derivative with respect to t on both sides, H ′ (t) = P ⊥ (t)H ′ (t) + P ′ ⊥ (t)H(t) ⇒ P ∥ (t)H ′ (t) = P ′ ⊥ (t)H(t) = -P ′ ∥ (t)H(t). At t = 0, we have P ∥ (0)H ′ (0) = -P ′ ∥ (0)H(0). ( ) WLOG let H(0) = diag(λ 1 , • • • , λ d ), ∈ R d×d , where λ i = 0 for all m < i ≤ d. Therefore P ⊥ (0) = I m 0 0 0 , P ∥ (0) = 0 0 0 I d-m . Decompose P ′ ∥ (0), H(0) and H ′ (0)as follows. P ′ ∥ (0) = P ′ ∥,11 (0) P ′ ∥,12 (0) P ′ ∥,21 (0) P ′ ∥,22 , H(0) = H 11 (0) 0 0 0 , H ′ (0) = H ′ 11 (0) H ′ 12 (0) H ′ 21 (0) H ′ 22 . Substituting the decomposition into (33), we have 0 0 H ′ 21 (0) H ′ 22 (0) = - P ′ ∥,11 (0)H 11 (0) 0 P ′ ∥,21 (0)H 11 (0) 0 . Therefore, H ′ 22 (0) = 0 and P ∥ (0)H ′ (0) = -P ′ ∥ (0)H(0) = - 0 0 H ′ 21 (0) 0 . Any u ∈ T θ (Γ) can be decomposed as u = [0, u 2 ] ⊤ where u 2 ∈ R d-m . With this decomposition, we have P ∥ (0)H ′ (0)u = 0. Also, note that H ′ (0) = ∇ 3 L(θ)[v]. Hence, ∂Φ(θ)∇ 3 L(θ)[vu T ] = 0.

K.3 PRELIMINARY LEMMAS FOR GD AND GF

In this subsection, we introduce a few useful preliminary lemmas about gradient descent and gradient flow. Before presenting the lemmas, we introduce some notations and assumptions that will be used in this subsection. Consider gradient descent iterates { ût } t∈N following the update rule ût+1 = ût -η∇L( ût ). We first introduce the descent lemma for gradient descent. Lemma K.6 (Descent lemma for GD). If ût ∈ U and η ≤ 1 ρ , then η 2 ∥∇L( ût )∥ 2 2 ≤ L( ût ) -L( ût+1 ), L( ût+1 ) -L * ≤ (1 -µη)(L( ût ) -L * ). Proof. By ρ-smoothness, L( ût+1 ) ≤ L( ût ) + ⟨∇L( ût ), ût+1 -ût ⟩ + ρη 2 2 ∥ ût+1 -ût ∥ 2 2 = L( ût ) -η(1 - ρη 2 )∥∇L( ût )∥ 2 2 ≤ L( ût ) - η 2 ∥∇L( ût )∥ 2 2 By the definition of µ-PL, we have L( ût+1 ) -L * ≤ (1 -µη)(L( ût ) -L * ). Then we prove the Lipschitzness of Ψ(θ). Lemma K.7 (Lipschitzness of Ψ(θ)). Ψ(θ) is √ 2ρ-Lipschitz for θ ∈ U . That is, for any θ 1 , θ 2 ∈ U , | Ψ(θ 1 ) -Ψ(θ 2 )| ≤ 2ρ∥θ 1 -θ 2 ∥ 2 . Proof. Fix θ 1 and θ 2 . Denote by θ(t) := (1 -t)θ 1 + tθ 2 the convex combination of θ 1 and θ 2 where t ∈ [0, 1]. Further define f (t) := Ψ(θ(t)). Below we consider two cases. Case 1. If ∀t ∈ (0, 1), f (t) > 0, then f (t) is differentiable on (0, 1). | Ψ(θ 2 ) -Ψ(θ 1 )| = |f (1) -f (0)| = 1 0 f ′ (t)dt = 1 0 ∇ Ψ(θ(t)), θ 2 -θ 1 dt = 1 0 ⟨∇L(θ(t)), θ 2 -θ 1 ⟩ L(θ(t)) -L * dt ≤ ∥θ 2 -θ 1 ∥ 2 1 0 ∥∇L(θ(t))∥ 2 L(θ(t)) -L * dt. By ρ-smoothness of L, for all θ ∈ U , ∥∇L(θ)∥ 2 2 ≤ 2ρ (L(θ) -L * ) . Since L(θ(t)) -L * > 0 for all t ∈ (0, 1), ∥∇L(θ(t))∥2 √ L(θ(t))-L * ≤ √ 2ρ. Therefore, | Ψ(θ 2 ) -Ψ(θ 1 )| ≤ 2ρ 2 ∥θ 2 -θ 1 ∥ 2 . Case 2. If ∃t ′ ∈ (0, 1) such that f (t ′ ) = 0, then | Ψ(θ 2 ) -Ψ(θ 1 )| = |f (1) -f (0)| = (1 -t ′ ) f (1) -f (t ′ ) 1 -t ′ + t ′ f (t ′ ) -f (0) t ′ ≤ max f (1) 1 -t ′ , f (0) t ′ . Since θ(t ′ ) minimizes L in an open set, ∇L(θ(t ′ )) = 0. By ρ-smoothness of L, for all θ ∈ U , L(θ) ≤ L * + ρ 2 ∥θ -θ(t ′ )∥ 2 2 ⇒ Ψ(θ) ≤ ρ 2 ∥θ -θ(t ′ )∥ 2 . Therefore, f (1) ≤ ρ 2 ∥θ 2 -θ(t ′ )∥ 2 = (1 -t ′ ) ρ 2 ∥θ 2 -θ 1 ∥ 2 f (0) ≤ ρ 2 ∥θ 1 -θ(t ′ )∥ 2 = t ′ ρ 2 ∥θ 2 -θ 1 ∥ 2 . Then we have | Ψ(θ 2 ) -Ψ(θ 1 )| ≤ ρ 2 ∥θ 2 -θ 1 ∥ 2 . Combining case 1 and case 2, we conclude the proof. Below we introduce a lemma that relates the movement of one step gradient descent to the change of the potential function. Lemma K.8 (Lemma G.1 in Lyu et al. (2022) ). If ût ∈ U and η ≤ 1/ρ 2 then Ψ( ût ) -Ψ( ût+1 ) ≥ √ 2µ 4 η∥∇L( ût )∥ 2 . Proof. Ψ( ût ) -Ψ( ût+1 ) = L( ût ) -L( ût+1 ) Ψ( ût ) + Ψ( ût+1 ) ≥ L( ût+1 ) -L( ût ) 2 Ψ( ût ) ≥ η(1 -ρ 2 η/2)∥∇L( ût )∥ 2 2 2 Ψ( ût ) , where the two inequalities uses Lemma K.6. By µ-PL, Ψ( ût ) ≤ 1 √ 2µ ∥∇L( ût )∥ 2 . Therefore, we have Ψ( ût ) -Ψ( ût+1 ) ≥ √ 2µ 2 (1 -ηρ/2)η∥∇L( ût )∥ 2 ≥ √ 2µ 4 η∥∇L( ût )∥ 2 . Based on Lemma K.8, we have the following lemma that bounds the movement of GD over multiple steps. Lemma K.9 (Bounding the movement of GD). If û0 is initialized such that ∥ û0 -θ * ∥ 2 ≤ 1 4 µ ρ ϵ ′ , then for all t ≥ 0, ût ∈ B ϵ ′ (θ * ) and ∥ ût -û0 ∥ 2 ≤ 8 µ Ψ( û0 ). Proof. We prove the proposition by induction. When t = 0, it trivially holds. Assume that the proposition holds for ûτ , 0 ≤ τ < t. For step t, since ûτ ∈ B ϵ ′ (θ * ), we apply Lemma K.8 and obtain ∥ ût -û0 ∥ 2 ≤ η t-1 τ =0 ∥∇L( ûτ )∥ 2 ≤ 8 µ Ψ( û0 ) -Ψ( ût ) ≤ 8 µ Ψ( û0 ). Further by ρ-smoothness of L(•), ∥ ût -û0 ∥ 2 ≤ 8 µ Ψ( û0 ) ≤ 2 ρ µ ∥ û0 -θ * ∥ 2 ≤ 1 2 ϵ ′ . Therefore, ∥ ût -θ * ∥ 2 ≤ ∥ ût -û0 ∥ 2 + ∥ û0 -θ * ∥ 2 < ϵ ′ , which concludes the proof. Finally, we introduce a lemma adapted from Thm. D.4 of which bounds the movement of GF. Lyu et al. (2022) . Lemma K.10. Assume that ∥θ 0 -θ * ∥ 2 < µ ρ ϵ ′ . The gradient flow θ(t) = -dL(θ(t)) dt starting at θ 0 converges to a point in U and θ 0 -lim t→+∞ θ(t) 2 ≤ 2 µ L(θ 0 ) -L * ≤ ρ µ ∥θ 0 -θ * ∥ 2 Proof. Let T := inf{t : θ / ∈ U }. Then for all t < T , d dt (L(θ) -L * ) 1/2 = 1 2 (L(θ) -L * ) -1/2 • ∇L(θ), dθ dt = - 1 2 (L(θ) -L * ) -1/2 ∥∇L(θ)∥ 2 ∥ dθ dt ∥ 2 . By µ-PL, ∥∇L(θ)∥ 2 ≥ 2µ(L(θ) -L * ). Hence, d dt (L(θ) -L * ) 1/2 ≤ - √ 2µ 2 ∥ dθ dt ∥ 2 . Integrating both sides, we have T 0 ∥ dθ(τ ) dτ ∥dτ ≤ 2 √ 2µ (L(θ 0 ) -L * ) 1/2 ≤ ρ µ ∥θ 0 -θ * ∥ 2 < ϵ ′ , where the second inequality uses ρ-smoothness of L. Therefore, T = +∞ and θ(t) converges to some point in U .

K.4 CONSTRUCTION OF WORKING ZONES

We construct four nested working zones (Γ ϵ0 , Γ ϵ1 , Γ ϵ2 , Γ ϵ3 ) in the neighborhood of Γ. Later we will show that the local iterates θ (s) k,t ∈ Γ ϵ2 and the global iterates θ(s) ∈ Γ ϵ0 with high probability after O(log 1 η ) rounds. The following lemma illustrates the properties the working zones should satisfy. Lemma K.11 (Working zone lemma). There exists constants ϵ 0 < ϵ 1 < ϵ 2 < ϵ 3 such that (Γ ϵ0 , Γ ϵ1 , Γ ϵ2 , Γ ϵ3 ) satisfy the following properties: 1. L satisfies µ-PL in Γ ϵ3 for some µ > 0. 2. Any gradient flow starting in Γ ϵ2 converges to some point in Γ. Then, by Falconer (1983)  , Φ(•) is C ∞ in Γ ϵ2 . 3. Any θ ∈ Γ ϵ1 has an ϵ 1 -neighborhood B ϵ1 (θ) such that B ϵ1 (θ) ⊆ Γ ϵ2 . 4. Any gradient descent starting in Γ ϵ0 with sufficiently small learning rate will stay in Γ ϵ1 . Proof. Let θ(0) be initialized such that Φ( θ(0) ) ∈ Γ. Let Z be the set of all points on the gradient flow trajectory starting from θ(0) and Z ϵ be the ϵ-neighborhood of Z, where ϵ is a positive constant. Since the gradient flow converges to ϕ (0) , Z and Z ϵ are bounded. We construct four nested working zones. By Lemma H.3 in Lyu et al. (2022) , there exists an ϵ 3neighborhood of Γ, Γ ϵ3 , such that L satisfies µ-PL for some µ > 0. Let M be the convex hull of Γ ϵ3 ∪ Z ϵ and M ϵ4 be the ϵ 4 -neighborhood of M where ϵ 4 is a positive constant. Then M ϵ4 is bounded. Define ρ 2 = sup θ∈M ϵ 4 ∥∇ 2 L(θ)∥ 2 and ρ 3 = sup M ϵ 4 ∥∇ 3 L(θ)∥ 2 . By Lemma K.10, we can construct an ϵ 2 -neighborhood of Γ where ϵ 2 < µ ρ2 ϵ 3 such that all GF starting in Γ ϵ2 converges to Γ. By Falconer (1983)  , Φ(•) is C 2 in Γ ϵ3 . Define ν 1 = sup θ∈Γ ϵ 3 ∥∂Φ(θ)∥ 2 and ν 2 = sup θ∈Γ ϵ 3 ∥∂ 2 Φ(θ)∥ 2 . We also construct an ϵ 1 neighborhood of Γ, Γ ϵ1 , where ϵ 1 ≤ 1 2 ϵ 2 < 1 2 µ ρ2 ϵ 3 such that all θ ∈ Γ ϵ1 has an ϵ 1 neighborhood where Φ is well defined. Finally, by Lemma K.9, there exists an ϵ 0 -neighborhood of Γ where ϵ 0 ≤ 1 4 µ ρ2 ϵ 1 such that all gradient descent iterates starting in Γ ϵ0 with η ≤ 1 ρ2 will stay in Γ ϵ1 . Note that the notions of Z ϵ , M ϵ4 , ρ 2 , ρ 3 , ν 1 , and ν 2 defined in the proof will be useful in the remaining part of this section. When analyzing the limiting dynamics of Local SGD, we will show that all θ (s) k,t stays in Γ ϵ2 , ũ(s) t ∈ Γ ϵ1 , θ(s) ∈ Γ ϵ0 with high probability after O(log 1 η ) rounds.

K.5 PHASE 1: ITERATE APPROACHING THE MANIFOLD

The approaching phase can be further divided into two subphases. In the first subphase, θ( 0) is initialized such that ϕ (0) ∈ Γ. We will show that after a constant number of rounds s 0 , θ(s0) goes to the inner part of Γ ϵ0 such that ∥ θ(s0) -ϕ (0) ∥ 2 ≤ cϵ 0 with high probability, where 0 < c < 1 and the constants will be specified later (see Appendix K.5.2). In the second subphase, we show that the iterate can reach within Õ( √ η) distance from Γ after O(log 1 η ) rounds with high probability (see Appendix K.5.3).

K.5.1 ADDITIONAL NOTATIONS

Consider an auxiliary sequence { ũ(s) t } where ũ(s) 0 = θ(s) and ũ(s ) t+1 = ũ(s) t -η∇L( ũ(s) t ), 0 ≤ t ≤ H -1. Define ∆(s) k,t := θ (s) k,t - ũ(s) t to be the difference between the local iterate and the gradient descent iterate. Notice that ∆(s) k,0 = 0, for all k and s. Consider a gradient flow {u(t)} t≥0 with the initial condition u(0) = θ(0) and converges to ϕ (0) ∈ Γ. For simplicity, let u (s) t := u(sα + tη) be the gradient flow after s rounds plus t steps. Let s 0 be the smallest number such that ∥u (s0) 0 -ϕ (0) ∥ 2 ≤ 1 4 µ ρ2 ϵ 0 . Note that s 0 is a constant independent of η. In this subsection, the minimum value of the loss in Appendix K.3 corresponds to the loss value on Γ, i.e., L * = L(ϕ), ∀ϕ ∈ Γ. We also define the following sequence { Z(s) k,t } H t=0 that will be used in the proof. Define Z(s) k,t := t-1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l )) z (s) k,τ , Z(s) k,0 = 0.

K.5.2 PROOF FOR SUBPHASE 1

First, we have the following lemma about the concentration of Z(s) k,t . Lemma K.12 (Concentration property of { Z(s) k,t } H t=0 ). Given θ(s) such that ũ(s) t ∈ Γ ϵ3 ∪ Z ϵ for all 0 ≤ t ≤ H, then with probability at least 1 -δ, ∥ Z(s) k,t ∥ 2 ≤ C1 σ max 2H log 2HK δ , ∀0 ≤ t ≤ H, k ∈ [K], where C1 := exp(αρ 2 ). Proof. For each Z(s) k,t , construct a sequence { Z(s) k,t,t ′ } t t ′ =0 : Z(s) k,t,t ′ := t ′ -1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l )) z (s) k,τ , Z(s) k,t,0 = 0. Since ũ(s) t ∈ Γ ϵ3 ∪ Z ϵ , we have ∥∇ 2 L( ũ(s) t )∥ 2 ≤ ρ 2 for all 0 ≤ t ≤ H. Then, for all τ and t, t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l )) 2 ≤ (1 + ρ 2 η) H ≤ exp(αρ 2 ) = C1 . Notice that for all 0 ≤ t ≤ H, { Z(s) k,t,t ′ } t t ′ =0 is a martingale with ∥ Z(s) k,t,t ′ - Z(s) k,t,t ′ -1 ∥ 2 ≤ C1 σ max . By Azuma-Hoeffding's inequality, P(∥ Z(s) k,t ∥ 2 ≥ ϵ ′ ) ≤ 2 exp    -ϵ ′2 2t C1 σ max 2    ≤ 2 exp    -ϵ ′2 2H C1 σ max 2   . Taking a union bound on all k ∈ [K] and 0 ≤ t ≤ H, we can conclude that with probability at least 1 -δ, ∥ Z(s) k,t ∥ 2 ≤ C1 σ max 2H log 2HK δ , ∀0 ≤ t ≤ H, k ∈ [K]. The following lemma states that the gradient descent iterates will closely track the gradient flow with the same initial point. Lemma K.13. Denote G := sup t≥0 ∥∇L(u(t))∥ 2 as the upper bound of the gradient on the gradi- ent flow trajectory. If ∥ ũ(s) t -u (s) t ∥ 2 = O( √ η), then for all 0 ≤ t ≤ H, the closeness of ũ(s) t and u (s) t is bounded by ∥ ũ(s) t -u (s) t ∥ 2 ≤ C1 ∥ ũ(s) 0 -u (s) 0 ∥ 2 + C1 ηG, where C1 = exp(αρ 2 ). Proof. We prove by induction that ∥ ũ(s) t -u (s) t ∥ 2 ≤ (1 + ρ 2 η) t ∥ ũ(s) 0 -u (s) 0 ∥ 2 + ρ 2 η 2 G t-1 τ =0 (1 + ρ 2 η) τ . ( ) When t = 0, (34) holds trivially. Assume that (34) holds for 0 ≤ τ ≤ t, then ũ(s) t+1 -u (s) t+1 = ũ(s) t -η∇L( ũ(s) t ) -u t - sα+(t+1)η sα+tη ∇L(u(v))dv = ũ(s) t -u t -η ∇L( ũ(s) t ) -∇L(u (s) t ) - sα+(t+1)η sα+tη ∇L(u (s) t ) -∇L(u(v)) dv. By smoothness of L, ∥∇L(u (s) t ) -∇L(u(v))∥ 2 ≤ ρ 2 ∥u (s) t -u(v)∥ 2 ≤ ρ 2 v sα+tη ∥∇L(u(w))∥ 2 dw ≤ ρ 2 ηG. Since ρ 2 2 η 2 G t-1 τ =0 (1 + ρ 2 η) τ ≤ ηG(1 + ρ 2 η) t ≤ exp(αρ 2 )ηG, then ∥ ũ(s) t -u (s) t ∥ 2 = O( √ η), which implies that ũ(s) t ∈ M ϵ4 . Hence, ∥∇L( ũ(s) t ) -L(u (s) t )∥ 2 ≤ ρ 2 ∥ ũ(s) t -u (s) t ∥ 2 . By triangle inequality, ∥ ũ(s) t+1 -u (s) t+1 ∥ 2 ≤ (1 + ρ 2 η)∥ ũ(s) t -u (s) t ∥ 2 + ρ 2 η 2 G ≤ (1 + ρ 2 η) t+1 ∥ ũ(s) t -u (s) t ∥ 2 + ρ 2 η 2 G t τ =0 (1 + ρ 2 η) τ , which concludes the induction step. Appling 1 + ρ 2 η ≤ exp(ρ 2 η), we have the lemma. Utilizing the concentration probability of { Z(s) k,t }, we can obtain the following lemma which implies that the Local SGD iterates will closely track the gradient descent iterates with high probability. Lemma K.14. Given θ(s) such that ũ(s) t ∈ Γ ϵ3 ∪ Z ϵ for all 0 ≤ t ≤ H, then for δ = O(poly(η)), with probability at least 1 -δ, there exists a constant C3 such that ∥θ (s) k,t - ũ(s) t ∥ 2 ≤ C3 η log 1 ηδ , ∀0 ≤ t ≤ H, k ∈ [K], and ∥ θ(s+1) - ũ(s) H ∥ 2 ≤ C3 η log 1 ηδ . Proof. Since ũ(s) t ∈ Γ ϵ3 ∪ Z ϵ for all 0 ≤ t ≤ H, we have ∥∇ 2 L( ũ(s) t )∥ 2 ≤ ρ 2 . According to the update rule for θ (s) k,t and ũ(s) t , θ (s) k,t+1 = θ (s) k,t -η∇L(θ (s) k,t ) -ηz (s) k,t , ũ(s) t+1 = ũ(s) t -η∇L( ũ(s) t ). ( ) Subtracting ( 36) from ( 35) gives ∆(s) k,t+1 = ∆(s) k,t -η(∇L(θ (s) k,t ) -∇L( ũ(s) t )) -ηz (s) k,t = (I -η∇ 2 L( ũ(s) t )) ∆(s) k,t -ηz (s) k,t + η ṽ(s) k,t . Here, ṽ(s) k,t = (1 -β (s) k,t )θ (s) k,t + β (s) k,t ũ(s) k,t , where β (s) k,t ∈ (0, 1) depends on θ (s) k,t and ũ(s) t . Therefore, ∥ṽ (s) k,t ∥ 2 ≤ ρ3 2 ∥ ∆(s) k,t ∥ 2 2 if θ (s) k,t ∈ M ϵ4 . Applying (37) t times, we have ∆(s) k,t = t-1 τ =0 (I -η∇ 2 L( ũ(s) τ )) ∆(s) k,0 -η t-1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l ))z (s) k,τ + η t-1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l ))ṽ (s) k,τ . By Cauchy-Schwartz inequality, triangle inequality and the definition of Z(s) k,t , if for all 0 ≤ τ ≤ t-1 and k ∈ [K], θ (s) k,τ ∈ M ϵ4 , then we have ∥ ∆(s) k,t ∥ 2 ≤ η∥ Z(s) k,t ∥ 2 + 1 2 ηρ 3 t-1 τ =0 C1 ∥ ∆(s) k,τ ∥ 2 2 . ( ) Applying Lemma K.12 and substituting in the value of H, we have that with probability at least 1 -δ, ∥ Z(s) k,t ∥ 2 ≤ C1 σ max 2α η log 2αK ηδ , ∀k ∈ K, 0 ≤ t ≤ H. ( ) Now we show by induction that for δ = O(poly(η)), when (39) holds, there exists a constant C2 > 2σ max √ 2α C1 such that ∥ ∆(s) k,t ∥ 2 ≤ C2 η log 2αK ηδ . When t = 0, ∆(s) k,0 = 0. Assume that ∥ ∆(s) k,τ ∥ 2 ≤ C2 η log 2αK ηδ , for all k ∈ [K], 0 ≤ τ ≤ t -1. Then for all 0 ≤ τ ≤ t -1, θ (s) k,τ ∈ M ϵ4 . Therefore, we can apply (38) and obtain ∥ ∆(s) k,t ∥ 2 ≤ η∥ Z(s) k,t ∥ 2 + 1 2 ηρ 3 t-1 τ =0 C1 ∥ ∆(s) k,τ ∥ 2 2 ≤ C1 σ max 2αη log 2αK ηδ + 1 2 C1 C2 2 σ 2 max αρ 3 η log 2αK ηδ . Given that C2 ≥ 2σ max √ 2α C1 and δ = O(poly(η)), when η is sufficiently small, ∥ ∆(s) k,t ∥ 2 ≤ C2 η log 2αK ηδ . To sum up, for δ = O(poly(η)), with probability at least 1 -δ, ∥ ∆(s) k,t ∥ 2 ≤ C2 η log 2αK ηδ for all k ∈ [K], 0 ≤ t ≤ H. By triangle inequality, ∥ θ(s+1) - ũ(s) H ∥ 2 ≤ 1 K k∈[K] ∥ ∆(s) k,H ∥ 2 ≤ C2 η log 2αK ηδ . The combination of Lemma K.13 and Lemma K.14 leads to the following lemma, which states that the Local SGD iterate will enter Γ ϵ1 after s 0 rounds with high probability. Lemma K.15. Given θ(0) such that Φ( θ(0) ) ∈ Γ, then for δ = O(poly(η)), there exists a positive constant C4 such that with probability at least 1 -δ, ∥ θ(s0) -ϕ (0) ∥ 2 ≤ 1 4 µ ρ 2 ϵ 0 + C4 η log 1 ηδ . Proof. First, we prove by induction that for δ = O(poly(η)), when ∥ Z(s) k,t ∥ 2 ≤ C1 σ max 2H log 2HKs 0 δ , ∀0 ≤ t ≤ H, k ∈ [K], 0 ≤ s < s 0 , the closeness of θ(s) and u (s) 0 is bounded by ∥ θ(s) -u (s) 0 ∥ 2 ≤ s l=1 Cl 1 ηG + C3 η log s 0 ηδ , ∀0 ≤ s ≤ s 0 . When s = 0, θ(0) = u (0) 0 . Assume that (41) holds for round s. Then by Lemma K.13, for all 0 ≤ t ≤ H, ∥ ũ(s) t -u (s) t ∥ 2 ≤ C1 ∥ ũ(s) 0 -u (s) 0 ∥ 2 + C1 ηG = C1 ∥ θ(s) 0 -u (s) 0 ∥ 2 + C1 ηG ≤ s l=1 Cl+1 1 ηG + C3 η log s 0 ηδ + C1 ηG. Therefore, for sufficiently small η, ũ(s) t ∈ Z ϵ , ∀0 ≤ t ≤ H. Combing the above inequality with Lemma K.14, we have ∥ θ(s+1) -u (s+1) 0 ∥ 2 = ∥ θ(s+1) -u (s) H ∥ 2 ≤ ∥ θ(s+1) - ũ(s) H ∥ 2 + ∥ ũ(s) H -u (s) H ∥ 2 ≤ s+1 l=1 Cl+1 1 ηG + C3 η log s 0 ηδ , which concludes the induction. Therefore, when (40) holds, there exists a positive constant C4 such that ∥ θ(s0) -u (s0) 0 ∥ 2 ≤ C4 η log 1 ηδ . By definition of u (s0) 0 , ∥ θ(s0) -ϕ (0) ∥ 2 ≤ 1 4 µ ρ 2 ϵ 0 + C4 η log 1 ηδ . Finally, according to Lemma K.12, (40) holds with probability at least 1 -δ.

K.5.3 PROOF FOR SUBPHASE 2

In subphase 2, we show that the iterate can reach within Õ( √ η) distance from Γ after O(log 1 η ) rounds with high probability. The following lemma manifests how the potential function Ψ( θ(s) ) evolves after one round. Lemma K.16. Given θ(s) ∈ Γ ϵ0 , for δ = O(poly(η)), with probability at least 1 -δ, θ (s) k,t ∈ Γ ϵ2 , Ψ(θ (s) k,t ) ≤ Ψ( θ(s) ) + C5 η log 1 ηδ , ∀k ∈ [K], 0 ≤ t ≤ H and θ(s+1) ∈ Γ ϵ2 , Ψ( θ(s+1) ) ≤ exp(-αµ/2) Ψ( θ(s) ) + C5 η log 1 ηδ , where C5 is a positive constant. Proof. Since θ(s) ∈ Γ ϵ0 , then for all 0 ≤ t ≤ H, ũ(s) t ∈ Γ ϵ1 by the definition of the working zone. By Lemma K.6, for η ≤ 1 ρ2 , L( ũ(s) t ) -L * ≤ (1 -µη) t L( θ(s) ) -L * ≤ L( θ(s) ) -L * , ∀0 ≤ t ≤ H. Specially, for t = H, L( ũ(s) H ) -L * ≤ (1 -µη) α η L( θ(s) ) -L * ≤ exp(-αµ)(L( θ(s) ) -L * ). Therefore, Ψ( ũ(s) H ) ≤ exp(-αµ/2) Ψ( θ(s) ). According to the proof of Lemma K.14, for δ = O(poly(η)), when ∥ Z(s) k,t ∥ 2 ≤ C1 σ max 2α η log 2αK ηδ , ∀k ∈ [K], 0 ≤ t ≤ H, there exists a constant C3 such that ∥θ (s) k,t - ũ(s) t ∥ 2 ≤ C3 η log 1 ηδ , ∀0 ≤ t ≤ H, k ∈ [K], and ∥ θ(s+1) - ũ(s) H ∥ 2 ≤ C3 η log 1 ηδ . Since ũ(s) t ∈ Γ ϵ1 , ∀0 ≤ t ≤ H, θ(s+1) ∈ Γ ϵ2 and θ(s) k,t ∈ Γ ϵ2 , ∀0 ≤ t ≤ H, k ∈ [K]. By Lemma K.7, Ψ(•) is √ 2ρ 2 -Lipschitz in M ϵ4 . Therefore, when (42) holds, there exists a constant C5 := √ 2ρ 2 C3 such that Ψ(θ (s) k,t ) ≤ Ψ( ũ(s) t ) + 2ρ 2 ∥θ (s) k,t - ũ(s) t ∥ 2 ≤ Ψ( θ(s) ) + C5 η log 1 ηδ , Ψ( θ(s+1) ) ≤ Ψ( ũ(s) H ) + 2ρ 2 ∥ θ(s+1) - ũ(s) H ∥ 2 ≤ exp(-αµ/2) Ψ( θ(s) ) + C5 η log 1 ηδ . Finally, by Lemma K.12, (42) holds with probability at least 1 -δ. We are thus led to the following lemma which characterizes the evolution of the potential Ψ( θ(s) ) and Ψ(θ (s) k,t ) over multiple rounds. Lemma K.17. Given ∥ θ(0) -ϕ (0) ∥ 2 ≤ 1 2 µ ρ2 ϵ 0 , for δ = O(poly(η)) and any integer 1 ≤ R ≤ R tot , with probability at least 1 -δ, θ(s) ∈ Γ ϵ0 , Ψ( θ(s) ) ≤ exp(-αµs/2) Ψ( θ(0) ) + 1 1 -exp(-αµ/2) C5 η log R ηδ , ∀0 ≤ s ≤ R. ( ) Furthermore, θ(s) k,t ∈ Γ ϵ2 , Ψ(θ (s) k,t ) ≤ Ψ( θ(s) ) + C5 η log R ηδ , ∀0 ≤ t ≤ H, 0 ≤ s < R, k ∈ [K]. ( ) Proof. We prove induction that for δ = O(poly(η)), when ∥ Z(s) k,t ∥ 2 ≤ C1 σ max 2α η log 2RαK ηδ , ∀k ∈ [K], 0 ≤ t ≤ H, 0 ≤ s < R, then for all 0 ≤ s ≤ R, (43) and ( 44) hold. When s = 0, θ(0) ∈ Γ ϵ0 and (43) trivially holds. By Lemma K.16, (44) holds. Assume that ( 43) and ( 44) hold for round s -1. Then for round s, by Lemma K.16, θ(s) ∈ Γ ϵ2 and Ψ( θ(s) ) ≤ exp(-αµ/2) Ψ( θ(s-1) ) + C5 η log R ηδ ≤ exp(-αµs/2) Ψ( θ(0) ) + 1 1 -exp(-αµ/2) C5 η log R ηδ , where the second inequality comes from the induction hypothesis. By Lemma K.10, ∥ θ(s) -ϕ (s) ∥ 2 ≤ 2 √ 2µ Ψ( θ(s) ) ≤ 2 √ 2µ Ψ( θ(0) ) + 2 √ 2µ(1 -exp(-αµ/2)) C5 η log R ηδ ≤ 1 2 ϵ 0 + 2 √ 2µ(1 -exp(-αµ/2)) C5 η log R ηδ . Here, the last inequality uses Ψ( θ(0) ) ≤ ρ2 2 ∥ θ(s) -ϕ (0) ∥ 2 ≤ 1 2 µ 2 ϵ 0 . Hence, when η is suffi- ciently small, θ(s) ∈ Γ ϵ0 . Still by Lemma K.16, θ(s) k,t ∈ Γ ϵ2 and Ψ(θ (s) k,t ) ≤ Ψ( θ(s) ) + C5 η log R ηδ . Finally, according to Lemma K.12, (45) holds with probability at least 1 -δ. The following corollary is a direct consequence of Lemma K.17 and Lemma K.10. Corollary K.1. Let s 1 := ⌈ 20 αµ log 1 η ⌉. Given ∥ θ(0) -ϕ (0) ∥ 2 ≤ 1 2 µ ρ2 ϵ 0 , for δ = O(poly(η)), with probability at least 1 -δ, Ψ( θ(s1) ) ≤ C6 η log 1 ηδ , ∥ θ(s1) -ϕ (s1) ∥ 2 ≤ C6 η log 1 ηδ , ( ) where C6 is a constant. Proof. Substituting in R = s 1 to Lemma K.17 and applying ∥ θ(s1) -ϕ (s) ∥ 2 ≤ 2 µ Ψ( θ(s1) ) for θ(s1) ∈ Γ ϵ0 , we have the lemma. Finally, we provide a high probability bound for the change of the projection on the manifold after s 1 rounds ∥ϕ (s1) -ϕ (0) ∥ 2 . Lemma K.18. Let s 1 := ⌈ 20 αµ log 1 η ⌉. Given ∥ θ(0) -ϕ (0) ∥ 2 ≤ 1 2 µ ρ2 ϵ 0 . For δ = O(poly(η)), with probability at least 1 -δ, ∥ϕ (s1) -ϕ (0) ∥ 2 ≤ C8 log 1 η η log 1 ηδ . Proof. From Lemma K.17, for δ = O(poly(η)), when ∥ Z(s) k,t ∥ 2 ≤ C1 σ max 2α η log 2s 1 αK ηδ , ∀k ∈ [K], 0 ≤ t ≤ H, 0 ≤ s < s 1 , ( ) then θ(s) ∈ Γ ϵ0 , for all 0 ≤ s ≤ s 1 . By the definition of Γ ϵ0 , ũ(s) t ∈ Γ ϵ1 , for all 0 ≤ t ≤ H, 0 ≤ s ≤ s 1 . By triangle inequality, ∥ϕ (s1) -ϕ (0) ∥ 2 can be decomposed as follows. ∥ϕ (s1) -ϕ (0) ∥ 2 ≤ s1-1 s=0 ∥ϕ (s+1) -ϕ (s) ∥ 2 ≤ s1-1 s=0 ∥Φ( ũ(s) H ) -Φ( ũ(s) 0 )∥ 2 + s1-1 s=0 ∥Φ( θ(s+1) ) -Φ( ũ(s) H )∥ 2 . ( ) By Lemma K.14, when (47) hold , then for all 0 ≤ s < s 1 -1, ∥ θ(s+1) - ũ(s) H ∥ 2 ≤ C3 η log s 1 ηδ . This implies that θ(s+1 ) ∈ B ϵ1 ( ũ(s) H ). Since for all θ ∈ Γ ϵ2 , ∥∂Φ(θ)∥ 2 ≤ ν 1 , then Φ(•) is ν 1 - Lipschitz in B ϵ1 ( ũ(s) H ). This gives ∥Φ( θ(s+1) ) -Φ( ũ(s) H )∥ 2 ≤ ν 1 ∥ θ(s+1) - ũ(s) H ∥ 2 ≤ ν 1 C3 η log s 1 ηδ . ( ) Then we analyze ∥ θ(s+1) -ũ(s) H ∥ 2 . By Lemma K.9 and the definition of Γ ϵ0 and Γ ϵ1 , there exists ϕ ∈ Γ such that ũ(s) t ∈ B ϵ1 (ϕ), ∀0 ≤ t ≤ H. Therefore, we can expand Φ( ũ(s) t+1 ) as follows: Φ( ũ(s) t+1 ) = Φ( ũ(s) t -η∇L( ũ(s) t )) = Φ( ũ(s) t ) -η∂Φ( ũ(s) )∇L(u (s) t ) + η 2 2 ∂ 2 Φ( û(s) t )[∇L( ũ(s) t ), ∇L( ũ(s) t )] = Φ( ũ(s) t ) + η 2 2 ∂ 2 Φ c (s) t ũ(s) t + (1 -c (s) t ) ũ(s) t+1 [∇L( ũ(s) t ), ∇L( ũ(s) t )], where c (s) t ∈ (0, 1). Then we have ∥Φ( ũ(s) H ) -Φ( ũ(s) 0 )∥ 2 ≤ η 2 2 H-1 t=0 ∥∂ 2 Φ( c (s) t ũ(s) t + (1 -c (s) t ) ũ(s) t+1 )[∇L( ũ(s) ), ∇L( ũ(s) t )]∥ 2 ≤ η 2 2 ν 2 H-1 t=0 ∥∇L( ũ(s) t )∥ 2 2 . By Lemma K.6, η 2 ∥∇L( ũ(s) t )∥ 2 2 ≤ L( ũ(s) t ) -L( ũ(s) t+1 ). Therefore, ∥Φ( ũ(s) H ) -Φ( ũ(s) 0 )∥ 2 ≤ ην 2 (L( ũ(s) 0 ) -L( ũ(s) H )) ≤ ην 2 [ Ψ( θ(s) )] 2 ≤ ν 2 η 2 exp(-αsµ) Ψ( θ(0) ) + C2 5 η (1 -exp(-αµ/2)) 2 log s 1 ηδ , ( ) where the last inequality uses Cauchy-Schwartz inequality and Lemma K.17. Summing up (50) , we obtain s1-1 s=0 ∥Φ( ũ(s) H ) -Φ( ũ(s) 0 )∥ 2 ≤ ν 2 η 2 Ψ( θ(0) ) s1-1 s=0 exp(-αµs) + s 1 C2 5 η (1 -exp(-αµ/2)) 2 log s 1 ηδ ≤ C7 η log 1 η log 1 ηδ , ( ) where C7 is a constant. Substituting ( 49) and ( 51) into ( 48), for sufficiently small η, we have ∥ϕ (s1) -ϕ (0) ∥ 2 ≤ ν 1 C3 s 1 η log s 1 ηδ + C7 η log 1 η log 1 ηδ ≤ C8 log 1 η η log 1 ηδ , where C8 is a constant. Finally, according to Lemma K.12, (47) holds with probability at least 1 -δ.

K.6 PHASE 2: ITERATES STAYING CLOSE TO MANIFOLD

In this subsection, we show that ∥x k,t } H t=0 that will be useful in the proof: m (s) k,t := t-1 τ =0 z (s) k,τ , m k,0 = 0. We also define P : R d → R d×d as an extension of ∂Φ: P (θ) := ∂Φ(θ), if θ ∈ Γ ϵ2 , 0, otherwise. Finally, we define a martingale {Z (s) t : s ≥ 0, 0 ≤ t ≤ H}: Z (s) t := 1 K k∈[K] s-1 r=0 H-1 τ =0 P ( θ(r) )z (r) k,t + 1 K k∈[K] t-1 τ =0 P ( θ(s) )z (s) k,t , Z (0) 0 = 0.

K.6.2 PROOF FOR THE HIGH PROBABILITY BOUNDS

A direct application of Azuma-Hoeffding's inequality yields the following lemma.

Lemma K.19 (Concentration property of m

(s) k,t ). With probability at least 1-δ, the following holds: ∥m (s) k,t ∥ 2 ≤ C9 1 η log 1 ηδ , ∀0 ≤ t ≤ H, k ∈ [K], 0 ≤ s < R grp , where C9 is a constant. Proof. Notice that ∥m (s) k,t+1 -m (s) k,t ∥ 2 ≤ σ max . Then by Azuma-Hoeffdings inequality, P(∥m (s) k,t ∥ 2 ≥ ϵ ′ ) ≤ 2 exp - ϵ ′2 2tσ 2 max . Taking union bound on K clients, H local steps and R grp rounds, we obtain that the following inequality holds with probability at least 1 -δ: ∥m (s) k,t ∥ 2 ≤ σ max 2H log 2KHR grp δ , ∀0 ≤ t ≤ H, k ∈ [K], 0 ≤ s < R grp . Substituting in H = α η and R grp = ⌊ 1 αη β ⌋ yields the lemma. Again applying Azuma-Hoeffding's inequality, we have the following lemma about the concentration property of Z (s) t . Lemma K.20 (Concentration property of Z (s) t ). With probability at least 1 -δ, the following inequality holds: ∥Z (s) H ∥ 2 ≤ C12 η -0.5-0.5β log 1 ηδ , ∀0 ≤ s < R grp . Proof. Notice that ∥Z (s) t+1 -Z (s) t ∥ 2 ≤ ν 2 σ max , ∀0 ≤ t ≤ H -1 and ∥Z (s+1) 0 -Z (s) H ∥ 2 ≤ ν 2 σ max . By Azuma-Hoeffding's inequality, P(∥Z (s) t ∥ 2 ≥ ϵ ′ ) ≤ 2 exp - ϵ ′2 2(sH + t)ν 2 2 σ 2 max . Taking union bound on R grp rounds, we obtain that the following inequality holds with probability at least 1 -δ: ∥Z (s) H ∥ 2 ≤ σ max ν 2 2HR grp log 2R grp δ , ∀0 ≤ s < R grp . Substituting in H = α η and R grp = ⌊ 1 αη β ⌋ yields the lemma. We proceed to present a direct corollary of Lemma K.17 which provides a bound for the potential function over R grp rounds. Lemma K.21. Given ∥ θ(0) -ϕ (0) ∥ 2 ≤ C 0 η log 1 η where C 0 is a constant, then for δ = O(poly(η)), with probability at least 1 -δ, θ(s) ∈ Γ ϵ0 , Ψ( θ(s) ) ≤ C 1 η log 1 ηδ , ∀0 ≤ s < R grp , (s) k,t ∈ Γ ϵ2 , Ψ( θ(s) k,t ) ≤ C 1 η log 1 ηδ , ∀0 ≤ s < R grp , 0 ≤ t ≤ H, k ∈ [K], where C 1 is a constant that can depend on C 0 . Furthermore, Ψ( θ(Rgrp) ) ≤ C10 η log 1 ηδ , where C9 is a constant independent of C 0 . Still by triangle inequality, we have ∥θ (s) k,t -θ(s) ∥ 2 ≤ η t-1 τ =0 ∥∇L(θ (s) k,τ )∥ 2 + η∥m (s) k,t ∥ 2 . Due to ρ 2 -smoothness of L, when (55) holds, ∥∇L(θ (s) k,τ )∥ 2 ≤ 2ρ 2 Ψ(θ (s) k,τ ) ≤ C 1 2ρ 2 η log 2 ηδ . By Lemma K.19, with probability at least 1 -δ 2 , ∥m k,t ∥ 2 ≤ C9 1 η log 2 ηδ , ∀0 ≤ t ≤ H, k ∈ [K], 0 ≤ s < R grp . Combining ( 59) and ( 60), when ( 55) and ( 56) hold simultaneously, there exists a constant C 3 which can depend on C 0 such that ∥θ (s) k,t -θ(s) ∥ 2 ≤ C 3 η log 1 ηδ , ∀k ∈ [K], 0 ≤ t ≤ H. By triangle inequality, ∥ θ(s+1) -θ(s) ∥ 2 ≤ C 3 η log 1 ηδ . Combining ( 57), ( 58) and ( 61), we complete the proof. Then we provide high probability bounds for the movement of ϕ (s) within R grp rounds. Lemma K.23. Given ∥ θ(0) -ϕ (0) ∥ 2 ≤ C 0 η log 1 η where C 0 is a constant, then for δ = O(poly(η)), with probability at least 1 -δ, ∥ϕ (s) -ϕ (0) ∥ 2 ≤ C 4 η 0.5-0.5β log 1 ηδ , ∀1 ≤ s ≤ R grp . where C 4 is a constant that can depend on C 0 . Proof. By the update rule of Local SGD, θ (s) k,H = θ(s) -η H-1 t=0 ∇L(θ (s) k,t ) -η H-1 t=0 z (s) k,t Averaging among K clients gives θ(s+1) = θ(s) - η K H-1 t=0 k∈[K] ∇L(θ (s) k,t ) - η K H-1 t=0 k∈[K] z (s) k,t . By Lemma K.22, for δ = O(poly(η)), the following holds with probability at least 1 -δ/3, ∥θ (s) k,t -θ(s) ∥ 2 ≤ C 2 η log 3 ηδ , θ k,t ∈ B ϵ0 (ϕ (s) ), ∀0 ≤ s < R grp , 0 ≤ t ≤ H, k ∈ [K], ∥ θ(s+1) -θ(s) ∥ 2 ≤ C 2 η log 3 ηδ , θ(s) , θ(s+1) ∈ B ϵ0 (ϕ (s) ), ∀0 ≤ s < R grp . When ( 62) and ( 63) hold, we can expand Φ( θ(s+1) ) as follows: ϕ (s+1) = ϕ (s) + ∂Φ( θ(s) )( θ(s+1) -θ(s) ) + 1 2 ∂ 2 Φ( θ(s) )[ θ(s+1) -θ(s) , θ(s+1) -θ(s) ] = ϕ (s) - η K H-1 t=0 k∈[K] ∂Φ( θ(s) )∇L(θ (s) k,t ) T (s) 1 - η K ∂Φ( θ(s) ) H-1 t=0 k∈[K] z (s) k,t T (s) 2 + 1 2 ∂ 2 Φ(a (s) θ(s) + (1 -a (s) ) θ(s+1) )[θ (s+1) -θ (s) , θ (s+1) -θ (s) ] T (s) 3 , where a (s) ∈ (0, 1). Telescoping from round 0 to s -1, we have ∥ϕ (s) -ϕ (0) ∥ 2 = s-1 r=0 T (r) 1 + s-1 r=0 T (r) 2 + s-1 r=0 T (r) 3 . From (63), we can bound ∥T 62) and ( 63) hold, we have (s) 3 ∥ 2 by ∥T (s) 3 ∥ 2 ≤ 1 2 ν 2 C 2 2 η log 3 ηδ . We proceed to bound ∥T (s) 1 ∥ 2 . When ( ∂Φ( θ(s) )∇L(θ (s) k,t ) = ∂Φ(θ (s) k,t )∇L(θ (s) k,t ) + ∂ 2 Φ( θ(s) k,t )[θ (s) k,t -θ(s) , ∇L(θ (s) k,t )] = ∂ 2 Φ(b (s) k,t θ(s) + (1 -b (s) k,t ) θ(s) k,t )[θ k,t ∈ (0, 1). By Lemma K.17, with probability at least 1 -δ/3, the following holds: ∥∇L(θ (s) k,t )∥ 2 ≤ 2ρ 2 Ψ(θ (s) k,t ) ≤ C 1 2ρ 2 η log 3 ηδ , ∀k ∈ [K], 0 ≤ t ≤ H, 0 ≤ s < R grp . When ( 62), ( 63) and ( 64) hold simultaneously, we have for all 0 ≤ s < R grp , ∥T (s) 1 ∥ 2 ≤ ην 2 K H-1 t=0 ∥θ (s) k,t -θ(s) ∥ 2 ∥∇L(θ (s) k,t )∥ 2 ≤ αν 2 √ 2ρ 2 C 1 C 2 K η log 3 ηδ . Finally, we bound ∥ s-1 r=0 T (r) 2 ∥ 2 . By Lemma K.20, the following inequality holds with probability at least 1 -δ/3: ∥Z (s) H ∥ 2 ≤ C12 η -0.5-0.5β log 3 ηδ , ∀0 ≤ s < R grp . When ( 62), ( 63) and ( 65) hold simultaneously, we have ∥ s r=0 T (r) 2 ∥ 2 = η∥Z (s) H ∥ 2 ≤ C12 η 0.5-0.5β log 3 ηδ , ∀0 ≤ s < R grp Combining the bounds for ∥T (s) 1 ∥ 2 , ∥ s r=0 T (r) 2 ∥ 2 and ∥T (s) 3 ∥ 2 and taking union bound, we obtain that for δ = O(poly(η)), the following inequality holds with probability at least 1 -δ: ∥ϕ (s) -ϕ (0) ∥ 2 ≤ C 4 η 0.5-0.5β log 1 ηδ , ∀1 ≤ s ≤ R grp . where C 4 is a constant that can depend on C 0 . K.7 SUMMARY OF THE DYNAMICS AND PROOF OF THEOREMS J.1 AND J.2 Based on the results in Appendix K.5 and Appendix K.6, we summarize the dynamics of Local SGD iterates and then present the proof of Theorems J.1 and J.2 in this subsection. For convenience, we first introduce the definition of global step and δ-good step. Definition K.3 (Global step). Define I as the index set {(s, t) : s ≥ 0, 0 ≤ t ≤ H} with lexicographical order, which means (s 1 , t 1 ) ⪯ (s 2 , t 2 ) if and only if s 1 < s 2 or (s 1 = s 2 and t 1 ≤ t 2 ). A global step is indexed by (s, t) corresponding to the t-th local step at round s. Definition K.4 (δ-good step). In the training process of Local SGD, we say the global step (s, t) ⪯ (R tot , 0) is δ-good if the following inequalities hold: ∥ Z(r) k,τ ∥ 2 ≤ exp(αρ 2 )σ max 2H log 6HR tot K δ , ∀k ∈ [K], (r, τ ) ⪯ (s, t), ∥m (r) k,τ ∥ 2 ≤ σ max 2H log 6KHR tot δ , ∀k ∈ [K], (r, τ ) ⪯ (s, t), ∥Z H ∥ 2 ≤ σ max ν 2 2HR grp log 2R tot δ , ∀0 ≤ r < s. Applying the concentration properties of Z(r) k,τ , m k,τ and Z H (Lemmas K.20, K.19 and K.12) yields the following theorem. Theorem K.1. For δ = O(poly(η)), with probability at least 1 -δ, all global steps (s, t) ⪯ (R tot , 0) are δ-good. In the remainder of this subsection, we use O(•) notation to hide constants independent of δ and η. Below we present a summary of the dynamics of Local SGD when θ( 0) is initialized such that Φ( θ(0) ) ∈ Γ and all global steps are δ-good. Phase 1 lasts for s 0 + s 1 = O(log 1 η ) rounds. At the end of phase 1, the iterate reaches within O( η log 1 ηδ ) from Γ, i.e., ∥ θ(s0+s1) -ϕ (s0+s1) ∥ 2 = O( η log 1 ηδ ). The change of the projection on manifold over s 0 + s 1 rounds, ∥ϕ (s1+s0) -ϕ (0) ∥ 2 , is bounded by O(log 1 η η log 1 ηδ ). After s 0 + s 1 rounds, the dynamic enters phase 2 when the iterates stay close to Γ with θ(s ) ∈ Γ ϵ2 , ∀s 0 + s 1 ≤ s ≤ R tot and θ (s) k,t ∈ Γ ϵ2 , ∀k ∈ [K], (s 0 + s 1 , 0) ⪯ (s, t) ⪯ (R tot , 0). Furthermore, ∥x (s) k,t ∥ 2 and ∥ x(s) H ∥ 2 satisfy the following equations: ∥x (s) k,t ∥ 2 = O( η log 1 ηδ ), ∀k ∈ [K], 0 ≤ t ≤ H, s 0 + s 1 ≤ s < R tot , ∥ x(s) H ∥ 2 = O( η log 1 ηδ ), ∀s 0 + s 1 ≤ s < R tot . Moreover, for s 0 + s 1 ≤ s ≤ R tot -R grp , the change of the manifold projection within R grp rounds can be bounded as follows: ∥ϕ (s+r) -ϕ (s) ∥ 2 = O(η 0.5-0.5β log 1 ηδ ), ∀1 ≤ r ≤ R grp . After combing through the dynamics of Local SGD iterates during the approaching and drift phase, we are ready to present the proof of Theorems J.1 and J.2, which are direct consequences of the lemmas in Appendix K.5 and K.6. Proof of Theorem J.1. By Lemmas K.15, K.22 and Corollary K.1, for δ = O(poly(η)), when all global steps are δ-good, θ(s ) ∈ Γ ϵ2 , ∀s 0 + s 1 ≤ s ≤ R tot and θ (s) k,t ∈ Γ ϵ2 , ∀k ∈ [K], (s 0 + s 1 , 0) ⪯ (s, t) ⪯ (R tot , 0) and ∥x (s) k,t ∥ 2 , ∥ x(s) H ∥ 2 satisfy the following equations: ∥x (s) k,t ∥ 2 = O( η log 1 ηδ ), ∀k ∈ [K], 0 ≤ t ≤ H, s 0 + s 1 ≤ s < R tot , ∥ x(s) H ∥ 2 = O( η log 1 ηδ ), ∀s 0 + s 1 ≤ s < R tot . Hence ∥ x(Rtot) 0 ∥ 2 = O( Ψ( θ(Rtot) )) = O(∥ x(Rtot-1) H ∥ 2 ) = O( η log 1 ηδ ) by smoothness of L and Lemma K.10. According to Theorem K.1, with probability at least 1 -δ, all global steps are δ-good, thus completing the proof. Proof of Theorem J.2. By Lemma K.23, for δ = O(poly(η)), when all global steps are δ-good, then ∀s 0 + s 1 ≤ s ≤ R tot -R grp , ∥ϕ (s+r) -ϕ (s) ∥ 2 = Õ(η 0.5-0.5β ), ∀0 ≤ r ≤ R grp . Also, by Lemma K.18, when all global steps are δ-good, the change of projection on manifold over s 0 +s 1 rounds (i.e., Phase 1), ∥ϕ (s0+s1) -ϕ (0) ∥ 2 is bounded by Õ( √ η). According to Theorem K.1, with probability at least 1 -δ, all global steps are δ-good, thus completing the proof. K.8 PROOF OF THEOREM 3.3 In this subsection, we explicitly derive the dependency of the approximation error on α. The proofs are quite similar to those in Appendix K.5 and hence we only state the key proof idea for brevity. With the same method as the proofs in Appendix K.5.2, we can show that with high probability, ∥ θ(s) -ϕ (s) ∥ 2 ≤ 1 2 µ ρ2 after s ′ 0 = O(1) rounds. Below we focus on the dynamics of Local SGD thereafter. We first remind the readers of the definition of { Zs k,t }: Z(s) k,t := t-1 τ =0 t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l )) z (s) k,τ , Z(s) k,0 = 0. We have the following lemma that controls the norm of the matrix product t-1 l=τ +1 (I - η∇ 2 L( ũ(s) l ) ). Lemma K.24. Given θ(s) ∈ Γ ϵ0 , then there exists a positive constant C ′ 3 independent of α such that for all 0 ≤ τ < t ≤ H, t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l )) 2 ≤ C ′ 3 . Proof. Since θ(s) ∈ Γ ϵ0 , then ũ(s) t ∈ Γ ϵ1 for all 0 ≤ t ≤ H. We first bound the minimum eigenvalue of ∇ 2 L( ũ(s) t ). Due to the PL condition, by Lemma K.6, for η ≤ 1 ρ2 , L( ũ(s) t ) -L * ≤ (1 -µη) t L( θ(s) ) -L * ≤ exp(-µtη)(L( θ(s) ) -L * ), ∀0 ≤ t ≤ H. Therefore, Ψ( ũ(s) t ) ≤ exp(-µtη/2) Ψ( θ(s) ). Let C ′ 1 = ρ 3 ρ2 µ . By Weyl's inequality, |λ min (∇ 2 L( ũ(s) t ))| = |λ min (∇ 2 L( ũ(s) t )) -λ min (∇ 2 L(Φ( ũ(s) t ))| ≤ ρ 3 ∥∇ 2 L( ũ(s) t ) -∇ 2 L(Φ( ũ(s) t ))∥ 2 ≤ ρ 3 ∥ ũ(s) t -Φ( ũ(s) t )∥ 2 ≤ ρ 3 2 µ exp(-µtη/2) Ψ( θ(s) ) ≤ C ′ 1 exp(-µtη/2)ϵ 0 , where the last two inequalities use Lemmas K.10 and K.7 respectively. Therefore, for all 0 ≤ t ≤ H and 0 ≤ τ ≤ t -1, ∥ t-1 l=τ +1 (I -η∇ 2 L( ũ(s) l ))∥ 2 ≤ t-1 l=τ +1 (1 + η|λ min ∇ 2 L( ũ(s) l )|) ≤ ∞ l=0 (1 + η|λ min ∇ 2 L( ũ(s) l )|) ≤ exp(ηϵ 0 C ′ 1 ∞ l=0 exp(-µlη/2)). For sufficiently small η, there exists a constant C ′ 2 such that ∞ l=0 exp(-µlη/2)) = 1 1 -exp(-µη/2) ≤ C ′ 2 η . Substituting ( 67) into (66), we obtain the lemma. Based on Lemma K.24, we obtain the following lemma about the concentration property of Z(s) k,t , which can be derived in the same way as Lemma K.12. Lemma K.25. Given θ(s) ∈ Γ ϵ0 , then with probability at least 1 -δ, ∥ Z(s) k,t ∥ 2 ≤ C ′ 3 σ max 2α η log 2αK ηδ , ∀0 ≤ t ≤ H, k ∈ [K], where C ′ 3 is defined in Lemma K.24. The following lemma can be derived analogously to Lemma K.14 but the error bound is tighter in terms of its dependency on α. Lemma K.26. Given θ(s) ∈ Γ ϵ1 , then for δ = O(poly(η)), with probability at least 1 -δ, there exists a constant C ′ 4 independent of α such that ∥θ (s) k,t - ũ(s) t ∥ 2 ≤ C ′ 4 αη log α ηδ , ∀0 ≤ t ≤ H, k ∈ [K], and ∥ θ(s+1) - ũ(s) H ∥ 2 ≤ C ′ 4 αη log α ηδ . Then, similar to Lemma K.17, we can show that for δ = O(poly(η)) and simultaneously all s ≥ s ′ 0 + s ′ 1 where s ′ 1 = O( 1 α log 1 η ), it holds with probability at least 1 -δ that ∥ θ(s) -ϕ (s) ∥ 2 = O( αη log α ηδ ) . Note that to eliminate the dependency of the second term's denominator on α in (44), we can discuss the cases of α > c 0 and α < c 0 respectively where c 0 can be an arbitrary positive constant independent of α. For the case of α < c 0 group ⌈ c0 α ⌉ rounds together and repeat the arguments in this subsection to analyze the closeness between Local SGD and GD iterates as well as the evolution of loss.

K.9 COMPUTING THE MOMENTS FOR ONE "GIANT STEP"

In this subsection, we compute the first and second moments for the change of manifold projection every R grp rounds of Local SGD. Since the randomness in training might drive the iterate out of the working zone, making the dynamic intractable, we analyze a more well-behaved sequence { θ(s) k,t : (s, t) ⪯ (R tot , 0), k ∈ [K]} which is equal to {θ (s) k,t } with high probability. Specifically, θ(s) k,t equal to θ (s) k,t if the global step (s, t) is η 100 -good and is set as a point ϕ null ∈ Γ otherwise. The formal definition is as follows. Denote by E (s) t the event {global step (s, t) is η 100 -good}. Define a well-behaved sequence θ(s) k,t := θ (s) k,t 1 E (s) t + ϕ null 1 Ē(s) t , which satisfies the following update rule: θ(s) k,t+1 = θ (s) k,t+1 1 E (s) t+1 + ϕ null 1 Ē(s) t+1 (68) = θ(s) k,t -η∇L( θ(s) k,t ) -ηz (s) k,t -1 Ē(s) t+1 ( θ(s) k,t -η∇L( θ(s) k,t ) -ηz (s) k,t ) + 1 Ē(s) t+1 ϕ null :=ê (s) k,t . By Theorem K.1, with probability at least 1 -η 100 , θ(s) k,t = θ (s) k,t , ∀k ∈ [K], (s, t) ⪯ (R tot , 0). Similar to {θ (s) k,t }, we define the following variables with respect to { θ(s) k,t }: θ(s+1) avg := 1 K k∈[K] θ(s) k,H , φ(s) := Φ( θ(s) avg ), x(s) k,t := θ(s) k,t -φ(s) , x(s) avg,0 := θ(s) avg -φ(s) , x(s) avg,H := 1 K k∈[K] x(s) k,H . Notice that x(s) k,0 = x(s) avg,0 for all k ∈ [K]. Finally, we introduce the following mapping Ψ(θ) : Γ → R d×d , which is closely related to Ψ defined in Theorem 3.2. Definition K.6. For θ ∈ Γ, we define the mapping Ψ(θ) : Γ → R d×d : Ψ(θ) = i,j∈[d] ψ(ηH(λ i + λ j )) Σ(θ), v i v ⊤ j v i v ⊤ j , where λ i , v i are the i-th eigenvalue and eigenvector of ∇ 2 L(θ) and v i 's form an orthonormal basis of R d . Additionally, ψ(x) := e -x -1+x x and ψ(0) = 0; see Figure 9 for a plot. Remark K.1. Intuitively, Ψ(θ) rescales the entries of Σ(θ) in the eigenbasis of ∇ 2 L(θ). When ∇ 2 L(θ) = diag(λ 1 , • • • , λ d ) ∈ R d×d , where λ i = 0 for all m < i ≤ d, Ψ(Σ 0 ) i,j = ψ(ηH(λ i + λ j ))Σ 0,i,j . Note that Ψ(θ) can also be written as vec(Ψ(θ)) = ψ(ηH(∇ 2 L(θ) ⊕ ∇ 2 L(θ)))vec(Σ(θ)), where ⊕ denotes the Kronecker sum A ⊕ B = A ⊗ I d + I d ⊗ B, vec(•) is the vectorization operator of a matrix and ψ(•) is interpreted as a matrix function. Now we are ready to present the result about the moments of φ(s+Rgrp) -φ(s) . Theorem K.2. For s 0 + s 1 ≤ s ≤ R tot -R grp and 0 < β < 0.5, the first and second moments of φ(s+Rgrp) -φ(s) are as follows: E[ φ(s+Rgrp) -φ(s) | φ(s) , E (s) 0 ] = η 1-β 2B ∂ 2 Φ( φ(s) )[Σ( φ(s) ) + (K -1)Ψ( φ(s) )] + Õ(η 1.5-2β ) + Õ(η), E[( φ(s+Rgrp) -φ(s) )( φ(s+Rgrp) -φ(s) ) ⊤ | φ(s) , E (s) 0 ] = η 1-β B Σ ∥ ( φ(s) ) + Õ(η 1.5-2β ) + Õ(η), where Õ(•) hides log terms and constants independent of η. Remark K.2. By Theorem K.1 and the definition of θ(s) k,t , (70) and (71) still hold when we replace φ(s) with ϕ (s) and replace φ(s+Rgrp) with ϕ (s+Rgrp) . We shall have Theorem K.2 if we prove the following theorem, which directly gives Theorem K.2 with a simple shift of index. For brevity, denote by ∆ φ(s) := φ(s) -φ(0) , Σ 0 := Σ( φ(0) ), Σ 0,∥ := Σ ∥ ( φ(0) ). Theorem K.3. Given ∥ θ(0) avg -φ(0) ∥ 2 = O( η log 1 η ), for 0 < β < 0.5, the first and second moments of ∆ φ(Rgrp) are as follows: 5-1.5β ) + Õ(η). E[∆ φ(Rgrp) ] = η 1-β 2B ∂ 2 Φ( φ(0) )[Σ 0 + (K -1)Ψ( φ(0) )] + Õ(η 1.5-2β ) + Õ(η), E[∆ φ(Rgrp) ∆ φ(Rgrp)⊤ ] = η 1-β B Σ 0,∥ + Õ(η 1. We will prove Theorem K.3 in the remainder of this subsection. For convenience, we introduce more notations that will be used throughout the proof. Let H 0 := ∇ 2 L( φ(0) ). By Assumption 3.2, rank(H 0 ) = m. WLOG, assume H 0 = diag(λ 1 , • • • , λ d ) ∈ R d×d , where λ i = 0 for all m < i ≤ d and λ 1 ≥ λ 2 • • • ≥ λ m . By Lemma K.2, ∂Φ( φ(0) ) is the projection matrix onto the tangent space T φ(0) (Γ) (i.e. the null space of ∇ 2 L( φ(0) )) and therefore, ∂Φ( φ(0) ) = 0 0 0 I d-m . Let P ∥ := ∂Φ( φ(0) ) and P ⊥ := I d -P ∥ . Let Â(s) avg := E[ x(s) avg,H x(s)⊤ avg,H ], q(s) t := E[ x(s) k,t ] and B(s) t := E[ x(s) k,t ∆ φ(s)⊤ ]. The latter two notations are independent of k since θ(s) 1,t , . . . , θ(s) K,t are identically distributed. The following lemma computes the first and second moments of the change of manifold projection every round. Lemma K.27. Given ∥ θ(0) avg -φ(0) ∥ 2 = O( η log 1 η ), for 0 ≤ s < R grp , the first and second moments of φ(s+1) -φ(s) are as follows: E[ φ(s+1) -φ(s) ] = P ∥ q(s) H + ∂ 2 Φ( φ(0) )[ B(s) H ] + 1 2 ∂ 2 Φ( φ(0) )[ Â(s) avg ] + Õ(η 1.5-β ), E[( φ(s+1) -φ(s) )( φ(s+1) -φ(s) ) ⊤ ] = P ∥ Â(s) avg P ∥ + Õ(η 1.5-0.5β ). Proof. By Taylor expansion, we have φ(s+1) = Φ φ(s) + x(s) avg,H = φ(s) + ∂Φ( φ(s) ) x(s) avg,H + 1 2 ∂ 2 Φ( φ(s) )[ x(s) avg,H x(s)⊤ avg,H ] + O(∥ x(s) avg,H ∥ 3 2 ) = φ(s) + ∂Φ( φ(0) + ∆ φ(s) ) x(s) avg,H + 1 2 ∂ 2 Φ( φ(0) + ∆ φ(s) )[ x(s) avg,H x(s)⊤ avg,H ] + O(∥ x(s) avg,H ∥ 3 2 ) = φ(s) + P ∥ x(s) avg,H + ∂ 2 Φ( φ(0) )[ x(s) avg,H ∆ φ(s)⊤ ] + 1 2 ∂ 2 Φ( φ(0) )[ x(s) avg,H x(s)⊤ avg,H ] + O(∥∆ φ(s) ∥ 2 2 ∥ x(s) avg,H ∥ 2 + ∥∆ φ(s) ∥ 2 ∥ x(s) avg,H ∥ 2 2 + ∥ x(s) avg,H ∥ 3 2 ). Rearrange the terms and we obtain: φ(s+1) -φ(s) = P ∥ x(s) avg,H + ∂ 2 Φ( φ(0) )[ x(s) avg,H ∆ φ(s)⊤ ] + 1 2 ∂ 2 Φ( φ(0) )[ x(s) avg,H x(s)⊤ avg,H ] + O(∥∆ φ(s) ∥ 2 2 ∥ x(s) avg,H ∥ 2 + ∥∆ φ(s) ∥ 2 ∥ x(s) avg,H ∥ 2 2 + ∥ x(s) avg,H ∥ 3 2 ). Moreover, ( φ(s+1) -φ(s) )( φ(s+1) -φ(s) ) ⊤ = P ∥ x(s) avg,H x(s)⊤ avg,H P ∥ + O(∥∆ φ(s) ∥ 2 ∥ x(s) avg,H ∥ 2 2 ). Noticing that x(s) k,H ∆ φ(s)⊤ are identically distributed for all k ∈ [K], we have E[ x(s) avg,H ∆ φ(s)⊤ ] = 1 K k∈[K] E[ x(s) k,H ∆ φ(s)⊤ ] = B(s) H . Then taking expectation of both sides of ( 74) gives E[ φ(s+1) -φ(s) ] = P ∥ q(s) H + ∂ 2 Φ( φ(0) )[ B(s) H ] + 1 2 ∂ 2 Φ( φ(0) )[ Â(s) avg ] + O(E[∥∆ φ(s) ∥ 2 2 ∥ x(s) avg,H ∥ 2 ] + E[∥∆ φ(s) ∥ 2 ∥ x(s) avg,H ∥ 2 2 ] + E[∥ x(s) avg,H ∥ 3 2 ]). Again taking expectation of both sides of ( 75) yields E[( φ(s+1) -φ(s) )( φ(s+1) -∆ φ(s)⊤ )] = P ∥ Â(s) avg P ∥ + O(E[∥∆ φ(s) ∥ 2 ∥ x(s) avg,H ∥ 2 2 ]). By Lemmas K.22 and K.23, the following holds simultaneously with probability at least 1 -η 100 : ∥∆ φ(s) ∥ 2 = Õ(η 0.5-0.5β ), ∥ x(s) avg,H ∥ 2 = Õ(η 0.5 ). Furthermore, since for all k ∈ [K] and (s, t) ⪯ (R tot , 0), θ(s) k,t stays in Γ ϵ2 which is a bounded set, ∥∆ φ(s) ∥ 2 and ∥ x(s) avg,H ∥ 2 are also bounded. Therefore, we have E[∥∆ φ(s) ∥ 2 2 ∥ x(s) avg,H ∥ 2 ] = Õ(η 1.5-β ), E[∥∆ φ(s) ∥ 2 ∥ x(s) avg,H ∥ 2 2 ] = Õ(η 1.5-0.5β ), E[∥ x(s) avg,H ∥ 3 2 ] = Õ(η 1.5 ), which concludes the proof. We compute Â(s) avg , q(s) t and B(s) t by solving a set of recursions, which is formulated in the following lemma. Additionally, define Â(s) t := E[ x(s) k,t x(s)⊤ k,t ] and M (s) t := E[ x(s) k,t x(s) k,l ], (k ̸ = l). Lemma K.28. Given ∥ θ(0) avg -φ(0) ∥ 2 = O( η log 1 η ), for 0 ≤ s < R grp and 0 ≤ t < H, we have the following recursions. q(s) t+1 = q(s) t -ηH 0 q(s) t -η∇ 3 L(ϕ (0) )[ B(s) t ] - η 2 ∇ 3 L(ϕ (0) )[ Â(s) t ] + Õ(η 2.5-β ), Â(s) t+1 = Â(s) t -ηH 0 Â(s) t -η Â(s) t H 0 + η 2 B loc Σ 0 + Õ(η 2.5-0.5β ), M (s) t+1 = M (s) t -ηH 0 M (s) t -η M (s) t H 0 + Õ(η 2.5-0.5β ), B(s) t+1 = (I -ηH 0 ) B(s) t + Õ(η 2.5-β ). Moreover, Â(s) avg = 1 K Â(s) H + (1 - 1 K ) M (s) H , M (s+1) 0 = Â(s+1) 0 = P ⊥ Â(s) avg P ⊥ + O(η 1.5-0.5β ), q(s+1) 0 = P ⊥ q(s) H -∂ 2 Φ(ϕ (0) )[ B(s) H ] - 1 2 ∂ 2 Φ(ϕ (0) )[ Â(s) avg ] + Õ(η 1.5-β ), B(s+1) 0 = P ⊥ B(s) H + P ⊥ Â(s) avg P ∥ + Õ(η 1.5-β ). Proof. We first derive the recursion for q(s) t . Recall the update rule for θ(s) k,t : θ(s) k,t+1 = θ(s) k,t -η∇L( θ(s) k,t ) -ηz (s) k,t + ê(s) k,t . Subtracting φ(s) from both sides gives x(s) k,t+1 = x(s) k,t -η∇L( θ(s) k,t ) -ηz (s) k,t + O(∥ê (s) k,t ∥ 2 ) = x(s) k,t -η ∇ 2 L( φ(s) ) x(s) k,t + 1 2 ∇ 3 L( φ(s) )[ x(s) k,t x(s)⊤ k,t ] + O(∥ x(s) k,t ∥ 3 2 ) -ηz (s) k,t + O(∥ê (s) k,t ∥ 2 ) = x(s) k,t -η ∇ 2 L( φ(0) ) + ∇ 3 L( φ(0) )∆ φ(s) + O(∥∆ φ(s) ∥ 2 ) x(s) k,t - η 2 ∇ 3 L( φ(0) ) + O(∥∆ φ(s) ∥ 2 ) [ x(s) k,t x(s)⊤ kt ] -ηz (s) k,t + O(η∥ x(s) k,t ∥ 3 2 + ∥ê (s) k,t ∥ 2 ) = x(s) k,t -ηH 0 x(s) k,t -η∇ 3 L( φ(0) )[ x(s) k,t ∆ φ(s)⊤ ] - η 2 ∇ 3 L( φ(0) )[ x(s) k,t x(s)⊤ k,t ] -ηz (s) k,t + O(η∥ x(s) k,t ∥ 3 2 + η∥∆ φ(s) ∥ 2 ∥ x(s) k,t ∥ 2 2 + η∥∆ φ(s) ∥ 2 2 ∥ x(s) k,t ∥ 2 + ∥ê (s) k,t ∥ 2 ), where the second and third equality perform Taylor expansion. Taking expectation on both sides gives q(s) t+1 = (I -ηH 0 ) q(s) t -η∇ 3 L( φ(0) )[ q(s) t ] - η 2 ∇ 3 L( φ(0) )[ Â(s) t ] + O ηE[∥ x(s) k,t ∥ 3 2 ] + ηE[∥∆ φ(s) ∥ 2 ∥ x(s) k,t ∥ 2 2 ] + ηE[∥∆ φ(s) ∥ 2 2 ∥ x(s) k,t ∥ 2 ] + E[∥ê (s) k,t ∥ 2 ] . By Theorem K.1, with probability at least 1 -η 100 , ê(s) k,t = 0, ∀k ∈ [K], (s, t) ⪯ (R grp , 0). Also notice that both θ(s) k,t and ϕ null belong to the bounded set Γ ϵ2 . Therefore, ∥ê k,t ∥ 2 is bounded and we have E[∥ê Secondly, we derive the recursion for B(s) t . Multiplying both sides of (87) by ∆ φ(s)⊤ and taking expectation, we have B(s) t+1 = (I -ηH 0 ) B(s) t + O(ηE[∥∆ φ(s) ∥ 2 ∥ x(s) k,t ∥ 2 2 + ∥∆ φ(s) ∥ 2 2 ∥ x(s) k,t ∥ 2 + ∥ê (s) k,t ∥ 2 ]). Still by Theorem K.1 and ( 76) to (78), we have (82). Thirdly, we derive the recursion for Â(s) t . By (87), we have Â(s) t+1 = Â(s) t -ηH 0 Â(s) t -η Â(s) t H 0 + η 2 B loc Σ 0 + O(η 2 E[∥∆ φ(s) ∥ 2 + ∥ x(s) k,t ∥ 2 ]) + O(ηE[∥ x(s) k,t ∥ 3 2 + ∥ x(s) k,t ∥ 2 2 ∥∆ φ(s) ∥ 2 + ∥ê (s) k,t ∥ 2 ]) = (I -ηH 0 ) Â(s) t + η 2 B loc Σ 0 + Õ(η 2.5-0.5β ), which establishes (80). Fourthly, we derive the recursion for M (s) t . Multiplying both sides of (87) by x(s) l,t+1 and taking expectation, l ̸ = k, we obtain M (s) t+1 = M (s) t -ηH 0 M (s) t -η M (s) t H 0 + O(ηE[∥ x(s) k,t ∥ 2 ∥ x(s) l,t ∥ 2 ∥∆ φ(s) ∥ 2 ]) + O(ηE[∥ x(s) k,t ∥ 2 2 ∥ x(s) l,t ∥ 2 + ∥ê (s) k,t ∥ 2 ]). By a similar argument to the proof of Lemma K.27, we have E[∥ x(s) k,t ∥ 2 2 ∥ x(s) l,t ∥ 2 ] = Õ(η 1.5 ), E[∥ x(s) k,t ∥ 2 ∥ x(s) l,t ∥ 2 ∥∆ φ(s) ∥ 2 ] = Õ(η 1.5-0.5β ), which yields (81). Now we proceed to prove (83) to (86) . By definition of Â(s) avg , Â(s) avg = 1 K 2 E[( k∈[K] x(s) k,H )( k∈[K] x(s) k,H ) ⊤ ] = 1 K 2 k∈[K] E[ x(s) k,H x(s)⊤ k,H ] + 1 K 2 k,l∈[K],k̸ =l E[ x(s) k,H x(s)⊤ l,H ] = 1 K Â(s) H + (1 - 1 K ) M (s) H , which demonstrates (83). Then we derive (84). By definition of x(s+1) avg,0 , x(s+1) avg,0 = φ(s) + x(s) avg,H -Φ( φ(s) + x(s) avg,H ) = φ(s) + x(s) avg,H -φ(s) + ∂Φ( φ(s) ) x(s) avg,H + O(∥ x(s) avg,H ∥ 2 2 ) = x(s) avg,H -P ∥ + O(∥∆ φ(s) ∥ 2 ) x(s) avg,H + O(∥ x(s) avg,H ∥ 2 2 ) = P ⊥ x(s) avg,H + O(∥ x(s) avg,H ∥ 2 2 + ∥ x(s) avg,H ∥ 2 ∥∆ φ(s) ∥ 2 ). Hence, M (s+1) 0 = Â(s+1) 0 = E[ x(s) avg,0 x(s)⊤ avg,0 ] = P ⊥ Â(s) avg P ⊥ + O(E[∥ x(s) avg,H ∥ 3 2 + ∥ x(s) avg,H ∥ 2 2 ∥∆ φ(s) ∥ 2 ]). By ( 76) and ( 78), we obtain (84). By (74), φ(s+1) -φ(s) = P ∥ x(s) avg,H + O(∥ x(s) avg,H ∥ 2 ∥∆ φ(s) ∥ 2 + ∥ x(s) avg,H ∥ 2 2 ). Combining ( 88) and ( 89) gives

E[

x(s) avg,0 ( φ(s+1) -φ(s) ) ⊤ ] = P ⊥ Â(s) avg P ∥ + Õ(η 1.5-0.5β ). Therefore, B(s+1) 0 = E[ x(s+1) avg,0 ∆ φ(s+1)⊤ ] = E[ x(s+1) avg,0 (∆ φ(s) + φ(s+1) -φ(s) ) ⊤ ] = P ⊥ B(s) H + P ⊥ Â(s) avg P ∥ + Õ(η 1.5-β ). Finally, we apply Lemma K.27 to derive (85). q(s+1) 0 = E[ x(s+1) avg,0 ] = E[ x(s) avg,H -( φ(s+1) -φ(s) )] = q(s) H -P ∥ q(s) H -∂ 2 Φ( φ(0) )[ B(s) H ] - 1 2 ∂ 2 Φ( φ(0) )[ Â(s) avg ] + Õ(η 1.5-β ) = P ⊥ q(s) H -∂ 2 Φ( φ(0) )[ B(s) H ] - 1 2 ∂ 2 Φ( φ(0) )[ Â(s) avg ] + Õ(η 1.5-β ), which concludes the proof. With the assumption that the hessian at φ(0) is diagonal, we have the following corollary that formulates the recursions for each matrix element. Corollary K.2. Given ∥ θ(0) avg -φ(0) ∥ 2 = O( η log 1 η ), for 0 ≤ s < R grp and 0 ≤ t < H, we have the following elementwise recursions.

Â(s)

t+1,i,j = (1 - (λ i + λ j ) η) Â(s) t,i,j + η 2 B loc Σ 0,i,j + Õ(η 2.5-0.5β ), M (s) t+1,i,j = (1 -(λ i + λ j ) η) M (s) t,i,j + Õ(η 2.5-0.5β ), B(s) t+1,i,j = (1 -λ i η) B(s) t,i,j + Õ(η 2.5-β ), Â(s) avg,i,j = 1 K ( Â(s) H,i,j - M (s) H,i,j ) + M (s) H,i,j , M (s+1) 0,i,j = Â(s+1) 0,i,j = Â(s) avg,i,j + Õ(η 1.5-0.5β ), 1 ≤ i ≤ m, 1 ≤ j ≤ m, Õ(η 1.5-0.5β ), otherwise. B(s+1) 0,i,j =      B(s) H,i,j + Â(s) avg,,i,j + Õ(η 1.5-β ), 1 ≤ i ≤ m, m < j ≤ d, B(s) H,i,j + Õ(η 1.5-β ), 1 ≤ i ≤ m, 1 ≤ j ≤ m, Õ(η 1.5-β ), m < i ≤ d. Having formulated the recursions, we are ready to solve out the explicit expressions. We will split each matrix into four parts and them one by on. Specifically, a matrix M can be split into P ∥ M P ∥ in the tangent space of Γ at φ(0) , P ⊥ M P ⊥ in the normal space, along with P ∥ M P ⊥ and P ⊥ M P ∥ across both spaces. We first compute the elements of P ⊥ Â(s) t P ⊥ and P ⊥ Â(s) avg P ⊥ . Lemma K.29 (General formula for P ⊥ Â(s) t P ⊥ and P ⊥ Â(s ) avg P ⊥ ). Let R 0 := ⌈ 10 λmα log 1 η ⌉. Then for 1 ≤ i ≤ m, 1 ≤ j ≤ m and R 0 ≤ s < R grp , Â(s) avg,i,j = 1 (λ i + λ j )KB loc ηΣ 0,i,j + Õ(η 1.5-0.5β ), Â(s) t,i,j = -1 - 1 K (1 -(λ i + λ j )η) t (λ i + λ j )B loc ηΣ 0,i,j + η (λ i + λ j )B loc Σ 0,i,j + Õ(η 1.5-0.5β ). For s < R 0 , Â(s) t,i,j = Õ(η) and Â(s) avg,,i,j = Õ(η). Proof. For 1 ≤ i ≤ m, 1 ≤ j ≤ m, λ i > 0, λ j > 0. By (90), Â(s) t,i,j = (1 -(λ i + λ j )η) t Â(s) 0,i,j + t-1 τ =0 (1 -(λ i + λ j )η) τ η 2 B loc Σ 0,i,j + Õ( t-1 τ =0 (1 -(λ i + λ j )η) τ η 2.5-0.5β ) = (1 -(λ i + λ j )η) t Â(s) 0,i,j + 1 -(1 -(λ i + λ j )η) t (λ i + λ j )B loc ηΣ 0,i,j + Õ(η 1.5-0.5β ), where the second inequality uses t-1 τ =0 (1 -(λ i + λ j )η) τ = 1-(1-(λi+λj )η) t (λi+λj )η ≤ 1 (λi+λj )η . By (91), M (s) t,i,j = (1 -(λ i + λ j )η) t M (s) 0,i,j + Õ( t-1 τ =0 (1 -(λ i + λ j )η) τ η 2.5-0.5β ) = (1 -(λ i + λ j )η) t Â(s) 0,i,j + Õ(η 1.5-0.5β ), where the second equality uses M (s+1) 0 = A (s+1) 0 . By ( 93) and ( 94), 5-0.5β ). Â(s) avg,i,j = 1 -(1 -(λ i + λ j )η) H (λ i + λ j )KB loc ηΣ 0,i,j + (1 -(λ i + λ j )η) H Â(s) 0,i,j + Õ(η 1.5-0.5β ), Â(s+1) 0,i,j = Â(s) avg,i,j + Õ(η 2.5-0.5β ) = 1 -(1 -(λ i + λ j )η) H (λ i + λ j )KB loc ηΣ 0,i,j + (1 -(λ i + λ j )η) H Â(s) 0,i,j + Õ(η 1. Then we obtain Â(s) 0,i,j = (1 -(λ i + λ j )η) sH Â(0) 0,i,j + 1 -(1 -(λ i + λ j )η) H (λ i + λ j )KB loc ηΣ 0,i,j s-1 r=0 (1 -(λ i + λ j )η) rH + Õ(η 1.5-0.5β s-1 r=R0 (1 -(λ i + λ j )η) rH ). Notice that |1 -(λ i + λ j )η| < 1 and (1 -(λ i + λ j )η) H ≤ exp(-(λ i + λ j )ηH) = exp(-(λ i + λ j )α). Therefore, s-1 r=0 (1 -(λ i + λ j )η) rH = 1 -(1 -(λ i + λ j )η) rH 1 -(1 -(λ i + λ j )η) H ≤ 1 1 -exp(-(λ i + λ j )α) . Then we have Â(s) 0,i,j = (1 -(λ i + λ j )η) sH Â(0) 0,i,j + 1 -(1 -(λ i + λ j )η) sH (λ i + λ j )KB loc ηΣ 0,i,j + Õ(η 1.5-0.5β ). Finally, we demonstrate that for s ≥ R 0 , Â(s) 0,i,j and Â(s) avg,i,j is approximately equal to η (λi+λj )KB loc Σ 0,i,j . By (96), when s ≥ R 0 , (1 -(λ i + λ j )η) sH = O(η 10 ), which gives Â(s) avg,i,j = 1 (λ i + λ j )KB loc ηΣ 0,i,j + Õ(η 1.5-0.5β ), A (s) t,i,j = -1 - 1 K (1 -(λ i + λ j )η) t (λ i + λ j )B loc ηΣ 0,i,j + η (λ i + λ j )B loc Σ 0,i,j + Õ(η 1.5-0.5β ). For s < R 0 , since Â(0) 0 = x(s) avg,0 x(s)⊤ avg,0 = Õ(η), we have Â(s) avg,,i,j = Õ(η) and Â(s) t,i,j = Õ(η). Secondly, we compute P ∥ Â(s)  P ∥ ). For 1 ≤ i ≤ m, m < j ≤ d, Â(s) t,i,j = 1 -(1 -λ i η) t λ i B loc ηΣ 0,i,j + Õ(η 1.5-0.5β ), Â(s) avg,i,j = 1 -(1 -λ i η) H λ i KB loc ηΣ 0,i,j + Õ(η 1.5-0.5β ). Proof. Note that for 1 ≤ i ≤ m, m < j ≤ d and λ i > 0, λ j = 0. By ( 90) and (94), Â(s) t,i,j = (1 -λ i η) t Â(s) 0,i,j + 1 -(1 -λ i η) t λ i B loc ηΣ 0,i,j + Õ(η 1.5-0.5β ) = 1 -(1 -λ i η) t λ i B loc ηΣ 0,i,j + Õ(η 1.5-β ). By ( 91) and ( 94), M (s) t,i,j = Õ(η 1.5-0.5β ). Then, 5-0.5β ). Â(s) avg,i,j = 1 -(1 -λ i η) H λ i KB loc ηΣ 0,i,j + Õ(η 1.5-0.5β P ⊥ ). For m < i ≤ d and 1 ≤ j ≤ m, Â(s) t,i,j = 1 -(1 -λ j η) t λ j B loc ηΣ 0,i,j + Õ(η 1.5-0.5β ), Â(s) avg,i,j = 1 -(1 -λ j η) H λ j KB loc ηΣ 0,i,j + Õ(η 1. Finally, we derive the general formula for P ∥ Â(s) t P ∥ and P ∥ Â(s) avg P ∥ . Lemma K.32 (General formula for P ∥ Â(s) t P ∥ and P ∥ Â(s) avg P ∥ ). For m < i ≤ d and m < j ≤ d, Â(s) avg,i,j = Hη 2 KB loc Σ 0,i,j + Õ(η 1.5-0.5β ), Â(s) t,i,j = Â(s) 0,i,j + tη 2 B loc Σ 0,i,j + Õ(η 1.5-0.5β ). Proof. Note that for m < i ≤ d, m < j ≤ d and λ i = λ j = 0. (90) is then simplified as Â(s) t+1,i,j = Â(s) t,i,j + η 2 B loc Σ 0,i,j + Õ(η 2.5-0.5β ). Therefore, Â(s) t,i,j = Â(s) 0,i,j + tη 2 B loc Σ 0,i,j + Õ(η 1.5-0.5β ). According to (91), M (s) t,i,j = Õ(η 1.5-0.5β ) for m < i ≤ d and m < j ≤ d. Combining (91), ( 94) and (97) yields avg,i,j = Hη 2 KB loc Σ 0,i,j + Õ(η 1.5-0.5β ). Now, we move on to compute the general formula for B(s) t . Lemma K.33 (The general formula for P ⊥ B(s ) t P ∥ ). Note that for 1 ≤ i ≤ m and m < j ≤ d, when R 0 := ⌈ 10 λmα log 1 η ⌉ ≤ s < R grp , B(s) t,i,j = (1 -λ i η) t λ i KB loc ηΣ 0,i,j + Õ(η 1.5-β ). For s < R 0 , B(s) t,i,j = Õ(η). Proof. Note that for 1 ≤ i ≤ m, λ i > 0. By (92), B(s) t+1,i,j = (1 -λ i η) B(s) t,i,j + Õ(η 2.5-β ). Hence, B(s) t,i,j = (1 -λ i η) t B(s) 0,i,j + Õ(η 1.5-β ). According to (95), B(s+1) 0,i,j = B(s) H,i,j + Â(s) avg,,i,j + Õ(η 2.5-β ) = (1 -λ i η) H B(s) 0,i,j + Â(s) avg,i,j + Õ(η 1.5-β ). Lemma K.36. The expectation of the change of manifold projection every round is E[ φ(s+1) -φ(s) ] = Hη 2 2B ∂ 2 Φ( φ(0) )[Σ 0 + Ψ( φ(0) )] + Õ(η 1.5-β ), R 0 < s < R grp Õ(η), s ≤ R 0 , where R 0 := ⌈ 10 λmα log 1 η ⌉. Proof. We first compute E[ φ(s+1) -φ(s) ]. By (72), we only need to compute P ∥ q(s) H by relating it to these matrices. Multiplying both sides of (79) by P ∥ gives P ∥ q(s) t+1 = P ∥ q(s) t -ηP ∥ ∇ 3 L( φ(0) )[ B(s) t ] - η 2 P ∥ ∇ 3 L( φ(0) )[ Â(s) t ] + Õ(η 2.5-β ). Similarly, according to (85), we have P ∥ q(s+1) 0 = -P ∥ ∂ 2 Φ( φ(0) )[ B(s) H ] - 1 2 P ∥ ∂ 2 Φ( φ(0) )[ Â(s) avg ] + Õ(η 1.5-β ). Combining ( 99) and (100) yields P ∥ q(s) H = - 1 2 P ∥ ∂ 2 Φ( φ(0) )[ Â(s-1) avg ] - η 2 P ∥ ∇ 3 L( φ(0) )[ H-1 t=0 Â(s) t ] -ηP ∥ ∇ 3 L( φ(0) )[ H-1 t=0 B(s) t ] -P ∥ ∂ 2 Φ( φ(0) )[ B(s-1) H ] + Õ(η 1.5-β ). By Lemmas K.29, K.32 and K.30, for s ≤ R 0 = ⌊ 10 λα log 1 η ⌋, Â t = Õ(η), Â(s) avg = Õ(η) and B(s) t = Õ(η). Therefore, E[ φ(s+1) -φ(s) ] = Õ(η). For s > R 0 , Â(s-1) avg = Â(s) avg + Õ(η 1.5-0.5β ). Substituting (101) into (72) gives E[ φ(s+1) -φ(s) ] = 1 2 P ⊥ ∂ 2 Φ( φ(0) )[ Â(s) avg ] + P ⊥ ∂ 2 Φ( φ(0) )[ B(s) H ] T1 T2 -ηP ∥ ∇ 3 L( φ(0) )[ 1 2 H-1 t=0 Â(s) t + H-1 t=0 B(s) t T3 ] + Õ(η 1.5-β ). Below we compute T 1 and T 2 for s > R 0 respectively. By Lemma K.3, P ⊥ ∂ 2 Φ( φ(0) )[P ⊥ Â(s) avg P ∥ ] = P ⊥ ∂ 2 Φ( φ(0) )[P ∥ Â(s) avg P ⊥ ] = 0, P ⊥ ∂ 2 Φ( φ(0) )[P ∥ Â(s) avg P ∥ ] = ∂ 2 Φ( φ(0) )[P ∥ Â(s) avg P ∥ ]. By Lemma K.4, P ⊥ ∂ 2 Φ( φ(0) )[P ⊥ Â(s) avg P ⊥ ] = 0. Therefore, for s > R 0 , P ⊥ ∂ 2 Φ( φ(0) )[ Â(s) avg ] = Hη 2 2KB loc ∂ 2 Φ( φ(0) )Φ[Σ 0,∥ ] + Õ(η 1.5-0.5β ), where we apply Lemma K.32. Similarly, for s > R 0 , P ⊥ ∂ 2 Φ( φ(0) )[ B(s) H ] = ∂ 2 Φ( φ(0) )[P ∥ B(s) H P ∥ ] = Õ(η 1.5-β ), where we apply Lemma K.35. Hence, T 1 = Hη 2 2B ∂ 2 Φ( φ(0) )[Σ 0,∥ ] + Õ(η 1.5-β ). where ψ(•) is interpreted as an elementwise matrix function here. By symmetry of Â(s) t 's and ∇ 3 L( φ(0) ), 1 2 ∇ 3 L( φ(0) ) H-1 t=0 P ⊥ Â(s) t P ∥ + H-1 t=0 P ∥ Â(s) t P ⊥ = ∇ 3 L( φ(0) ) H-1 t=0 P ⊥ Â(s) t P ∥ . Therefore, we only have to evaluate ∇ 3 L( φ(0) ) H-1 t=0 P ⊥ ( Â(s) t + B(s) t )P ∥ + H-1 t=0 P ∥ B(s) t P ⊥ . To compute the elements of  ≤ i ≤ m and m < j ≤ d, H-1 t=0 Â(s) t,i,j = H-1 t=0 1 -(1 -λ i η) t λ i B loc ηΣ 0,i,j + Õ(η 0.5-β ) = Hη λ i B loc Σ 0,i,j - 1 -(1 -λ i η) H λ 2 i B loc Σ 0,i,j + Õ(η 0.5-β ) = Hη λ i B loc 1 - 1 -(1 -λ i η) H λ i Hη Σ 0,i,j + Õ(η 0.5-β ) = Hη λ i B loc ψ(λ i Hη)Σ 0,i,j + Õ(η 0.5-β ), H-1 t=0 B(s) t,i,j = H-1 t=0 (1 -λ i η) t λ i KB loc ηΣ 0,i,j + Õ(η 1.5-β ), = 1 -(1 -λ i η) H λ 2 i KB loc Σ 0,i,j + Õ(η 0.5-β ) = Hη λ i KB loc Σ 0,i,j - Hη λ i KB loc 1 - 1 -(1 -λ i η) H λ i Hη Σ 0,i,j + Õ(η 0.5-β ) = Hη λ i KB loc Σ 0,i,j - Hη λ i KB loc ψ(λ i Hη)Σ 0,i,j + Õ(η 0.5-β ). Therefore, the matrix form of H-1 t=0 P ⊥ ( Â(s) t + B(s) t )P ∥ is H-1 t=0 P ⊥ ( Â(s) t + B(s) t )P ∥ = Hη B V H0 Σ 0,⊥,∥ + (K -1)ψ(Σ 0,⊥,∥ ) + Õ(η 0.5-β ), where ψ(•) is interpreted as an elementwise matrix function here. Furthermore, by Lemma K.35,

H-1 t=0

B(s) t = Õ(η 0.5-β ). Applying Lemma K.3, we have (105). Finally, directly applying Lemma K.5, we have -ηP ∥ ∇ 3 L( φ(0) )[P ∥ T 3 P ∥ ] = 0. Notice that ψ(Σ 0,∥ ) = 0 where ψ(•) operates on each element. Combining (104), ( 105) and ( 106), we obtain (103). By ( 102) and ( 103), we have (98). Lemma K.37. The second moment of the change of manifold projection every round is E[( φ(s+1) -φ(s) )( φ(s+1) -φ(s) ) ⊤ ] = Hη 2 B Σ 0,∥ + Õ(η 1.5-0.5β ), R 0 ≤ s < R grp Õ(η), s < R 0 , where R 0 := ⌈ 10 λmα log 1 η ⌉. Proof. Directly apply Lemma K.32 and Lemma K.27 and we have the lemma. With Lemmas K.36 and K.37, we are ready to prove Theorem K.3.

Proof of Theorem

K.3. We first derive E[∆ φ(Rgrp) ]. Recall that R grp = ⌊ 1 αη β ⌋ = 1 Hη 1+β + o(1) where 0 < β < 0.5. By Lemma K.36, E[ φ(Rgrp) -φ(0) ] = R0 s=0 E[ φ(s+1) -φ(s) ] + Rgrp-1 s=R0+1 E[ φ(s+1) -φ(s) ] = η 1-β 2B ∂ 2 Φ( φ(0) )[Σ 0 + Ψ( φ(0) )] + Õ(η 1.5-2β ) + Õ(η). Then we compute E[∆ φ(Rgrp) ∆ φ(Rgrp)⊤ ]. E      Rgrp-1 s=0 ( φ(s+1) -φ(s) )     Rgrp-1 s=0 ( φ(s+1) -φ(s) )   ⊤    = Rgrp-1 s=0 E[( φ(s+1) -φ(s) )( φ(s+1) -φ(s) ) ⊤ ] + s̸ =s ′ E[( φ(s+1) -φ(s) )]E[( φ(s ′ +1) -φ(s ′ ) ) ⊤ ] = η 1-β B Σ 0,∥ + Õ(η) + Õ(η 1.5-1.5β ), where the last inequality uses E [( φ(s+1) -φ(s) )]E[( φ(s ′ +1) -φ(s ′ ) ) ⊤ ] = Õ(η 2 ).

K.10 PROOF OF WEAK APPROXIMATION

We are now in a position to utilize the estimate of moments obtained in previous subsections to prove the closeness of the sequence {ϕ (s) } ⌊T /(Hη 2 )⌋ s=0 and the SDE solution {ζ : t ∈ [0, T ]} in the sense of weak approximation. Recall the SDE that we expect the manifold projection {Φ( θ(s) )} ⌊T /(Hη 2 )⌋ s=0 to track: dζ(t) = P ζ 1 √ B Σ 1 /2 ∥ (ζ)dW t (a) diffusion -1 2B ∇ 3 L(ζ)[ Σ ♢ (ζ)]dt (b) drift-I -K-1 2B ∇ 3 L(ζ)[ Ψ(ζ)]dt (c) drift-II , According to Lemma K.3 and Lemma K.4, the drift term in total can be written as the following form: (b) + (c) = 1 2B ∂ 2 Φ(ζ)[Σ(ζ) + (K -1)Ψ(ζ)]. Then by definition of P ζ , ( 107) is equivalent to the following SDE: dζ(t) = 1 √ B ∂Φ(ζ)Σ 1/2 (ζ)dW t + 1 2B ∂ 2 Φ(ζ) [Σ(ζ) + (K -1)Ψ(ζ)] dt. Therefore, we only have to show that ϕ (s) closely tracks {ζ(t)} satisfying Equation (108). By Lemma K.11, there exists an ϵ 3 neighborhood of Γ, Γ ϵ3 , where Φ(•) is C ∞ -smooth. Due to compactness of Γ, Γ ϵ3 is bounded and the mappings ∂ 2 Φ(•), ∂Φ(•), Σ 1/2 (•), Σ(•) and Ψ(•) are all Lipschitz in Γ ϵ3 . By Kirszbraun theorem, both the drift and diffusion term of (108) can be extended as Lipschitz functions on R d . Therefore, the solution to the extended SDE exists and is unique. We further show that the solution, if initialized as a point on Γ, always stays on the manifold almost surely. As a preparation, we first show that Γ has no boundary. Lemma K.38. Under Assumptions 3.1 to 3.3, Γ has no boundary. Proof. We prove by contradiction. If Γ has boundary ∂Γ, WLOG, for a point p ∈ ∂Γ, let the Hessian at p be diagonal with the form ∇ 2 L(p) = diag(λ 1 , • • • , λ d ) where λ i > 0 for 1 ≤ i ≤ m and λ i = 0 for m < i ≤ d . where b(•) : R d → R d is the drift function and σ(•) : R d×d → R d×d is the diffusion matrix. Denote by P X (x, s, t) the distribution of X(t) with the initial condition X(s) = x.Define ∆(x, n) := X (n+1)ηe -x, where X (n+1)ηe ∼ P X (x, nη e , (n + 1)η e ), which characterizes the update in one step. In our context, we view the change of manifold projection over R grp := ⌊ 1 αη 1-β ⌋(β ∈ (0, 0.5)) rounds as one "giant step". Hence the ϕ (nRgrp) corresponds to the discrete time random variable x n corresponds to and ζ(t) corresponds to the continuous time random variable X t . According to Theorem K.2, we set η e = η 1-β , b(ζ) = 1 2B ∂ 2 Φ(ζ) [Σ(ζ) + (K -1)Ψ(ζ)] , σ(ζ) = 1 √ B ∂Φ(ζ)Σ 1/2 (ζ). Due to compactness of Γ, b(•) and σ(•) are Lipschitz on Γ. As for the update in one step, ∆(•, •) is defined in our context as: ∆(ϕ, n) := ζ (n+1)ηe -ϕ, where ζ (n+1)ηe ∼ P ζ (ϕ, nη e , (n + 1)η e ) and ϕ ∈ Γ. For convenience, we further define ∆ (n) := φ((n+1)Rgrp) -φ(nRgrp) , ∆(n) := ∆( φ(Rgrp) , n), b (n) : = b( φ(nRgrp) ), σ (n) : = σ( φ(nRgrp) ). We use C g,i to denote constants that can depend on the test function g and independent of η e . The following lemma relates the moments of ∆(ϕ, n) to b(ϕ) and σ(ϕ). Lemma K.40. There exists a positive constant C 0 independent of η e and g such that for all ϕ ∈ Γ, |E[ ∆i (ϕ, n)] -η e b i (ϕ)| ≤ C 0 η 2 e , ∀1 ≤ i ≤ d, |E[ ∆i (ϕ, n) ∆j (x, n)] -η e d l=1 σ i,l (ϕ)σ l,j (ϕ)| ≤ C 0 η 2 e , ∀1 ≤ i, j ≤ d, E 6 s=1 ∆is (ϕ, n) ≤ C 0 η 3 e , ∀1 ≤ i 1 , • • • , i 6 ≤ d. The lemma below states that the expectation of the test function is smooth with respect to the initial value. Proof. For β ∈ (0, 0.5), define γ 1 := 1.5-2β 1-β , γ 2 := 1 1-β , and then 1 < γ 1 < 1.5, 1 < γ 2 < 2. We introduce the following lemma which serves as a key step to control the approximation error. Specifically, this lemma bounds the difference in one step change between the discrete process and the continuous one as well as the product of higher orders. Lemma K.42. If ∥ θ(0) -ϕ (0) ∥ 2 = O( η log 1 η ), then there exist positive constants C 1 and b independent of η e and g such that for all 0 ≤ n < ⌊T /η e ⌋, 1. |E[∆ (n) i - ∆(n) i | E (nRgrp) 0 | ≤ C 1 η γ1 e (log 1 ηe ) b + C 1 η γ2 e (log 1 ηe ) b , ∀1 ≤ i ≤ d, |E[∆ (n) i ∆ (n) j - ∆(n) i ∆(n) j | E (nRgrp) 0 | ≤ C 1 η γ1 e (log 1 ηe ) b + C 1 η γ2 e (log 1 ηe ) b , ∀1 ≤ i, j ≤ d. 2. E 6 s=1 ∆ (n) is | E (nRgrp) 0 ≤ C 2 1 η 2γ1 e (log 1 ηe ) 2b , ∀1 ≤ i 1 , • • • , i 6 ≤ d, E 6 s=1 ∆(n) is | E (nRgrp) 0 ≤ C 2 1 η 2γ1 e (log 1 ηe ) 2b , ∀1 ≤ i 1 , • • • , i 6 ≤ d. Proof. According to Appendix K.7, we have E 6 s=1 ∆ (n) is | E (nRgrp) 0 = Õ(η 3-3β ). Since γ 1 < 1.5 and γ 2 < 2, we can utilize Theorem K.3 and conclude that there exist positive constants C 2 and b independent of η e and g such that where C g,1 is a positive constant independent of η and φ(lRgrp) but can depend on g. E[∆ (n) i -η e b (n) i | E (nRgrp) 0 ] ≤ C 2 η γ1 e (log 1 ηe ) b + C 2 η γ2 e (log 1 ηe ) b , ∀1 ≤ i ≤ d, E[∆ (n) i ∆ (n) j -η e d l=1 σ (n) i,l σ (n) l,j | E (nRgrp) 0 ] ≤ C 2 η γ1 e (log 1 ηe ) b + C 2 η γ2 e (log 1 ηe ) b , ∀1 ≤ i, j ≤ d, Proof. By Lemma K.41, u l,n (ϕ) ∈ C 3 for all l and n. That is, there exists K(•) ∈ G such that for all l, n, u l,n (ϕ) and its partial derivatives up to the third order are bounded by K(ϕ). By the law of total expectation and triangle inequality, We first bound A 2 and A 3 . Since φ(lRgrp) ∈ Γ, both φ(lRgrp) + ∆ (l) and φ(lRgrp) + ∆(l) belong to Γ. Due to compactness of Γ and smoothness of u l+1,n (•) on Γ, there exist a positive constant C g,2 such that A 2 + A 3 ≤ C g,2 η 100 . We proceed to bound A 1 . Expanding u l+1,n (•) at φ(lRgrp) and by triangle inequality, A ], for some θ, θ ∈ (0, 1). Since φ(lRgrp) belongs to Γ which is compact, there exists a constant C g,3 such that for all 1 ≤ i, j . Since φ(lRgrp) and φ(lRgrp) + ∆ (l) both belong to Γ which is compact, there exists a constant C g,4 such that for all 1 ≤ i, j, p ≤ d, 0 ≤ l ≤ n -1 and 1 ≤ n ≤ ⌊T /η e ⌋, Proof. For 0 ≤ l ≤ n, define the random variable ζl,n which follows the distribution P ζ ( φ(lRgrp) , l, n) conditioned on φ(lRgrp) . Therefore, P( ζn,n = φ(nRgrp) ) = 1 and ζ0,n ∼ ζ nηe . Denote by u(ϕ, s, t) := E ζt∼P ζ (ϕ,s,t) [g(ζ t )] and T l+1,n := u l+1,n ( φ(lRgrp) +∆ (l) , (l+1)η e , nη e )u l+1,n ( φ(lRgrp) + ∆(l) , (l + 1)η e , nη e ). Then there exists positive constant b ′ independent of η and g, and C ′ g which is independent of η but can depend on g such that We can view the random variable pairs {(ϕ (nRgrp+s cls ) , ζ nη 0.75 +s cls αη ) : n = 0, • • • , ⌊T /η 0.75 ⌋} as reference points and then approximate the value of g(ϕ (s) ) and g(ζ(sHη 2 )) with the value at the nearest reference points. By Lemmas K.18 and K.23, for 0 ≤ r ≤ R grp and 0 ≤ s ≤ R tot -r, E[∥ϕ (s+r) -ϕ (s) ∥ 2 ] = Õ(η 0.375 ). Since the values of ϕ (s) and ζ are restricted to a bounded set, g(•) is Lipschitz on that set. Therefore, we have the theorem. And the hessian is: ∇ 2 L(θ) = (1 -p)h ′ (a 1 ) ∂ 2 f 1 ∂θ 2 + ph ′ (a 2 ) C -1 j>1 ∂ 2 f j ∂θ 2 T + (1 -p)h ′′ (a 1 ) ∂f 1 ∂θ ∂f 1 ∂θ ⊤ + ph ′′ (a 2 ) C -1 j>1 ∂f j ∂θ ∂f j (θ) ∂θ ⊤ . Since j∈[C] f i = 1, ∂ 2 f 1 ∂θ 2 = - j>1 ∂ 2 f j ∂θ 2 . ( ) Also, notice that h ′ (x) = -1 x . Therefore, (1 -p)h ′ (a 1 ) = ph ′ (a 2 ) C -1 . ( ) Substituting ( 115) and ( 116) into the expression of T gives T = 0, which simplifies ∇ 2 L(θ) as the following form: ∇ 2 L(θ) = (1 -p)h ′′ (a 1 ) ∂f 1 ∂θ ∂f j (θ) ∂θ ⊤ + ph ′′ (a 2 ) C -1 j>1 ∂f j ∂θ ∂f j (θ) ∂θ ⊤ . Again notice that h ′′ (x) = h ′ (x) for all x ∈ R + . Therefore, ∇ 2 L(θ) = Σ(θ). With the property Σ(θ) = ∇ 2 L(θ), we are ready to prove Theorem G.1. Proof of Theorem G.1. Recall the general form of the slow SDE: dζ(t) = 1 √ B ∂Φ(ζ)Σ 1/2 (ζ)dW (t) + 1 2B ∂ 2 Φ(ζ) [Σ(ζ) + (K -1)Ψ(ζ)] dt, where Ψ is defined in Definition K. By the chain rule, we have (120). Combining (118),( 119) and (120) gives the theorem.

M.5 DETAILS FOR EXPERIMENTS ON THE EFFECT OF GLOBAL BATCH SIZE

CIFAR-10 experiments. The model we use is ResNet-56. We resume from the model obtained in Figure 1 (a) at epoch 250 and train for another 250 epochs. The local batch size for all runs is B loc = 128. We first make grid search of η for SGD with K = 16 among {0.04, 0.08, 0.16, 0.32, 0.64} and find that the final test accuracy varies little across different learning rates (within 0.1%). Then we choose η = 0.32. For the green curve in Figure 4 (a), we search for the optimal H for K = 16 and keep α fixed when scaling η with K. For the red curve in Figure 4 (a), we search for the optimal H for each K among {6, 12, 60, 120, 300, 750, 1500, 3000, 6000, 12000, 24000} and also make sure that H does not exceed the total number of iterations for 250 epochs. The learning curves for constant and optimal α are visualized in Figures 10(a ) and 10(c) respectively. We report the mean and standard deviation over three runs. ImageNet experiments. We start from the model obtained at epoch 100 in Figure 1 (b) and train for another 50 epochs. The local batch size for all runs is B loc = 32. We first make grid search among {0.032, 0.064, 0.16, 0.32} for H = 1 to achieve the best test accuracy and choose H = 0.064. For the orange curve in Figure 4 (b), we search H among {2, 4, 6, 13, 26, 52, 78, 156} for K = 256 to achieve the optimal test accuracy and the keep α constant as we scale η with K. To obtain the optimal H for each K, we search among {6240, 7800, 10400, 12480, 15600, 20800, 24960, 31200} for K = 16, {1600, 3120, 4160, 5200, 6240, 7800, 10400} for K = 32, {312, 480, 520, 624, 800, 975, 1040, 1248, 1560, 1950} for K = 64, and {1, 2, 3, 6, 13} for K = 512. The learning curves for constant and optimal α are visualized in Figures 10(b ) and 10(d) respectively. We report the mean and standard deviation over three runs. ResNet-56. As for the model architecture, we replace the batch normalization layer in Yang's implementation with group normalization such that the training loss is independent of the sampling order. We also use Swish activation (Ramachandran et al., 2017) in place of ReLU to ensure the smoothness of the loss function. We generate the pre-trained model by running label noise SGD with corruption probability p = 0.1 for 500 epochs (6, 000 iterations). We initialize the model by the same strategy introduced in the first paragraph of Appendix M.1. Applying the linear warmup scheme proposed by Goyal et al. (2017) , we gradually ramp up the learning rate η from 0.1 to 3.2 for the first 20 epochs and multiply the learning rate by 0.1 at epoch 250. All subsequent experiments in Figure 7 (a) (a) use learning rate 0.1. The weight decay λ is set as 5 × 10 -4 . Note that adding weight decay in the presence of normalization accelerates the limiting dynamics and will not affect the implicit regularization on the original loss function (Li et al., 2022) . VGG-16. We follow Yang's implementation of the model architecture except that we replace maximum pooling with average pooling and use Swish activation (Ramachandran et al., 2017) to make the training loss smooth. We initialize all weight parameters by Kaiming Normal and all bias parameters as zero. The pre-trained model is obtained by running label noise SGD with total batch size 4096 and corruption probability p = 0.1 for 6000 iterations. We use a linear learning rate warmup from 0.1 to 0.5 in the first 500 iterations. All runs in Figures 7(b ) and 8 resume from the model obtained by SGD with label noise. In Figure 7 (b), we use learning rate η = 0.1. In Figure 8 , we set η = 0.005 for H = 97, 000 and η = 0.01 for SGD (H = 1). The weight decay λ is set as zero.



This generalization improvement is not mentioned explicitly in(Ortiz et al., 2021) but can be clearly seen from Figures 7 and 8 in their paper. https://github.com/bearpaw/pytorch-classification



ImageNet, B = 8192, ResNet-50.

Figure 1: Post-Local SGD (H > 1) generalizes better than SGD (H = 1). We switch to Local SGD at the first learning rate decay (epoch #250) for CIFAR-10 and at the second learning rate decay (epoch #100) for ImageNet. See Appendix M.1 for training details.

CIFAR-10, test acc v.s. H.

Figure 2: Ablation studies on η, H and training time in the same setting as Figure 1. For (a)(d), we train from random initialization. For (b)(c)(e)(f), we start training from the checkpoints saved at the switching time points in Figure 1 (epoch #250 for CIFAR-10 and epoch #100 for ImageNet). See Appendix M.2 for training details.

) and 2(b) and is 0.16 for Figure 2(c). As shown in Figure 2(d), Local SGD encounters optimization difficulty in the first phase where η is large (η = 3.2), resulting in inferior final test accuracy. Even for training from a pretrained model, the generalization improvement of Local SGD disappears for large learning rates (e.g., η = 1.6 in Figure 5(d)). In contrast, if a longer training time is allowed, reducing the learning rate of Local SGD does not lead to test accuracy drop (Figure 5(c)).(3). Training time should be long enough. To investigate the effect of training time, in Figures 2(b) and 2(c), we extend the training budget for the Post-local SGD experiments in Figure 1 and observe that a longer training time leads to greater generalization improvement upon SGD. On the other hand, Local SGD generalizes worse than SGD in the first few epochs of Figures 2(a) and 2(c); see Figures 5(a) and 5(b) for an enlarged view.

← θ(s) and does H steps of SGD with local batches. In the t-th local step of the k-th worker, it draws a local batch of B loc := B/K independent samples ξ

,B loc from a shared training distribution D and updates as follows:

.2 LOCAL SGD STRENGTHENS THE DRIFT TERM IN SLOW SDE.

IMPLEMENTATION DETAILS OF PARALLEL SGD, LOCAL SGD AND POST-LOCAL SGD In this section, we present the formal procedures for Parallel SGD, Local SGD and Post-local SGD. Given a training dataset and a data augmentation function, Algorithms 1 and 2 show the implementations of distributed samplers for sampling local batches with and without replacement. Then Algorithms 3 to 5 show the implementations of parallel SGD, Local SGD and Post-local SGD that can run with either of the samplers. Sampling with replacement. Our theory analyzes parallel SGD, Local SGD and Post-local SGD when local batches are sampled with replacement (Algorithm 1). That is, local batches consist of IID samples from the same training distribution D, where D serves as an abstraction of the distribution of an augmented sample drawn from the training dataset. The mathematical formulations are given in Section 1.Sampling without replacement. Slightly different from our theory, we use the sampling without replacement (Algorithm 2) in our experiments unless otherwise stated. This sampling scheme is standard in practice: it is used byGoyal et al. (2017) for parallel SGD and byLin et al. (2020b);Ortiz et al. (2021) for Post-local/Local SGD. This sampling scheme works as follows. At the beginning of every epoch, the whole training dataset is shuffled and evenly partitioned into K shards. Each worker takes one shard and samples batches without replacement. When all workers pass their own shard, the next epoch begins and the whole dataset is reshuffled. An alternative view is that the workers always share the same dataset. For each epoch, they perform local steps by sampling batches of data without replacement until the dataset contains too few data to form a batch. Then another epoch starts with the dataset reloaded to the initial state.

Distributed Sampler on K Workers (Sampling with Replacement) Require: shared training dataset D, data augmentation function A( ξ) Hyperparameters: local batch size B loc Function Sample() on worker k: Draw B loc IID samples ξ1 , . . . , ξB loc from D with replacement ; ξ b ← A( ξb ) for all 1 ≤ b ≤ B loc ; // apply data augmentation return (ξ 1 , . . . , ξ B loc ) ; end Algorithm 2: Distributed Sampler on K Workers (Sampling without Replacement) Require: shared training dataset D, data augmentation function A( ξ) Hyperparameters: local batch size B loc Constant: N loc := |D| KB loc // number of local batches per worker per epoch Local Variables: c (k) ← N loc B loc for worker k // number of samples drawn in this epoch Function Sample() on worker k: if c (k) = N loc B loc then // Now start a new epoch Wait until all the other workers reach this line ;

update the model end Algorithm 4: Local SGD on K Workers Input: loss function ℓ(θ; ξ), initial parameter θ(0) Hyperparameters: total number of rounds R, number of local steps H per round Hyperparameters: learning rate η, local batch size B loc for s = 0, . . . , R -1 do for each worker k do in parallel θ (s) k,0 ← θ(0) ; // maintain a local copy of the global iterate

-Reduce aggregation of local iterates end Algorithm 5: Post-local SGD on K Workers Input: loss function ℓ(θ; ξ), initial parameter θ 0 Hyperparameters: total number of iterations T , learning rate η, local batch size B loc Hyperparameters: switching time point t 0 , number of local steps H per round Ensure: T -t 0 is a multiple of H Starting from θ 0 , run Parallel SGD for t 0 iterations and obtain θ t0 ; Starting from θ t0 , run Local SGD for 1 H (T -t 0 ) rounds with H local steps per round ; return the final global iterate of Local SGD ; D MODELING LOCAL SGD WITH MULTIPLE CONVENTIONAL SDES Lin et al. (2020b) tried to informally explain the success of Local SGD by adopting the argument that larger diffusion term in the conventional SDE leads to better generalization (see Section 3.1 and appendix A). Basically, they attempted to write multiple SDEs, each of which describes the H-step local training process of each worker in each round (from θ

Figure 3: Reducing the diffusion term of the Slow SDE for Local SGD leads to better generalization.Test accuracy improves as we increase K with fixed η and H to reduce the diffusion term while keeping the drift term untouched. See Appendix M.4 for details.

optimal α constant α (d) ImageNet, start from #100.

Figure 4: For training from CIFAR-10 and ImageNet checkpoints, Local SGD consistently outperforms SGD (H = 1) across different batch sizes B (fixing B loc and varying K), where the learning rate is scaled by the LSR η ∝ B. Two possible ways of tuning the number of local steps H are considered: (1). Tune H for the best test accuracy for K = 16 and K = 256 respectively on CIFAR-10 and ImageNet, then scale H as H ∝ 1/B so that α := ηH is constant; (2). Tune H specifically for each K. See Appendix M.5 for training details.

CIFAR-10, start from random.

ImageNet, start from #250.

32, H = 600 η = 0.064, H = 3000 (c) CIFAR-10, start from #100.Optimal α (log scale) (f) ImageNet, optimal α v.s. η.

Figure 5: Additional experimental results about the effect of the learning rate, training time and the number of local steps. See Appendix M.2 for details.

(b) SGD with larger batch sizes.

Post-local SGD, sampling with replacement.

(c)) and Post-local SGD significantly outperforms SGD. G DISCUSSIONS ON LOCAL SGD WITH LABEL NOISE REGULARIZATION G.1 THE SLOW SDE FOR LOCAL SGD WITH LABEL NOISE REGULARIZATION In this subsection, we present the Slow SDE for Local SGD in the case of label noise regularization and show that Local SGD indeed induces a stronger regularization term, which presumably leads to better generalization. Theorem G.1 (Slow SDE for Local SGD with label noise regularization). For a C-class classification task with cross-entropy loss, the slow SDE of Local SGD with label noise has the following form:

(b) VGG-16 w/o normalization.

Figure 7: Local SGD with label noise regularization on CIFAR-10 without data augmentation using K = 32 ,B loc = 128. A larger number of local steps indeed enables higher test accuracy. For both architectures, we replace ReLU with Swish. See Appendix M.6 for training details.

and Li et al. (2021b) connect minimizing the trace of Hessian to finding sparse or low-rank solutions for training two-layer linear nets. Damian et al. (2021) empirically showed that good generalization correlates with a smaller trace of Hessian in training ResNets with label noise. Besides, Ma & Ying (2021) connect the trace of Hessian to the smoothness of the function represented by a deep neural net.

Figure 8: Local SGD with label noise regularization on CIFAR-10 without data augmentation using K = 4, B loc = 128. SGD (H = 1) indeed achieves comparable test accuracy as Local SGD with a large H when we scale up its learning rate to √ K times that of Local SGD. See Appendix M.6 for training details.

among K workers at step 0 and H. Then for all k ∈ [K], x (s)

Assume that the loss function L(θ) is ρ-smooth and µ-PL in an open, convex neighborhood U of a local minimizer θ * . Denote by L * := L(θ * ) the minimum value for simplicity. Let ϵ ′ be the radius of the open ϵ ′ -ball centered at θ * such that B ϵ ′ (θ * ) ⊆ U . We also define a potential function Ψ(θ) := L(θ) -L * .

∥ 2 = Õ( √ η) and ∥ θ(s+r) -θ(s) ∥ 2 = Õ(η 0.5-0.5β ), ∀0 ≤ r ≤ R grp with high probability.K.6.1 ADDITIONAL NOTATIONSBefore presenting the lemmas, we define the following martingale {m

Figure 9: A plot of ψ(x).

∥ 2 ] = O(η 100 ). Combining this with (76) to (78) yields (79).

1 ηe ) 2b , ∀1 ≤ i 1 , • • • , i 6 ≤ d. (113)Combining (111) -(113) with Lemma K.40 gives the above lemma.Lemma K.43. For a test function g ∈ C 3 , let u l,n (ϕ) := u(ϕ, lη e , nη e ) = E ζt∼P ζ (ϕ,lηe,nηe) [g(ζ t )]. If ∥ θ(0) -ϕ (0) ∥ 2 = O( η log 1 η ), then for all 0 ≤ l ≤ n -1 and 1 ≤ n ≤ ⌊T /η e ⌋, E[u l+1,n ( φ(lRgrp) + ∆ (l) ) -u l+1,n ( φ(lRgrp) + ∆(l+1) ) | φ(lRgrp) ] ≤ C g,1 (η γ1 e + η γ2 e ) log( 1 ηe ) b ,

[u l+1,n ( φ(lRgrp) + ∆ (l) ) -u l+1,n ( φ(lRgrp) + ∆(l) )] | φ(lRgrp) ≤ E[u l+1,n ( φ(lRgrp) + ∆ (l) ) -u l+1,n ( φ(lRgrp) + ∆(l) ) | φ(lRgrp) , E η 100 E[|u l+1,n ( φ(lRgrp) + ∆ (l) )| | φ(lRgrp) , Ē(lRgrp) 0 ] A2 + η 100 E[|u l+1,n ( φ(lRgrp) + ∆(l) )| | φ(lRgrp) , Ē(lRgrp) 0 ]A3 .

l+1,n ∂ϕ i ∂ϕ j ∂ϕ p ( φ(lRgrp) + θ∆ (l) )∆ l+1,n ∂ϕ i ∂ϕ j ∂ϕ p ( φ(lRgrp) + θ ∆(l

≤ d, 0 ≤ l ≤ n -1, 1 ≤ n ≤ ⌊T /η e ⌋, | ∂u l+1,n ∂ϕ i ( φ(lRgrp) )| ≤ C g,3 , | ∂ 2 u l+1,n ∂ϕ i ∂ϕ j ( φ(lRgrp) )| ≤ C g,3 .By Lemma K.42,B 1 ≤ dC g,3 C 1 (η γ1 e + η γ2 e )(log 1 ηe ) b , B 2 ≤ d 2 2 C g,3 C 1 (η γ1 e + η γ2 e )(log1 ηe ) b . Now we bound the remainders. By Cauchy-Schwartz inequality, E[ ∂ 3 u l+1,n ∂ϕ i ∂ϕ j ∂ϕ p ( φ(lRgrp) + θ∆ (l) l+1,n ∂ϕ i ∂ϕ j ∂ϕ p ( φ(lRgrp) + θ∆ (l) )

Combining the above inequality with Lemma K.42, we haveE[ ∂ 3 u l+1,n ∂ϕ i ∂ϕ j ∂ϕ p ( φ(lRgrp) + θ∆ (l) ≤ C g,4 C 1 η γ1 e log( 1 ηe ) b .Hence, for all1 ≤ n ≤ ⌊T /η e ⌋, 0 ≤ l ≤ n -1, |R| ≤ d 3 6 C g,4 C 1 η γ1 e log( 1 ηe ) b .Similarly, we can show that there exists a constant C g,5 such that for all1 ≤ n ≤ ⌊T /η e ⌋, 0 ≤ l ≤ n -1, | R| ≤ d 3 6 C g,5 C 1 η γ1 e log( 1 ηe ) b .Combining the bounds on A 1 to A 3 , we have the lemma.



[g(ϕ (nRgrp) )] -E[g(ζ(nη e ))]≤ E[g( ζn,n ) -g( ζ0,n ) | E ( φ((l+1)Rgrp) , (l + 1)η e , nη e ) -u( ζl,l+1 , (l + 1)η e , nη e ) | E (η 100 ).Noticing that E[T l+1,n | E (nRgrp) 0 ] = E[E[T l+1,n | φ(lRgrp) , E can apply Lemma K.43 and obtain that for all 0 ≤ n ≤ ⌊T /η e ⌋, E[g(ϕ (nRgrp) )] -E[g(ζ(nη e ))] ≤ nC g,1 (η γ1 e + η γ2 e )(log 1 ηe ) b ≤ T C g,1 (η γ1-1 e + η γ2-1 e )(log 1 ηe ) b .Notice that η γ1 e + η γ2 e = η 0.5-β + η β and T , C g,1 are both constants that are independent of η e . Let β = 0.25 and we have Theorem K.4.Having established Theorem K.4, we are thus led to prove Theorem 3.2.Proof of Theorem 3.2. Denote by s cls = s 0 + s 1 = O(log 1 η ), which is the time the global iterate θ(s) will reach within Õ(η) from Γ with high probability. Define ζ(t) to be the solution to the limiting SDE (108) conditioned on E (s cls ) 0 and ζ(0) = ϕ (s cls ) . By Theorem K.4, we havemax n=0,••• ,⌊T /η 0.75 ⌋ E[g(ϕ (nRgrp+s cls ) ) -g( ζ(nη 0.75 )) | ϕ (s cls ) , E (s cls ) 0 ] ≤ C g η 0.25 (log 1 η ) b , where R grp = ⌊ 1 αη 0.75 ⌋. Noticing that (i) g ∈ C 3 (ii) b, σ ∈ C ∞ and (iii) ζ(t), ζ(t) ∈ Γ, t ∈ [0, ∞)almost surely, we can conclude that given E (s cls ) 0 , ∥ζ(t) -ζ(t)∥ 2 = Õ( √ η), ∀t ∈ [0, T ].

max n=0,••• ,⌊T /η 0.75 ⌋ E[g(ϕ (nRgrp+s cls ) ) -g(ζ(nη 0.75 + s cls Hη 2 ))] ≤ C ′ g η 0.25 (log 1 η ) b ′ .

to the hessian ∇ 2 L(θ) for all θ ∈ S * . Lemma L.1.If f (θ; x i , ŷi ) is C 2 -smooth on R d given any i ∈ [N ], ŷi ∈ [C] and S * ̸ = ∅, then for all θ ∈ S * , Σ(θ) = ∇ 2 L(θ).Proof. Since L(•) is C 2 -smooth, ∇L(θ) = 0 for all θ ∈ S * . To prove the above lemma, it suffices to show that ∀i ∈ [N ], E[∇ℓ(θ; x i , ŷi )∇ℓ(θ; x i , ŷi ) ⊤ ] = ∇ 2 L(θ). W.L.O.G, let y = 1 and therefore for all θ ∈ S * ,f 1 (θ; x) = 1 -p =: a 1 , f j (θ; x) = p C -1 =: a 2 , ∀j > 1, j ∈ [C].Additionally, let h(x) := -log(x), x ∈ R + . The stochastic gradient ∇ℓ(θ; x, ŷ) follows the distribution:∇ℓ(θ; x, ŷ) = h ′ (a 1 ) ∂f1 ∂θ w.p. 1 -p, h ′ (a 2 ) ∂fj ∂θ , w.p. p C-1 , ∀j ∈ [C], j > 1.Then the covariance of the gradient noise is: E[∇ℓ(θ; x, ŷ)∇ℓ(θ; x, ŷ) ⊤ ] = (1 -p)(h ′ (a 1 )) 2 ∂f 1 (θ * )

6. Since for ζ ∈ Γ, Σ(ζ) = ∇ 2 L(ζ), then ∂Φ(ζ)Σ 1/2 (ζ) = 0.(118)Now we show that∂ 2 Φ(ζ)[Σ(ζ)] = -∇ Γ tr(∇ 2 L(ζ)). (119) Since ∇ 2 L(ζ) = Σ(ζ), V ∇ 2 L(ζ) [Σ] = 1 2 I. By Lemma K.4, ∂ 2 Φ(ζ)[Σ(ζ)] = -1 2 ∂Φ(ζ)∇ 3 L(ζ)[I] = -1 2 ∇ Γ tr(∇ 2 L(ζ)).Finally, we show that∂ 2 Φ(ζ)[Ψ(ζ)] = -∇ Γ 1 2Hη tr(F (2Hη∇ 2 L(ζ))). (120) Define ψ(x) := xψ(x) = e -x -1 + x. By definition of Ψ(ζ), when Σ(ζ) = ∇ 2 L(ζ), Ψ(ζ) = ψ(2ηH∇ 2 L(ζ)), where ψ(•) is interpreted as a matrix function. Since ψ(2ηH∇ 2 L(ζ)) ∈ span{uu ⊤ | u ∈ T ⊥ ζ (Γ)}, by Lemma K.4, ∂ 2 Φ(ζ)[Ψ(ζ)] = -1 2 ∂Φ(ζ)trψ(2ηH∇ 2 L(ζ)).

DETAILS FOR EXPERIMENTS ON LABEL NOISE REGULARIZATION For all label noise experiments, we do not use data augmentation, use sampling with replacement, and set the corruption probability as 0.1. We simulate 32 workers with B = 4096 in Figure 7 and 4 workers with B = 512 in Figure 8. We use ResNet-56 with GroupNorm with the number of groups 8 for Figure 7(a) and VGG-16 without normalization for Figures 7(b) and 8. Below we list the training details for ResNet-56 and VGG-16 respectively.

for an introduction to Itô calculus. Here P ζ equals Φ(ζ +AdW t +bdt)-Φ(ζ) by Itô calculus, which means that P ζ projects an infinitesimal step from ζ, so that ζ after taking the projected step does not leave the manifold Γ. It can be shown by simple calculus that ∂Φ(ζ) equals the projection matrix onto the tangent space of Γ at ζ. We decompose the noise covariance Σ(ζ) for ζ ∈ Γ into two parts: the noise in the tangent space Σ

Interpretation of the Slow SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Interpretation of the Slow SDE for SGD. . . . . . . . . . . . . . . . . . . 3.3.2 Local SGD Strengthens the Drift Term in Slow SDE. . . . . . . . . . . . . 3.3.3 Theoretical Insights into Tuning the Number of Local Steps . . . . . . . . Understanding the Diffusion Term in the Slow SDE . . . . . . . . . . . . . . . . . E.2 The Effect of Global Batch Size on Generalization . . . . . . . . . . . . . . . . . Additional Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Similar to Lemma K.30, we have the following lemma for the general formula of P ∥

∞ and (iii) Γ is compact, we can directly apply Lemma B.3 inMalladi et al. (2022) and Lemma 26 inLi et al. (2019a)  to obtain the above lemma.The following lemma states that the expectation of g(ζ(t)) for g ∈ C 3 is smooth with respect to the initial value of the SDE solution.Lemma K.41. Let s ∈ [0, T ], ϕ ∈ Γ and g ∈ C 3 . For t ∈ [s, T ], define u(ϕ, s, t) := E ζt∼P ζ (ϕ,s,t) [g(ζ t )].

ACKNOWLEDGEMENT AND DISCLOSURE OF FUNDING

The work of Xinran Gu and Longbo Huang is supported by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403, the Tsinghua University Initiative Scientific Research Program, and Tsinghua Precision Medicine Foundation 10001020109. The work of Kaifeng Lyu and Sanjeev Arora is supported by funding from NSF, ONR, Simons Foundation, DARPA and SRC.

annex

Supplementary Plot: Learning rate should be small. Figure 5 (c) shows that reducing the learning rate from 0.32 to 0.064 does not lead to test accuracy drop for Local SGD on CIFAR-10, if the training time is allowed to be longer and the number of local steps H is set properly. Figure 5(d) presents the case where, with a large learning rate, the generalization improvement of Local SGD disappears even starting from a pre-trained model. Supplementary Plot: Reconciling our main finding with Ortiz et al. (2021) . In Figure 5 (e), the generalization benefit of Local SGD with H = 24 becomes less significant after the learning rate decay at epoch 226, which is consistent with the observation by Ortiz et al. (2021) that the generalization benefit of Local SGD usually disappears after the learning rate decay. But we can preserve the improvement by increasing H to 900. Here, we use Local SGD with momentum.Supplementary Plot: Optimal α gets larger for smaller η. In Figure 5 (f), we summarize the optimal α := ηH that enables the highest test accuracy for each learning rate in Figure 2(f) . We can see that the optimal α increases as we decrease the learning rate. The reason is that the approximation error bound O( αη log α ηδ ) in Theorem 3.3 decreases with η, allowing for a larger value of α to better regularize the model.η into Lemma K.17, for δ = O(poly(η)), with probability at least 1 -δ, (52) and ( 53) where C 1 is a constant that can depend on C 0 . Furthermore, for round θ(Rgrp) ,where C9 is a constant independent of C 0 .where C 2 is a constant that can depend C 0 . Furthermore,where C11 is a constant independent of C 0 .Proof. Decomposing x (s) k,t by triangle inequality, we haveWe first bound ∥ θ(s) -ϕ (s) ∥ 2 . By Lemma K.21, for δ = O(poly(η)), with probability at leastandwhere C 2 is a constant that may depend on C 0 and C10 is a constant independent of C 0 . When ( 54) and (56) hold, by Lemma K.10,Then we bound ∥θBy the update rule, we haveThen we havewhere the second equality uses (96) and the last inequality uses B(0Therefore,For s < R 0 , Â(s) avg,,i,j = Õ(η) and therefore, B(s) t,i,j = Õ(η).Lemma K.34 (General formula for the elements of P ⊥ B(st,i,j = Õ(η 1.5-β ).Proof. Note that for 1 ≤ i ≤ m, λ i > 0. By (92),

B(s)

t,i,j = (1 -λ i η) t B(s) 0,i,j + Õ(η 1.5-β ). By (95),where the last inequality uses B(0)Proof. Note that λ i = 0 for m < i ≤ d. By (92) and (95),Having obtained the expressions for B(s) t , Â(s) t and Â(s) avg , we now provide explicit expressions for the first and second moments of the change of manifold projection every round in the following two lemmas.We move on to show thatSimilar to the way we compute Â(s) t , Â(s) avg and B(s) t , we compute T 2 by splitting T 3 into four matrices and then substituting them into the linear operator -ηP ∥ ∇ 3 L( φ(0) )[•] one by one. First, we show thatwhere ψ(•) is interpreted as an elementwise matrix function here. By Lemmas K.29 and K.34, forThen we simplify T 4 . Notice thatTherefore,Substituting T 4 back into the expression forCombining the elementwise results, we obtain the following matrix form expression:By Lemma K.4, we have (104).Secondly, we show that for s > R 0 ,Denote by x i:j := (x i , x i+1 , • • • , x j ) (i ≤ j) the (j -i + 1)-dimensional vector formed by the i-th to j-th coordinates of x. Since ∂(∇L(p))Therefore, Γ is a closed manifold (i.e., compact and without boundary). Then we have the following lemma stating that Γ is invariant for (108). Lemma K.39. Let ζ(t) be the solution to (108) with ζ(0) ∈ Γ, then ζ(t) ∈ Γ for all t ≥ 0. In other words, Γ is invariant for (108).Proof. According to Filipović (2000) and Du & Duan (2007) , for a closed manifold M to be viable for the SDE dX(t) = F (X(t))dt + B(X(t))dW t where F : R d → R d and B : R d → R d are locally Lipschitz, we only have to verify the following Nagumo type consistency condition:where D[•] is the Jacobian operator and B j (x) denotes the j-th column of B(x).In our context, since for ϕ ∈ Γ, ∂Φ(ϕ) is a projection matrix onto T ϕ (Γ), each column of ∂Φ(ϕ)Σ 1/2 (ϕ) belongs to T ϕ (Γ), verifying the second condition. Denote by P ⊥ (ϕ) := I d -∂Φ(ϕ) the projection onto the normal space of Γ at ϕ. To verify the first condition, it suffices to show that P ⊥ (ϕ)µ(ϕ) = 0. We evaluate j P ⊥ (ϕ)D[B j (ϕ)]B j (ϕ) as follows.where the last inequality uses Lemma K.3. Again applying Lemma K.3, we haveCombining ( 109) and ( 110), we can verify the first condition.In order to establish Theorem 3.2, it suffices to prove the following theorem, which captures the closeness of ϕ (s) and ζ(t) every R grp rounds.where C g > 0 is a constant independent of η but can depend on g(•) and b > 0 is a constant independent of η and g(•).

K.10.1 PRELIMINARIES AND ADDITIONAL NOTATIONS

We first introduce a general formulation for stochastic gradient algorithms (SGAs) and then specify the components of this formulation in our context. Consider the following SGA:where x n ∈ R d is the parameter, η e is the learning rate, h(•, •) is the update which depends on x n and a random vector ξ n sampled from some distribution Ξ(x n ). Also, consider the following Stochastic Differential Equation (SDE).

L DERIVING THE SLOW SDE FOR LABEL NOISE REGULARIZATION

In this section, we formulate how label noise regularization works and provide a detailed derivation of the theoretical results in Appendix G.Consider training a model for C-class classification on dataset D = {(x i , y i )} N i=1 , where x i denotes the input and y i ∈ [C] denotes the label. Denote by ∆ C-1+ be the model output on input x with parameter θ, whose j-th coordinate f j (θ; x) stands for the probability of x belonging to class j. Let ℓ(θ; x, y) be the cross entropy loss given input x and label y, i.e, ℓ(θ; x, y) = -log f y (θ; x).Adding label noise means replacing the true label y with a fresh noisy label ŷ every time we access the sample. Specifically, ŷ is set as the true label y with probability 1 -p and as any other label with probability p C-1 , where p is the fixed corruption probability. The training loss is defined aswhere the expectation is taken over the stochasticity of ŷi . Notice that given a sample (x, y),By the property of cross-entropy loss, (114) attains its global minimum if and only if f j = p C-1 , for all j ∈ [C], j ̸ = y and f y = 1 -p. Due to the large expressiveness of modern deep learning models, there typically exists a set} such that all elements of S * minimizes L(θ). Then, the manifold Γ is a subset of S * . The following lemma relates the noise covariance Σ(θM EXPERIMENTAL DETAILSIn this section, we specify the experimental details that are omitted in the main text. Our experiments are conducted on CIFAR-10 ( Krizhevsky et al., 2009) and ImageNet Russakovsky et al. (2015) . Our code is available at https://github.com/hmgxr128/Local-SGD. Our implementation of ResNet-56 (He et al., 2016) and VGG-16 (Simonyan & Zisserman, 2015) is based on the high-starred repository by Wei Yang 2 and we use the implementation of ResNet-50 from torchvision 0.3.1. We run all CIFAR-10 experiments with B loc = 128 on 8 NVIDIA Tesla P100 GPUs while ImageNet experiments are run on 8 NVIDIA A5000 GPUS with B loc = 32. All ImageNet experiments are trained with ResNet-50.We generally adopt the following training strategies. We do not add any momentum unless otherwise stated. We follow the suggestions by Jia et al. (2018) and do not add weight decay to the bias and learnable parameters in the normalization layers. For all models with BatchNorm layers, we go through 100 batches of data with batch size B loc to estimate the running mean and variance before evaluation. Experiments on both datasets follow the standard data augmentation pipeline in He et al. (2016) except the label noise experiments. Additionally, we use FFCV (Leclerc et al., 2022) to accelerate data loading for ImageNet training.Slightly different from the update rule of Local SGD in Section 1, we use sampling without replacement unless otherwise stated. See Appendix C for implementation details and discussion.

M.1 POST-LOCAL SGD EXPERIMENTS IN SECTION 1

CIFAR-10 experiments. We simulate 32 clients with B = 4096. We follow the linear scaling rule and linear learning rate warmup strategy suggested by Goyal et al. (2017) . We first run 250 epochs of SGD with the learning rate gradually ramping up from 0.1 to 3.2 for the first 50 epochs. Resuming from the model obtained at epoch 250, we run Local SGD with η = 0.32. Note that we conduct grid search for the initial learning rate among {0.005, 0.01, 0.05, 0.1, 0.15, 0.2} and choose the learning rate with which parallel SGD (H = 1) achieves the best test accuracy. We also make sure that the optimal learning rate resides in the middle of the set. The weight decay λ is set as 5 × 10 -4 . As for the initialization scheme, we follow Lin et al. (2020b) and Goyal et al. (2017) . Specifically, we use Kaiming Normal (He et al., 2015) for the weights of convolutional layers and initialize the weights of fully-connected layers by a Gaussian distribution with mean zero and standard deviation 0.01.The weights for normalization layers are initialized as one. All bias parameters are initialized as zero. We report the mean and standard deviation over 5 runs.ImageNet experiments. We simulate 256 workers with B = 8192. We follow the linear scaling rule and linear learning rate warmup strategy suggested by Goyal et al. (2017) . We first run 100 epochs of SGD where the learning rate linearly ramps up from 0.5 to 16 for the first 5 epochs and then decays by a factor of 0.1 at epoch 50. Resuming from epoch 100, we run Local SGD with η = 0.16. Note that we conduct grid search for the initial learning rate among {0.05, 0.1, 0.5, 1} and choose the learning rate with which parallel SGD (H = 1) achieves the best test accuracy. We also make sure that the optimal learning rate resides in the middle of the set. The weight decay λ is set as 1 × 10 -4 and we do not add any momentum. The initialization scheme follows the implementation of torchvision 0.3.1. We report the mean and standard deviation over 3 runs. Local SGD with momentum 0.9, where the momentum buffer is kept locally and never averaged. We run SGD with momentum 0.9 for 150 epochs to obtain the pre-trained model, where the learning 

