

Abstract

Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50. Second order methods are among the most powerful algorithms in mathematical optimization. Algorithms in this family often use a preconditioning matrix to transform the gradient before applying each step. Classically, the preconditioner is the matrix of second-order derivatives (i.e., the Hessian) in the context of exact deterministic optimization (e.g., Fletcher, 2013; Lewis & Overton, 2013; Nocedal, 1980) . While second-order methods often have significantly better convergence properties than first-order methods, the size of typical problems prohibits their use in practice, as they require quadratic storage and cubic computation time for each gradient update. Approximate algorithms such as quasi-Newton methods are aimed at significantly reducing these requirements; nonetheless, they still impose non-trivial memory costs equivalent to storing several copies of the model (and often quadratic computation, as in the popular two-loop recursion (Nocedal, 1980)), which severely limits their use at the immense scale of present-day deep learning. Arguably, one of the greatest challenges of modern optimization is to bridge this gap between theoretical and practical optimization towards making second-order methods feasible to implement and deploy at immense scale. Besides the compelling scientific and mathematical developments it may stimulate, this challenge has also a clear real-world significance: recent practice of training deep learning models suggests that the utility of common first-order methods is quickly reaching a plateau, in large part because their time-per-step is already negligible (compared to other parts of the computation) and cannot be optimized further; thus, the only way to obtain faster training performance is by drastically reducing the number of update steps. To this end, utilizing second-order methods seem a very natural and promising approach. In this paper we attempt to narrow the gap between theory and practice of second-order methods, focusing on second-order adaptive methods for stochastic optimization. These methods can be thought of as full-matrix analogues of common adaptive algorithms such as AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010) and Adam (Kingma & Ba, 2014): they precondition each gradient with a second moment matrix, akin to a covariance matrix, that accumulates the outer products of the stochastic gradients. Full-matrix versions are potentially more powerful than first-order methods as they can exploit statistical correlations between (gradients of) different parameters; geometrically,



they can scale and rotate gradients whereas first order methods only scale gradients. However they suffer from similar prohibitive runtime and memory costs as Hessian-based methods. Recent developments in the space of second-order methods, on which we focus on in this paper, include the K-FAC (Heskes, 2000; Martens & Grosse, 2015) and Shampoo (Gupta et al., 2018) algorithms that exploit the structure of deep networks (and more generally, models described by a collection of tensors) for mitigating the space and runtime costs of full-matrix second-order algorithms. These methods approximate each preconditioning matrix using a factored representation that stems from the network structure. However, in very large applications, such algorithms are still impractical due to a number of numerical and infrastructural pitfalls and are difficult to parallelize. Contributions. We provide solutions to practical concerns and challenges that arise in implementing and using second-order methods at large scale. Our focus will be on the Shampoo algorithm, but most of the challenges we address are relevant to the implementation of many other second-order methods. These include: • We design and implement an pipelined version of the optimization algorithm, critically exploiting the heterogeneity and computing power of CPU-Accelerator coupled architectures; • We extend Shampoo in a number of ways so as to make it applicable to a larger range of deep architectures; in particular, the extensions allow Shampoo to be used for training very large layers such as embedding layers ubiquitous in language and translation models; • We replace expensive spectral decompositions (e.g., SVD) used for manipulating preconditioners with an efficient and numerically-stable iterative method for computing roots of PSD matrices; • We describe practical challenges and limitations we faced in our design, which we argue could be useful for the design considerations of next-generation accelerator hardware architectures. Our distributed implementation demonstrates significant improvements in performance, both in terms of number of steps, and often in actual wall-clock time, on some extremely large deep learning tasks: • Machine translation: we train Transformer models (Vaswani et al., 2017) on the WMT'14 English to French translation task (Bojar et al., 2014) in half as many steps compared to state-of-the-art (well tuned Adam), resulting with up to 45% reduction in wall-time. • Language modeling: we trained BERT (Devlin et al., 2018) in 16% fewer steps and achieve higher masked-LM accuracy compared to state-of-the-art optimizer (You et al., 2019 ) at 32K batch size; overall wall-time decreased by 4% from 3.8 to 3.65 hours. (For this task, our system has not yet been tuned for performance; we discuss several possible optimizations below.) • Click-Through Rate (CTR) prediction: we trained the DLRM model (Naumov et al., 2019) on the terabyte Criteo dataset (Criteo Labs, 2015) at 64K batch size in half as many steps as the current state-of-the-art optimizer, with a wall-time reduction of 37.5%. We achieve a new state-of-the-art performance of 80.56% AUC (≈ 0.3% improvement) on this task. (An improvement of 0.1% is considered significant; see Rong et al., 2020; Wang et al., 2017.) • Image classification: we achieve MLPerf target accuracy of 75.9% (Mattson et al., 2019) at 32K batch size on the standard ResNet-50 ImageNet benchmark in 10% fewer steps than previous state-of-the-art. Here we do not see wall-time gains, mainly because the problem is too small (only few thousand steps for convergence which does not allow for amortization of costs). However, we expect that one would be able to better exploit parallelism via improved software and hardware support. We note that one of our main points in this work was to demonstrate wall-time speedups with secondorder methods implemented on a real-world distributed setup being used to train state-of-the-art deep models. In our view, this is important for influencing future hardware accelerator design and runtime software. Indeed, first-order methods have received huge investments in tuning, implementation, platform support and tailored accelerator hardware over the last decade; we believe there are numerous opportunities to improve the per-step time performance of preconditioned methods as well. For example, our results provide a concrete justification for incorporating 64bit accumulation units in hardware for distributed training, adding larger on-chip memory, better model parallelism and tighter coupling between accelerators and CPUs, which would make second order methods feasible across more domains and models.

Related work.

Classic techniques for addressing the high storage and computation costs of secondorder methods mostly belong to the quasi-Newton or the trust-region families of algorithms (Conn et al., 2000; Nocedal & Wright, 2006) . Traditionally, these methods need nearly-accurate gradients in order to construct useful quadratic approximations and implement reliable line searches, rendering them as suitable for training with very large batch sizes, and resulting in expensive iterations that make the overall algorithm slow compared with stochastic first-order methods (see, e.g., Bollapragada et al., 2018 for a recent account). Hence, our focus in this paper is on adaptive second-order methods which are directly applicable in a stochastic setting. That said, our effort could be relevant to quasi-Newton and trust-region methods as well: e.g., each iteration of typical trust-region methods amounts to solving a certain generalized eigenvalue problem, which presents numerical difficulties of similar nature to those encountered in matrix root/inverse computations, being addressed here. Various approximations to the preconditioning matrix have been proposed in the recent literature (e.g., Gonen & Shalev-Shwartz, 2015; Erdogdu & Montanari, 2015; Agarwal et al., 2016; Xu et al., 2016; Pilanci & Wainwright, 2017) . However, so far the only prevalent and pragmatic approximation is the diagonal approximation. Some recent approaches for approximating a full-matrix preconditioner are K-FAC (Martens & Grosse, 2015) , Shampoo (Gupta et al., 2018) and GGT (Agarwal et al., 2018) . K-FAC uses a factored approximation of the Fisher-information matrix as a preconditioner. While our focus in this paper is on Shampoo, we believe that many of the techniques presented here could also be applied to make K-FAC practical in large scale (see Appendix C). GGT uses a clever trick to compute a low-rank approximation to the AdaGrad preconditioner. However, GGT maintains several hundred copies of the gradient in memory, which is too expensive even for mid-sized models. 

2. P

Adaptive preconditioning methods. First order methods iteratively update the parameters solely based on gradient information: w t+1 = w tη t ḡt where w t and ḡt are (column) vectors in R d . Here ḡt denotes a linear combination of the current and past gradients g 1 , . . . , g t , where different algorithms use different combinations. Preconditioned methods take the form w t+1 = w t -P t ḡt where P t is an d × d matrix. Whereas in Newton-type methods this matrix is related to the Hessian matrix of second-order derivatives, adaptive preconditioning is based on gradient-gradient correlations. The parameters of a deep network are structured as a set of tensors of order two (i.e., a matrix), three, or four. For simplicity of presentation we focus on the matrix case-however our design, analysis, and implementation hold for tensors of arbitrary order. We denote the space of parameters by the matrix W ∈ R m×n and an estimate of its gradient by G. Full matrix Adagrad flattens W, G to vectors of dimension mn, it thus requires m 2 n 2 space to store the preconditioner and m 3 n 3 time to perform the update. m and n are in the 1000's in state-of-the-art models, thus rendering full-matrix preconditioning impractical. For this reason, both AdaGrad and Adam constrain the preconditioning matrices to be diagonal. Shampoo bridges the gap between full matrix preconditioning and the diagonal version by approximating the matrices. The Shampoo algorithm. We describe Shampoo in the context of the Online Convex Optimization (OCO) framework, which generalizes stochastic optimization (see, e.g., Shalev-Shwartz, 2012; Hazan, 2016) . In OCO, learning progresses in rounds where on round t the learner receives an input X t and then uses the parameters W t to form a prediction denoted ŷt . After making the prediction, the true outcome y t is revealed. The discrepancy between the true and predicted outcomes is assessed by a loss function which takes values in R + . The learner then uses the discrepancy to update the matrix to W t+1 and prepare for the next round. For instance, the input on round t can be an example x t ∈ R n for which the learner predicts ŷ = f (W t , x t ) where f : R m → R and the loss is a function : R × R → R + such as ( ŷ, y) = (yŷ) 2 or ( ŷ, y) = log(1 + exp(-y ŷ)). Stochastic gradient methods use the gradient G t = ∇ W ( f (W, x t ), y t ), thus G t ∈ R m×n if the parameters are shaped as a matrix W ∈ R m×n . For matrix-shaped parameters, Shampoo tracks two statistics over the course of its run, L t and R t , which are defined as follows: L t = I m + t s=1 G s G T s ; R t = I n + t s=1 G T s G s . Note that L t ∈ R m×m and R t ∈ R n×n . These are used to precondition the gradient and update W: W t+1 = W t -η L -1/4 t G t R -1/4 t . The primary complexity of Shampoo arises from the computation of L -1/4 t and R -1/4 t , which was naively implemented using spectral decompositions (i.e., SVD). 3 F -M P : C We discuss the main challenges and design choices in the development of the distributed implementation of Shampoo. These largely arose from the fact that modern accelerators are highly optimized for training using first-order optimizers, which have low computational and memory requirements. The Shampoo algorithm is computationally expensive. The extra computation in Shampoo compared to standard first-order methods is in the following steps: • Preconditioner statistics computation: L t = L t-1 + G t G T t and R t = R t-1 + G T t G t ; • Inverse p'th root computation: L -1/4 t and R -1/4 t ; • Preconditioned gradient computation: L -1/4 t G t R -1/4 t . Preconditioner statistics and gradient computations are expensive for large fully connected as well as embedding layers, we address these below. For other layers we show in Section 5 that they do not add significantly to the runtime of each step. Computing the inverse p'th roots is very slow-as much as 100 times the step time in some cases-and performing these without slowing down training was a key challenge in our system.

3.1. A

Modern ML architectures often use very large embedding layers, where the longer dimension can be in the millions. For example, DLRM (Naumov et al., 2019) on Criteo-1Tb uses a vocabulary with ∼186 million hash buckets, while in Transformer models (Shazeer et al., 2018) the largest layer can have up to 65536 units per dimension. This makes preconditioning impossible due to O(d 2 ) memory and O(d 3 ) computational complexity. We show how to extend Shampoo to overcome these problems; we provide proofs and convergence results in Appendix B. Large layers. For embedding layers specifically, we extend the Shampoo algorithm to allow us use only one of the preconditioners, in case both preconditioners are too expensive to compute. Our choice is empirically supported by the experiments shown in Figs. 2b, 3a and 5a which suggest that there is a benefit from preconditioning one dimension of the large softmax and embedding layers with minimal increase in time. The following result allows us to choose a subset of preconditioners: L 1. Let G 1 , . . . , G t ∈ R m×n be matrices of rank at most r. Let g s = vec(G s ) and define H t = I mn + t s=1 g s g T s . Let L t , R t be defined as above: L t = I m + t s=1 G s G T s , R t = I n + t s=1 G T s G s . Then for any p, q > 0 such that 1/p + 1/q = 1, we have H t r L 1/p t ⊗ R 1/q t . A consequence is that for any p, q > 0 such that 1/p + 1/q = 1, the full AdaGrad preconditioned gradient H -1/2 t g t is approximated by (L 1/p t ⊗ R 1/q t ) -1/2 g t , giving us G t = L -1/2p t G t R -1/2q t . Now, by choosing (p, q) = (1, ∞) and (p, q) = (∞, 1) we obtain the simple preconditioned gradients: G t R -1/2 t and L -1/2 t G t . Theorem 3 shows that Lemma 1 can be used to prove a regret bound for this extended Shampoo in the online convex optimization setting -this provides intuitive justification for the usefulness of this approximation. We further optimize the computation of these preconditioned gradients for embedding layers by taking advantage of the sparse inputs, see details in Appendix D. Preconditioning blocks from large tensors. In addition to embedding layers, large models occasionally have large fully connected layers. To reduce the computational cost of computing statistics and preconditioned gradient: we divide the tensor into blocks and treating individual block as a separate tensor. Concretely this would entail dividing tensor W ∈ R km×kn , into W 1,1 . . . W m,n such that W i, j ∈ R k×k ∀i, j. Shampoo still converges in this case in the convex setting (Theorem 4), showing that the extension is justified. L 2. Assume that g 1 , . . . , g t ∈ R mk are vectors, and let g i = [g i,1 , . . . , g i,k ] where g i, j ∈ R m . Define H t = I mn + t s=1 g s g T s , and let B t ∈ R mk×mk be the block diagonal matrix with k m × m blocks, where the j-th block is B (j) t = I m + t s=1 g s, j g T s, j . Then H t k B t . We performed experiments to study the effect of partitioning intermediate layers into blocks, in which we observed that the latter had minimal impact on quality of the solution while providing faster step time as well as reduced memory overheads; see Fig. 3b . Delayed preconditioners. As remarked above, computing the preconditioners is the most expensive computation in every Shampoo step. In Fig. 3c we show that we can compute the preconditioners once every few hundred steps without a significant effect on the accuracy which indicates that the loss function landscape does not change significantly with each step. We observe that there is a performance/quality tradeoff here -in our experiments we set the frequency of computing preconditioners to the smallest value that does not degrade performance, i.e. the number of training steps that can be completed in the amount of time needed to compute the largest preconditioner. The only way to increase the frequency of computing preconditioners is with better hardware/software support.

3.2. N

Inverse p'th roots (where typically p = 2, 4, 8) can be computed using SVD, but there are efficient iterative algorithms such as the coupled Newton iteration algorithm (Guo & Higham, 2006; Iannazzo, 2006) that can compute the inverse p'th root via a sequence of matrix-vector and matrix-matrix products, which are highly optimized on modern accelerators. However, our experiments suggest that on real workloads the condition numbers of the L t , R t matrices are very large (see Fig. 6 in Appendix E) so both SVD and the coupled iteration must be run in double-precision, but this is very expensive on accelerators. We applied several further optimizations to speedup the coupled Newton iteration in our implementation; these are described in Appendix E.

3.3. I

Heterogeneous training hardware. Neural network accelerators are custom designed to run machine learning workloads faster and at lower cost. Accelerator design is trending towards preferring lower-precision (8-bit/16-bit) arithmetic that satisfy both of these goals on existing benchmarks. Our method demands double-precision arithmetic as described above, which makes running computation on accelerators a non-starter, and therefore we had to design the system to leverage the existing underutilized CPUs attached to the accelerators (Section 4).

API inflexibility.

Deep learning libraries such as TensorFlow (Abadi et al., 2016) offer APIs for optimizer implementation that are well suited for first-order optimizers and for mini-batch training. Our design requires that we interact with the training loop in non-standard ways, which requires framework level changes. Our Transformer experiments were carried out in the Lingvo (Shen et al., 2019) TensorFlow framework, while BERT-Large, DRLM, as well as ResNet-50 used the MLPerf v0.7 Tensorflow baselines (Mattson et al., 2019) . Experimentation required changes to the training loop such as gathering statistics at regular intervals, distributing computation across all the CPUs available in the cluster without blocking the TPU training, as well as updating the preconditioners. We anticipate that this proof-of-concept for full-matrix preconditioning will encourage the development of more flexible API's to fully utilize heterogeneous hardware.

4. D S D

We present our distributed system design of the modified Shampoo algorithm. Our method is designed to run effectively on modern neural network accelerators such as TPUs (Jouppi et al., 2017) or GPUs. We first describe the standard paradigm of data parallelism used in training models on these accelerators (Dean et al., 2012) . Parameters are replicated on each core of the accelerator, and each core computes forward propagation and back propagation on a sub-batch (a subset of a mini-batch, which itself is a small randomly selected subset of the training set) of input examples. These gradients are averaged across all cores via all-reduction to get the average gradient for the mini-batch. Each core uses the average mini-batch gradient to update its copy of the parameters. All-reduction adds a barrier as all the cores need to synchronize to compute the mini-batch gradient. In Fig. 2b we measure the overhead of each of the steps on a Transformer model (Vaswani et al., 2017) described in the experiment section. We observe that the overheads from all-reduction and weight updates are a minor part (< 5%) of the overall step time. The overall design of our implementation is illustrated by the timeline in Fig. 1 . As discussed in the previous section the preconditioner computation (inverse pth root) is expensive and requires double precision, also we need to do this computation once every few hundred steps. These observations naturally suggested using the often underutilized CPUs on the machines to which the accelerators such as GPUs or Cloud TPUs are attached. CPUs offer double precision arithmetic but are slower than GPUs or Cloud TPUs, which makes them a perfect choice to run the preconditioner computation without adding any extra cost to the training run, as the computation is pipelined and runs asynchronously without blocking the training loop. Preconditioners need to be computed for every layer of the network so we distribute the computation across all the CPUs that are part of the training system. As a result, the most expensive step in Shampoo adds almost nothing to the overall training time. Moreover, the computational overhead of preconditioned gradient is independent of the batch size. Thus, increasing the batch size allows us to linearly decrease the overhead making Shampoo practical for very large scale training setups. On smaller problems such as CIFAR-10, we find that our design still results in training time improvements (Appendix G.3) as preconditioner computations take very little time.

5. E

We compare our method against various widespread optimization algorithms for training large state-of-the-art deep models for machine translation, language modeling, recommendation systems as well as image classification. Details of the experiments are given in Appendix G and we will opensource our code before publication.

5.1. M T T

We demonstrate the effectiveness of our implementation on the standard machine translation dataset from WMT'14 English to French (en→fr) with 36.3M sentence pairs. We used the state-of-the-art Transformer architecture (Vaswani et al., 2017) . This architecture contains 93.3M parameters and consists of 6 layers for its encoder and decoder. Each layer is composed of 512 model dimensions, 2048 hidden dimensions, and 8 attention heads. The model makes use of a sub-word vocabulary that contains 32K word pieces (Schuster & Nakajima, 2012) . The experiment was run on 32 cores of a Cloud TPU v3 Pod, and the implementation of the optimizer was carried out in the Lingvo (Shen et al., 2019) sequence to sequence modeling based on TensorFlow. Our results are shown in Fig. 2a : our algorithm achieves the same accuracy as AdaGrad or Adam in about half as many steps. Preconditioning of embedding and softmax layers. Following the first extension in Section 3.1 the algorithm preconditions the large layers with only one of the preconditioners (G t R -1/2 t or L -1/2 t G t ) to make it tractable. Fig. 2b shows the increase in step time is only 6% while Fig. 3a shows that we can reduce the number of steps to convergence by ≈ 20%. Reducing overhead in fully-connected layers. Following the second extension in Section 3.1 we ran two experiments where we partitioned fully connected layer of size [512, 2048] into two blocks of size [512, 1024] and four blocks of size [512, 512] . Our experiments show no drop in quality under this approximation with a small reduction in runtime (< 3%). 

5.2. T -B

We also ran experiments with a larger Transformer model with 375.4M parameters, consisting of 6 layers for its encoder and decoder. Each layer is composed of 1024 model dimensions, 8192 hidden dimensions, and 16 attention heads. Results are presented in Fig. 4a where again we see an improvement in the end-to-end wall-clock time. For the softmax, embedding and the projection fully-connected layer (with 8192 hidden dimensions) we only make use of the left preconditioner. We note that step time is dominated by the preconditioned gradient computation which can be reduced by sub-blocking the layers. On the overhead of the optimizer. We show the computational and memory complexity of the Shampoo extensions described in Section 3.1 in Table 2 in the appendix. We note that the overhead from computing the statistics, as well as from computing the preconditioned update for single step of training, can be further reduced by increasing the batch sizes (indeed, these overheads are independent of the batch size) as shown in Fig. 4b where the overhead dramatically reduces from 40% to 19%.

5.3. A C -T R (CTR)

We trained the Deep Learning Recommendations Model (DLRM) of Naumov et al. (2019) on the terabyte Criteo click logs dataset for online advertisement click-through-rate prediction task (Criteo Labs, 2015) . We compared Shampoo against the highly tuned SOTA baseline from MLPerf v0.7 training benchmarks (Wu et al., 2020) . We trained the model with a batch size of 65536 for 64000 steps (1 epoch). We trained a version of the model where Shampoo is applied only to the hidden layers as well as one where we apply it for all layers. We only tune the learning rate, and keep the exact same setup as the baseline. We found that Shampoo achieves the target accuracy of 80.25% in only 30.97K steps compared to 64K steps for the baseline. Moreover, Shampoo achieves new 

5.4. L

We trained BERT-Large (the Bidirectional Encoder Representation architecture of Devlin et al., 2018) for the language modeling task on the concatenation of Wikipedia and BooksCorpus, with 2.5B and 800M words respectively. BERT-Large is a large bidirectional transformer model containing 24 transformer blocks with 1024 hidden dimensions and 16 self attention heads. It has 340M parameters and is set up to jointly optimize two objectives: (a) masked language model (Masked-LM) loss where the task is to predict masked tokens based on surrounding context, and (b) next sentence prediction (NSP) loss where the task is to predict whether two given sentences are consecutive in the text. In Fig. 5b we compare our results against the current state of the art in training BERT (You et al., 2019) . Models were trained with batch size 16K; in these experiments we replaced the Adam update rule in Lamb that produces the preconditioned gradient with Shampoo. Both experiments used existing well-tuned hyperparameters of the baseline.

5.5. I

We trained a ResNet-50 model (He et al., 2016) on the ImageNet-2012 (Russakovsky et al., 2015) dataset and compared it against the state-of-the-art baseline using SGD+Momentum. We base our experiments off the Tensorflow baseline available from Mattson et al. (2019) 

6. C R

We have presented an implementation of a second order optimizer, and demonstrated step time as well as wall time improvements on multiple large tasks in different domains -in each case our implementation performed as well or better than state-of-the-art optimizers specialized for each domain. The main point of our work is to demonstrate that second order methods implemented on a real-world distributed setup can be used to train state-of-the-art deep models. We hope that this work will influence future hardware accelerator design and runtime software -first order methods have received large investments in tuning, implementation, platform support and hardware tailored for them, and we believe there are several opportunities to improve the per-step time performance of second order methods as well: • Most second order methods use symmetric matrices, but we haven't found support for typing operands as symmetric, which can reduce compute flops and storage by upto 50%. • Several optimizations that are currently tuned towards first order methods could be extended to second order methods. For example, weight update sharding pattern matches first order methods (Xu et al., 2020) and dramatically reduces the time spent in the update step as well as memory used. This change can also be applied to Shampoo with blocked preconditioners -but we do not have support for it yet as it requires compiler level support, and is not expressible at the program layer. Currently every core must update all layers which is quite inefficient. • Mixed precision algorithms may work for inverse pth roots and can help increase the frequency of preconditioner computation. • Increased memory per chip can allow larger preconditioners. • Hardware support for high-precision arithmetic in accelerators can allow more frequent preconditioner computation. The benefits of high precision arithmetic for optimization run counter to the prevailing wisdom in ML1 which has led to the focus on low-precision formats such as bfloat16 (Wang & Kanwar, 2019) . • Hardware support for storing/packing and using upper/lower triangular matrices efficiently, as available in LAPACK. Our hope is that these suggestions could result in innovations that would make second-order methods practical across more domains and models, especially in data limited regimes where we may not able to amortize the latency added in the data transfer between the accelerator and the CPU. 1For example, (Gupta et al., 2015) say "It is well appreciated that in the presence of statistical approximation and estimation errors, high-precision computation in the context of learning is rather unnecessary (Bottou & Bousquet, 2007)" and (Higham & Pranesh, 2019) say "... machine learning provides much of the impetus for the development of half precision arithmetic in hardware ..."

A N

We use lowercase letters to denote scalars and vectors, and uppercase letters to denote matrices. A F denotes the Frobenius norm of A, i.e., A 2 F = i, j A 2 i j . A • B denotes the Hadamard or element-wise product of A and B which have the same shape, so C = A • B ⇐⇒ C i j = A i j B i j . D α is the element-wise power, (D α ) i j = D α i j . We use to denote the Loewner order: given square symmetric matrices A, B, we write A B iff B -A is positive semidefinite (PSD). Given a symmetric PSD matrix A, and α ∈ R, A α is defined as follows: let A = UDU T be the singular value decomposition of A, where U is a unitary matrix and D is a diagonal matrix (with D ii ≥ 0 as A is PSD), then A α = UD α U T , where (D α ) ii = D α ii . If α < 0, this is defined for positive definite matrices only, where D ii > 0. We use vec(A) to denote the flattening of the m × n matrix A: if A has rows a 1 , . . . , a m , then vec(A) is the mn × 1 column vector vec(A) = (a 1 , . . . , a m ) T . A ⊗ B denotes the Kronecker product of two matrices A and B, and we will use the identities (A ⊗ B) α = A α ⊗ B α for α ∈ R, and (A ⊗ B) vec(C) = vec(ACB T ).

B D P P ( L 1)

. Lemma 8 in Gupta et al. (2018) shows that H t r L t ⊗ I n and H t r I m ⊗ R t . By using Ando's inequality (Ando et al., 2004) , we get H t r(L t ⊗ I n ) 1/p (I m ⊗ R t ) 1/q = r(L 1/p t ⊗ I n )(I m ⊗ R 1/q t ) = r L 1/p t ⊗ R 1/q t , which concludes the proof. This lemma immediately allows us to prove a regret bound for Shampoo with extended exponents: T 3. Assume that the gradients G 1 , . . . , G T are matrices of rank at most r. Then the regret of Shampoo with extended exponents compared to any W ∈ R m×n is bounded as follows, T t=1 f t (W t ) - T t=1 f t (W ) ≤ √ 2r D Tr(L 1 2p T ) Tr(R 1 2q T ) , where L T = I m + T t=1 G t G T t , R T = I n + T t=0 G T t G t , D = max t ∈[T ] W t -W 2 . and 1/p + 1/q = 1, p, q ≥ 1.

P

. The proof follows the proof of Theorem 7 in Gupta et al. (2018) . Let H t = L 1 2p t ⊗ R 1 2q t . Then the update rule of the extended Shampoo algorithm is equivalent to w t+1 = w t -ηH -1 t g t . Since 0 L 1 . . . L T and 0 R 1 . . . R T , standard properties of the Kronecker product and the operator monotonicity of the function x → x α for α ≤ 1 (an immediate consequence of Ando's inequality) ensure that 0 H 1 . . . H T . Following the aforementioned proof, we have the regret bound T t=1 f t (W t ) - T t=1 f t (W ) ≤ D 2 2η Tr(H T ) + η 2 T t=1 g t 2 H * t , where D = max t W t -W 2 . Define g t = vec(G t ) and H t = ( I m + t s=1 g s g T s ) 1/2 , then Lemma 1 shows that H t √ r H t , using operator monotonicity. Using this equation twice, along with Equation ( 6) from the proof of Theorem 7, we have , where x j ∈ R m . Then x T H t x = x 2 2 + t s=1 x T g s g T s x = x 2 2 + t s=1 (g T s x) 2 = x 2 2 + t s=1 k j=1 g T s, j x j 2 ≤ k x 2 2 + k t s=1 k j=1 (g T s, j x j ) 2 = k k j=1 x j 2 2 + t s=1 x T j g s, j g T s, j x j = k k j=1 x T j I m + t s=1 g s, j g T s, j x j = k k j=1 x T j B (j) t x j = k x T B t x. Here we used the inequality k j=1 α j 2 ≤ k k j=1 α 2 j , which follows from the convexity of x → x 2 (or from the fact that variance of a random variable is non-negative). This lemma once again allows us to prove a regret bound, exactly following the proof of the regret bound above: T 4. Assume that the gradients are g 1 , . . . , g T ∈ R mk , and let g i = [g i,1 , . . . , g i,k ] where g i, j ∈ R m . Then the regret of Shampoo with blocking compared to any w ∈ R mk is bounded as follows: T t=1 f t (w t ) - T t=1 f t (w ) ≤ √ 2k D k j=1 Tr I m + T t=1 g t, j g T t, j . The two regret bounds can be combined to show that Shampoo with both extensions also converges.

C C K-FAC

K-FAC is a natural gradient algorithm, and approximates the curvature of the loss using the Fisher Information Matrix: F = E p(x |θ) ∇ log p(x|θ) ∇ log p(x|θ) T = E p(x |θ) g p(x |θ) g T p(x |θ) . For a fully connected layer with W ∈ R m×n , where W x = s, the gradient for the layer G t ∈ R m×n can be written via the chain rule as G t = ∇ s (s t , y t )x T and in vectorized form as: ∇ s (s t , y t ) ⊗ x. We can then write the Fisher information matrix as: F = E p(x |θ) (∇ s (s t , y t ) ⊗ x) (∇ s (s t , y t ) ⊗ x) T = E p(x |θ) (∇ s (s t , y t )∇ s (s t , y t ) T ) ⊗ (x t x T t ) . Assuming independence between ∇ s (s t , y t ) and x, K-FAC rewrites the Fisher in tractable form as: F ≈ E (∇ s (s t , y t )∇ s (s t , y t ) T ) ⊗ E x t x T t . If we let D = E (∇ s (s t , y t )∇ s (s t , y t ) T ) and X = E x t x T t , the update rule then becomes: W t+1 ≈ W t -η D -1 G t X -1 . We note some of the differences and similarities between the two updates here. KFAC preconditioners use exponent of -1 (as original Fisher is inverted) whereas Shampoo uses -1/2p where p is the rank of the tensor. KFAC computes statistics based on gradients with labels sampled from the model's predictive distribution (hence requiring strictly more computation) where as Shampoo relies on the gradient of the mini-batch. Now we can compute each term in the Shampoo preconditioners as: G t G T t = ∇ s (s t , y t )x T t x t ∇ s (s t , y t ) T = x t 2 2 ∇ s (s t , y t )∇ s (s t , y t ) T ; G T t G t = x t ∇ s (s t , y t ) T ∇ s (s t , y t )x T t = ∇ s (s t , y t ) 2 2 x t x T t . Dividing by the scale, and taking expectations on both sides: E G t G T t x t 2 2 = E ∇ s (s t , y t )∇ s (s t , y t ) T = D; E G T t G t ∇ s (s t , y t ) 2 2 = E x t x T t = X. This shows that K-FAC preconditioners are closely related to Shampoo preconditioners, especially when one uses the empirical Fisher (Kunstner et al., 2019) . The main difficulty in implementing K-FAC on a model is that current optimizer APIs make it difficult to send additional information such as x t 2 2 , ∇ s (s t , y t ) 2 2 to the optimizer, so K-FAC implementations have to register the structure of each layer. Moreover, due to the dependence of K-FAC on the structure of the network, it is difficult to implement standard operators like batch norm, weight norm, layer norm, etc., which are prevalent in the tasks and models we considered. For example, if we write a fully connected layer with weight norm as s = W x/ W , then the gradient G t = 1 W ∇ s (s t , y t )x T - ∇ s (s t , y t ) T W x W 3 W, so rewriting E[vec(G t ) vec(G t ) T ] as a Kronecker product is not an easy task. The similarity between K-FAC and Shampoo preconditioners also allows us to use techniques explored by the K-FAC community for Shampoo. One of the extensions for KFAC is the E-KFAC algorithm (George et al., 2018) which constructs a better approximation of the Fisher matrix by using the eigenbasis computed from the Kronecker approximation, but rescaling the eigenvalues to match the diagonal of the Fisher matrix in this eigenbasis. This method produces a provably better approximation, and can immediately be applied to Shampoo too with a simple modification: Let Ĥt ≈ L 1/2 t ⊗ R 1/2 t . Let the singular value decompositions of the factors be L 1/2 t = UDU T and R 1/2 t = V D V T . The L 1/2 t ⊗ R 1/2 t = (U ⊗ V)(D ⊗ D )(U ⊗ V) T . Now the EKFAC correction replaces D ⊗ D by the optimal diagonal Λ = diag((U ⊗ V) T Ĥt (U ⊗ V)) = I + t s=1 diag((U ⊗ V) T vec(G s ) vec(G s ) T (U ⊗ V)) = I + t s=1 diag(vec(U T G s V) vec(U T G s V) T ) = I + t s=1 vec(U T G s V) 2 , Thus we can approximately compute Λ t+1 ≈ Λ t + (U T G t V) 2 , and the new update becomes: W t+1 = W t -η t U(Λ -1/2 t • (U T G t V))V T . This technique does have the disadvantage that it requires computing the singular value decompositions (which we already observed are much slower than coupled Newton iterations), and doubles the number of matrix multiplications in the preconditioned gradient computation. At this time our experiments did not show significant improvements over the standard Shampoo implementation, but we plan to explore this further.

D S

In modern networks, embedding layers are usually very large, and even computing the left preconditioner as described in Section 3.1 can be prohibitively expensive. However we can take advantage of the fact that the inputs to the network are very sparse, and use this to reduce the computation significantly. Let our input example to such a network consist of a set of categorical features: each feature such as user language, user country etc consists of one out of a set of options. Then the output of the embedding layer is the concatenation of the embeddings for each such feature. If the embeddings are of width d and there are N such embeddings, then the embedding layer is W ∈ R d×N . The input can be represented as x ∈ R N ×m , where m is the number of categorical features, and each column is one-hot: if the k-th feature is x(k), then x jk = δ j,x(k) . The output of the layer is y = W x. Now G = ∇ W = ∇ y x T , so GG T = ∇ y x T x ∇ y T . But x T x = I m , so GG T = ∇ y ∇ y T . Thus we can compute the preconditioner for W by computing it on the output of the embedding layer, and this is a much smaller computation since y is of dimension b × m, this computation is O(d 2 m) rather than O(d 2 N). Note that sparse multiplication would also be O(d 2 m), but accelerators usually implement sparse operations by densifying the tensors. If each column of x is multi-hot, as is the case when the features are words and their embeddings are averaged, x T x is a diagonal matrix, where each diagonal entry is a function of the number of ones in each column of x. Computing GG T = ∇ y (x T x)∇ y T is still O(d 2 m) O(d 2 N). E A N p- The Newton method for solving the matrix equation X -p -A = 0 produces the iteration X k+1 = 1 p [(p + 1)X k -X p+1 k A], where we take X 0 = 1 c I. This iteration satisfies X k → A -1/p as k → ∞, but it is not numerically stable. Introducing the matrix M k = X p k A, we get X k+1 = X k (p + 1)I -M k p , X 0 = 1 c I, and M k+1 = X p k+1 A = (p + 1)I -M k p p X p k A = (p + 1)I -M k p p M k , M 0 = 1 c p A, since X k , M k and A commute with each other. This is the coupled Newton iteration for computing inverse p-th roots, and was shown to be numerically stable in (Guo & Higham, 2006; Iannazzo, 2006) . We implemented the following optimizations to the coupled Newton iteration method: • Warm Start: The coupled Newton iteration to compute G -1/p starts with X = I, M = G and maintains the invariant M = X p G while driving M → I, resulting in X → G -1/p . We need to find the p-th root of a sequence G t , so we instead set X = G -1/p t , M = X p G t+1 ; since the difference between G t and G t+1 is small, this ensures that M is already close to I. In our experiments warmstart improves convergence (by upto 4x fewer steps). • Projecting top singular values: In practice our G t matrices have large condition numbers, which sometimes leads to inaccurate results. As a rule of thumb, computing G -1/p leads to a loss of log 2 ( 1 p κ(G)) bits of precision (Overton, 2001) , where κ(G) is the condition number of the G. However we also find that usually there is a sharp falloff within the first few singular values, so in order to reduce the condition number, we project away the largest singular values, since these are the least important after taking inverses. We find the top-k singular values λ 1 , . . . , λ k and their associated singular vectors using a standard iterative method, and replace each with λ k+1 . This produces a better approximation than adding I to each G t : the latter can wash out the smallest (and most crucial) singular values, see Fig. 7 where the smallest singular value for a layer can be as small as 10 -10 to 10 -6 during the course of optimization. • Dynamic tuning of projection: We dynamically tune the number of singular values we need to project in the previous step, by computing the condition number κ(G t ) and using it to estimate the smallest singular value of G t+1 as λ max (G t+1 )/κ(G t ). We then keep projecting out singular values of G t+1 until we get an acceptable condition number. • Correcting for projection: If G t = i λ i v i v T i , then G -1/p t = i λ -1/p i v i v T i . Projection above means replacing λ 1 , . . . λ k by λ k+1 , but since we have already computed the corresponding v 1 , . . . v k , we correct the approximate p-th root by adding k i=1 (λ -1/p i -λ -1/p k+1 )v i v T i . This is a small effect, but adding it is a straightforward modification (details deferred to Appendix E). (right) . We find that the coupled Newton iteration method can effectively utilize the CPUs and give large walltime improvements compared to SVD (that relies on bidiagonal divide-and-conquer). These were measured without warmstart which provides additional speedup of upto 4x by reducing the number of iterations to the solution.These were measured on Intel Skylake CPUs. Note that since ∼ log 2 ( 1 p κ(L t )) bits of precision are lost in computing p-th roots, 64-bit arithmetic becomes necessary. Algorithm II Sketch of the Shampoo algorithm 1: parameters: learning rate η t , momentum: β 1 , β 2 2: for t = 1, . . . ,T do 3: Receive stochastic gradients G t for each layer 4: if t % τ 2 = 0 then 5: if β 2 < 1 then 6: L t ← β 2 L t-τ 2 + (1 -β 2 ) G t G T t 7: R t ← β 2 R t-τ 2 + (1 -β 2 ) G T t G t 8: else 9: L t ← L t-τ 2 + G t G T t 10: R t ← R t-τ 2 + G T t G t 11: D t ← D t-1 + G t • G t 12: M t ← β 1 M t-1 + (1 -β 1 ) D -1/2 t • G t 13: if t % τ 1 = 0 then 14: Gather preconditioners L -1/4 (t-τ 1 ) , R -1/4 (t-τ 1 ) from CPUs 15: Send L t , R t to CPU host to compute L -1/4 t , R -1/4 t 16: if t > τ 1 then 17: P t ← β 1 P t-1 + (1 -β 1 ) L -1/4 t G t R -1/4 t 18: η t ← η 0 M t F P t F 19: W t = W t-1 -η t P t 20: else 21: η t ← η 0 22: W t = W t-1 -η t M t G F D E Layer wise learning rates. As seen in Fig. 7 the step size scale for each layer is dependent on the operator norm of the preconditioners (inverse-pth root of the smallest singular value of the statistics This procedure termed Grafting in (Agarwal et al., 2020) allows us to bootstrap a reasonable learning rate schedule for a specific problem that is well tuned, and study the effect of preconditioned gradient directions in isolation. The weight matrix W t is updated as W t = W t-1 -A t Ŝt , where:  T C M All preconditioner W t : [n, m] O(n 2 m + m 2 n) O(n 2 + m 2 ) Left only preconditioner for W t : [n, m] O(n 2 m) O(n 2 ) Preconditioner: block size b O(mnb) O(mn) D t = t s=1 G s • G s ; A t = η 0 D -1/2 t • G t F (Adagrad magnitude) Ŝt = L -1/4 t G t R -1/4 t L -1/4 t G t R -1/4 t F (Shampoo direction). 0K 10K 20K 30K 40K 50K

G.1 T WMT'14 →

For all optimizers, we make use of a warmup schedule where the learning rate is increased from 0.0 to η over 40k steps. For the smaller transformer experiments, we use a quadratic warmup, and for the larger transformer experiments we use a linear warmup. We found that quadratic warmup improves all optimizers equally and provides a better log-perplexity. For the Adam optimizer experiments, we use a learning rate decay schedule of the form η t = η d/t, following the suggestion of Vaswani et al. (2017) . For the smaller Transformer experiments, we tuned the hyperparameters for each algorithm over 100 trials. We took the best settings for the momentum and second-moment parameters, and tuned the learning rates until either the model became unstable, or did not increase performance. For Shampoo, we used a per layer learning rate derived from AdaGrad (see Appendix G for details), and found that for the exact same hyperparameter settings as AdaGrad, Shampoo provides a modest improvement in performance. Moreover, Shampoo allows for larger learning rates than AdaGrad does, as shown in Fig. 4a .

G.2 S BERT-L

Our current implementation showed a 14% increase in step time for BERT-Large, nearly wiping out all the gains from reduced number of steps (16%). We note that due amount of resources it would require to tune BERT, we used Shampoo with exact same hyper-parameters as LAMB with grafting to understand the effect of preconditioner. Moreover, step time can be optimized considerably as the current implementation is not heavily optimized. For example, larger batch sizes help amortize the preconditioning overhead, and reduce overall wall time to reach the same accuracy. Furthermore, in our current implementation, all TPU cores compute all the preconditioning statistics and the preconditioned gradients, which involves over a hundred 1024 × 1024 matrix multiplications. This repeated work can be avoided by cross-replica sharding of weight update (Xu et al., 2020) , which distributes this computation across cores, and should save at least half the step time overhead.

G.3 CIFAR-10

We train a ResNet-50 model on CIFAR-10 ( Krizhevsky et al., 2009) with 2 cores of CloudTPU-v2 at batch size 2048. Our baseline achieves 93.45% accuracy at 300 epochs, where as Shampoo reaches the same accuracy in 143 epochs. We see an overall training time reduction of 42% (1428 seconds to 827 seconds). As it is a smaller problem, the time taken for preconditioner inverse computation for the largest preconditioning matrix is less than 1ms on the CPU. We use a total of 8 CPU cores to run these inverses.

G.4 I N

For SGD with Momentum, the learning rate is warmed up over the first 5 epochs from 0 to 1.6, followed by a 10x drops of the learning rate at 30, 60 and 80 epochs. For LARS, we use warmup learning rate over 20 epochs for 4K and 16K batch sizes, 25 epochs for 32K batch size with a polynomial decay (p=2) until end of training. For Shampoo we use the same layer-wise heuristics and hyperparameters as LARS with Grafting such that the direction is changed to the one computed by Shampoo. We make use weight decay with value: λ 2 = 2x10 -4 and label smoothing of 10 -1 . • Forward Pass: Each core independently computes the predictions for each training example in its sub-batch. • Gradient: The gradient is for the sub-batch is computed using the back-propagation algorithm. • All reduction: The gradients for the sub-batches from all cores are averaged to compute the gradient for the minibatch. This is then sent back to each core. • Preconditioner statistics: The preconditioner statistics for adaptive algorithms are updated, e.g. for AdaGrad, we set H i := H i + g 2 i for all parameters, while for Shampoo, we set L i := L i + GG T etc. • Preconditioned gradient: The preconditioned gradient is computed -e.g. for AdaGrad, we compute g i / √ H i , while for Shampoo, we compute L -1/4 GR -1/4 . • Parameter updates: The parameters are updated using the preconditioned gradients. This step is the same for all algorithms: W := Wη G, where G is the preconditioned gradient. Note that the Shampoo computation of the preconditioners L -1/4 , R -1/4 is pipelined on the host CPU, so does not show up in the step times.



Figure 1: Timeline illustrating the design of the optimization algorithm. Preconditioner statistics (L t and R t ) are computed at each step by the accelerators. Preconditioners (L 1/4 t and R 1/4t ) are only computed every N steps and this computation is distributed to all available CPU cores.

Figure 3: Impact of Shampoo extensions on WMT'14 en→fr training: (a) preconditioning applied to all layers except embedding and softmax layers, vs. applied to all layers; (b) preconditioning with fully-connected layers partitioned into sub-blocks; (c) varying interval between preconditioner updates.

(b) Masked Language accuracy on BERT-Large.

Figure 5: (a) Shampoo reaches a target AUC of 80.25% in half as many steps with preconditioning embedding layers improving the results, and achieves a new state-of-the-art AUC of 80.56%; (b) Shampoo converges in ≈ 16% fewer steps, and achieves ≈ 1% higher MLM accuracy than the baseline on BERT-Large.

Figure6: Benchmarks on computing inverse-pth root for statistics of varying dimensions (left), and the condition numbers for L t of a layer in the transformer model over time(right). We find that the coupled Newton iteration method can effectively utilize the CPUs and give large walltime improvements compared to SVD (that relies on bidiagonal divide-and-conquer). These were measured without warmstart which provides additional speedup of upto 4x by reducing the number of iterations to the solution.These were measured on Intel Skylake CPUs. Note that since ∼ log 2 ( 1 p κ(L t )) bits of precision are lost in computing p-th roots, 64-bit arithmetic becomes necessary.

Figure 7: Minimum (dashed) and maximum (solid) singular values for statistics matrices of the embedding, softmax and intermediate attention query projection layers.

0.0060 β 1 = 0.9, β 2 = 0.999 6.4k steps Shampoo 16384 η = 0.0060 β 1 = 0.9, β 2 = 0.999, 6.4k steps λ 2 = 10 -2 , τ 1 = 400, τ 2 = 10 Block size: 1024 DLRM (32) SGD 65536 η = 0.1, poly decay(p=2) at 38k steps 2k steps Shampoo 65536 η = 0.1 poly decay(p=2) at 38k steps 2k steps β 1 = 0.9, τ 1 = 999, τ 2 = 10 (w/ embd) 65536 η embd = 0.31

Translation: WMT-14 En-Fr Transformer ≈ 12 hrs 6.5 hrs Translation: WMT-14 En-Fr Transfomer-Big ≈ 47 hrs 29.5 hrs Language Modeling: Wikipedia+Books BERT-Large 228 mins 219 mins G.6 B -F . 2 Each step of training consists of the following phases, whose times are shown in Fig. 2b.

Results for a Transformer model on WMT'14 en→fr, trained with batch size of 1536. (a) Test log-perplexity vs. number of steps; the algorithm converges 1.95x faster in steps, while being only ≈ 16% slower per step. This allows the method to attain a particular log-perplexity in 40% less wall-time. (b) Detailed breakdown of latency of a single step (Appendix G.6). Diagonal AdaGrad optimizer: 134ms, Shampoo: 145ms (all layers except embedding and softmax layers) and 155ms (all layers). Preconditioner computation is pipelined and distributed over CPUs, thus not adding any overhead, and transfer latency (≈100ms) is amortized over hundreds of steps.

Test log-perplexity of a Transformer-Big model on WMT'14 en→fr. (a) Shampoo converges faster than AdaGrad (≈ 2x faster in steps), and allows larger learning rates; due to the large overhead in step time, this results in only 30% improvement in wall-time. (b) Larger batch sizes reduce the optimizer overhead from 40% to 19%, resulting in an end-to-end improvement of 41% in wall-time for convergence.state-of-the-art performance of 80.56% AUC (an ≈ 0.3% improvement) on this dataset, note that an improvement of 0.1% is considered significant in this task; seeRong et al., 2020;Wang et al., 2017. Here preconditioning embedding layers further reduced the number of steps needed to reach the target accuracy from 39.96K to 30.97K.

where the target criteria is reaching 75.9% accuracy. See results in Table1; in particular, we find that Shampoo reaches the target accuracy in fewer steps than the current state of the art. Tuning details are in Appendix G.4. Epochs and steps to MLPerf target accuracy of 75.9% with a ResNet-50.

Let x ∈ R mk , and x = [x 1 , x 2 , . . . , x k ]

Computational and memory complexity of variants of Shampoo. matrix) has large spread in its range which results in optimization instabilities in practice. Moreover, as statistics as well as preconditioner computation are amortized across many steps the norm does not grow at every step. Hence, we rely on a learning rate schedule based on the update directions of a well tuned first order optimizer (in our experiments we use diagonal AdaGrad for Transformers in machine translation, as well as Criteo, layer-wise scaling heuristic proposed in LARS/LAMB optimizer, where each layer's learning rate is set to be W t F G t F for BERT and ResNet training. For example, when used with diagonal AdaGrad: Shampoo is used to determine the direction of the update, and AdaGrad to determine its magnitude.

Hyperparameter setup used in our experiments.

annex

Algorithm I A coupled Newton iteration procedure for computing inverse p-th roots of a PSD matrix, with warm start and singular value projection 1: procedure M SV(G) 2:while i < n step and error > doreturn λ, v/ v 11:12: procedure P (G, κ (optional), κ d (optional), n proj (optional))13:i = 0 14:16:return G + λ∆ return X

F I D S

Our implementation of the Shampoo algorithm for fully-connected layers is described in Algorithm II.The algorithm can use heavy-ball momentum for its updates, as well an exponential moving average over the preconditioners, like Adam. The configuration parameter τ 1 denotes the number of steps between subsequent fetches of the latest available preconditioner by the accelerator. τ 1 must be set sufficiently high so that there is enough time for the CPU to complete the computation of the preconditioner asynchronously and pipeline it efficiently, but otherwise its setting does not have a significant effect on convergence. The configuration parameter τ 2 (default value = 1) determines the frequency of gathering gradient statistics -we update L t , R t every τ 2 steps only for efficiency.

F.1 C S

We capture the computational and memory complexity under various schemes described in Section 3.1 of handling large layers in Table 2 .

