RANDOMIZED AUTOMATIC DIFFERENTIATION

Abstract

The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor.

1. INTRODUCTION

Deep neural networks have taken center stage as a powerful way to construct and train massivelyparametric machine learning (ML) models for supervised, unsupervised, and reinforcement learning tasks. There are many reasons for the resurgence of neural networks-large data sets, GPU numerical computing, technical insights into overparameterization, and more-but one major factor has been the development of tools for automatic differentiation (AD) of deep architectures. Tools like PyTorch and TensorFlow provide a computational substrate for rapidly exploring a wide variety of differentiable architectures without performing tedious and error-prone gradient derivations. The flexibility of these tools has enabled a revolution in AI research, but the underlying ideas for reverse-mode AD go back decades. While tools like PyTorch and TensorFlow have received huge dividends from a half-century of AD research, they are also burdened by the baggage of design decisions made in a different computational landscape. The research on AD that led to these ubiquitous deep learning frameworks is focused on the computation of Jacobians that are exact up to numerical precision. However, in modern workflows these Jacobians are used for stochastic optimization. We ask: Why spend resources on exact gradients when we're going to use stochastic optimization? This question is motivated by the surprising realization over the past decade that deep neural network training can be performed almost entirely with first-order stochastic optimization. In fact, empirical evidence supports the hypothesis that the regularizing effect of gradient noise assists model generalization (Keskar et al., 2017; Smith & Le, 2018; Hochreiter & Schmidhuber, 1997) . Stochastic gradient descent variants such as AdaGrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2015) form the core of almost all successful optimization techniques for these models, using small subsets of the data to form the noisy gradient estimates. The goals and assumptions of automatic differentiation as performed in classical and modern systems are mismatched with those required by stochastic optimization. Traditional AD computes the derivative or Jacobian of a function accurately to numerical precision. This accuracy is required for many problems in applied mathematics which AD has served, e.g., solving systems of differential equations. But in stochastic optimization we can make do with inaccurate gradients, as long as our estimator is unbiased and has reasonable variance. We ask the same question that motivates mini-batch SGD: why compute an exact gradient if we can get noisy estimates cheaply? By thinking of this question in the context of AD, we can go beyond mini-batch SGD to more general schemes for developing cheap gradient estimators: in this paper, we focus on developing gradient estimators with low memory cost. Although previous research has investigated approximations in the forward or reverse pass of neural networks to reduce computational requirements, here we replace deterministic AD with randomized automatic differentiation (RAD), trading off of computation for variance inside AD routines when imprecise gradient estimates are tolerable, while retaining unbiasedness.

2. AUTOMATIC DIFFERENTIATION

Automatic (or algorithmic) differentiation is a family of techniques for taking a program that computes a differentiable function f : R n → R m , and producing another program that computes the associated derivatives; most often the Jacobian: J [f ] = f : R n → R m×n . (For a comprehensive treatment of AD, see Griewank & Walther (2008) ; for an ML-focused review see Baydin et al. (2018) .) In most machine learning applications, f is a loss function that produces a scalar output, i.e., m = 1, for which the gradient with respect to parameters is desired. AD techniques are contrasted with the method of finite differences, which approximates derivatives numerically using a small but non-zero step size, and also distinguished from symbolic differentiation in which a mathematical expression is processed using standard rules to produce another mathematical expression, although Elliott (2018) argues that the distinction is simply whether or not it is the compiler that manipulates the symbols. There are a variety of approaches to AD: source-code transformation (e.g., Bischof et al. (1992) ; Hascoet & Pascual (2013); van Merrienboer et al. (2018) ), execution tracing (e.g., Walther & Griewank (2009) ; Maclaurin et al.) , manipulation of explicit computational graphs (e.g., Abadi et al. (2016) ; Bergstra et al. (2010) ), and category-theoretic transformations (Elliott, 2018) . AD implementations exist for many different host languages, although they vary in the extent to which they take advantage of native programming patterns, control flow, and language features. Regardless of whether it is constructed at compile-time, run-time, or via an embedded domain-specific language, all AD approaches can be understood as manipulating the linearized computational graph (LCG) to collapse out intermediate variables. Figure 1 shows the LCG for a simple example. These computational graphs are always directed acyclic graphs (DAGs) with vertices as variables. Let the outputs of f be y j , the inputs θ i , and the intermediates z l . AD can be framed as the computation of a partial derivative as a sum over all paths through the LCG DAG (Bauer, 1974) : ∂y j ∂θ i = J θ [f ] j,i = [i→j] (k,l)∈[i→j] ∂z l ∂z k (1) where [i → j] indexes paths from vertex i to vertex j and (k, l) ∈ [i → j] denotes the set of edges in that path. See Figure 1d for an illustration. Although general, this naïve sum over paths does not take advantage of the structure of the problem and so, as in other kinds of graph computations, dynamic programming (DP) provides a better approach. DP collapses substructures of the graph until it becomes bipartite and the remaining edges from inputs to outputs represent exactly the entries of the Jacobian matrix. This is referred to as the Jacobian accumulation problem (Naumann, 2004) and there are a variety of ways to manipulate the graph, including vertex, edge, and face elimination (Griewank & Naumann, 2002) . Forward-mode AD and reverse-mode AD (backpropagation) are special cases of more general dynamic programming strategies to perform this summation; determination of the optimal accumulation schedule is unfortunately NP-complete (Naumann, 2008) . While the above formulation in which each variable is a scalar can represent any computational graph, it can lead to structures that are difficult to reason about. Often we prefer to manipulate vectors and matrices, and we can instead let each intermediate z l represent a d l dimensional vector. In this case, ∂z l/∂z k ∈ R d l ×d k represents the intermediate Jacobian of the operation z k → z l . Note that Equation 1 now expresses the Jacobian of f as a sum over chained matrix products.

3. RANDOMIZING AUTOMATIC DIFFERENTIATION

We introduce techniques that could be used to decrease the resource requirements of AD when used for stochastic optimization. We focus on functions with a scalar output where we are interested in the gradient of the output with respect to some parameters, J θ [f ]. Reverse-mode AD efficiently calculates J θ [f ], but requires the full linearized computational graph to either be stored during the forward pass, or to be recomputed during the backward pass using intermediate variables recorded during the forward pass. For large computational graphs this could provide a large memory burden. The most common technique for reducing the memory requirements of AD is gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016) , which saves memory by adding extra forward pass computations. Checkpointing is effective when the number of "layers" in a computation graph is much larger than the memory required at each layer. We take a different approach; we instead aim to save memory by increasing gradient variance, without extra forward computation. Our main idea is to consider an unbiased estimator Ĵθ [f ] such that E Ĵθ [f ] = J θ [f ] which allows us to save memory required for reverse-mode AD. Our approach is to determine a sparse (but random) linearized computational graph during the forward pass such that reverse-mode AD applied on the sparse graph yields an unbiased estimate of the true gradient. Note that the original computational graph is used for the forward pass, and randomization is used to determine a LCG to use for the backward pass in place of the original computation graph. We may then decrease memory costs by storing the sparse LCG directly or storing intermediate variables required to compute the sparse LCG. In this section we provide general recipes for randomizing AD by sparsifying the LCG. In sections 4 and 5 we apply these recipes to develop specific algorithms for neural networks and linear PDEs which achieve concrete memory savings.

3.1. PATH SAMPLING

Observe that in Bauer's formula each Jacobian entry is expressed as a sum over paths in the LCG. A simple strategy is to sample paths uniformly at random from the computation graph, and form a Monte Carlo estimate of Equation 1. Naïvely this could take multiple passes through the graph. However, multiple paths can be sampled without significant computation overhead by performing a topological sort of the vertices and iterating through vertices, sampling multiple outgoing edges for each. We provide a proof and detailed algorithm in the appendix. Dynamic programming methods such as reverse-mode automatic differentiation can then be applied to the sparsified LCG.

3.2. RANDOM MATRIX INJECTION

In computation graphs consisting of vector operations, the vectorized computation graph is a more compact representation. We introduce an alternative view on sampling paths in this case. A single path in the vectorized computation graph represents many paths in the underlying scalar computation graph. As an example, Figure 2c The number of paths increases exponentially with depth; path sampling gives high variance. Independent paths are common when a loss decomposes over data. Fully interleaved graphs are common with vector operations. We now note that the contribution of the path p = θ → a 1 → b 2 → c 2 → y to the gradient is, ∂y ∂C P 2 ∂C ∂B P 2 ∂B ∂A P 1 ∂A ∂θ where P i = e i e T i (outer product of standard basis vectors). Sampling from {P 1 , P 2 , P 3 } and right multiplying a Jacobian is equivalent to sampling the paths passing through a vertex in the scalar graph. In general, if we have transition B → C in a vectorized computational graph, where B ∈ R d , C ∈ R m , we can insert a random matrix P = d /k k s=1 P s where each P s is sampled uniformly from {P 1 , P 2 , . . . , P d }. With this construction, EP = I d , so E ∂C ∂B P = ∂C ∂B . If we have a matrix chain product, we can use the fact that the expectation of a product of independent random variables is equal to the product of their expectations, so drawing independent random matrices P B , P C would give E ∂y ∂C P C ∂C ∂B P B = ∂y ∂C E [P C ] ∂C ∂B E [P B ] = ∂y ∂C ∂C ∂B Right multiplication by P may be achieved by sampling the intermediate Jacobian: one does not need to actually assemble and multiply the two matrices. For clarity we adopt the notation S P [ ∂C /∂B] = ∂C /∂BP . This is sampling (with replacement) k out of the d vertices represented by B, and only considering paths that pass from those vertices. The important properties of P that enable memory savings with an unbiased approximation are EP = I d and P = RR T , R ∈ R d×k , k < d . (6) We could therefore consider other matrices with the same properties. In our additional experiments in the appendix, we also let R be a random projection matrix of independent Rademacher random variables, a construction common in compressed sensing and randomized dimensionality reduction. In vectorized computational graphs, we can imagine a two-level sampling scheme. We can both sample paths from the computational graph where each vertex on the path corresponds to a vector. We can also sample within each vector path, with sampling performed via matrix injection as above. In many situations the full intermediate Jacobian for a vector operation is unreasonable to store. Consider the operation B → C where B, C ∈ R d . The Jacobian is d × d. Thankfully many common operations are element-wise, leading to a diagonal Jacobian that can be stored as a d-vector. Another common operation is matrix-vector products. Consider Ab = c, ∂c /∂b = A. Although A has many more entries than c or b, in many applications A is either a parameter to be optimized or is easily recomputed. Therefore in our implementations, we do not directly construct and sparsify the Jacobians. We instead sparsify the input vectors or the compact version of the Jacobian in a way that has the same effect. Unfortunately, there are some practical operations such as softmax that do not have a compactly-representable Jacobian and for which this is not possible.

3.3. VARIANCE

The variance incurred by path sampling and random matrix injection will depend on the structure of the LCG. We present two extremes in Figure 2 . In Figure 2a , each path is independent and there are a small number of paths. If we sample a fixed fraction of all paths, variance will be constant in the depth of the graph. In contrast, in Figure 2b , the paths overlap, and the number of paths increases exponentially with depth. Sampling a fixed fraction of all paths would require almost all edges in the graph, and sampling a fixed fraction of vertices at each layer (using random matrix injection, as an example) would lead to exponentially increasing variance with depth. It is thus difficult to apply sampling schemes without knowledge of the underlying graph. Indeed, our initial efforts to apply random matrix injection schemes to neural network graphs resulted in variance exponential with depth of the network, which prevented stochastic optimization from converging. We develop tailored sampling strategies for computation graphs corresponding to problems of common interest, exploiting properties of these graphs to avoid the exploding variance problem.

4. CASE STUDY: NEURAL NETWORKS

We consider neural networks composed of fully connected layers, convolution layers, ReLU nonlinearities, and pooling layers. We take advantage of the important property that many of the intermediate Jacobians can be compactly stored, and the memory required during reverse-mode is often bottlenecked by a few operations. We draw a vectorized computational graph for a typical simple neural network in figure 3 . Although the diagram depicts a dataset of size of 3, mini-batch size of size 1, and 2 hidden layers, we assume the dataset size is N . Our analysis is valid for any number of hidden layers, and also recurrent networks. We are interested in the gradients ∂y /∂W1 and ∂y /∂W2.

4.1. MINIBATCH SGD AS RANDOMIZED AD

At first look, the diagram has a very similar pattern to that of 2a, so that path sampling would be a good fit. Indeed, we could sample B < N paths from W 1 to y, and also B paths from W 2 to y. Each path corresponds to processing a different mini-batch element, and the computations are independent. In empirical risk minimization, the final loss function is an average of the loss over data points. Therefore, the intermediate partials ∂y /∂h2,x for each data point x will be independent of the other data points. As a result, if the same paths are chosen in path sampling for W 1 and W 2 , and if we are only interested in the stochastic gradient (and not the full function evaluation), the computation graph only needs to be evaluated for the data points corresponding to the sampled paths. This exactly corresponds to mini-batching. The paths are visually depicted in Figure 3b . We wish to use our principles to derive a randomization scheme that can be used on top of mini-batch SGD. We ensure our estimator is unbiased as we randomize by applying random matrix injection independently to various intermediate Jacobians. Consider a path corresponding to data point 1. The contribution to the gradient ∂y /∂W1 is

4.2. ALTERNATIVE SGD SCHEMES WITH RANDOMIZED AD

W1 R D×H1 x1 x2 x3 R D h1,1 h1,2 h1,3 R H1 a1,1 a1,2 a1,3 R H1 W2 R H1×H2 h2,1 h2,2 h2,3 R H2 y R × × × ReLU ReLU ReLU × × × (a) Neural network computational graph W1 R D×H1 x1 x2 x3 R D h1,1 h1,2 h1,3 R H1 a1,1 a1,2 a1,3 R H1 W2 R H1×H2 h2,1 h2,2 h2,3 R H2 y R × × × ReLU ReLU ReLU × × × (b) Computational Graph with Mini-batching ∂y ∂h 2,1 ∂h 2,1 ∂a 1,1 ∂a 1,1 ∂h 1,1 ∂h 1,1 ∂W 1 (7) Using random matrix injection to sample every Jacobian would lead to exploding variance. Instead, we analyze each term to see which are memory bottlenecks. ∂y /∂h2,1 is the Jacobian with respect to (typically) the loss. Memory requirements for this Jacobian are independent of depth of the network. The dimension of the classifier is usually smaller (10 -1000) than the other layers (which can have dimension 10, 000 or more in convolutional networks). Therefore, the Jacobian at the output layer is not a memory bottleneck. X W1 H 1 A 1 W2 H 2 L Figure 4: Convnet activation sampling for one minibatch element. X is the image, H is the pre-activation, and A is the activation. A is the output of a ReLU, so we can store the Jacobian ∂A 1/∂H 1 with 1 bit per entry. For X and H we sample spatial elements and compute the Jacobians ∂H 1/∂W 1 and ∂H 2/∂W 2 with the sparse tensors. ∂h2,1 /∂a1,1 is the Jacobian of the hidden layer with respect to the previous layer activation. This can be constructed from W 2 , which must be stored in memory, with memory cost independent of mini-batch size. In convnets, due to weight sharing, the effective dimensionality is much smaller than H 1 × H 2 . In recurrent networks, it is shared across timesteps. Therefore, these are not a memory bottleneck. ∂a1,1 /∂h1,1 contains the Jacobian of the ReLU activation function. This can be compactly stored using 1-bit per entry, as the gradient can only be 1 or 0. Note that this is true for ReLU activations in particular, and not true for general activation functions, although ReLU is widely used in deep learning. For ReLU activations, these partials are not a memory bottleneck. ∂h1,1 /∂W1 contains the memory bottleneck for typical ReLU neural networks. This is the Jacobian of the hidden layer output with respect to W 1 , which, in a multi-layer perceptron, is equal to x 1 . For B data points, this is a B × D dimensional matrix. Accordingly, we choose to sample ∂h1,1 /∂W1, replacing the matrix chain with ∂y ∂h2,1 ∂h2,1 ∂a1,1 ∂a1,1 ∂h1,1 S P W 1 ∂h1,1 ∂W1 . For an arbitrarily deep NN, this can be generalized: ∂y ∂h d,1 ∂h d,1 ∂a d-1,1 ∂a d-1,1 ∂h d-1,1 . . . ∂a 1,1 ∂h 1,1 S P W 1 ∂h 1,1 ∂W 1 , ∂y ∂h d,1 ∂h d,1 ∂a d-1,1 ∂a d-1,1 ∂h d-1,1 . . . ∂a 2,1 ∂h 2,1 S P W 2 ∂h 2,1 ∂W 2 This can be interpreted as sampling activations on the backward pass. This is our proposed alternative SGD scheme for neural networks: along with sampling data points, we can also sample activations, while maintaining an unbiased approximation to the gradient. This does not lead to exploding variance, as along any path from a given neural network parameter to the loss, the sampling operation is only applied to a single Jacobian. Sampling for convolutional networks is visualized in Figure 4 .

4.3. NEURAL NETWORK EXPERIMENTS

We evaluate our proposed RAD method on two feedforward architectures: a small fully connected network trained on MNIST, and a small convolutional network trained on CIFAR-10. We also evaluate our method on an RNN trained on Sequential-MNIST. The exact architectures and the calculations for the associated memory savings from our method are available in the appendix. In Figure 5 we include empirical analysis of gradient noise caused by RAD vs mini-batching. We are mainly interested in the following question: For a fixed memory budget and fixed number of gradient descent iterations, how quickly does our proposed method optimize the training loss compared to standard SGD with a smaller mini-batch? Reducing the mini-batch size will also reduce computational costs, while RAD will only reduce memory costs. Theoretically our method could reduce computational costs slightly, but this is not our focus. We only consider the memory/gradient variance tradeoff while avoiding adding significant overhead on top of vanilla reverse-mode (as is the case for checkpointing). Results are shown in Figure 6 . Our feedforward network full-memory baseline is trained with a minibatch size of 150. For RAD we keep a mini-batch size of 150, and try 2 different configurations. For "same sample", we sample with replacement a 0.1 fraction of activations, and the same activations are sampled for each mini-batch element. For "different sample", we sample a 0.1 fraction of activations, independently for each mini-batch element. Our "reduced batch" experiment is trained without RAD with a mini-batch size of 20 for CIFAR-10 and 22 for MNIST. This achieves similar memory budget as RAD with mini-batch size 150. Details of this calculation and of hyperparameters are in the appendix. For the convolutional and fully connected neural networks, the loss decreases faster using activation sampling, compared to reducing the mini-batch size further to match the memory usage. For the fully connected NN on MNIST, it is important to sample different activations for each mini-batch element, since otherwise only part of the weight vector will get updated with each iteration. For the convolutional NN on CIFAR-10, this is not an issue due to weight tying. As expected, the full memory baseline converges quicker than the low memory versions. For the RNN on Sequential-MNIST, sampling different activations at each time-step matches the performance obtained by reducing the mini-batch size. Figure 5 : We visualize the gradient noise for each stochastic gradient method by computing the full gradient (over all mini-batches in the training set) and computing the mean squared error deviation for the gradient estimated by each method for each layer in the convolutional net. RAD has significantly less variance vs memory than reducing minibatch size. Furthermore, combining RAD with an increased mini-batch size achieves similar variance to baseline 150 mini-batch elements while saving memory.

Fraction of activations

For the feedforward networks we tune the learning rate and 2 regularization parameter separately for each gradient estimator on a randomly held out validation set. We train with the best performing hyperparameters on bootstrapped versions of the full training set to measure variability in training. Details are in the appendix, including plots for train/test accuracy/loss, and a wider range of fraction of activations sampled. All feedforward models are trained with Adam. In the RNN case, we also run baseline, "same sample", "different sample" and "reduced batch" experiments. The "reduced batch" experiment used a mini-batch size of 21, while the others used a mini-batch size of 150. The learning rate was fixed at 10 -4 for all gradient estimators, found via a coarse grid search for the largest learning rate for which optimization did not diverge. Although we did not tune the learning rate separately for each estimator, we still expect that with a fixed learning rate, the lower variance estimators should perform better. When sampling, we sample different activations at each time-step. All recurrent models are trained with SGD without momentum.

5. CASE STUDY: REACTION-DIFFUSION PDE-CONSTRAINED OPTIMIZATION

Our second application is motivated by the observation that many scientific computing problems involve a repeated or iterative computation resulting in a layered computational graph. We may apply RAD to get a stochastic estimate of the gradient by subsampling paths through the computational graph. For certain problems, we can leverage problem structure to develop a low-memory stochastic gradient estimator without exploding variance. To illustrate this possibility we consider the optimization of a linear reaction-diffusion PDE on a square domain with Dirichlet boundary conditions, representing the production and diffusion of neutrons in a fission reactor (McClarren, Published as a conference paper at ICLR 2021 2018). Simulating this process involves solving for a potential φ(x, y, t) varying in two spatial coordinates and in time. The solution obeys the partial differential equation: ∂φ(x, y, t) ∂t = D∇ 2 φ(x, y, t) + C(x, y, t, θ)φ(x, y, t) We discretize the PDE in time and space and solve on a spatial grid using an explicit update rule φ t+1 = M φ t + ∆tC t φ t , where M summarizes the discretization of the PDE in space. The exact form is available in the appendix. The initial condition is φ 0 = sin (πx) sin (πy), with φ = 0 on the boundary of the domain. The loss function is the time-averaged squared error between φ and a time-dependent target, 2πx) sin (πy). The source C is given by a seven-term Fourier series in x and t, with coefficients given by θ ∈ R 7 , where θ is the control parameter to be optimized. Full simulation details are provided in the appendix. L = 1 /T t ||φ t (θ) -φ target t || 2 2 . The target is φ target t = φ 0 + 1 /4 sin (πt) sin (

The gradient is

∂L ∂θ = T t=1 ∂L ∂φt t i=1 t-1 j=i ∂φj+1 ∂φj ∂φi ∂Ci-1 ∂Ci-1 ∂θ . As the reaction-diffusion PDE is linear and explicit, ∂φj+1 /∂φj ∈ R N 2 x ×N 2 x is known and independent of φ. We avoid storing C at each timestep by recomputing C from θ and t. This permits a low-memory stochastic gradient estimate without exploding variance by sampling from ∂L /∂φt ∈ R N 2 x and the diagonal matrix ∂φi /∂Ci-1, replacing ∂L ∂θ with the unbiased estimator T t=1 S P φ t ∂L ∂φ t t i=1 t-1 j=i ∂φ j+1 ∂φ j S P φ i-1 ∂φ i ∂C i-1 ∂C i-1 ∂θ . ( ) This estimator can reduce memory by as much as 99% without harming optimization; see Figure 7b . RAD saves up to 99% of memory without significant slowdown in convergence.

6. RELATED WORK

Approximating gradients and matrix operations Much thought has been given to the approximation of general gradients and Jacobians. We draw inspiration from this literature, although our main objective is designing an unbiased gradient estimator, rather than an approximation with bounded accuracy. Abdel-Khalik et al. (2008) accelerate Jacobian accumulation via random projections, in a similar manner to randomized methods for SVD and matrix multiplication. Choromanski & Sindhwani (2017) recover Jacobians in cases where AD is not available by performing a small number of function evaluations with random input perturbations and leveraging known structure of the Jacobian (such as sparsity and symmetry) via compressed sensing. Other work aims to accelerate neural network training by approximating operations from the forward and/or backward pass. Sun et al. (2017) and Wei et al. (2017) backpropagate sparse gradients, keeping only the top k elements of the adjoint vector. Adelman & Silberstein (2018) approximate matrix multiplications and convolutions in the forward pass of neural nets nets using a columnrow sampling scheme similar to our subsampling scheme. Their method also reduces the computational cost of the backwards pass but changes the objective landscape. Related are invertible and reversible transformations, which remove the need to save intermediate variables on the forward pass, as these can be recomputed on the backward pass. Maclaurin et al. (2015) use this idea for hyperparameter optimization, reversing the dynamics of SGD with momentum to avoid the expense of saving model parameters at each training iteration. Gomez et al. (2017) introduce a reversible ResNet (He et al., 2016) to avoid storing activations. Chen et al. (2018) introduce Neural ODEs, which also have constant memory cost as a function of depth. Limited-memory learning and optimization Memory is a major bottleneck for reverse-mode AD, and much work aims to reduce its footprint. Gradient checkpointing is perhaps the most well known, and has been used for both reverse-mode AD (Griewank & Walther, 2000) with general layerwise computation graphs, and for neural networks (Chen et al., 2016) . In gradient checkpointing, some subset of intermediate variables are saved during function evaluation, and these are used to re-compute downstream variables when required. Gradient checkpointing achieves sublinear memory cost with the number of layers in the computation graph, at the cost of a constant-factor increase in runtime. Stochastic Computation Graphs Our work is connected to the literature on stochastic estimation of gradients of expected values, or of the expected outcome of a stochastic computation graph. The distinguishing feature of this literature (vs. the proposed RAD approach) is that it uses stochastic estimators of an objective value to derive a stochastic gradient estimator, i.e., the forward pass is randomized. Methods such as REINFORCE (Williams, 1992) optimize an expected return while avoiding enumerating the intractably large space of possible outcomes by providing an unbiased stochastic gradient estimator, i.e., by trading computation for variance. This is also true of mini-batch SGD, and methods for training generative models such as contrastive divergence (Hinton, 2002) , and stochastic optimization of evidence lower bounds (Kingma & Welling, 2013) . Recent approaches have taken intractable deterministic computation graphs with special structure, i.e. involving loops or the limits of a series of terms, and developed tractable, unbiased, randomized telescoping series-based estimators for the graph's output, which naturally permit tractable unbiased gradient estimation (Tallec & Ollivier, 2017; Beatson & Adams, 2019; Chen et al., 2019; Luo et al., 2020) .

7. CONCLUSION

We present a framework for randomized automatic differentiation. Using this framework, we construct reduced-memory unbiased estimators for optimization of neural networks and a linear PDE. Future work could develop RAD formulas for new computation graphs, e.g., using randomized rounding to handle arbitrary activation functions and nonlinear transformations, integrating RAD with the adjoint method for PDEs, or exploiting problem-specific sparsity in the Jacobians of physical simulators. The randomized view on AD we introduce may be useful beyond memory savings: we hope it could be a useful tool in developing reduced-computation stochastic gradient methods or achieving tractable optimization of intractable computation graphs. assume they take 1-bit per dimension (non-reduced dimension) to store. The input to the softmax layer is not sampled or projected. In both cases, the average pool and bias gradients does not require saving since the gradient is constant. For MNIST fully connected, this gives (per mini-batch element memory): Baseline: (784 + 300 + 300 + 300 + 10) • 32 bits = 6.776 kBytes RAD 0.1: (79 + 30 + 30 + 30 + 10) • 32 bits + (300 + 300 + 300) • 1 bits = 828.5 bytes which leads to approximately 8x savings per mini-batch element. For CIFAR-10 convolutional, this gives (per mini-batch element memory): Baseline: We trained the CIFAR-10 models for 100, 000 gradient descent iterations with a fixed mini-batch size, sampled with replacement from the training set. We lower the learning rate by 0.6 every 10, 000 iterations. We train with the Adam optimizer. We center the images but do not use data augmentation. The MNIST models were trained similarly, but for 20, 000 iterations, with the learning rate lowered by 0.6 every 2, 000 iterations. We fixed these hyperparameters in the beginning and did not modify them. ( We tune the initial learning rate and 2 weight decay parameters for each experiment reported in the main text for the feedforward networks. For each experiment (project, same sample, different sample, baseline, reduced batch), for both architectures, we generate 20 (weight decay, learning rate) pairs, where each weight decay is from the loguniform distribution over 0.0000001 -0.001 and learning rate from loguniform distribution over 0.00001 -0.01. We then randomly hold out a validation dataset of size 5000 from the CIFAR-10 and MNIST training sets and train each pair on the reduced training dataset and evaluate on the validation set. For each experiment, we select the hyperparameters that give the highest test accuracy. For each experiment, we train each experiment with the best hyperparameters 5 times on separate bootstrapped resamplings of the full training dataset (50, 000 for CIFAR-10 and 60, 000 for MNIST), and evaluate on the test dataset (10, 000 for both). This is to make sure the differences we observe across experiments are not due to variability in training. In the main text we show 3 randomly selected training curves for each experiment. Below we show all 5. All experiments were run on a single NVIDIA K80 or V100 GPU. Training times were reported on a V100.

RNN TRAINING DETAILS

All RNN experiments were trained for 200,000 iterations (mini-batch updates) with a fixed mini-batch size, sampled with replacement from the training set. We used the full MNIST training set of 60,000 images whereby the images were centered. Three repetitions of the same experiment were performed with different seeds. Hyperparameter tuning was not performed due to time constraints. The hidden-to-hidden matrix (W hh ) is initialised with the identity matrix, the input-to-hidden matrix (W ih ) and hidden-to-output (last hidden layer to softmax input) are initialised with a random matrix where each element is drawn independently from a N (0, 0.001) distribution and the biases (b ih , b hh ) are initialised with zero. The model was evaluated on the test set of 10,000 images every 400 iterations and on the entire training set every 4000 iterations. For the "sample", "different sample", "project" and "different project" experiments different activations/random matrices were sampled at every time-step of the unrolled RNN. All experiments were run on a single NVIDIA K80 or V100 GPU. The average running times for each experiment are given in Table 2 . Note that we did not optimise our implementation for speed and so these running times can be reduced significantly. 

IMPLEMENTATION

The code is provided on GitHubfoot_2 . Note that we did not optimize the code for computational efficiency; we only implemented our method as to demonstrate the effect it has on the number of gradient steps to train. Similarly, we did not implement all of the memory optimizations that we account for in our memory calculations; in particular in our implementation we did not take advantage of storing ReLU derivatives with 1-bit or the fact that average pooling has a constant derivative. Although these would have to be implemented in a practical use-case, they are not necessary in this proof of concept. 

APPENDIX C: PATH SAMPLING ALGORITHM AND ANALYSIS

Here we present an algorithm for path sampling and provide a proof that it leads to an unbiased estimate for the gradient. The main idea is to sample edges from the set of outgoing edges for each vertex in topological order, and scale appropriately. Vertices that have no incoming edges sampled can be skipped. Algorithm 1 RMAD with path sampling 1: Inputs: The main storage savings from Algorithm 1 will come from Line 9, where we only consider a vertex if it has an incoming edge that has been sampled. In computational graphs with a large number of independent paths, this will significantly reduce memory required, whether we record intermediate variables and recompute the LCG, or store entries of the LCG directly. To see that path sampling gives an unbiased estimate, we use induction on the vertices in reverse topological order. For every vertex z, we denote z = ∂y ∂z and ẑ as our approximation for z. For our base case, we let ŷ = dy dy = 1, so Eŷ = ȳ. For all other vertices z, we define ẑ = d z (z,v)∈E I v=vi ∂v ∂z v (9) where d z is the out-degree of z, v i is sampled uniformly from the set of successors of z, and I v=vi is an indicator random variable denoting if v = v i . We then have 



Department of Computer Science Department of Astrophysical Sciences https://github.com/PrincetonLIPS/RandomizedAutomaticDifferentiation



Figure 1: Illustration of the basic concepts of the linearized computational graph and Bauer's formula. (a) a simple Python function with intermediate variables; (b) the primal computational graph, a DAG with variables as vertices and flow moving upwards to the output; (c) the linearized computational graph (LCG) in which the edges are labeled with the values of the local derivatives; (d) illustration of the four paths that must be evaluated to compute the Jacobian. (Example from Paul D. Hovland.)

Figure 2: Common computational graph patterns. The graphs may be arbitrarily deep and wide. (a) A small number of independent paths. Path sampling has constant variance with depth. (b)The number of paths increases exponentially with depth; path sampling gives high variance. Independent paths are common when a loss decomposes over data. Fully interleaved graphs are common with vector operations.

Figure 3: NN computation graphs.

Figure 6: Training curves for neural networks. The legend in (c) applies to all plots. For the convolutional and

Figure 7: Reaction-diffusion PDE expt. (b)

Figure10: Full train/test curves and runtime per iteration for various fractions for the "different sample" experiment. Note that the reason 0.8 does not quite converge to baseline in the training curve is because we sample with replacement. This is an implementation detail; our method could be modified to sample without replacement, and at fraction 1.0 would be equivalent to baseline. The weight decay and initial learning rate for the RAD experiments above are all the same as the ones tuned for 0.1 fraction "different sample" experiment. The baseline experiments are tuned for baseline.

V, E) -Computational Graph. d v denotes outdegree, v.succ successor set of vertex v. ) = 0, ∀e ∈ E 8: for v in topological order; synchronous with forward computation do 9:if No incoming edge of v has been sampled then i from [d v ] uniformly.13:Q(v, v.succ[i]) ← Q(v, v.succ[i]) + dv k ∂v.succ[i] ∂v14: Run backpropagation from y to Θ using Q as intermediate partials.15: Output: ∇ Θ y

randomness over the sampling of outgoing edges is independent of v, which must be true because our induction is in reverse topological order. Since by induction we assumed Ev = v,

are vectors with entries a i , b i , c i , ∂C /∂B, ∂B /∂A are 3 × 3 Jacobian matrices for the intermediate operations, ∂y /∂C is 1 × 3, and ∂A /∂θ is 3 × 1.

Peak

3 • 32 • 32 + 16 • 32 • 32 + 32 • 16 • 16 + 32 • 16 • 16 + 32 • 8 • 8 + 10) • 32 bits = 151.59 kBytes

Average running times for RNN experiments on Sequential-MNIST.

FULL TRAINING/TEST CURVES FOR MNIST AND CIFAR-10Training time vs Iterations for SmallFCNet on MNIST Figure 8: Full train/test curves and runtime per iteration. Note that training time is heavily dependent on implementation, which we did not optimize. In terms of FLOPs, "project" should be significantly higher than all the others and "reduced batch" should be significantly smaller. The "baseline", "same sample", and "different sample" should theoretically have approximately the same number of FLOPs.SWEEP OF FRACTIONS SAMPLED FOR MNIST AND CIFAR-10

ACKNOWLEDGEMENTS

The authors would like to thank Haochen Li for early work on this project. We would also like to thank Greg Gundersen, Ari Seff, Daniel Greenidge, and Alan Chung for helpful comments on the manuscript. This work is partially supported by NSF IIS-2007278. 

APPENDIX A: NEURAL NETWORK EXPERIMENTS RANDOM PROJECTIONS FOR RAD

As mentioned in Section 3.2 (around Equation 5) of the main paper, we could also use different matrices P that have the propertiesIn the appendix we report experiments of letting R be a matrix of iid Rademacher random variables, scaled by √ k. P = RR T defined in this way satisfies the properties above. Note that this would lead to additional computation: The Jacobian or input vector would have to be fully computed, and then multiplied by R and stored. In the backward pass, it would have to be multiplied by R T . We report results as the "project" experiment in the full training/test curves in the following sections. We see that it performs competitively with reducing the mini-batch size.

ARCHITECTURES USED

We use three different neural network architectures for our experiments: one fully connected feedforward, one convolutional feedforward, and one recurrent. 

APPENDIX B: REACTION-DIFFUSION PDE

The reaction-diffusion equation is a linear parabolic partial differential equation. In fission reactor analysis, it is called the one-group diffusion equation or one-speed diffusion equation, shown below.Here φ represents the neutron flux, D is a diffusion coefficient, and Cφ and S are source terms related to the local production or removal of neutron flux. In this paper, we solve the one-speed diffusion equation in two spatial dimensions on the unit square with the condition that φ = 0 on the boundary. We assume that D is constant equal to 1 /4, C(x, y, t, θ) is a function of control parameters θ described below, and S is zero. We discretize φ on a regular grid in space and time, which motivates the notation φ → φ t . The grid spacing is ∆x = 1 /32 and the timestep is ∆t = 1 /4096. We simulate from t = 0 to t = 10. We use the explicit forward-time, centered-space (FTCS) method to timestep φ. The timestep is chosen to satisfy the stability criterion, D∆t /(∆x) 2 ≤ 1 4 . In matrix notation, the FTCS update rule can be written φ t+1 = M φ t + ∆tC t φ t , in index notation it can be written as follows:The term Cφ in the one-speed diffusion equation relates to the local production or removal of neutrons due to nuclear interactions. In a real fission reactor, C is a complicated function of the material properties of the reactor and the heights of the control rods. We make the simplifying assumption that C can be described by a 7-term Fourier series in x and t, written below. Physically, this is equivalent to the assumption that the material properties of the reactor are constant in space and time, and the heights of the control rods are sinusoidally varied in x and t. φ 0 is initialized so that the reactor begins in a stable state, the other parameters are initialized from a uniform distribution.C(x, y, t, θ) = θ 0 + θ 1 sin(πt) + θ 2 cos(πt) + θ 3 sin(2πx) sin(πt)+ θ 4 sin(2πx) cos(πt) + θ 5 cos(2πx) sin(πt) + θ 6 cos(2πx) cos(πt)The details of the stochastic gradient estimate and optimization are described in the main text. The Adam optimizer is used. Each experiment of 800 optimization iterations runs in about 4 hours on a GPU.

