Differentiate Everything with a Reversible Embeded Domain-Specific Language

Abstract

Reverse-mode automatic differentiation (AD) suffers from the issue of having too much space overhead to trace back intermediate computational states for back-propagation. The traditional method to trace back states is called checkpointing that stores intermediate states into a global stack and restore state through either stack pop or re-computing. The overhead of stack manipulations and re-computing makes the general purposed (not tensor-based) AD engines unable to meet many industrial needs. Instead of checkpointing, we propose to use reverse computing to trace back states by designing and implementing a reversible programming eDSL, where a program can be executed bi-directionally without implicit stack operations. The absence of implicit stack operations makes the program compatible with existing compiler features, including utilizing existing optimization passes and compiling the code as GPU kernels. We implement AD for sparse matrix operations and some machine learning applications to show that our framework has the state-of-the-art performance.

1. Introduction

Most of the popular automatic differentiation (AD) tools in the market, such as TensorFlow (Abadi et al., 2015) , Pytorch (Paszke et al., 2017) , and Flux (Innes et al., 2018) implements reverse mode AD at the tensor level to meet the need in machine learning. Later, People in the scientific computing domain also realized the power of these AD tools, they use these tools to solve scientific problems such as seismic inversion (Zhu et al., 2020) , variational quantum circuits simulation (Bergholm et al., 2018; Luo et al., 2019) and variational tensor network simulation (Liao et al., 2019; Roberts et al., 2019) . To meet the diverse need in these applications, one sometimes has to define backward rules manually, for example 1. To differentiate sparse matrix operations used in Hamiltonian engineering (Hao Xie & Wang) , people defined backward rules for sparse matrix multiplication and dominant eigensolvers (Golub & Van Loan, 2012), 2. In tensor network algorithms to study the phase transition problem (Liao et al., 2019; Seeger et al., 2017; Wan & Zhang, 2019; Hubig, 2019) , people defined backward rules for singular value decomposition (SVD) function and QR decomposition (Golub & Van Loan, 2012) . Instead of defining backward rules manually, one can also use a general purposed AD (GP-AD) framework like Tapenade (Hascoet & Pascual, 2013) , OpenAD (Utke et al., 2008) and Zygote (Innes, 2018; Innes et al., 2019) . Researchers have used these tools in practical applications such as bundle adjustment (Shen & Dai, 2018) and earth system simulation (Forget et al., 2015) , where differentiating scalar operations is important. However, the power of these tools are often limited by their relatively poor performance. In many practical applications, a program might do billions of computations. In each computational step, the AD engine might cache some data for backpropagation. (Griewank & Walther, 2008) Frequent caching of data slows down the program significantly, while the memory usage will become a bottleneck as well. Caching implicitly also make these frameworks incompatible with kernel functions. To avoid such issues, we need a new GP-AD framework that does not cache automatically for users. In this paper, we propose to implement the reverse mode AD on a reversible (domain-specific) programming language (Perumalla, 2013; Frank, 2017) , where intermediate states can be traced backward without accessing an implicit stack. Reversible programming allows people to utilize the reversibility to reverse a program. In machine learning, reversibility is proven to substantially decrease the memory usage in unitary recurrent neural networks (MacKay et al., 2018) , normalizing flow (Dinh et al., 2014) , hyper-parameter learning (Maclaurin et al., 2015) and residual neural networks (Gomez et al., 2017; Behrmann et al., 2018) . Reversible programming will make these happen naturally. The power of reversible programming is not limited to handling these reversible applications, any program can be written in a reversible style. Converting an irreversible program to the reversible form would cost overheads in time and space. Reversible programming provides a flexible time-space trade-off scheme that different with checkpointing (Griewank, 1992; Griewank & Walther, 2008; Chen et al., 2016) , reverse computing (Bennett, 1989; Levine & Sherman, 1990) , to let user handle these overheads explicitly. There have been many prototypes of reversible languages like Janus (Lutz, 1986) , R (not the popular one) (Frank, 1997) , Erlang (Lanese et al., 2018) and object-oriented ROOPL (Haulund, 2017) . In the past, the primary motivation to study reversible programming is to support reversible computing devices (Frank & Knight Jr, 1999) such as adiabatic complementary metal-oxide-semiconductor (CMOS) (Koller & Athas, 1992) , molecular mechanical computing system (Merkle et al., 2018) and superconducting system (Likharev, 1977; Semenov et al., 2003; Takeuchi et al., 2014; 2017) , and these reversible computing devices are orders more energy-efficient. Landauer proves that only when a device does not erase information (i.e. reversible), its energy efficiency can go beyond the thermal dynamic limit. (Landauer, 1961; Reeb & Wolf, 2014) However, these reversible programming languages can not be used directly in real scientific computing, since most of them do not have basic elements like floating point numbers, arrays, and complex numbers. This motivates us to build a new embedded domain-specific language (eDSL) in Julia (Bezanson et al., 2012; 2017) as a new playground of GP-AD. In this paper, we first compare the time-space trade-off in the optimal checkpointing and the optimal reverse computing in Sec. 2. Then we introduce the language design of NiLang in Sec. 3. In Sec. 4, we explain the implementation of automatic differentiation in NiLang. In Sec. 5, we benchmark the performance of NiLang's AD with other AD software and explain why it is fast. 2 Reverse computing as an Alternative to Checkpointing One can use either checkpointing or reverse computing to trace back intermediate states of a Tstep computational process s 1 = f 1 (s 0 ), s 2 = f 2 (s 1 ), . . . , s T = f T (s T -1 ) with a run-time memory S . In the checkpointing scheme, the program first takes snapshots of states at certain time steps S = {s a , s b , . . .}, 1 ≤ a < b < ... ≤ T by running a forward pass. When retrieving a state s k , if s k ∈ S , just return this state, otherwise, return max j s j<k ∈ S and re-compute s k from s j . In the reverse computing scheme, one first writes the program in a reversible style. Without prior knowledge, a regular program can be transpiled to the reversible style is by doing the transformation in Listing. 1. Listing 1: Transpiling a regular code to the reversible code without prior knowledge. s 1 += f 1 (s 0 ) s 2 += f 2 (s 1 ) . . . s T += f T (s T -1 ) Listing 2: The reverse of Listing. 1 s T -= f T (s T -1 ) . . . s 2 -= f 2 (s 1 ) s 1 -= f 1 (s 0 ) Then one can visit states in the reversed order by running the reversed program in Listing. 2, which erases the computed results from the tail. One may argue that easing through uncomputing is not necessary here. This is not true for a general reversible program, because the intermediate states might be mutable and used in other parts of the program. It is easy to see, both checkpointing and reverse computing can trace back states without time overhead, but both suffer from a space overhead that linear to time (Table 1 ). The checkpointing scheme snapshots the output in every step, and the reverse computing scheme allocates extra storage for storing outputs in every step. On the other side, only checkpointing can achieve a zero space overhead by recomputing everything from the beginning s 0 , with a time complexity O(T 2 ). The minimum space complexity in reverse computing is O(S log(T/S )) (Bennett, 1989; Levine & Sherman, 1990; Perumalla, 2013) , with time complexity O(T 1.585 ). Table 1 : T and S are the time and space of the original irreversible program. In the "Reverse computing" case, the reversibility of the original program is not utilized. The difference in space overheads can be explained by the difference of the optimal checkpointing and optimal reverse computing algorithms. The optimal checkpointing algorithm that widely used in AD is the treeverse algorithm in Fig. 1 (a) . This algorithm partitions the computational process binomially into d sectors. At the beginning of each sector, the program snapshots the state and push it into a global stack, hence the memory for checkpointing is dS . The states in the last sector are retrieved by the above space-efficient O(T 2 ) algorithm. After that, the last snapshot can be freed and the program has one more quota in memory. With the freed memory, the second last sector can be further partition into two sectors. Likewise, the lth sectors is partitioned into l sub-sectors, where l is the sector index counting from the tail. Recursively apply this treeverse algorithm t times until the sector size is 1. The approximated overhead in time and space are T c = tT, S c = dS , where T = η(t, d) holds. By carefully choosing either a t or d, the overhead in time and space can be both logarithmic. On the other side, the optimal time-space trade-off scheme in reverse computing is the Bennett's algorithm illustrated in Fig. 1 (b). It evenly evenly partition the program into k sectors. The program marches forward (P process) for k steps to obtain the final state s k+1 , then backward (Q process) from the k -1th step to erase the states in between s 1<i<k . This process is also called the compute-copyuncompute process. Recursively apply the compute-copy-uncompute process for each sector until each P/Q process contains only one unit computation. the time and space complexities are T r = T T S ln(2-(1/k)) ln k , S r = k -1 ln k S log T S . Here, the overhead in time is polynomial, which is worse than the treeverse algorithm. The treeverse like partition does not apply here because the first sweep to create initial checkpoints without introducing any space overheads is not possible in reversible computing. The reverse computing does not show advantage from the above complexity analysis. But we argue the this analysis is from the worst case, which are very different to the practical using cases. First, ... η(t, 3) η(t, 2) η(t, 1) η(t-1, 2) η(t-1, 1) ... P2 P2 P2 Q2 Q2 P1 P1 P1 Q1 Q1 (b) (a) Figure 1 : (a) Treeverse algorithm for optimal checkpointing. (Griewank, 1992)  η(τ, δ) ≡ τ + δ δ = (τ+δ)! τ!δ! is the binomial function. (b) Bennett's time space trade-off scheme for reverse computing. (Bennett, 1973; Levine & Sherman, 1990 ) P and Q are computing and uncomputing respectively. The pseudo-code is defined in Listing. 3. reverse computing can make use of the reversibility to save memory. In Appendix B.2, we show how to implement a unitary matrix multiplication without introducing overheads in space and time. Second, reverse computing does not allocate automatically for users, user can optimize the memory access patterns for their own devices like GPU. Third, reverse computing is compatible with effective codes, so that it fits better with modern languages. In Appendix B.1, we show how to manipulate inplace functions on arrays with NiLang. Fourth, reverse computing can utilize the existing compiler to optimize the code because it does not introduce global stack operations that harm the purity of functions. Fifth, reverse computing encourages users to think reversibly. In Appendix B.3, we show reversible thinking can lead the user to a constant memory, constant time implementation of chained multiplication algorithms.

3. Language design

NiLang is an embedded domain-specific language (eDSL) NiLang built on top of the host language Julia (Bezanson et al., 2012; 2017) . Julia is a popular language for scientific programming and machine learning. We choose Julia mainly for speed. Julia is a language with high abstraction, however, its clever design of type inference and just in time compiling make it has a C like speed. Meanwhile, it has rich features for meta-programming. Its package for pattern matching MLStyle allows us to define an eDSL in less than 2000 lines. Comparing with a regular reversible programming language, NiLang features array operations, rich number systems including floating-point numbers, complex numbers, fixed-point numbers, and logarithmic numbers. It also implements the compute-copy-uncompute (Bennett, 1973) macro to increase code reusability. Besides the above reversible hardware compatible features, it also has some reversible hardware incompatible features to meet the practical needs. For example, it views the floating-point + andoperations as reversible. It also allows users to extend instruction sets and sometimes inserting external statements. These features are not compatible with future reversible hardware. NiLang's compiling process, grammar and operational semantics are described in Appendix G. The source code is also available online, we will put a link here after the anonymous open review session. By the time of writing, the version of NiLang is v0.7.3.

3.1. Reversible functions and instructions

Mathematically, any irreversible mapping y = f(args...) can be trivially transformed to its reversible form y += f(args...) or y = f(args...) ( is the bit-wise XOR), where y is a pre-emptied variable. But in numeric computing with finite precision, this is not always true. The reversibility of arithmetic instruction is closely related to the number system. For integer and fixed point number system, y += f(args...) and y -= f(args...) are rigorously reversible. For logarithmic number system and tropical number system (Speyer & Sturmfels, 2009) , y *= f(args...) and y /= f(args...) as reversible (not introducing the zero element). While for floating point numbers, none of the above operations are rigorously reversible. However, for convenience, we ignore the round-off errors in floating-point + andoperations and treat them on equal footing with fixed-point numbers in the following discussion. In Appendix F, we will show doing this is safe in most cases provided careful implementation. Other reversible operations includes SWAP, ROT, NEG et. al., and Macro @i generates two functions that are reversible to each other, multiplier and ∼multiplier, each defines a mapping R 3 → R 3 . The ! after a symbol is a part of the name, as a conversion to indicate the mutated variables.

3.2. Reversible memory management

A distinct feature of reversible memory management is that the content of a variable must be known when it is deallocated. We denote the allocation of a pre-emptied memory as x ← 0, and its inverse, deallocating a zero emptied variable, as x → 0. An unknown variable can not be deallocate, but can be pushed to a stack pop out later in the uncomputing stage. If a variable is allocated and deallocated in the local scope, we call it an ancilla. Listing. 5 defines the complex valued accumulative log function. Listing 5: Reversible complex valued log function y += log(|x|) + iArg(x). @i @inline function (:+=)(log)(y!::Complex{T }, x::Complex{T}) where T n ← zero(T) n += abs(x) y!.re += log(n) y!.im += angle(x) n -= abs(x) n → zero(T) end Listing 6: Compute-copy-uncompute version of Listing. 5 @i @inline function (:+=)(log)(y!::Complex{T }, x::Complex{T}) where T @routine begin n ← zero(T) n += abs(x) end y!.re += log(n) y!.im += angle(x) ~@routine end Here, the macro @inline tells the compiler that this function can be inlined. One can input "←" and "→" by typing "\leftarrow[TAB KEY]" and "\rightarrow[TAB KEY]" respectively in a Julia editor or REPL. NiLang does not have immutable structs, so that the real part y!.re and imaginary y!.im of a complex number can be changed directly. It is easy to verify that the bottom two lines in the function body are the inverse of the top two lines. i.e., the bottom two lines uncomputes the top two lines. The motivation of uncomputing is to zero clear the contents in ancilla n so that it can be deallocated correctly. Compute-copy-uncompute is a useful design pattern in reversible programming so that we created a pair of macros @routine and ∼@routine for it. One can rewrite the above function as in Listing. 6. 

3.3. Reversible control flows

One can define reversible if, for and while statements in a reversible program. Fig. 2 The reversible for statement is similar to the irreversible one except that after execution, the program will assert the iterator to be unchanged. To reverse this statement, one can exchange start and stop and inverse the sign of step. Listing. 11 computes the Fibonacci number recursively and reversibly. Listing 11: Computing Fibonacci number recursively and reversibly. @i function rrfib(out!, n) @invcheckoff if (n >= 1, ~) counter ← 0 counter += n while (counter > 1, counter!=n) rrfib(out!, counter-1) counter -= 2 end counter -= n % 2 counter → 0 end out! += 1 end Here, out! is an integer initialized to 0 for storing outputs. The precondition and postcondition are wrapped into a tuple. In the if statement, the postcondition is the same as the precondition, hence we omit the postcondition by inserting a "∼" in the second field for "copying the precondition in this field as the postcondition". In the while statement, the postcondition is true only for the initial loop. Once code is proven correct, one can turn off the reversibility check by adding @invcheckoff before a statement. This will remove the reversibility check and make the code faster and compatible with GPU kernels (kernel functions can not handle exceptions). 4 Reversible automatic differentiation (Revels et al., 2016) , we use the operator overloading technique to differentiate the program efficiently. In the backward pass, we wrap each output variable with a composite type GVar that containing an extra gradient field, and feed it into the reversed generic program. Instructions are multiple dispatched to corresponding gradient instructions that update the gradient field of GVar at the meantime of uncomputing. By reversing this gradient program, we can obtain the gradient program for the reversed program too. One can define the adjoint ("adjoint" here means the program for back-propagating gradients) of a primitive instruction as a reversible function on either the function itself or its reverse since the adjoint of a function's reverse is equivalent to the reverse of the function's adjoint. f :( x, g x ) → ( y, g T x ∂ x ∂ y ) (3) f -1 :( y, g y ) → ( x, g T y ∂ y ∂ x ) It can be easily verified by applying the above two mappings consecutively, which turns out to be an identity mapping considering ∂ y ∂ x ∂ x ∂ y = 1. The implementation details are described in Appendix C. In most languages, operator overloading is accompanied with significant overheads of function calls and object allocation and deallocation. But in a language with type inference and just in time compiling like Julia, the boundary between two approaches are vague. The compiler inlines small functions, packs an array of constant sized immutable objects into a continuous memory, and truncates unnecessary branches automatically.

4.2. Hessians

Combining forward mode AD and reverse mode AD is a simple yet efficient way to obtain Hessians. By wrapping the elementary type with Dual defined in package ForwardDiff and throwing it into the gradient program defined in NiLang, one obtains one row/column of the Hessian matrix. We will use this approach to compute Hessians in the graph embedding benchmark in Sec. D.2.

4.3. CUDA kernels

CUDA programming is playing a significant role in high-performance computing. In Julia, one can write GPU compatible functions in native Julia language with KernelAbstractions. (Besard et al., 2017) Since NiLang does not push variables into stack automatically for users, it is safe to write differentiable GPU kernels with NiLang. We will differentiate CUDA kernels with no more than extra 10 lines in the bundle adjustment benchmark in Sec. 5.

5. Benchmarks

We benchmark our framework with the state-of-the-art GP-AD frameworks, including source code transformation based Tapenade and Zygote and operator overloading based ForwardDiff and ReverseDiff. Since most tensor based AD software like famous TensorFlow and PyTorch are not designed for the using cases used in our benchmarks, we do not include those package to avoid an unfair comparison. In the following benchmarks, the CPU device is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, and the GPU device is NVIDIA Titan V. For NiLang benchmarks, we have turned the reversibility check off to achieve a better performance. We reproduced the benchmarks for Gaussian mixture model (GMM) and bundle adjustment in Srajer et al. (2018) by re-writing the programs in a reversible style. We show the results in Fig. 3 . The Tapenade data is obtained by executing the docker file provided by the original benchmark, which provides a baseline for comparison. 10 5 10 6 10 7 # of parameters NiLang's objective function is ∼2× slower than normal code due to the uncomputing overhead. In this case, NiLang does not show advantage to Tapenade in obtaining gradients, the ratio between computing the gradients and the objective function are close. This is because the bottleneck of this  10 3 10 1 10 1 (b) Bundle Adjustment Julia-O NiLang-O Tapenade-O ForwardDiff-J NiLang-J Tapenade-J NiLang-J (GPU) S = (2 + d 2 )k + 2d + P, S r = (3 + d 2 + d)k + 2log 2 k + P, where d and k are the size and number of covariance matrices. P = d(d+1) 2 k + k + dk is the size of parameter space. The memory of the dataset (d×N) is not included because it will scale as N. Due to the hardness of estimating peak memory usage, the Tapenade data is missing here. The ForwardDiff memory usage is approximately the original size times the batch size, where the batch size is 12 by default. In the bundle adjustment benchmark, NiLang performs the best on CPU. We also compiled our adjoint program to GPU with no more than 10 lines of code with KernelAbstractions, which provides another ∼200× speed up. Parallelizing the adjoint code requires the forward code not reading the same variable simultaneously in different threads, and this requirement is satisfied here. The peak memory of the original program and the reversible program are both equal to the size of parameter space because all "allocation"s happen on registers in this application. One can find more benchmarks in Appendix D, including differentiating sparse matrix dot product and obtaining Hessians in the graph embedding application.

6. Discussion

In this work, we demonstrate a new approach to back propagates a program called reverse computing AD by designing a reversible eDSL NiLang. NiLang is a powerful tool to differentiate code from the source code level so that can be directly useful to machine learning researches. It can generate efficient backward rules, which is exemplified in Appendix E. It can also be used to differentiate reversible neural networks like normalizing flows (Kobyzev et al., 2019) to save memory, e.g. backpropagating NICE network (Dinh et al., 2014) with only constant space overheads. NiLang is most useful in solving large scale scientific problems memory efficiently. In Liu et al. (2020) , people solve the ground state problem of a 28 × 28 square lattice spin-glass by re-writing the quantum simulator with NiLang. There are some challenges in reverse computing AD too. • The native BLAS and convolution operations in NiLang are not optimized for the memory layout, and are too slow comparing with state of the art machine learning libraries. We need a better implementation of these functions in the reversible programming context so that it can be more useful in training traditional deep neural networks. • Although we show some examples of training neural networks on GPU, the shared-reading of a variable is not allowed. • NiLang's IR does not have variable analysis. The uncomputing pass is not always necessary for the irreversible host language to deallocate memory. In many cases, the host language's variable analysis can figure this out, but it is not guarantted. Another interesting issue is how to make use of reversible computing devices to save energy in machine learning. Reversible computing is not always more energy efficient than irreversible computing. In the time-space trade-off scheme in Sec. 2, we show the time to uncompute a unit of memory is exponential to n as Q n = (2k -1) n , and the computing energy also increases exponentially. On the other side, the amount of energy to erase a unit of memory is a constant. When (2k -1) n > 1/ξ, erasing the memory irreversibly is more energy-efficient, where ξ is the energy ratio between a reversible operation (an instruction or a gate) and its irreversible counterpart. Listing 13: Inplace affine transformation. @i function i_affine!(y!::AbstractVector{T}, W::AbstractMatrix{T}, b::AbstractVector{T}, x: :AbstractVector{T}) where T @safe @assert size(W) == (length(y!), length(x)) && length(b) == length(y!) @invcheckoff for j=1:size(W, 2) for i=1:size(W, 1) @inbounds y! [i] += W[i,j]*x[j] end end @invcheckoff for i=1:size(W, 1) @inbounds y![i] += b[i] end end Here, the expression following the @safe macro is an external irreversible statement.

B.2 Utilizing reversibility

Reverse computing can utilize reversibility to trace back states without extra memory cost. For example, we can define the unitary matrix multiplication that can be used in a type of memoryefficient recurrent neural network. (Jing et al., 2016) Listing 14: Two level decomposition of a unitary matrix. @i function i_umm!(x!::AbstractArray, θ) M ← size(x!, 1) N ← size(x!, 2) k ← 0 @safe @assert length(θ) == M*(M-1)/2 for l = 1:N for j=1:M for i=M-1:-1:j INC(k) ROT(x![i,l], x![i+1,l], θ[k]) end end end k → length(θ) end

B.3 Encourages reversible thinking

Last but not least, reversible programming encourages users to code in a memory friendly style. Since allocations in reversible programming are explicit, programmers have the flexibility to control how to allocate memory and which number system to use. For example, to compute the power of a positive fixed-point number and an integer, one can easily write irreversible code as in Listing. 15 Listing 15: A regular power function. Since the fixed-point number is not reversible under *=, naive checkpointing would require stack operations inside a loop. With reversible thinking, we can convert the fixed-point number to logarithmic numbers to utilize the reversibility of *= as shown in Listing. 16. Here, the algorithm to convert a regular fixed-point number to a logarithmic number can be efficient. (Turner, 2010) C Implementation of AD in NiLang To backpropagate the program, we first reverse the code through source code transformation and then insert the gradient code through operator overloading. If we inline all the functions in Listing. 6, the function body would be like Listing. 17. The automatically generated inverse program (i.e. (y, x) → (ylog(x), x)) would be like Listing. 18. Listing 17: The inlined function body of Listing. 6. § ¤ @routine begin nsq ← zero(T) n ← zero(T) nsq += x[i].re ^2 nsq += x[i].im ^2 n += sqrt(nsq) end y![i].re += log(n) y![i].im += atan(x[i].im, x[i].re) ~@routine ¦ ¥ Listing 18: The inverse of Listing. 17. § ¤ @routine begin nsq ← zero(T) n ← zero(T) nsq += x[i].re ^2 nsq += x[i].im ^2 n += sqrt(nsq) end y![i].re -= log(n) y![i].im -= atan(x[i].im, x[i].re) ~@routine ¦ ¥ To compute the adjoint of the computational process in Listing. 17, one simply insert the gradient code into its inverse in Listing. 18. The resulting inlined code is show in Listing. 19. Listing 19: Insert the gradient code into Listing. 18, the original computational processes are highlighted in yellow background.  ![i].re.x -= log(n.x) n.g += y![i].re.g / n.x y![i].im.x-=atan(x[i].im.x,x[i].re.x) @zeros T xy2 jac_x jac_y xy2 += abs2(x[i].re.x) xy2 += abs2(x[i].im.x) jac_y += x[i].re.x / xy2 jac_x += (-x[i].im.x) / xy2 x[i].im.g += y![i].im.g * jac_y x[i].re.g += y![i].im.g * jac_x jac_x -= (-x[i].im.x) / xy2 jac_y -= x[i].re.x / xy2 xy2 -= abs2(x[i].im.x) xy2 -= abs2(x[i].re.x) ~@zeros T xy2 jac_x jac_y ~@routine Here, @zeros TYPE var1 var2... is the macro to allocate multiple variables of the same type. Its inverse operations starts with ∼@zeros deallocates zero emptied variables. In practice, "inserting gradients" is not achieved by source code transformation, but by operator overloading. We change the element type to a composite type GVar with two fields, value x and gradient g. With multiple dispatching primitive instructions on this new type, values and gradients can be updated simultaneously. Although the code looks much longer, the computing time (with reversibility check closed) is not. Listing 20: Time and allocation to differentiate complex valued log. § ¤ julia> using NiLang, NiLang.AD, BenchmarkTools julia> @inline function (ir_log)(x::Complex{T}) where T log(abs(x)) + im*angle(x) end julia> @btime ir_log(x) setup=(x=1.0+1.2im); # native code 30.097 ns (0 allocations: 0 bytes) julia> @btime (@instr y += log(x)) setup=(x=1.0+1.2im; y=0.0+0.0im); # reversible code 17.542 ns (0 allocations: 0 bytes) julia> @btime (@instr ~(y += log(x))) setup=(x=GVar(1.0+1.2im, 0.0+0.0im); y=GVar(0.1+0.2im , 1.0+0.0im)); # adjoint code 25.932 ns (0 allocations: 0 bytes)

¦ ¥

The performance is unreasonably good because the generated Julia code is further compiled to LLVM so that it can enjoy existing optimization passes. For example, the optimization passes can find out that for an irreversible device, uncomputing local variables n and nsq to zeros does not affect return values, so that it will ignore the uncomputing process automatically. Unlike checkpointing based approaches that focus a lot in the optimization of data caching on a global stack, NiLang does not have any optimization pass in itself. Instead, it throws itself to existing optimization passes in Julia. Without accessing the global stack, NiLang's code is quite friendly to optimization passes. In this case, we also see the boundary between source code transformation and operator overloading can be vague in a Julia, in that the generated code can be very different from how it looks. The joint functions for primitive instructions (:+=)(sqrt) and (:-=)(sqrt) used above can be defined as in Listing. 21. Listing 21: Adjoints for primitives (:+=)(sqrt) and (:-=)(sqrt). § ¤ @i @inline function (:-=)(sqrt)(out!::GVar, x::GVar{T}) where T @routine @invcheckoff begin @zeros T a b a += sqrt(x. 

D.2 Graph embedding problem

Graph embedding can be used to find a proper representation for an order parameter (Takahashi & Sandvik, 2020) in condensed matter physics. People want to find a minimum Euclidean space dimension k that a Petersen graph can embed into, that the distances between pairs of vertices are l 1 , and the distance between pairs of disconnected vertices are l 2 , where l 2 > l 1 . The Petersen graph is shown in Fig. 5 . Let us denote the set of connected and disconnected vertex pairs as L 1 and L 2 , respectively. This problem can be variationally solved with the following loss. L = Var(dist(L 1 )) + Var(dist(L 2 )) + exp(relu(dist(L 1 ) -dist(L 2 ) + 0.1))) -1 (7) The first line is a summation of distance variances in two sets of vertex pairs, where Var(X) is the variance of samples in X. The second line is used to guarantee l 2 > l 1 , where X means taking the average of samples in X. Its reversible implementation could be found in our benchmark repository. We repeat the training for dimension k from 1 to 10. In each training, we fix two of the vertices and optimize the positions of the rest. Otherwise, the program will find the trivial solution with overlapped vertices. For k < 5, the loss is always much higher than 0, while for k ≥ 5, we can get a loss close to machine precision with high probability. From the k = 5 solution, it is easy to see l 2 /l 1 = √ 2. An Adam optimizer with a learning rate 0.01 (Kingma & Ba) requires ∼2000 steps training. The trust region Newton's method converges much faster, which requires ∼20 computations of Hessians to reach convergence. Although training time is comparable, the converged precision of the later is much better. Since one can combine ForwardDiff and NiLang to obtain Hessians, it is interesting to see how much performance we can get in differentiating the graph embedding program. In Table 3 , we show the the performance of different implementations by varying the dimension k. The number of parameters is 10k. As the baseline, (a) shows the time for computing the function call. We have reversible and irreversible implementations, where the reversible program is slower -------------samples: 10000 evals/sample: 7

¦ ¥

We first import the norm function from Julia standard library LinearAlgebra. Zygote's builtin AD engine will generate a slow code and memory allocation of 339KB. Then we write a reversible norm function r_norm with NiLang and port the backward function to Zygote by specifying the backward rule (the function marked with macro Zygote.@adjoint). Except for the speed up in computing time, the memory allocation also decreases to 23KB, which is equal to the sum of the original x and the array used in backpropagation. ( × 8 + 1000 × 8 × 2)/1024 ≈ 23 The later one has a doubled size because GVar has an extra gradient field.

F A benchmark of round-off error in leapfrog

Running reversible programming with the floating pointing number system can introduce round-off errors and make the program not reversible. The quantify the effects, we use the leapfrog integrator to compute the orbitals of planets in our solar system as a benchmark. The leapfrog interations can be represented as where G is the gravitational constant and m j is the mass of jth planet, x, v and a are location, velocity and acceleration respectively. The first value of velocity is v 1/2 = a 0 ∆t/2. Since the dynamics of our solar system are symplectic and the leapfrog integrator is time-reversible, the reversible program does not have overheads and the evolution time can go arbitrarily long with constant memory. We compare the mean error in the final axes of the planets and show the results in Fig. 6 . Errors are computed by comparing with the results computed with high precision floating-point numbers. One of the key steps that introduce round-off error is the computation of acceleration. If it is implemented as in Listing. 25, the round-off error does not bring additional effect in the reversible context, hence we see overlapping lines "(regular)" and "(reversible)" in the figure. This is because, when returning a dirty (not exactly zero cleared due to the floating-point round-off error) ancilla to the ancilla pool, the small remaining value will be zero-cleared automatically in NiLang. The acceleration function can also be implemented as in Listing. 26, where the same variable rb is repeatedly used for compute and uncompute, the error will accumulate on this variable. In both Float64 (double precision floating point) and Float32 (single precision floating point) benchmarks, the results show a much lower precision. Hence, simulating reversible programming with floating-point numbers does not necessarily make the results less reliable if one can avoid cumulative errors in the implementation. a i = G m j ( x j -x i ) x i -x j 3 (8) v i+1/2 = v i-1/2 + a i ∆t (9) x i+1 = x i + v i+1/2 ∆t Listing 25: Compute the acceleration. Compute and uncompute on ancilla rc § ¤ @i function :(+=)(acceleration)(y!::V3{T}, ra::V3{T}, rb::V3{T}, mb::Real, G) where T @routine @invcheckoff begin @zeros T d anc1 anc2 anc3 anc4 

G.2 Operational Semantics

The following operational semantics for the forward and backward evaluation shows how a statement is evaluated and reversed. σ P , while (e 1 , e 2 ) s end ⇓ -1 p σ P σ P , s⇓ -1 p σ P UNCOMPUTE σ P , ∼s ⇓ p σ P σ P , s 1 ⇓ p σ P σ P , begin s end ⇓ p σ P σ P , s 1 ⇓ -1 p σ P COMPUTE-COPY-UNCOMPUTE σ P , @routine s 1 ; s; ∼@routine ⇓ p σ P σ P , @routine s 1 ; ∼begin s end ∼@routine ⇓ p σ P COMPUTE-COPY-UNCOMPUTE -1 σ P , @routine s 1 ; s; ∼@routine⇓ -1 p σ P σ P ,  d i ⇓ get v i ∅[x 1 → v 1 • • • x n → v n ], (x 1 • • • x n ) = x f (x 1 • • • x n ) ⇓ e σ P 0 [x 1 → v 1 • • • x n → v n ] σ P i-1 , v i , d i ⇓ set σ P i CALL σ P , x f (d 1 • • • d n ) ⇓ p σ Pn σ P , (∼x f )(d 1 • • • d n ) ⇓ p σ P CALL -1 σ P , x f (d 1 • • • d n )⇓ -



this instruction set is extensible. One can define a reversible multiplier in NiLang as in Listing. 4. Listing 4: A reversible multiplier julia> using NiLang julia> @i function multiplier(y!::Real, a::Real, b::Real) y! += a * b end julia> multiplier(2, 3, 5) (17, 3, 5) julia> (~multiplier)(17, 3, 5) (2, 3, 5)

Figure 2: The flow chart for reversible (a) if statement and (b) while statement. "pre" and "post" represents precondition and postcondition respectively. The assersion errors are thrown to the host language instead of handling them in NiLang.

Fig. 2 (b)  shows the flow chart of the reversible while statement. It also has two condition expressions. Before executing the condition expressions, the program presumes the postcondition is false. After each iteration, the program asserts the postcondition to be true. To reverse this statement, one can exchange the precondition and postcondition, and reverse the body statements. The pseudo-code for the forward and backward passes are shown in Listing. 9 and Listing. 10.

Figure 3: Absolute runtimes in seconds for computing the objective (-O) and Jacobians (-J). (a) GMM with 10k data points, the loss function has a single output, hence computing Jacobian is the same as computing gradient. ForwardDiff data is missing due to not finishing in limited time. The NiLang GPU data is missing because we do not write kernel here. (b) Bundle adjustment.

Figure 4: Peak memory of running the original and the reversible GMM program. The labels are (d, k) pairs.

function mypower(x::T, n::Int) where T y = one(T) for i=1:n y *= x end return y end Listing 16: A reversible power function.

i function mypower(out,x::T,n::Int) where T if (x != 0, ~) @routine begin ly ← one(ULogarithmic{T}) lx ← one(ULogarithmic{T}) lx *= convert(x) for i=1:n ly *= x end end out += convert(ly) ~@routine end end

matricesWe compare the call, uncall and backward propagation time used for sparse matrix dot product and matrix multiplication in

Figure5: The Petersen graph has 10 vertices and 15 edges. We want to find a minimum embedding dimension for it.

Figure 6: Round-off errors in the final axes of planets as a function of the number of time steps. "(regular)" means an irreversible program, "(reversible)" means a reversible program (Listing. 25), and "(ccumulative)" means a reversible program with the acceleration computed with ccumulative errors (Listing. 26).

Statements s :: = ∼s | e ( d * ) | x ← e | x → e | @routine s ; s * ; ∼@routine | if ( e , e ) s * else s * end | while ( e , e ) s * end | for x = e : e : e s * end | begin s * end Data views d :: = d . x | d [ e ] | d e | c | x Reversible functions p :: = @i function x ( x * ) s* end is the pipe operator in Julia. Here, e is a reversible function and d e represents a bijection of d.Function arguments are data views, where a data view is a modifiable memory. It can be a variable, a field of a data view, an array element of a data view, or a bijection of a data view.

Computing the gradient here is similar to forward mode automatic differentiation that computes ∂[multiple outputs] ∂[single input] . Inspired by the Julia package ForwardDiff

Their reversible implementations are shown in Listing. 22 and Listing. 23. The computing time for backward propagation is approximately 1.5-3 times the Julia's native forward pass, which is close to the theoretical optimal performance.

Absolute runtimes in seconds for computing the objectives (O) and the backward pass (B) of sparse matrix operations. The matrix size is 1000 × 1000, and the element density is 0.05. The total time for computing gradients can be estimated by summing "O" and "B".

Absolute times in seconds for computing the objectives (O), uncall objective (U), gradients (G) and Hessians (H) of the graph embedding program. k is the embedding dimension, the number of parameters is 10k. than the irreversible native Julia program by a factor of ∼2 due to the uncomputing overhead. The reversible program shows the advantage of obtaining gradients when the dimension k ≥ 3. The larger the number of inputs, the more advantage it shows due to the overhead proportional to input size in forward mode AD. The same reason applies to computing Hessians, where the combo of NiLang and ForwardDiff gives the best performance for k ≥ 3.E Porting NiLang to ZygoteZygote is a popular machine learning package in Julia. We can port NiLang's automatically generated backward rules to Zygote to accelerate some performance-critical functions. The following example shows how to speed up the backward propagation of norm by ∼50 times.

P [x → v] environemnt with x's value equal to v σ P [x → nothing] environemnt with x undefined σ P , e ⇓ e va Julia expression e under environment σ P is interpreted as value v σ P , s ⇓ p σ P the evaluation of a statement s under environment σ P generates environment σ P σ P , s ⇓ -1 p σ P the reverse evaluation of a statement s under environment σ P generates environment σ P

P , e ⇓ e v ANCILLA σ P [x → nothing], x → e ⇓ p σ P [x → v] σ P , e ⇓ e v σ P [x → v], x ← e ⇓ p σ P [x → nothing] σ P , x ← e ⇓ p σ P ANCILLA -1 σ P , x → e⇓ -1 p σ P σ P , x → e ⇓ p σ P σ P , x ← e⇓ -1 p σ P σ P , s 1 ⇓ p σ P σ P , begin s 2 • • • s n end ⇓ p σ P BLOCK σ P , begin s 1 • • • s n end ⇓ p σ P σ P , begin end ⇓ p σ P σ P , s n ⇓ -1 p σ P σ P , begin s 1 • • • s n-1 end ⇓ -1 p σ P BLOCK -1 σ P , begin s 1 • • • s n end ⇓ -1 p σ P σ P , begin end ⇓ -1 p σ P σ P , e 1 ⇓ e n 1 σ P , e 2 ⇓ e n 2 σ P , e 3 ⇓ e n 3 (n 1 <= n 3 ) == (n 2 > 0) σ P [x → n 1 ],s ⇓ p σ P σ P , for x = e 1 + e 2 :e 2 :e 3 s end ⇓ p σ P FOR σ P , for x = e 1 :e 2 :e 3 s end ⇓ p σ P σ P , e 1 ⇓ e n 1 σ P , e 2 ⇓ e n 2 σ P , e 3 ⇓ e n 3 (n 1 <= n 3 ) != (n 2 > 0) FOR-EXIT σ P , for x = e 1 :e 2 :e 3 s end ⇓ p σ P σ P , for x = e 3 : -e 2 :e 1 ∼begin s end end ⇓ p σ P FOR -1 σ P , for x = e 1 :e 2 :e 3 s end ⇓ -1 p σ P σ P , e 1 ⇓ e true σ P , s 1 ⇓ p σ P σ P , e 2 ⇓ e true IF-T σ P , if (e 1 , e 2 ) s 1 else s 2 end ⇓ p σ P σ P , e 1 ⇓ e f alse σ P , s 2 ⇓ p σ P σ P , e 2 ⇓ e f alse IF-F σ P , if (e 1 , e 2 ) s 1 else s 2 end ⇓ p σ P σ P , if (e 2 , e 1 ) ∼begin s 1 end else ∼begin s 2 end end ⇓ p σ P IF -1 σ P , if (e 1 , e 2 ) s 1 else s 2 end ⇓ -1 p σ P σ P , e 1 ⇓ e true σ P , e 2 ⇓ e f alse σ P , s ⇓ p σ P σ P , (e 1 , e 2 , s) ⇓ loop σ P WHILE σ P , while (e 1 , e 2 ) s end ⇓ p σ P σ P , e 2 ⇓ e true σ P , e 1 ⇓ e true σ P , s ⇓ p σ P σ P , (e 1 , e 2 , s) ⇓ loop σ P WHILE-REC σ P , (e 1 , e 2 , s) ⇓ loop σ P σ P , e 2 ⇓ e true σ P , e 1 ⇓ e f alse WHILE-EXIT σ P , (e 1 , e 2 , s) ⇓ loop σ P σ P , while (e 2 , e 1 ) ∼begin s end end ⇓ p σ P WHILE -1

Acknowledgments

The authors are grateful to the people who help improve this work and fundings that sponsored the research. To meet the anonymous criteria, we will add the acknowledgments after the open review session.

annex

A NiLang implementation of Bennett's time-space trade-off algorithm Listing 12: NiLang implementation of the Bennett's time-space trade-off scheme. The input f is a vector of functions and state is a dictionary. We also added some irreversible external statements (those marked with @safe) to help analyse to program.B Cases where reverse computing shows advantage

B.1 Handling effective codes

Reverse computing can handling effective codes with mutable structures and arrays. For example, the affine transformation can be implemented without any overhead.Listing 22: Reversible sparse matrix multiplication. § ¤ using SparseArrays @i function i_dot(r::T, A::SparseMatrixCSC{T},B::SparseMatrixCSC{T}) where {T} m ← size(A, 1) n ← size(A, 2) @invcheckoff branch_keeper ← zeros(Bool, 2*m) @safe size(B) == (m,n) || throw(DimensionMismatch("matrices must have the same dimensions")) @invcheckoff @inbounds for j = 1:nListing 23: Reversible sparse matrix dot-product. § ¤ @i function i_mul!(C::StridedVecOrMat, A::AbstractSparseMatrix, B::StridedVector{T}, α:: Number, β::Number) where T @safe size(A, 2) == size(B, 1) || throw(DimensionMismatch()) @safe size(A, 1) == size(C, 1) || throw(DimensionMismatch()) @safe size(B, 2@safe error("only β = 1 is supported, got β = $(β).") end # Here, we close the reversibility check inside the loop to increase performance @invcheckoff for k = 1:size(C, 2) @inbounds for col = 1:size(A, 2)Here, chfield(x, y, z) is a Julia function that returns an object similar to x, but with field y modified to value z, setindex!(x, z, y) is a Julia function that sets the yth element of an array x to the value of z. We do not define primitive instructions likeusing the above CALL rule, where prime( , x f ) is a predefined Julia function. To reverse the call, the reverse julia function ∼prime( , x f ) should also be properly defined. On the other side, any non-primitive NiLang function defintion will generate two Julia functions x f and ∼x f so that it can be used in a recursive definition or other reversible functions. When calling a function, NiLang does not allow the input data views mappings to the same memory, i.e. shared read and write is not allowed. If the same variable is used for shared writing, the result might incorrect. If the same variable is used for both reading and writing, the program will become irreversible. For example, one can not use x -= x to erase a variable, while coding x.y -= x.z is safe. Even if a variables is used for shared reading, it can be dangerous in automatic differentiating. The share read of a variable induces a shared write of its gradient in the adjoint program. For example, y += x * x will not give the correct gradient, but its equivalent forms z ← 0; z += x; y += x * z; z -= x; z → 0 and y += x ^2 will.

G.3 Compilation

The compilation of a reversible function to native Julia functions is consisted of three stages: preprocessing, reversing and translation as shown in Fig. 7 . In the preprocessing stage, the compiler pre-processes human inputs to the reversible NiLang IR. The preprocessor removes eye candies and expands shortcuts to symmetrize the code. In the left most code box in Fig. 7 , one uses @routine <stmt> statement to record a statement, and ∼@routine to insert the corresponding inverse statement for uncomputing. This macro is expanded in the preprocessing stage. In the reversing stage, based on this symmetrized reversible IR, the compiler generates reversed statements. In the translation stage, the compiler translates this reversible IR as well as its inverse to native Julia code. It adds @assignback before each function call, inserts codes for reversibility check, and handle control flows. The @assignback macro assigns the outputs of a function to its input data views. As a final step, the compiler attaches a return statement that returns all updated input arguments. Now, the function is ready to execute on the host language.

