SIGNATORY: DIFFERENTIABLE COMPUTATIONS OF THE SIGNATURE AND LOGSIGNATURE TRANSFORMS, ON BOTH CPU AND GPU

Abstract

Signatory is a library for calculating and performing functionality related to the signature and logsignature transforms. The focus is on machine learning, and as such includes features such as CPU parallelism, GPU support, and backpropagation. To our knowledge it is the first GPU-capable library for these operations. Signatory implements new features not available in previous libraries, such as efficient precomputation strategies. Furthermore, several novel algorithmic improvements are introduced, producing substantial real-world speedups even on the CPU without parallelism. The library operates as a Python wrapper around C++, and is compatible with the PyTorch ecosystem. It may be installed directly via pip.

1. INTRODUCTION

The signature transform, sometimes referred to as the path signature or simply signature, is a central object in rough path theory (Lyons, 1998; 2014) . It is a transformation on differentiable pathsfoot_0 , and may be thought of as loosely analogous to the Fourier transform. However whilst the Fourier transform extracts information about frequency, treats each channel separately, and is linear, the signature transform exacts information about order and area, explicitly considers combinations of channels, and is in a precise sense 'universally nonlinear' (Bonnier et al., 2019, Proposition A.6 ). The logsignature transform (Liao et al., 2019 ) is a related transform, that we will also consider. In both cases, by treating sequences of data as continuous paths, then the (log)signature transform may be applied for use in problems with sequential structure, such as time series. Indeed there is a significant body of work using the (log)signature transform in machine learning, with examples ranging from handwriting identification to sepsis prediction, see for example Morrill et al. (2019) ; Fermanian (2019) ; Király & Oberhauser (2019) ; Toth & Oberhauser (2020) ; Morrill et al. (2020b) . Earlier work often used the signature and logsignature transforms as a feature transformation. See Levin et al. (2013) ; Chevyrev & Kormilitzin (2016) ; Yang et al. (2016a; b) ; Kormilitzin et al. (2016) ; Li et al. (2017) ; Perez Arribas et al. (2018) for a range of examples. In this context, when training a model on top, it is sufficent to simply preprocess the entire dataset with the signature or logsignature transform, and then save the result. However, recent work has focused on embedding the signature and logsignature transforms within neural networks. Recent work includes Bonnier et al. (2019) ; Liao et al. (2019) ; Moor et al. (2020) ; Morrill et al. (2020a) ; Kidger et al. (2020) among others. In this context, the signature and logsignature transforms are evaluated many times throughout a training procedure, and as such efficient and differentiable implementations are crucial. Previous libraries (Lyons, 2017; Reizenstein & Graham, 2018) have been CPU-only and single-threaded, and quickly become the major source of slowdown when training and evaluating these networks.

1.1. CONTRIBUTIONS

We introduce Signatory, a CPU-and GPU-capable library for calculating and performing functionality related to the signature and logsignature transforms. To our knowledge it is the first GPU-capable library for these operations. The focus is on machine learning applications. Signatory is significantly faster than previous libraries (whether run on the CPU or the GPU), due to a combination of parallelism and novel algorithmic improvements. In particular the latter includes both uniform and asymptotic rate improvements over previous algorithms. Additionally, Signatory provides functionality not available in previous libraries, such as precomputation strategies for efficient querying of the (log)signature transform over arbitrary overlapping intervals. The library integrates with the open source PyTorch ecosystem and runs on Linux or Windows. Documentation, examples, benchmarks and tests form a part of the project. Much of the code is written in C++ primitives and the CPU implementation utilises OpenMP. The backward operations are handwritten for both speed and memory efficiency, and do not rely on the autodifferentiation provided by PyTorch. The source code is located at https://github.com/patrick-kidger/signatory, documentation and examples are available at https://signatory.readthedocs.io, and the project may be installed directly via pip. This paper is not a guide to using Signatory-for that we refer to the documentation. This is meant as a technical exposition of its innovations.

1.2. APPLICATIONS

Signatory has already seen a rapid uptake amongst the signature community. Recent work using Signatory include Morrill et al. (2020b) ; Perez Arribas et al. (2020) who involve signatures in neural differential equations, or Moor et al. (2020) ; Min & Ichiba (2020) who study deep signature models (Bonnier et al., 2019) . Meanwhile Ni et al. (2020) apply Signatory to hybridise signatures with GANs, and Morrill et al. (2020a) create a generalised framework for the "signature method". As a final example, Signatory is now itself a dependency for other libraries (Kidger, 2020) .

2. BACKGROUND

We begin with some exposition on theory of the signature and logsignature transforms. We begin with definitions and offer intuition afterwards. Also see Reizenstein & Graham (2018) for an introduction focusing on computational concerns, and Lyons et al. (2004) and Hodgkinson et al. (2020) for pedagogical introductions to the motivating theory of rough paths.

2.1. THE SIGNATURE TRANSFORM

Definition 1. Let R d1 ⊗R d2 ⊗• • •⊗R dn denote the space of all real tensors with shape d 1 ×d 2 ×• • •× d n . There is a corresponding binary operation ⊗, called the tensor product, which maps a tensor of shape (d 1 , . . . , d n ) and a tensor of shape (e 1 , . . . , e m ) to a tensor of shape (d 1 , . . . , d n , e 1 , . . . , e m ) via (A i1,...,in , B j1,...,jm ) → A i1,...,in B j1,...,jm . For example when applied to two vectors, it reduces to the outer product. Let R d ⊗k = R d ⊗ • • • ⊗ R d , and v ⊗k = v ⊗ • • • ⊗ v for v ∈ R d , in each case with k -1 many ⊗. Definition 2. Let N ∈ N. The signature transform to depth N is defined as Sig N : f ∈ C([0, 1]; R d ) f differentiable → N k=1 R d ⊗k , Sig N (f ) =   • • • 0<t1<•••<t k <1 df dt (t 1 ) ⊗ • • • ⊗ df dt (t k ) dt 1 • • • dt k   1≤k≤N . Most texts define the signature transform using the notation of stochastic calculus. Here, we sacrifice some generality (that is not needed in this context) in favour of more widely-used notation.foot_2  The signature transform may naturally be extended to sequences of data. Definition 3. The space of sequences of data over a set V is S (V ) = {x = (x 1 , . . . , x L ) | L ∈ N, x i ∈ V for all i} . An interval of (x 1 , . . . , x L ) ∈ S (V ) is (x i , . . . , x j ) ∈ S (V ) for some 1 ≤ i < j ≤ L. Definition 4. Let x = (x 1 , . . . , x L ) ∈ S R d with L ≥ 2. Let f : [0, 1] → R d be the unique continuous piecewise affine function such that f ( i-1 L-1 ) = x i for all i, and is affine on the pieces in between. Let N ∈ N. Then define Sig N (x) = Sig N (f ). In this way we interpret Sig N as a map Sig N : S R d → N k=1 R d ⊗k . Note that the choice of i-1 L-1 is unimportant; any L points in [0, 1] would suffice, and in fact the definition is invariant to this choice (Bonnier et al., 2019, Definition A.10 ).

2.2. THE GROUPLIKE STRUCTURE

With A 0 = B 0 = 1 ∈ R on the right hand side, define byfoot_3  : N k=1 R d ⊗k × N k=1 R d ⊗k → N k=1 R d ⊗k , (A 1 , . . . A N ) (B 1 , . . . , B N ) →   k j=0 A j ⊗ B k-j   1≤k≤N . Chen's identity (Lyons et al., 2004, Theorem 2.9) states that the image of the signature transform forms a noncommutative group with respect to . That is, given a sequence of data (x 1 , . . . , x L ) ∈ S R d and some j ∈ {2, . . . , L -1}, then Sig N ((x 1 , . . . , x L )) = Sig N ((x 1 , . . . , x j )) Sig N ((x j , . . . , x L )). (2) Furthermore the signature of a sequence of length two may be computed explicitly from the definition. Letting exp : R d → N k=1 R d ⊗k , exp : v → v, v ⊗2 2! , v ⊗3 3! , . . . , v ⊗N N ! , then Sig N ((x 1 , x 2 )) = exp(x 2 -x 1 ). With Chen's identity, this implies that the signature transform may be computed by evaluating Sig N ((x 1 , . . . , x L )) = exp(x 2 -x 1 ) exp(x 3 -x 2 ) • • • exp(x L -x L-1 ). (3)

2.3. THE LOGSIGNATURE, INVERTED SIGNATURE, AND INVERTED LOGSIGNATURE

The group inverse we denote -1 . Additionally a notion of logarithm may be defined (Liao et al., 2019) , where log : image Sig N → N k=1 R d ⊗k . This then defines the notions of inverted signature transform, logsignature transform and inverted logsignature transform as InvertSig N (x) = Sig N (x) -1 , LogSig N (x) = log Sig N (x) , InvertLogSig N (x) = log Sig N (x) -1 respectively. We emphasise that the inverted signature or logsignature transforms are not the inverse maps of the signature or the logsignature transforms. The logsignature transform extracts the same information as the signature transform, but represents the information in a much more compact way, as image (log) is a proper subspacefoot_4 of N k=1 R d ⊗k . Its dimension is w(d, N ) = N k=1 1 k i|k µ k i d i , which is known as Witt's formula (Lothaire, 1997) . µ is the Möbius function.

2.4. SIGNATURES IN MACHINE LEARNING

In terms of the tensors used by most machine learning frameworks, then the (inverted) signature and logsignature transforms of depth N may both be thought of as consuming a tensor of shape (b, L, d), corresponding to a batch of b different sequences of data, each of the form (x 1 , . . . , x L ) for x i ∈ R d . The (inverted) signature transform then produces a tensor of shape (b, N k=1 d k ), whilst the (inverted) logsignature transform produces a tensor of shape (b, w(d, N )). We note that these can be easily be large, and much research has focused on ameliorating this Bonnier et al. (2019) ; Morrill et al. (2020a) ; Cuchiero et al. (2020) . All of these transforms are in fact differentiable with respect to x, and so may be backpropagated through. These transforms may thus be thought of as differentiable operations between tensors, in the way usually performed by machine learning frameworks.

2.5. INTUITION

The (inverted) signature and logsignature transforms all have roughly the same intuition as one another. (They all represent the same information, just in slightly different ways.) Given a sequence of data (x 1 , . . . , x L ), then these transforms may be used as binning functions, feature extractors, or nonlinearities, to give summary statistics over the data. These summary statistics describe the way in which the data interacts with dynamical systems (Morrill et al., 2020b) . Indeed, we have already linked the signature to the exponential map, which is defined as the solution to a differential equation: d exp dt (t) = exp(t). The signature may in fact be defined as the solution of a controlled exponential map: dSig N (f )(t) = Sig N (f )(t) ⊗ df (t), so that Sig N (f ) is response of a particular dynamical system driven by f . The theory here is somewhat involved, and is not an interpretation we shall pursue further here. An equivalent more straightforward interpretation is arrived at by observing that the terms of the exponential of a scalar exp : x ∈ R → (1, x, 1 2 x 2 , . . .) produce (up to scaling factors) every monomial of its input. Classical machine learning takes advantage of this as a feature extractor in polynomial regression. The signature transform is the equivalent operation when the input is a sequence.

3. CODE EXAMPLE

Signatory is designed to be Pythonic, and offer operations working just like any other PyTorch operation, outputting PyTorch tensors. A brief example is: 1 import s i g n a t o r y 2 import t o r c h 3 b a t c h , s t r e a m , c h a n n e l s , d e p t h = 1 , 1 0 , 2 , 4 4 p a t h = t o r c h . r a n d ( b a t c h , s t r e a m , c h a n n e l s , r e q u i r e s _ g r a d =True ) 5 s i g n a t u r e = s i g n a t o r y . s i g n a t u r e ( p a t h , d e p t h ) 6 s i g n a t u r e . sum ( ) . b a c k w a r d ( )

4. ALGORITHMIC IMPROVEMENTS

We present several noveral algorithmic improvements for computing signatures and logsignatures.

4.1. FUSED MULTIPLY-EXPONENTIATE

Recall from equation (3) that the signature may be computed by evaluating several exponentials and several . We begin by finding that it is beneficial to compute N k=1 R d ⊗k × R d → N k=1 R d ⊗k , A, z → A exp(z) as a fused operation. Doing so has uniformly (over d, N ) fewer scalar multiplications than the composition of the individual exponential and , and in fact reduces the asymptotic complexity of this operation from O(N d N ) to O(d N ). Furthermore this rate is now optimal, as the size of result (an element of N k=1 R d ⊗k ), is itself of size O(d N ). The bulk of a signature computation may then be sped up by writing it in terms of this fused operation. See equation (3): a single exponential is required at the start, followed by a reduction with respect to this fused multiply-exponentiate. This gives substantial real-world speedups; see the benchmarks of Section 6. The fusing is done by expanding A exp(z) = k i=0 A i ⊗ z ⊗(k-i) (k -i)! 1≤k≤N , at which point the k-th term may be computed by a scheme in the style of Horner's method: k i=0 A i ⊗ z ⊗(k-i) (k -i)! = • • • z k + A 1 ⊗ z k -1 + A 2 ⊗ z k -2 + • • • ⊗ z 2 + A k-1 ⊗ z + A k . See Appendix A.1 for the the mathematics, including proofs of both the asymptotic complexity and the uniformly fewer multiplications.

4.2. IMPROVED PRECOMPUTATION STRATEGIES

Given a sequence of data x = (x 1 , . . . , x L ), it may be desirable to query Sig N ((x i , . . . , x j )) for many different pairs i, j. We show that this query may be computed in just O(1) (in L) time and memory by using O(L) precomputation and storage. Previous theoretical work has achieved only O(log L) inference with O(L log L) precomputation (Chafai & Lyons, 2005) . Doing so is surprisingly simple. Precompute Sig N ((x 1 , . . . , x j )) and InvertSig N ((x 1 , . . . , x j )) for all j. This may be done in only O(L) work, by iteratively computing each signature via Sig N ((x 1 , . . . , x j )) = Sig N ((x 1 , . . . , x j-1 )) Sig N ((x j-1 , x j )), (6) with a similar relation for the inverted signature. Then, at inference time use the group-like structure Sig N ((x i , . . . , x j )) = InvertSig N ((x 1 , . . . , x i )) Sig N ((x 1 , . . . , x j )), followed by a log if it is a logsignature that is desired. As a single operation this is O(1) in L. We do remark that this should be used with caution, and may suffer from numerical stability issues when used for large i, j.

4.3. MORE EFFICIENT LOGSIGNATURE BASIS

The logsignature transform of a path has multiple possible representations, corresponding to different possible bases of the ambient space, which is typically interpreted as a free Lie algebra (Reutenauer, 1993) . The Lyndon basis is a typical choice (Reizenstein & Graham, 2018) . We show that there exists a more computationally efficient basis. It is mathematically unusual, as it is not constructed as a Hall basis. But if doing deep learning, then the choice of basis is (mostly) unimportant if the next operation is a learnt linear transformation. The Lyndon basis uses Lyndon brackets as its basis elements. Meanwhile our new basis uses basis elements that, when written as a sum of Lyndon brackets and expanded as a sum of words, have precisely one word as a Lyndon word. This means that the coefficient of this basis element can be found cheaply, by extracting the coeffient of that Lyndon word from the tensor algebra representation of the logsignature. See Appendix A.2 for the full exposition.

5. NEW FEATURES

Signatory provides several features not available in previous libraries.

5.1. PARALLELISM

There are two main levels of parallelism. First is naïve parallelism over the batch dimension. Second, we observe that equation (3) takes the form of a noncommutative reduction with respect to the fused multiply-exponentiate. The operation is associative, and so this may be parallelised in the usual way for reductions, by splitting the computation up into chunks. Parallelism on the CPU is implemented with OpenMP. For speed the necessary operations are written in terms of C++ primitives, and then bound into PyTorch.

5.2. GPU SUPPORT

An important feature is GPU support, which is done via the functionality available through LibTorch. It was a deliberate choice not to write CUDA code; this is due to technical reasons for distributing the library in a widely-compatible manner. See Appendix B. There are again two levels of parallelism, as described in the previous section. When the (log)signature transform is part of a deep learning model trained on the GPU, then GPU support offers speedups not just from the use of a GPU over the CPU, but also obviates the need for copying the data to and from the GPU.

5.3. BACKPROPAGATION

Crucial for any library used in deep learning is to be able to backpropagate through the provided operations. Signatory provides full support for backpropagation through every provided operation. There has previously only been limited support for backpropagation through a handful of simple operations, via the iisignature library (Reizenstein & Graham, 2018) . The backpropagation computations are handwritten, rather than being generated autodifferentiably. This improves the speed of the computation by using C++ primitives rather than high-level tensors, and furthermore allows for improved memory efficiency, by exploiting a reversibility property of the signature (Reizenstein, 2019, Section 4.9.3) . We discuss backpropagation in more detail in Appendix C.

5.4. INVERTED SIGNATURES AND LOGSIGNATURES

Signatory provides the capability to compute inverted signatures and logsignatures, via the optional inverse argument to the signature and logsignature functions. This is primarily a convenience as Sig N ((x 1 , . . . , x L )) -1 = Sig N ((x L , . . . , x 1 )).

5.5. EXPLOITING THE GROUPLIKE STRUCTURE

It is often desirable to compute (inverted) (log)signatures over multiple intervals of the same sequence of data. These calculations may jointly be accomplished more efficiently than by evaluating the signature transform over every interval separately. In some cases, if the original data has been discarded and only its signature is now known, exploiting this structure is the only way to perform the computation. Here we detail several notable cases, and how Signatory supports them. In all cases the aim is to provide a flexible set of tools that may be used together, so that wherever possible unecessary recomputation may be elided. Their use is also discussed in the documentation, including examples. Combining adjacent intervals Recall equation ( 2). If the two signatures on the right hand side of the equation are already known, then the signature of the overall sequence of data may be computed using only a single operation, without re-iterating over the data. This operation is provided for by the multi_signature_combine and signature_combine functions. Expanding intervals Given a sequence of data (x 1 , . . . , x L ) ∈ S R d , a scenario that is particularly important for its use in Section 4.2 is to compute the signature of expanding intervals of the data,foot_5  (Sig N ((x 1 , x 2 )), Sig N ((x 1 , x 2 , x 3 )), . . . , Sig N ((x 1 , . . . , x L ))). This may be interpreted as a sequence of signatures, that is to say an element of S N k=1 R d ⊗k . By equation ( 6), this may be done efficiently in only O(L) work, and with all the earlier signatures available as byproducts, for free, of computing the final element Sig N ((x 1 , . . . , x L )). This is handled by the optional stream argument to the signature function and to the logsignature function. Arbitrary intervals Given a sequence of data x = (x 1 , . . . , x L ), it may be desirable to query Sig N ((x i , . . . , x j )) for many i, j such that 1 ≤ i < j ≤ L, as in Section 4.2. Using the efficient precomputation strategy described there, Signatory provides this capability via the Path class. Keeping the signature up-to-date Suppose we have a sequence of data (x 1 , . . . , x L ) ∈ S R d whose signature Sig N ((x 1 , . . . , x L )) has already been computed. New data subsequently arrives, some (x L+1 , . . . , x L+M ) ∈ S R d , and we now wish to update our computed signature, for example to compute the sequence of signatures (Sig N ((x 1 , . . . , x L+1 )), . . . , Sig N ((x 1 , . . . , x L+M ))). This could done by computing the signatures over (x L+1 , . . . , x L+M ), and combining them together as above. However, if these signatures (over (x L+1 , . . . , x L+M )) are not themselves of interest, then this approach may be improved upon, as this only exploits the grouplike structure, but not the fused multiply-exponentiate described in Section 4.1. Computing them in this more efficient way may be handled via the basepoint and initial arguments to the signature function, and via the update method of the Path class.

6. BENCHMARK PERFORMANCE

We are aware of two existing software libraries providing similar functionality, esig (Lyons, 2017) and iisignature (Reizenstein & Graham, 2018) . We ran a series of benchmarks against the latest versions of both of these libraries, namely esig 0.6.31 and iisignature 0.24. The computer used was equipped with a Xeon E5-2960 v4 and a Quadro GP100, and was running Ubuntu 18.04 and Python 3.7. esig is not shown as it is incapable of computing backward operations. Note the logarithmic scale.

6.1. DIRECT COMPARISON

For esig and iisignature we report run time on the CPU, whilst for Signatory we report run time on the GPU, CPU with parallelism, and CPU without parallelism. As it is in principle possible to parallelise these alternative libraries using Python's multiprocessing modulefoot_6 , the most important comparisons are to Signatory on the CPU without parallelism (representing like-for-like computational resources), and on the GPU (representing the best possible performance). We begin with a benchmark for the forward operation through the signature transform. We consider a batch of 32 sequences, of length 128. We then investigate the scaling as we vary either the number of channels (over 2-7) in the input sequences, or the depth (over 2-9) of the signature transform. For varying channels, the depth was fixed at 7. For varying depths, the channels was fixed at 4. Every test case is repeated 50 times and the fastest time taken. See Figure 1 . Note the logarithmic scale. We observe that iisignature is Signatory's strongest competitor in all cases. Signatory and iisignature are comparable for the very smallest of computations. As the computation increases in size, then the CPU implementations of Signatory immediately overtake iisignature, followed by the GPU implementation. For larger computations, Signatory can be orders of magnitude faster. For example, to compute the signature transform with depth and number of channels both equal to 7, then iisignature takes 20.9 seconds to perform the computation. In contrast, running on the CPU without parallelism, Signatory takes only 3.8 seconds, which represents a 5.5× speedup. We emphasise that the same computational resources (including lack of parallelism) were used for both. To see the benefits of a GPU implementation over a CPU implementation-the primary motivation for Signatory's existence-then we observe that Signatory takes only 0.16 seconds to compute this same operation. Compared to the best previous alternative in iisignature, this represents a 132× speedup. Next, we consider the backward operation through the signature transform. We vary over multiple inputs as before. See Figure 2 . We again observe the same behaviour. iisignature is Signatory's strongest competitor, but is still orders of magnitude slower on anything but the very smallest of problems. For example, to backpropagate through the signature transform, with depth and number of channels both equal to 7, then Signatory on the CPU without parallelism takes 13.7 seconds. Meanwhile, iisignature takes over 2 minutes -128 seconds -to perform this computation on like-for-like computational resources. Running Signatory on the GPU takes a fraction of a second, specifically 0.772 seconds. These represent speedups of 9.4× and 166× respectively. For further speed benchmarks, a discussion on memory usage benchmarks, the precise numerical values of the graphs presented here, and code to reproduce these benchmarks, see Appendix D. We observe the same consistent improvements on these additional benchmarks.

6.2. DEEP LEARNING EXAMPLE

To emphasise the benefit of Signatory to deep learning applications, we consider training a deep signature model (Bonnier et al., 2019) on a toy dataset of geometric Brownian motion samples. The samples have one of two different volatilities, and the task is to perform binary classification. The model sweeps a small feedforward network over the input sequence (to produce a sequence of hidden states), applies the signature transform, and then maps to a binary prediction via a final learnt linear map. This model has learnt parameters prior to the signature transform, and so in particular backpropagation through the signature transform is necessary. The signature transform is computed either using Signatory, or using iisignature. We train the model on the GPU and plot training loss against wall-clock time. Both models train successfully, but the model using Signatory trains 210 times faster than the one using iisignature. This makes clear how signatures have previously represented the largest computational bottleneck. The improvement of 210× is even larger than the improvements obtained in the previous section. We attribute this to the fact that iisignature necessarily has the additional overhead copying data from the GPU to the CPU and back again.

7. CONCLUSION

We have introduced Signatory, a library for performing functionality related to the signature and logsignature transforms, with a particular focus on applications to machine learning. Notable contributions are the speed of its operation, its GPU support, differentiability of every provided operation, and its novel algorithmic innovations.

A FURTHER DETAILS OF ALGORITHMIC IMPROVEMENTS A.1 FUSED MULTIPLY-EXPONENTIATE

The conventional way to compute a signature is to iterate through the computation described by equation ( 3): for each new increment, take its exponential, and it on to what has already been computed; repeat. Our proposed alternate way is to fuse the exponential and into a single operation, and then iteratively perform this fused operation. We now count the number of multiplications required to compute N k=1 R d ⊗k × R d → N k=1 R d ⊗k , A, z → A exp(z) for each approach. We will establish that the fused operation uses fewer multiplications for all possible d ≥ 1 and N ≥ 1. We will then demonstrate that it is in fact of a lower asymptotic complexity.

A.1.1 THE CONVENTIONAL WAY

The exponential is defined as exp : R d → N k=1 R d ⊗k , exp : x → x, x ⊗2 2! , x ⊗3 3! , . . . , x ⊗N N ! , see Bonnier et al. (2019, Proposition 15) . Note that every tensor in the exponential is symmetric, and so in principle requires less work to compute than its number of elements would suggest. For the purposes of this analysis, to give the benefit of the doubt to a competing method, we shall assume that this is done (although taking advantage of this in practice is actually quite hard (Reizenstein & Graham, 2018 , Section 2)). This takes N k=2 d + d + k -1 k scalar multiplications, using the formula for unordered sampling with replacement (Reizenstein & Graham, 2018 , Section 2), under the assumption that each division by a scalar costs the same as a multiplication (which can be accomplished by precomputing their reciprocals and then multiplying by them). Next, we need to count the number of multiplications to perform a single . Let A, B ∈ N k=1 R d ⊗k . Let A = (A 1 , . . . , A N ), with A i = (A j1,...,ji i ) 1≤j1,...,ji≤d , and every A j1,...,ji i ∈ R. Additionally let A 0 = 1. Similarly for B. Then is defined by : N k=1 R d ⊗k × N k=1 R d ⊗k → N k=1 R d ⊗k , : A, B → k i=0 A i ⊗ B k-i 1≤k≤N , where each A i ⊗ B k-i = A j1,...,ji i B ĵ1,..., ĵk-i k-i 1≤j1,...,ji, ĵ1,..., ĵk-i ≤d is the usual tensor product, the result is thought of as a tensor in (R d ) ⊗k , and the summation is taken in this space. See Bonnier et al. (2019, Definition A.13 ). To the authors' knowledge there has been no formal analysis of a lower bound on the computational complexity of , and there is no better way to compute it than naïvely following this definition. This, then, requires N k=1 k-1 i=1 d j1,...,ji=1 d ĵ1,..., ĵk-i =1 1 = N k=1 k-1 i=1 d k = N k=1 (k -1)d k scalar multiplications. Thus the overall cost of the conventional way involves C(d, N ) = N k=2 d + d + k -1 k + N k=1 (k -1)d k (9) scalar multiplications. A.1.2 THE FUSED OPERATION Let A ∈ N k=1 R d ⊗k and z ∈ R d . Then A exp(z) = k i=0 A i ⊗ z ⊗(k-i) (k -i)! 1≤k≤N , where the k-th term may be computed by a scheme in the style of Horner's method: k i=0 A i ⊗ z ⊗(k-i) (k -i)! = • • • z k + A 1 ⊗ z k -1 + A 2 ⊗ z k -2 + • • • ⊗ z 2 + A k-1 ⊗ z + A k . As before, we assume that the reciprocals 1 2 , . . . , 1 N have been precomputed, so that each division costs the same as a multiplication. Then we begin by computing z/2, . . . , z/N , which takes d(N -1) multiplications. Computing the k-th term as in equation ( 10 First suppose d = 1. Then F(1, N ) = (N -1) + N k=1 (k -1) ≤ 2(N -1) + N k=1 (k -1) = C(1, N ). Now suppose N = 1. Then F(d, 1) = 0 = C(d, 1). Now suppose N = 2. Then F(d, 2) = d + d 2 ≤ d + d + 1 2 + d 2 = C(d, 2) Now suppose d ≥ 2 and N ≥ 3. Then F(d, N ) = d(N -1) + N k=1 k i=2 d i = d N +2 -d 3 -(N -1)d 2 + (N -1)d (d -1) 2 . ( ) And C(d, N ) = N k=2 d + d + k -1 k + N k=1 (k -1)d k ≥ N k=1 (k -1)d k = (N -1)d N +2 -N d N +1 + d 2 (d -1) 2 . ( ) Thus we see that it suffices to show that d N +2 -d 3 -(N -1)d 2 + (N -1)d ≤ (N -1)d N +2 -N d N +1 + d 2 , for d ≥ 2 and N ≥ 3. That is, 0 ≤ d N +1 (d(N -2) -N ) + d(d 2 + N (d 2 -1) + 1). ( ) At this point d = 2, N = 3 must be handled as a special case, and may be verified by direct evaluation of equation ( 14). So now assume d ≥ 2, N ≥ 3, and that d = 2, N = 3 does not occur jointly. Then we see that equation ( 14) is implied by 0 ≤ d(N -2) -N and 0 ≤ d 2 + N (d 2 -1) + 1. The second condition is trivially true. The first condition rearranges to N/(N -2) ≤ d, which is now true for d ≥ 2, N ≥ 3 with d = 2, N = 3 not jointly true. This establishes the uniform bound F(d, N ) ≤ C(d, N ). Checking the asymptotic complexity is much more straightforward. Consulting equations ( 12) and ( 13) shows that F(d, n) = O(d N ) whilst C(d, N ) = Ω(N d N ). And in fact as d+k-1 k ≤ d k then equation (9) demonstrates that C(d, N ) = O(N d N ).

A.2 LOGSIGNATURE BASES

We move on to describing our new more efficient basis for the logsignature.

A.2.1 WORDS, LYNDON WORDS, AND LYNDON BRACKETS

Let A = {a 1 , . . . , a d } be a set of d letters. Let A +N be the set of all words in these letters, of length between 1 and N inclusive. For example a 1 a 4 ∈ A +N is a word of length two. Impose the order a 1 < a 2 < • • • < a d on A, and extend it to the lexicographic order on words in A +N of the same length as each other, so that for example a 1 a 2 < a 1 a 3 < a 2 a 1 . Then a Lyndon word (Lalonde & Ram, 1995) is a word which comes earlier in lexicographic order than any of its rotations, where rotation corresponds to moving some number of letters from the start of the word to the end of the word. For example a 2 a 2 a 3 a 4 is a Lyndon word, as it is lexicographically earlier than a 2 a 3 a 4 a 2 , a 3 a 4 a 2 a 2 and a 4 a 2 a 2 a 3 . Meanwhile a 2 a 2 is not a Lyndon word, as it is not lexicographically earlier than a 2 a 2 (which is a rotation). Denote by L A +N the set of all Lyndon words of length between 1 and N . Given any Lyndon word w 1 • • • w n with n ≥ 2 and w i ∈ A, we may consider its longest Lyndon suffix; that is, the smallest j > 1 for which w j • • • w n is a Lyndon word. (It is guaranteed to exist as w n alone is a Lyndon word.) It is a fact (Lalonde & Ram, 1995) that w 1 • • • w j-1 is then also a Lyndon word. Given a Lyndon word w, we denote by w b its longest Lyndon suffix, and by w a the corresponding prefix. Considering spans with respect to R, let [ • , • ] : span(A +N ) × span(A +N ) → span(A +N ) be the commutator given by [w, z] = wz -zw, where wz denotes concatenation of words, distributed over the addition, as w and z belong to a span and thus may be linear combinations of words. For example w = 2a 1 a 2 + a 1 and z = a 1 + a 3 gives wz = 2a 1 a 2 a 1 + 2a 1 a 2 a 3 + a 1 a 1 + a 1 a 3 . Then define φ : L A +N → span(A +N ) by φ(w) = w if w is a word of only a single letter, and by φ(w) = [φ(w a ), φ(w b )] otherwise. For example, φ(a 1 a 2 a 2 ) = [[a 1 , a 2 ], a 2 ] = [a 1 a 2 -a 2 a 1 , a 2 ] = a 1 a 2 a 2 -2a 2 a 1 a 2 + a 2 a 2 a 1 . Now extend φ by linearity from L A +N to span(L A +N ), so that φ : span(L A +N ) → span(A +N ) is a linear map between finite dimensional real vector spaces, from a lower dimensional space to a higher dimensional space. Next, let ψ : A +N → span(L A +N ) be such that ψ(w) = w if w ∈ L A +N , and ψ(w) = 0 otherwise. Extend ψ by linearity to span(A +N ), so that ψ : span(A +N ) → span(L A +N ) is a linear map between finite dimensional real vector spaces, from a higher dimensional space to a lower dimensional space.

A.2.2 A BASIS FOR SIGNATURES

Recall that the signature transform maps between spaces as follows. Sig N : S R d → N k=1 R d ⊗k . Let {e i | 1 ≤ i ≤ d} be the usual basis for R d . Then {e i1 ⊗ • • • ⊗ e i k | 1 ≤ i 1 , . . . i k ≤ d} is a basis for (R d ) ⊗k . An arbitrary element of N k=1 R d ⊗k may be written as   d i1,...i k =1 α i1,...,i k e i1 ⊗ • • • ⊗ e i k   1≤k≤N (15) for some α i1,...,i k . Then A +N may be used to represent a basis for N k=1 R d ⊗k . Identify e i1 ⊗ • • • ⊗ e i k with a i1 • • • a i k . Extend linearly, so as to identify expression (15) with the formal sum of words N k=1 d i1,...i k =1 α i1,...,i k a i1 • • • a i k . With this identification, span(A +N ) ∼ = N k=1 R d ⊗k

A.2.3 BASES FOR LOGSIGNATURES

Suppose we have some x ∈ S R d . Using the identification in equation ( 16), then we may attempt to seek some x ∈ span(L A +N ) such that φ(x) = log Sig N (x) . ( ) This is an overdetermined linear system. As a matrix φ is tall and thin. However it turns out that image (log) = image (φ) and moreover there exists a unique solution (Reizenstein & Graham, 2018) . (That it is an overdetermined system is typically the point of the logsignature transform over the signature transform, as it then represents the same information in less space.) If x = ∈L(A +N ) α , with α ∈ R, then by linearity ∈L(A +N ) α φ( ) = log Sig N (x) , so that φ(L A +N ) is a basis, called the Lyndon basis, of image (log). When calculating the logsignature transform in a computer, then the collection of α are a sensible choice for representing the result, and indeed, this is what is done by iisignature. See Reizenstein & Graham (2018) for details of this procedure. However, it turns out that this is unnecessarily expensive. In deep learning, it is typical to apply a learnt linear transformation after a nonlinearity -in which case we largely do not care in what basis we represent the logsignature, and it turns out that we can find a more efficient one. The Lyndon basis exhibits a particular triangularity property (Reutenauer, 1993, Theorem 5 .1), (Reizenstein, 2019, Theorem 32) , meaning that for all ∈ L A +N , then φ( ) has coefficient zero for any Lyndon word lexicographically earlier than . This property has already been exploited by iisignature to solve ( 17) efficiently, but we can do better: it means that ψ • φ : span L A +N → span L A +N is a triangular linear map, and so in particular it is invertible, and defines a change of basis; it is this alternate basis that we shall use instead. Instead of seeking x as in equation ( 17), we may now instead seek z ∈ span L A +N such that (φ • (ψ • φ) -1 )(z) = log Sig N (x) . But now by simply applying ψ to both sides: z = ψ log Sig N (x) . This is now incredibly easy to compute. Once log Sig N (x) has been computed, and interpreted as in equation ( 16), then the operation of ψ is simply to extract the coefficients of all the Lyndon words, and we are done.

B LIBTORCH VS CUDA

LibTorch is the C++ equivalent to the PyTorch library. GPU support in Signatory was provided by using the operations provided by LibTorch. It was a deliberate choice not to write custom CUDA kernels. The reason for this is as follows. We have to make a choice between distributing source code and distributing precompiled binaries. If we distribute source code, then we rely on users being able to compile CUDA, which is far from a guarantee. Meanwhile, distributing precompiled binaries is unfortunately not feasible on Linux. C/C++ extensions for Python are typically compiled for Linux using the 'manylinux' specification, and indeed PyPI will only host binaries claiming to be compiled according to this specification. Unfortunately, based on our inspection of its build scripts, PyTorch appears not to conform to this specification. It instead compiles against a later version of Centos than is supported by manylinux, and then subsequently modifies things so as to seem compatible with the manylinux specification. Unpicking precisely how PyTorch does this so that we might duplicate the necessary functionality (as we must necessarily remain compatible with PyTorch as well) was judged a finickity task full of hard-to-test edge cases, that is an implementation detail of PyTorch that should not be relied upon, and that may not remain stable across future versions.

C BACKPROPAGATION

Backpropagation is calculated in the usual way, mathematically speaking, by treating the signature and logsignature transforms as a composition of differentiable primitives, as discussed in Section 2.2. The backpropagation computations are handwritten, and are not generated autodifferentiably. This improves the speed of the computation by using C++ primitives, rather than high-level tensors. C.1 REVERSIBILITY Moreover, it allows us to exploit a reversibility property of the signature (Reizenstein, 2019) . When backpropagating through any forward operation, then typically the the forward results are stored in memory, as these are used in the backward pass. However, recall the grouplike structure of the signature; in particular this means that Sig N ((x 1 , . . . , x L-1 )) = Sig N ((x 1 , . . . , x L )) Sig N ((x L-1 , x L )) -1 . = Sig N ((x 1 , . . . , x L )) Sig N ((x L , x L-1 )). ( ) Consider the case of Sig N ((x 1 , . . . , x L )) by iterating through equation ( 3) from left to right. Reversibility means we do not need to store the intermediate computations Sig N ((x 1 , . . . , x j )): given the final Sig N ((x 1 , . . . , x L )), we can recover Sig N ((x 1 , . . . , x j )) in the order that they are needed in the backward pass by repeatedly applying equation ( 18). We remark in Section 2.5 that the signature may be interpreted as the solution to a differential equation. This recomputation procedure actually corresponds to the adjoint method for backpropagating through a differential equation, as popularised in machine learning via Chen et al. (2018) . Importantly however, this does not face reconstruction errors in the same way as neural differential equations (Gholami et al., 2019) . Because the driving path f is taken to be piecewise affine in Definition 4, then the differential equation defining the signature may be solved exactly, without numerical approximations. Equation ( 18) uses the same basic operations as the forward operation, and can be computed using the same subroutines, including the fused multiply-exponentiate.

C.2 SPEED VERSUS MEMORY TRADE-OFFS

The reversibility procedure just described introduces the additional cost of recomputing the path (rather than just holding it in memory). In principle this need not be performed by holding partial results in memory. For simplicity we do not offer this an alternative with Signatory. Signature-based techniques are often applied to long or high-frequency data (Lyons et al., 2014; Morrill et al., 2020b) , for which the large size of multiple partially computed signatures can easily become a memory issue. Nonetheless this represents an opportunity for further work.

C.3 PARALLELISM

The use of parallelism in the gradient computation depends upon whether to use reversibility as discussed. Consider first the case in which reversibility is not used, and all intermediate results are held in memory. As discussed in Section 5.1, the forward operation may be computed in parallel as a reduction. The computation graph (within the signature computation) then looks like a balanced tree, and so the backward operation through this computation graph may be performed in parallel as well. However if reversibility is used then only the final Sig N ((x 1 , . . . , x L )) is held in memory, then the necessary intermediate computations to backpropagate in parallel are not available. As Signatory uses reversibility then backpropagation is not performed in parallel. This represents an opportunity for further work, but practically speaking we expect that its impact is only moderate. Backpropagation is typically performed as part of a training procedure over batches of data; thus the available parallelism may already saturated by parallelism over the batch, and by the intrinsic parallelism available within each primitive operation.

D FURTHER BENCHMARKS D.1 CODE FOR REPRODUCIBILITY

The benchmarks may be reproduced with the following code on a Linux system. First we install the necessary packages. pip install numpy==1.18.0 matplotlib==3.0.3 torch==1.5.0 pip install iisignature==0.24 esig==0.6.31 signatory==1.2.1.1.5.0 git clone https://github.com/patrick-kidger/signatory.git cd signatory Note that numpy must be installed before iisignature, and PyTorch must be installed before Signatory. The unusually long version number for Signatory is necessary to specify both the version of Signatory, and the version of PyTorch that it is for. The git clone is necessary as the benchmarking code is not distributed via pip.

Now run python command.py benchmark -help

for further details on how to run any particular benchmark. For example, python command.py benchmark -m time -f sigf -t channels -o graph will reproduce Figure 1a .

D.2 MEMORY BENCHMARKS

Our benchmark scripts offer some limited ability to benchmark memory consumption, via the -m memory flag to the benchmark scripts. The usual approach to such benchmarking, using valgrind's massif, necessarily includes measuring the set-up code. As this includes loading both the Python interpreter and PyTorch, measuring the memory usage of our code becomes tricky. As such we use an alternate method, in which the memory usage is sampled at intervals, using the Python package memory_profiler, which may be installed via pip install memory_profiler. This in turn has the limitation that it may miss a peak in memory usage; for small calculations it may miss the entire calculation. Furthermore, the values reported are inconsistent with those reported in Reizenstein & Graham (2018) . Nonetheless, when compared against iisignature using memory_profiler, on larger computations where peaks are less likely to go unobserved, then Signatory typically uses at an order of magnitude less memory. However due to the limitations above, we have chosen not report quantitative memory benchmarks here.

D.3 SIGNATURE TRANSFORM BENCHMARKS

The precise values of the points of Figures 1 and 2 are shown in Tables 1 2 3 4 . For convenience, the ratio between the speed of Signatory and the speed of iisignature is also shown.

D.4 LOGSIGNATURE TRANSFORM BENCHMARKS

See Figure 4 for the graphs of the benchmarks for the logsignature transform. The computer and runtime environment used was as described in Section 6. We observe similar behaviour to the benchmarks for the signature transform. iisignature is slightly faster for some very small computations, but that as problem size increases, Signatory swiftly overtakes iisignature, and is orders of magnitude faster for larger computations. The precise values of the points on these graphs are shown in Tables 5 6 7 8 . Times are given in seconds. Also shown is the ratio between the speed of Signatory and the speed of iisignature. A dash indicates that esig does not support that operation.

D.5 SINGLE-ELEMENT-BATCH BENCHMARKS

The benchmarks so far considered were for a batch of samples (of size 32). Whilst this is of particular relevance for training, it is sometimes less relevant for inference. We now repeat all the previous benchmarks (forward and backward through both signature and logsignature, varying both depth and channels), except that the batch dimension is reduced to size 1. See Figures 5 and 6 . Numerical values are presented in Tables 9 10 11 12 13 14 15 16 . Here we see on very small problems that iisignature now outperforms Signatory by about a millisecond, but that once again Signatory overtakes iisignature on reasonably-sized problems, and is still orders of magnitude faster on larger problems. We do not regard the performance on very small single-element problems as a drawback of Signatory. If performing very few very small calculations, then the difference of a millisecond is irrelevant. If performing very many very small calculations, then these can typically be batched together. Figure 5 : Time taken on benchmark computations to compute the specified operation. In all cases the input was a "batch" of 1 sequence, of length 128. For varying channels, the depth was fixed at 7. For varying depths, the channels was fixed at 4. Every test case was repeated 50 times and the fastest time taken. Note that esig is only shown for certain operations as it is incapable of computing large operations or of computing backward operations. Note the logarithmic scale. Figure 6 : Time taken on benchmark computations to compute the specified operation. In all cases the input was a "batch" of 1 sequence, of length 128. For varying channels, the depth was fixed at 7. For varying depths, the channels was fixed at 4. Every test case was repeated 50 times and the fastest time taken. Note that esig is only shown for certain operations as it is incapable of computing large operations or of computing backward operations. Note the logarithmic scale. 



And may be extended to paths of bounded variation, or merely finite p-variation(Lyons et al., 2004). Additionally, many texts also include a k = 0 term, which is defined to equal one. We omit this as it does not carry any information, and is therefore irrelevant to the task of machine learning. Most texts use ⊗ rather than to denote this operation, as it may be regarded as an generalisation of the tensor product. That will not be important to us, however, so we use differing notation to aid interpretation. log is actually a bijection. image Sig N is some curved manifold in N k=1 R d ⊗k , and log is the map that straightens it out into a linear subspace. Note that we start with Sig N ((x1, x2)), as two is the shortest a sequence of data can be to define a path; see Definition 4. Subject to nontrivial overhead.



Figure 1: Time taken on benchmark computations to compute the signature transform. esig is only shown for small operations as it is incapable of larger operations. Note the logarithmic scale.

Figure 2: Time taken on benchmark computations to backpropagate through the signature transform. esig is not shown as it is incapable of computing backward operations. Note the logarithmic scale.

Figure 3: Loss against wall-clock time for a deep signature model. Both plots are identical; the second plot uses a log-scaled x-axis.

) then involves d 2 + d 3 + • • • + d k multiplications. This is because, working from innermost bracket to outermost, the first ⊗ produces a d × d matrix as the outer product of two size d vectors, and may thus be computed with d 2 multiplications; the second ⊗ produces a d × d × d tensor from a d × d matrix and a size d vector, and may thus be computed with d 3 multiplications; and so on. Thus the overall cost of a fused multiply-exponentiate is F(d, N ) = d(N -1) + establishing the uniform bound F(d, N ) ≤ C(d, N ) for all d ≥ 1 and N ≥ 1.

Figure4: Time taken on benchmark computations to compute the specified operation. In all cases the input was a batch of 32 sequences of data, each of length 128. For varying channels, the depth was fixed at 7. For varying depths, the channels was fixed at 4. Every test case was repeated 50 times and the fastest time taken. Note that esig is only shown for certain operations as it is incapable of computing large operations or of computing backward operations. Note the logarithmic scale.

(a) Signature forward, varying channels (b) Signature backward, varying channels (c) Signature forward, varying depths (d) Signature backward, varying depths

(a) Logsignature forward, varying channels (b) Logsignature backward, varying channels (c) Logsignature forward, varying depths (d) Logsignature backward, varying depths

Signature forward, varying channels. Times are given in seconds. A dash indicates that esig does not support that operation.

Signature backward, varying channels. Times are given in seconds. A dash indicates that esig does not support that operation.

Signature forward, varying depths. Times are given in seconds. A dash indicates that esig does not support that operation.

Signature backward, varying depths. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature forward, varying channels. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature backward, varying channels. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature forward, varying depths. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature backward, varying depths. Times are given in seconds. A dash indicates that esig does not support that operation.

Signature forward, varying channels, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

Signature backward, varying channels, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

Signature forward, varying depths, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

Signature backward, varying depths, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature forward, varying channels, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature backward, varying channels, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature forward, varying depths, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

Logsignature backward, varying depths, single-element-batch. Times are given in seconds. A dash indicates that esig does not support that operation.

ACKNOWLEDGEMENTS

This work was supported by the Engineering and Physical Sciences Research Council [EP/L015811/1].

availability

tests may be found at https://github.com/

