SEQ2TENS: AN EFFICIENT REPRESENTATION OF SE-QUENCES BY LOW-RANK TENSOR PROJECTIONS

Abstract

Sequential data such as time series, video, or text can be challenging to analyse as the ordered structure gives rise to complex dependencies. At the heart of this is non-commutativity, in the sense that reordering the elements of a sequence can completely change its meaning. We use a classical mathematical object -the free algebra -to capture this non-commutativity. To address the innate computational complexity of this algebra, we use compositions of low-rank tensor projections. This yields modular and scalable building blocks that give state-of-the-art performance on standard benchmarks such as multivariate time series classification, mortality prediction and generative models for video. Code and benchmarks are publically available at

1. INTRODUCTION

A central task of learning is to find representations of the underlying data that efficiently and faithfully capture their structure. In the case of sequential data, one data point consists of a sequence of objects. This is a rich and non-homogeneous class of data and includes classical uni-or multi-variate time series (sequences of scalars or vectors), video (sequences of images), and text (sequences of letters). Particular challenges of sequential data are that each sequence entry can itself be a highly structured object and that data sets typically include sequences of different length which makes naive vectorization troublesome. Contribution. Our main result is a generic method that takes a static feature map for a class of objects (e.g. a feature map for vectors, images, or letters) as input and turns this into a feature map for sequences of arbitrary length of such objects (e.g. a feature map for time series, video, or text). We call this feature map for sequences Seq2Tens for reasons that will become clear; among its attractive properties are that it (i) provides a structured, parsimonious description of sequences; generalizing classical methods for strings, (ii) comes with theoretical guarantees such as universality, (iii) can be turned into modular and flexible neural network (NN) layers for sequence data. The key ingredient to our approach is to embed the feature space of the static feature map into a larger linear space that forms an algebra (a vector space equipped with a multiplication). The product in this algebra is then used to "stitch together" the static features of the individual sequence entries in a structured way. The construction that allows to do all this is classical in mathematics, and known as the free algebra (over the static feature space). Outline. Section 2 formalizes the main ideas of Seq2Tens and introduces the free algebra T(V ) over a space V as well as the associated product, the so-called convolution tensor product. Section 3 shows how low rank (LR) constructions combined with sequence-to-sequence transforms allows one to efficiently use this rich algebraic structure. Section 4 applies the results of Sections 2 and 3 to build modular and scalable NN layers for sequential data. Section 5 demonstrates the flexibility and modularity of this approach on both discriminative and generative benchmarks. Section 6 makes connections with previous work and summarizes this article. In the appendices we provide mathematical background, extensions, and detailed proofs for our theoretical results.

2. CAPTURING ORDER BY NON-COMMUTATIVE MULTIPLICATION

We denote the set of sequences of elements in a set X by Seq(X ) = {x = (x i ) i=1,...,L : x i ∈ X , L ≥ 1} (1) where L ≥ 1 is some arbitrary length. Even if X itself is a linear space, e.g. X = R, Seq(X ) is never a linear space since there is no natural addition of two sequences of different length. Seq2Tens in a nutshell. Given any vector space V we may construct the so-called free algebra T(V ) over V . We describe the space T(V ) in detail below, but as for now the only thing that is important is that T(V ) is also a vector space that includes V , and that it carries a non-commutative product, which is, in a precise sense, "the most general product" on V . The main idea of Seq2Tens is that any "static feature map" for elements in X φ : X → V can be used to construct a new feature map Φ : Seq(X ) → T(V ) for sequences in X by using the algebraic structure of T(V ): the non-commutative product on T(V ) makes it possible to "stitch together" the individual features φ(x 1 ), . . . , φ(x L ) ∈ V ⊂ T(V ) of the sequence x in the larger space T(V ) by multiplication in T(V ). With this we may define the feature map Φ(x) for a sequences x = (x 1 , . . . , x L ) ∈ Seq(X ) as follows (i) lift the map φ : X → V to a map ϕ : X → T(V ), (ii) map Seq(X ) → Seq(T(V )) by (x 1 , . . . , x L ) → (ϕ(x 1 ), . . . , ϕ(x L )), (iii) map Seq(T(V )) → T(V ) by multiplication (ϕ(x 1 ), . . . , ϕ(x L )) → ϕ(x 1 ) • • • ϕ(x L ). In a more concise form, we define Φ as Φ : Seq(X ) → T(V ), Φ(x) = L i=1 ϕ(x i ) where denotes multiplication in T(V ). We refer to the resulting map Φ as the Seq2Tens map, which stands short for Sequences-2-Tensors. Why is this construction a good idea? First note, that step (i) is always possible since V ⊂ T(V ) and we discuss the simplest such lift before Theorem 2.1 as well as other choices in Appendix B. Further, if φ, respectively ϕ, provides a faithful representation of objects in X , then there is no loss of information in step (ii). Finally, since step (iii) uses "the most general product" to multiply ϕ(x 1 ) • • • ϕ(x L ) one expects that Φ(x) ∈ T(V ) faithfully represents the sequence x as an element of T(V ). Indeed in Theorem 2.1 below we show an even stronger statement, namely that if the static feature map φ : X → V contains enough non-linearities so that non-linear functions from X to R can be approximated as linear functions of the static feature map φ, then the above construction extends this property to functions of sequences. Put differently, if φ is a universal feature map for X , then Φ is a universal feature map for Seq(X ); that is, any non-linear function f (x) of a sequence x can be approximated as a linear functional of Φ(x), f (x) ≈ , Φ(x) . We also emphasize that the domain of Φ is the space Seq(X ) of sequences of arbitrary (finite) length. The remainder of this Section gives more details about steps (i),(ii),(iii) for the construction of Φ. The free algebra T(V ) over a vector space V . Let V be a vector space. We denote by T(V ) the set of sequences of tensors indexed by their degree m, T(V ) := {t = (t m ) m≥0 | t m ∈ V ⊗m } (3) where by convention V ⊗0 = R. For example, if V = R d and t = (t m ) m≥0 is some element of T(R d ), then its degree m = 1 component is a d-dimensional vector t 1 , its degree m = 2 component is a d × d matrix t 2 , and its degree m = 3 component is a degree 3 tensor t 3 . By defining addition and scalar multiplication as s + t := (s m + t m ) m≥0 , c • t = (ct m ) m≥0 the set T(V ) becomes a linear space. By identifying v ∈ V as the element (0, v, 0, . . . , 0) ∈ T(V ) we see that V is a linear subspace of T(V ). Moreover, while V is only a linear space, T(V ) carries a product that turns T(V ) into an algebra. This product is the so-called tensor convolution product, and is defined for s, t ∈ T(V ) as s • t := m i=0 s i ⊗ t m-i m≥0 = 1, s 1 + t 1 , s 2 + s 1 ⊗ t 1 + t 2 , . . . ∈ T(V ) where ⊗ denotes the usual outer tensor product; e.g. for vectors u = (u i ), v = (v i ) ∈ R d the outer tensor product u ⊗ v is the d × d matrix (u i v j ) i,j=1,...,d . We emphasize that like the outer tensor product ⊗, the tensor convolution product • is non-commutative, i.e. s • t = t • s. In a mathematically precise sense, T(V ) is the most general algebra that contains V ; it is a "free construction". Since T(V ) is realized as series of tensors of increasing degree, the free algebra T(V ) is also known as the tensor algebra in the literature. Appendix A contains background on tensors and further examples. Lifting static feature maps. Step (i) in the construction of Φ requires turning a given feature map φ : X → V into a map ϕ : X → T(V ). Throughout the rest of this article we use the lift ϕ(x) = (1, φ(x), 0, 0 . . .) ∈ T(V ). We discuss other choices in Appendix B, but attractive properties of the lift 6 are that (a) the evaluation of Φ against low rank tensors becomes a simple recursive formula (Proposition 3.3, (b) it is a generalization of sequence sub-pattern matching as used in string kernels (Appendix B.3, (c) despite its simplicity it performs exceedingly well in practice (Section 4). Extending to sequences of arbitrary length. Steps (i) and (ii) in the construction specify how the map Φ : X → T(V ) behaves on sequences of length-1, that is, single observations. Step (iii) amounts to the requirement that for any two sequences x = (x 1 , . . . , x K ), y = (y 1 , . . . , y L ) ∈ Seq(V ), their concatenation defined as z = (x 1 , . . . , x K , y 1 , . . . , y L ) ∈ Seq(V ) can be understood in the feature space as (non-commutative) multiplication of their corresponding features Φ(z) = Φ(x) • Φ(y). In other words, we inductively extend the lift ϕ to sequences of arbitrary length by starting from sequences consisting of a single observation, which is given in equation 2. Repeatedly applying the definition of the tensor convolution product in equation 5 leads to the following explicit formula Φ m (x) = 1≤i1<•••<im≤L x i1 ⊗ • • • ⊗ x im ∈ V ⊗m , Φ(x) = (Φ m (x)) m≥0 , where x = (x 1 , . . . , x L ) ∈ Seq(V ) and the summation is over non-contiguous subsequences of x. Some intuition: generalized pattern matching. Our derivation of the feature map Φ(x) = (1, Φ 1 (x), Φ 2 (x), . . .) ∈ T(V ) was guided by general algebraic principles, but equation 8 provides an intuitive interpretation. It shows that for each m ≥ 1, the entry Φ m (x) ∈ V ⊗m constructs a summary of a long sequence x = (x 1 , . . . , x L ) ∈ Seq(V ) based on subsequences (x i1 , . . . , x im ) of x of length-m. It does this by taking the usual outer tensor product x i1 ⊗ • • • ⊗ x im ∈ V ⊗m and summing over all possible subsequences. This is completely analogous to how string kernels provide a structured description of text by looking at non-contiguous substrings of length-m (indeed, Appendix B.3 makes this rigorous). However, the main difference is that the above construction works for arbitrary sequences and not just sequences of discrete letters. Readers with less mathematical background might simply take this as motivation and regard equation 8 as definition. However, the algebraic background allows to prove that Φ is universal, see Theorem 2.1 below. Universality. A function φ : X → V is said to be universal for X if all continuous functions on X can be approximated as linear functions on the image of φ. One of the most powerful features of neural nets is their universality (Hornik, 1991) . A very attractive property of Φ is that it preserves universality: if φ : X → V is universal for X , then Φ : Seq(X) → T(V ) is universal for Seq(X ). To make this precise, note that V ⊗m is a linear space and therefore any = ( 0 , 1 , . . . , M , 0, 0, . . .) ∈ T(V ) consisting of M tensors m ∈ V ⊗m , yields a linear functional on T(V ); e.g. if V = R d and we identify m in coordinates as m = ( i1,...,im m ) i1,...,im∈{1,...,d} then , t := M m=0 m , t m = M m=0 i1,...,im∈{1,...,d} i1,...,im m t i1,...,im m . Thus linear functionals of the feature map Φ, are real-valued functions of sequences. Theorem 2.1 below shows that any continuous function f : Seq(X ) → R can by arbitrary well approximated by a ∈ T(V ), f (x) ≈ , Φ(x) . Theorem 2.1. Let φ : X → V be a universal map with a lift that satisfies some mild constraints, then the following map is universal: Φ : Seq(X ) → T(V ), x → Φ(x). A detailed proof and the precise statement of Theorem 2.1 is given in Appendix B.

3. APPROXIMATION BY LOW-RANK LINEAR FUNCTIONALS

The combinatorial explosion of tensor coordinates and what to do about it. The universality of Φ suggests the following approach to represent a function f : Seq(X ) → R of sequences: First compute Φ(x) and then optimize over (and possibly also the hyperparameters of φ) such that f (x) ≈ , Φ(x) = M m=0 m , Φ m (x) . Unfortunately, tensors suffer from a combinatorial explosion in complexity in the sense that even just storing Φ m (x) ∈ V ⊗m ⊂ T(V ) requires O(dim(V ) m ) real numbers. Below we resolve this computational bottleneck as follows: in Proposition 3.3 we show that for a special class of low-rank elements ∈ T(V ), the functional x → , Φ(x) can be efficiently computed in both time and memory. This is somewhat analogous to a kernel trick since it shows that , Φ(x) can be cheaply computed without explicitly computing the feature map Φ(x). However, Theorem 2.1 guarantees universality under no restriction on , thus restriction to rank-1 functionals limits the class of functions f (x) that can be approximated. Nevertheless, by iterating these "low-rank functional" constructions in the form of sequence-to-sequence transformations this can be ameliorated. We give the details below but to gain intuition, we invite the reader to think of this iteration analogous to stacking layers in a neural network: each layer is a relatively simple non-linearity (e.g. a sigmoid composed with an affine function) but by composing such layers, complicated functions can be efficiently approximated. Rank-1 functionals are computationally cheap. Degree m = 2 tensors are matrices and lowrank (LR) approximations of matrices are widely used in practice (Udell & Townsend, 2019) to address the quadratic complexity. The definition below generalizes the rank of matrices (tensors of degree m = 2) to tensors of any degree m. Definition 3.1. The rank (also called CP rank (Carroll & Chang, 1970 )) of a degree-m tensor t m ∈ V ⊗m is the smallest number r ≥ 0 such that one may write t m = r i=0 v 1 i ⊗ • • • ⊗ v m i , v 1 i , . . . , v m i ∈ V. We say that t = (t m ) m≥0 ∈ T(V ) has rank-1 (and degree-M ) if each t m ∈ V ⊗m is a rank-1 tensor and t i = 0 for i > M . Remark 3.2. For x = (x 1 , . . . , x L ) ∈ Seq(V ), the rank r m ∈ N of Φ m (x) satisfies r m ≤ L m , while the rank and degree r, d ∈ N of Φ(x) satisfy r ≤ L K for K = L 2 and d ≤ L. A direct calculation shows that if is of rank-1, then , Φ(x) can be computed very efficiently by inner product evaluations in V . Proposition 3.3. Let = ( m ) m≥0 ∈ T(V ) be of rank-1 and degree-M . If φ is lifted to ϕ as in equation 6, then , Φ(x) = M m=0 1≤i1<•••<im≤L m k=1 v m k , φ(x i k ) ( ) where m = v m 1 ⊗ • • • ⊗ v m m ∈ V ⊗m , v m i ∈ V and m = 0, . . . , M . Note that the inner sum is taken over all non-contiguous subsequences of x of length-m, analogously to m-mers of strings and we make this connection precise in Appendix B. Low-rank Seq2Tens maps. The composition of a linear map L : T(V ) → R N with Φ can be computed cheaply in parallel using equation 12 when L is specified through a collection of N ∈ N rank-1 elements 1 , . . . , N ∈ T(V ) such that Φθ (x 1 , . . . , x L ) := L • Φ(x 1 , . . . , x L ) = ( j , Φ(x 1 , . . . , x L )) N j=1 ∈ R N . ( ) We call the resulting map Φθ : Seq(X ) → R N a Low-rank Seq2Tens map of width-N and order-M , where M ∈ N is the maximal degree of 1 , . . . , N such that j i = 0 for i > M . The LS2T map is parametrized by (1) the component vectors v k j,m ∈ V of the rank-1 elements j m = v 1 j,m ⊗• • •⊗v m j,m , ) by any parameters θ that the static feature map φ θ : X → V may depend on. We jointly denote these parameters by θ = (θ, 1 , . . . , N ). In addition, by the subsequent composition of Φθ with a linear functional R N → R, we get the following function subspace as hypothesis class for the LS2T H = N j=1 α j j , Φ(x 1 , . . . , x L ) | α j ∈ R H = , Φ(x 1 , . . . , x L ) | ∈ T(V ) Hence, we acquire an intuitive explanation of the (hyper)parameters: the width of the LS2T, N ∈ N specifies the maximal rank of the low-rank linear functionals of Φ that the LS2T can represent, while the span of the rank-1 elements, span( 1 , . . . , N ) determine an N -dimensional subspace of the dual space of T(V ) consisting of at most rank-N functionals. Recall now that without rank restrictions on the linear functionals of Seq2Tens features, Theorem 2.1 would guarantee that any real-valued function f : Seq(X ) → R could be approximated by f (x) ≈ , Φ(x 1 , . . . , x L ) . As pointed out before, the restriction of the hypothesis class to lowrank linear functionals of Φ(x 1 , . . . , x L ) would limit the class of functions of sequences that can be approximated. To ameliorate this, we use LS2T transforms in a sequence-to-sequence fashion that allows us to stack such low-rank functionals, significantly recovering expressiveness. Sequence-to-sequence transforms. We can use LS2T to build sequence-to-sequence transformations in the following way: fix the static map φ θ : X → V parametrized by θ and rank-1 elements such that θ = (θ, 1 , . . . , N ) and apply the resulting LS2T map Φθ over expanding windows of x: Seq(X ) → Seq(R N ), x → Φθ (x 1 ), Φθ (x 1 , x 2 ), . . . , Φθ (x 1 , . . . , x L ) . Note that the cost of computing the expanding window sequence-to-sequence transform in equation 15 is no more expensive than computing Φθ (x 1 , . . . , x L ) itself due to the recursive nature of our algorithms, for further details see Appendices D.2, D.3, D.4. Deep sequence-to-sequence transforms. Inspired by the empirical successes of deep RNNs (Graves et al., 2013b; a; Sutskever et al., 2014) , we iterate the transformation 15 D-times: Seq(X ) → Seq(R N1 ) → Seq(R N2 ) → • • • → Seq(R N D ). Each of these mappings Seq(R Ni ) → Seq(R Ni+1 ) is parametrized by the parameters θi of a static feature map φ θi and a linear map L i specified by N i rank-1 elements of T(V ); these parameters are collectively denoted by θi = (θ i , 1 i , . . . , Ni i ). Evaluating the final sequence in Seq(R N D ) at the last observation-time t = L, we get the deep LS2T map with depth-D Φθ 1,..., θD : Seq(X ) → R n D . (17) Making precise how the stacking of such low-rank sequence-to-sequence transformations approximates general functions requires more tools from algebra, and we provide a rigorous quantitative statement in Appendix C. Here, we just appeal to the analogy made with adding depth in neural networks mentioned earlier and empirically validate this in our experiments in Section 4.

4. BUILDING NEURAL NETWORKS WITH LS2T LAYERS

The Seq2Tens map Φ built from a static feature map φ is universal if φ is universal, Theorem 2.1. NNs form a flexible class of universal feature maps with strong empirical success for data in X = R d , and thus make a natural choice for φ. Combined with standard deep learning constructions, the framework of Sections 2 and 3 can build modular and expressive layers for sequence learning. Neural LS2T layers. The simplest choice among many is to use as static feature map φ : X = R d → R h a feedforward network with depth-P , φ = φ P • • • • • φ 1 where φ j (x) = σ(W j x + b j ) for W j ∈ R h×d , b j ∈ R h . We can then lift this to a map ϕ : R d → T(R h ) as prescribed in equation 6. Hence, the resulting LS2T layer x → ( Φθ (x 1 , . . . , x i )) i=1,...,L is a sequence-to-sequence transform Seq(R d ) → Seq(R h ) that is parametrized by θ = (W 1 , b 1 , . . . , W P , b P , 1 1 , . . . , N1 1 ). Bidirectional LS2T layers. The transformation in equation 15 is completely causal in the sense that each step of the output sequence depends only on past information. For generative models, it can behove us to make the output depend on both past and future information, see Graves et al. (2013a) ; Baldi et al. (1999) ; Li & Mandt (2018) . Similarly to bidirectional RNNs and LSTMs (Schuster & Paliwal, 1997; Graves & Schmidhuber, 2005) , we may achieve this by defining a bidirectional layer, Φb ( θ1, θ2) (x) : Seq(R d ) → Seq(R N +N ), x → ( Φθ 1 (x 1 , . . . , x i ), Φθ 2 (x i , . . . , x L )) L i=1 . (18) The sequential nature is kept intact by making the distinction between what classifies as past (the first N coordinates) and future (the last N coordinates) information. This amounts to having a form of precognition in the model, and has been applied in e.g. dynamics generation (Li & Mandt, 2018) , machine translation (Sundermeyer et al., 2014) , and speech processing (Graves et al., 2013a) . Convolutions and LS2T. We motivate to replace the time-distributed feedforward layers proposed in the paragraph above by temporal convolutions (CNN) instead. Although theory only requires the preprocessing layer of the LS2T to be a static feature map, we find that it is beneficial to capture some of the sequential information in the preprocessing layer as well, e.g. using CNNs or RNNs. From a mathematical point of view, CNNs are a straightforward extension since they can be interpreted as time-distributed feedforward layers applied to the input sequence augmented with a p ∈ N number of its lags for CNN kernel size p (see Appendix D.1 for further discussion). In the following, we precede our deep LS2T blocks by one or more CNN layers. Intuitively, CNNs and LS2Ts are similar in that both transformations operate on subsequences of their input sequence. The main difference between the two lies in that CNNs operate on contiguous subsequences, and therefore, capture local, short-range nonlinear interactions between timesteps; while LS2Ts (equation 12) use all non-contiguous subsequences, and hence, learn global, long-range interactions in time. This observation motivates that the inductive biases of the two types of layers (local/global time-interactions) are highly complementary in nature, and we suggest that the improvement in the experiments on the models containing vanilla CNN blocks are due to this complementarity.

5. EXPERIMENTS

We demonstrate the modularity and flexibility of the above LS2T and its variants by applying it to (i) multivariate time series classification, (ii) mortality prediction in healthcare, (iii) generative modelling of sequential data. In all cases, we take a strong baseline model (FCN and GP-VAE, as detailed below) and upgrade it with LS2T layers. As Thm. 2.1 requires the Seq2Tens layers to be preceded by at least a static feature map, we expect these layers to perform best as an add-on on top of other models, which however can be quite simple, such as a CNN. The additional computation time is negligible (in fact, for FCN it allows to reduce the number of parameters significantly, while retaining performance), but it can yield substantial improvements. This is remarkable, since the original models are already state-of-the-art on well-established (frequentist and Bayesian) benchmarks.

5.1. MULTIVARIATE TIME SERIES CLASSIFICATION

As the first task, we consider multivariate time series classification (TSC) on an archive of benchmark datasets collected by Baydogan (2015) . Numerous previous publications report results on this Published as a conference paper at ICLR 2021 are present in the input time series, and picking up on these improve the performance -however, on a few datasets the contrary is true. (3) Lastly, FCN 128 -LS2T 3 64 , outperforms all baseline methods with high probability (p ≥ 0.8), and hence successfully improves on the FCN 128 via its added ability to learn long-range time-interactions. We remark that FCN 64 -LS2T 3 64 has fewer parameters than FCN 128 by more than 50%, hence we managed to compress the FCN to a fraction of its original size, while on average still slightly improving its performance, a nontrivial feat by its own accord.

5.2. MORTALITY PREDICTION

We consider the PHYSIONET2012 challenge dataset (Goldberger et al., 2000) for mortality prediction, which is a case of medical TSC as the task is to predict in-hospital mortality of patients after their admission to the ICU. This is a difficult ML task due to missingness in the data, low signalto-noise ratio (SNR), and imbalanced class distributions with a prevalence ratio of around 14%. We extend the experiments conducted in Horn et al. (2020) , which we also use as very strong baselines. Under the same experimental setting, we train two models: FCN-LS2T as ours and the FCN as another baseline. For both models, we conduct a random search for all hyperparameters with 20 samples from a pre-specified search space, and the setting with best validation performance is used for model evaluation on the test set over 5 independent model trains, exactly the same way as it was done in Horn et al. (2020) . We preprocess the data using the same method as in Che et al. (2018, eq. ( 9)) and additionally handle static features by tiling them along the time axis and adding them as extra coordinates. We additionally introduce in both models a SpatialDropout1D layer after all CNN and LS2T layers with the same tunable dropout rate to mitigate the low SNR of the dataset. We can observe that FCN-LS2T takes on average first place according to both ACCURACY and AUPRC, outperforming FCN and all SOTA methods, e.g. TRANS-FORMER (Vaswani et al., 2017) , GRU-D Che et al. (2018) , SEFT (Horn et al., 2020) , and also being competitive in terms of AUROC. This is very promising, and it suggests that LS2T layers might be particularly well-suited to complex and heterogenous datasets, such as medical time series, since the FCN-LS2T models significantly improved accuracy on ECG as well, another medical dataset in the previous experiment.

5.3. GENERATING SEQUENTIAL DATA

Finally, we demonstrate on sequential data imputation for time series and video that LS2Ts do not only provide good representations of sequences in discriminative, but also generative models. The GP-VAE model. In this experiment, we take as base model the recent GP-VAE (Fortuin et al., 2020) , that provides state-of-the-art results for probabilistic sequential data imputation. The GP-VAE is essentially based on the HI-VAE (Nazabal et al., 2018) for handling missing data in variational autoencoders (VAEs) (Kingma & Welling, 2013) adapted to the handling of time series data by the use of a Gaussian process (GP) prior (Williams & Rasmussen, 2006) across time in the latent sequence space to capture temporal dynamics. Since the GP-VAE is a highly advanced model, its in-depth description is deferred to Appendix E.3.We extend the experiments conducted in Fortuin et al. (2020) , and we make one simple change to the GP-VAE architecture without changing any other hyperparameters or aspects: we introduce a single bidirectional LS2T layer (B-LS2T) into the encoder network that is used in the amortized representation of the means and covariances of the variational posterior. The B-LS2T layer is preceded by a time-embedding and differencing block, and succeeded by channel flattening and layer normalization as depicted in Figure 5 . The idea behind this experiment is to see if we can improve the performance of a highly complicated model that is composed of many interacting submodels, by the naive introduction of LS2T layers. Additionally, when comparing GP-VAE to BRITS on Physionet, the authors argue that although the BRITS achieves a higher AUROC score, the GP-VAE should not be disregarded as it fits a generative model to the data that enjoys the usual Bayesian benefits of predicting distributions instead of point predictions. The results display that by simply adding our layer into the architecture, we managed to elevate the performance of GP-VAE to the same level while retaining these same benefits. We believe the reason for the improvement is a tighter amortization gap in the variational approximation (Cremer et al., 2018) achieved by increasing the expressiveness of the encoder by the LS2T allowing it to pick up on long-range interactions in time. We provide further discussion in Appendix E.3.

6. RELATED WORK AND SUMMARY

Related Work. The literature on tensor models in ML is vast. Related to our approach we mention pars-pro-toto Tensor Networks (Cichocki et al., 2016) , that use classical LR decompositions, such as CP (Carroll & Chang, 1970 ), Tucker (Tucker, 1966) , tensor trains (Oseledets, 2011) and tensor rings (Zhao et al., 2019) ; further, CNNs have been combined with LR tensor techniques (Cohen et al., 2016; Kossaifi et al., 2017) and extended to RNNs (Khrulkov et al., 2019) ; Tensor Fusion Networks (Zadeh et al., 2017) and its LR variants (Liu et al., 2018b; Liang et al., 2019; Hou et al., 2019) ; tensor-based gait recognition (Tao et al., 2007) . Our main contribution to this literature is the use of the free algebra T(V ) with its convolution product •, instead of V ⊗m with the outer product ⊗ that is used in the above papers. While counter-intuitive to work in a larger space T(V ), the additional algebra structure of (T(V ), •) is the main reason for the nice properties of Φ (universality, making sequences of arbitrary length comparable, convergence in the continuous time limit; see Appendix B) which we believe are in turn the main reason for the strong benchmark performance. Stacked LR sequence transforms allow to exploit this rich algebraic structure with little computational overhead. Another related literature are path signatures in ML (Lyons, 2014; Chevyrev & Kormilitzin, 2016; Graham, 2013; Bonnier et al., 2019; Toth & Oberhauser, 2020) . These arise as special case of Seq2Tens (Appendix B) and our main contribution to this literature is that Seq2Tens resolves a well-known computational bottleneck in this literature since it never needs to compute and store a signature, instead it directly and efficiently learns the functional of the signature. Summary. We used a classical non-commutative structure to construct a feature map for sequences of arbitrary length. By stacking sequence transforms we turned this into scalable and modular NN layers for sequence data. The main novelty is the use of the free algebra T(V ) constructed from the static feature space V . While free algebras are classical in mathematics, their use in ML seems novel and underexplored. We would like to re-emphasize that (T(V ), •) is not a mysterious abstract space: if you know the outer tensor product ⊗ then you can easily switch to the tensor convolution product • by taking sums of outer tensor products, as defined in equation 5. As our experiments show, the benefits of this algebraic structure are not just theoretical but can significantly elevate performance of already strong-performing models.



3; the proof of Proposition 3.3 is given in Appendix B.1.1. While equation 12 looks expensive, by casting it into a recursive formulation over time, it can be computed in O(M 2 •L•d) time and O(M 2 •(L+c)) memory, where d is the inner product evaluation time on V , while c is the memory footprint of a v ∈ V . This can further be reduced to O(M • L • d) time and O(M • (L + c)) memory by an efficient parametrization of the rank-1 element ∈ T(V ). We give further details in Appendices D.2, D.3, D.4.

Comparison of FCN-LS2T and FCN on PHY-SIONET2012 with the results from Horn et al. (2020).

Performance comparison of GP-VAE (B-LS2T) with the baseline methods To make the comparison, we ceteris paribus re-ran all experiments the authors originally included in their paper(Fortuin et al., 2020), which are imputation of Healing MNIST, Sprites, and Physionet 2012. The results are in Table3, which report the same metrics as used in Fortuin et al. (2020), i.e. negative log-likelihood (NLL, lower is better), mean squared error (MSE, lower is better) on test sets, and downstream classification performance of a linear classifier (AUROC, higher is better). For all other models beside our GP-VAE (B-LS2T), the results were borrowed fromFortuin et al. (2020). We observe that simply adding the B-LS2T layer improved the result in almost all cases, except for Sprites, where the GP-VAE already achieved a very low MSE score.

availability

https://github.com/tgcsaba

annex

Table 1 : Posterior probabilities given by a Bayesian signed-rank test comparison of the proposed methods against the baselines. {>}, {<}, {=} refer to the respective events that the row method is better, the column method is better, or that they are equivalent. archive, which makes it possible to compare against several well-performing competitor methods from the TSC community. These baselines are detailed in Appendix E.1. This archive was also considered in a recent popular survey paper on DL for TSC (Ismail Fawaz et al., 2019) , from where we borrow the two best performing models as DL baselines: FCN and ResNet. The FCN is a fully convolutional network which stacks 3 convolutional layers of kernel sizes (8, 5, 3) and filters (128, 256, 128) followed by a global average pooling (GAP) layer, hence employing global parameter sharing. We refer to this model as FCN 128 . The ResNet is a residual network stacking 3 FCN blocks of various widths with skip-connections in between (He et al., 2016) and a final GAP layer.

MODEL LS2T

The FCN is an interesting model to upgrade with LS2T layers, since the LS2T also employs parameter sharing across the sequence length, and as noted previously, convolutions are only able to learn local interactions in time, that in particular makes them ill-suited to picking up on longrange autocorrelations, which is exactly where the LS2T can provide improvements. As our models, we consider three simple architectures: (i) LS2T 3 64 stacks 3 LS2T layers of order-2 and width-64; (ii) FCN 64 -LS2T 3 64 precedes the LS2T 3 64 block by an FCN 64 block; a downsized version of FCN 128 ; (iii) FCN 128 -LS2T 3 64 uses the full FCN 128 and follows it by a LS2T 3 64 block as before. Also, both FCN-LS2T models employ skip-connections from the input to the LS2T block and from the FCN to the classification layer, allowing for the LS2T to directly see the input, and for the FCN to directly affect the final prediction. These hyperparameters were only subject to hand-tuning on a subset of the datasets, and the values we considered were H, N ∈ {32, 64, 128}, M ∈ {2, 3, 4} and D ∈ {1, 2, 3}, where H, N ∈ N is the FCN and LS2T width, resp., while M ∈ N is the LS2T order and D ∈ N is the LS2T depth. We also employ techniques such as time-embeddings (Liu et al., 2018a) , sequence differencing and batch normalization, see Appendix D.1; Appendix E.1 for further details on the experiment and Figure 2 in thereof for a visualization of the architectures.Results. We trained the models, FCN 128 , ResNet, LS2T 3 64 , FCN 64 -LS2T 3 64 , FCN 128 -LS2T 3 64 on each of the 16 datasets 5 times while results for other methods were borrowed from the cited publications. In Appendix E.1, Figure 3 depicts the box-plot of distributions of accuracies and a CD diagram using the Nemenyi test (Nemenyi, 1963) , while Table 7 shows the full list of results. Since mean-ranks based tests raise some paradoxical issues (Benavoli et al., 2016) , it is customary to conduct pairwise comparisons using frequentist (Demšar, 2006) or Bayesian (Benavoli et al., 2017) hypothesis tests. We adopted the Bayesian signed-rank test from Benavoli et al. (2014) , the posterior probabilities of which are displayed in Table 1 , while the Bayesian posteriors are visualized on Figure 4 in App. E.1. The results of the signed-rank test can be summarized as follows: (1) LS2T 3 64 already outperforms some classic TS classifiers with high probability (p ≥ 0.8), but it is not competitive with other DL classifiers. This observation is not surprising since even theory requires at least a static feature map to precede the LS2T. (2) FCN 64 -LS2T 3 64 outperforms almost all models with high probability (p ≥ 0.8), except for ResNet (which is stil outperformed by p ≥ 0.7), FCN 128 and FCN 128 -LS2T 3 64 . When compared with FCN 128 , the test is unable to decide between the two, which upon inspection of the individual results in Table 7 can be explained by that on some datasets the benefit of the added LS2T block is high enough that it outweighs the loss of flexibility incurred by reducing the width of the FCN -arguably these are the datasets where long-range autocorrelations

