RECURSIVE TIME SERIES DATA AUGMENTATION

Abstract

Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks we create our model using available data. Training on available realizations, where data is limited, often induces severe over-fitting thereby preventing generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call the Recursive Interpolation Method (RIM). New augmented time series are generated using a recursive interpolation function from the original time series for use in training. We perform theoretical analysis to characterize the proposed RIM and to guarantee its performance under certain conditions. We apply RIM to diverse synthetic and real-world time series cases to achieve strong performance over non-augmented data on a variety of learning tasks. Our method is also computationally more efficient and leads to better performance when compared to state of the art time series data augmentation.

1. INTRODUCTION

The recent success of machine learning (ML) algorithms depends on the availability of a large amount of data and prodigious computing power, which in practice are not always available. In real world applications, it is often impossible to indefinitely sample and ideally, we would like the ML model to make good decisions with a limited number of samples. To overcome these issues, we can exploit additional information such as the structure or invariance in the data that help the ML algorithms efficiently learn and focus on the most important features for solving the task. In ML, the exploitation of structure in the data has been handled using four different yet complementary approaches: 1) Architecture design, 2) Transfer learning, 3) Data representation, and 4) Data augmentation. Our focus in this work is on data augmentation approaches in the context of time series learning. Time series representations do not expose the full information of the underlying dynamical system Prado (1998) in a way that ML can easily recognize. For instance, in financial time series data, there are patterns at various scales that can be learned to improve performance. At a more fundamental level, time series are one-dimensional projections of a hypersurface of data called the phase space of a dynamical system. This projection results in a loss of information regarding the dynamics of the system. However, we can still make inferences about the dynamical system that projects a time series realization. Our approach is to use these inferences to generate additional time series data from the original realization to build richer representations and improve time series pattern identification resulting in more optimal parameters and reduced variance. We show that our methodology is applicable to a variety of ML algorithms. Time series learning problems depend on the observed historical data used for training. We often use a set of time series data to train the ML model. Each element in the set can be viewed as a sample derived from the underlying stochastic dynamical system. However, each historical time series data sample is only one particular realization of the underlying stochastic dynamical system in the real world that we are trying to learn. Our work focuses on problems where available realizations are limited but is not limited to these problems. In fact, our method can be applied to any time series learning task such as stock price prediction where we often have a single realization or problems with numerous realizations such as speech recognition where many audio clips are available for training. Let us consider the stock price prediction problem. The task is to predict or classify the trend of future price. Ideally we want our model to perform well by capturing the stochastic dynamics of stock markets. However, we only train the model using a single time series realization or limited historical realizations. As a result, we do not truly capture the characteristic behaviour of the underlying dynamical system. Using the original training data and hence one or a few realizations of the underlying dynamical system usually induces over-fitting. This is not ideal as we want our model to perform well in the stochastic system instead of just a specific realization of that system. Contributions. The contributions of our work are as follows: • We present a time series augmentation technique based on recursive interpolation. • We provide a theoretical analysis of learning improvement for the proposed time series augmentation method: -We show that our recursive augmentation allows us to control by how much the augmented time series trajectory deviates from the original time series trajectory (Theorem 3.1) and that there is a natural trade-off that is induced when our augmentation deviates considerably from the original time series (Theorem 3.2). -We demonstrate that our learning bound depends on the dimension and properties of the time series, as well as the neural network structure (Theorems 3.3 and 3.4). -We believe that this work is the first to offer a theoretical ML framework for time series data augmentation with guarantees for variance reduction in the learned model (Theorem 3.5). • We empirically demonstrate learning improvements using synthetic data as well as real world time series datasets. Outline of the paper. Section 2 presents the literature review. Section 3 defines the notations, the problem setting, and provides the main theoretical results. Section 4 describes the experimental results, and Section 5 concludes with a summary and a discussion of future work.

2. RELATED WORK

Augmentation for Computer Vision. In the computer vision context, there are multiple ways to augment image data like cropping, rotation, translation, flipping, noise injection and so on. Among them, the mixup technique proposed in Zhang et al. ( 2018) is similar to our approach. They train a neural network on convex combinations of pairs of images and their labels. However, just applying a static technique to dynamic time series data is not appropriate. Chen et al. (2020) showed that data augmentation has a similar effect to an averaging operation over the orbits of a certain group of transformation that keep the data distribution invariant. (2019) . These approaches are problem dependent and do not offer theoretical guaranteed learning improvement. In addition, the learning-based methods require large amounts of training data.

Augmentation for

Augmentation for Reinforcement Learning (RL). Laskin et al. (2020) presented the RL with Augmented Data (RAD) module which can augment most RL algorithms that use image data. They have demonstrated that augmentations such as random translate, random convolutions, crop, patch cutout, amplitude scale, and color jitter can enable RL algorithms to outperform complex advanced methods on standard benchmarks. Kostrikov et al. (2021) presented a data augmentation method that can be applied to conventional model-free reinforcement learning (RL) algorithms, enabling learning directly from pixels without the need for pre-training or auxiliary losses. The inclusion of this augmentation method improves performance substantially, enabling a Soft Actor-Critic agent to reach advanced functioning capability on the DeepMind control suite, outperforming model-based methods and contrastive learning. Laskin et al. (2020) and Kostrikov et al. (2021) show the benefit of RL augmentation using convolutional neural networks (CNNs) for static data but do not handle dynamic data such as time series. Ideally, we would like to have access to more data that is representative of the underlying dynamics of the system or the regime under which we operate. However, we can not randomly add more data as there is a probability that the added data might not be representative of our stochastic dynamical system. To ensure that we are able to add meaningful data without disturbing the properties of the original data, we introduce a new approach called Recursive Interpolation Method. Our paper proposes a recursive interpolation method for time series as a tool for generating data augmentations.

3.1. RECURSIVE INTERPOLATION METHOD (RIM)

In our setting, we consider each time series sample as one realization from the underlying dynamical system. The realization consists of features along the time axis. Let d + 1 be the dimension of the time series sample and {0, 1, . . . , k} be the label set for each sample. Then, each sample belongs to R d+1 × {0, 1, . . . , k}. Let S = {s 0 , s 1 , . . . , s N } be the collection of the time series samples. Let D be a distribution with support [0, 1) and λ i be drawn from D independently denoted by λ i ∼ D for time i ∈ [1 : d]. Let us denote ⃗ λ = (λ 1 , . . . , λ d ) to be the vector of interpolations. For the sake of notational simplicity, in the rest of the paper, we denote ⃗ λ ≜ λ and λ ∼ D means that each component λ i of λ is sampled independently from D. For a given vector λ ∈ [0, 1) d and a time series sample s ∈ S, we generate an augmented time series sample s λ as follows. Let us consider an original time series sample s = (x 0 , x 1 , . . . , x d , y) where the time series features (x 0 , x 1 , . . . , x d ) ∈ R d+1 and y ∈ {0, 1, . . . , k} being the label of the corresponding time series sample. For each λ = (λ 1 , . . . , λ d ) ∈ [0, 1) d (λ 0 is considered to be a dummy value), we define an augmented sample s λ = (x 0,λ0 , x 1,λ1 , . . . , x d,λ d , y) such that The newly generated augmented sample s λ has the same label y as the original sample s. Our recursive methodology allows us to generate a new time series realization that preserves the trajectory of the original data within some bound (Theorem 3.1). Notations. Let l(s, θ) be a loss function defined on a sample s and a model parameter θ and similarly l(s λ , θ) is the loss defined on the augmented sample using a model parameter θ. We denote that the distribution P is parametrized by θ * on the sample space S. Given few realizations {s i } i∈[0:N ] from the distribution P, we set: θ * = argmin θ E s∼P [l(s, θ)] θ = argmin θ 1 N + 1 N i=0 l(s i , θ) θ aug = argmin θ E s∼P [E λ∼D [l(s λ , θ)]] (2) θaug = argmin θ 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ)] R n (l • Θ) = E ϵi∼E sup θ∈Θ 1 N + 1 N i=0 ϵ i l(s i , θ) Using the notations above, our recursive time series data augmentation minimizes the augmented loss under which we take the expectation over the augmented sample space. We call it average augmented loss and denote it by l aug (s, θ) = E λ∼D [l(s λ , θ)]. The three most important aspects of our theoretical framework are the characterization of our recursive time series augmentation: the trade-off that is induced in the learning parameter space (Section 3.2), the learning bound showing the impact of the structural properties of time series and the neural network on the learning parameters (Section 3.3), and better parameter learning with reduced variance when using augmented samples compared to the non augmented ones (Section 3.4).

3.2. LEARNING BOUND CONNECTING ORIGINAL TIME SERIES AND AUGMENTED TIME SERIES

We define the recursively interpolated time series and show that the augmented time series samples can deviate from the original ones. We measured this deviation from the original time series using a norm distance between the augmented time series and the original one. We show that this distance is bounded, where the bound depends on the characteristics of the time series features (Theorem 3.1). Let sign(t) =      0 if t = 0 1 if t > 0 -1 if t < 0 , δ ab = 1 if a = b 0 if a ̸ = b , and D be a distribution with support [0, 1). Theorem 3.1. (Characterization of recursive augmentation) If λ n ∼ D and g(λ n ) = (1 -λ n )(1 - δ 0n ) + (1 -sign(n)), then the following holds. Let n ∈ [0 : N ]. (1) x n,λn = n k=0 ( n i=k+1 λ i )g(λ n )x n where λ j ∼ D for j ≥ 1 and λ 0 is a dummy value. (2) Let ∥•∥ be a norm, e = E[D], m ′ = max i∈[1:N ] {∥x i -x i-1 ∥} and m = max i∈[0:N ] {∥x i ∥}. Then ∥E λ1,...,λn [(x n,λn -x n )]∥ ≤ min{(3e)m, e 1 -e m ′ , N em ′ }, E λ1,...,λn [∥(x n,λn -x n )∥] ≤ 2λ n m Recall that s = (x 0 , x 1 , . . . , x d , y) and s λ = (x 0,λ0 , x 1,λ1 , . . . , x d,λ d , y), we have ∥s -s λ ∥ 2 = d i=0 (x i -x i,λi ) 2 ≤ d i=0 |x i -x i,λi | = ∥s -s λ ∥ 1 Note that we used the l 2 norm for simplicity but in general, any norm satisfies the inequality. Hence, measuring the distance of each feature between the augmented time series sample and the original one gives us some information about how far they deviate from each other. Theorem 3.2. (Learning bound using characterization of recursive augmentation) Without any loss of generality, let l(•, •) ∈ [0, 1]foot_1 be a loss function with Lipschitz condition and S = {s 0 , s 1 , . . . , s N } be the collection of the time series samples. Then with probability at least 1 -δ over the samples {s i } i∈[0:N ] , we have E s∼P [l(s, θaug )] -E s∼P [l(s, θ * )] < 2R N (l aug • Θ) + 2 log(2/δ) N +1 + 2L Lip E s∼P E λ∼D [∥s λ -s∥]. (5) Moreover, we have Theorem 3.2 represents the standard argument for regret bounds using Rademacher complexity. R N (l aug • Θ) ≤ R N (l • Θ) + max i∈{0,...,N } L Lip E λ∼D [∥s i,λ -s i ∥]. The term in the left in (Eq.5) is what we call the generalization error or sometimes simply the risk, and the term in the right is called the regret excess risk. Theorem 3.2 tells us how good is our parameter estimation compared to the optimal parameters. Our bound is governed by three terms: 1) Rademacher complexity using the augmented data, 2) the sample size, 3) and the distance between our original time series sample and the augmented one. Now if the distance in (Eq.5) between time series samples before and after augmentation goes to zero then the learning bound is more tight. Another implication of the distance being small is that our recursive augmentation method decreases the Rademacher complexity as demonstrated by (Eq.6). Decreasing Rademacher complexity simply ends up producing a tighter bound on the parameter space. On the other hand, if the distance in (Eq.5 and Eq.6) becomes large (i.e. the augmented time series samples deviate considerably from the original ones), then we can not guarantee with high probability that our method will always outperform the non augmented case. This is a trade-off frequently observed in learning theory.

3.3. LEARNING BOUND CONNECTING THE STRUCTURAL PROPERTIES OF TIME SERIES AND NEURAL NETWORKS

Theorem 3.3. Let f θ be a neural network with a model parameter θ, ReLU activations, and sigmoid function as the activation function for the last layer denoted by f θ (x) = σ(g θ (x)) where g θ (x) = ∇g T θ x + b is the pre-activation signal for the last layer. Then the cross entropy loss function l(•, •) has error bound as follows ∥l(s λ , θ) -l(s, θ)∥ ≤ √ d ∥A∥ F + d i=1 ∥B i ∥ F ∥∇g θ ∥ where A =          0 ⃗ 0 T ⃗ 0 ∂x1,λ 1 ∂λ1 | λ= ⃗ 0 ∂x2,λ 2 ∂λ1 | λ= ⃗ 0 • • • ∂xd,λ d ∂λ1 | λ= ⃗ 0 0 ∂x2,λ 2 ∂λ2 | λ= ⃗ 0 • • • ∂xd,λ d ∂λ2 | λ= ⃗ 0 . . . . . . . . . . . . 0 0 • • • ∂xd,λ d ∂λd | λ= ⃗ 0          B i =          0 ⃗ 0 T ⃗ 0 ∂ 2 x1,λ 1 ∂λi∂λ1 | λ= ⃗ 0 ∂ 2 x2,λ 2 ∂λi∂λ1 | λ= ⃗ 0 • • • ∂ 2 xd,λ d ∂λi∂λ1 | λ= ⃗ 0 0 ∂ 2 x2,λ 2 ∂λi∂λ2 | λ= ⃗ 0 • • • ∂ 2 xd,λ d ∂λi∂λ2 | λ= ⃗ 0 . . . . . . . . . . . . 0 0 • • • ∂ 2 xd,λ d ∂λi∂λd | λ= ⃗ 0 ,          ⃗ 0 ∈ R d denoted by column vector, and ∥•∥ F is a Frobenius norm. Theorem 3.3 shows how the structural properties of the time series (A and B) and the neural network architecture (∇g θ ) can affect the learning process. The interpolation vector λ will determine velocity A and accelerations {B i } i∈[1:d] of the features for the augmented sample s λ . Theorem 3.4. (Learning bound with structural properties on time series and neural network) Let l be the cross entropy loss function such that l(•, •) ∈ [0, 1]. Then with probability at least 1 -δ over the samples {s i } i∈[0:N ] , we have E s∼P [l(s, θaug )] -E s∼P [l(s, θ * )] < 2R N (l aug • Θ) + 2 log(2/δ) N +1 + 2 √ d A + d i=1 B i ∥∇g θ ∥ (8) Moreover, we have R N (l aug • Θ) ≤ R N (l • Θ) + √ d A + d i=1 B i ∥∇g θ ∥. Theorem 3.4 reveals that there is a trade off between the dimension of features of a time series sample, the λ, and the neural network architecture. A and {B i } i∈[1:d] are the bounds for the velocity and accelerations, respectively, which are computed using the interpolation vector. The last term of (Eq.8) is constructed by the gradient of the pre-activation g θ with respect to an input, the dimension of features of a time series sample, and the bound for the velocity and accelerations induced by the interpolation vector. Theorem 3.4 tells us how good is our parameter estimation compared to the optimal parameters. Our bound is governed by three terms: 1) Rademacher complexity using the augmented data, 2) the sample size, 3) and a term that depends on the dimension of the features of the time series sample d, the structural properties of the time series (A and B i ), and the neural network architecture (∇g θ ). Because our RIM method uses a recursive interpolation between two consecutive features, the order of magnitude of A and B do not change drastically, and hence do not blow up the bound. Another implication of the A and B being small is that our recursive augmentation method decreases the Rademacher complexity as demonstrated by (Eq.9). Decreasing Rademacher complexity simply ends up producing a tighter bound on the parameter space.

3.4. VARIANCE REDUCTION

Suppose that we observe a set {s 0 , s 1 , . . . , s N } of N + 1 samples from the underlying sample space S. Using our RIM method, we can augment the observed sample s i with a distribution D, which results in the set of augmented samples {s i,λ | λ ∼ D} for s i . Based on mild assumptions (please refer to Appendix A.5) on the regularity of the loss function and on the underlying sample space, we have the following results. Theorem 3.5. (Asymptotic normality) Assume Θ is open. Then θ and θaug admit the following Bahadur representation; √ N + 1( θ -θ * ) = 1 √ N + 1 V -1 θ * N i=0 ∇l(s i , θ * ) + o P (1) √ N + 1( θaug -θ * ) = 1 √ N + 1 V -1 θ * N i=0 ∇l aug (s i , θ * ) + o P (1) Therefore, both θ and θaug are asymptotically normal √ N + 1( θ -θ * ) → N (0, Σ 0 ) and √ N + 1( θaug -θ * ) → N (0, Σ aug ) (11) where the covariance is given by Σ 0 = V -1 θ * E s∼P [∇l(s, θ * )∇l(s, θ * ) T ]V -1 θ * Σ aug = Σ 0 -E s∼P [XX T ] where X = ∇l(s, θ * ) -∇l aug (s, θ * ). As a consequence, the asymptotic relative efficiency of θaug compared to θ is RE = tr(Σ0) tr(Σaug) ≥ 1. (Eq.11) describes the asymptotic behaviour of the learning parameters. (Eq.12) shows that our recursive time series augmentation reduces the variance of the learning parameters.

4.1. DESIGN

We show our results on time series classification tasks with different time series augmentation methods. We compare the improvements of downstream task performance due to different data augmentation methods by enlarging the training set when we train the classifiers. To benchmark our RIM approach, we compare our results with that achieved by TimeGAN approach Yoon et al. ( 2019). We specifically select TimeGAN as they are known to preserve the temporal dynamics thereby maintaining the correlation between variables across time. Furthermore, with the flexibility of the unsupervised GAN framework and the control offered by the supervised training in autoregressive models, comparing our results against those by TimeGANs can provide us a rigorous benchmark against a tested state-of-the-art approach. We use a relatively small training set with a large testing set so that it is more challenging for classifiers to generalize and data augmentations are favorable. Note that we use data augmentation methods on two classes separately as data augmentations should only preserve properties of that particular class. We use different data augmentation methods on these two classes of time series in our training set as follows: RIM methods can be directly applied to time series within each class to generate new time series for that class to enlarge the original training set such that the generated series are close to original series in that class according to Theorem 3.1. For the TimeGAN baseline, we train two TimeGANs separately using time series from each class. Once these two TimeGANs are trained, they are used to generate time series for each class to enlarge the original training set. We consider four tasks: the first two use synthetic datasets generated by solving 1-dimensional ODEs, and the last two use real-world datasets. We compare testing accuracy using the original data, augmented data with RIM, and augmented data with TimeGANs.

4.2. RESULTS

For task 1, we consider solutions to ODEs containing exponential functions, and two classes in our binary classification correspond to the two ODEs with different parameters. For task 2, we consider solutions to ODEs with trigonometric functions. In this setting, ODEs can be thought of as generators that generate time series on which we are performing classification, and ODEs with different parameters invoke different dynamical behaviors on their solutions (time series). For each class, we generate multiple solutions using corresponding ODE with different initial values. To make the learning tasks harder, we add random noise to the solution generated by these ODEs. 2014) is associated with predicting the pattern of user movements in real world office environments from time series generated by a Wireless Sensor Network (WSN). The input data contains RSS measured between the nodes of a WSN, comprising of 5 sensors: 4 in the environment and 1 for the user. Data has been collected during movement of the users and labelled to indicate whether the user's trajectory will lead to a change in the room or not. In experiments, we use a subset of the data to form a small training set to challenge our algorithm. We achieve better and more robust test accuracy than the TimeGAN and the non augmented case when using augmented data as reflected in Figure 4 . Since λ is the only parameter used in our RIM augmentation technique, our ablation study paid very precise attention to the choice of the λ parameter. Under this, we tested different λ distributions. Given that we were interested in convex combinations between x i and x i-1,λi-1 , we had to restrict λ between 0 and 1. Two ways to perform this would be: (1) uniformly distribute the weights while sampling λ; (2) concentrating on a specific part of λ distribution. To address (1), we use U(0, 1) which is the main test-bed for all the experiments in the current main text. Whereas to address the (2), we perform studies using beta distribution by varying its shape parameters to focus on specific parts of densities. We tested Beta(2, 2), Beta(0.5, 0.5) and Beta(2, 5), for which the resulting plots can be found in Appendix C. For all these cases, we observed improvements from using RIM compared to non-augmented training, both in terms of a higher final testing accuracy and with fewer training iterations thereby solidifying the effectiveness of RIM. Feldkamp (2007) . This dataset contains time series corresponding to measurements of engine noise captured by a motor sensor. The goal is to detect the presence of a specific issue with the engine by classifying each time series into issue/no issue classes. We sample 100 time series from FordA to form a small training set to challenge our algorithm and 100 time series for testing. As shown in Figure 5 , RIM outperforms the TimeGAN and the non augmented case on the test accuracy. From Figure 2 to Figure 5 , we can see that RIM achieves superior performance over the non augmented case. Furthermore, RIM is also able to achieve better or comparable performance than TimeGAN on these tasks without going through the extensive training process associated with GANs. For our experiments, we train the TimeGAN for 2500 epochs (3 hours on Xeon Processors CPU) for synthetic datasets and 5000 epochs (6 hours on Xeon Processors CPU) for real datasets. Visual comparisons of the time series generated by RIM and TimeGAN with the original time series are shown in Appendix B. As expected from Theorem 3.5, RIM has smaller variance and better convergence compared to the non augmented case across all experiments.

4.3. EXTENSION TO OTHER LEARNING TASKS

Section 4.2 demonstrates results for time series classification. In this section, we show that RIM can also be used in other learning tasks including continuous time series forecasting and RL. In continuous time series forecasting, we generally have one historical realization and we leverage that to form our training set composed of (x,y) pairs where x is the previous n time steps' data and y is the future step target by decomposing the time series into smaller components. RIM can then be used to generate more time series from the unique realization so that we can enlarge our training set by adding more (x,y) pairs from the original realizations. We also used RIM to augment state trajectories in RL tasks (please refer to pseudo code in Appendix E.2). Preliminary experiments for continuous time series forecasting can be found in the Appendix D and RL tasks can be found in the Appendix E.

5. CONCLUSION

We developed a Recursive Interpolation Method (RIM) for time series as a data augmentation technique to learn models accurately with limited data. The RIM is simple, yet effective, supported by theoretical analysis guaranteeing faster convergence. Theoretically, we proved that the RIM guarantees better parameter convergence with reduced variance. Empirically, our methodology outperforms the current state-of-the-art approaches for different real world problem domains and synthetic datasets by obtaining higher accuracy with reduced variance. Because our approach operates on the input time series data, it is invariant to the choice of the ML algorithm. The methodology described in this paper can be used to enhance ML solutions to a wide variety of time series learning problems.       0 if t = 0 1 if t > 0 -1 if t < 0 , δ ab = 1 if a = b 0 if a ̸ = b , and D is a distribution with support [0, 1). Proof of Theorem 3.1 (1) Note that λ 0 is a dummy value for mathematical convenience and n i=k+1 λ i = 1 if n < k + 1. We prove this by induction on n. The case n = 1 shows that x 1,λ1 = g(λ 1 )x 1 + λ 1 g(λ 0 )x 0 = (1 -λ 1 )x 1 + λ 1 x 0 since g(λ 0 ) = 1 and g(λ 1 ) = 1 -λ 1 . We now assume that the inequality holds for n -1 and prove it for n. By construction of x n,λn , x n,λn = (1 -λ n )x n + λ n x n-1,λn-1 = (1 -λ n )x n + λ n n-1 k=0 n-1 i=k+1 λ i g(λ k )x k = n i=n+1 λ i g(λ n )x n + n-1 k=0 n i=k+1 λ i g(λ k )x k = n k=0 n i=k+1 λ i g(λ k )x k Thus (1) holds by mathematical induction. (2) Consider the equation ∥x n,λn -x n ∥. Then by (1), the first bound can be found by ∥E[(x n,λn -x n )]∥ = ∥E[ n k=0 ( n i=k+1 λ i )g(λ k )x k -x n ]∥ = ∥E[ n-1 k=0 ( n i=k+1 λ i )g(λ k )x k -λ n x n ]∥ = ∥ n-1 k=1 e n-k (1 -e)x k + e n-1 x 0 -ex n ∥ ≤ n-1 k=1 e n-k (1 -e)∥x k ∥ + e n-1 ∥x 0 ∥ + e∥x n ∥ ≤ ( n-1 k=1 e n-k - n-2 k=0 e n-k + e n-1 + e)m = ( n-1 k=0 e n-k - n-2 k=2 e n-k + e)m ≤ (e n-1 -e n + 2e)m ≤ 3em where e = E[D] and m = max i∈[0:n] {∥x i ∥}. Now we prove the second bound. Since ∥x n,λn -x n ∥ = ∥x n -((1 -λ n )x n + λ n x n-1,λn-1 )∥ = λ n ∥x n -x n-1 + x n-1 -x n-1,λn-1 ∥ ≤ λ n (∥x n -x n-1 ∥ + ∥x n-1 -x n-1,λn-1 ∥), By recursively applying (Eq. 13), we obtain ∥x n,λn -x n ∥ ≤ n k=1 ( n i=k λ i )∥x k -x k-1 ∥. By Jensen's inequality and (Eq. 14), we obtain the second bound ∥E[(x n,λn -x n )]∥ ≤ E[∥x n,λn -x n ∥] ≤ E[ n k=1 ( n i=k λ i )∥x k -x k-1 ∥] ≤ n k=1 e n-k+1 m ′ ≤ min{ e 1 -e m ′ , nem ′ } where e = E[D] and m ′ = max i∈[1:n] {∥x i -x i-1 ∥}. Now we show that Neural Networks with ReLU Activations. Using ReLU activation functions, neural networks are constructed by piecewise linear functions of an input. Such a neural network g θ can be formulated by ∇g T θ x + b where x is an input, ∇g T θ is the gradient of g θ (x) along x. Lemma A.1. (Structure of partial derivative) Let s = (x 0 , . . . , x d ) be one sample drawn from time series and λ ∈ [0, 1] d . Then E λ1,...,λn [∥(x n,λn -x n )∥] ≤ 2λ n m. E λ1,...,λn [∥(x n,λn -x n )∥] = E[∥ n-1 k=0 ( n i=k+1 λ i )g(λ k )x k -λ n x n ∥] ≤ E[ n-1 k=0 |( n i=k+1 λ i )g(λ k )|∥x k ∥ + |λ n |∥x n ∥] ≤ E[ n-1 k=0 ( n i=k+1 λ i )g(λ k ) + λ n ]m = λ 1 λ 2 • • • λ n + λ 2 λ 3 • • • λ n (1 -λ 1 ) + λ 3 λ 4 • • • λ n (1 -λ 2 ) + • • • λ n (1 -λ n-1 ) + λ n m = 2λ n m ∂x i,λi ∂λ j =      0 if i < j (x i-1,λi-1 -x i ) if i = j ( i k=j+1 λ k )(x j-1,λj-1 -x j ) if i > j (17) Proof. If i < j, then x i,λi does not depend on λ j . Hence ∂x i,λ i ∂λj = 0. If i = j, then x i,λi = (1 -λ i )x i + λ i x i-1,λi-1 = x i + λ i (x i-1,λi-1 -x i ). Hence ∂x i,λ i ∂λj = (x i-1,λi-1 -x i ). If i > j, we prove this by induction on i. The case i = j + 1 shows that x j+1,λj+1 = (1 -λ j+1 )x j+1 + λ j+1 x j,λj = x j+1 + λ j+1 (x j,λj -x j+1 ). Since x j+1 is not depedent on λ j and ∂x j,λ j ∂λj = (x j-1,λj-1 -x j ), we have ∂x j+1,λ j+1 ∂λj = λ j+1 (x j-1,λj-1 -x j ). We now assume that the inequality holds for i and prove it for i + 1. Since x i+1,λi+1 = (1 -λ i+1 )x i+1 + λ i+1 x i,λi = x i+1 + λ i+1 (x i,λi -x i+1 ) and ∂x i,λi ∂λ j = ( i k=j+1 λ k )(x j-1,λj-1 -x j ), ∂x i+1,λi+1 ∂λ j = λ i+1 ( i k=j+1 λ k )(x j-1,λj-1 -x j ) = ( i+1 k=j+1 λ k )(x j-1,λj-1 -x j ). As a consequence of mathematical induction, the conclusion holds. We use neural networks with ReLU activations and sigmoid function as the activation function for the last layer. So, a neural network f θ (x) = σ(g θ (x)) where g θ (x) = ∇g T θ x + b is the pre-activation signal for the last layer. Recall that we denote by ⃗ λ ≜ λ and λ ∼ D means that each component λ i of λ is sampled independently from D. Proof of Theorem 3.2. Let s λ = (x 0 , x 1,λ1 , . . . , x d,λ d , y) and s be one sample. We denote the features by x λ = (x 0 , x 1,λ1 , . . . , x d,λ d ) and the label by y. Define the matrices A =          0 ⃗ 0 T ⃗ 0 ∂x 1,λ 1 ∂λ1 | λ= ⃗ 0 ∂x 2,λ 2 ∂λ1 | λ= ⃗ 0 • • • ∂x d,λ d ∂λ1 | λ= ⃗ 0 0 ∂x 2,λ 2 ∂λ2 | λ= ⃗ 0 • • • ∂x d,λ d ∂λ2 | λ= ⃗ 0 . . . . . . . . . . . . 0 0 • • • ∂x d,λ d ∂λ d | λ= ⃗ 0          B i =          0 ⃗ 0 T ⃗ 0 ∂ 2 x 1,λ 1 ∂λi∂λ1 | λ= ⃗ 0 ∂ 2 x 2,λ 2 ∂λi∂λ1 | λ= ⃗ 0 • • • ∂ 2 x d,λ d ∂λi∂λ1 | λ= ⃗ 0 0 ∂ 2 x 2,λ 2 ∂λi∂λ2 | λ= ⃗ 0 • • • ∂ 2 x d,λ d ∂λi∂λ2 | λ= ⃗ 0 . . . . . . . . . . . . 0 0 • • • ∂ 2 x d,λ d ∂λi∂λ d | λ= ⃗ 0          where ⃗ 0 ∈ R d denoted by column vector and e i is a vector whose i-th component is 1 and 0 otherwise. Let l(s λ , θ) = y log(f θ (x λ )) + (1 -y) log(1 -f θ (x λ )). Denote l θ (λ) = l(s λ , θ). Using Taylor expansion of the loss l around λ, we have l θ (λ) = l θ ( ⃗ 0) + d i=1 ∂l θ (λ) ∂λ i λ= ⃗ 0 λ i + 1 2 i,j ∂ 2 l θ (λ) ∂λ i ∂λ j λ= ⃗ 0 λ i λ j + O(∥λ∥ 2 ) (20) Note that ∂l θ (λ) ∂x i,λi = y ∂ log(f θ (x λ )) ∂x i,λi + (1 -y) ∂ log(1 -f θ (x λ )) ∂x i,λi = y ∂f θ (x λ ) ∂x i,λ i f θ (x λ ) -(1 -y) ∂f θ (x λ ) ∂x i,λ i 1 -f θ (x λ ) = ∂f θ (x λ ) ∂x i,λi y(1 -f θ (x λ )) + (y -1)f θ (x λ ) (1 -f θ (x λ ))f θ (x λ ) = ∂f θ (x λ ) ∂x i,λi y -f θ (x λ ) (1 -f θ (x λ ))f θ (x λ ) (21) and ∂f θ (x λ ) ∂x i,λi = ∂σ(g θ (x λ )) ∂x i,λi = ∂g θ (x λ ) ∂x i,λi σ(g θ (x λ ))(1 -σ(g θ (x λ ))) = ∂(∇g T θ x λ ) ∂x i,λi σ(g θ (x λ ))(1 -σ(g θ (x λ ))) = ( ∂∇g T θ ∂x i,λi x λ + ∇g T θ ∂x λ ∂x i,λi )σ(g θ (x λ ))(1 -σ(g θ (x λ ))) = ∇g T θ e i σ(g θ (x λ ))(1 -σ(g θ (x λ ))). Since the i-th feature x i,λi of x λ depends on {λ 1 , . . . , λ i }, we have ∂l θ (λ) ∂λ j = d i=1 ∂x i,λi ∂λ j ∂l θ (λ) ∂x i,λi = d i=j ∂x i,λi ∂λ j ∂f θ (x λ ) ∂x i,λi y -f θ (x λ ) (1 -f θ (x λ ))f θ (x λ ) = d i=j ∂x i,λi ∂λ j ∇g T θ e i σ(g θ (x λ ))(1 -σ(g θ (x λ ))) y -f θ (x λ ) (1 -f θ (x λ ))f θ (x λ ) = d i=j ∂x i,λi ∂λ j ∇g T θ e i (y -f θ (x λ )) Thus we have d j=1 ∂l θ (λ) ∂λ j λ= ⃗ 0 λ j = (y -f θ (x))(0, λ) T A∇g θ (24) Hence we have d j=1 ∂l θ (λ) ∂λ j λ= ⃗ 0 λ j ≤ √ d∥A∥ F ∥∇g θ ∥. Note that ∥y -f θ (x))∥ ≤ 1. Now we consider the second partial derivative of the loss function l θ (λ). ∂ 2 l θ (λ) ∂λ u ∂λ j = ∂ ∂λ u ( d i=j ∂x i,λi ∂λ j ∇g T θ e i (y -f θ (x λ ))) = d i=j ∇g T θ e i ∂ 2 x i,λi ∂λ u ∂λ j (y -f θ (x λ ))- ∂x i,λi ∂λ j d k=u ∂x k,λ k ∂λ u ∇g T θ e k f θ (x λ )(1 -f θ (x λ )) We will put (Eq. 26) into (Eq. 20) to calculate the Taylor loss explicitly. For the first term, we have d i=1 λ i d j=1 λ j d l=j ∇g T θ e l ( ∂ 2 x l,λ l ∂λ i ∂λ j (y -f θ (x λ ))) λ= ⃗ 0 = d i=1 λ i (y -f θ (x))(0, λ) T B i ∇g θ (27) and for the second term, we have d j=1 λ j d i=j ∇g T θ e i ∂x i,λi ∂λ j d l=1 λ l d k=l ∂x k,λ k ∂λ l ∇g T θ e k f θ (x λ )(1 -f θ (x λ )) λ= ⃗ 0 = d j=1 λ j d i=j ∇g T θ e i ∂x i,λi ∂λ j f θ (x)(1 -f θ (x))(0, λ) T A∇g θ = f θ (x)(1 -f θ (x))(0, λ) T A∇g θ ∇g T θ A T (0, λ). ( ) By combining (Eq. 27 and 28), we have d i=1 d j=1 λ i λ j ∂ 2 l θ (λ) ∂λ i ∂λ j λ= ⃗ 0 = d i=1 λ i (y-f θ (x))(0, λ) T B i ∇g θ -f θ (x)(1-f θ (x))(0, λ) T A∇g θ ∇g T θ A T (0, λ) ≤ d i=1 √ d∥B i ∥ F ∥∇g θ ∥. ( ) The last inequality holds due to f θ (x)(1 -f θ (x))(0, λ) T A∇g θ (x)∇g θ (x) T A T (0, λ) ≥ 0 (30) By (Eq. 20, 25 and 29), the conclusion holds. A.3 PROOF OF THEOREM 3.2 We start with a basic inequality frequently used in the proofs. E s∼P [l(s, θ)] θ = argmin θ 1 N + 1 N i=0 l(s i , θ) θ aug = argmin θ E s∼P [E λ∼D [l(s λ , θ)]] θaug = argmin θ 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ)] R n (l • Θ) = E ϵi∼E sup θ∈Θ 1 N + 1 N i=0 ϵ i l(s i , θ) l aug (s, θ) = E λ∼D [l(s λ , θ)] (32) Assumption 1. Assume that the loss function l satifies Lipschitz condition with respect to the norm. Proof of Theorem 3.2. E s∼P [l(s, θaug )] -E s∼P [l(s, θ * )] = u 1 + u 2 + u 3 + u 4 + u 5 where u 1 = E s∼P [l(s, θaug )] -E s∼P [E λ∼D [l(s λ , θaug )]] u 2 = E s∼P [E λ∼D [l(s λ , θaug )]]- 1 N + 1 N i=0 E λ∼D [l(s i,λ , θaug )] u 3 = 1 N + 1 N i=0 E λ∼D [l(s i,λ , θaug )]- 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ * )] u 4 = 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ * )] -E s∼P [E λ∼D [l(s λ , θ * )]] u 5 = E s∼P [E λ∼D [l(s λ , θ * )]] -E s∼P [l(s, θ * )] (33) We get u 1 + u 5 ≤ 2 sup θ∈Θ E s∼P [l(s, θ)] -E s∼P [E λ∼D [l(s λ , θ)]] where we have E s∼P [l(s, θ)] -E s∼P [E λ∼D [l(s λ , θ)]] = E s∼P [l(s, θ) -E λ∼D [l(s λ , θ)]] = E s∼P [E λ∼D [l(s, θ) -l(s λ , θ)]] ≤ L Lip E s∼P E λ∼D [∥s λ -s∥]. Hence, from (Eq. 34 and 35) u 1 + u 5 ≤ 2L Lip E s∼P E λ∼D [∥s λ -s∥]. ( ) By McDiarmid's inequality, in terms of the probability P( 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ * )] -E s∼P [E λ∼D [l(s λ , θ * )]] ≥ t) ≤ exp - 2t 2 N i=0 ( 1 N +1 ) 2 (37) u 4 has the following bound with probability at least 1 -δ u 4 < log(1/δ) 2(N + 1) . ( ) Moreover, Rademacher complexity holds for u 2 , so we have u 2 ≤ 2R N (l aug • Θ) + 4 2 log(4/δ) N + 1 (39) with probability at least 1 -δ. By (Eq. 38 and 39), we get the following inequality u 2 + u 4 ≤ 2R N (l aug • Θ) + 5 2 log(4/δ) N + 1 (40) with probability at least 1 -δ where R N (l • Θ) = E ϵi∼E [sup θ∈Θ 1 N + 1 N i=0 ϵ i E λ∼D [l(s i,λ , θ)] ]. where ϵ i is a Radamacher variable for all i ∈ [N ]. Since θaug is an optimal parameter for 1 N +1 N i=0 E λ∼D [l(s i,λ , θ)], then u 3 ≤ 0. From (Eq. 47, 40 and 41), we conclude that E s∼P [l(s, θaug )]-E s∼P [l(s, θ * )] < 2R N (l aug •Θ)+5 2 log(4/δ) N + 1 +2L Lip E s∼P E λ∼D [∥s λ -s∥]. (42) Now we will prove that R N (l aug • Θ) ≤ R N (l • Θ) + max i∈{0,...,N } L Lip E λ∼D [∥s i,λ -s i ∥]. ( ) Since R N (l aug • Θ) -R N (l • Θ) = E ϵi∼E [sup θ∈Θ 1 N + 1 N i=0 ϵ i l aug (s i , θ) -sup θ∈Θ 1 N + 1 N i=0 ϵ i l(s i , θ) ] ≤ E ϵi∼E [sup θ∈Θ 1 N + 1 N i=0 ϵ i l aug (s i , θ) - 1 N + 1 N i=0 ϵ i l(s i , θ) ] = E ϵi∼E [sup θ∈Θ 1 N + 1 N i=0 ϵ i (l aug (s i , θ) -l(s i , θ)) ] ≤ E ϵi∼E [sup θ∈Θ 1 N + 1 N i=0 ϵ i (l aug (s i , θ) -l(s i , θ))] ≤ sup θ∈Θ 1 N + 1 N i=0 (l aug (s i , θ) -l(s i , θ)) = sup θ∈Θ 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ) -l(s i , θ)] ≤ max i∈{0,...,N } L Lip E λ∼D [∥s i,λ -s i ∥]. (44) We will show E λ∼D [∥s i,λ -s∥] ≤ 2dem. Let each sample s (or s i ) ∈ R d+1 × {0, 1, . . . , k}. Then s = (x 0 , x 1 , . . . , x d , y). By the augmentation method, recursively applying convex combinations, we have s λ = (x 0 , x 1,λ1 , x 2,λ2 , . . . , x d,λ d , y). By Theorem 3.1 (Eq. 4), each E λ1,...,λj [∥(x j,λj -  x j )∥] ≤ 2λ j m. Hence E λ∼D [∥s i,λ -s∥] ≤ d j=1 E λ∼D [∥x λj ,j -x j ∥] ≤ 2E λ∼D [(λ 1 + λ 2 + • • • + λ d )]m ≤ 2mde

A.4 PROOF OF THEOREM 3.4

Let A = sup s∈S ∥A∥ F and B i = sup s∈S ∥B i ∥ F . Then for each s ∼ P, we have: ∥A∥ F ≤ A and ∥B i ∥ F ≤ B i , where A and B i depend on s for i ∈ [d] . Please refer to (Eq. 18) and (Eq. 19). Proof of Theorem 3.4. The proof is the almost same with the proof of Theorem 3.2. We only describe the part should be replaced in order to prove Theorem 3.4. Similarly, We start with the decomposition of the equation as follows E s∼P [l(s, θaug )] -E s∼P [l(s, θ * )] = u 1 + u 2 + u 3 + u 4 + u 5 where u 1 , u 2 , u 3 , u 4 and u 5 are defined in (Eq. 33). (Eq. 35) in the proof of Theorem 3.2 will be replaced with the following E s∼P [l(s, θ)] -E s∼P [E λ∼D [l(s λ , θ)]] = E s∼P [l(s, θ) -E λ∼D [l(s λ , θ)]] = E s∼P [E λ∼D [l(s, θ) -l(s λ , θ)]] ≤ √ d A + d i=1 B i ∥∇g θ ∥. Hence, from (Eq. 34) and (Eq. 46), we have u 1 + u 5 ≤ 2 √ d A + d i=1 B i ∥∇g θ ∥. The last two lines of (Eq. 44) will be replaced with the following sup θ∈Θ 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ) -l(s i , θ)] ≤ √ d A + d i=1 B i ∥∇g θ ∥ Theorem 3.3 guarantees the inequality. Thus the conclusion holds. A.5 PROOF OF THEOREM 3.5 Asymptotic Results. Note that the sample space is contained in R d+1 × {0, . . . , k} and the parameter space Θ ⊆ R p where {0, . . . , k} is the label set. Under regularity conditions, it is well known that θ is asymptotically normal with covariance given by the inverse Fisher information matrix. We will see that θaug is also asymptotically normal with the covariance. Suppose that we observe a set {s 0 , s 1 , . . . , s N } of N + 1 samples from the underlying sample space S. Using our RIM method, we can augment the observed sample s i with a distribution D, which results in the set of augmented samples {s i,λ | λ ∼ D} for s i . We then decompose ∪ N i=0 {s i,λ | λ ∼ D} into disjoint union of some sets S i such that s i ∈ S i . Assumption 2. (Disjointness) the sample space S is the disjoint countable union of all possible augmented sample spaces. Consider the probability space is (S, F, P) where F is a sigma algebra and P the corresponding probability measure. Let µ be a measurable function from S to R p for some p ∈ N. For each sample s = (x, y) ∈ S, define µ(s) = E[µ | S i ] where s ∈ S i . Let's assume that we observe N + 1 samples {s 0 , s 1 , . . . , s N } from the underlying sample space. Then by Assumption 2, the underlying sample space is decomposed into ∪ ∞ i=1 S i . i,e S = ∪ ∞ i=1 S i . This is well-defined by Assumption 2, and µ is a measurable function. Let µ i be the expectation value of µ over S i . Lemma A.3. (Effects of the average function) With notation as above, the following holds. 1. The law of total expectation: E s∼P [µ] = E s∼P [µ].

2.. The law of total covariance:

Cov s∼P µ = E s∼P [Cov(µ|µ)] + Cov s∼P µ. Proof. From the disjointedness of the underlying sample space, for each s ∈ S, we have s ∈ S i for some i ∈ N E s∼P [µ] = E s∼P [E s∼P [µ | S i ]] = E s∼P [E s∼P [µ | µ = µ i ]] = E s∼P [µ]. Thus the law of total expectation and the law of total covariance naturally follow. Under mild assumptions for a given loss function, we show that the average loss function l aug inherits the same properties from the non augmented loss function, where l aug (s, θ) = E λ∼D [l(s λ , θ)]. Assumption 3. (Regularity of the loss function) For the loss function l(•, θ), we assume that 1. For the minimizer θ * of the population risk and any ϵ > 0, we have sup {∥θ-θ * ∥≥ϵ | θ∈Θ} E s∼P [l(s, θ)] > E s∼P [l(s, θ * )] 2. For every ϵ > 0, there exists a function l ′ ∈ L 2 (P) such that for almost every s and for every θ 1 , θ 2 ∈ N (θ 0 , ϵ), we have |l(s, θ 1 ) -l(s, θ 2 )| ≤ l ′ (s)∥θ 1 -θ 2 ∥ 3. Uniform weak law of large number holds sup θ∈Θ 1 N + 1 N i=0 l(s i , θ) -E s∼P [l(s, θ)] → 0 4. For each θ in Θ, the map s → l(s, θ) is measurable 5. The map θ → l(s, θ ) is differentiable at θ * for almost every s 6. The map θ → E s∼P [l(s, θ)] admits a second-order Taylor expansion at θ * with nonsingular second derivatives matrix V θ * Proposition A.4. For the pair (θ aug , l aug ), the following property holds. For every ϵ > 0, there exists a function l ′ aug ∈ L 2 (P) such that for almost every s and for every θ 1 , θ 2 ∈ N (θ 0 , ϵ), we have |l aug (s, θ 1 ) -l aug (s, θ 2 )| ≤ l ′ aug (s)∥θ 1 -θ 2 ∥ Proof. Since we have |l aug (s, θ 1 ) -l aug (s, θ 2 )| = |E λ∼D [l(s λ , θ 1 ) -l(s λ , θ 2 )]| ≤ E λ∼D [|l(s λ , θ 1 ) -l(s λ , θ 2 )|] ≤ E λ∼D [l ′ (s λ )∥θ 1 -θ 2 ∥] = E λ∼D [l ′ (s λ )]∥θ 1 -θ 2 ∥, the conclusion holds and l ′ aug = E λ∼D [l ′ (s λ )]. Proposition A.5. For the pair (θ aug , l aug ), the following property holds. sup θ∈Θ 1 N + 1 N i=0 l aug (s i , θ) -E s∼P [l aug (s, θ)] → 0 Proof. We have sup θ∈Θ 1 N + 1 N i=0 l aug (s i , θ) -E s∼P [l aug (s, θ)]] = sup θ∈Θ 1 N + 1 N i=0 E λ∼D [l(s i,λ , θ)] -E s∼P [E λ∼D [l(s λ , θ)]] = o P (1) where the last equality holds because the underlying sample space S is the disjoint countable union of all possible augmented sample spaces. Proposition A.6. For the pair (θ aug , l aug ), the following property holds. For each θ in Θ, the map s → l aug (s, θ) is measurable Proof. l aug is measurable since l is measurable and l aug (s, θ) = E λ∼D [l(s λ , θ)]. Proposition A.7. For the pair (θ aug , l aug ), the following property holds. For each θ in Θ, the map s → l aug (s, θ) is differentiable Proof. we have where the last inequality holds because l is differentiable. lim δ→0 l aug (s, θ * + δ) -l aug (s, θ * ) -δ T ∇l aug (s, θ * ) ∥δ∥ = lim δ→0 E λ∼D [l(s λ , θ * +δ)-l(s λ , θ * )]-δ T E λ∼D [∇l(s λ , θ * )] ∥δ∥ = lim δ→0 E λ∼D l(s λ , θ * +δ)-l(s λ , θ * )-δ T (∇l(s λ , θ * )) ∥δ∥ ≤ lim δ→0 E λ∼D l(s λ , θ * +δ)-l(s λ , θ * )-δ T (∇l(s λ , θ * )) ∥δ∥ Let us define F δ (s) = E λ∼D l(s λ , θ * +δ)-l(s λ , θ * )-δ T (∇l(s λ , θ * )) ∥δ∥ G(s) = E λ∼D [|l ′ (s λ ) + (∇l(s λ , θ * ))|] for all s ∈ S Since |F δ (s)| ≤ G(s) Proposition A.8. The map θ → E s∼P [l aug (s, θ)] admits a second-order Taylor expansion at θ * with non-singular second derivatives matrix V θ * Proof. By the total law of expectation of Lemma A.3, we get E s∼P [l aug (s, θ)] = E s∼P [l(s, θ)]. Hence the conclusion holds by the bullet 6 from Assumption 3. Combining all the Propositions [A.4 -A.8 ] with some results shown in Van der Vaart (1998) , we prove Theorem 3.5. Proof of Theorem 3.5. The results for θ have already been proven in (Van der Vaart, 1998, Theorem 5.23 ). The Propositions [A.4 -A.8 ] guarantee that ( Van der Vaart, 1998, Theorem 5.23) and Σ aug = Cov(∇l aug (s, θ * )). Hence we have Σ 0 -Σ aug = Cov s∼P µ -Cov s∼P µ = E s∼P [Cov(µ|µ)] = E s∼P [E s∼P [(µ -E[µ|µ])(µ -E[µ|µ]) T | µ]] = E s∼P [E s∼P [(µ -µ)](µ -µ) T | µ]] = E s∼P [E s∼P [(∇l(s, θ * ) -∇l aug (s, θ * ))(∇l(s, θ * ) -∇l aug (s, θ * )) T | µ]] = E s∼P E s∼P [XX T | µ] = E s∼P [XX T ]. Thus we get Σ aug = Σ 0 -E s∼P [XX T ]. Since tr(E s∼P [XX T ]) ≥ 0, we have RE = tr(Σ0) tr(Σaug) ≥ 1

B VISUALIZATIONS OF RIM TIMEGAN GENERATED TIME SERIES

This sections shows visualization of time series generated by RIM and TimeGAN. We plot samples from original time series, RIM generated time series and TimeGAN generated time series from both classes for Synthetic exponential ODE classification in Figure 6 . 

D TIME SERIES FORECASTING

In this section, we consider time series forecasting task where we use previous n time steps data to predict next times step data. We compare performances of regression model trained with original dataset and regression models trained with augmented dataset using RIM. Predicting Stock Price Movement. This regression task consists of predicting the next day SPY500 index Open price from historical SPY500 index using data from July 2008 to December 2012 as training data and data from January 2013 to March 2014 as testing data. The input data contains the last 7 days' historical Open, Close, High, Low, and Volume of the SPY500 index. After predicting the next day's Open price, we take a long position if the predicted next day Open is larger than today's Open, short otherwise. On comparing the results on the test data for the augmented case and the original case, we observe that the proportion of profitable trading signals is higher in the augmentation-trained model as observed in Figure 9 . Using these trading signals, we also calculate the trading system's CAGR (Compound Annual Growth Rate) which we observe to be higher in the augmentation trained model. The test loss plot shows that the MSE for the augmentation trained model is consistently lower than the non-augmentation trained model. Predicting Air Quality. The restricted air quality dataset contains 1200 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors. This is a time series regression task where the target is the next time step's CO concentration. The input data contains the last six time steps' 10 features as used in De Vito et al. (2008) and for Machine Learning & Repository. Figure 10 shows that the test MSE of the augmentation trained model remains lower than the non-augmentation trained model during the training epochs validating our claims about the robustness of the approach. Accordingly, the proportion of correct predictions of CO up/down also remains higher for the augmented case. 

E TIME SERIES RL

In this section, we consider portfolio management task using reinforcement learning. More specifically, we compare performances of agents (DPG) trained with original state trajectories (price evolution) and augmented state trajectories using RIM. E.1 DATASET 

E.3 POLICY DEPLOYMENT

In our RL experiments, the investment decisions to rebalance the portfolio are made daily and each input signal represents a multidimensional tensor that aggregates historical open, low, high, close prices and volume. It should be noted that our training and testing include the transaction costs (TC). We used the typical cost due to bid-ask spread and market impact that is 0.25%. We believe these are reasonable transaction costs for the portfolio trades. 

F HYPERPARAMETERS

We note that the primary objective of the conducted experiments is to show that RIM can improve model performance. Therefore, in all experiments, instead of finding optimal set of parameters for augmented and non-augmented trained model, we compare the performance of augmented trained model and non-augmented trained model with the same hyperparameter configuration. We demonstrated in Section 4 that RIM indeed improves model performance with same hyperparameter configuration. However, are these improvements robust to other hyperparameters? To answer this question, we conducted sensitivity analysis for two supervised learning tasks (Indoor user movement classification and Air Quality regression) and RL task (Portfolio Management).

F.1 HYPERPARAMETER SENSITIVITY FOR SUPERVISED TASKS

For Indoor movement classification task, we conducted the same experiment in Section 4 with 9 different hyperparameter configurations as shown in table 1 and observed that for all the cases RIM outperforms Non-Augmented case (with smaller mean test loss and higher mean test accuracy) which solidifies our claim of enhancement observed in model performance when we use RIM. For Air quality regression task, we again conducted the same experiment as in Section 4 with 8 different hyperparameter configurations as shown in table 2 . Here too, we find that RIM outperforms Non-Augmented case (with smaller mean test MSE and higher mean test accuracy) which confirms our claim of enhancement observed in model performance when we use RIM. 



x i,λi = (1 -λ i )x i + λ i x i-1,λi-1 and x 0,λ0 = x 0(1)X 0 = 0 X 1 = 2 X 2 = 1 X 3 = 4 X 4 = 2 X 0 = 0 X 1 = 2 X 2 = 1 X 3 = 4 X 4 = 2 X 0,𝛌 = 0 X 1,𝛌 = 1.6X 2,𝛌 = 1.12 X 3,𝛌 = 3.42 X 4,𝛌 = 2.28 RIM (𝛌 i = 𝛌 = 0.8)Eq. 1 We can choose the range of the loss function to be in any compact and connected subset of R under the usual topology.



Time Series. There is an exhaustive list of transformations applied to time series that are usually used as data augmentationWen et al. (2020a). Fawaz et al. (2018) described transformations in the time domain such as time warping and time permutation. There are methods that belong to the magnitude domain such as magnitude warping, Gaussian noise injection, quantization, scaling, and rotation Wen & Keyes (2019). There exists other transformations on time series in the frequency and time-frequency domains that are based on Discrete Fourier Transform (DFT). In this context, they apply transformations in the amplitude and phase spectra of the time series and apply the reverse DFT to generate a new time series signalGao et al. (2020). Besides the transformations in different domains, there are also more advanced methods, including decomposition-based methods such as the Seasonal and Trend decomposition using Loess (STL) method and its variantsCleveland et al. (1990); Wen et al. (2020b), statistical generative modelsKang et al. (2020), and learning-based methods. The learning-based methods can be further divided into embedding space DeVries & Taylor (2017), and deep generative models (DGMs)Esteban et al. (2017); Yoon et al.

Figure 1: Illustration of RIM.

) with E λ∼D [∥s i,λ -s∥] ≤ 2mde, where e = E[D], d + 1 is the dimension of the time series sample, and m = max i∈[0:N ] {∥x i ∥}.

Figure 2: Time series from two classes on the left plot, Test Accuracy for the exponential synthetic ODE system using a Convolutional Neural Network with kernel size=3, filter=32, batch size=16, using BatchNorm and Adam optimizer. The test accuracy plot indicates the resulting mean ± standard deviation from 10 runs.

Figure 4: Time series from two classes and Test Accuracy for the Indoor User Movement Classification using a Convolutional Neural Network with kernel size=3, filter=32, batch size=16, using BatchNorm and Adam optimizer. The test accuracy plot indicates the resulting mean ± standard deviation from 10 runs.

Figure 5: Time series from two classes and Test Accuracy for the Ford Engine Classification with a Convolutional Neural Network with kernel size=3, filter=32, batch size=16, using BatchNorm and Adam optimizer. The test accuracy plot indicates the resulting mean ± standard deviation from 10 runs.

16) where m = max i∈[0:N ] {∥x i ∥}. The first inequality holds by Minkowski inequality. A.2 PROOF OF THEOREM 3.2

45) where e = E[D], d + 1 is the dimension of the time series sample, and m = max i∈[0:N ] {∥x i ∥}.

for all s ∈ S by Lebesgue's dominated convergence theorem,lim δ→0 l aug (s, θ * + δ) -l aug (s, θ * ) -δ T ∇l aug (s, θ * ) ∥δ∥ ≤ lim δ→0 E λ∼D l(s λ , θ * +δ)-l(s λ , θ * )-δ T (∇l(s λ , θ * )) ∥δ∥ = E λ∼D lim δ→0 l(s λ , θ * +δ)-l(s λ , θ * )-δ T (∇l(s λ , θ * )) ∥δ∥ = 0

Figure 6: Visualization of exponential ODE classification series generated by RIM and TimeGAN against original series (5 each)

Figure 8: Test accuracy over epochs for Indoor dataset with λ sampled from beta distributions with different scales (Left) Beta(0.5, 0.5), (Middle) Beta(2, 2), and (Right) Beta(2, 5).

Figure 9: Profitable trading signals (left), test set CAGR (middle), test set MSE (right) for the SPY500 Dataset using an LSTM model with 2 LSTM layers (200 neurons), 2 dense layers (100 neurons), lr=1e-4, batch size=16. The plots indicate resulting mean ± standard deviation from 10 runs.

Figure 10: Test accuracy (left) and Test MSE (right) for the Air Quality Dataset using an LSTM model with 2 LSTM layers (200 neurons), 2 dense layers (100 neurons), lr=1e-4, and batch size=16. The plots indicate the resulting mean ± standard deviation from 10 runs.

Figure 11: Evolution of stock prices.

Figure 12: Training and testing results for DPG (above) and DDPG (below). The plots indicate the resulting mean ± standard deviation from 20 runs with different seeds.

Aad W Van der Vaart. Asymptotic statistics. Cambridge university press, 1998. Qingsong Wen, Liang Sun, Fan Yang, Xiaomin Song, Jingkun Gao, Xue Wang, and Huan Xu. Time series data augmentation for deep learning: A survey. arXiv preprint arXiv:2002.12478, 2020a. Qingsong Wen, Zhe Zhang, Yan Li, and Liang Sun. Fast robuststl: Efficient and robust seasonaltrend decomposition for time series with complex patterns. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2203-2213, 2020b. Tailai Wen and Roy Keyes. Time series anomaly detection using convolutional neural networks and transfer learning. IJCAI Workshop on AI4IoT, 2019. Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar. Time-series generative adversarial networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5508-5518, 2019. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.

can be applied to the pairs (θ * , l) and (θ aug , l aug ). Therefore θaug is asymptotically normal and satisfies (Eq. 10). Let X = ∇l(s, θ * ) -∇l aug (s, θ * ). Recall that Cov s∼P µ = E s∼P [Cov(µ|µ)] + Cov s∼P µ by Lemma A.3. Let's consider µ and µ to be ∇l(s, θ * ) and ∇l aug (s, θ * ), respectively. Then Σ 0 = Cov(∇l(s, θ * ))

Observed time series data S = {s 0 , s 1 , . . . , s T } where s i = (x i , y i ) for x i ∈ R d and y i ∈ R for i ∈ [0 : T ]. 1 Initialize θ parameter for the policy network, Y epochs, and a distribution D with support [0, 1).for e = 1 to Y do 2 Initialize ⃗ λ = (λ 1 , . . . , λ T ) with λ i ∼ D // Initialize interpolation coefficients vector Augmented Path Simulator Generate an augmented trajectory S ⃗ λ = {s 0 , s 1,λ 1 , . . . , s T,λ T } where s i,λ i = (x i,λ i , y i,λ i ) for t = 1 to T do

In the table, the Test Loss and Test Accuracy are the mean(standard deviation) over 10 runs for 50 epochs for varying Filters and Kernel size for Indoor User Movement Classification Task Filters Kernel Size Test Loss -RIM Test Loss -NoAug Test Acc -RIM Test Acc -NoAug

In the table, the MSE and Accuracy are the mean(standard deviation) over 10 runs for 50 epochs for varying Epoch Initial, Batch Size, LSTM Layer and Dense Layer for Air Quality Regression Task on Test Data Epoch Init Batch Size LSTM Layer Dense Layer MSE -RIM MSE -NoAug Acc -RIM Acc -NoAug

Sensitivity analysis for DPG configurations with augmentation

ACKNOWLEDGEMENT

Amine is grateful to Hafida Ines Bouzaouia and Douglas Tweed for their help and constructive comments regarding this work. This research is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) [grant number 411296449], Canada's federal funding agency for university based research.

Appendix

The code for all the experiments can be found at the following link.

A PROOFS

A.1 PROOF OF THEOREM 3.1 Let sign(t) =

F.2 HYPERPARAMETER SENSITIVITY FOR RL TASK

The hyperparameter space is represented by a hypercube: the more values it contains the harder it is to explore all the possible combinations. To efficiently find the optimal set of hperparameters, we explored the hyperparameter space using Bayesian optimization (BO) Hutter et al. (2011) . Table 3 shows the range of values for the hyperparameters used during the training and validation phase. The learning rate controls the speed at which neural network parameters are updated. The window is used to allow the deep RL agents to utilize a range of historical data values to relax the Markov assumption. We allow the use of 2 days up to 30 days of historical data. The number of filters and kernel strides are the hyperparameters for the convolution neural networks. It is important to carefully optimize these parameters in order to capture the best feature representations used by the policy networks. Finally, the training and testing sizes may also impact the RL performance. So, we also consider them as hyperparameters. 

Parameters Bounds Type

Learning rate (lr) 10 The detailed hyperparameter configurations each index refers to are listed in Tables 4 and 5 . The horizontal axis shows the cumulative total validation return. The blue line shows the validation performance for DPG without augmentation. The orange line shows the validation performance for DPG using RIM. The worst to best models are ordered from bottom to top.

