RECURSIVE TIME SERIES DATA AUGMENTATION

Abstract

Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks we create our model using available data. Training on available realizations, where data is limited, often induces severe over-fitting thereby preventing generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call the Recursive Interpolation Method (RIM). New augmented time series are generated using a recursive interpolation function from the original time series for use in training. We perform theoretical analysis to characterize the proposed RIM and to guarantee its performance under certain conditions. We apply RIM to diverse synthetic and real-world time series cases to achieve strong performance over non-augmented data on a variety of learning tasks. Our method is also computationally more efficient and leads to better performance when compared to state of the art time series data augmentation.

1. INTRODUCTION

The recent success of machine learning (ML) algorithms depends on the availability of a large amount of data and prodigious computing power, which in practice are not always available. In real world applications, it is often impossible to indefinitely sample and ideally, we would like the ML model to make good decisions with a limited number of samples. To overcome these issues, we can exploit additional information such as the structure or invariance in the data that help the ML algorithms efficiently learn and focus on the most important features for solving the task. In ML, the exploitation of structure in the data has been handled using four different yet complementary approaches: 1) Architecture design, 2) Transfer learning, 3) Data representation, and 4) Data augmentation. Our focus in this work is on data augmentation approaches in the context of time series learning. Time series representations do not expose the full information of the underlying dynamical system Prado (1998) in a way that ML can easily recognize. For instance, in financial time series data, there are patterns at various scales that can be learned to improve performance. At a more fundamental level, time series are one-dimensional projections of a hypersurface of data called the phase space of a dynamical system. This projection results in a loss of information regarding the dynamics of the system. However, we can still make inferences about the dynamical system that projects a time series realization. Our approach is to use these inferences to generate additional time series data from the original realization to build richer representations and improve time series pattern identification resulting in more optimal parameters and reduced variance. We show that our methodology is applicable to a variety of ML algorithms. Time series learning problems depend on the observed historical data used for training. We often use a set of time series data to train the ML model. Each element in the set can be viewed as a sample derived from the underlying stochastic dynamical system. However, each historical time series data sample is only one particular realization of the underlying stochastic dynamical system in the real 1

