A TEMPORAL KERNEL APPROACH FOR DEEP LEARN-ING WITH CONTINUOUS-TIME INFORMATION

Abstract

Sequential deep learning models such as RNN, causal CNN and attention mechanism do not readily consume continuous-time information. Discretizing the temporal data, as we show, causes inconsistency even for simple continuous-time processes. Current approaches often handle time in a heuristic manner to be consistent with the existing deep learning architectures and implementations. In this paper, we provide a principled way to characterize continuous-time systems using deep learning tools. Notably, the proposed approach applies to all the major deep learning architectures and requires little modifications to the implementation. The critical insight is to represent the continuous-time system by composing neural networks with a temporal kernel, where we gain our intuition from the recent advancements in understanding deep learning with Gaussian process and neural tangent kernel. To represent the temporal kernel, we introduce the random feature approach and convert the kernel learning problem to spectral density estimation under reparameterization. We further prove the convergence and consistency results even when the temporal kernel is non-stationary, and the spectral density is misspecified. The simulations and real-data experiments demonstrate the empirical effectiveness of our temporal kernel approach in a broad range of settings.

1. INTRODUCTION

Deep learning models have achieved remarkable performances in sequence learning tasks leveraging the powerful building blocks from recurrent neural networks (RNN) (Mikolov et al., 2010) , longshort term memory (LSTM) (Hochreiter & Schmidhuber, 1997) , causal convolution neural network (CausalCNN/WaveNet) (Oord et al., 2016) and attention mechanism (Bahdanau et al., 2014; Vaswani et al., 2017) . Their applicability to the continuous-time data, on the other hand, is less explored due to the complication of incorporating time when the sequence is irregularly sampled (spaced). The widely-adopted workaround is to study the discretized counterpart instead, e.g. the temporal data is aggregated into bins and then treated as equally-spaced, with the hope to approximate the temporal signal using the sequence information. It is perhaps without surprise, as we show in Claim 1, that even for regular temporal sequence the discretization modifies the spectral structure. The gap can only be amplified for irregular data, so discretizing the temporal information will almost always

