CONTINUOUS DEPTH RECURRENT NEURAL DIFFEREN-TIAL EQUATIONS

Abstract

Recurrent neural networks (RNNs) have brought a lot of advancements in sequence labeling tasks and sequence data. However, their effectiveness is limited when the observations in the sequence are irregularly sampled, where the observations arrive at irregular time intervals. To address this, continuous time variants of the RNNs were introduced based on neural ordinary differential equations (NODE). They learn a better representation of the data using the continuous transformation of hidden states over time, taking into account the time interval between the observations. However, they are still limited in their capability as they use the discrete transformations and a fixed discrete number of layers (depth) over an input in the sequence to produce the output observation. We intend to address this limitation by proposing RNNs based on differential equations which model continuous transformations over both depth and time to predict an output for a given input in the sequence. Specifically, we propose continuous depth recurrent neural differential equations (CDR-NDE) which generalizes RNN models by continuously evolving the hidden states in both the temporal and depth dimensions. CDR-NDE considers two separate differential equations over each of these dimensions and models the evolution in the temporal and depth directions alternatively. We also propose the CDR-NDE-heat model based on partial differential equations which treats the computation of hidden states as solving a heat equation over time. We demonstrate the effectiveness of the proposed models by comparing against the state-of-the-art RNN models on real world sequence labeling problems and data.

1. INTRODUCTION

Deep learning models such as ResNets (He et al., 2016) have brought a lot of advances in many real world computer vision applications (Ren et al., 2017; He et al., 2020; Wang et al., 2019) . They managed to achieve a good generalization performance by addressing the vanishing gradient problem in deep learning using skip connections. Recently, it was shown that the transformation of hidden representations in the ResNet block is similar to the Euler numerical method (Lu et al., 2018; Haber & Ruthotto, 2017) for solving ordinary differential equations (ODE) with constant step size. This observation has led to the inception of new deep learning architectures based on differential equations such as neural ODE (NODE) (Chen et al., 2018) . NODE performs continuous transformation of hidden representation by treating Resnet operations as an ODE parameterized by a neural network and solving the ODE using numerical methods such as Euler method and Dopri-5 (Kimura, 2009) . NODE automated the model selection (depth estimation), is parameter efficient and is robust towards adversarial attacks than a ResNet with similar architecture (Hanshu et al., 2019) . Recurrent neural networks and its variants such as long short term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and gated recurrent units (GRU) (Cho et al., 2014) were successful and effective in modeling time-series and sequence data. However, RNN models were not effective for irregularly sampled time-series data (Rubanova et al., 2019b) , where the observations are measured at irregular intervals of time. ODE-RNN (Rubanova et al., 2019b ) modeled hidden state transformations across time using a NODE, where the transformations of hidden representations depended on the time-gap between the arrivals and this led to a better representation of hidden state. This addressed the drawbacks of the RNN models which performs a single transformation of the hidden representation at the observation times irrespective of the time interval. Such continuous recurrent models such as GRU-ODE (De Brouwer et al., 2019) and ODE-LSTM (Lechner & Hasani, 2020) were proposed to learn better representation of irregular time series data.When applied to the sequence data with a sequence of input-output elements along with their time of occurrences, these models obtain the temporal evolution of hidden states using a neural ODE. At an observation time, this is then combined with the input at that time and a discrete number of transformations is applied using a feed-forward neural network to obtain the final hidden representation. This final hidden representation is then used to produce the desired output. Though these models evolve continuously over time, they use a fixed discrete transformations over depth. There are several real world sequence labelling problems where the sequences could be of different complexities or the input elements in the sequence could be of different complexities. For instance, consider the problem of social media post classification where different posts arrive at irregular time intervals. The posts could have varying characteristics with some posts containing only text while some contains both text and image. It would be beneficial to have a recurrent neural network model which would consider the complexities of the input in a sequence by having a varying number of transformations for different inputs. In this work, we propose continuous depth recurrent neural differential equation (CDR-NDE) models which generalize the recurrent NODE models to have continuous transformation over depth in addition to the time. Continuous depth allows flexibility in modeling sequence data, with different depths over the elements in the sequence as well as different sequences. Combining this with the continuous time transformation as in recurrent neural ODE allows greater modeling capability for irregularly sampled sequence data. The proposed continuous depth recurrent neural differential equations (CDR-NDE) model the evolution of the hidden states simultaneously in both the temporal and depth dimensions using differential equations. Continuous transformation of hidden states is modeled as a differential equation with two independent variables, one in the temporal and the other in the depth direction. We also aim to model the evolution of the hidden states using a partial differential equation (PDE) based on the 1D-heat equation, leading to the CDR-NDE-heat model. Heat equation is a second order partial differential equation, which models the flow of heat across the rod over time. The proposed CDR-NDE-heat model considers the transformation of hidden states across depth and time using a non-homogeneous heat equation. An advantage is that it is capable of considering the information from the future along with the past in sequence labeling tasks. We exploit the structure in the CDR-NDE-heat model and PDE solvers to develop an efficient way to obtain the hidden states where all the hidden states at a particular depth can be computed simultaneously. We evaluate the performance of our proposed models on real-world datasets such as person activity recognition (Asuncion & Newman, 2007) and Walker2d kinematic simulation data (Lechner & Hasani, 2020) . Through experiments, we show that the proposed continuous depth recurrent neural differential equation models outperformed the state-of-the-art recurrent neural networks in all these tasks.

2. RELATED WORK

RNN models such as LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Cho et al., 2014) are the primary choice to fit high-dimensional time-series and sequence data. For irregular time-series data, traditional LSTM and GRU models are less effective as they do not consider the varying inter-arrival times. To address the problem of fitting irregular time-series data, the standard approach is the augmented-LSTM which augments the elapsed time with input data. In GRU-D (Che et al., 2018) and RNN-Decay (Rubanova et al., 2019b) , the computed hidden state is the hidden state multiplied by a decay term proportional to the elapsed time. In other variants such as CT-GRU (Mozer et al., 2017) , CT-RNN (Funahashi & Nakamura, 1993) 



, ODE-RNN(Rubanova et al., 2019b), GRU-ODE(De Brouwer  et al., 2019),ODE-LSTM (Lechner & Hasani, 2020)  and Jump-CNF(Chen et al., 2020), the hidden state is computed as a continuous transformation of intermediate hidden states. CT-LSTM (Mei & Eisner, 2017a) combines both LSTM and continuous time neural Hawkes process to model continuous transformation of hidden states. Two alternative states are computed at each time-step and the final state is an interpolated value of these hidden states, where the interpolation depends on the elapsed time. Phased-LSTM(Neil et al., 2016) models irregularly sampled data using an additional time gate. The updates to the cell state and hidden state only happen when the time gate is open. This time gate allows for the updates to happen at irregular intervals. Phased LSTM reduces the memory decay as the updates only happen in a small time when the time gate is open. ODE-RNN(Rubanova et al.,  2019b)  used neural ordinary differential equations over time to model the evolution of the hidden

