MULTI-TIME ATTENTION NETWORKS FOR IRREGULARLY SAMPLED TIME SERIES

Abstract

Irregular sampling occurs in many time series modeling applications where it presents a significant challenge to standard deep learning models. This work is motivated by the analysis of physiological time series data in electronic health records, which are sparse, irregularly sampled, and multivariate. In this paper, we propose a new deep learning framework for this setting that we call Multi-Time Attention Networks. Multi-Time Attention Networks learn an embedding of continuous time values and use an attention mechanism to produce a fixed-length representation of a time series containing a variable number of observations. We investigate the performance of this framework on interpolation and classification tasks using multiple datasets. Our results show that the proposed approach performs as well or better than a range of baseline and recently proposed models while offering significantly faster training times than current state-of-the-art methods. 1

1. INTRODUCTION

Irregularly sampled time series occur in application domains including healthcare, climate science, ecology, astronomy, biology and others. It is well understood that irregular sampling poses a significant challenge to machine learning models, which typically assume fully-observed, fixed-size feature representations (Marlin et al., 2012; Yadav et al., 2018) . While recurrent neural networks (RNNs) have been widely used to model such data because of their ability to handle variable length sequences, basic RNNs assume regular spacing between observation times as well as alignment of the time points where observations occur for different variables (i.e., fully-observed vectors). In practice, both of these assumptions can fail to hold for real-world sparse and irregularly observed time series. To respond to these challenges, there has been significant progress over the last decade on building and adapting machine learning models that can better capture the structure of irregularly sampled multivariate time series (Li & Marlin, 2015; 2016; Lipton et al., 2016; Futoma et al., 2017; Che et al., 2018; Shukla & Marlin, 2019; Rubanova et al., 2019) . In this work, we introduce a new model for multivariate, sparse and irregularly sampled time series that we refer to as Multi-Time Attention networks or mTANs. mTANs are fundamentally continuous-time, interpolation-based models. Their primary innovations are the inclusion of a learned continuous-time embedding mechanism coupled with a time attention mechanism that replaces the use of a fixed similarity kernel when forming representation from continuous time inputs. This gives mTANs more representational flexibility than previous interpolation-based models (Shukla & Marlin, 2019) . Our approach re-represents an irregularly sampled time series at a fixed set of reference points. The proposed time attention mechanism uses reference time points as queries and the observed time points as keys. We propose an encoder-decoder framework for end-to-end learning using an mTAN module to interface with given multivariate, sparse and irregularly sampled time series inputs. The encoder takes the irregularly sampled time series as input and produces a fixed-length latent representation over a set of reference points, while the decoder uses the latent representations to produce reconstructions conditioned on the set of observed time points. Learning uses established methods for variational autoencoders (Rezende et al., 2014; Kingma & Welling, 2014) . The main contributions of the mTAN model framework are: (1) It provides a flexible approach to modeling multivariate, sparse and irregularly sampled time series data (including irregularly sampled time series of partially observed vectors) by leveraging a time attention mechanism to learn temporal similarity from data instead of using fixed kernels. (2) It uses a temporally distributed latent representation to better capture local structure in time series data. (3) It provides interpolation and classification performance that is as good as current state-of-the-art methods or better, while providing significantly reduced training times.

2. RELATED WORK

An irregularly sampled time series is a time series with irregular time intervals between observations. In the multivariate setting, there can also be a lack of alignment across different variables within the same multivariate time series. Finally, when gaps between observation times are large, the time series is also considered to be sparse. Such data occur in electronic health records (Marlin et al., 2012; Yadav et al., 2018) , climate science (Schulz & Stattegger, 1997) , ecology (Clark & Bjørnstad, 2004) , biology (Ruf, 1999) , and astronomy (Scargle, 1982) . It is well understood that such data cause significant issues for standard supervised machine learning models that typically assume fully observed, fixed-size feature representations (Marlin et al., 2012) . A basic approach to dealing with irregular sampling is fixed temporal discretization. For example, Marlin et al. ( 2012) and Lipton et al. ( 2016) discretize continuous-time observations into hour-long bins. This has the advantage of simplicity, but requires ad-hoc handling of bins with more than one observation and results in missing data when bins are empty. The alternative to temporal discretization is to construct models with the ability to directly use an irregularly sampled time series as input. Che et al. (2018) present several methods based on gated recurrent unit networks (GRUs, Chung et al. ( 2014)), including an approach that takes as input a sequence consisting of observed values, missing data indicators, and time intervals since the last observation. Pham et al. (2017) proposed to capture time irregularity by modifying the forget gate of an LSTM (Hochreiter & Schmidhuber, 1997 ), while Neil et al. (2016) introduced a new time gate that regulates access to the hidden and cell state of the LSTM. While these approaches allow the network to handle event-based sequences with irregularly spaced vector-valued observations, they do not support learning directly from vectors that are partially observed, which commonly occurs in the multivariate setting because of lack of alignment of observation times across different variables. Another line of work has looked at using observations from the future as well as from the past for interpolation. Yoon et al. (2019) and Yoon et al. (2018) presented an approach based on the multi-directional RNN (M-RNN) that can leverage observations from the relative past and future of a given time point. Shukla & Marlin (2019) proposed the interpolation-prediction network framework, consisting of several semi-parametric RBF interpolation layers that interpolate multivariate, sparse, and irregularly sampled input time series against a set of reference time points while taking into account all observed data in a time series. Horn et al. (2020) proposed a set function-based approach for classifying time-series with irregularly sampled and unaligned observation. et al., 2014) . Instead of the encoder-decoder architecture where the ODE is decoupled from the input processing, GRU-ODE-Bayes provides a tighter integration by interleaving the ODE and the input processing steps. Several recent approaches have also used attention mechanisms to model irregularly sampled time series (Song et al., 2018; Tan et al., 2020; Zhang et al., 2019) as well as medical concepts (Peng et al., 2019; Cai et al., 2018) . Most of these approaches are similar to Vaswani et al. (2017) where they replace the positional encoding with an encoding of time and model sequences using self-attention.



Implementation available at : https://github.com/reml-lab/mTAN



Chen et al. (2018) proposed a variational auto-encoder model (Kingma & Welling, 2014; Rezende et al., 2014) for continuous time data based on the use of a neural network decoder combined with a latent ordinary differential equation (ODE) model. They model time series data via a latent continuous-time function that is defined via a neural network representation of its gradient field. Building on this, Rubanova et al. (2019) proposed a latent ODE model that uses an ODE-RNN model as the encoder. ODE-RNNs use neural ODEs to model the hidden state dynamics and an RNN to update the hidden state in the presence of a new observation. De Brouwer et al. (2019) proposed GRU-ODE-Bayes, a continuous-time version of the Gated Recurrent Unit (Chung

