PROBABILISTIC IMPUTATION FOR TIME-SERIES CLAS-SIFICATION WITH MISSING DATA

Abstract

Multivariate time series data available for real-world applications typically contain a significant amount of missing values. A dominant approach for the classification with such missing values is to heuristically impute the missing values with specific values (zero, mean, values of adjacent time-steps) or learnable parameters. However, these simple strategies do not take the data generative process into account, and more importantly, do not effectively capture the uncertainty in prediction due to the multiple possibilities for the missing values. In this paper, we propose a novel probabilistic framework for classification with multivariate time series data with missing values. Our model consists of two parts; a deep generative model for missing value imputation and a classifier. Extending the existing deep generative models to better capture structures of time-series data, our deep generative model part is trained to impute the missing values in multiple plausible ways, effectively modeling the uncertainty of the imputation. The classifier part takes the time series data along with the imputed missing values and classifies signals, and is trained to capture the predictive uncertainty due to the multiple possibilities of imputations. Importantly, we show that naïvely combining the generative model and the classifier could result in trivial solutions where the generative model does not produce meaningful imputations. To resolve this, we present a novel regularization technique that can promote the model to produce useful imputation values that actually help classification. Through extensive experiments on real-world time series data with missing values, we demonstrate the effectiveness of our method.

1. INTRODUCTION

Multivariate time-series data are universal; many real-world applications ranging from healthcare, stock markets, and weather forecasting take multivariate time-series data as inputs. Arguably the biggest challenge in dealing with such data is the presence of missing values, due to the fundamental difficulty of faithfully measuring data for all time steps. The degree of missing is often severe, so in some applications, more than 90% of data are missing for some features. Therefore, developing an algorithm that can accurately and robustly perform predictions with missing data is considered an important problem to be tackled. In this paper, we focus on the task of classification, where the primary goal is to classify given multivariate time-series data with missing values, simply imputing the missing values with heuristically chosen values considered to be strong baselines that are often competitive or even better than more sophisticated methods. For instance, one can fill all the missing values with zero, the mean of the data, or values from the previous time steps. GRU-D (Che et al., 2018) proposes a more elaborated imputation algorithm where the missing values are filled with a mixture between the data means and values from the previous time steps with the mixing coefficients learned from the data. While these simple imputation-based methods work surprisingly well (Che et al., 2018; Du et al., 2022) , they lack a fundamental mechanism to recover the missing values, especially the underlying generative process of the given time series data. Dealing with missing data is deeply connected to handling uncertainties originating from the fact that there may be multiple plausible options for filling in the missing values, so it is natural to analyze them with the probabilistic framework. There have been rich literature on statistical analysis for missing data, where the primary goal is to understand how the observed and missing data are generated. In the seminal work of Little and Rubin (2002) , three assumptions for the missing data generative process were introduced, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). While MCAR or MAR simplifies the modeling and thus makes the inference easier, they may be unrealistic for real-world applications, because they assume that the missing mechanism is independent of the missing values (MAR) or both missing and observed values (MCAR). MNAR, the most generic assumption, assumes that the missing mechanism depends on both missing and observed values, so the generative model based on the MNAR assumption should explicitly take the missing mechanism into account. Based on this framework, Mattei and Frellsen (2019) presented deep generative models for missing data under MAR assumption, and this was later extended to MNAR in Ipsen et al. (2021) . Combining the deep generative model and classifier, Ipsen et al. (2022) proposed a hybrid model that can classify missing data with problematically imputed values generated under MAR assumption. Still, in our opinion, there is no satisfactory work combining probabilistic generative models for multivariate time-series data with missing values and classification models, so that the classifier could consider the uncertainty in filling in the missing values when making predictions. The aforementioned probabilistic frameworks are not designed for classification (Mattei and Frellsen, 2019; Ipsen et al., 2021) , and more importantly, not tailored for time series data (Ipsen et al., 2022) . A naïve extension of Ipsen et al. (2022) for time series is likely to fail; putting the obvious difference between the static and time series data aside, the fundamental difficulty of learning the generative models for missing is that there are no explicit learning signals that could promote the model to generate "meaningful" missing values. Since we don't have ground truth for the missing values, in principle, the generative model can generate arbitrary values (e.g., zeros), and the combined classifier can still successfully classify time series data, which is a critical problem that is overlooked in the existing works. To this end, we propose a hybrid model combining the deep generative models for multivariate time series data and the classification models for them. The generative model part is built under the MNAR assumption, and borrowing the structure of GRU-D (Che et al., 2018) , the generative model is designed to naturally encode the continuity of the multivariate time series data in the generative process. The classifier then takes the missing values generated from the generative model and executes classification, and unlike the algorithms based on heuristic imputations, it takes multiple feasible options for the missing values and computes predictions based on them. To tackle the difficulty in guiding the generative model to generate "meaningful" missing values, we introduce a novel regularization technique that deliberately erases observed values during training. As a consequence, the classifier is forced to do classification based more on the generated missing values, so the generative model is encouraged to produce missing values that are more advantageous for the classification. Using the various real-world multivariate time series benchmarks with missing values, we demonstrate that our approach outperforms baselines both in terms of classification accuracy and uncertainty estimates.

2.1. SETTINGS AND NOTATIONS

Let x = [x 1 , . . . , x d ] ⊤ ∈ R d be a d-dimensional vector, along with the mask vector s = [s 1 , . . . , s d ] ⊤ ∈ {0, 1} d , where s j = 1 if x j is observed and s j = 0 otherwise. Given a mask s, we can split x into the observed part x obs := {x j | s j = 1} and the missing part x mis := {x j | s j = 0}. For a collection of data, the i th instance is denoted as x i = [x i,1 , . . . , x i,d ], and s i , x obs i , and x mis i are defined similarly. For a multivariate time-series data, we denote the vector of t th time step as 2021), we assume that the joint distribution of an input x and a mask s is factorized as p θ,ψ (x, s) = p θ (x)p ψ (s|x). The conditional distribution p ψ (s|x) plays an important role for describing missing mechanism. Under MCAR assumption, we have p(s|x) = p(s), under MAR we have p ψ (s|x) = p ψ (s|x obs ), and under MNAR we have p ψ (s|x) = p ψ (s|x obs , x mis ). The likelihood for the observed data x obs is thus computed as p θ,ψ (x obs , s) = p θ,ψ (x, s)dx mis . x t = [x t,1 , . . . , x t,d ] ∈ R d ,

2.2. MISSING DATA IMPORTANCE-WEIGHTED AUTOENCODER AND ITS EXTENSIONS

In this section, we briefly review the Missing data Importance-Weighted AutoEncoder (MI-WAE) (Mattei and Frellsen, 2019) , a deep generative model for missing data, and its extensions to MNAR and supervised settings. Similar to variational autoencoder (VAE) (Kingma and Welling, 2014) , MIWAE assumes that a data x is genearted from a latent representation z, but we only observe x obs with s generated from the missing data model p ψ (s|x). MIWAE assumes MAR, so we have p ψ (s|x) = p ψ (s|x obs ). The log-likelihood for (x obs , s) is then computed as log p θ,ψ (x obs , s) = log p ψ (s|x obs )p θ (x obs , x mis |z)p θ (z)dzdx mis = log p ψ (s|x obs ) + log p θ (x obs |z)p θ (z)dz =log p θ (x obs ) . (1) For the missing data imputation, p ψ (s|x obs ) is not necessary, so we choose to maximize only the log p θ (x obs ). The integral is intractable, so we consider the Importance Weighted AutoEncoder (IWAE) lower bound (Burda et al., 2015) , log p θ (x obs ) ≥ E z 1:K log 1 K K k=1 p θ (x obs |z k )p θ (z k ) q ϕ (z k |x obs ) := L MIWAE (θ, ϕ). Here, q ϕ (z k |x obs ) for k = 1, . . . , K are i.i.d. copies of the variational distribution (encoder) q ϕ (z|x obs ) approximating the true posterior p ϕ (z|x obs ). E z 1:K denotes the expectation w.r.t. K k=1 q ϕ (z k |x obs ). K is the number of particles, and the lower bound gets tighter as K increases and converges to the upper-bound as K → ∞. 2021) presented not-MIWAE, an extension of MIWAE with MNAR assumption. The log-likelihood for (x obs , s) under the MNAR assumption is, log p θ,ψ (x obs , s) = log p ψ (s|x obs , x mis )p θ (x obs |z)p θ (x mis |z)p θ (z)dzdx mis ,

Ipsen et al. (

where we are assuming that (x obs , x mis ) are independent given z. The corresponding IWAE lowerbound with the variational distribution q ϕ (x mis , z|x obs ) = p θ (x mis |z)q ϕ (z|x obs ) is, L notMIWAE (θ, ψ, ϕ) := E z 1:K ,x mis 1:K log 1 K K k=1 p θ (s|x obs , x mis k )p θ (x obs |z k )p θ (z k ) q ϕ (z k |x obs ) . , where E z 1:K ,x mis 1:K denotes the expectation w.r.t. K k=1 p θ (x mis k |z k )q ϕ (z k |x obs ). On the other hand, Ipsen et al. (2022) extended MIWAE to a supervised learning setting, where the goal is to learn the joint distribution of an observed input x obs , a mask s, and corresponding label y, log p θ,ψ,λ (y, x obs , s) = log p λ (y|x obs , x mis )p ψ (s|x obs , x mis )p θ (x obs , x mis |z)p θ (z)dzdx mis = log p ψ (s|x obs ) + log p λ (y|x obs , x mis )p θ (x obs |z)p θ (x mis |z)p θ (z)dz =log p θ,λ (y,x obs ) , The term p ψ (s|x obs ) is irrelevant to the prediction for y, so we choose to maximize log p θ,λ (y, x obs ), which again can be lower-bounded by IWAE bound with the variational distribution q ϕ (z, x mis |x obs ) = p θ (x mis |z)q ϕ (z|x obs ): L supMIWAE (θ, λ, ϕ) := E z 1:K ,x mis 1:K log 1 K K k=1 p λ (y|x obs , x mis k )p θ (x obs |z k )p(z k ) q ϕ (z k |x obs ) , where E z 1:K ,x mis 1:K denotes the expectation w.r.t. K k=1 p θ (x mis k |z k )q ϕ (z k |x obs ).

2.3. GRU FOR MULTIVARIATE TIME SERIES DATA AND IMPUTATION METHODS

We briefly review GRU (Cho et al., 2014) as it is being used as a building block for our method. Given a multivariate time series (x t ) T t=1 , GRU takes a vector of one time step at a time and accumulates the information into a hidden state h t . Specifically, the forward pass at t th time step takes x t and updates the hidden state h t as follows: a t = σ(W a x t + U a h t-1 + b a ), r t = σ(W r x t + U r h t-1 + b r ) ht = tanh(W x t + U (r t ⊙ h t-1 ) + b), h t = (1 -a t ) ⊙ h t-1 + a t ⊙ ht , where ⊙ denotes the element-wise multiplication. We also review the heuristical imputation methods described in Che et al. (2018) , which are for GRU-based multivariate time-series classifiers and commonly used as baselines. • GRU-zero: simply put zero for all missing values. That is, xt,j = s t,j x t,j . • GRU-mean: imputes the missing values as xt,j = s t,j x t,j + (1 -s t,j )x j , where xj = n i=1 T t=1 s t,i,j x t,i,j / n i=1 T t=1 s t,i,j is the empirical mean of observed values for j th feature of a given collection of time series data ((x t,i ) T t=1 ) n i=1 . • GRU-forward: imputes the missing values as xt,j = s t,j x t,j + (1 -s t,j )x t ′ ,j , where t ′ is the last time when j th feature was observed before t. • GRU-simple: along with the imputed vector xt (either by GRU-mean or GRU-forward), concatenate additional information. Che et al. (2018) proposed to concatenate 1) the mask s t , and the time-interval δ t saving the length of the intervals between observed values (see Che et al. (2018) for precise definition). The concatenated vector [x t , s t , δ t ] is then fed into GRU. • GRU-D: introduces learnable decay values for the input x t and hidden state h t as follows: γ x = exp(-max(W γ x δ t + b γ x , 0)), γ h = exp(-max(W γ h δ t + b γ h , 0)). Given a vector x t with mask s t , GRU-D imputes the missing values as xt,j = s t,j x t,j + (1 -s t,j )(γ x,t x t ′ ,j + (1 -γ x,t )x j ). That is, the missing is imputed as a mixture of the last observed x t ′ ,j and the empirical mean xj with the mixing coefficient set as the learned decay. The hidden state from the previous time step h t-1 is decayed as γ h ⊙ h t-1 and passed through GRU with the imputed xt .

3. METHODS

In this section, we describe our method, a probabilistic framework for multivariate time series data with missing values. Our method is an extension of supMIWAE to time series data under MNAR assumption, but the actual implementation is not merely a naïve composition of the existing models. In Section 3.1, we first present supnotMIWAE, an MNAR version of supMIWAE, with the encoder and decoder architectures designed for time series data with missing values. In Section 3.2, we show why the sup(not)MIWAE for data with missings may fail, and propose a novel regularization technique to prevent that.

3.1. SUPNOTMIWAE FOR MULTIVARIATE TIME SERIES DATA

Given a multivariate time series data x 1:T := (x t ) T t=1 with observed x obs 1:T and missing x mis 1:T , a missing mask s 1:T := (s t ) T t=1 and a label y, we assume the following state-space model with latent vectors z 1:T := (z t ) T t=1 . p θ,ψ,λ (y, x obs 1:T , s 1:T ) = p λ (y|x obs 1:T , x mis 1:T )p θ (x obs 1:T |z 1:T )p θ (x mis 1:T |z 1:T )p θ (z 1:T )p ψ (s 1:T |x 1:T )dx mis 1:T dz 1:T . (9) Below we describe each component more in detail. Prior p θ (z 1:T ) we assume an autoregressive prior for z 1:T , p θ (z 1:T ) = N (z 1 |0, I) T t=2 N (z t |µ pr (z 1:t-1 ), diag(σ 2 pr (z 1:t-1 )), where (µ pr (z 1:t ), σ pr (z 1:t )) T -1 t=1 are computed as h t = GRU pr (z t , h t-1 ), µ pr (z 1:t ), σ pr (z 1:t ) = MLP pr (h t-1 ). (11) Here, GRU pr (z t , h t-1 ) is a GRU cell that takes z t and the hidden state h t-1 and update it to h t . Decoders p θ (x obs 1:T |z 1:T ) and p θ (x mis 1:T |z 1:T ) The decoder for the observed p θ (x obs 1:T |z 1:T ) is defined in autoregressive fashion, p θ (x obs 1:T |z 1:T ) = T t=1 N (x obs t |µ dec (z 1:t ), diag(σ 2 dec (z 1:t ))), where (µ dec (z 1:t ), σ dec (z 1:t )) T t=1 are defined as in (11) using GRU. The decoder for the missing p θ (x mis 1:T |z 1:T ) shares the same model; that is, both observed value decoder p θ (x obs 1:T |z 1:T ) and the missing decoder p θ (x mis 1:T |z 1:T ) share the same model. Missing model p ψ (s 1:T |x 1:T ) The missing model is simply assumed to be independent Bernoulli distributions over the time steps and features. p ψ (s 1:T |x 1:T ) = T t=1 d j=1 Bern(s t,j |σ mis,t,j (x 1:T )), where σ mis (x 1:T ) is computed as σ mis (x 1:T ) = MLP mis (x 1:T ). Classifier p λ (y|x obs 1:T , x mis 1:T ) We simply use a common GRU-based time series classifier for this. Let h T be the hidden state from a GRU after consuming (x obs 1:T , x mis 1:T ). Then the conditional distribution is defined as p λ (y|x obs 1:T , x mis 1:T ) = Categorical(y | Softmax(Linear cls (h T )). During the forward pass, the classifier takes the observed input x obs 1:T and the missing values generated from the decoder p θ (x mis 1:T |z 1:T ). We find it beneficial to adopt the idea of GRU-D, where instead of directly putting the generated missing values x mis 1:T , putting the decayed missing values as follows: xt := (x obs t , x mis t ) where x mis t ∼ p θ (x mis t |z 1:T ), (16) xt,j = s t,j x t,j + (1 -s t,j )(γ cls,t x t ′ ,j + (1 -γ cls,t )x t,j ), (17) where γ cls = exp(-max(0, W cls δ t +b cls )) is a learnable decay. We find this stabilizes the learning when the generated missing values x mis 1:T are inaccurate, for instance, in the early stage of learning. Note also the difference between (16) and the original GRU-D imputation (8) . In GRU-D, the last observed values are mixed with the mean feature, while ours mix them with the generated values. Encoder q ϕ (z 1:T |x obs 1:T ) Given the generative model defined as above, we introduce the variational distribution for (x mis 1:T , z 1:T ) for lower-bounding the log-likelihood. q θ,ϕ (x mis 1:T , z 1:T |x obs 1:T ) = p θ (x mis 1:T |z 1:T )q ϕ (z 1:T |x obs 1:T ). Here, the encoder q ϕ (z 1:T |x obs 1:T ) is defined as an autoregressive model as before, q ϕ (z 1:T |x obs 1:T ) = T t=1 N (z t |µ enc (x obs 1:t ), diag(σ 2 enc (x obs 1:t ))). Similar to the decoder, we use GRU to compute (µ enc (x obs 1:t ), σ enc (x obs 1:t )) for t = 1, . . . , T . However, since x t includes many missing values, rather than putting only the observed values, we find it beneficial to put the imputed value xt as an input to the encoder. For the imputation, we adopt GRU-D. To summarize, the encoder parameters are constructed from GRU-D outputs, with the inputs imputed with learnable decay values. Objective Having all the ingredients defined, the IWAE bound for supnotMIWAE is computed as follows: log p λ,θ,ψ (y, x obs 1:T , s 1:T ) ≥ L supnotMIWAE (λ, θ, ψ, ϕ) := E z 1:K,1:T ,x mis 1:K,1:T log 1 K K k=1 ω k , where ω k := p θ (y|x obs 1:T , x mis k,1:T )p ψ (s 1:T |x obs 1:T , x mis k,1:T )p θ (x obs 1:T |z k,1:T )p θ (z k,1:T ) q ϕ (z k,1:T |x obs 1:T ) . Here, (q ϕ (z k,1:T |x obs 1:T )p θ (x mis k,1:T |z k,1:T )) K k=1 are i.i.d. copies of the variational distribution, and E z 1:K,1:T ,x mis 1:K,1:T denotes the expectation w.r.t. those i.i.d. copies.

3.2. OBSDROPOUT: REGULARIZING SUPNOTMIWAE FOR BETTER IMPUTATION

The problem with (20) is that there is no clear supervison for the missing values x mis 1:T . Obviously, if we had an access to the missing values, the conditional probability p θ (x mis 1:T |z 1:T ) would guide the model to learn to correctly impute those missing values. Without such true values, we can only encourage the model to impute the missing values with some indirect criteria. In the objective (20), there are two terms that the model hinges on for this matter. • The missing model p ψ (s 1:T |x obs 1:T , x mis 1:T ): this term encourages the model to reconstruct the missing mask s t from the imputed value x mis t , so in principle, the model should impute the missing values in a way that they are distinguishable from the observed values. However, in general, the distributions of the observed and the missings are not necessarily different, and more importantly, the model can easily cheat the objective. For instance, consider a trivial case where the model imputes all the missing values with zero. The conditional probability p ψ (s 1:T |x obs 1:T , x mis 1:T ) can still be maximized by setting σ mis (x t,j ) = 0 if x t,j = 0 (unless there are not many observed with x obs t,j = 0). • The classifier p θ (y|x obs 1:T , x mis 1:T ): this term expects the model to generate meaningful imputations so that they are helpful for the classification. However, as shown in prior works (Che et al., 2018) , the classifier can achieve decent classification accuracy without meaningful imputations, for instance, it will still be able to classify the signals while all the missing values are imputed with zeros. Hence, in the current form, there is no strong incentive for the model to learn non-trivial imputations that will bring significant accuracy gain over the zero imputations. To summarize, in the current form, the objective (20) is not likely to generate realistic missing values. To resolve this, we may introduce a missing model p θ (s 1:T |x obs 1:T , x mis 1:T ) much more elaborated than the simple i.i.d. model that we are using right now, but that may require some dataset-specific design. Instead, we present a simple regularization technique that can effectively enhance the quality of the imputed values. Our idea is simple; when passing the observed inputs x obs 1:T and the imputed missing values xmis 1:T (i.e., imputed by ( 16)) to the classifier, deliberately drop some portion of the observed inputs. Without dropping the observed inputs, the classifier may heavily rely on the observed inputs to do the classification, but if some of the observed inputs are dropped out during training, the classifier can focus more on the imputed missing values xmis 1:T . As a result, the model is encouraged to generate more "useful" missing values that are beneficial for classification. More specifically, let β be a predefined dropout probability. Then we construct the imputed input xt to the classifier as follows: m t,j ∼ Bern(1 -β), st,j := s t,j m t,j ≈ x t := (x obs 1:T , x mis 1:T ) where (x obs 1:T , x mis 1:T ) ∼ p θ (x obs 1:T , x mis 1:T |z 1:T ), xt,j := st,j x t,j + (1 -st,j )(γ cls,t x t ′ ,j + (1 -γ cls,t ) ≈ x t,j ). ( ) That is, when an observed x t,j is dropped out, we put a generated value with the decay applied as in (16), so that the classifier could focus more on the values generated by the decoder as we intended. We call this idea ObsDropout, since we are dropping out the observed values during the training.  where the joint distribution is decomposed as p θ,ψ,λ (y, x obs 1:T , s 1:T , m 1:T ) = p λ (y|x obs 1:T , x mis 1:T , m 1:T )p β (m 1:T )p ψ (s 1:T |x 1:T )p θ (x obs 1:T |z 1:T )p θ (x mis 1:T |z 1:T )p θ (z 1:T ). (24) Consequently, the IWAE objective is slightly modified as follows: L ′ supnotMIWAE (λ, θ, ψ, ϕ) := E z 1:K,1:T ,x mis 1:K,1:T ,m 1:K log 1 K K k=1 ω k , where ω k := p λ (y|x obs 1:T , x mis k,1:T , m k,1:T )p ψ (s 1:T |x obs 1:T , x mis k,1:T )p θ (x obs 1:T |z k,1:T )p θ (z k,1:T ) q ϕ (z k,1:T |x obs 1:T ) , where the expectation is over K i.i.d. copies of the variational distribution, q(z 1:T , x mis 1:T , m 1:T |x obs 1:T ) = q ϕ (z 1:T |x obs 1:T )p θ (x mis 1:T |z 1:T )p β (m 1:T ), with p β (m 1:T ) := T t=1 d j=1 Bern(m t,j |β).

3.3. PREDICTION

Similar to SupMIWAE, we exploit Self-Normalized Importance Sampling (SNIS) to approximate the predictive distribution for a new input x obs 1:T . With the model trained with obsdropout, we have p(y|x obs 1:T ) ≈ 1 S S s=1 K k=1 ω(s) k p λ (y|x obs 1:T , (x mis ) (s) k,1:T , m k,1:T ), k,1:T , (x mis ) (s) k,1:T , m k,1:T ) i.i.d. ∼ q ϕ (z 1:T |x obs 1:T )p θ (x mis 1:T |z 1:T )p β (m 1:T ), provides useful framework to train DLVMs under missingness. However, it is not directly applicable for time series data because it cannot model the temporal dependency within a series. There exists previous work to make Deep latent variable models suitable for multivariate time series. For example, Fortuin et al. (2020) proposed VAE architecture which aims to impute multivariate time series data, using Gaussian process prior to encode the temporal correlation in the latent space.  ω k := p θ (x obs 1:T |z k,1:T )p θ (z k,1:T ) q ϕ (z k,1:T |x obs 1:T ) , ωk := ω k K ℓ=1 ω ℓ . (

5. EXPERIMENTS

In this section, we demonstarte our method on real-world multivariate time series data with missing values. We compare ours to the baselines on three datasets: PhysioNet 2012 (Silva et al., 2012), MIMIC-III (Johnson et al., 2016) and Human Activity Recognition (Anguita et al., 2013) . PhysioNet 2012 and MIMIC-III datasets contain Electronic Health Records of patients from Intensive Care Units (ICU). Human Activity Recognition dataset consists of the 3D coordinate of sensors mounted on the people doing some daily activities such as walking, sitting. See Appendix A for the details of datasets. For all three datasets, we compare classification accuracy and the uncertainty quantification performances. For PhysioNet 2012, we also compare the missing value imputation performance of our methods to the baselines. For the baselines, we considered GRU classifiers with various imputation methods, and few other deep neural network based methods that are considered to be competitive in the literature. See Appendix A for detailed description of the baselines. For the uncertainty quantification metrics, we compared cross-entropy (CE, equals negative log-likelihood), expected calibration error (ECE), brier score (BS). Especially, for PhysioNet 2012 and MIMIC-III, we also considered balanced versions of them (with "b" in front of the metric names), since those datasets are largely imbalanced so the usual uncertainty quantification metrics may be biased. Please refer to Appendix A for the detailed description of the metrics.

5.1. CLASSIFICATION RESULTS

We summarize the classification results in Table 1 , Table 2 , and Table 3 . In general, ours acheive the best performance among the competing methods both in terms of prediction accuracy and uncertainty quantification. We also provide an ablation study for our model to see the effect of 1) using time-aware architecture (GRU) for the encoder and decoder of supnotMIWAE, and 2) obsdropout. The results clearly show that both components play important roles for our model. In Appendix B, we provide further results showing the effect of dropout rate β for the performance.

5.2. IMPUTATION RESULTS

We quantitatively check the imputation performance of our model on PhysioNet 2012 dataset in Table 4, and visually check the imputation quality by changing our model settings in Fig. 2 . Although our model is designed for the classification, ours achieved the lowest MAE and MRE, outperforming the baseline (SAITS) specifically designed for the imputation. Especially, the ablation study on the class supervision part p λ (y|x 1:T ) and the obsdropout implies that the imputation values generated by our model which was trained to better classify the signals are more "realistic". Fig. 2 highlight the effect of using GRU based encoders and decoders and obsdropout. The values imputed with those techniques form smoother trajectories and better capture the uncertainties in the intervals without observed values.

6. CONCLUSION

In this paper, we presented a novel probabilistic framework for multivariate time series classification with missing data. Under the MNAR assumption, we first developed a deep generative model suitable for generating missing values in multivariate time series data. Then we identified an important drawback of the naïve combination of the deep generative models with the classifiers and proposed a novel regularization technique called obsdropout to circumvent that. We demonstrated that ours could classify real-world multivariate time series data more accurately and robustly than existing methods. In this paper, we focused on GRU-based architectures for both generative model and classifier. An interesting future work would be extending our methods with other architectures such as transformers (Vaswani et al., 2017 For all dataset, we basically standardize the numerical covariates so that all features have zero mean and unit variance, respectively. PhysioNet2012 and MIMIC-III Since there is no fixed rule for preprocessing Physionet2012 and MIMIC-III database, researchers usually preprocess the raw data on their own so that there are countless possibilities for the form of preprocessed dataset. Therefore, it is difficult for practitioners to compare experimental results with other works. For the comparability, we employ python package medical-ts-datasets (Horn et al., 2020 ) which provides the unified data preprocessing pipeline for Physionet2012 and MIMIC-III datasets. For both dataset, patients who have more than 1000 time steps or have no observed time series data were excluded from the dataset. Also, discretizing the time step of data by hour and aggregate the measurement is frequently used to preprocess Physionet2012 in previous work (Rubanova et al., 2019) , but this package preserves much more original time series variables while preprocessing than hourly based aggregation preprocessing for both dataset. We follow the preprocessing of the medical ts datasetsfoot_0 library. UCI Human Activity For comparability, we decide to preprocess this dataset based on (Rubanova et al., 2019) . However, we modify some part of dataset to apply to our code implementation. We standardize the dataset while (Rubanova et al., 2019) did not standardize the dataset. Also, we make new variable that records the time index of lastly observed value of each data point to distinguish between missingness and meaningless padding.

A.2 DETAILS FOR CLASSIFICATION EXPERIMENTS

For all experiments, we use five different seeds to conduct experiments.

A.2.1 BASELINE METHODS

• GRU-mean: Missing value is simply replaced with the empirical mean of each variable. • GRU-forward: Missing entries are filled with previously observed value. • GRU-simple: concatenate the mask s t , and the time-interval δ t along with the imputed vector xt . The concatenated vector [x t , s t , δ t ] is then fed into GRU. • GRU-D: Missing values are imputed as a weighted mean of the last observed x t ′ ,j and the mean xj with the learnable weight. • Phased-LSTM: This model is LSTM variants designed to deal with long sequence input by introducing time gate in their cell to prevent memory decay when useful information is absence for a long time. • Interpolation-Prediction Network(IP-Nets): Instead of directly imputing missing values, this model employed semi-parametric interpolation network that makes regularly spaced representation of irregularly sampled time series data. Then, this representation is fed into prediction network such as GRU. Since we conduct online-prediction task on Human Activity dataset, we do not consider IP-Nets as baseline models because this model use future information when conducting interpolation.

A.2.2 TRAINING DETAILS

In order to conduct experiments fairly, we fix the parameters of every model similar, or at least set our model to have relatively small number of parameters compared to other baseline. Also, we use Adam optimizer with learning rate 0.0001 and batch size 128 for all models for MIMIC-III in-hospital mortality prediction task. In Physionet2012 experiment, we use Adam optimizer with learning rate 0.001 and batch size 128 for all models. For online-prediction task, we also employ Adam optimizer with learning rate 0.001 and batch size 128. We employ early stopping for all classification experiments. For the mortality prediction tasks, we set early stopping patience to 10 epochs and set the area under the precision-recall curve (AUPRC) of the validation data as the early stopping criterion. For Human activity prediction task, we used validation accuracy as stopping criterion and set early stopping patience to 20 epochs. Since the label imbalance of Physionet2012 and MIMIC-III is extreme, we oversample the mortality class to train models on the balanced batches. See Table 8 for detailed hyperparameter settings for imputation experiment.

B ADDITIONAL EXPERIMENTS

We conduct numerous ablation experiments to analyze the effect of obsdropout method. In Phys-ioNet2012 dataset, our models with obsdropout method outperforms at all rates. Also, in MIMIC-III datset, our technique works well on reasonable rates. Although there is much room for further analysis of effect of obsdropout on the predictive performance, these results are sufficient to show that our technique is generally effective.



https://github.com/ExpectationMax/medical ts datasets



Figure 1: An overview of our model with obsdropout.

Researchers have developed deep neural network architectures customized to multivariate time series classification task. There have been several architectures showing competitive empirical performance in multivariate time series classification task. Che et al. (2018) modified the architecture of GRU intending to perform supervised learning with sparse covariates by introducing learnable temporal decay mechanism for input and hidden state of GRU. This mechanism has been applied to further research. For example, Cao et al. (2018) employed temporal decay in hidden states of their bidirectional-RNN-based model to capture the missing pattern of irregularly sampled times series. Shukla and Marlin (2019) presented hybrid architecture of interpolation network and classifier. Interpolation network returns fully observed and regularly sampled representation of original time series data. Taking this representation as an input, even common deep neural network model makes good predictive performance.

Classification performance of baseline methods and ours on Human Activity Recognition dataset. ±0.005 0.163 ±0.0121 0.019 ±0.006 0.046 ±0.002 GRU-Simple 0.767 ±0.008 0.161 ±0.0029 0.015 ±0.003 0.047 ±0.001 GRU-Forward 0.798 ±0.007 0.152 ±0.0038 0.020 ±0.003 0.043 ±0.001 GRU-D 0.789 ±0.004 0.150 ±0.0046 0.018 ±0.004 0.044 ±0.001 Ours 0.798 ±0.004 0.141 ±0.0028 0.005 ±0.001 0.042 ±0.001

Figure 2: Plots of µ dec (z 1:t ), σ 2 dec (z 1:t ). (left) Our model which architecture of encoder and decoder is MLP. (right) Our model trained with obsdropout with rate 0.5.

and the corresponding mask as s t = [s t,1 , . . . , s t,d ]. The t th time step of i th instance of a collection is denoted as x t,i , which is split into x obs t,i and x mis t,i according to s t,i . Following Mattei and Frellsen (2019);Ipsen et al. (

Classification performances of baseline methods and ours on PhysioNet 2012 dataset.

Classification performances of baseline methods and ours on MIMIC-III dataset. ±0.008 0.857 ±0.003 0.472 ±0.007 0.492 ±0.016 0.156 ±0.002 0.462 ±0.023 0.221 ±0.018 0.154 ±0.009 w/ MLP enc/dec 0.509 ±0.007 0.857 ±0.003 0.471 ±0.005 0.485 ±0.008 0.155 ±0.002 0.450 ±0.011 0.217 ±0.008 0.149 ±0.004

Imputation performance on PhysioNet 2012 dataset. Ours 0.391 ±0.002 0.564 ±0.002 w/o supervision 0.400 ±0.002 0.573 ±0.009 w/o obsdropout 0.397 ±0.006 0.573 ±0.009

). I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality of patients in icu: The physionet/computing in cardiology challenge 2012. Computing in Cardiology, 2012. 8 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017. 9 Statistics of each dataset. This dataset contains approximately 12,000 Electronic Health Records of adult patients who were admitted to intensive care unit (ICU). Each record contains up to 37 time series variables including vital signs such as heart rate, temperature. All of the variables are measured during the first 48 hours of each patient's admission to ICU and the sampling rate of times series varies among variables. After preprocessing, we have 37 features and 11,971 data point. On this dataset, we conduct missing data imputation task and mortality prediction task, which aims to predict in-hospital mortality of ICU patients using information collected during first 48 hours in ICU. MIMIC-III MIMIC-III dataset is freely accessible and widely used database which includes deidentified Electronical Health Record of patients who stayed in ICU of Beth Israel Deaconess Medical Center from 2001 to 2012. It originally consists of approximately 57,000 records of patients stayed in ICU. Records including various variables such as medications, in-hospital mortality and vital signs. Harutyunyan et al. (2019) set variety of benchmark tasks based on subset of this database. Among them, we conduct binary in-hospital mortality prediction task. After preprocessing, our dataset contains 16 features and 21,107 data points. For this dataset, we conduct mortality prediction task which is identical to PhysioNet2012 classification task.UCI Localization Data for Person Activity (UCI Human Activity)This dataset includes records of five people doing some usual activities such as walking or sitting. All people wear sensors on their right ankle,left,belt and chest. During activities, and the sensors record their position in the form of three-dimensional coordinates at very short intervals. Activities of each people at certain time point classified into one of 11 classes and recorded with the position of sensors. After preprocessing, we have total 6554 time series with 12 features(3-dimensional coordinates of 4 devices). Using this preprocessed data, we conduct online-prediction task. The objective of this task is to classify each individual's activity per time point based on the position of sensors.

The number of parameters of baseline models and our model on each dataset.

Effect of obsdropout rate for ours on PhysioNet 2012 dataset. ±0.011 0.859 ±0.006 0.488 ±0.032 0.472 ±0.050 0.157 ±0.007 0.445 ±0.057 0.184 ±0.057 0.149 ±0.021 0.1 0.545 ±0.014 0.862 ±0.007 0.473 ±0.023 0.474 ±0.023 0.153 ±0.006 0.437 ±0.023 0.184 ±0.029 0.146 ±0.008 0.2 0.546 ±0.005 0.865 ±0.004 0.480 ±0.025 0.457 ±0.033 0.154 ±0.006 0.416 ±0.037 0.161 ±0.034 0.138 ±0.014 0.3 0.554 ±0.008 0.868 ±0.001 0.464 ±0.006 0.477 ±0.039 0.149 ±0.002 0.434 ±0.056 0.181 ±0.040 0.145 ±0.021 0.4 0.558 ±0.005 0.868 ±0.006 0.470 ±0.014 0.452 ±0.026 0.152 ±0.004 0.401 ±0.035 0.155 ±0.027 0.132 ±0.013 0.5 0.561 ±0.003 0.871 ±0.004 0.462 ±0.016 0.456 ±0.021 0.149 ±0.005 0.400 ±0.025 0.158 ±0.020 0.132 ±0.009 0.6 0.556 ±0.005 0.869 ±0.002 0.458 ±0.008 0.474 ±0.021 0.149 ±0.002 0.425 ±0.027 0.179 ±0.018 0.141 ±0.011 0.7 0.554 ±0.014 0.867 ±0.003 0.462 ±0.013 0.471 ±0.026 0.151 ±0.004 0.425 ±0.029 0.179 ±0.025 0.140 ±0.011 0.8 0.558 ±0.010 0.864 ±0.009 0.472 ±0.031 0.489 ±0.033 0.153 ±0.008 0.455 ±0.032 0.194 ±0.033 0.152 ±0.012 0.9 0.546 ±0.010 0.859 ±0.006 0.469 ±0.015 0.486 ±0.016 0.154 ±0.003 0.453 ±0.023 0.198 ±0.020 0.152 ±0.009

Effect of obsdropout rate for ours on MIMIC-III dataset. ±0.008 0.857 ±0.003 0.472 ±0.007 0.492 ±0.016 0.156 ±0.002 0.462 ±0.023 0.221 ±0.018 0.154 ±0.009 0.1 0.519 ±0.007 0.858 ±0.003 0.470 ±0.005 0.490 ±0.007 0.155 ±0.002 0.457 ±0.013 0.217 ±0.010 0.152 ±0.005 0.2 0.519 ±0.006 0.859 ±0.003 0.468 ±0.004 0.501 ±0.009 0.155 ±0.001 0.472 ±0.014 0.228 ±0.009 0.158 ±0.005 0.3 0.518 ±0.004 0.859 ±0.002 0.469 ±0.002 0.494 ±0.008 0.155 ±0.001 0.462 ±0.013 0.220 ±0.009 0.154 ±0.005 0.4 0.518 ±0.004 0.858 ±0.003 0.470 ±0.003 0.493 ±0.015 0.155 ±0.001 0.461 ±0.025 0.220 ±0.017 0.153 ±0.009 0.5 0.511 ±0.010 0.855 ±0.003 0.475 ±0.006 0.488 ±0.010 0.157 ±0.002 0.457 ±0.013 0.217 ±0.013 0.152 ±0.005 0.6 0.504 ±0.010 0.851 ±0.003 0.483 ±0.005 0.501 ±0.011 0.160 ±0.002 0.488 ±0.015 0.229 ±0.010 0.163 ±0.006 0.7 0.480 ±0.036 0.840 ±0.017 0.499 ±0.020 0.497 ±0.010 0.166 ±0.007 0.495 ±0.020 0.236 ±0.024 0.163 ±0.006 0.8 0.461 ±0.032 0.831 ±0.015 0.515 ±0.018 0.490 ±0.006 0.171 ±0.006 0.498 ±0.020 0.235 ±0.029 0.163 ±0.006 0.9 0.419 ±0.024 0.810 ±0.010 0.543 ±0.011 0.496 ±0.029 0.181 ±0.004 0.534 ±0.050 0.268 ±0.030 0.174 ±0.019

annex

Reproducibility statement Please refer to Appendix A for full experimental detail including datasets, models, and evaluation metrics. GRU-Mean n units: 256, dropout: 0.0, recurrent dropout: 0.0 GRU-Simple n units: 256, dropout: 0.0, recurrent dropout: 0.0 GRU-Forward n units: 256, dropout: 0.0, recurrent dropout: 0.0 GRU-D n units: 256, dropout: 0.0, recurrent dropout: 0.0 

A.2.4 EVALUATION METRICS

For the classification task, we evaluate all models in terms of both predictive accuracy and predictive uncertainty. We use the area under precision recall curve (AUPRC), the area under receiver operating characteristic (AUROC), and the accuracy (ACC) to evaluate the predictive performance. To measure the uncertainty calibration of the model, we use cross entrophy (CE), expected calibration error (ECE) and brier score (BS) for comparing calibration. In addition, we also check the balanced versions of uncertainty metrics due to severe class imbalance of datasets.Balanced metric In supervised dataset D, which contains input data x and a corresponding label y, we simply re-weight each uncertainty metric by class ratio.Here, D c is a subset of the dataset D which only contains the label y = c.Accuracy Metrics Accuracy metrics are defined using the following terms, where tp, tn, f n, and f p denote true positive, true negative, false negative, and false positive respectively. • AUROC: area under receiver operating characteristic, area under sensitivity curve.

A.3 DETAILS FOR IMPUTATION EXPERIMENTS

Basically, we perform imputation experiments on the testdataset of Physionet2012. We randomly delete 10% of observed data for testing imputation performance of models. We measure the performance for five different seeds.

A.3.1 BASELINE METHODS

• Mean: Replace missing values with global mean.• Forward: Impute missing value with previously observed value• GRU-D: Missing values are imputed as a weighted mean of the last observed x t ′ ,j and the mean xj with the learnable weight.• GP-VAE: This model is VAE-based probabilistic imputation method proposed by Fortuin et al. (2020) . This method employ GP-prior to encode the temporal correlation in the latent space.• SAITS: 

