HIDDEN MARKOV MODELS ARE RECURRENT NEURAL NETWORKS: A DISEASE PROGRESSION MODELING AP-PLICATION

Abstract

Hidden Markov models (HMMs) are commonly used for disease progression modeling when the true patient health state is not fully known. Since HMMs may have multiple local optima, performance can be improved by incorporating additional patient covariates to inform estimation. To allow for this, we formulate a special case of recurrent neural networks (RNNs), which we name hidden Markov recurrent neural networks (HMRNNs), and prove that each HMRNN has the same likelihood function as a corresponding discrete-observation HMM. The HMRNN can be combined with any other predictive neural networks that take patient covariate information as input. We show that HMRNN parameter estimates are numerically close to those obtained from via the Baum-Welch algorithm, validating their theoretical equivalence. We then demonstrate how the HMRNN can be combined with other neural networks to improve parameter estimation, using an Alzheimer's disease dataset. The HMRNN's solution improves disease forecasting performance and offers a novel clinical interpretation compared with a standard HMM.

1. INTRODUCTION

Hidden Markov models (HMMs; Baum & Petrie, 1966) are commonly used for modeling disease progression, because they allow researchers to conceptualize complex (and noisy) clinical measurements as originating from a smaller set of latent health states. Each latent health state is characterized by an emission distribution that specifies the probabilities of each measurement/observation given that state. This allows HMMs to explicitly account for uncertainty or measurement error, since the system's true state is not fully observable. Because of their intuitive parameter interpretations and flexibility, HMMs have been used to model biomarker changes in HIV patients (Guihenneuc-Jouyaux et al., 2000) , Alzheimer's disease progression (Liu et al., 2015) , breast cancer screening decisions (Ayer et al., 2012) , and patient response to blood anticoagulants (Nemati et al., 2016) . Researchers may wish to integrate HMMs with other disease progression models and/or data sources. For instance, researchers in Igl et al. (2018) jointly trained parameters for an HMM and a reinforcement learning policy to maximize patient returns. Other researchers have attempted to learn or initialize HMM parameters based on additional sources of patient data (Gupta, 2019; Zhou et al., 2019) . Such modifications typically require multiple estimation steps (e.g., Zhou et al., 2019) or changes to parameter interpretation (e.g., Igl et al., 2018) . This is because the standard algorithm for fitting HMMs, the Baum-Welch algorithm (Baum & Petrie, 1966) , maximizes the likelihood of a data sequence without consideration of additional covariates. We introduce Hidden Markov Recurrent Neural Networks (HMRNNs) -neural networks that mimic the computation of hidden Markov models while allowing for substantial modularity with other predictive networks. Unlike past work combining neural networks and HMMs (e.g., Bridle, 1990) , HMRNNs are designed to maximize the most commonly-used HMM fit criterion -the likelihood of the data given the parameters. In doing so, our primary contributions are as follows: (1) We prove how recurrent neural networks (RNNs) can be formulated to optimize the same likelihood function as HMMs, with parameters that can be interpreted as HMM parameters (section 3); (2) We empirically demonstrate that our model yields statistically similar parameter solutions compared with the Baum-Welch algorithm (section 4.1); (3) We demonstrate our model's utility in a disease progression application, in which combining it with other predictive neural networks improves predictive accuracy and offers unique parameter interpretations not afforded by simple HMMs (section 4.2).

2. RELATED WORK

A small number of studies have attempted to formally model HMMs in a neural network context. Wessels & Omlin (2000) proposes using neural networks to approximate Gaussian emission distributions in HMMs; however, their method requires pre-training of the HMM. Similar to our work, Bridle (1990) demonstrates how HMMs can be reduced to recurrent neural networks for speech recognition, though it requires that neurons be computed via products (rather than sums), which are not commonly used in modern neural networks. This model also maximizes the mutual information between observations and hidden states; this is a commonly used criterion in speech recognition, but less common than likelihood maximization in other domains (e.g., disease progression modeling). Lastly, Bridle (1990) and Wessels & Omlin (2000) present only theoretical justification, with no empirical comparisons with the Baum-Welch algorithm. A limited number of studies have also explored connections between neural networks and Markov models in the healthcare domain. For instance, Nemati et al. (2016) employs a discriminative hidden Markov model to estimate 'hidden states' underlying patients' ICU measurements, though these hidden states are not mathematically equivalent to HMM latent states. Estebanez et al. (2012) compares HMM and neural network effectiveness in training a robotic surgery assistant, while Baucum et al. (2020) proposes a generative neural network for modeling ICU patient health based on the mathematical intuition of the HMM. Although these studies showcase the of value of pairing neural networks and Markov models in the healthcare domain, they differ from our approach of directly formulating HMMs as neural networks, which maintains the interpretability of HMMs while allowing for joint estimation of the HMM with other predictive models. In summary, studies have shown the promise of incorporating elements of HMMs into deep learning tasks, there are no existing methods for optimizing HMM log-likelihood in a neural network context. While past works have also used gradient descent to learn HMM parameters (e.g. Yildirim et al., 2015) , we demonstrate how specifically implementing HMMs as neural networks allows additional data sources (e.g., patient covariates) to steer model estimation to better-fitting solutions. We thus develop the first neural network formulation of an HMM that maximizes the observed data likelihood, employs widely-used neural network operations, and compares favorably to the Baum-Welch algorithm when tested on real-world datasets.

3. METHODS

In this section, we briefly review HMM preliminaries, formally define the HMRNN, and prove that it optimizes the same likelihood function as a corresponding HMM.

3.1. HMM PRELIMINARIES

Formally, an HMM models a system over a given time horizon T , where the system occupies a hidden state x t ∈ S = {1, 2, . . . , k} at any given time point t ∈ {0, 1, . . . , T }; that is, x t = i indicates that the system is in the ith state at time t. For any state x t ∈ S and any time point t ∈ {0, 1, . . . , T }, the system emits an observation according to an emission distribution that is uniquely defined for each state. We consider the case of categorical emission distributions, which are commonly used in healthcare (e.g., Liu et al., 2015; Leon, 2015; Ayer et al., 2012; Stanculescu et al., 2013) . These systems emit one of c distinct observations at each time point; that is, for any time t, we observe y t ∈ O, where |O| = c and O = {1, . . . , c}. Thus, an HMM is uniquely defined by a k-length initial probability vector π, k × k transition matrix P , and k × c emission matrix Ψ. Entry i in the vector π is the probability of starting in state i, row i in the matrix P is the state transition probability distribution from state i, and row i of the matrix Ψ is the emission distribution from state i. HMMs are fit via the Baum-Welch algorithm, which identifies the parameters that (locally) maximize the likelihood of the observed data (Baum & Petrie, 1966) . The likelihood of an observation sequence y is a function of an HMM's initial state distribution (π), transition probability matrix (P ), and emission matrix (Ψ) (Jurafsky & Martin, 2009) . Let diag(Ψ i ) be a k × k diagonal matrix with the ith column of Ψ as its entries -i.e., the probabilities of observation i from each state. We then have Pr(y|π, P , Ψ) = π • diag(Ψ y0 ) • ( T i=1 P • diag(Ψ yi )) • 1 k×1 . (1) The likelihood function can also be expressed in terms of α t (i), the probability of being in state i at time t and having observed {y 0 , ..., y t }. We denote α t as the (row) vector of all α t (i) for i ∈ S, with α t = π • diag(Ψ y0 ) • ( t i=1 P • diag(Ψ yi )) (2) for t ∈ {1, ..., T }, with α 0 = π • diag(Ψ y0 ). Note that equation 3.1 also implies that α t = α t-1 • P • diag(Ψ yt ) for t ∈ {1, ...

, T }, and that

Pr(y|π, P , Ψ) = α T • 1 k×1 . (3)

3.2. DEFINITION OF HIDDEN MARKOV RECURRENT NEURAL NETWORKS (HMRNNS)

An HMRNN is a recurrent neural network whose parameters directly correspond to the initial state, transition, and emission probabilities of an HMM. As such, training an HMRNN optimizes the joint log-likelihood of the N T -length observation sequences given these parameters. Definition 3.1. An HMRNN is a recurrent neural network with trainable parameters π (a k-length stochastic vector), P (a k × k stochastic matrix), and Ψ (a k × c stochastic matrix). It is trained on T + 1 input matrices of size N × c, denoted by Y t for t ∈ {0, 1, . . . , T }, where the n-th row of matrix Y t is a one-hot encoded vector of observation y (n) t for sequence n ∈ {1, . . . , N }. The HMRNN consists of an inner block of hidden layers that is looped T + 1 times (for t ∈ {0, 1, . . . , T }), with each loop containing hidden layers h (t) 1 , h (t) 2 , and h (t) 3 , and a c-length input layer h (t) y through which the input matrix Y t enters the model. The HMRNN has a single output unit o (T ) whose value is the joint negative log-likelihood of the N observation sequences under an HMM with parameters π, P , and Ψ; the summed value of o (T ) across all N observation sequences is also the loss which is minimized through any neural network optimizer (e.g., gradient descent). Layers h (t) 1 , h (t) 2 , h (t) 3 , and o (T ) are defined in the following equations. Note that the block matrix in equation ( 5) is a c × (kc) block matrix of c 1 1×k vectors , arranged diagonally, while the block matrix in equation ( 6) is a (kc) × k row-wise concatenation of c k × k identity matrices. h (t) 1 = π , t = 0, h (t-1) 3 P , t > 0. (4) h (t) 2 = ReLu h (t) 1 [diag(Ψ 1 ) . . . diag(Ψ c )] + Y t 1 1×k . . . 0 1×k . . . . . . . . . 0 1×k . . . 1 1×k -1 n×(kc) (5) h (t) 3 = h (t) 2 [I k . . . I k ] (6) o (T ) = -log(h (T ) 3 1 k×1 ). Fig. 1 outlines the structure of the HMRNN. Intuitively, operations within each recurrent block mimic matrix multiplication by diag(Ψ yt ) (i.e., h (t) 3 = h (t) 1 diag(Ψ yt )), while connections between blocks mimic multiplication by P . In each block, layer h (t) 1 contains k units that represent the joint probability of being in each state 1-k and all observations up to time t -note that this is equivalent to the α values in traditional HMM notation. Layer h (t) 2 , expands each unit in h (t) 1 into c units via connections with weights Ψ i,j for i ∈ {1, . . . , k} and j ∈ {1, . . . , c}, resulting in k • c units; this is equivalent to multiplying h (t) 1 (in row-vector form) by a column-wise concatenation of diag(Ψ j ) Each column in Y t is connected to all units in h (t) 2 that correspond to that column's observation, with connection weights set to 1. A bias of -1 and subsequent ReLu activation are then applied to layer h (t) 2 ; this leaves the units that correspond to Y t unchanged, while all other units (i.e., probabilities for non-occurring observations) are made negative by the -1 bias, then forced to zero through the ReLu activation. Thus, layer h (t) 2 identifies the joint probability of being in each state and observing Y t . Layer h (t) 3 then sums across all c units for each of the k states (all of which are zero except for those corresponding to Y t ), yielding k unitsrepresenting the probabilities of being in each state given all previous observations and the observation at time t, i.e., h 1, k c, 1 1, 1 c, k 1 k 1 k Ψ 1 ,1 Ψ k ,1 Ψ 1 ,c Ψ k ,c h 2 (0) h 1 (0) h 3 (0) 1, k c, 1 1, 1 c, k 1 k 1 k Ψ 1 ,1 Ψ k ,1 Ψ 1 ,c Ψ k ,c h 2 (T) h 1 (T) h 3 (T) P 1,1 P k,k P 1, (t) 3 = h (t) 1 diag(Ψ yt ). We then apply a fully-connected layer of weights to transform h , equivalent to matrix multiplication by P , i.e., h (t+1) 1 = h (t) 3 P , forcing the rows of P to sum to 1. Lastly, activations for h (T ) 3 are summed into unit o (T ) and subject to a negative logarithmic activation function, yielding the negative log-likelihood of the data given the model's parameters. Note that the hidden layers may suffer from underflow for long sequences. This can be addressed by normalizing layer h (t) 3 to sum to 1 at each time point, then simply subtracting the logarithm of the normalization term (i.e., the log-sum of the activations) from network's output -log(o (T ) ). The HMRNN is a special case of an RNN, due to its time-dependent layers and shared weights between time points. Note that its use of recurrent blocks (each of which containing three layers) differs from many RNNs that use only a single layer at each time point; this distinction allows the HMRNN to mimic HMM computations of an HMM at each time point.

3.3. PROOF OF HMM/HMRNN EQUIVALENCE

We now formally establish that the HMRNN's output unit, o (T ) , is the negative log-likelihood of an observation sequence under an HMM with parameters π, P , and Ψ. We prove this for the case of N = 1 and drop notational dependence on n (i.e., we write y (1) t as y t ), though extension to N > 1 is trivial since the log-likelihood of multiple independence sequences is simply the sum of their individual log-likelihoods. We first rely on the following lemma. Lemma 3.1. If all units in h (t) 1 (j) are between 0 and 1 (inclusive), then h (t) 3 = h (t) 1 diag(Ψ yt ). Proof. Let h (t) 1 (j) and h (t) 3 (j) represent the jth units of layer h (t) 1 and h (t) 3 , respectively. Showing h (t) 3 = h (t) 1 diag(Ψ yt ) is equivalent to showing that h (t) 3 (j) = Ψ j,yt h (t) 1 (j) for j ∈ {1, .., k}. We simulate systems with state spaces S = 1, 2, ..., k that begin in state 1, using k = 5, 10, or 20 states. These state sizes are consistent with disease progression HMMs, which often involve less than 10 states (Zhou et al., 2019; Jackson et al., 2003; Sukkar et al., 2012; Martino et al., 2020) . We assume that each state 'corresponds' to one observation, implying the same number of states and observations (c = k). The probability of correctly observing a state (P (y t = x t )) is ψ ii , which is the diagonal of Ψ and is the same for all states. We simulate systems with ψ ii = 0.6, 0.75, and 0.9. We test three variants of the transition probability matrix P . Each is defined by their same-state transition probability p ii , which is the same for all states. For all P the probability of transitioning to higher states increases with state membership; this is known as 'increasing failure rate' and is a common property for Markov processes. As p ii decreases, the rows of P stochastically increase, i.e., lower values of p ii imply a greater chance of moving to higher states. We use values of p ii = 0.4, 0.6, and 0.8, for 27 total simulations (k = {5, 10, 20} × Ψ ii = {0.6, 0.75, 0.9} × p ii = {0.4, 0.6, 0.8}). For each of the 27 simulations, we generate 100 trajectories of length T = 60; this time horizon might practically represent one hour of data collected each minute or two months of data collected each day. Initial state probabilities are fixed at 1 for state 1 and 0 otherwise. Transition parameters are initialized based on the observed number of transitions in each dataset, using each observation as a proxy for its corresponding state. Since transition probabilities are initialized assuming no observation error, the emission matrices are correspondingly initialized using ψ ii = 0.95 (with the remaining 0.05 distributed evenly across all other states). In practice, gradient descent on the HMRNN rarely yields parameter values that are exactly zero. To facilitate comparability between the HMRNN and Baum-Welch, we post-process all HMRNN results with one iteration of the Baum-Welch algorithm, which forces low-probability entries to zero. For Baum-Welch and HMRNN, training ceased when all parameters ceased to change by more than 0.001. For each simulation, we compare Baum Welch's and the HMRNN's average Wasserstein distance between the rows of the estimated and ground truth P and Ψ matrices. This serves as a measure of each method's ability to recover the true data-generating parameters. We also compare the Baum-Welch and HMRNN solutions' log-likelihoods using a separate hold-out set of 100 trajectories. Across all simulations, the average Wasserstein distance between the rows of the true and estimated transition matrices was 0.191 for Baum-Welch and 0.178 for HMRNN (paired t-test p-value of 0.483). For the emission matrices, these distances were 0.160 for Baum-Welch and 0.137 for HMRNN (paired t-test p-value of 0.262). This suggests that Baum-Welch and the HMRNN recovered the ground truth parameters with statistically similar degrees of accuracy. This can be seen in Figure 2 , which presents the average estimated values of p ii and ψ ii under each model. Both models' estimated p ii values are, on average, within 0.05 of the ground truth values, while they tended to estimate ψ ii values of around 0.8 regardless of the true ψ ii . Note that, while Baum-Welch was slightly more accurate at estimating p ii and ψ ii , the overall distance between the ground truth and estimated parameters did not significantly differ between Baum-Welch and the HMRNN. For each simulation, we also compute the log-likelihood of a held-out set of 100 sequences under the Baum-Welch and HMRNN parameters, as a measure of model fit. The average holdout loglikelihoods under the ground truth, , respectively (paired t-test p-value for Baum-Welch/HMRNN difference of 0.440). Thus, Baum-Welch and HMRNN yielded similar degrees of model fit on held-out sequence data.

4.2. ALZHEIMER'S DISEASE SYMPTOM PROGRESSION APPLICATION

We also demonstrate how incorporating additional data into an HMRNN can improve parameter fit and offer novel clinical interpretations. We test our HMRNN on clinical data from n = 426 patients with mild cognitive impairment (MCI), collected over the course of three (n = 91), four (n = 106), or five (n = 229) consecutive annual clinical visits (Initiative). Given MCI patients' heightened risk of Alzheimer's, modeling their symptom progression is of considerable clinical interest (Hirao et al., 2005; Hansson et al., 2010; Rabin et al., 2009) . We analyze patients' overall cognitive functioning based on the Mini Mental Status Exam (MMSE; Folstein et al., 1975) . MMSE scores range from 0 to 30, with scores of 27 -30 indicating no cognitive impairment, 24 -26 indicating borderline cognitive impairment, and 17-23 indicating mild cognitive impairment (Chopra et al., 2007; Monroe & Carter, 2012) . Scores below 17 were infrequent (1.2%) and were treated as scores of 17 for analysis. We use a 3-state latent space S = {0, 1, 2}, with x t = 0 representing 'no cognitive impairment,' x t = 1 representing 'borderline cognitive impairment,' and x t = 2 representing 'mild cognitive impairment.' The observation space is O = {0, 1, 2}, using y t = 0 for scores of 27 -30, y t = 1 for scores of 24 -26, and y t = 2 for scores of 17 -23. To showcase the benefits of the HMRNN's flexibility, we augment the HMRNN through two substantive modifications. First, the initial state probabilities in the augmented HMRNN are predicted from patients' gender, age, degree of temporal lobe atrophy (Hua et al., 2008) , and amyloid-beta 42 levels (Aβ42, a relevant Alzheimer's biomarker (Canuet et al., 2015; Blennow, 2004) ), using a single-layer neural network. Second, at each time point, the probability of being in the most impaired state, h (1) t (2), is used to predict concurrent scores on the Clinical Dementia Rating (CDR, Morris, 1993) , a global assessment of dementia severity, allowing another clinical metric to inform estimation. We use a single connection and sigmoid activation to predict patients' probability of receiving a CDR score above 0.5 (corresponding to 'mild dementia'). We compare HMRNN and Baum-Welch parameter solutions' ability to predict patients' final MMSE score categories from their initial score categories, using 10-fold cross-validation. We evaluate performance using weighted log-loss L (Guerrero-Pena et al., 2018; Stelmach & Chlebus, 2020) , i.e., the log-probability placed on each final MMSE score category averaged across score categories. This metric accounts for class imbalance and rewards models' confidence in their predictions, an important component of medical decision support (Bussone et al., 2015; Lim & Dey, 2010) . We also report p, the average probability placed on patients' final MMSE scores (computed from L). We train all models using a relative log-likelihood tolerance of 0.001% (we do not use parameter convergence since the number of parameters differs between models). Converge runtimes for Baum-Welch and the HMRNN are 2.89 seconds and 15.24 seconds, respectively. Model results appear in Table 1 . Note that the HMRNN's weighted log-loss L is significantly lower than Baum-Welch's (paired t-test p-value= 2.396 × 10 -6 ), implying greater predictive performance. This is supported by Figure 3 , which shows p, the average probability placed on patients' final MMSE scores by score category. Note that error bars represent marginal sampling error and do not represent statistical comparisons between Baum-Welch and HMRNN. Interestingly, the HMRNN yields lower transition probabilities and lower MMSE accuracy (i.e., lower diagonal values of Ψ) than Baum-Welch, suggesting that score changes are more likely attributable to testing error as opposed to true state changes.

5. DISCUSSION

We outline a flexible approach for HMM estimation using neural networks. The HMRNN produces statistically similar solutions to the Baum-Welch algorithm in its standard form, yet can be combined with other neural networks to improve predictive accuracy when additional data is available. In our Alzheimer's case study, augmenting an HMRNN with two predictive networks improves the predictive performance of its parameter solution compared with Baum-Welch. Interestingly, the HMRNN yields a clinically distinct parameter interpretation compared with Baum-Welch, predicting relatively poor diagnostic accuracy for the 'borderline' and 'mild' cognitive impairment states of the MMSE. This suggests that fewer diagnostic categories might improve the MMSE's utility, which is supported by existing MMSE research (e.g., Monroe & Carter, 2012) , and suggests the HMRNN might also be used to improve the interpretability of HMM parameter solutions. In addition to demonstrating the HMRNN's utility in a practical setting, we also make a theoretical contribution by formulating discrete-observation HMMs as a special case of RNNs and by proving coincidence of their likelihood functions. Unlike past approaches, our formulation relies only on matrix multiplication and nonlinear activations, and is designed for generalized use by optimizing for maximum likelihood, which is widely used for HMM training. Future work may formally assess the time complexity of the HMRNN formulation. The HMRNN converges slower than Baum-Welch for the Alzheimer's case study. Yet packages for training neural networks are designed to efficiently handle large datasets, and therefore the HMRNN may converge faster than Baum-Welch for problems with larger sample sizes. Yet since sequence lengths in healthcare are often considerably shorter than in other domains that employ HMMs (e.g., speech analysis), runtimes will likely be reasonable for many healthcare datasets. We also limited our case study to disease progression; future work might explore the HMRNN in other healthcare domains.



Figure 1: Structure of the hidden Markov recurrent neural network (HMRNN). Solid lines indicate learned weights that correspond to HMM parameters; dotted lines indicate weights fixed to 1. The inner block initializes with the initial state probabilities then mimics multiplication by diag(Ψ yt ); connections between blocks mimic multiplication by P .

Figure 2: Estimated p ii (left) and ψ ii (right) under Baum-Welch and HMRNN, shown by ground truth parameter value. Results for each column are averaged across 9 simulations. Dashed lines indicate ground truth p ii (left) and ψ ii (right) values, and error bars indicate 95% confidence intervals (but do not represent tests for significant differences). In line with Theorem 3.1, Baum-Welch and the HMRNN produce near-identical parameter solutions according to the Wasserstein distance metric.

Figure 3: Average probability placed on final MMSE scores, by score category. Recall that the HMRNN's average performance significantly outperforms Baum-Welch (paired t-test p-value= 2.396 × 10 -6 ). As see in the Figure, this effect is consistent across score categories. Error bars indicate 95% confidence intervals, and do not represent tests for significant differences.

Results from Alzheimer's disease case study.

annex

To show this, recall that h (t) 2 contains k×c units, which we index with a tuple (l, m) for l ∈ {1, . . . , c} and m ∈ {1, . . . , k}. The connection matrix between h (t) 1 and h (t) 2 is [ diag(Ψ 1 ) . . . diag(Ψ c )]. Thus, the connection between units h (t) 1 (j) and h (t) 2 (l, m) is Ψ j,l when j = m, and equals 0 otherwise. Also recall that matrix Y t enters the model through a c-length input layer h (t) y , where the jth unit is 1 when y t = j, and equals 0 otherwise. This layer is connected to h (t) 2 by a c × (kc) diagonal block matrix of c (1 × k) row vectors of ones. Thus, the connection between the jth unit of this input layer and unit (l, m) of h (t) 2 is 1 when j = l, and equals 0 otherwise. Lastly, a bias of -1 is added to all units in h (t) 2 , which is then subject to a ReLu activation, resulting in the following expression for each unit in hBecausey (l) is 1 when y t = l, and equals 0 otherwise, then if all units in h1 are between 0 and 1, we haveThe connection matrix between h2 and h3 is a (kc) × k row-wise concatenation of k × k identity matrices; thus, the connection between h (t) 2 (l, m) and h (t) 3 (j) is 1 if j = m, and 0 otherwise. Hence,Thus, h1 diag(Ψ yt ). Theorem 3.1. An HMRNN with parameters π (1 × k stochastic vector), P (k × k stochastic matrix), and Ψ (k ×c stochastic matrix), and with layers defined as in equations (4-7), produces output neuron o (T ) for sequence n ∈ {1, . . . , N } whose value is the negative log-likelihood of a corresponding HMM as defined in equation (1).Proof. Note that, based on Lemma 3.1 and equation 4, h., k} and therefore h (t) 3 = α t . We show the initial condition that hwhich is the logarithm of the HMM likelihood from equation 3.

4. EXPERIMENTS AND RESULTS

In this section, we compare HMRNNs to Baum-Welch through computational experiments with synthetic data and a case study of Alzheimer's disease patients. The experiment in section 4.1 demonstrates that Baum-Welch and and the HMRNN yield similar parameter solutions and predictive accuracy. Section 4.2 demonstrates how augmenting the HMRNN with additional neural networks yields HMM parameters with improved predictive ability on real-world data.

4.1. EMPIRICAL VALIDATION OF HMRNN

We demonstrate that an HMRNN trained via gradient descent yields statistically similar solutions to the Baum-Welch algorithm. We show this with synthetically-generated observations sequences for which the true HMM parameters are known, allowing us to validate Theorem 3.1 empirically.

