HIDDEN MARKOV MODELS ARE RECURRENT NEURAL NETWORKS: A DISEASE PROGRESSION MODELING AP-PLICATION

Abstract

Hidden Markov models (HMMs) are commonly used for disease progression modeling when the true patient health state is not fully known. Since HMMs may have multiple local optima, performance can be improved by incorporating additional patient covariates to inform estimation. To allow for this, we formulate a special case of recurrent neural networks (RNNs), which we name hidden Markov recurrent neural networks (HMRNNs), and prove that each HMRNN has the same likelihood function as a corresponding discrete-observation HMM. The HMRNN can be combined with any other predictive neural networks that take patient covariate information as input. We show that HMRNN parameter estimates are numerically close to those obtained from via the Baum-Welch algorithm, validating their theoretical equivalence. We then demonstrate how the HMRNN can be combined with other neural networks to improve parameter estimation, using an Alzheimer's disease dataset. The HMRNN's solution improves disease forecasting performance and offers a novel clinical interpretation compared with a standard HMM.

1. INTRODUCTION

Hidden Markov models (HMMs; Baum & Petrie, 1966) are commonly used for modeling disease progression, because they allow researchers to conceptualize complex (and noisy) clinical measurements as originating from a smaller set of latent health states. Each latent health state is characterized by an emission distribution that specifies the probabilities of each measurement/observation given that state. This allows HMMs to explicitly account for uncertainty or measurement error, since the system's true state is not fully observable. Because of their intuitive parameter interpretations and flexibility, HMMs have been used to model biomarker changes in HIV patients (Guihenneuc-Jouyaux et al., 2000) , Alzheimer's disease progression (Liu et al., 2015) , breast cancer screening decisions (Ayer et al., 2012) , and patient response to blood anticoagulants (Nemati et al., 2016) . Researchers may wish to integrate HMMs with other disease progression models and/or data sources. For instance, researchers in Igl et al. (2018) jointly trained parameters for an HMM and a reinforcement learning policy to maximize patient returns. Other researchers have attempted to learn or initialize HMM parameters based on additional sources of patient data (Gupta, 2019; Zhou et al., 2019) . Such modifications typically require multiple estimation steps (e.g., Zhou et al., 2019) or changes to parameter interpretation (e.g., Igl et al., 2018) . This is because the standard algorithm for fitting HMMs, the Baum-Welch algorithm (Baum & Petrie, 1966) , maximizes the likelihood of a data sequence without consideration of additional covariates. We introduce Hidden Markov Recurrent Neural Networks (HMRNNs) -neural networks that mimic the computation of hidden Markov models while allowing for substantial modularity with other predictive networks. Unlike past work combining neural networks and HMMs (e.g., Bridle, 1990) , HMRNNs are designed to maximize the most commonly-used HMM fit criterion -the likelihood of the data given the parameters. In doing so, our primary contributions are as follows: (1) We prove how recurrent neural networks (RNNs) can be formulated to optimize the same likelihood function as HMMs, with parameters that can be interpreted as HMM parameters (section 3); (2) We empirically demonstrate that our model yields statistically similar parameter solutions compared with the Baum-Welch algorithm (section 4.1); (3) We demonstrate our model's utility in a disease progression application, in which combining it with other predictive neural networks improves predictive accuracy and offers unique parameter interpretations not afforded by simple HMMs (section 4.2).

2. RELATED WORK

A small number of studies have attempted to formally model HMMs in a neural network context. Wessels & Omlin (2000) proposes using neural networks to approximate Gaussian emission distributions in HMMs; however, their method requires pre-training of the HMM. Similar to our work, Bridle (1990) demonstrates how HMMs can be reduced to recurrent neural networks for speech recognition, though it requires that neurons be computed via products (rather than sums), which are not commonly used in modern neural networks. This model also maximizes the mutual information between observations and hidden states; this is a commonly used criterion in speech recognition, but less common than likelihood maximization in other domains (e.g., disease progression modeling). Lastly, Bridle (1990) and Wessels & Omlin (2000) present only theoretical justification, with no empirical comparisons with the Baum-Welch algorithm. A limited number of studies have also explored connections between neural networks and Markov models in the healthcare domain. For instance, Nemati et al. ( 2016) employs a discriminative hidden Markov model to estimate 'hidden states' underlying patients' ICU measurements, though these hidden states are not mathematically equivalent to HMM latent states. Estebanez et al. ( 2012) compares HMM and neural network effectiveness in training a robotic surgery assistant, while Baucum et al. ( 2020) proposes a generative neural network for modeling ICU patient health based on the mathematical intuition of the HMM. Although these studies showcase the of value of pairing neural networks and Markov models in the healthcare domain, they differ from our approach of directly formulating HMMs as neural networks, which maintains the interpretability of HMMs while allowing for joint estimation of the HMM with other predictive models. In summary, studies have shown the promise of incorporating elements of HMMs into deep learning tasks, there are no existing methods for optimizing HMM log-likelihood in a neural network context. While past works have also used gradient descent to learn HMM parameters (e.g. Yildirim et al., 2015) , we demonstrate how specifically implementing HMMs as neural networks allows additional data sources (e.g., patient covariates) to steer model estimation to better-fitting solutions. We thus develop the first neural network formulation of an HMM that maximizes the observed data likelihood, employs widely-used neural network operations, and compares favorably to the Baum-Welch algorithm when tested on real-world datasets.

3. METHODS

In this section, we briefly review HMM preliminaries, formally define the HMRNN, and prove that it optimizes the same likelihood function as a corresponding HMM.

3.1. HMM PRELIMINARIES

Formally, an HMM models a system over a given time horizon T , where the system occupies a hidden state x t ∈ S = {1, 2, . . . , k} at any given time point t ∈ {0, 1, . . . , T }; that is, x t = i indicates that the system is in the ith state at time t. For any state x t ∈ S and any time point t ∈ {0, 1, . . . , T }, the system emits an observation according to an emission distribution that is uniquely defined for each state. We consider the case of categorical emission distributions, which are commonly used in healthcare (e.g., Liu et al., 2015; Leon, 2015; Ayer et al., 2012; Stanculescu et al., 2013) . These systems emit one of c distinct observations at each time point; that is, for any time t, we observe y t ∈ O, where |O| = c and O = {1, . . . , c}. Thus, an HMM is uniquely defined by a k-length initial probability vector π, k × k transition matrix P , and k × c emission matrix Ψ. Entry i in the vector π is the probability of starting in state i, row i in the matrix P is the state transition probability distribution from state i, and row i of the matrix Ψ is the emission distribution from state i.

