DEEP REINFORCEMENT LEARNING FOR COST-EFFECTIVE MEDICAL DIAGNOSIS

Abstract

Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects lab test panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the F1 score instead of the error rate. However, optimizing the non-concave F 1 score is not a classic RL problem, thus invalidating standard RL methods. To remedy this issue, we develop a reward shaping approach, leveraging properties of the F 1 score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained F 1 score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on diverse clinical tasks: ferritin abnormality detection, sepsis mortality prediction, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identify all Pareto-front solutions. Across all tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to 85% reduction in testing cost. Core codes are available on GitHub 1 .

1. INTRODUCTION

In clinical practice, physicians usually order multiple panels of lab tests on patients and their interpretations depend on medical knowledge and clinical experience. Each test panel is associated with certain financial cost. For lab tests within the same panel, automated instruments will simultaneously provide all tests, and eliminating a single lab test without eliminating the entire panel may only lead to a small reduction in laboratory cost (Huck & Lewandrowski, 2014) . On the other hand, concurrent lab tests have been shown to exhibit significant correlation with each other, which can be utilized to estimate unmeasured test results (Luo et al., 2016) . Thus, utilizing the information redundancy among lab tests can be a promising way of optimizing which test panel to order when balancing comprehensiveness and cost-effectiveness. The efficacy of the lab test panel optimization can be evaluated by assessing the predictive power of optimized test panels on supporting diagnosis and predicting patient outcomes. We investigate the use of reinforcement learning (RL) for lab test panel optimization. Our goal is to dynamically prescribe test panels based on available observations, in order to maximize diagnosis/prediction accuracy while keeping testing at a low cost. It is quite natural that sequential test panel selection for prediction/classification can be modeled as a Markov decision process (MDP). However, application of reinforcement learning (RL) to this problem is nontrivial for practical considerations. One practical challenge is that clinical diagnostic data are often highly imbalanced, in some cases with <5% positive cases (Khushi et al., 2021; Li et al., 2010; Rahman & Davis, 2013) . In supervised learning, this problem is typically addressed by optimizing towards accuracy metrics suitable for unbalanced data. The most prominent metric used by clinicians is the F1 score, i.e., the harmonic mean of a prediction model's recall and precision, which balances type I and type II errors in a single metric. However, the F1 score is not a simple weighted error rate -this makes designing the reward function hard for RL. Another challenge is that, for cost-sensitive diagnostics, one hopes to view this as a multi-objective optimization problem and fully characterize the cost-accuracy tradeoff, rather than finding an ad-hoc solution on the tradeoff curve. In this work, we aim to provide a tractable algorithmic framework, which provably identifies the set of all Pareto-front policies and trains efficiently. Our main contributions are summarized as follows: • We formulate cost-sensitive diagnostics as a multi-objective policy optimization problem. The goal is to find all optimal policies on the Pareto front of the cost-accuracy tradeoff. • To handle severely imbalanced clinical data, we focus on maximizing the F 1 score directly. Note that F 1 score is a nonlinear, nonconvex function of true positive and true negative rates. It cannot be formulated as a simple sum of cumulative rewards, thus invalidating standard RL solutions. We leverage monotonicity and hidden minimax duality of the optimization problem, showing that the Pareto set can be achieved via a reward shaping approach. • We propose a Semi-Model-based Deep Diagnostic Policy Optimization (SM-DDPO) method for learning the Pareto solution set from clinical data. Its architecture comprises three modules and can be trained efficiently by combing pretraining, policy update, and model-based RL. • We apply our approach to real-world clinical datasets. Experiments show that our approach exhibits good accuracy-cost trade-off on all tasks compared with baselines. Across the experiments, our method achieves state-of-the-art accuracy with up to 80% reduction in cost. Further, SM-DDPO is able to compute the set of optimal policies corresponding to the entire Pareto front. We also demonstrate that SM-DDPO applies not only to the F 1 score but also to alternatives such as the AM score.

2. RELATED WORK

Reinforcement learning (RL) has been applied in multiple clinical care settings to learn optimal treatment strategies for sepsis Komorowski et al. 



(2018), to customize antiepilepsy drugs for seizure controlGuez et al. (2008)  etc. See surveyYu et al. (2021)  for more comprehensive summary. Guidelines on using RL for optimizing treatments in healthcare has also been proposed around the topics of variable availability, sample size for policy evaluation, and how to ensure learned policy works prospectively as intendedGottesman et al. (2019). However, using RL for simultaneously reducing the healthcare cost and improving patient's outcomes has been underexplored.Our problem of cost-sensitive dynamic diagnosis/prediction is closely related to feature selection in supervised learning. The original static feature selection methods, where there exists a common subset of features selected for all inputs, were extensively discussed Guyon & Elisseeff (2003); Kohavi & John (1997); Bi et al. (2003); Weston et al. (2003; 2000). Dynamic feature selection methods He et al. (2012); Contardo et al. (2016); Karayev et al. (2013), were then proposed to take the difference between inputs into account. Different subsets of features are selected with respect to different inputs. By defining certain information value of the features Fahy & Yang (2019); Bilgic & Getoor (2007), or estimating the gain of acquiring a new feature would yield Chai et al. (2004). Reinforcement learning based approaches Ji & Carin (2007); Trapeznikov & Saligrama (2013); Janisch et al. (2019); Yin et al. (2020); Li & Oliva (2021); Nam et al. (2021) are also proposed to dynamically select features for prediction/classification. We give a more detailed discussion in Appendix A. 3 PARETO-FRONT PROBLEM FORMULATION 3.1 MARKOV DECISION PROCESS (MDP) MODEL We model the dynamic diagnosis/prediction process for a new patient as an episodic Markov decision process (MDP) M = (S, A, P, R, γ, ξ). As illustrated in Figure 1, the state of a patient is described by s = x ⊙ M , where x ∈ R d denotes d medical tests of a patient, M ∈ {0, 1} d is a binary mask

