DEEP REINFORCEMENT LEARNING FOR COST-EFFECTIVE MEDICAL DIAGNOSIS

Abstract

Dynamic diagnosis is desirable when medical tests are costly or time-consuming. In this work, we use reinforcement learning (RL) to find a dynamic policy that selects lab test panels sequentially based on previous observations, ensuring accurate testing at a low cost. Clinical diagnostic data are often highly imbalanced; therefore, we aim to maximize the F1 score instead of the error rate. However, optimizing the non-concave F 1 score is not a classic RL problem, thus invalidating standard RL methods. To remedy this issue, we develop a reward shaping approach, leveraging properties of the F 1 score and duality of policy optimization, to provably find the set of all Pareto-optimal policies for budget-constrained F 1 score maximization. To handle the combinatorially complex state space, we propose a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) framework that is compatible with end-to-end training and online learning. SM-DDPO is tested on diverse clinical tasks: ferritin abnormality detection, sepsis mortality prediction, and acute kidney injury diagnosis. Experiments with real-world data validate that SM-DDPO trains efficiently and identify all Pareto-front solutions. Across all tasks, SM-DDPO is able to achieve state-of-the-art diagnosis accuracy (in some cases higher than conventional methods) with up to 85% reduction in testing cost. Core codes are available on GitHub 1 .

1. INTRODUCTION

In clinical practice, physicians usually order multiple panels of lab tests on patients and their interpretations depend on medical knowledge and clinical experience. Each test panel is associated with certain financial cost. For lab tests within the same panel, automated instruments will simultaneously provide all tests, and eliminating a single lab test without eliminating the entire panel may only lead to a small reduction in laboratory cost (Huck & Lewandrowski, 2014) . On the other hand, concurrent lab tests have been shown to exhibit significant correlation with each other, which can be utilized to estimate unmeasured test results (Luo et al., 2016) . Thus, utilizing the information redundancy among lab tests can be a promising way of optimizing which test panel to order when balancing comprehensiveness and cost-effectiveness. The efficacy of the lab test panel optimization can be evaluated by assessing the predictive power of optimized test panels on supporting diagnosis and predicting patient outcomes. We investigate the use of reinforcement learning (RL) for lab test panel optimization. Our goal is to dynamically prescribe test panels based on available observations, in order to maximize diagnosis/prediction accuracy while keeping testing at a low cost. It is quite natural that sequential test panel selection for prediction/classification can be modeled as a Markov decision process (MDP). However, application of reinforcement learning (RL) to this problem is nontrivial for practical considerations. One practical challenge is that clinical diagnostic data are often highly imbalanced, in some cases with <5% positive cases (Khushi et al., 2021; Li et al., 2010; Rahman & Davis, 2013) . In supervised learning, this problem is typically addressed by optimizing towards accuracy metrics suitable for unbalanced data. The most prominent metric used by clinicians is the F1 score, i.e., the harmonic mean of a prediction model's recall and precision, which balances type I and type II errors in a single metric. However, the F1 score is not a simple weighted error rate -this makes designing the reward function hard for RL. Another challenge is that, for cost-sensitive diagnostics, one hopes to view this as a multi-objective optimization problem and fully characterize the cost-accuracy tradeoff, rather than finding an ad-hoc solution on the tradeoff curve. In this work, we aim to provide a tractable algorithmic framework, which provably identifies the set of all Pareto-front policies and trains efficiently. Our main contributions are summarized as follows: • We formulate cost-sensitive diagnostics as a multi-objective policy optimization problem. The goal is to find all optimal policies on the Pareto front of the cost-accuracy tradeoff. • To handle severely imbalanced clinical data, we focus on maximizing the F 1 score directly. Note that F 1 score is a nonlinear, nonconvex function of true positive and true negative rates. It cannot be formulated as a simple sum of cumulative rewards, thus invalidating standard RL solutions. We leverage monotonicity and hidden minimax duality of the optimization problem, showing that the Pareto set can be achieved via a reward shaping approach. • We propose a Semi-Model-based Deep Diagnostic Policy Optimization (SM-DDPO) method for learning the Pareto solution set from clinical data. Its architecture comprises three modules and can be trained efficiently by combing pretraining, policy update, and model-based RL. • We apply our approach to real-world clinical datasets. Experiments show that our approach exhibits good accuracy-cost trade-off on all tasks compared with baselines. Across the experiments, our method achieves state-of-the-art accuracy with up to 80% reduction in cost. Further, SM-DDPO is able to compute the set of optimal policies corresponding to the entire Pareto front. We also demonstrate that SM-DDPO applies not only to the F 1 score but also to alternatives such as the AM score.

2. RELATED WORK

Reinforcement learning (RL) has been applied in multiple clinical care settings to learn optimal treatment strategies for sepsis Komorowski et al. (2018) , to customize antiepilepsy drugs for seizure control Guez et al. (2008) etc. See survey Yu et al. (2021) for more comprehensive summary. Guidelines on using RL for optimizing treatments in healthcare has also been proposed around the topics of variable availability, sample size for policy evaluation, and how to ensure learned policy works prospectively as intended Gottesman et al. (2019) . However, using RL for simultaneously reducing the healthcare cost and improving patient's outcomes has been underexplored. Our problem of cost-sensitive dynamic diagnosis/prediction is closely related to feature selection in supervised learning. The original static feature selection methods, where there exists a common subset of features selected for all inputs, were extensively discussed Guyon & Elisseeff (2003) ; Kohavi & John (1997) ; Bi et al. (2003) ; Weston et al. (2003; 2000) . 

3.2. MULTI-OBJECTIVE POLICY OPTIMIZATION FORMULATION

Let π : S → A be the overall policy, a map from the set of states to the set of actions. We optimize π towards two objectives: • Maximizing prediction accuracy. Due to severely imbalanced data, we choose to maximize the F 1 scorefoot_0 , denoted by F 1 (π), as a function of policy π. F 1 score measures the performance of the diagnosis by considering both type I and type II errors, which is defined as: F1(π) = TP(π) TP(π) + 1 2 (FP(π) + FN(π)) = 2TP(π) 1 + TP(π) -TN(π) where TP(π), TN(π), FP(π), FN(π) are normalized true positive, true negative, false positive and false negative that sum up to 1. Remark that TP(π), TN(π), FP(π), FN(π) can all be expressed as sum of rewards/costs over the MDP's state trajectories. However, F 1 (π) is nonlinear with respect to the MDP's state-action occupancy measure, thus it cannot be expressed as any cumulative sum of rewards. • Lowering cost. Define the testing cost by Cost (π) = E π [ t≥0 k∈[D] c(k) • 1{a t = k}] , where E π denotes expectation under policy π, c(k) is the cost of panel k. In this work, we hope to solve for cost-sensitive policies for all possible testing budget. In other words, we aim to find the cost-sensitive Pareto front as follows. Definition 3.1 (Cost-F 1 Pareto Front of Multi-Objective Policy Optimization). The Pareto front Π * for cost-sensitive dynamic diagnosis/prediction is the set of policies such that Π * = ∪ B>0 argmax π {F 1 (π) subject to Cost(π) ≤ B} (1) Finding Π * requires novel solutions beyond standard RL methods. Challenges are two-folded: (1) Even in the single-objective case, F 1 (π) is a nonlinear, non-concave function of TP(π), TN(π). Although both TP(π), TN(π) can be formulated as expected sum of rewards in the MDP, the F 1 score is never a simple sum of rewards. Standard RL methods do not apply to maximizing such a function. (2) We care about finding the set of all Pareto-optimal policies when there are two conflicting objectives, rather than an ad hoc point on the trade-off curve.

4. FINDING COST-F 1 PARETO FRONT VIA REWARD SHAPING

The F 1 score is a nonlinear and nonconvex function of true positive, true negative, false positive, false negative rates. It cannot be expressed by sum of rewards. This invalidates all existing RL methods even in the unconstrained case, creating tremendous challenges. Despite the non-concavity and nonlinearity of F 1 , we will leverage the mathematical nature of Markov decision process and properties of the F1 score to solve problem (1). In this section, we provide an optimization duality analysis and show how to find solutions to problem (1) via reward shaping and solving a reshaped cumulative-reward MDP. Step 1: utilizing monotonicity of F 1 score To start with, we note that F 1 score is monotonically increasing in both TP and TN. Assume, for any given cost budget B, the optimal policy π * (B) achieves the highest F 1 score. Then π * (B) is also optimal to the following program: max π {TN(π) subject to Cost(π) ≤ B, TP(π) ≥ TP(π * (B))} , indicating the Pareto front of F 1 score is a subset of Π * ⊆ ∪ B>0,K∈[0,1] argmax π {TN(π) subject to Cost(π) ≤ B, TP(π) ≥ K} . (2) Step 2: reformulation using occupancy measures Fix any specific pair (B, B ′ ). Consider the equivalent dual linear program form Zhang et al. (2020) of the above policy optimization problem (2). It is in terms of the cumulative state-action occupancy measure µ : ∆ S A → R S×A ≥0 , defined as: µ π (s, a) := E π t≥0 1(s t = s, a t = a) , ∀s ∈ S, a ∈ A. Then the program (2) is equivalent to: max µ TN(µ) subject to Cost(µ) ≤ B, TP(µ) ≥ K, a µ(s, a) = s ′ ,a ′ ∈[D] µ(s ′ , a ′ )P (s|s ′ , a ′ ) + ξ(s), ∀s where ξ(•) denotes initial distribution, and TP, TN and cost are reloaded in terms of occupancy µ as: Step 3: utilizing hidden minimax duality The above program can be equivalently reformulated as a max-min program: TP(µ) = max µ min λ≥0,ρ≤0 TN(µ) + λ • (TP(µ) -K) + ρ • (Cost(µ) -B) subject to a µ(s, a) = s ′ ,a ′ ∈[D] µ(s ′ , a ′ )P (s|s ′ , a ′ ) + ξ(s), ∀s. Note the max-min objective is linear in terms of λ, ρ and µ. Thus, minimax duality holds, then we can swap the min and max to obtain the equivalent form: min λ≥0,ρ≤0 max µ TN(µ) + λ • (TP(µ) -K) + ρ • (Cost(µ) -B) subject to a µ(s, a) = s ′ ,a ′ ∈[D] µ(s ′ , a ′ )P (s|s ′ , a ′ ) + ξ(s), ∀s. For any fixed pair of (λ, ρ), the inner maximization problem of the above can be rewritten equivalently into an unconstrained policy optimization problem: max π TN(π) + λ • TP(π) + ρ • Cost(π) . This is finally a standard cumulative-sum MDP problem, with reshaped reward: reward ρ • c(t) for the action of choosing test panel t, reward λ for the diagnosis action and get a true positive, reward 1 for getting a true negative. Putting together three steps, we can show the following theorem. The full proof can be found in Appendix E. Theorem 4.1. The Cost-F 1 Pareto front defined in (1) is a subset of the collection of all reward-shaped solutions, given by Π * ⊆ Π := ∪ λ≥0,ρ≤0 argmax π {TN(π) + λ • TP(π) + ρ • Cost(π)} . Thus, to learn the full Pareto front, it suffices to solve a collection of unconstrained policy optimization problems with reshaped cumulative rewards.

5. METHOD

In this section, we propose a deep reinforcement learning pipeline for Pareto-optimal dynamic diagnosis policies. We use a modular architecture for efficient encoding of partially-observed patient information, policy optimization and reward learning. 2018), our posterior state encoder aims at resolving exponentially many possible missing patterns. Therefore, we pretrain it on unlabeled augmented data, constructed by repeatedly and randomly masking entries as additional samples.

5.3. END-TO-END TRAINING VIA SEMI-MODEL-BASED POLICY UPDATE

Training the overall policy using a standard RL algorithm alone (such as Q learning, or policy gradient) would suffer from the complex state and action spaces. To ease the heavy training, we design a semi-model-based modular approach to train the panel selector and classifier concurrently but in different manners: • The classifier f ϕ (•) : R d → R 2 , parameterized by ϕ, maps the posterior encoded state Imp θ (x⊙M ) to a probability distribution over labels. It is trained by directly minimizing the cross entropy loss ℓ c from collected datafoot_2 . This differs from typical classification in that the data are collected adaptively by RL, rather than sampled from a prefixed source. • The panel selector, a network module parameterized by ψ, takes the following as input s emb θ,ϕ (s) = s emb θ,ϕ (x ⊙ M ) = (Imp θ (x ⊙ M ), f ϕ (Imp θ (x ⊙ M )), M ), and maps it to a probability distribution over actions. We train the panel selector by using the classical proximal policy updates (PPO) Schulman et al. (2017) , which aims at maximizing a clipped surrogate objective regularized by a square-error loss of the value functions and an entropy bonus. We denote this loss function as ℓ rl and relegate its expanded form to Appendix D. • The full algorithm updates panel selector and classifier concurrently, given in Algorithm 1 and visualized in Figure 2 . We call it "semi-model-based" because it maintains a running estimate of the classifier (which is a part of the reward model of the MDP) while making proximal policy update. Algorithm 1 Semi-Model-Based Deep Diagnosis Policy Optimization (SM-DDPO) Initialize: Imp θ , Classifier f ϕ (0,0) , Panel/Prediction Selection Policy π ψ (0,0) , number of loops L, L 1 , L 2 , stepsize η; for i = 0, 1, • • • , L do ▷ End-to-end training outer loop Construct RL environment using state embedding s emb θ,ϕ (i,0) defined in (3) for j = 1, 2, • • • , L 1 do ▷ Policy update inner loop Run RL policy in environment for T timesteps and save observations in Q Update panel selection policy by ψ (i,j) = argmax ψ ℓ rl (ψ; ψ (i,j-1) ) end for Set ψ (i+1,0) = ψ (i,L1) for j = 1, 2, • • • , L 2 do ▷ Classifier update inner loop Sample minibatch B j from Q Update classifier by ϕ i,j = ϕ i,j-1 -η • ∇ ϕ 1 |Bj | k∈Bj ℓ c (ϕ i-1 ; (x k , M k )) end for Set ϕ (i+1,0) = ϕ (i,L2) end for Output: Classifier f ϕ (L+1,0) , Policy π ψ (L+1,0) . Such a hybrid RL technique of model learning and policy updates has been used for solving complex games, where a notable example is Deepmind's Muzero Schrittwieser et al. (2020) . Further, we remark that Algorithm 1 does end-to-end training. Thus, it is compatible with on-the-fly learning. The algorithm can start with as little as zero knowledge about the prediction task, and it can keep improving on new incoming patients by querying test panels and finetuning the state encoder.

6. EXPERIMENTS

We test the method on three clinical tasks using real-world datasets. See Table 1 for a summary. We split each dataset into 3 parts: training set, validation set, and test set. Training data is further split into two disjoint sets, one for pretraining the state encoder and the other one for end-to-end RL training. Validation set is used for tuning hyperparametersfoot_3 . During RL trainingfoot_4 , we sample a random patient and a random subset of test results as initial observations at the beginning of an episode, for sufficient exploration. We evaluate the trained RL policy on patients from the test sets, initialized at a state with zero observed test result, and report F1 score and AUROC. (2013) . Machine learning models can predict abnormal ferritin levels using concurrent laboratory measurements routinely collected in primary care Luo et al. (2016) ; Kurstjens et al. (2022) . These predictive models achieve promising results, e.g. around 0.90 AUC using complete blood count and C-reactive protein Kurstjens et al. (2022) , and around 0.91 AUC using common lab tests Luo et al. (2016) . However, both studies required the full observation of all selected predictors without taking the financial costs into consideration. We applied our proposed models to a ferritin dataset from a tertiary care hospital (approved by Institutional Review Board), following the steps described in Luo et al. (2016) . Our dataset includes 43,472 patients, of whom 8.9% had ferritin levels below the reference range that should be considered abnormal. We expected to predict abnormal ferritin results using concurrent lab testing results and demographic information. • Comparisons with baseline models using full observation of data. The results are presented in Table 2 . Across all three clinical tasks, our proposed model can achieve comparable or even state-of-the-art performance, while significantly reducing the financial cost. On sepsis dataset, SM-DDPO end2end yielded better results (F 1 =0.562, AUROC=0.845) than the strongest baseline models, LightGBM (F 1 =0.517, AUROC=0.845), when saving up to 84% in test cost. On ferritin dataset, LightGBM (F 1 =0.627, AUROC=0.948) performed slightly better than our model (F 1 =0.624, AU-ROC=0.928), however, by using 5x testing cost. On AKI dataset, SM-DDPO end2end (F 1 =0.495, AUROC=0.795) achieved comparable results to the optimal full observation model, 3-layer MLP (F 1 =0.494, AUROC=0.802), while saving the testing cost from $591 to $90. • Comparisons with other test selection strategies. Our proposed SM-DDPO end2end , using RLinspired dynamic selection strategy, consistently yielded better performance and required less testing cost across all three datasets, when compared to the models using fixed or random selection strategy. For fixed test selection strategy, we first tested the classification methods using the two most relevant panels: CBC and CMP. These baselines with reduced testing cost still behaved much worse than our approach in both F 1 score and AUROC. We also tested another fixed selection (FS) baseline, where we chose to always observe 2 most selected test panels reported in our approach for all patients, while keeping other modules the same. Our approach outperformed FS on both F 1 score and AUROC while having a similar testing cost. The random selection (RS) baseline selected test panels uniformly at random and had a worse performance. Q-learning for classification with costly features (CWCF) Janisch et al. ( 2019) performed poorly on all three clinical datasets. We believe this is because the model uses the same network for selecting tests and learning rewards. For such imbalanced datasets, this may make the training unstable and difficult to optimize. • Efficiency and accuracy of end-to-end training. The classifier and panel selector of SM-DDPO end2end are both trained from the scratch, using Algorithm 1. As in Table 2 , this end-to-end training scheme gives comparable accuracy with policies that rely on a heavily pretrained classifier (SM-DDPO pretrain ). If a brand-new diagnostic/predictive task is given, our algorithm can be trained without prior data/knowledge about the new disease. It can adaptively prescribe lab tests to learn the disease model and test selection policy in an online fashion. End-to-end training is also more data-efficient and runtime efficient. • Interpretability. Our algorithm is able to select test panels that are clinically relevant. For ferritin prediction, our algorithm identifies TSAT as a most important panel, which is indeed useful for detecting iron deficiency. For AKI prediction, our algorithm recommends serum creatinine level test as an important predictor for 95% of subjects, i.e., current and past serum creatinine is indicative of future AKI, expanding its utility as a biomarker de Geus et al. (2012) .

6.3. TRAINING CURVES

We present the training curves on AKI dataset in Figure 3 . We refer more results to Appendix B. • SM-DDPO learns the disease model. In end-to-end training, the diagnostic classifier is trained from the scratch. It maps any partially-observed patient state to a diagnosis/prediction. We evaluate this classifier on static data distributions, in order to eliminate the effect of dynamic test selection and focus on classification quality. Figure 3 shows that the classifier learns to make the high-quality prediction with improved quality during RL, via training only on data selected by the RL algorithm. 

7. SUMMARY

In this work, we develop a Semi-Model-based Deep Diagnosis Policy Optimization (SM-DDPO) method to find optimal cost-sensitive dynamic policies and achieve state-of-art performances on real-world clinical datasets with up to 85% reduction in testing cost. In our approach, we use a modular network and a combination of various training schemes for more efficient RL. Trapeznikov & Saligrama (2013) ranks features in a prefixed order and only requires the agent to decide if it will add a next feature or stop. The prefixed ranked order makes the problem easier to solve, but it is suboptimal compared to fully dynamic policies. In this work, we deal with datasets with highly imbalancement, i.e., there are much fewer ill patients compared to healthy patients. To handle such data imbalancement, we use the metric of F1 score, which is a well-established and acknowledged evaluation metric in the medical domain Perazella (2015); Sanchez-Pinto & Khemani (2016) , to evaluate the diagnostic performance. Alternative metrics includes linear AM metrics Natarajan et al. (2018) ; Menon et al. (2013) for imbalanced datasets can also handled by our frameworks, which we discussed in Appendix F.

B EXPERIMENTS B.1 DESCRIPTION OF DATASETS

We summarize a detailed description of all lab tests in three datasets in Table 3 , 4 and 5 

B.2 SUPPLEMENT EXPERIMENT RESULTS

In this section, we show a more detailed results of Table 2 presented in the main text.

Computing Resource

The experiments are conducted in part through the computational resources and staff contributions provided for the Quest high performance computing facility at Northwestern University which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology. Quest provides computing access of over 11,800 CPU cores. In our experiment, we deploy each model training job to one CPU, so that multiple configurations can be tested simultaneously. Each training job requires a wall-time less than 2 hours of a single CPU core. Hyper-Parameters Settings The hyper-parameter of our model is divided onto four parts: parameter for training imputer, classifier, PPO policy, as well as the end-to-end training parameters. We list all the hyper-parameters we tuned in Table 7 , including both the tuning range and final selection. The unmentioned parameters in training EMFlow imputer are set to the same as in its original work Ma & Ghosh (2021) . The unmentioned parameters in PPO are set to default values in Python package Raffin et al. (2021) . Data Splitting Scheme We split each dataset into 3 parts: training set (75%), validation set (15%), and test set (10%). Training data is further splitted into two disjoint sets, one for pretraining the state encoder (25%) and the other one for end-to-end RL training (50%). The validation set is splitted in the same way, one for tuning hyper-parameters for training the state encoder (5%) and the other one for tuning hyper-parameters for end-to-end RL training (10%). Here, all percentages are calculated w.r.t. the whole dataset size.

Policy Improvement During RL Training

In Figure 5 , we showed the performance of the RL policies evaluated on train, validation and test sets. We note that the three curves closely match one another, confirming the generalizability of the learned dynamic classification policy. Zoom-in Version of the cost-F 1 Pareto Front Here we present the cost-F 1 Pareto front, comparing only to baselines with similar testing cost (Figure 6 is a subset of Figure 4 ). The yellow curve, which is the envelope of the 190 solutions, is the final Pareto front we obtained on the test set. Comparing to baselines with similar cost, our approach gives a better performance on F 1 score under the same cost budget. As already shown in Figure 4 , even comparing to fully observable approaches which have  L 1 (θ) = - 1 |B| i∈B log p X (x i ; θ, μ, Σ) (4) where NF is parametrized by θ and p X denotes the likelihood of the current imputation estimate x i given the base distribution. Then with the current estimate of the normalizing flow, we do an online version of expectation maximization (EM) in the latent space to update the estimate for the base distribution. The expectation step is given as: ẑm i = E[ẑ m i |ẑ o i ; μ, Σ], i ∈ B And the maximization step follows an iterative update scheme to deal with the memory overhead of the classical EM: μ(t) = ρ t μbatch + (1 -ρ t )μ (t-1) (6) Σ(t) = ρ t Σbatch + (1 -ρ t ) Σ(t-1) where μbatch and Σbatch are estimated using the conditional mean and variance given a batch of data: μbatch = g µ (μ (t) , Σ(t) ; {z i , M i } i∈B ) (8) Σbatch = g Σ (μ (t) , Σ(t) ; {z i , M i } i∈B ) Lastly, we re-optimize NF via a combination of log-likelihood of current imputation and a regularization term indicating the distance of the imputation result to the ground truth. L 2 (θ) = - 1 |B| i∈B [log p X (x i ; θ, μ, Σ) -α∥x i -x i ∥ 2 2 ] In this work, we used the above "supervised" version of EMFlow, as we already know the ground truth data when training this imputer. Finally, the imputation procedure is in similar manner as Algorithm 2 Training Phase of EMFlow ("supervised" version) Input: Current imputation X = (x 1 , x2 , • • • , xn ) ⊤ , missing masks M = (M 1 , M 2 , • • • , M n ) ⊤ , initial estimates of the base distribution (μ (0) , Σ(0) ), online EM step size sequence: 4) Update θ via gradient descent # update the base distribution z i = f -1 θ (x i ), i ∈ B Impute in the latent space with (μ (t-1) , Σ(t-1) ) to get {z i } i∈B via Eq. ( 5) Obtain updated (μ (t) , Σ(t) ) via Eq. ( 6) # update the flow model again ρ 1 , ρ 2 , • • • , for t = 1, 2, • • • , T epoch do Get a mini-batch XB = {x i } i∈B # update the flow model Compute L 1 in Eq. ( xi = f θ (ẑ i ), i ∈ B Compute L 2 via Eq. (10) Update θ via gradient descent end for Algorithm 3 Re-imputation Phase of EMFlow Input: Current imputation X = (x 1 , x2 , • • • , xn ) ⊤ , missing masks M = (M 1 , M 2 , • • • , M n ) ⊤ , initial estimates of the base distribution (μ, Σ) from the previous training phase for t = 1, 2, • • • , n do z i = f -1 θ (x i ), i ∈ B Impute in the latent space with (μ (t-1) , Σ(t-1) ) to get {z i } i∈B via Eq. (5) xi = f θ (ẑ i ), i ∈ B xi = xi ⊙ (1 -M i ) + xi ⊙ M i end for Output: {x i } n i=1 the training phase without updating the parameters. Both the training and imputation scheme are illustrated in Algorithm 2 and 3 respectively. To illustrate the imputation quality of the above EMFlow algorithm, we present a simple results in Table 8 of the root mean square error (RMSE) compared to classical imputation algorithms on UCI datasets as a reference. We do point out unlike conventional usage of imputation, we use EMFlow mainly as a posterior state encoder, instead of a imputer with excellent imputation quality.  ℓ c (ϕ; x, M ) = - 2 i=1 w i log exp f ϕ (Imp θ (x ⊙ M )) i 2 i=1 exp f ϕ (Imp θ (x ⊙ M )) i . from the collected dataset Q in training the RL policy. This differs from typical classification in that the data are collected adaptively by RL rather than sampled from a prefixed source. Here w i 's are the class weights which treated as a tuning parameter. In this work, we used a 3-layer DNN as the model for the classifier. Training objective ℓ rl for the panel/prediction selector module The panel selector, a network module parameterized by ψ, takes the following as input s emb θ,ϕ (s) = s emb θ,ϕ (x ⊙ M ) = (Imp θ (x ⊙ M ), f ϕ (Imp θ (x ⊙ M )), M ), and maps it to a probability distribution over actions. We train the panel selector by using the classical proximal policy updates (PPO) Schulman et al. (2017) , which aims at maximizing a clipped surrogate objective regularized by a square-error loss of the value functions and an entropy bonus. We denote this loss function as ℓ rl , which is defined as the following regularized clipped surrogate loss Schulman et al. (2017) : ℓ rl (ψ; ψ old ) := ℓ CLIP (ψ; ψ old ) -c 1 ℓ V F (ψ) + c 2 Ent[π ψ ], where ℓ CLIP (ψ; ψ old ) = Êt min( π ψ (at|ŝt) π ψ old (at|ŝt) Ât , clip( π ψ (at|ŝt) π ψ old (at|ŝt) , 1 -ϵ, 1 + ϵ) Ât ) denotes the clipped surrogate loss with Ât being the estimated advantages, regularized by the square error loss of the value function ℓ V F (ψ) = Êt [(V ψ (ŝ t ) -V targ t ) 2 ] and an entropy term Ent[π ψ ] over states. Here Êt is empirical average over collected dataset Q, and ŝt = s emb θ,ϕ (s t ) denotes the state embedding we derived. Note the policy and value network share the parameter ψ. We also show a high-level algorithm 4 here. For more details, please refer to the original paper Schulman et al. (2017) . We use the Python package stable-baseline3 Raffin et al. (2021) for implementing PPO. , for all pairs of (λ, ρ). Here y denotes the ground-truth label of the patient. Before presenting the proofs for Theorem 4.1, we reintroduce some key notations. First, we define the length of the MDP episode as τ = min{t ≥ 0|a t ∈ {P, N}} as a random variable depending on the policy π. Then we can rigorously define the normalized true positive, true negative, false positive and false negative as TP(π) := g(P, P), TN(π) := g(N, N), FP(π) := g(P, N), FN(π) := g(N, P). Here g(i, j) := E π [1{a τ = i} • 1{y = j}] denotes the probability of the policy diagnosing class i ∈ {P, N} while the ground truth label is of class j ∈ {P, N}. And the testing cost is defined as: Cost(π) = E π   τ -1 t=0 k∈[D] c(k) • 1{a t = k}   , where c(k) is the cost of panel k. The cumulative state-action occupancy measure µ : ∆ S A → R S×A ≥0 , is defined as Proof of Theorem 4.1. We follow the proof framework stated in the main text. Step 1: utilizing monotonicity of F 1 score To start with, we note that F 1 score is monotonically increasing in both true positive and true negative. Assume for any given cost budget B, the optimal policy π * (B) achieves the highest F 1 score. Then π * (B) is also the optimal to the following program: Step 3: utilizing hidden minimax duality The above program can be equivalently reformulated as a max-min program: µ(s ′ , a ′ )P (s|s ′ , a ′ ) + ξ(s), ∀s. Note the max-min objective is linear in terms of λ, ρ and µ. Thus minimax duality holds, then we can swap the min and max to obtain the equivalent form: µ(s ′ , a ′ )P (s|s ′ , a ′ ) + ξ(s), ∀s. For any fixed pair of (λ, ρ), the inner maximization problem of the above can be rewritten back in terms of policy π, as the following equivalent unconstrained policy optimization problem: 



An alternative to the F1 score is the AM metric that measures the average of true positive rate and true negative rate for imbalanced dataNatarajan et al. (2018);Menon et al. (2013). Our approach directly applies to such linear metric. Please refer to Appendix F for details. The EMFlow imputation method is originally proposed inMa & Ghosh (2021) that maps the data space to a Gaussian latent space via normalizing flows. We give a more detailed discussion of this method in Appendix C. The forms of the training objective ℓc and ℓ rl of classifier and panel selector are given in Appendix D Detailed data splitting, hyperparameter choices and searching ranges are presented in Appendix B. Our codes used the implementation of PPO algorithm in packageRaffin et al. (2021).



Dynamic feature selection methodsHe et al. (2012);Contardo et al. (2016);Karayev et al. (2013), were then proposed to take the difference between inputs into account. Different subsets of features are selected with respect to different inputs. By defining certain information value of the features Fahy & Yang (2019);Bilgic & Getoor (2007), or estimating the gain of acquiring a new feature would yieldChai et al. (2004). Reinforcement learning based approachesJi & Carin (2007); Trapeznikov & Saligrama (2013); Janisch et al. (2019); Yin et al. (2020); Li & Oliva (2021); Nam et al. (2021) are also proposed to dynamically select features for prediction/classification. We give a more detailed discussion in Appendix A. 3 PARETO-FRONT PROBLEM FORMULATION indicating whether the entries of x are observed or missing. Let there be D test panels, whose union is the set of all d tests. The action set A = {1, 2, • • • , D} ⊔ {P, N} contains two sets of actionsobservation actions and prediction/diagnosis actions. At each stage, one can either pick an action a ∈ {1, 2, • • • , D} from any one of the available panels, indicating choosing a test panel a to observe, which will incur a corresponding observation cost c(a); Or one can terminate the episode by directly picking a prediction action a ∈ {P, N}, indicating diagnosing the patient as positive class (P) or negative class (N). A penalty will generated if the diagnosis does not match the ground truth y. An example of this process in sepsis mortality prediction is illustrated in Figure1. We considers the initial distribution ξ to be patients with only demographics panel observed and discount factor γ = 1.

Figure 1: MDP model of dynamic diagnosis: illustration of state-action transitions in one episode.

Model-based Deep Diagnostic Policy Optimization (SM-DDPO) framework is illustrated in Figure 2. The complete dynamic testing policy π comprises three models: (1) a posterior state encoder for mapping partially-observed patient information to an embedding vector; (2) a state-todiagnosis/prediction classifier which can be reviewed as a reward function approximator; (3) a test panel selector that outputs an action based on the encoded state. This modular architecture makes RL tractable via a combination of pre-training, policy update and model-based RL.

Figure 2: Dynamic diagnostic policy learning via semi-model-based proximal policy optimization. The full policy π comprises of three modules: posterior state encoder, classifier, and panel selector. 5.2 POSTERIOR STATE ENCODER We borrowed the idea of imputation to map the partially observed patient information to a posterior embedding vector. In this work, we consider a flow-based deep imputer named EMFlow 3 . Given the imputer Imp θ (•) parameterized by θ, the RL agent observes tests x⊙M and calculates Imp θ (x⊙M ) ∈ R d as a posterior state encoder. Unlike conventional imputation Lin & Tsai (2020); Austin et al. (2021); Osman et al. (2018), our posterior state encoder aims at resolving exponentially many possible missing patterns. Therefore, we pretrain it on unlabeled augmented data, constructed by repeatedly and randomly masking entries as additional samples.

Figure 3: Classifier improvement during RL training on AKI Dataset. Accuracy of the learned classifier is evaluated on static patient distributions, with 1) random missing pattern, where we uniformly at random augment the test data; 2) missing pattern of the optimal policy's state distribution. During the end-to-end RL training, the classifier gradually improves and has higher accuracy with the second missing pattern.

Figure 4: Cost-F 1 Pareto Front for maximizing F 1 -score on Ferritin, AKI and Sepsis Datasets 6.4 COST-F 1 PARETO FRONT Figure4illustrates the Pareto fronts learned on all three datasets. We trained for optimal policies on 190 MDP instances specified by different value pairs of (λ, ρ) in Theorem 4.1, and present the corresponding performance on F 1 score (red) and AUROC (blue) evaluated on the test sets. We identify the Pareto front as the upper envelope of these solutions, which are the yellow curves in Figure4. These results present the full tradeoff between testing cost and diagnostic/predictive accuracy. As a corollary, given any cost budget B, one is able to obtain the best testing strategy with the optimal F 1 performance directly from Figure4. We present a zoom-in version in Appendix B.

Figure 5: RL training curves on AKI Dataset. We evaluate the policy and classifier learned during the end-to-end training phase on both train, validation and test set. The matching curve confirms generalizability of the learned policies.

Figure 6: Cost-F 1 Pareto Front for maximizing F1-score on Ferritin, AKI and Sepsis Datasets

Proximal Policy Optization (PPO) for iteration = 0, 1, • • • do for actor = 1, 2, • • • , N actor do Run policy π ψ old in environment for T timesteps and save all observations in Q ▷ Q is also used to train the classifier module Compute advantage estimate Â1 , • • • , ÂT end for Optimizae surrogate ℓ rl wrt ψ, with K epochs and minibatch size M ≤ N actor T ψ old ← ψ end for E PROOFS OF THEOREM 4.1For the ease of reading, we present the Theorem 4.1 in Appendix again. Theorem E.1 (Copy of Theorem 4.1). The Cost-F 1 Pareto front defined in (1) is a subset of the collection of all reward-shaped solutions, given byΠ * ⊆ Π := ∪ λ≥0,ρ≤0 argmax π {TN(π) + λ • TP(π) + ρ • Cost(π)} .As a natural corollary, we have the following result: Corollary E.1. The Cost-F 1 Pareto front defined in (1) is a subset of the solutions of the MDP model for dynamic diagnosis process defined in Section 3 with reward functions: if a ∈ [D] (choosing task panels) λ • 1{y = P}, if a = P (true positive diagnosis) 1{y = N}, if a = N (true negative diagnosis)

t = s, a t = a) , ∀s ∈ S, a ∈ A, which denotes the expected time spent in state-action pair (s, a) during an episode. There is a one-one correspondence between the policy and the occupancy measure given by π(a|s) = µ(s, a) a∈A µ(s, a).

subject to Cost(π) ≤ B, TP(π) ≥ TP(π * (B))} , indicating the Pareto front of F 1 score is a subset ofΠ * ⊆ ∪ B>0,B ′ ∈[0,1] argmax π {TN(π) subject to Cost(π) ≤ B, TP(π) ≥ B ′ } . (12)Step 2: reformulation using occupancy measures Fix any specific pair (B, B ′ ). Consider the equivalent dual linear program form Zhang et al. (2020) of the above policy optimization problem in terms of occupancy measure. Then the program (12) is equivalent to:max µ TN(µ) subject to Cost(µ) ≤ B, TP(µ) ≥ B ′ , a µ(s, a) = s ′ ,a ′ ∈[D]µ(s ′ , a ′ )P (s|s ′ , a ′ ) + ξ(s), ∀s, where ξ(•) denotes initial distribution, and TP, TN and cost are reloaded in terms of occupancy µ as:

+ λ • (TP(µ) -B ′ ) + ρ • (Cost(µ) -B) subject to a µ(s, a) = s ′ ,a ′ ∈[D]

) + λ • (TP(µ) -B ′ ) + ρ • (Cost(µ) -B)

+ λ • TP(π) + ρ • Cost(π).This is finally a standard cumulative-sum MDP problem, with reshaped reward: reward ρ • c(t) for the action of choosing test panel t, reward λ for the action to diagnose and get a true positive, reward 1 for getting a true negative, as stated in Corollary E.1: R((s, a) =    ρ • c(a), if a ∈ [D] (choosing task panels) λ • 1{y = P}, if a = P (true positive diagnosis) 1{y = N}, if a = N (true negative diagnosis).

Summary statistics of ferritin, AKI and sepsis datasets Dataset # of tests # test panels # patients % positive class #training #validation #held-out testing

These lab tests were ordered through and can be dynamically selected from 6 lab test panels, including basic metabolic panel (BMP, n=9, [estimated national average] cost=$36), Acute Kidney Injury Prediction Acute kidney injury (AKI) is commonly encountered in adults in the intensive care unit (ICU), and patients with AKI are at risk for adverse clinical outcomes such as prolonged ICU stays and hospitalization, need for renal replacement therapy, and increased mortalityKellum & Lameire (2013). AKI usually occurs over the course of a few hours to days and the efficacy of intervention greatly relies on the early identification of deteriorationKellum & Lameire (2013). Prior risk prediction models for AKI based on EHR data yielded modest performance, e.g., We followed steps inShin et al. (2021) to collect 5,783 septic patients from the MIMIC-III dataset Johnson et al. (2016) according to the Sepsis-3 criteriaSinger et al. (2016). The in-hospital mortality rate of this cohort is 14.5%. We focused on predicting in-hospital mortality for these sepsis patients using demographics information, medical histories, mechanical ventilation status, the first present lab testing results and physiologic measurements within 24 hours of ICU admission, and the Sequential

Model performance, measured by F 1 score, area under ROC (AUC), and testing cost, for three real-world clinical datasets. The tested models include logistic regression (LR), random forests (RF), gradient boosted regression trees (XGBoost Chen & Guestrin (2016) and LightGBMKe et al. (2017) ), 3-layer multi-layer perceptron, Q-learning for classification with costly features (CWCF)Janisch et al. (2019), Random Selection (RS), Fixed Selection (FS). All models were fine-tuned to maximize the F 1 score. The model yielded the highest F 1 score is in bold. The model required the least testing cost is underlined. More detailed results of this table with more dynamic baselines and standard deviations reported in Appendix B.

Haiyan Yin, Yingzhen Li, Sinno Jialin Pan, Cheng Zhang, and Sebastian Tschiatschek. Reinforcement learning with efficient active feature acquisition. arXiv preprint arXiv:2011.00825, 2020.

Summary descriptive statistics of the Ferritin dataset. (N = 43472. Binary variables are reported with positive numbers and percentages. Continuous variables are reported with means and interquartile ranges. These variables can also be ordered through basic metabolic panel ($36).

Summary descriptive statistics of the AKI dataset. (N = 23950. Binary variables are reported with positive numbers and percentages. Continuous variables are reported with means and interquartile ranges.)

Summary descriptive statistics of the Sepsis dataset. (N = 5396. Binary variables are reported with positive numbers and percentages. Continuous variables are reported with means and interquartile ranges.)

Hyper-parameter Table

Comparison of Imputers on Public Datasets in RMSE. Each dataset has 20% MCAR. Apart from the time-consuming MissForest Imputer, the EMFlow outperforms other imputers in most datasets.

Detailed statistics of our results optimizing F1 score.

Detailed statistics of our results optimizing AM metric.

ACKNOWLEDGMENTS

Mengdi Wang acknowledges the support by NSF grants DMS-1953686, IIS-2107304, CMMI-1653435, ONR grant 1006977, and http://C3.AI.Yuan Luo acknowledges the support by NIH grants U01TR003528 and R01LM013337.Yikuan Li acknowledges the support by AHA grant 23PRE1010660.

APPENDIX

We organized the appendices as follows. We first provide a complete literature review in appendix A. In appendix B, we provided the descriptive statistics of all three clinical datasets, followed by the experimental results and detailed training settings. In appendix C, we described the EMFlow imputation algorithm. In appendix D, we gave more details of the proposed end-to-end training approach. In appendix E, we proved the Theorem 4.1. In appendix F, we discuss the alternative AM metric for evaluating the diagnostic performance as an example to show the capability of our framework to handle classic linear metrics.

A RELATED WORK

Reinforcement learning (RL) has been applied in multiple clinical care settings to learn optimal treatment strategies for sepsis Komorowski et al. (2018) , to customize antiepilepsy drugs for seizure control Guez et al. (2008) etc. See survey Yu et al. (2021) for more comprehensive summary. Guidelines on using RL for optimizing treatments in healthcare has also been proposed around the topics of variable availability, sample size for policy evaluation, and how to ensure learned policy works prospectively as intended Gottesman et al. (2019) . However, using RL for simultaneously reducing the healthcare cost and improving patient's outcomes has been underexplored. 2017) ), 3-layer multi-layer perceptron, Q-learning for classification with costly features (CWCF) Janisch et al. (2019) , Random Selection (RS), Fixed Selection (FS). Among them, classical methods including LR, RF, XGBoost, Lightbgm are tested under both fully observable case and fixed observable case where 2 most relavant panels (CMB and CBCP panel) are chosen. All models were fine-tuned to maximize the F 1 score. The model yielded the highest F 1 score is bold. The model required the least testing cost is underlined. much higher testing cost, our approach exhibits comparing performance on F 1 score and AUROC with a much lower testing cost.Lastly, we emphasize that our approach gives the whole Pareto front, which represents a complete and rigorous characterization of the accuracy-cost trade-off in our medical diagnostics tasks. While all the baselines we compared to, only gives single points on this trade-off characterization.

F EXTENSION TO AM METRIC

In this section, we show that our framework directly handles classic linear metrics defined as a linear combination of (normalized) true positives and true negatives. We use the AM metric Natarajan et al. (2018) ; Menon et al. (2013) , a linear metric for imbalanced tasks via considering the average of true positive rate and true negative rate, as an example.We can formally define the cost-AM Pareto front as follows:where AM(π) = 1 2 (TPR(π) + TNR(π)). Here we define the true positive rate and true negative rate respectively as:,where TP(π), TN(π), FP(π), FN(π) are normalized true positive, true negative, false positive and false negative rates that sum up to 1. Let λ > 0 be the known ratio between the number of healthy patients and ill patients. Then we have TP(π)+FN(π) = 1 1+λ and TN(π)+FP(π) = λ 1+λ , indicatingThus, we show the linearity of AM metric, indicating given any fixed budget, the problem already reduced back to a standard MDP. It is a simpler problem and we can directly solve them using our training framework. Rigorously, we have the following result parallel to our Theorem 1:Empirically, we show the following Table 9 of our results on AM metric, comparing to the same baselines discussed in our work. The testing costs are similar to the Table 2 in the paper. It is shown that our approach achieves the best AM score while having lower testing costs. 

G PRECISION AND RECALL

We also show the detailed statistics of two sets of optimal solutions that optimizes F1 score and AM metric respectively. It can be shown by comparing Table 10 and 11 that F1 score balances the recall and precision while AM metric balances the true positive and negative rates.

