AN EXAMINATION OF PREFERENCE-BASED REIN-FORCEMENT LEARNING FOR TREATMENT RECOM-MENDATION

Abstract

Treatment recommendation is a complex multi-faceted problem with many conflicting objectives, e.g., optimizing the survival rate (or expected lifetime), mitigating negative impacts, reducing financial expenses and time costs, avoiding over-treatment, etc. While this complicates the hand-engineering of a reward function for learning treatment policies, fortunately, qualitative feedback from human experts is readily available and can be easily exploited. Since direct estimation of rewards via inverse reinforcement learning is a challenging task and requires the existence of an optimal human policy, the field of treatment recommendation has recently witnessed the development of the preference-based Reinforcement Learning (PRL) framework, which infers a reward function from only qualitative and imperfect human feedback to ensure that a human expert's preferred policy has a higher expected return over a less preferred policy. In this paper, we first present an open simulation platform to model the progression of two diseases, namely Cancer and Sepsis, and the reactions of the affected individuals to the received treatment. Secondly, we investigate important problems in adopting preference-based RL approaches for treatment recommendation, such as advantages of learning from preference over hand-engineered reward, addressing incomparable policies, reward interpretability, and agent design via simulated experiments. The designed simulation platform and insights obtained for preference-based RL approaches are beneficial for achieving the right trade-off between various human objectives during treatment recommendation.

1. INTRODUCTION

With recent advances in deep learning and open access to large-scale Electronic Health Records (EHRs), Deep Reinforcement Learning (RL) approaches have gained popularity for treatment recommendation (Raghu et al., 2017; Lopez-Martinez et al., 2019) . But the success of RL applications often crucially depends on the prior knowledge that goes into the definition of the reward function (Wirth et al., 2017) . However, treatment recommendation is a multi-faceted problem where the reward function is hard to engineer and requires quantifying the trade-off between diverse types of realistic objectives. For instance, clinicians often aim to optimize the survival rate (or expected lifetime) while mitigating negative impacts of the treatment (Raghu et al., 2017; Lopez-Martinez et al., 2019; Wang et al., 2018) . However, they also keep in mind, the patient's considerations of financial expenses and time costs in accepting treatment strategies (Faissol et al., 2007; Denton et al., 2009) . Moreover, unnecessary or over-treatment needs to be avoided and certain agreements based on the patient's medical insurance plan also need to be followed for an affordable treatment (Nemati et al., 2016) . To explicitly reflect human's objectives in the reward function, prior work jointly considers multiple objectives weighted linearly to reduce the problem to a single-objective MDP (Faissol et al., 2007; Denton et al., 2009) . However, the linearly weighted reward function induces negative interference between objectives, especially when representations are learned using neural networks and shared among different objectives, which goes against human's actual intentions (Pham et al., 2018; Schaul et al., 2019) . Further, given clinicians' treatment strategies, the intrinsic reward function cannot be inferred accurately with existing inverse reinforcement learning (IRL) methods (Abbeel & Ng, 2004; Ho & Ermon, 2016) , since they require access to samples from an optimal policy, which is not guaranteed in reality (Komorowski et al., 2018; Saria, 2018) . Fortunately, qualitative feedback according to humans' preferences can be easily obtained and efficiently leveraged to infer reward functions. In this paper, we investigate preference-based Reinforcement Learning approaches (Fürnkranz et al., 2012; Cheng et al., 2011; Akrour et al., 2012; Schäfer & Hüllermeier, 2018; Christiano et al., 2017) for treatment recommendation, where the reward is estimated based on preferences over a pair of treatment strategies. Specifically, the reward estimator ensures that in a policy pair, the policy preferred according to a human's objectives has a higher expected return. However, acceptance of PRL approaches for treatment recommendation requires significant exploration of their practical utility, reliability and interpretability.

Contributions:

In this work, we first present an open simulation platform to investigate the preference-based Reinforcement Learning approaches from the above aspects. The constructed simulator models the dynamic state transitions of different individuals with Cancer or Sepsis and their reactions to the received medication or operation treatment, which enables efficient model training and reliable performance evaluation. Next, we conduct comprehensive simulated experiments to address the following questions: 1) Does the preference-based qualitative feedback really benefit the policy learning compared to handcrafted rewards and other existing treatment recommendation approaches? 2) How to better optimize human's objectives by learning a reward representation which can deal with policies which are incomparable? 3) Is the reward function inferred by PRL interpretable and does it faithfully follow human intentions? and 4) How to design agent types so that resulting policies together with the preference feedback lead to more accurate reward estimation and better treatment outcomes? Our experiments provide useful insights and guidance in developing preference-based RL approaches to realize the right trade-off between human objectives during treatment recommendation.

2. PROBLEM DEFINITION

We cast the treatment policy learning as a Markov Decision Process (MDP). At time-step t, s t is a vector composed of multiple health-related features, a t is either a scalar value representing dosage amount or a boolean value denoting whether to perform an operation. Besides effects from the conducted actions, features in the state influence each other's progression, which is simulated by the state transition probability function P(s t+1 |s t , a t ). The agent is targeted at learning the optimal policy π * that maximizes the expected return V π * (s 0 ) = max π E[ ∞ t=0 γ t r t ], where γ ∈ [0, 1] is the discount factor and r t is the estimated reward based on preference feedback. Given two policies π m and π n starting with the same initial state s i , π m (s i ) π n (s i ) represents the preference of policy π m to π n based on human's objectives. Rather than using hand-crafted reward functions of the MDP, we are aimed at finding a parameterized reward function r θ P that approximates the true reward function r underlying human's preference.

3. SIMULATION PLATFORM DESIGN

3.1 GENERAL CANCER AND DRUG TREATMENT SIMULATION Following prior work (Fürnkranz et al., 2012) , we use the mathematical model proposed by Zhao et al. (2009) to simulate the general cancer evolution and drug treatment effects.

State Transformation:

The values of the next tumor size y t+1 and the toxicity level x t+1 are determined by the current drug amount d t , their current values y t , x t and initial values y 0 , x 0 : y t+1 = ReLU y t + [a 1 • max(x t , x 0 ) -b 1 • (d t -m 1 )] × I(y t > 0) x t+1 = ReLU x t + a 2 • max(y t , y 0 ) + b 2 • (d t -m 2 ) , where I is the indicator function which outputs 1 if the current tumor size y t > 0 and 0 otherwise.

3.2. SEPSIS INFECTION AND BLOOD PURIFICATION SIMULATION

We employ the mathematical model derived by Song et al. (2012) to simulate the acute inflammation process in response to an infection. There are 19 physiological features that govern sepsis dynamics, 8 of which are observable while the remaining 11 are unmeasurable conceptual variables. Whenever a blood purification operation is made, three components in the circulation are eliminated, i.e., activated neutrophils N a and the pro-and anti-inflammatory mediators PI and AI. Besides effects from the blood purification operation, the variables influence each others' progression through Ordinary differential equations (ODEs). State Transformation: There are 18 ODEs to describe feature interactions and 3 ODEs for hypothetic mechanism of blood purification. For a simple demonstration, we only list the equations of activated neutrophils (N a ) with(out) blood purification operation here. No operation: dN a dt = N r P I n h n Nr-Na + P I n 1 τ Nr-Na + N p P I n h n Np-Na + P I n 1 τ Np-Na - N a τ Na - N a P I n h n Na-Ns + P I n 1 τ Na-Ns , With operation: dN aHA dt = dN a dt - N a /N ∞ h AIHA + (N a /N ∞ ) , where N r , N p , N a are resting, primed and activated blood neutrophils respectively, PI is the systemic pro-inflammatory response, τ Nr-Na , τ Np-Na , τ Na-Ns are constant parameters and h n Nr-Na , h n Np-Na , h n Na-Ns , h AIHA are hill equations.

4.1. EXISTING TREATMENT RECOMMENDATION APPROACHES

Learning from Hand-crafted Reward: When the optimization objective is to maximize the clinical efficacy alone, the reward function for intermediate timesteps is either 0 or hand-crafted based on indicators of patient health, while the rewards for positive and negative outcomes at terminal timesteps are normally of the same scale, but of opposite directions (Raghu et al., 2017; Wang et al., 2018; Nemati et al., 2016) . Besides obtaining the optimal clinical efficacy, another line of work has also included auxiliary objectives like mitigating negative impacts or improving health conditions. Most of existing work specified linear scalarization functions based on domain knowledge to project the multi-objective MDP to a single-objective MDP (Denton et al., 2009; Lopez-Martinez et al., 2019; Zhao et al., 2009) . Due to the sensitivity of the learned policy and resulted performance to relative values of the manually specified rewards, the employed reward functions in these approaches are difficult to be quantified by experts to achieved distinct goals in treatment recommendation. Learning from Human Feedback: Given demonstrations from domain experts, inverse Reinforcement Learning (IRL) methods have been proposed to seek the reward function that models the intention of the demonstrator first and then train RL agents to match the demonstrations (Abbeel & Ng, 2004; Ho & Ermon, 2016) . Though explicit quantitative reward signals are no longer needed in IRL settings, learning treatment policies from clinicians' demonstrations is challenging, since optimal demonstrations are difficult to provide by clinicians while the general treatment regimens in demonstrations can hardly reflect the actual intentions (Gao et al., 2018; Brown et al., 2019) . Fortunately, even non-experts can provide feedback in the form of preference, which has been utilized to replace conventional numerical reward signals with relative utility values (Fürnkranz et al., 2012; Cheng et al., 2011; Akrour et al., 2012; Schäfer & Hüllermeier, 2018; Christiano et al., 2017) .

4.2. PREFENCE-BASED REINFORCEMENT LEARNING FRAMEWORK

We display the framework in Algorithm 1 to show the reward and policy learning procedure given preference feedback. Firstly, we are aimed at learning a reward function, based on which the preference between two policies could be approximated. Learning from Qualitative Feedbacks: We denote by π m (s i ) π n (s i ) the case that given s i , π m is preferred to π n . We here treat the qualitative feedback learning problem as a classic binary classification task, where two policies are given and a model learns to approximate the preference between the two. We assume the probability that one policy is preferred to the other is a function of their received reward estimations (explained later in Preference Probability Representation), then the for all s ∈ S do 5: for t = 0 to T -1 do 7: s 1 0 ← s, s 2 0 ← s, τ 1 ← ∅, τ 2 ← ∅ // a 1 t ← π(s 1 t ; θ 1 A ), s 1 t+1 ← SIMULATE s 1 t , a 1 t , r 1 θ P ,t ← REWARD(s 1 t , a 1 t ; θ P ) 8: a 2 t ← π(s 2 t ; θ 2 A ), s 2 t+1 ← SIMULATE s 2 t , a 2 t , r 2 θ P ,t ← REWARD(s 2 t , a 2 t ; θ P ) 9: τ 1 ← τ 1 ∪ {(s 1 t , a 1 t , r 1 θ P ,t ), s 1 t+1 }, τ 2 ← τ 2 ∪ {(s 2 t , a 2 t , r 2 θ P ,t , s 2 t+1 )} 10: end for 11: Γ 1 ← Γ 1 ∪ {τ 1 }, Γ 2 ← Γ 2 ∪ {τ 2 } 12: pre(τ 1 , τ 2 ) ←EVALUATEPREFERENCE(τ 1 , τ 2 ) // Preference feedback from humans 13: D ← D ∪ τ 1 , τ 2 , pre(τ 1 , τ 2 ) 14: end for 15: Drawing minibatches  Γ 1 n ∼ Γ 1 , Γ 2 n ∼ Γ 2 , L(θ P ) = -E si∼S I π m (s i ) π n (s i ) log p π m (s i ) π n (s i ); θ P (1) + I π n (s i ) π m (s i ) log p π n (s i ) π m (s i ); θ P , where I(• •) is an indicator function equal to 1 if the first policy is preferred to the second, 0 otherwise. Preference Probability Representation: Bradley-Terry model (Bradley & Terry, 1952 ) is a widely used probability model to predict the preference of a paired comparison: p(i j) = pi pi+pj , where p i is a positive real-valued score assigned to individual i. In order to compute the probability that π m is preferred to π n given state s i , we employ its implementation introduced in (Agresti & Kateri, 2011) : p π m (s i ) π n (s i ) = exp R(π m , s i ; θ P ) exp R(π m , s i ; θ P ) + exp R(π n , s i ; θ P ) , where capital R denotes the expected return of conducting a policy given one specific initial state. Given the learned reward, parameters of the RL agent are updated with either of the following two methods. Action-based Reward Modification (AbRM): Hand-crafted rewards assigned to different outcomes influence the agent performance a lot, even if we keep the ratio but change the magnitude only. Instead of designing the scalar rewards for different outcomes manually, we send the preference-based reward r θ P (s t , a t ) to the agent at each time-step.

State-based Reward Modification (SbRM):

We derive a new state value h θ P from r θ P to represent how good the state is: h θ P (s t ) = max a r θ P (s t , a). We further compute the advantage value of the current state over the previous one, h θ P (s t ) -h θ P (s t-1 ), as the instant reward to encourage appropriate behaviors in accordance with preference. Problems to be Resolved: Before applying preference-based RL approaches to treatment recommendation, we need to address the following problems to ensure that the preference-based reward estimation is consistent with human's objectives and the well-trained agent provides reliable and interpretable policies to clinicians: • Does the preference-based qualitative feedback really benefit the policy learning compared to handcrafted rewards and other existing treatment recommendation approaches? • How to better optimize human's objectives by learning a reward representation which can deal with policies which are incomparable? • Is the reward function inferred by PRL interpretable and does it faithfully follow human intentions? • How to design agent types so that resulting policies together with the preference feedback lead to more accurate reward estimation and better treatment outcomes?

5.1. SETTINGS

Medication Recommendation for General Cancer: For 6-month simulation, the agent makes dosage amount decisions in each month. 10, 000 subjects are randomly sampled for training, 2, 000 for validation and 2, 000 for testing. We are aimed at learning optimal policies with three kinds of intentions: 1) maximizing survival rate to obtain optimal clinical efficacy (CE); 2) and mitigating negative effects represented by the sum of the tumor size and the toxicity level after treatment (CE&OF-I); 3) and mitigating negative effects represented by two separate health signs, the highest toxicity level during the treatment and the final tumor size (CE&OF-II). Blood Purification Recommendation for Sepsis: During the 100-hour simulation, the agent is asked whether to perform a 2-hour operation in every 2 hours. We randomly sample 3, 000 subjects for training, 1, 000 for validation and testing. This is a partially observable MDP and LSTMs are utilized for agent modeling. The agent learns policies to fulfill two intentions: 1) maximizing survival rate to obtain optimal clinical efficacy (CE); 2) and avoiding too frequent operations (CE&OF).

5.2. COMPARED APPROACHES

We benchmark results from the following existing approaches from treatment recommendation literature: • Non-learning (Zhao et al., 2009; Fürnkranz et al., 2012) : 1) Constant: A static dosage amount is given to all the subjects throughout the six months; 2) Random: One of the four dosage options is randomly selected at each time-step; 3) Upper Bound: The subjects with Sepsis receive operations all the time throughout the simulation period. • Preference Learning (Fürnkranz et al., 2012) : in Preference-Based Policy Iteration (PBPI), one action is preferred to the other based on their outcomes after certain times of simulations. Every time the dosage with the highest preference is selected. • Reinforcement Learning with handcrafted Reward: 1) Single-objective RL (Schulman et al., 2015) : the conventional policy gradient approach; it receives +1 for survival outcome, -1 for death, and 0 for all intermediate steps. 2) Single-objective RL (Ensemble): among two agents, the one with better performance on the validation set is evaluated on the testing set. It is developed for fair comparison • Reinforcement Learning with Preference-based Reward: to guide the RL agent learning, both AbRM and SbRM are trained based on preference determined by human's intentions. Since the preference-based reward is a non-stationary value approximated by a neural network, we implement agents with the policy gradient, which is robust to changes in the reward function (Ho & Ermon, 2016; Christiano et al., 2017) .

5.3. BENCHMARK RESULTS

Medication Recommendation for General Cancer: We evaluate different approaches to pursue treatment goals in terms of maximizing survival rate CE, mitigating negative impact CE&OF-I (sum of tumor size and toxicity level) in Table 1 and CE&OF-II (highest toxicity level and final tumor size) in Table 2 (Appendix). Considering the Survival Rate as the only metric to derive preference on two policies, agents learning from either action-based (31.52%) or state-based (30.54%) preference reward have much better performance in saving lives than Single-objective RL (26.96%), where the handcrafted reward is used to penalize policies with death outcomes. When negative impacts are expected to be mitigated besides saving lives, agents receiving rewards from preference are capable to maintain the performance on the clinical efficacy with much fewer negative impacts at the same time. When human's preference is defined as CE&OF-II, the highest toxicity level during the treatment and the final tumor size are two contradictory objectives to minimize. In Table 2 , we observe that preference-based reward guides the agent to policies with the highest survival rates, while one negative impact gets reduced but the other increases compared with other approaches. Blood Purification Recommendation for Sepsis: Performance bar charts of different approaches evaluated by Survival Rate and Number of Operations are illustrated in Fig. 1 . When guided by preference-based reward rather than manually crafted reward, a slightly higher Survival Rate is achieved by both AbRM and SbRM, while the average number of operations has fallen considerably, by 6.79% with AbRM and 14.50% with SbRM. Note that although the approach Multi-objective RL leads to the fewest number of operations, the performance in Survival Rate has dropped to make undesired trade-offs between clinical efficacy optimization and negative impacts mitigation.

5.4. EXAMINATION OF PREFERENCE-BASED RL APPROACHES

Advantages of Preference-based Reward over handcrafted Reward: In Fig. 5 (Appendix), we first show the sensitivity of policy behaviors to small changes in handcrafted rewards leads to unstable clinical efficacy even when the relative importance of obtaining positive outcomes against negative ones keeps unchanged. The three heatmaps of the agent's performance in response to different reward scalars reflect the difficulty in specifying an appropriate reward function to enable policy learning with the optimal clinical efficacy during treatment recommendation. In Fig. 6 (Appendix), we further show the difficulty of selecting reward scalars for three factors -survival rate, last tumor size and maximum toxicity level -in the grid-search Multi-objective RL approach to appropriately prioritize the clinical efficacy over negative impacts. To study whether the preference-based reward estimation can fully capture the human's intentions, we provide the RL agents with linear combinations of 3 (Appendix) show that learning from preference-based reward alone is adequate to achieve high clinical efficacy, while its combination with handcrafted rewards distracts the RL agent from optimizing the human's actual intentions and finally leads to inferior performance. To study the generalizability of the reward estimator, we extract the well-trained reward model in 2-hour operation configuration for Sepsis treatment and use it as the pre-trained model for 4-hour operation experiments. In Fig. 2b and Fig. 9b (Appendix), the reward estimator with knowledge transfer helps the agent speed up learning: compared with learning from scratch, the reward estimator with good initialization from a different configuration can provide better guidance to the agent. Addressing Incomparable policies: Given two policies for one sampled subject, if they have identical performance according to human's objectives, then the two policies are deemed to be incomparable. Since no clear preference conclusion can be drawn between the two incomparable policies, the majority of existing work in preference learning disregarded them directly (Fürnkranz et al., 2012; Cheng et al., 2011; Akrour et al., 2012; Schäfer & Hüllermeier, 2018; Christiano et al., 2017) . Only comparable pairs, either π m preferred to π n (I π m (s i ) π n (s i ) = 1) or π n preferred to π m (I π n (s i ) π m (s i ) = 1 ) , are included in the training set to optimize preference approximation. However, preference learning based on comparable policies alone achieves quite unsatisfactory clinical efficacy in our treatment recommendation tasks. As shown in Fig. 3c for Cancer treatment recommendation, the survival rate (green curve) progresses with little improvement but great fluctuation during 400 epochs of training. Two reasons are likely to contribute to the failure: 1) polarized preference (one preferred with probability 0.85, and the other 0.15 in Fig. 3a ) is inferred between two incomparable policies although the preference label is never provided in the training set; 2) only around one-fifth of the policy pairs (2,000 comparable from 10,000 sampled subjects) are leveraged in each epoch for preference model update (green line in Fig. 3b ). After the above performance analysis, we find that excluding incomparable pairs from the training set leaves the parameterized model exploring the preference space arbitrarily and inferring random preference over two policies although they are incomparable. To avoid arbitrary exploration in the preference space, we handle the incomparable pairs with a simple approach: treating both policies from the incomparable pair equally, i.e., I π m (s i ) π n (s i ) = I π n (s i ) π m (s i ) = 0.5. With the small but important augmentation to the preference indicator function, incomparable policies are efficiently utilized for better preference space exploration (preference approaching 0.5 as expected in Fig. 3a ), more samples for preference model update (all the 10,000 samples from the training set participate in the loss function minimization in Fig. 3b ), and much higher clinical efficacy (more than 30% survival rate achieved after the model converges in Fig. 3c ).

Interpretability in Inferred Rewards:

To demonstrate whether the preference-based reward match human's actual intentions, we visualize the SbRM and AbRM agent's expected return for Cancer treatment during training and its relationship with the resulted negative impacts during testing (OF-I) in Fig. 4 and Fig. 7 (Appendix), respectively. From Fig. 4a , we can observe that the rising trend of the expected return matches the improving Survival Rate quite well, although the parameters of the reward estimator are updated at the same time. The estimated reward offers reasonable explanations for the policy performance: the higher the expected return, the better the policy. After the model converges, we analyze the distribution of expected returns for policies with different negative impacts. Since penalties or rewards are assigned to policies based on their outcomes only, the conventional Policy Gradient approach treats policies ending with survivals but different negative impacts equally (horizontal blue dots in Fig. 4b ). After adopting preference-based reward, policies resulting in survival outcomes can distinguish from each other: policies with smaller negative impacts have much higher expected return. In Fig. 4c , policies leading to deaths have extremely low expected return (approaching zero), while the expected return for policies with survival outcomes is negatively proportional to the amount of negative impacts. As shown in Fig. 2a and Fig. 9a (Appendix), the expected return received by the agent for Sepsis treatment also shares the common trend with the Survival Rate: if more lives have been saved by the agent, then higher expected return is achieved. Influence of Agent Types in Treatment Outcomes: As depicted in Algorithm 1, the studied preference-based RL framework adopts two RL agents controlled by different parameters to infer the reward and learn the policy that optimizes human's intentions. We here study the influence of different agent designs on reward approximation and resulted performance. Specifically, the reward function is estimated to approximate the preference over policies, among which the first policy is performed by one RL agent while the second policy can be executed by different agent types. The clinical efficacy curves shown in Fig. 8 (Appendix) empirically prove the effectiveness of the current design of two different preference-based RL agents.

6. CONCLUSIONS AND FUTURE DIRECTIONS

To obtain optimal treatment policies based on human's diverse objectives, we investigate performance of the preference-based Reinforcement Learning approaches, where higher rewards are automatically estimated and assigned to actions following human's actual intentions underlying the provided preference feedback. During interacting with the developed simulation platform, we resolve critical implementation problems and gain a deeper understanding in designing preference-based RL approaches, in order to better aid clinicians in treatment decision making. In future work, we will consider tackling some more practical aspects about human's preferences in adopting treatment strategies: 1) how to efficiently leverage preference in reward learning if human's feedback is limited, 2) how to fully reflect human's actual intentions in reward learning if both preference feedback and clinicians' demonstrations are provided.

A APPENDIX

A.1 FIGURES AND TABLES Expected Return The preference-based Reinforcement Learning framework is composed of two main modules, Preference-based Reward Learning and Preference-guided Agent Learning. In Preference-based Reward Learning, the reward estimator parameterized by θ P delivers step-wise rewards to the two agents parameterized by θ 1 A and θ 2 A based on their policy preference. In Preference-guided Agent Learning, the agents update their parameters so as to optimize the clinicians' objectives. The pair of policies performed by the two agents on the sampled subject is stored in the policy pool and leveraged for parameter update in reward estimator, with the aim to ensure higher expected return for the preferred policy. We list the pseudo codes for collaborative learning in Algorithm 1, Preference-based Reward Learning in Algorithm 2, and Preference-guided Agent Learning in Algorithm 3, respectively. Collaborative Learning Algorithm 1 illustrates the collaborative learning process between the two modules in order to estimate reward and learn policies in personalized treatment recommendation. In the beginning, the model parameters are randomly initialized (line 1), and the policy pools for the reward estimator and the two agents are created as empty sets (line 2). In each iteration, one subject is sampled from the training set for agent learning (line 3 to 5). At each simulation step, the two agents are asked to make decisions based on the current state and the reward estimator generates corresbonding step-wise reward for each of them (line 6 to 10). The subject's internal state keeps on updating until the simulation time has reached or the subject dies intermediately according to the underlying mathematical modeling. The policy pools of the two agents are augmented with the trajectories on the newest sampled subject (line 9 and 11), while the policy pool for the reward estimator is also updated (line 13) after computing the ground-truth preference label (line 12). After all the samples have been utilized for policy generation, the reward estimator minimizes the classification loss during policy preference inference with Algorithm 2 (line 16), while the RL agents optimize the expected return with Algorithm 3 (line 17).

Require:

D n : sampled policy pairs in n-th iteration θ P : parameters to update in reward function γ P : discounted factor on reward β: step size for parameter update 1: L ← 0 2: for all (τ 1 , τ 2 , pre(τ 1 , τ 2 )) ∈ D n do 3: R(τ 1 ; θ P ) ← 0, R(τ 2 ; θ P ) ← 0 4: for all (s 1 t , a 1 t , r 1 θ P ,t , s 1 t+1 ) ∈ τ 1 do 5: R(τ 1 ; θ P ) ← R(τ 1 ; θ P ) + γ t P r 1 θ P ,t 6: end for 7: for all (s 2 t , a 2 t , r 2 θ P ,t , s 2 t+1 ) ∈ τ 2 do 8: R(τ 2 ; θ P ) ← R(τ 2 ; θ P ) + γ t P r 2 θ P ,t

9:

end for 10: Compute p(τ 1 τ 2 ) 11: if τ 1 τ 2 then 12: L ← L + log p(τ 1 τ 2 ) 13: else if τ 2 τ 1 then 14: L ← L + log 1 -p(τ 1 τ 2 ) 15: else if τ 1 ∼ τ 2 then 16: L ← L + 0.5 log p(τ 1 τ 2 ) + 0.5 log 1 -p(τ 1 τ 2 ) 17: end if 18: end for 19: Update θ P ← θ P -β∆ θ P L 20: return θ P Preference-based Reward Learning Given pairs of policies with corresponding preferences, the reward estimator updates its parameters to maximize the probability that the preferred policy achieves higher expected return than the other. As shown in Algorithm 2, the discounted expected returns achieved by each agent are firstly calculated respectively for each sampled policy pair (line 3 to 9). Then the probability that policy τ 1 is preferred to τ 2 is positively correlated to the expected return of τ 1 , and is computed as (Agresti & Kateri, 2011) introduced (line 10). Hence p(τ 2 τ 1 ) is equal to 1 -p(τ 1 τ 2 ). Then the loss value is computed considering different kinds of preference relationships between the two policies (line 11 to 17). Incomparable policy pairs are also leveraged in reward learning for better preference space exploration (line 15 to 16). where (1 ≤ t ≤ 6), the survival status is assumed to depend on both the current tumor size y t and the toxicity level x t . The probability of a patient's death is modeled as follows: Hazard function: λ(t) = exp(-4 + y t + x t ), Cumulative hazard function: Implementation Details The action space is discrete and the dosage amount decisions are selected among 4 options: 0.1, 0.4, 0.7, 1.0 (Fürnkranz et al., 2012) . For state initialization, the tumor size and the toxicity level in the 0 th month are generated independently from the uniform distribution U(0, 2). The simulation terminates after t = 6 th month or if the patient dies intermediately. ∆∆(t) = t t-1 λ(s)d(s),

Model Implementation and Training

For 6-month simulation, we randomly sample 10, 000 subjects for training, 2, 000 for validation, and 2, 000 for testing. The neural networks for all deep learning approaches including preference learning and reinforcement learning share the similar network structure and hyper-parameters: 2 fully-connected layers, the first followed by ReLU activation and the second followed by different activation functions for different approaches. In one epoch, the agent gets updated after seeing all the training samples. The learning rate is set to 0.01 and all the networks converge after 400 epochs. For deep RL methods, we set the discount factor γ to 1.

A.3.2 BLOOD PURIFICATION RECOMMENDATION FOR SEPSIS

Mathematical Modeling in Simulation Sepsis is initiated by spillover of pathogens into blood, where the pathogen is allowed to spread throughout the organism in which systemic inflammation takes place (Stojkovic et al., 2016) . Motivated by the promising results of blood purification in other critical illness conditions like acute kidney failure (Ronco et al., 2000) , blood purification has gained attention as a potentially effective solution for septic subjects (Rimmelé & Kellum, 2011) . In blood purification treatment, the patient is connected to an extracorporeal hemoadsorption device that removes harmful particles from the blood and leads the patient towards a healthy state. We employ the mathematical model derived by Song et al. to simulate the acute inflammation process in response to an infection (Song et al., 2012) . Both heuristic knowledge about the mechanism underlying infection and real measurements from experiments on CLP-induced septic rats were leveraged for the model design. The distribution of initial physiological features and their interactions are derived from domain knowledge. The initial physiological features that characterize a subject accords with the probability distributions based on real experimental measurements for septic rats. The parameters in transition functions are calibrated so that the generated trajectories closely follow experimentally observed temporal patterns in septic rats. Figure 12 demonstrates the feature interaction network. There are 19 physiological features that govern sepsis dynamics, 8 of which are observable (features above the horizontal dashed line) while the remaining 11 are conceptual variables (features below the horizontal dashed line). When a blood purification operation is made, three components in the circulation are eliminated (features marked by red dashed ring), i.e., activated neutrophils N a and the pro-and anti-inflammatory mediators PI and AI. Besides effects from the blood purification operation, the variables influence each others' progression through Ordinary differential equations (ODEs). State Transition There are 18 ODEs to describe feature interactions and 3 ODEs for the hypothetic mechanism of blood purification. The hypothetic mechanisms of action of the blood purification are implemented by assuming the hemoadsorption device eliminates only three components in the circulation: activated neutrophils (N a ), pro-inflammatory mediators (P I), and anti-inflammatory mediators (AI) during the treatment period. We here only show the transition equation of these three key features with and without operation, ODEs concerning other features can be found in (Song et al., 2012) . The variable PI stands for the extent of the systemic inflammation and progresses as follows: The variable AI describes the level of the anti-inflammation corresponding to systemically acting anti-inflammatory mediators and gets updated as follows: dP I dt = B/B ∞ h P I_B + B/B ∞ 1 - D n h n P I_D + D n 1 - AI n (1 -P I) h n P I_AI + AI n (3) + 1 - B/B ∞ h P I_B + B/B ∞ D n h n P I_D + D n 1 - AI n (1 -P I) h n P I_AI + AI n + B/B ∞ h P I_B + B/B ∞ D n h n P I_D + D n 1 - AI n ( dAI dt = P I n1 h n1 AI_P I + P I n1 1 - N a /N ∞ h AI_Na + N a /N ∞ + 1 - P I n1 h n1 AI_P I + P I n1 N a /N ∞ h AI_Na + N a /N ∞ + P I n2 h n2 AI_P I + P I n2 N a /N ∞ h AI_Na + N a /N ∞ -AI 1 τ AI , AI(t + 1) = AI(t ) + dAI dt (t ) If no operation is performed AI(t ) + dAI dt (t ) - AI h AIHA +AI Otherwise , where variables h AI_P I , h AI_Na , τ AI are subject-specific parameters, N ∞ is a predefined upper bound of neutrophils, h AIHA = 0.3, n 1 = 1 and n 2 = 3.  where N r is resting blood neutrophils, N p is blood neutrophils, N s is neutrophils sequestered in the lung capillaries, variables h Nr_Na , h Np_Na , h Na_Ns , τ Nr_Na , τ Np_Na , τ Na_Ns are subject-specific parameters, h NaHA = 0.3, and n = 3.

Survival Analysis

The survival status of the subject only depends on the value of the systemic pro-inflammatory response P I at the end of the simulation. When the P I value at the last time-step is smaller than the pre-defined threshold 0.5, then the subject is assumed to be alive, otherwise dead. Note that after the blood purification process, the P I value reduces as time passes, hence one cannot conclude whether the subject is alive in the intermediate time-steps. After the pre-defined simulation horizon is reached, we can confirm which subjects survive with the help of treatment. The mathematical model is quite different from the general Cancer Treatment model where subjects have a probability to die intermediately. Implementation Details Due to phenotype differences, some subjects survive without any blood purification operation while some die. This is consistent with laboratory experiments where 30% of rats survived till seven days while the remaining died between two to five days after CLP (Zhao et al., 2009) . We call the survivor group Survival Population and the non-survivor group Death Population. The survival status of the Survival Population gets no influence from blood purification operations. Subjects from Death Population have the potentials to survive if proper treatment policies are delivered. Since we are primarily concerned about the outcomes on subjects from Death Population, we only sample subjects from the Death Population in this paper to train and evaluate treatment policies. There are a few hyper-parameters that should be set in advance: 1) Simulation step size τ : every τ time, the simulator updates the internal status of subjects by computing the ODEs with feature values from the last simulation step and the current action. 2) Simulation horizon length T : we can evaluate the performance of a policy by checking outcomes of subjects after time T . 3) Valid time range L for patients to receive treatment: operations can take place at any time-step (L = [0, T -1]) or be constrained to predefined time intervals (L [0, T -1]). 4) Frequency of decision-making f : subjects can receive operations at each simulation step τ or less frequently. 5) Duration of each blood purification operation l: it takes some costs to turn on/off the purification device and it is also unrealistic to attach and detach the device from the subject too frequently. Therefore, there should be a pre-defined value for the purification duration to rule out the possibility of too frequent actions. To generate testable hypotheses that guide future laboratory experiments (Song et al., 2012; Stojkovic et al., 2016) , the simulation of sepsis evolution should be configured to make the generated trajectory closely follow experimentally observed temporal patterns (Song et al., 2012) . Further, several constraints can be imposed on the simulation in accordance with previous blood purification studies (Song et al., 2012; Stojkovic et al., 2016) . Therefore we use the configuration listed in Table . 4 for experiments.

Model Implementation and Training

We randomly sample 3,000 subjects for training, 1,000 for validation, and 1,000 for testing. Implementation details of the deep RL approaches are similar to those mentioned in the Cancer task, except that the backend network is LSTM-based since this is a POMDP. Learning efficient treatment policies for Septic subjects is more difficult for Cancer due to the larger state space and the partially observable environment. Therefore, we adopt the following methods to ensure robust learning: 1) Mini-batch gradient descent with batch size 10,000 is adopted to update parameters in reward estimator and RL agents. 2) The learning rate for RL agents is 0.01 while 0.001 for the reward estimator. 3) As discussed in Experiment Section, experience replay makes the estimated reward positively proportional to the Survival Rate. We randomly extract policy pairs from the latest 30,000 samples for model updates.



PREFERENCE-BASED RL FOR TREATMENT RECOMMENDATIONRequire:S : initial states of sampled subjects N : number of training iterations T : the maximum simulation time to treat each subject 1: Randomly initialize θ P , θ 1 A , θ 2 A 2: D = ∅, Γ 1 = ∅, Γ 2 = ∅ // Initialize empty lists to store samples for reward and agent learning 3: for n = 0 to N -1 do 4:

Figure 1: Performance for Sepsis blood purification recommendation. (a) Optimize clinical efficacy only

Figure 2: Sepsis treatment strategies recommended by the SbRM agent: (a) clinical efficacy and expected return during training, (b) reward transferability among different configurations. (a) SbRM Training

Figure 3: Cancer treatment recommendation for 10,000 training subjects with the PRL framework.Curves in green describe scenarios when incomparable policies are discarded while curves in blue show cases when comparable policies are preserved and efficiently utilized.

Figure 4: Cancer treatment recommendation: (a) clinical efficacy and expected return during training, and (b, c) expected return of policies ending with different negative impacts during testing.

Figure 5: Performance of RL in different hand-crafted reward designs to optimize clinical efficacy.(a) Random Seed 2001

Figure 10: For Cancer experiments, true expected return of DQN learning from behavioral policies of Policy Gradient and its estimations from different off-policy evaluation methods.

Survival function: ∆F (t) = exp(-∆∆(t)), Death probability: p death = 1 -∆F (t).

D n ∼ D

Performance for Cancer medication recommendations. The best result per metric is marked in boldface. We present avg ± stdev values for all experiments averaged over 10 independent runs.

Performance for Cancer medication recommendation considering negative impacts from two factors: the tumor size in the end and the ever experienced maximum toxicity.

Evaluating the clinical efficacy (survival rate) achieved by the proposed preference-based RL framework when hand-crafted and preference-based rewards are linear combined with different ratios.

: sampled policies from one agent in n-th iteration α: step size for parameter update M: one of the two reward assignment methods L θ A : loss function in any deep RL approach parameterized by agent parameters θ A1: ε = ∅ 2: for all (s t , a t , r θ P ,t , s t+1 ) ∈ Γ n do t ← r θ P (s t , a t ) t ← h θ P (s t ) -h θ P (s t-1 ) ← ε ∪ {(s t , a t , r t , s t+1 )} 9: end for 10: Update θ A ← θ A -α∆ θ APreference-guided Agent Learning Each agent updates their parameters individually as Algorithm 3 depicts. The agent receives rewards computed by either Action-based Reward Modification (line 3 to 4) or State-based Reward Modification (line 5 to 6). Then we leverage (s t , a t , r t , s t+1 ) to update the agent model implemented by any deep Reinforcement Learning approach.

Configurations for Sepsis treatment simulation. h represents hour in the simulation platform.

The variable N a represents the activated blood neutrophils and transits in each simulation step as

