AN EXAMINATION OF PREFERENCE-BASED REIN-FORCEMENT LEARNING FOR TREATMENT RECOM-MENDATION

Abstract

Treatment recommendation is a complex multi-faceted problem with many conflicting objectives, e.g., optimizing the survival rate (or expected lifetime), mitigating negative impacts, reducing financial expenses and time costs, avoiding over-treatment, etc. While this complicates the hand-engineering of a reward function for learning treatment policies, fortunately, qualitative feedback from human experts is readily available and can be easily exploited. Since direct estimation of rewards via inverse reinforcement learning is a challenging task and requires the existence of an optimal human policy, the field of treatment recommendation has recently witnessed the development of the preference-based Reinforcement Learning (PRL) framework, which infers a reward function from only qualitative and imperfect human feedback to ensure that a human expert's preferred policy has a higher expected return over a less preferred policy. In this paper, we first present an open simulation platform to model the progression of two diseases, namely Cancer and Sepsis, and the reactions of the affected individuals to the received treatment. Secondly, we investigate important problems in adopting preference-based RL approaches for treatment recommendation, such as advantages of learning from preference over hand-engineered reward, addressing incomparable policies, reward interpretability, and agent design via simulated experiments. The designed simulation platform and insights obtained for preference-based RL approaches are beneficial for achieving the right trade-off between various human objectives during treatment recommendation.

1. INTRODUCTION

With recent advances in deep learning and open access to large-scale Electronic Health Records (EHRs), Deep Reinforcement Learning (RL) approaches have gained popularity for treatment recommendation (Raghu et al., 2017; Lopez-Martinez et al., 2019) . But the success of RL applications often crucially depends on the prior knowledge that goes into the definition of the reward function (Wirth et al., 2017) . However, treatment recommendation is a multi-faceted problem where the reward function is hard to engineer and requires quantifying the trade-off between diverse types of realistic objectives. For instance, clinicians often aim to optimize the survival rate (or expected lifetime) while mitigating negative impacts of the treatment (Raghu et al., 2017; Lopez-Martinez et al., 2019; Wang et al., 2018) . However, they also keep in mind, the patient's considerations of financial expenses and time costs in accepting treatment strategies (Faissol et al., 2007; Denton et al., 2009) . Moreover, unnecessary or over-treatment needs to be avoided and certain agreements based on the patient's medical insurance plan also need to be followed for an affordable treatment (Nemati et al., 2016) . To explicitly reflect human's objectives in the reward function, prior work jointly considers multiple objectives weighted linearly to reduce the problem to a single-objective MDP (Faissol et al., 2007; Denton et al., 2009) . However, the linearly weighted reward function induces negative interference between objectives, especially when representations are learned using neural networks and shared among different objectives, which goes against human's actual intentions (Pham et al., 2018; Schaul et al., 2019) . Further, given clinicians' treatment strategies, the intrinsic reward function cannot be inferred accurately with existing inverse reinforcement learning (IRL) methods (Abbeel & Ng, 2004; Ho & Ermon, 2016) , since they require access to samples from an optimal policy, which is not guaranteed in reality (Komorowski et al., 2018; Saria, 2018) . Fortunately, qualitative feedback according to humans' preferences can be easily obtained and efficiently leveraged to infer reward functions. In this paper, we investigate preference-based Reinforcement Learning approaches (Fürnkranz et al., 2012; Cheng et al., 2011; Akrour et al., 2012; Schäfer & Hüllermeier, 2018; Christiano et al., 2017) for treatment recommendation, where the reward is estimated based on preferences over a pair of treatment strategies. Specifically, the reward estimator ensures that in a policy pair, the policy preferred according to a human's objectives has a higher expected return. However, acceptance of PRL approaches for treatment recommendation requires significant exploration of their practical utility, reliability and interpretability.

Contributions:

In this work, we first present an open simulation platform to investigate the preference-based Reinforcement Learning approaches from the above aspects. The constructed simulator models the dynamic state transitions of different individuals with Cancer or Sepsis and their reactions to the received medication or operation treatment, which enables efficient model training and reliable performance evaluation. Next, we conduct comprehensive simulated experiments to address the following questions: 1) Does the preference-based qualitative feedback really benefit the policy learning compared to handcrafted rewards and other existing treatment recommendation approaches? 2) How to better optimize human's objectives by learning a reward representation which can deal with policies which are incomparable? 3) Is the reward function inferred by PRL interpretable and does it faithfully follow human intentions? and 4) How to design agent types so that resulting policies together with the preference feedback lead to more accurate reward estimation and better treatment outcomes? Our experiments provide useful insights and guidance in developing preference-based RL approaches to realize the right trade-off between human objectives during treatment recommendation.

2. PROBLEM DEFINITION

We cast the treatment policy learning as a Markov Decision Process (MDP). At time-step t, s t is a vector composed of multiple health-related features, a t is either a scalar value representing dosage amount or a boolean value denoting whether to perform an operation. Besides effects from the conducted actions, features in the state influence each other's progression, which is simulated by the state transition probability function P(s t+1 |s t , a t ). The agent is targeted at learning the optimal policy π * that maximizes the expected return V π * (s 0 ) = max π E[ ∞ t=0 γ t r t ], where γ ∈ [0, 1] is the discount factor and r t is the estimated reward based on preference feedback. Given two policies π m and π n starting with the same initial state s i , π m (s i ) π n (s i ) represents the preference of policy π m to π n based on human's objectives. Rather than using hand-crafted reward functions of the MDP, we are aimed at finding a parameterized reward function r θ P that approximates the true reward function r underlying human's preference. where I is the indicator function which outputs 1 if the current tumor size y t > 0 and 0 otherwise.

3.2. SEPSIS INFECTION AND BLOOD PURIFICATION SIMULATION

We employ the mathematical model derived by Song et al. (2012) to simulate the acute inflammation process in response to an infection. There are 19 physiological features that govern sepsis dynamics,



GENERAL CANCER AND DRUG TREATMENT SIMULATION Following prior work(Fürnkranz et al., 2012), we use the mathematical model proposed byZhao  et al. (2009)  to simulate the general cancer evolution and drug treatment effects.State Transformation:The values of the next tumor size y t+1 and the toxicity level x t+1 are determined by the current drug amount d t , their current values y t , x t and initial values y 0 , x 0 :y t+1 = ReLU y t + [a 1 • max(x t , x 0 ) -b 1 • (d t -m 1 )] × I(y t > 0) x t+1 = ReLU x t + a 2 • max(y t , y 0 ) + b 2 • (d t -m 2 ) ,

