AN EXAMINATION OF PREFERENCE-BASED REIN-FORCEMENT LEARNING FOR TREATMENT RECOM-MENDATION

Abstract

Treatment recommendation is a complex multi-faceted problem with many conflicting objectives, e.g., optimizing the survival rate (or expected lifetime), mitigating negative impacts, reducing financial expenses and time costs, avoiding over-treatment, etc. While this complicates the hand-engineering of a reward function for learning treatment policies, fortunately, qualitative feedback from human experts is readily available and can be easily exploited. Since direct estimation of rewards via inverse reinforcement learning is a challenging task and requires the existence of an optimal human policy, the field of treatment recommendation has recently witnessed the development of the preference-based Reinforcement Learning (PRL) framework, which infers a reward function from only qualitative and imperfect human feedback to ensure that a human expert's preferred policy has a higher expected return over a less preferred policy. In this paper, we first present an open simulation platform to model the progression of two diseases, namely Cancer and Sepsis, and the reactions of the affected individuals to the received treatment. Secondly, we investigate important problems in adopting preference-based RL approaches for treatment recommendation, such as advantages of learning from preference over hand-engineered reward, addressing incomparable policies, reward interpretability, and agent design via simulated experiments. The designed simulation platform and insights obtained for preference-based RL approaches are beneficial for achieving the right trade-off between various human objectives during treatment recommendation.

1. INTRODUCTION

With recent advances in deep learning and open access to large-scale Electronic Health Records (EHRs), Deep Reinforcement Learning (RL) approaches have gained popularity for treatment recommendation (Raghu et al., 2017; Lopez-Martinez et al., 2019) . But the success of RL applications often crucially depends on the prior knowledge that goes into the definition of the reward function (Wirth et al., 2017) . However, treatment recommendation is a multi-faceted problem where the reward function is hard to engineer and requires quantifying the trade-off between diverse types of realistic objectives. For instance, clinicians often aim to optimize the survival rate (or expected lifetime) while mitigating negative impacts of the treatment (Raghu et al., 2017; Lopez-Martinez et al., 2019; Wang et al., 2018) . However, they also keep in mind, the patient's considerations of financial expenses and time costs in accepting treatment strategies (Faissol et al., 2007; Denton et al., 2009) . Moreover, unnecessary or over-treatment needs to be avoided and certain agreements based on the patient's medical insurance plan also need to be followed for an affordable treatment (Nemati et al., 2016) . To explicitly reflect human's objectives in the reward function, prior work jointly considers multiple objectives weighted linearly to reduce the problem to a single-objective MDP (Faissol et al., 2007; Denton et al., 2009) . However, the linearly weighted reward function induces negative interference between objectives, especially when representations are learned using neural networks and shared among different objectives, which goes against human's actual intentions (Pham et al., 2018; Schaul et al., 2019) . Further, given clinicians' treatment strategies, the intrinsic reward function cannot be inferred accurately with existing inverse reinforcement learning (IRL) methods (Abbeel & Ng,

