QUASI-OPTIMAL REINFORCEMENT LEARNING WITH CONTINUOUS ACTIONS

Abstract

Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel quasi-optimal learning algorithm, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.

1. INTRODUCTION

Learning good strategies in a continuous action space is important for many real-world problems (Lillicrap et al., 2015) , including precision medicine, autonomous driving, etc. In particular, when developing a new dynamic regime to guide the use of medical treatments, it is often necessary to decide the optimal dose level (Murphy, 2003; Laber et al., 2014; Chen et al., 2016; Zhou et al., 2021) . In infinite horizon sequential decision-making settings (Luckett et al., 2019; Shi et al., 2021) , learning such a dynamic treatment regime falls into a reinforcement learning (RL) framework. Many RL algorithms (Mnih et al., 2013; Silver et al., 2017; Nachum et al., 2017; Chow et al., 2018b; Hessel et al., 2018) have achieved considerable success when the action space is finite. A straightforward approach to adapting these methods to continuous domains is to discretize the continuous action space. However, this strategy either causes a large bias in coarse discretization (Lee et al., 2018a; Cai et al., 2021a; b) or suffers from the the curse of dimensionality (Chou et al., 2017) for fine-grid. There has been recent progress on model-free reinforcement learning in continuous action spaces without utilizing discretization. In policy-based methods (Williams, 1992; Sutton et al., 1999; Silver et al., 2014; Duan et al., 2016) , a Gaussian distribution is used frequently for policy distribution representation, while its mean and variance are parameterized using function approximation and updated via policy gradient descent. In addition, many actor-critic based approaches, e.g., soft actor-critic (Haarnoja et al., 2018b ), ensemble critic (Fujimoto et al., 2018) and Smoothie (Nachum et al., 2018a) , have been developed to improve the performance in continuous action spaces. These works target to model a Gaussian policy for action allocations as well. However, there are two less-investigated issues in the aforementioned RL approaches, especially for their applications in the healthcare (Fatemi et al., 2021; Yu et al., 2021) . First, existing methods that use an infinite support Gaussian policy as the treatment policy may assign arbitrarily high dose levels, which may potentially harm the patient (Yanase et al., 2020) . Hence, these approaches are not reliable in practice due to safety and ethical concerns. It would be more desirable to develop a policy class to identify the near-optimal (Tang et al., 2020) , or at least safe, action regions, and reduce ¶ Correspondence to: Wenzhuo Zhou <wenzhuz3@uci.edu> and Ruoqing Zhu <rqzhu@illinois.edu>

