QUASI-OPTIMAL REINFORCEMENT LEARNING WITH CONTINUOUS ACTIONS

Abstract

Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel quasi-optimal learning algorithm, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.

1. INTRODUCTION

Learning good strategies in a continuous action space is important for many real-world problems (Lillicrap et al., 2015) , including precision medicine, autonomous driving, etc. In particular, when developing a new dynamic regime to guide the use of medical treatments, it is often necessary to decide the optimal dose level (Murphy, 2003; Laber et al., 2014; Chen et al., 2016; Zhou et al., 2021) . In infinite horizon sequential decision-making settings (Luckett et al., 2019; Shi et al., 2021) , learning such a dynamic treatment regime falls into a reinforcement learning (RL) framework. Many RL algorithms (Mnih et al., 2013; Silver et al., 2017; Nachum et al., 2017; Chow et al., 2018b; Hessel et al., 2018) have achieved considerable success when the action space is finite. A straightforward approach to adapting these methods to continuous domains is to discretize the continuous action space. However, this strategy either causes a large bias in coarse discretization (Lee et al., 2018a; Cai et al., 2021a; b) or suffers from the the curse of dimensionality (Chou et al., 2017) for fine-grid. There has been recent progress on model-free reinforcement learning in continuous action spaces without utilizing discretization. In policy-based methods (Williams, 1992; Sutton et al., 1999; Silver et al., 2014; Duan et al., 2016) , a Gaussian distribution is used frequently for policy distribution representation, while its mean and variance are parameterized using function approximation and updated via policy gradient descent. In addition, many actor-critic based approaches, e.g., soft actor-critic (Haarnoja et al., 2018b ), ensemble critic (Fujimoto et al., 2018) and Smoothie (Nachum et al., 2018a) , have been developed to improve the performance in continuous action spaces. These works target to model a Gaussian policy for action allocations as well. However, there are two less-investigated issues in the aforementioned RL approaches, especially for their applications in the healthcare (Fatemi et al., 2021; Yu et al., 2021) . First, existing methods that use an infinite support Gaussian policy as the treatment policy may assign arbitrarily high dose levels, which may potentially harm the patient (Yanase et al., 2020) . Hence, these approaches are not reliable in practice due to safety and ethical concerns. It would be more desirable to develop a policy class to identify the near-optimal (Tang et al., 2020), or at least safe, action regions, and reduce ¶ Correspondence to: Wenzhuo Zhou <wenzhuz3@uci.edu> and Ruoqing Zhu <rqzhu@illinois.edu> the optimal action search area for reliability and effectiveness. Those actions out of the identified region are discriminated as non-optimal, and would be screened out with zero densities in the policy distribution. Second, for many real-world applications, the action spaces are bounded due to practical constraints. Examples include autonomous driving with a limited steering angle and dose assignment with a budget or safety constraint. In these scenarios, modeling an optimal policy by an infinite support probability distribution, e.g., Gaussian policy, would inevitably introduce a non-negligible off-support bias as shown in Figure 2 . In consequence, the off-support bias damages the performance of policy learning and results in a biased decision-making procedure. Instead, constructing a policy class with finite but adjustable support might be one of the demanding solutions. In this work, we take a substantial step towards solving the aforementioned issues by developing a novel quasi-optimal learning algorithm. Our development hinges upon a novel quasi-optimal Bellman operator and stationarity equation, which is solved via minimizing an unbiased kernel embedding loss. Quasi-optimal learning estimates an implicit stochastic policy distribution whose support region only contains near-optimal actions. In addition, our algorithm overcomes the difficulties of the nonsmoothness learning issue and the double sampling issue (Baird, 1995) , and can be easily optimized using sampled transitions in off-policy scenarios without training instability and divergence. The main contribution of this paper can be summarized as follows: • We construct a novel Bellman operator and develop a reliable stochastic policy class, which is able to identify quasi-optimal action regions in scenarios with a bounded or unbounded action space. This address the shortcomings of existing approaches relying on modeling an optimal policy with infinite support distributions. • We formalize an unbiased learning framework for estimating the designed quasi-optimal policy. Our framework avoids the double sampling issue and can be optimized using sampled transitions, which is beneficial in offline policy optimization tasks. • We thoroughly investigate the theoretical properties of the quasi-optimal learning algorithm, including the adaptability of the quasi-optimal policy class, the loss consistency, the finitesample bound for performance error, and the convergence analysis of the algorithm. • Empirical analyses are conducted with comprehensive numerical experiments and a realworld case study, to evaluate the model performance in practice.

2. PRELIMINARIES

Notations We first give an introduction to our notations. For two strictly positive sequences {Ψ(m)} m≥1 and {Υ(m)} m≥1 , the notation {Ψ(m)} m≥1 ≲ {Υ(m)} m≥1 means that there exists a sufficiently small constant c ≥ 0 such that Ψ(n) ≤ cΥ(n). ∥ • ∥ L p and ∥ • ∥ ∞ denote the L p norm and supremum-norm, respectively. We define the set indicator function 1 set (x) = 1 if x ∈ set or 0 otherwise. The notation P n denotes the empirical measure i.e., P n = 1 n n i=1 . For two sets ℵ 0 and ℵ 1 , the notation ℵ 0 \ ℵ 1 indicates that the set ℵ 0 excluding the elements in the set ℵ 1 . We write |ℵ 0 | as the cardinality of the set ℵ 0 . For any Borel set ℵ 2 , we denote σ(ℵ 2 ) as the Borel measure of ℵ 2 . We denote a probability simplex over a space F by ∆(F), and in particular, ∆ convex (F) indicates the convex probability simplex over F. We denote ⌊•⌋ as the floor function, and use O as the convention. Background A Markov decision process (MDP) is defined as a tuple < S, A, P, R, γ >, where S is the state space, A is the action space, P : S × A → ∆(S) is the unknown transitional kernel, R : S × S × A → R is a bounded reward function, and γ ∈ [0, 1) is the discounted factor. In this paper, we focus on the scenario of continuous action space. We assume the offline data consists of n i.i.d. trajectories, i.e., D 1 :n = {S 1 i , A 1 i , R 1 i , S 2 i , . . . , S Ti i , A T i , R T i , S T +1 i } n i=1 , where the length of trajectory T is assumed to be non-random for simplicity. A policy π is a map from the state space to the action space π : S → A. The learning goal is to search an optimal policy π * which maximizes the expected discounted sum of rewards. V π t (s) = E π ∞ k=1 γ k-1 R t+k |S t = s is the value function under a policy π, where E π is taken by assuming that the system follows a policy π, and the Q-function is defined as Q π t (s, a) = E π ∞ k=1 γ k-1 R t+k |S t = s, A t = a . In a time-homogenous Markov process (Puterman, 2014), V π t (s) and Q π t (s, a) do not depend on t. The optimal value function V * is the unique fixed point of the Bellman operator B, BV (s) := max a E S t+1 ∼P(s,a) R t + γV (S t+1 )|S t = s, A t = a . Then BV * (s) = V * (s) for any s ∈ S. An optimal policy π * can be obtained by taking the greedy action of Q * (s, a), that is π * (s) = arg max a Q * (s, a). For the rest of the paper, we use the short notation E s ′ |s,a for the conditional expectation E s ′ ∼P(s,a) ; and E S t ,A t ,S t+1 is short for E S t ∼υ,A t ∼π b (•|S t ),S t+1 ∼P(S t ,A t ) , where υ is a some fixed distribution and π b is some behavior policy.

