QUASI-OPTIMAL REINFORCEMENT LEARNING WITH CONTINUOUS ACTIONS

Abstract

Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel quasi-optimal learning algorithm, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.

1. INTRODUCTION

Learning good strategies in a continuous action space is important for many real-world problems (Lillicrap et al., 2015) , including precision medicine, autonomous driving, etc. In particular, when developing a new dynamic regime to guide the use of medical treatments, it is often necessary to decide the optimal dose level (Murphy, 2003; Laber et al., 2014; Chen et al., 2016; Zhou et al., 2021) . In infinite horizon sequential decision-making settings (Luckett et al., 2019; Shi et al., 2021) , learning such a dynamic treatment regime falls into a reinforcement learning (RL) framework. Many RL algorithms (Mnih et al., 2013; Silver et al., 2017; Nachum et al., 2017; Chow et al., 2018b; Hessel et al., 2018) have achieved considerable success when the action space is finite. A straightforward approach to adapting these methods to continuous domains is to discretize the continuous action space. However, this strategy either causes a large bias in coarse discretization (Lee et al., 2018a; Cai et al., 2021a; b) or suffers from the the curse of dimensionality (Chou et al., 2017) for fine-grid. There has been recent progress on model-free reinforcement learning in continuous action spaces without utilizing discretization. In policy-based methods (Williams, 1992; Sutton et al., 1999; Silver et al., 2014; Duan et al., 2016) , a Gaussian distribution is used frequently for policy distribution representation, while its mean and variance are parameterized using function approximation and updated via policy gradient descent. In addition, many actor-critic based approaches, e.g., soft actor-critic (Haarnoja et al., 2018b) , ensemble critic (Fujimoto et al., 2018) and Smoothie (Nachum et al., 2018a) , have been developed to improve the performance in continuous action spaces. These works target to model a Gaussian policy for action allocations as well. However, there are two less-investigated issues in the aforementioned RL approaches, especially for their applications in the healthcare (Fatemi et al., 2021; Yu et al., 2021) . First, existing methods that use an infinite support Gaussian policy as the treatment policy may assign arbitrarily high dose levels, which may potentially harm the patient (Yanase et al., 2020) . Hence, these approaches are not reliable in practice due to safety and ethical concerns. It would be more desirable to develop a policy class to identify the near-optimal (Tang et al., 2020) , or at least safe, action regions, and reduce the optimal action search area for reliability and effectiveness. Those actions out of the identified region are discriminated as non-optimal, and would be screened out with zero densities in the policy distribution. Second, for many real-world applications, the action spaces are bounded due to practical constraints. Examples include autonomous driving with a limited steering angle and dose assignment with a budget or safety constraint. In these scenarios, modeling an optimal policy by an infinite support probability distribution, e.g., Gaussian policy, would inevitably introduce a non-negligible off-support bias as shown in Figure 2 . In consequence, the off-support bias damages the performance of policy learning and results in a biased decision-making procedure. Instead, constructing a policy class with finite but adjustable support might be one of the demanding solutions. In this work, we take a substantial step towards solving the aforementioned issues by developing a novel quasi-optimal learning algorithm. Our development hinges upon a novel quasi-optimal Bellman operator and stationarity equation, which is solved via minimizing an unbiased kernel embedding loss. Quasi-optimal learning estimates an implicit stochastic policy distribution whose support region only contains near-optimal actions. In addition, our algorithm overcomes the difficulties of the nonsmoothness learning issue and the double sampling issue (Baird, 1995) , and can be easily optimized using sampled transitions in off-policy scenarios without training instability and divergence. The main contribution of this paper can be summarized as follows: • We construct a novel Bellman operator and develop a reliable stochastic policy class, which is able to identify quasi-optimal action regions in scenarios with a bounded or unbounded action space. This address the shortcomings of existing approaches relying on modeling an optimal policy with infinite support distributions. • We formalize an unbiased learning framework for estimating the designed quasi-optimal policy. Our framework avoids the double sampling issue and can be optimized using sampled transitions, which is beneficial in offline policy optimization tasks. • We thoroughly investigate the theoretical properties of the quasi-optimal learning algorithm, including the adaptability of the quasi-optimal policy class, the loss consistency, the finitesample bound for performance error, and the convergence analysis of the algorithm. • Empirical analyses are conducted with comprehensive numerical experiments and a realworld case study, to evaluate the model performance in practice.

2. PRELIMINARIES

Notations We first give an introduction to our notations. For two strictly positive sequences {Ψ(m)} m≥1 and {Υ(m)} m≥1 , the notation {Ψ(m)} m≥1 ≲ {Υ(m)} m≥1 means that there exists a sufficiently small constant c ≥ 0 such that Ψ(n) ≤ cΥ(n). ∥ • ∥ L p and ∥ • ∥ ∞ denote the L p norm and supremum-norm, respectively. We define the set indicator function 1 set (x) = 1 if x ∈ set or 0 otherwise. The notation P n denotes the empirical measure i.e., P n = 1 n n i=1 . For two sets ℵ 0 and ℵ 1 , the notation ℵ 0 \ ℵ 1 indicates that the set ℵ 0 excluding the elements in the set ℵ 1 . We write |ℵ 0 | as the cardinality of the set ℵ 0 . For any Borel set ℵ 2 , we denote σ(ℵ 2 ) as the Borel measure of ℵ 2 . We denote a probability simplex over a space F by ∆(F), and in particular, ∆ convex (F) indicates the convex probability simplex over F. We denote ⌊•⌋ as the floor function, and use O as the convention. Background A Markov decision process (MDP) is defined as a tuple < S, A, P, R, γ >, where S is the state space, A is the action space, P : S × A → ∆(S) is the unknown transitional kernel, R : S × S × A → R is a bounded reward function, and γ ∈ [0, 1) is the discounted factor. In this paper, we focus on the scenario of continuous action space. We assume the offline data consists of n i.i.d. trajectories, i.e., D 1:n = {S 1 i , A 1 i , R 1 i , S 2 i , . . . , S Ti i , A T i , R T i , S T +1 i } n i=1 , where the length of trajectory T is assumed to be non-random for simplicity. A policy π is a map from the state space to the action space π : S → A. The learning goal is to search an optimal policy π * which maximizes the expected discounted sum of rewards. V π t (s) = E π ∞ k=1 γ k-1 R t+k |S t = s is the value function under a policy π, where E π is taken by assuming that the system follows a policy π, and the Q-function is defined as Q π t (s, a) = E π ∞ k=1 γ k-1 R t+k |S t = s, A t = a . In a time-homogenous Markov process (Puterman, 2014), V π t (s) and Q π t (s, a) do not depend on t. The optimal value function V * is the unique fixed point of the Bellman operator B, BV (s) := max a E S t+1 ∼P(s,a) R t + γV (S t+1 )|S t = s, A t = a . Then BV * (s) = V * (s) for any s ∈ S. An optimal policy π * can be obtained by taking the greedy action of Q * (s, a), that is π * (s) = arg max a Q * (s, a). For the rest of the paper, we use the short notation E s ′ |s,a for the conditional expectation E s ′ ∼P(s,a) ; and E S t ,A t ,S t+1 is short for E S t ∼υ,A t ∼π b (•|S t ),S t+1 ∼P(S t ,A t ) , where υ is a some fixed distribution and π b is some behavior policy.

3. METHODOLOGY

To start with, we first revisit the Bellman optimality equation via a policy explicit view, BV * (s) := max π E a∼π(•|s), S t+1 |s,a R(S t+1 , s, a) + γV * (S t+1 ) = V * (s). (1) To obtain the optimal policy π * and value function V * , an optimization idea is to minimize the discrepancy between the two sides of the equation under a L 2 loss. Unfortunately, there are several major challenges when it comes to optimization: (1) Non-smoothness: the Bellman operator involves a non-smoothed hard-max operator, which leads to training instability; (2) Policy class: As discussed in Section 1, it is necessary to induce an optimal policy class whose support consists of quasi-optimal sub-regions for reliability, and avoids off-support bias in Figure 2;  (3) Double sampling: the unknown conditional expectation E S t+1 |s,a is required to be double sampled for obtaining an unbiased sample approximation for E S t+1 |s,a . However, this is usually infeasible in real-world environments; (4) Off-policy data: directly minimizing the Bellman error is not easy to incorporate off-policy data. To address these issues, we propose a quasi-optimal counterpart of the Bellman equation (1).

3.1. QUASI-OPTIMAL BELLMAN OPERATOR

In this subsection, we aim to tackle the first two challenges. We propose a quasi-optimal counterpart for the Bellman operator B that simultaneously circumvents the non-smoothness obstacles, and induce a novel policy class which can identify quasi-optimal sub-regions in continuous action spaces. We leverage the Legendre-Fenchel transform (Hiriart-Urruty & Lemaréchal, 2012) on the Bellman operator B. For a convex probability simplex ∆ convex (A) and a strongly convex and continuous proximity function prox(π) : ∆ convex (A) → R, the Fenchel transform counterpart of B is defined as B µ V * µ (s) = max π∈∆convex(A) a∈A Q * µ (s, a)π(a|s) + µprox(π(a|s)) da, where Q * µ (s, a) = E S t+1 |s,a [R(S t+1 , s, a) + γV * µ (S t+1 )] , and V * µ (s) is the unique fixed point of the quasi-optimal Bellman operator B µ . Note that, besides the smoothing purpose, we are also interested in constructing a stochastic optimal policy class that can screen out the non-optimal and sub-optimal actions. Therefore, we further define a special prox function class motivated by the rationale of q-logarithm as prox(x) = log q (x) := x(1-x q-1 ) q-1 , where a∈A prox(π(a|s))da = 1 q-1 (1 -a∈A π q (a|s)da) essentially generalize the Shannon's entropy (Martins et al., 2020) . In this paper, we focus on the setting that q = 2. Assumption 3.1. For any policy distribution π ∈ ∆ convex (A), its density is bounded above by a constant, i.e., π(•|s) ≤ C for all s ∈ S. This assumption avoids some extreme cases where a stochastic policy distribution degenerates to be deterministic. In the following, we show several nice properties of the proposed Bellman operator. Proximal Approximation The operator B µ is a proximal approximation to B. This delivers two messages: firstly, the approximation bias is upper bounded; secondly, the operator B µ is a smoothed substitute for B. In particular, Theorem 3.1 demonstrates that the approximation bias can vanish to zero for small enough µ. In addition, the operator B µ has a differentiable and analytical form (3), which justifies that B µ is a smoothed counterpart of B, see Corollary S.1 in Appendix for details. Theorem 3.1 (Proximal bias). Under Assumption 3.1, for any s ∈ S and value function V , B µ V (s)- BV (s) ∈ [µ(1 -C), µ]. B µ V * µ (s) = µ - 1 4µ ( a ′ ∈Ws Q * µ (s, a ′ )da ′ -2µ) 2 σ(W s ) - a∈Ws Q * µ 2 (s, a)da , where W s denotes the the support of π * µ in (4) for a given state s. Quasi-optimal Support Region In addition to the proximal approximation property, another unique and important property of B µ is inducing a policy π * µ whose support region contains all the actions with action-value higher than a certain threshold. The induced policy π * µ is bridged from the oracle Q-function: π * µ (a|s) = Q * µ (s, a) 2µ -a∈Ws Q * µ (s, a)da 2µσ(W s ) + 1 σ(W s ) Figure 1 : An illustrating example of the quasi-optimal sub-regions. In the left panel, the lowest admissible action-value corresponds to the horizontal red dashed line, and the integral difference is the shadowed pink area, which equals 2µ. As shown in the right panel, when µ decreases, the pink area shrinks, and the quasi-optimal sub-regions become narrower. where the support of π * µ , i.e., W s := a∈A a1 screening set (a) with screening set := a ∈ A : a ′ ∈Ms(a) Q * µ (s, a ′ )da ′ -σ(M s (a))Q * µ (s, a) > 2µ , (5) M s (a) := a ′ ∈A a ′ 1 {Q * µ (s,a ′ )>Q * µ (s,a)} (a ′ ). ( ) This mechanism allows us to identify multiple sub-regions in the entire action space which only contains near-optimal actions, and weed out the sub-optimal and non-optimal support regions. Note that, the identified sub-region might not be joint in general, which is beneficial to the situation that the true Q-function has multiple modes. The screening set in (5) indicates that the threshold parameter µ not only controls the degree of smoothness, but also determines how the quasi-optimal region behaves and controls the screening intensity, as shown in Figure 1 . 3.2 q-GAUSSIAN POLICY DISTRIBUTION Figure 2 : An illustrating example of bounded action space and q-Gaussian policy distribution. The Gaussian policy assigns non-zero probabilities density to all actions, even for those actions outside of the true action space support boundary. This causes the off-support bias. In contrast, the q-Gaussian policy relieves such off-support bias blessed by the boundedness of the quasi-optimal region. In this section, we bridge the induced policy distribution π * µ to an explainable q-Gaussian distribution. The q-Gaussian distribution is less favored for heavy tails, which makes it widely used in practice to model the effect of external stochasticity (d'Onofrio, 2013) . In continuous actions problems, e.g., medical dose suggestion, the q-Gaussian distribution is a more suitable choice than the Gaussian distribution for policy modeling, since it can filter out non-optimal and risky dose levels, i.e., too high or too low dosage. Motivated by the fact that the induced policy π * µ is feasible to identify quasi-optimal support subregions, and q-Gaussian policy distribution can realize bounded support, we conjectured that the q-Gaussian policy distribution might be recovered from the induced policy π * µ . Fortunately, the q-Gaussian policy distribution is indeed a special case of the induced policy if Q * µ (s, a) is a concavely quadratic function with respect to the action a. We illustrate this phenomenon in Theorem 3.2. Theorem 3.2. Suppose Q * µ (s, a) is a concavely quadratic function over a ∈ A, i.e., Q * µ (s, a) = -α 1 (s)a 2 + α 2 (s)a + α 3 (s) := Q N µ (s, a) where α 1 (s), α 2 (s), α 3 (s) are functions over s ∈ S and α 1 (s) > 0 for all s, then the induced policy distribution π * µ (•|s) would follow a q-Gaussian distribution with a density function π * µ (a|s) = α 1 (s) 2µ a + α 2 (s) 2α 1 (s) 2 - 3 2 α 1 (s) 12µ 1 3 + := π N µ (a|s), and a closed-form quasi-optimal support region W s = α 2 (s) -(12α 2 1 (s)µ) 1 3 2α 1 (s) , α 2 (s) + (12α 2 1 (s)µ) 1 3 2α 1 (s) := W N s . The policy distribution π N µ (•|s) behaves as a affine transformation of the standard q-Gaussian distribution with mean -α2(s) 2α1(s) , where the maximum action-value attains, i.e., Q N µ (s, -α2(s) 2α1(s) ) = arg max a∈A Q N µ (s, a). Note that the width of the quasi-optimal region is (12α 2 1 (s)µ) 1 3 α1(s) determined by the threshold parameter µ. The actions within the region R\W N s are discriminated as the non-optimal and would be assigned with zero probability densities. For a small µ, i.e., strong screening intensity, a narrow region would be identified as the quasi-optimal, which yields a relatively conservative action recommendation. In contrast, with a large µ, more actions are included in the support. In an extreme case, W N s degenerates to R as µ → ∞. In Theorem 4.1 of Section 4, we investigate how the intensity of µ affects the induced policy distribution formally. So far, we have obtained the closed-form representations for the general policy π * µ (•|s) and q-Gaussian policy π N µ . However, how to make a policy estimation remains unknown. Indicated by the challenges in Section 3, we need to address the double sampling issue and utilize off-policy data in optimization. Both challenges cannot be easily solved by minimizing the Bellman error. Fortunately, the kernel embedding helps us to bypass the difficulties.

3.3. KERNEL EMBEDDING ON QUASI-OPTIMAL ERROR

In this subsection, we introduce the quasi-optimal learning framework for solving the induced policy π * µ . First, we establish a stationary equation in Theorem 3.3. This helps to incorporate off-policy data. Then we leverage the idea of the kernel embedding (Gretton et al., 2012) to obtain an unbiased empirical loss without the double sampling issue. Theorem 3.3 (Stationarity equation). Let V * µ be a fixed point of the quasi-optimal Bellman operator B µ , and π * µ is the induced policy in (4). For any s ∈ S, a ∈ A, and µ ∈ (0, ∞), the pair (V * µ , π * µ ) satisfies the following equation: E S t+1 |s,a R(S t+1 , s, a) + γV µ (S t+1 ) -µprox • (π µ (a|s)) -η(s) + ϖ(s, a) = V µ (s). (9) Here prox • (x) = 2x-1, η(s) : S → [-µC, 0] and ϖ(s, a) : S ×A → R + are Lagrange multipliers that ϖ(s, a) • π µ (a|s) = 0. The discrepancy between the two sides of (9) is "quasi-optimal error". The equation (9) connects quasi-optimal value function V * µ and policy function π * µ along with any arbitrary state-action pair. This provides an easy way to incorporate off-policy data, i.e., the state-action pairs which are sampled from state-action visitation under the behavior policy, without adjusting the distribution mismatch. Min-max Optimization One way to solve the equation ( 9) is minimizing the quasi-optimal error under a L 2 loss function. Unfortunately, the double sampling issue would still appear if replacing the unknown E S t+1 |s,a [R(S t+1 , s, a) + γV µ (S t+1 )] in the quasi-optimal error by its one-sample bootstrapping counterpart R t + γV µ (S t+1 ). Alternatively, inspired by the average Bellman error (Jiang et al., 2017) , we propose to minimize a weighted average quasi-optimal error, and the unwanted conditional variance of the bootstrapping counterpart under L 2 loss could vanish. We define the loss L(V µ , π µ , η, ϖ, u) as E S t ,A t ,S t+1 u S t , A t • G Vµ,πµ S t , A t , S t+1 -η(S t ) + ϖ(S t , A t ) -V µ (S t ) , where G Vµ,πµ (s, a, s ′ ) := R(s ′ , s, a) + γV µ (s ′ ) -µprox • (π µ (a|s)) and u(•) : S × A → R is a bounded function in L 2 space L 2 (C 0 ) := {u ∈ L 2 : ∥u∥ L 2 ≤ C 0 }. Essentially, the weight function u is to fit the discrepancy of (9) and promotes the sample points with large quasi-optimal errors. As L(V * µ , π * µ , η, ϖ, u) = 0 holds for any u function, this leads to a minimax optimization: min Vµ,πµ,η,ϖ max u∈L 2 (C0) L 2 (V µ , π µ , η, ϖ, u). ( ) Algorithm 1 Quasi-optimal Learning in Continuous Action Spaces 1: Input observed transition pairs data {(S t i , A t i , R t i , S t+1 i ) : t = 1, ..., T } n i=1 . 2: Initialize the parameters of interests (θ, ξ) = (θ 0 , ξ 0 ), the mini-batch size n 0 , the learning rate α 0 , the prox parameter µ, the kernel bandwidth bw 0 , and the stopping criterion ε. 3: For iterations j = 1 to k 4: Randomly sample a mini-batch {(S t i , A t i , R t i , S t+1 i ) : t = 1, ..., T } n0 i=1 . 5: Decay the learning rate α j = O(j -1/2 ). 6: Compute stochastic gradients with respect to θ and ξ: ∇θ = P n0 ∇ θ L U and ∇ξ = P n0 ∇ ξ L U . 7: Update the parameters of interest as θ j ← θ j-1 -α j ∇θ L U , ξ j ← ξ j-1 -α j ∇ξ L U . 8: Stop if ∥(θ j , ξ j ) -(θ j-1 , ξ j-1 )∥ ≤ ε. 9: Return θ ← θ j , ξ ← ξ j . Kernel Representation Solving the minimax optimization problem (10) is unstable, and it is also intractable due to the difficulty for the representation of u in L 2 space. Fortunately, we identify continuity invariance between the reward function and the optimal weight function u * (•) (see Theorem S.2 in Appendix). The optimal u * (•) is continuous as long as the reward function is continuous, which is widely satisfied in real-world applications. As for a positive definite kernel K, a bounded reproducing kernel Hilbert space (RKHS) H RKHS (C 0 ) := {u ∈ H RKHS : ∥u∥ K ≤ C 0 } has a diminishing approximation error to any continuous function class as C 0 → ∞ (Bach, 2017) . This together with continuity invariance provides us a basis for representing the weight function in a bounded RKHS. This kernel representation further leads to a closed-form of the inner optimization maximizer (Gretton et al., 2012) . The detailed derivation is provided in Theorem S.3 in Appendix. Upon this, the minimax optimization is reduced to only minimizing the loss L U = E S t , St ,A t , Ãt ,S t+1 , St+1 [Λ Vµ,πµ (S t , A t , S t+1 )K(S t , A t ; St , Ãt )Λ Vµ,πµ ( St , Ãt , St+1 )], where Λ Vµ,πµ (s, a, s ′ ) := G Vµ,πµ (s, a, s ′ ) -η(s) + ϖ(s, a) -V µ (s) and ( St , Ãt , St+1 ) is an independent copy of transition pair (S t , A t , S t+1 ). It observes that the loss L U is symmetric and kernel represented. This motivates us to use an unbiased U-statistic estimator to obtain the sample loss. Given the observed data, D 1:n , with n trajectories of length T , we can use a trajectory-based U-statistic estimator to capture the within-trajectory loss, thus the total loss L U can be aggregated as the empirical mean of n i.i.d. within trajectory loss: min Vµ,πµ,η,ϖ L U = P n T 2 1≤j̸ =k≤T [Λ Vµ,πµ (S j i , A j i , S j+1 i )K(S j , A j ; S k , A k )Λ Vµ,πµ (S k i , A k i , S k+1 i )] s.t. ϖ(a|s) ≥ 0, π µ (a|s) • ϖ(a|s) = 0 and η(s) ∈ [-µC, 0] for all s ∈ S, a ∈ A. The sample loss L U is unbiased and consistent with the population loss L U . The consistency is justified in Theorem 4.2 via examining the tail behavior of L U . In essence, solving the equation ( 12) is a computationally intensive non-linear programming problem. Alternatively, we convert the constrained problem to an unconstrained problem by restricting the Lagrange multipliers. Thus, it can be solved by an unconstrained true gradient algorithm, i.e., Algorithm 1 under function approximation (V µ , π µ , η, ϖ) = (V θ µ , π θ µ , η ξ , ϖ θ ); see Appendix for details.

4. THEORY

In this section, we study the theoretical properties of the proposed method. First, we study some general properties of the proposed quasi-optimal Bellman operator, given in Proposition S.1 and S.2 of Appendix. In Theorem 4.1, we disclose the effect of the intensity of prox parameter µ on the induced optimal policy distribution. Moreover, a non-asymptotic concentration bound is established in Theorem 4.2, showing the consistency and measuring the rate of convergence of L U to L U . Further, the overall performance error of the algorithm is given in Theorem 4.3, where the performance error is decomposed as the four sources. Finally, we show that the proposed quasi-optimal learning is a convergent algorithm. Before we present the theoretical results, we introduce some assumptions on the boundedness condition of the MDP and the sample trajectory properties, respectively. Assumption 4.1. The reward function R(s ′ , s, a) is uniformly bounded, i.e, ∥R(•)∥ ∞ ≤ R max . Assumption 4.2. Suppose {S t , A t } t≥1 is a strictly stationary and exponentially β-mixing sequence with a mixing coefficient β(m) ≲ exp(-δ 1 m) for m ≥ 1. We further assume that the behavior policy π b , which is used to collect the offline data D 1:n , satisfies that min a∈A,s∈S π b (a|s) > 0. Theorem 4.1 (Policy Adaptability). Under Assumption 4.1, for all s ∈ S, the quasi-optimal policy distribution π * µ (•|s) degenerates to a uniform distribution over ∆(A) as µ → ∞, and π * µ (•|s) concentrates in a point mass as µ → 0 and C → ∞. Theorem 4.1 formally investigates the effect of µ on π * µ (•|s). In an extreme case that µ → 0, C → ∞, only the action maximizing Q * µ (s, a) would be included in the quasi-optimal region. In the following, we establish a non-asymptotic concentration inequality for the empirical loss in the non-i.i.d. case. Theorem 4.2. For any µ ∈ (0, ∞) and ϵ > 0, under Assumptions 4.1-4.2, we have ϵ-divergence of | L U -L U | bounded in probability, i.e., P(| L U -L U | > ϵ) ≤ C 1 exp - ϵ 2 T -C 2 ϵM 2 max √ T M 2 max + ( ϵ 2 - C2M 2 max √ T ) log T log log(T ) + C 3 exp -nϵ 2 M 4 max , where C 1 , C 2 and C 3 are some constants depending on δ 1 respectively, and M max = 4 1-γ R max +µC. Theorem 4.2 implies that L U is a consistent estimator to L U , and thus avoiding the double sampling issue. Note that the concentration bound is sharper than the bound established in Chakrabortty & Kuchibhotla (2018) since we utilize a novel temporal correlatedness structure to decompose the U-statistic. We now analyze the performance error between the finite sample learner and true solution, which can be decomposed into four source errors. Theorem 4.3. Under Assumption 4.1-4.2, let V θ1,k µ be the optimizer from Algorithm 1 and V * is the optimal value function and κ min be the smallest eigenvalue corresponding to an orthonormal basis of L 2 (S × A) space. With probability 1 -δ, the performance error is upper bounded by ∥ V θ1,k µ -V * ∥ 2 L 2 ≤ C 4 κ min (1 -γ) 2   C 5 D P-dim log 8C4 δ n + 2 ∆ δ1 ∨ 1 ∆ C 6 ⌊T /2⌋   generalization error + C 7 µ 2 (C + |1 -C| ∨ 1) 2 (1 -γ) 2 proximal bias + C 8 V θ1 µ -V θ1,k µ 2 L 2 optimization error +ϵ approximation error , where ∆ = DP-dim log⌊T /2⌋ 2 + log( e δ ) + log + C5C D P-dim 6 2 , D P-dim = P-dim(Θ 1 ) + P-dim(Θ 2 ) + P-dim(Ξ 1 ) + P-dim(Ξ 2 ), and C 4 , ..., C 8 are some constants. Here P-dim(•) denotes the pseudodimension operator (Györfi, 2010) , and Θ 1 , Θ 2 , Ξ 1 and Ξ 2 are function spaces for V µ , π µ , ϖ and η, respectively. The ϵ approximation error is from parametrization (V θ1 µ , π θ2 µ , ϖ ξ1 , η ξ2 ) on (V µ , π µ , ϖ, η). The above sample complexity bound gives an insight into the performance error of the proposed algorithm. The generalization error ε gerr = O(1/ √ T ) if n is as the same order of T , the proximal bias ε prox = O(µ 2 ) and the optimization error ε optim = O(1/k) for k iterations. Although the prox function introduces a proximal bias in the quasi-optimal Bellman operator B µ , it leads to a smoothed approximation for B. There exists a trade-off between the proximal bias and approximation error. As the increase of µ, it enlarges the proximal bias but decreases the approximation error since true function space becomes more smoothed and easy for function approximation. On the other hand, a small µ leads to a small proximal bias but a relatively large approximation error. Theorem 4.4. Suppose L U in Algorithm 1 is differentiable, but not necessarily convex, and its gradient ∇ L U (θ, ξ) is M L -Lipschitz and Var( ∇θ + ∇ξ ) ≤ σ 2 0 . And suppose that the learning rate {α j } are set to α j = min 2 M L , Λ σ0 √ j for some Λ ≥ 0 and ε is sufficient small. Let k = k with P( k = j) = αj (2-M L αj ) k j=1 (αj (2-M L αj )) for j = 1, . . . , k ⋄ . Then, if ( θ, ξ) is the optimization solution and (θ 1 , ξ 1 ) is the first step solution, we have ∇ L U ( θ, ξ) 2 L 2 ≤ 2M L L U (θ 1 , ξ 1 ) -min θ,ξ L U (θ, ξ) M L k ⋄ + σ 0 M L Λ √ k ⋄ + Λσ 0 M L √ k ⋄ , Theorem 4.4 implies that the quasi-optimal learning algorithm is converges to a stationary point with a sub-linear rate O(1/ √ k ⋄ ) even if the empirical loss is non-convex. The property serves as a basis for applying non-linear function approximation with convergent guarantees. Theorem 4.4 is adapted from Corollary 2.2 in Ghadimi & Lan (2013) under a decay learning rate and a Euclidean stopping criterion. The convergence of Algorithm 1 is blessed by our unbiased stochastic gradient estimator.

5. RELATED WORKS

In this work, we propose a provably convergent and sample efficient off-policy optimization algorithm. Our learning algorithm is trained in a fully offline fashion, without any future online interaction with the environment. This connects our work to offline RL algorithms (Lange et al., 2012; Levine et al., 2020) . Due to space limitations, we defer the discussion on offline RL in Appendix. Algorithmically, our work is related to the entropy-regularized reinforcement learning algorithms (Rawlik et al., 2012; Haarnoja et al., 2017) , but these works are fundamentally different from ours. Our formulation is motivated by constructing a proximal counterpart of the Bellman operator, which serves as a basis for the latter quasi-oracle learning algorithm. Besides, the major drawback of the existing algorithms (Lee et al., 2018b; Chow et al., 2018b; Vieillard et al., 2020) is the lack of theoretical guarantees when accompanied by function approximation. It is not clear whether the algorithm is convergent, generalizable, and consistent. In contrast, our algorithm is thoroughly examined on both theoretical and empirical fronts. Nachum et al. (2017) ; Chow et al. (2018b) exploit an analogous stationarity condition as in Theorem 3.3 and minimize the upper bound of the error, which is biased and encounters double sampling issue. In contrast, our work leverages the kernel embedding to bypass the double sampling issue, and is provably consistent. Unlike our algorithm, the algorithms in continuous control problems, e.g., (Haarnoja et al., 2018b; Nachum et al., 2018b; Lee et al., 2019) do not check the policy optimality, but separately model a pre-specified policy class. This may introduce an additional bias if the pre-specified policy class is misspecified. Our approach exemplifies more recent efforts that aim to learn optimal policy with continuous actions (Lillicrap et al., 2015) . One of our key innovations is to develop a policy class that can identify quasi-optimal sub-regions and the induced policy has a closed-form regarding value function. This distinguishes us from the approaches, e.g., (Silver et al., 2014; Mnih et al., 2016; Kumar et al., 2019; 2020) . These methods typically require prior knowledge to determine pre-specified policy class and commonly use Gaussian family distribution, but unfortunately facing the risk from off-support bias. Our work is also relevant to safe/risk-sensitive RL. When the risk measure is defined based on the reward, e.g., the quantile of return, it draws connections to our algorithm. Given potential application scenarios, quasi-optimal learning is also related to RL in healthcare domain and the trade-offs between safety and optimality. Tang et al. (2020) constructs set-valued policies of near-optimal actions allowing the interaction between the clinician and the decision support system. However, their method is not applicable in a fully offline setting. Fatemi et al. (2021) assesses regions of risk and identifies treatments to avoid in a safety-critical environment. Nevertheless, near-optimal regret guarantee is vacuous in their framework. Due to page limits, we provide a detailed discussion on safe and healthcare RL in Appendix.

6. EXPERIMENTS

In this section, we evaluate our proposed method on synthetic and real environments. We compare our method to the state-of-the-art baselines including DDPG (Lillicrap et al., 2015) , SAC (Haarnoja et al., 2018a) , BEAR (Kumar et al., 2019) , Greedy-GQ (Ertefaie & Strawderman, 2018) , V-Learning (Luckett et al., 2019) . We also compete with two safe RL algorithms CQL (Kumar et al., 2020) and IQN (Dabney et al., 2018a) for a comprehensive comparison from the safety RL point of view. 

Synthetic Data

The four environments are simulated to mimic the real environments for continuous treatment applications. In Environment I and II, we consider a bounded action space to evaluate the potential of quasi-optimal learning for addressing off-support bias. The design of Environment III is to mimic safety-critical environment by incorporating the notion of safety into the reward function (Jia et al., 2020) , i.e., the optimal dosage is unique, and a high dosage leads to excessive toxicity while a lower dosage is ineffective (Zang et al., 2014) . This is helpful for examining safety performance. In Environment IV, all the methods are implemented and compared in a more complex environment. The detailed discussion on the experiment designs and settings is deferred to Section D in Appendix. Figure 3 shows that our proposed method outperforms competing methods with a relatively small variance. This mainly benefits from identifying the quasi-optimal region, which guarantees the suggested action is near-optimal, hence improving the performance. In comparison, SAC and BEAR use a Gaussian policy and assign non-negligible positive densities to all actions, even for the nonoptimal ones, which damages the model performance. Meanwhile, even though safe RL methods (i.e., CQL and IQN) show better performance and smaller variance compared with non-safe methods, their performance is still negatively affected by assigning non-zero densities to non-optimal actions. In addition, in Environment I and II with bounded action support, the competing methods are affected by an off-support bias which lowers their discounted return. In Environment III and IV, the performance gains of the proposed method are mainly from the well-recover of the quasi-optimal regions. Also, note that our algorithm achieves stable performance in small sample size settings, which is blessed by the smoothness and optimization-friendly of our algorithm. This is promising as limited data is common in medical applications. Additional experiment results including safety criterion, i.e., distribution of the discounted sum of rewards, sensitivity analysis of µ are provided in Appendix. Real Data: A Ohio Type 1 Diabetes Case Study Ohio type 1 diabetes (OhioT1DM) dataset (Marling & Bunescu, 2020) , which contains 2 cohorts of patients with Type-1 diabetes, each patient with 8 weeks of life-event data including health status measurements and insulin injection dosage. Clinicians are interested in adjusting insulin injection dose levels (Marling & Bunescu, 2020; Bao et al., 2011) based on patient's health status to maintain the glucose level in a certain range for safe dose suggestions. As each individual has dramatically distinctive glucose dynamics, We follow Zhu et al. (2020) to regard each patient data as an independent dataset, and the data from each day as a trajectory. The state variables are health status measurements, and the action space is a bounded insulin dose range. The glycemic index is regarded as a reward function to measure the goodness of dose suggestion. See more details of the experiment setup in Appendix. As shown in Table 1 , the proposed method achieves the best performance among almost all patients. The proposed method mitigates the off-support bias in this bounded dosage space and outperforms the competing methods. This finding is consistent with the results in the synthetic data and demonstrates the potential of our method in continuous action spaces. Besides model performance, we illustrate the safety guarantee of the quasi-optimal learning with additional experiments results and analyses in Appendix. Table 1 : The discounted return for the policy improvement based on 50 repeated experiments. Patient ID Proposed DDPG SAC BEAR Greedy-GQ VL CQL IQN 540 18.6 ± 0.6 14.1 ± 2.3 14.2 ± 1.2 13.7 ± 0.9 15.5 ± 2.4 14.1 ± 2.4 17.0 ± 0.9 18.2 ± 0.9 544 11.0 ± 0.7 7.5 ± 1.5 7.5 ± 2.5 5.9 ± 0.8 6.3 ± 2.9 8.1 ± 2.9 9.3 ± 1.0 9.8 ± 1.0 552 6.3 ± 0.4 4.8 ± 0.5 5.7 ± 1.0 3.6 ± 0.6 4.1 ± 1.8 5.2 ± 1.3 6.7 ± 0.7 6.1 ± 0.8 567 29.9 ± 1.5 30.0 ± 2.0 27. 11.6 ± 0.6 8.4 ± 0.9 9.3 ± 0.7 8.4 ± 0.7 9.2 ± 1.5 8.8 ± 1.9 9.4 ± 0.7 9.9 ± 0.8 570 25.0 ± 0.8 24.5 ± 1.4 26.1 ± 0.8 25.8 ± 0.8 22.8 ± 1.6 22.6 ± 1.5 25.8 ± 0.9 25.9 ± 0.8 575 15.5 ± 1.0 10.4 ± 1.3 8.8 ± 1.4 10.2 ± 1.0 5.7 ± 2.8 8.5 ± 2.3 12.6 ± 0.9 12.7 ± 1.2 588 18.6 ± 0.7 14.2 ± 1.3 13.5 ± 1.5 12.0 ± 0.9 10.0 ± 3.1 8.6 ± 2.3 15.7 ± 0.8 15.9 ± 1.3 591 15.4 ± 1.0 12.3 ± 0.6 11.9 ± 0.6 12.8 ± 0.7 10.7 ± 1.7 10.5 ± 2.6 14.9 ± 0.6 15.2 ± 0.7

7. CONCLUSIONS

We introduce a novel quasi-oracle learning algorithm for continuous action allocations, which is particularly useful in determining the dose level when developing medical treatment regimes. The quasi-optimal learning algorithm is provably convergent in off-policy cases, and a PAC bound is provided to analyze its sample complexity. The promising results arise some interesting directions for future works, including extending the framework to online settings interacting with environments. The authors are grateful to the anonymous reviewers and area chair for valuable comments and suggestions. Ruoqing Zhu's research is partially supported by a grant from the National Science Foundation DMS-2210657.

9. REPRODUCIBILITY STATEMENT

We include the reproducible code for all the experiments and the guideline for access to the Ohio Type I Diabetes dataset in GitHub link https://github.com/liyuhan529/ Quasi-optimal-Learning. The experiment details are provided in Appendix for reproducible purpose. All proofs of main theorems and addition theorems are included in Appendix.

10. ETHICS STATEMENT

The proposed method aims at finding optimal policy in continuous action space, with a special focus on medical applications. We believe the proposed work has potential applications in sequential treatment dose suggestion, e.g., managing diabetes through insulin injection. We admit that the proposed method needs additional validation experiments in controlled settings for practical use in medical applications to avoid abundant risks. The proposed algorithm should not be used as a stand-alone tool nor as a replacement of human experts. "Supplementary Materials for Quasi-optimal Reinforcement Learning with Continuous Actions" 

A ADDITIONAL RELATED WORKS

We discuss additional related works in this section. Safe RL Safe Reinforcement Learning (safe-RL) aims at finding an optimal policy while ensuring safety (Garcıa & Fernández, 2015) . In the safe-RL framework, the definition of safety and its guarantee varies based on the specific purpose of learning tasks. In our view, there are three mainstream works for safe RL. • Safe Exploration: ensuring safe action allocations in the exploration process by incorporating prior knowledge, which often exists in online RL settings (Pham et al., 2018) . • Safety Constraints: finding an optimal policy that satisfies external user-specified safe constraints (Chow et al., 2018a; Gu et al., 2022 ). • Risk-sensitivity and Conservatism: finding a policy maximizing the infinite-horizon cumulative discounted reward while incorporating the notion of risk (Morimura et al., 2010; Mavrin et al., 2019) , e.g., value at risk (quantile), percentile performance, chance, the variance of return. In medical applications, specifying explicit constraints is typically hard to realize in practice (Vincent, 2014) . Alternatively, the notion of safety is usually incorporated in the design of reward functions, where high-risk actions lead to significantly low reward (Raghu et al., 2017; Jia et al., 2020) . Based on these, our quasi-optimal learning is closely related to the risk-sensitive RL framework, which aims to control value at risk to ensure safety. For example, maintaining the discounted return above a certain threshold (Tamar et al., 2015) , reducing the variability of performance by avoiding extremely low performance (Ma et al., 2020) , or target to maximize the robust performance criterion, e.g., quantile of the discounted return (Dabney et al., 2018b) . Commonly used algorithms in risk-sensitive RL include conservative Q-learning (CQL; Kumar et al. (2020) ) and implicit quantile network (IQN; (Dabney et al., 2018a) ). CQL learns a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value and thus avoids selecting high-risk actions with over-estimation action value. IQN models the full quantile function for the state-action return distribution and yields risk-sensitive policies. For a more comprehensive empirical study, we compare the proposed algorithm with the aforementioned two safe RL baselines, conduct additional numerical experiments and analyze the results from the safety point of view. RL in healthcare Reinforcement learning has a wide variety of applications in healthcare (Yu et al., 2021) . Some of the recent works aim to solve safety issues when applying RL to healthcare domains. Tang et al. (2020) considers identifying set-valued policies with near-optimal actions, which allows incorporating expert knowledge from clinicians to assist in decision making. As the same rationale in our proposed quasi-optimal region, Tang et al. ( 2020) also utilizes the value function to threshold a near-optimal action set. However, this method is only developed on discrete action space, and it is still not directly applicable in fully offline settings. Fatemi et al. (2021) considers identifying high-risk states in data-constrained offline settings by training two separate Q functions that model the probability of negative outcomes and positive outcomes respectively. They target to identify treatments proportional to their chance of leading to dead-ends, and attain safety by excluding these treatments from consideration. However, as they aim to identify possible "dead-ends" of a state space and treatments, there exists a trade-off between safety and optimality. In particular, it still has a gap for optimal treatment allocations. Other interesting works in RL for healthcare including (Murphy, 2003; Shi et al., 2018; 2022) . We refer the reader to Levine et al. (2020) for more comprehensive discussions on the topics of the offline RL. In the aforementioned mainstreams of works, ours is closely related to the Bellman Residual Minimization. They learn the value function by solving a nested optimization problem, where the function space used for the inner and outer optimization must be the same. From the perspective of the couple optimization, their inner optimization plays a similar role as the inner maximization of the min-max framework. In addition to the fundamental difference in derivation, our min-max optimization can be reduced to a single minimization problem aided by the kernel representation, while they have to solve an unstable minimax optimization problem. Most importantly, our quasi-optimal learning framework provides a practical way to learn a reliable policy in continuous action space via quasi-optimal region identifications. to the best of our knowledge, no existing RL algorithms can achieve this. B TECHNICAL PROOFS  B µ V * µ (s) = µ    1 - a∈Ws,1   a∈Ws,1 Q * µ (s, a)da 2µσ(W s,1 ) - 1 σ(W s,1 ) 2 - Q * µ (s, a) 2µ 2   da    + Cσ(W s,1 ) a∈Ws,2 Q * µ (s, a)da -Cσ(W s,2 ) a∈Ws,1 Q * µ (s, a)da 2σ(W s,1 ) - µC 2 σ(W s,2 )(σ(W s,2 ) + σ(W s,1 )) σ(W s,1 ) , where W s,1 refers to the set {a ∈ A : C > π * µ (a|s) > 0}, W s,2 refers to the set {a ∈ A : π * µ (a|s) = C}. Proof: The proof is mainly to check the KKT conditions of the maximization. The Lagrangian function of the RHS of (2) can be expressed as follows: L(π, η, ϖ 1 , ϖ 2 ) = E a∼π ( •|s) [Q µ (s, a) + µprox(π(a|s))] -η(s) a∈A π(a|s)da -1 + ϖ 1 (s, a)π(a|s) -ϖ 2 (s, a)(π(a|s) -C). The following KKT conditions are necessary for the maximizer π * µ in the equation: • Primal: a∈A π * µ (a|s)da -1 = 0, -π * µ (a|s) ≤ 0, π * µ (a|s) ≤ C. • Duality: ϖ 1 (s, a) ≥ 0, ϖ 2 (s, a) ≥ 0. • Complementary slackness: ϖ 1 (s, a)π * µ (a|s) = 0, ϖ 2 (s, a)(π * µ (a|s) -C) = 0. • Stationarity: Q * µ (s, a) + µ(1 -2π * µ (a|s)) -η(s) + ϖ 1 (s, a) -ϖ 2 (s, a) = 0. We can obtain the equation for π µ (a|s) from the stationary condition such that π * µ (a|s) = 1 2 - 1 2µ [η(s) -Q * µ (s, a) -ϖ 1 (s, a) + ϖ 2 (s, a)]. Combined with complementary slackness condition, • If π * µ (a|s) = 0, then ϖ 1 (s, a) ≥ 0, ϖ 2 (s, a) = 0, thus Q * µ (s, a) ≤ η(s) -µ. • If C >π * µ (a|s) > 0, then ϖ 1 (s, a) = ϖ 2 (s, a) = 0, thus η(s) -µ + 2µC > Q * µ (s, a) > η(s) -µ. • If π * µ (a|s) = C, then ϖ 1 (s, a) = 0, ϖ 2 (s, a) ≤ 0, thus Q * µ (s, a) ≥ η(s) -µ + 2µC. Therefore, π * µ (s, a) can be expressed as: π * µ (a|s) =      0 if Q * µ (s, a) ≤ η(s) -µ 1 2 -1 2µ η(s) -Q * µ (s, a) if η(s) -µ + 2µC > Q * µ (s, a) > η(s) -µ C if Q * µ (s, a) ≥ η(s) -µ + 2µC Meanwhile, notice that a∈A π * µ (s, a) = 1, we can show that η(s) has a closed form: η(s) = µ + a∈Ws,1 Q * µ (s, a)da -2µ + 2µCσ(W s,2 ) σ(W s,1 ) , where W s,1 refers to the set {a ∈ A : C > π * µ (a|s) > 0}, W s,2 refers to the set {a ∈ A : π * µ (a|s) = C}, and σ(W s,1 ), σ(W s,1 ) refers to the interval length of the corresponding set. We take η(s) back to ( 14), we then have π * µ (a|s) =      0 if Q * µ (s, a) ≤ η(s) -µ Q * µ (s,a) 2µ - a∈W s,1 Q * µ (s,a)da 2µσ(Ws,1) + 1-Cσ(Ws,2) σ(Ws,1) if η(s) -µ + 2µC > Q * µ (s, a) > η(s) -µ C if Q * µ (s, a) ≥ η(s) -µ + 2µC We finally plug in the closed form of π * µ (a|s) to ( 2), by some algebra, we have B µ V * µ (s) = µ    1 - a∈Ws,1   a∈Ws,1 Q * µ (s, a)da 2µσ(W s,1 ) - 1 σ(W s,1 ) 2 - Q * µ (s, a) 2µ 2   da    + Cσ(W s,1 ) a∈Ws,2 Q * µ (s, a)da -Cσ(W s,2 ) a∈Ws,1 Q * µ (s, a)da 2σ(W s,1 ) - µC 2 σ(W s,2 )(σ(W s,2 ) + σ(W s,1 )) σ(W s,1 ) .

B.1.2 PROOF OF COROLLARY S.1

Corollary S.1. When σ(W s,2 ) = 0, we denote W 1 as W, the closed form in (13) can be simplified as B µ V * µ (s) = µ - 1 4µ ( a ′ ∈Ws Q * µ (s, a ′ )da ′ -2µ) 2 σ(W s ) - a∈Ws Q * µ 2 (s, a)da . Proof: We plug in σ(W s,2 ) = 0 to (13), then could obtain the result.

B.1.3 PROOF OF THEOREM 3.1

Proof of Theorem 3.1: For any generic value function V (s) and the corresponding generic Q-function Q(s, a), we first build the lower bound: B µ V (s) = max π∈∆convex(A) E a∼π(•|s) [Q(s, a) + µ(1 -π(a|s))] ≥ max π∈∆convex(A) E a∼π(•|s) [Q(s, a) + µ -µC] = BV (s) + µ(1 -C). For the upper bound: B µ V (s) = max π∈∆convex(A)(A) E a∼π(•|s) [Q(s, a) + µ(1 -π(a|s))] ≤ max π∈∆convex(A)(A) E a∼π(•|s) [Q(s, a) + µ] = BV (s) + µ. Therefore, we have B µ V (s) -BV (s) ∈ [µ(1 -C), µ]. B.1.4 PROOF OF THEOREM 3.2 Proof of Theorem 3.2: Suppose Q * µ (s, a) = -α 1 (s)a 2 + α 2 (s)a + α 3 (s) with α 1 (s) > 0. We assume the density won't reach its boundary value C for this theorem, and we proceed by simplifying α i (s) as α i for i = 1, 2, 3. By Equation (4), we have π * µ (a|s) = Q * µ (s, a) 2µ -a∈Ws Q * µ (s, a)da 2µσ(W s ) + 1 σ(W s ) + . We first try to find the support set of π * µ (a|s). Since Q * µ takes the maximum value at y = α2 2α1 , by the symmetric property of quadratic function, the support set should be of the form W s = [y-l, y+l](l > 0). Additionally, the boundary point of the support set should be the solution of Q * µ (s, a) 2µ -a∈Ws Q * µ (s, a)da 2µσ(W s ) + 1 σ(W s ) = 0, with respect to a. Thus, we can find the boundary point of the support set by solving the equation with respect to l: -α 1 (y ± l) 2 + α 2 (y ± l) + α 3 = 1 2l y+l y-l (-α 1 a 2 + α 2 a + α 3 )da - µ l . It turns out that l = (12α 2 1 µ) 1 3 2α1 . Thus, the support set has the closed-form W s = a : a ∈ α 2 -(12α 2 1 µ) 1 3 2α 1 , α 2 + (12α 2 1 µ) 1 3 2α 1 . Therefore σ(W s ) = (12α 2 1 µ) 1 3 α1 , and a∈Ws Q * µ (s, a)da 2µσ(W s ) = - (12α 2 1 µ) 2 3 -3α 2 2 24µα 1 + α 3 2µ . We plug in the result to the closed form of π * µ (a|s), and obtain the probability density function π * µ (a|s) = α 1 2µ (a + α 2 2α 1 ) 2 - 3 2 ( α 1 12µ ) 1 3 + . It is clear that the resulting distribution of π * µ (a|s) is of the exact form of q-Gaussian distribution with q = 0, β = α1 2µ and centered at α2 2α1 . B.2 PROOFS ON QUASI-OPTIMAL STAIONARITY EQUATION B.2.1 PROOF OF THEOREM 3.3 Proof of Theorem 3.3: By the stationary condition from Theorem S.1 we have Q * µ (s, a) + µ(1 -2π * µ (a|s)) -η(s) + ϖ 1 (s, a) -ϖ 2 (s, a) = 0, therefore, by the definition of Q * µ (s, a), we have E S t+1 |s,a [R(S t+1 , s, a)]+γE S t+1 |s,a [V * µ (S t+1 )]+µ(1-2π * µ (a|s))-η(s)+ϖ 1 (s, a)-ϖ 2 (s, a) = 0. ( ) Notice that E S t+1 |s,a [R(S t+1 , s, a)] = r(s, a), and we take expectation with respect to a following the policy distribution π * µ (a|s) from both sides of ( 16), 0 = E a∼π * µ (a|s) r(s, a) + γE S t+1 |s,a [V * µ (S t+1 )] + µ(1 -2π * µ (a|s)) -η(s) + ϖ 1 (s, a) -ϖ 2 (s, a) , 0 = a∈A π * µ (a|s) r(s, a) + γE S t+1 |s,a [V * µ (S t+1 )] + µ(1 -2π * µ (a|s)) -η(s) + ϖ 1 (s, a) -ϖ 2 (s, a) da. According to the proximal Bellman optimality equation B µ V * µ (s) = V * µ (s), where V * µ (s) is the fixed point of B µ . With the explicit definition of V * µ , we observe that 0 = a∈A π * µ (a|s) r(s, a) + γE S t+1 |s,a [V * µ (S t+1 )] + µ(1 -π * µ (a|s)) da - a∈A µπ * 2 µ (a|s)da - a∈A π * µ (a|s)η(s)da + a∈A π * µ (a|s)ϖ 1 (s, a)da - a∈A π * µ (a|s)ϖ 2 (s, a)da =V * µ (s) - a∈A µπ * 2 µ (a|s)da - a∈A π * µ (a|s)η(s)da + a∈A π * µ (a|s)ϖ 1 (s, a)da - a∈A π * µ (a|s)ϖ 2 (s, a)da Meanwhile a∈A π * µ (a|s)η(s)da = η(s) a∈A π * µ (a|s)da = η(s) by the property of density, a∈A π * µ (a|s)ϖ 1 (s, a)da = 0, and a∈A π * µ (a|s)ϖ 2 (s, a)da = C a∈A ϖ 2 (s, a)da by complete slackness, we further have V * µ (s) -µ a∈A π * 2 µ (a|s)da -η(s) -C a∈A ϖ 2 (s, a)da = 0. Since 0 ≤ π * µ (a|s) ≤ C, thus µ a∈A π * 2 µ (a|s)da = µEπ * µ (a|s) ∈ [0, C]. Therefore, η(s) := η(s) -V * µ (s) ∈ [-µC -C a∈A ϖ 2 (s, a)da, -C a∈A ϖ 2 (s, a)da]. The stationary condition can be reformulated as E S t+1 |s,a R(S t+1 , s, a)+γV * µ (S t+1 ) -µprox • (π * µ (a|s))-η(s)+ϖ 1 (s, a)-ϖ 2 (s, a)-V * µ (s) = 0. (17) Obviously, (π * µ , V * µ ) is a solution for the above equation for some η(s), ϖ 1 (s, a), and ϖ 2 (s, a), such that ϖ 1 (s, a) ≥ 0,ϖ 2 (s, a) ≥ 0, ϖ 1 (s, a) • π µ (a|s) = 0, ϖ 2 (s, a) • (C -π µ (a|s)) = 0 and η(s) ∈ -µC -C a∈A ϖ 2 (s, a)da, -C a∈A ϖ 2 (s, a)da . When σ(W s,2 ) = 0, we have ϖ 2 (s, a) = 0. Plugging in to equation ( 17), and denote W s,1 as W s , we have the exact form of (9).

B.3 PROOFS ON KERNEL REPRESENTATION B.3.1 PROOF OF THEOREM S.2

Theorem S.2. We define the optimal weight function as u * = arg max u∈L 2 (C0) L 2 (V µ , π µ , η, ϖ, u). Let C(S × A) be all continuous functions on S × A. For any (s, a) ∈ S × A and s ′ ∈ S, the optimal weight function u * (S t , A t ) ∈ L 2 (C 0 ) ∩ C(S × A) and is unique if the reward function R(s ′ , s, a) and the transition kernel P(s ′ |s, a) are continuous over (s, a). Proof: Denote u = G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t ) -V µ (S t ). It follows from the definition of L 2 (V µ , π µ , η, ϖ, u), we have that min Vµ,πµ,η,ϖ max u L 2 (V µ , π µ , η, ϖ, u) = min Vµ,πµ,η,ϖ max u E S t ,A t ,S t+1 G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) u(S t , A t ) 2 = min Vµ,πµ,η,ϖ max u G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) , u(S t , A t ) 2 = min Vµ,πµ,η,ϖ G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) , √ C 0 u ∥ u∥ L 2 ) 2 = min Vµ,πµ,η,ϖ G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) , G Vµ,πµ (S t , A t , S t+1 ) -η(S t )+ ϖ(S t , A t )) -V µ (S t ) • C 0 u ∥ u∥ L 2 , u ∥ u∥ L 2 = min Vµ,πµ,η,ϖ G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) , G Vµ,πµ (S t , A t , S t+1 ) -η(S t )+ ϖ(S t , A t )) -V µ (S t ) = min Vµ,πµ,η,ϖ E S t ,A t C 0 G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) 2 , where the third equality is obtained by maximization condition of the inner product between u and G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t ) -V µ (S t ) is that the two terms should have the same direction; the fourth equality is obtained by the equality condition of the Cauchy-Schwartz inequality. where the last equality holds because of the maximization of inner product between ũ and E S t ,A t G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(A t |S t ) -V µ (S t ) K(• ; S t , A t )] should have the same direction. Then we have, min Vµ,πµ,η,ϖ E S t ,A t G Vµ,πµ S t , A t , S t+1 ) -η S t + ϖ A t | S t -V µ S t • K • ; S t , A t , C 0 ũ/∥ u∥ HRKHS 2 HRKHS = min V , µ πµ,η,ϖ E S t ,A t G Vµ,πµ S t , A t , S t+1 -η S t + ϖ A t | S t -V µ S t • K(• ; S t , A t ) , E S t ,A t G Vµ,πµ S t , A t , S t+1 ) -η S t + ϖ A t | S t -V µ S t • K(• ; S t , A t ) • ũ ∥ u∥ HRKHS , C 0 u ∥ u∥ HRKHS HRKHS = min Vµ,πµ,η,ϖ E S t ,A t G Vµ,πµ S t , A t , S t+1 -η S t + ϖ A t | S t -V µ S t • K • ; S t , A t , C 0 E S t , A t G Vµ,πµ S t , A t , S t+1 -η S t + ϖ( A t | S t ) -V µ ( S t ) • K • ; S t , A t HRKHS , where the first equality is by the equality condition of Cauchy-Schwarz inequality, i.e. ũ/∥ũ∥ HRKHS is linear dependent of E S t ,A t G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(A t |S t ) -V µ (S t ) K(• ; S t , A t )] . Then, by the reproducing property of K(S t , A t ; St , Ãt ), we have min Vµ,πµ,η,ϖ max u∈H C 0 K L 2 (V µ , π µ , η, ϖ, u) = min Vµ,πµ,η,ϖ E S t , S t ,A t , A t G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) C 0 K S t , A t ; • , K S t , A t ; • HRKHS G Vµ,πµ ( S t , A t , St+1 ) -η( S t ) + ϖ( A t | S t ) -V µ ( S t ) = min Vµ,πµ,η,ϖ E S t , S t ,A t , A t C 0 G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) K S t , A t ; S t , A t G Vµ,πµ ( S t , A t , S t+1 ) -η( S t ) + ϖ( A t | S t ) -V µ ( S t ) = min Vµ,πµ,η,ϖ E S t , S t ,A t , A t ,S t+1 , S t+1 C 0 G Vµ,πµ (S t , A t , S t+1 ) -η(S t ) + ϖ(S t , A t )) -V µ (S t ) K S t , A t ; S t , A t G Vµ,πµ ( S t , A t , S t+1 ) -η( S t ) + ϖ( A t | S t ) -V µ ( S t ) . Thus, we finish the proof. Proposition S.1. The quasi-optimal Bellman operator B µ is γ-contractive with respect to the supreme norm over S. That is ∥B µ V -B µ V ′ ∥ ∞ ≤ γ∥V -V ′ ∥ ∞ , for any generic value functions {V, V ′ : S → R}. Proposition S.1 justifies that there exists a unique fixed point of B µ , i.e., V * µ , indicating that the quasi-optimal value function V * µ and the induced policy π * µ are well defined and unique. Proof: By the definition of B µ , the explicit form corresponding to V is as follows: B µ V (s) = max π E a∼π(•|s) E S t+1 |s,a [R(S t+1 , s, a) + γV (S t+1 )] + µprox • (π(a|s)) . For any two arbitrary value functions V and V ′ , we have ∥B µ V (s) -B µ V ′ (s)∥ ∞ = max π1 E a∼π(•|s) E S t+1 |s,a [R(S t+1 , s, a) + γV (S t+1 )] + µprox • (π 1 (a|s)) - max π2 E a∼π(•|s) E S t+1 |s,a [R(S t+1 , s, a) + γV ′ (S t+1 )] + µprox • (π 2 (a|s)) ≤ max π E a∼π(•|s) E S t+1 |s,a [R(S t+1 , s, a) + γV (S t+1 )] + µprox • (π(a|s)) - E a∼π(•|s) E S t+1 |s,a [R(S t+1 , s, a) + γV ′ (S t+1 )] + µprox • (π(a|s)) = max π γE a∼π(•|s),S t+1 |s,a V (S t+1 ) -V ′ (S t+1 ) ≤ γ∥V (s) -V ′ (s)∥ ∞ . B.4.2 PROOF OF PROPOSITION S.2 Proposition S.2. For any s ∈ S, the performance error between V * µ (s) and V * (s) satisfies ∥V * µ -V * ∥ ∞ ≤ µ • max{|1 -C|, 1} 1 -γ , where C is the upper bound for induced policy π µ . Proof of Proposition S.2: ∥V * µ -V * ∥ ∞ = ∥B µ V * µ -BV * ∥ ∞ ≤ ∥B µ V * µ -B µ V * ∥ ∞ + ∥B µ V * -BV * ∥ ∞ . Notice that ∥B µ V * µ -B µ V * ∥ ∞ ≤ γ∥V * µ -V * ∥ ∞ by Theorem S.1, and ∥B µ V * -BV * ∥ ∞ ≤ µ • max{|1 -C|, 1} by Proposition 3.1. Therefore, (1 -γ)∥V * µ -V * ∥ ∞ ≤ µ • max{|1 -C|, 1}. We finish the proof.

B.5 PROOF OF THEOREM 4.1

Proof of Theorem 4.1: We first prove that when µ → ∞, π * µ would degenerate to uniform distribution over A. By (4), we only need to prove that for arbitrary small ϵ > 0 Q * µ (s, a) 2µ -a∈Ws Q * µ (s, a)da 2µσ(W s ) + 1 σ(W s ) - 1 σ(A) < ϵ. Lower bound: Q * µ (s, a) 2µ -a∈Ws Q * µ (s, a) 2µσ(W s ) + 1 σ(W s ) ≥ Q * µ (s, a) 2µ - σ(W s ) max a ′ Q * µ (s, a ′ ) 2µσ(W s ) + 1 σ(W s ) (18) ≥ Q * µ (s, a) 2µ - max a ′ Q * µ (s, a ′ ) 2µ + 1 σ(A) Thus, we aim to prove that Q * µ (s, a) -max a ′ Q * µ (s, a ′ ) 2µ → 0. Let V * be the unique fixed point of (1), and H max = max H(π), where H(π) = E a∼π(•|s) [1 -π(a|s)]. Let r(s, a) := E S t+1 |s,a [R(S t+1 , s, a)], by the definition of Q * µ , we have Q * µ (s, a) 2µ - γE S t+1 |s,a V * µ S t+1 2µ = r(s, a) 2µ Q * µ (s, a) 2µ - γE S t+1 |s,a V * µ S t+1 -V * S t+1 2µ - γE S t+1 |s,a V * S t+1 2µ = r(s, a) 2µ . Therefore, Q * µ (s, a) 2µ - µγH max 2(1 -γ) ≤ r(s, a) 2µ + γE s ′ |s,a [V * (s ′ )] 2µ , Q * µ (s, a) 2µ - µγH max 2(1 -γ) ≤ R max 2(1 -γ)µ . Meanwhile, from another perspective, the proximal Bellman operator (2) can be treated as a new MDP with the immediate reward r(s, a) + µH(π(•|s)) for given s, a. Combine with the fact that γµH max 1 -γ = max π E π ∞ t=2 γ t-1 (µ -µπ(A t |S t ))|S 1 = s, A 1 = a . Let π H = argmax π H(π(a|s)), then Q * µ (s, a) 2µ - µγH max 2(1 -γ) = Q * µ (s, a) 2µ -max π E π ∞ t=2 γ t-1 µ -µπ A t | S t | S 1 = s, A 1 = a ≥ Q π H µ (s, a) 2µ -E π H ∞ t=2 γ t-1 µ -µπ H A t | S t | S 1 = s, A 1 = a = E π H ∞ t=1 γ t-1 r (S t , A t ) 2µ | S 1 = s, A 1 = a ≥ - R max 2(1 -γ)µ . Based on ( 20) and ( 21), we have Q * µ (s, a) 2µ - max a ′ Q * µ (s, a ′ ) 2µ = Q * µ (s, a) 2µ - γH max 2(1 -γ) + γH max 2(1 -γ) - max a ′ Q * µ (s, a ′ ) 2µ ≥ - R max (1 -γ)µ . Similarly, we also have Q * µ (s, a) 2µ - max a ′ Q * µ (s, a ′ ) 2µ ≤ R max (1 -γ)µ . ( ) Therefore, we have the lower bound approaching to 1 σ(A) . For the upper bound, we have a∈A π * µ (a|s)da = 1, thus a∈A Q * µ (s, a) 2µ -a ′ ∈Ws Q * µ (s, a ′ )da ′ 2µσ(W s ) + 1 σ(W s ) + da ≥ a∈A min a ′′ Q * µ (s, a ′′ ) 2µ -a ′ ∈Ws Q * µ (s, a ′ )da ′ 2µσ(W s ) + 1 σ(W s ) da 1 σ(A) ≥ min a ′′ Q * µ (s, a ′′ ) 2µ - a ′ ∈Ws) Q * µ (s, a ′ )da ′ 2µ + 1 σ(W s ) . By ( 23), we then have Q * µ (s, a) 2µ -a∈Ws Q * µ (s, a)da 2µσ(W s ) + 1 σ(W s ) = Q * µ (s, a) 2µ - max a ′′ Q * µ (s, a ′′ ) 2µ (24) + max a ′′ Q * µ (s, a ′′ ) 2µ -a ′ ∈Ws Q * µ (s, a ′ )da ′ 2µ + 1 σ(W s ) ≤ 1 σ(A) + R max (1 -γ)µ Therefore, by the lower bound and upper bound, we conclude that π µ (a|s) will decay to the uniform distribution on A as µ → ∞. For the case when µ → 0, we prove that π µ would converge to the uniform distribution with the length of the support set equal to 1 C . Therefore, when C → ∞, it will converge to the point mass. According to (15), we only need to prove σ(W s,1 ) → 0 as µ → 0. Meanwhile by Theorem (S.1), a ∈ W s,1 , if σ(W s,1 )Q * µ (s, a) - a ′ ∈Ws,1 Q * µ (s, a ′ )da ′ -2µ + 2µCσ(W s,2 ) ∈ (0, 2µCσ(W s,1 )). As µ → 0, (0, 2µCσ(W s,1 )) → 0. Thus, by squeeze theorem, we have σ(W s,1 )Q * µ (s, a) - a ′ ∈Ws,1 Q * µ (s, a ′ )da ′ -2µ + 2µCσ(W s,2 ) → 0 as µ → 0, which is equivalent to σ(W s,1 )Q * µ (s, a) - a ′ ∈Ws,1 Q * µ (s, a ′ )da ′ → 0 for all a ∈ W s,1 . Therefore, W s,1 could only include a with the same value of Q * µ (s, a), which should only be a series of points rather than an interval. Thus, σ(W s,1 ) = 0, and π * µ (a|s) would converge to uniform distribution with interval length 1 C . B.6 PROOF OF LEMMA S.1 Before we prove the main result, we first provide a helper lemma for studying the boundedness of the symmetric kernel in the U-statistic. Lemma S.1. Under Assumption 1, for any s ∈ S, a ∈ A and µ ∈ (0, ∞), we have that sup s∈S,a∈A G Vµ,πµ (s, a, s ′ ) -η(s) + ϖ(s, a) -V µ (s) ≤ M max , where M max = 4 1-γ R max + µC. Proof of Lemma S.1: G Vµ,πµ (s, a, s ′ ) -η(s) + ϖ(s, a) -V µ (s) =R(s ′ , s, a) + γV µ (s ′ ) + µ -2µπ µ (a|s) -η(s) + ϖ(s, a) -V µ (s) ≤R max + µ + µC + γV µ (s ′ ) -V µ (s) -2µπ µ (a|s) + ϖ(s, a) . By checking the KKT conditions, we can further simplify the term (a). Specifically, 1. If π µ = 0, then ϖ ≥ 0. By the stationarity equation ( 9), we have (a) = ϖ(s, a) = η(s) -Q µ (s, a) -µ + V µ (s) ≤ R max + γ R max -µH 1 -γ -µ + R max + µH 1 -γ H := E a∼πµ(•|s) (1 -π µ (a|s)) ≤ 2 1 -γ R max -µ + µH ≤ 2 1 -γ R max . 2. If π µ ∈ (0, C], then ϖ = 0 (a) = -2µπ µ (a|s) < 0. Therefore, G πµ (s, a, s ′ ) -η(s) + ϖ(s, a) -V µ (s) ≤R max + µ + µC + γV µ (s ′ ) -V µ (s) + 2 1 -γ R max ≤R max + µ + µC + γ R max + µH 1 -γ - -R max + µH 1 -γ + 2 1 -γ R max ≤ 4 1 -γ R max + µC + µ -µH ≤ 4 1 -γ R max + µC. Thus, we gain the upper bound. For the lower bound, the same technique is applied, and we can also gain that G Vµ,πµ (s, a, s ′ ) -η(s) + ϖ(s, a) -V µ (s) ≥ - 4 1 -γ R max -µC. Therefore, this completes the proof. B.7 PROOF OF THEOREM 4.2 Proof of Theorem 4.2: We first define an operator P from G Vµ,πµ (S k , A k , S k+1 ) to G Vµ,πµ (S k , A k , S k+1 ) -η(S k ) + ϖ(S k , A k ) to simplify the expression, such that PG Vµ,πµ (S k , A k , S k+1 ) := G Vµ,πµ (S k , A k , S k+1 ) -η(S k ) + ϖ(A k |S k ), We further define several other notations U T := T 2 -1 1≤j̸ =k≤T K(S j , A j ; S k , A k ){PG Vµ,πµ (S j , A j , S j+1 ) -V µ (S j )}• {PG Vµ,πµ (S k , A k , A k+1 ) -V µ (S k )} K S t , A t , S t+1 ; S t , A t , S t+1 := K S t , A t ; S t , A t PG Vµ,πµ S t , A t , S t+1 -V µ S t PG Vµ,πµ S t , A t , S t+1 -V µ S t . Let the expectation with respect to stationary trajectory and i.i.d training set as E T and E respectively. For any finite threshold parameter µ < ∞ and any ϵ > 0, we have P L U -L U > ϵ = P L U -E (U T ) + E (U T ) -L U > ϵ ≤ P L U -E (U T ) > ϵ 2 (i) + P |E (U T ) -L U | > ϵ 2 (ii) . For (i), since the Gaussian kernel satisfy that |K(•; •)| ≤ 1, then by Lemma S.1, we have K s, a, s ′ ; s, ã, s′ ≤ M 2 max , for any s, s, a, ã. By Hoeffding's inequality, we have (i) ≤ 2 exp - nϵ 2 2M 4 max . ( ) For the term (ii), the expectation of U T as E T (U T ) can be calculated as follows: E T (U T ) = T 2 -1 1≤j̸ =k≤T E T K(S j , A j ; S k , A k ){PG Vµ,πµ (S j , A j , S j+1 ) -V µ (S j )}• {PG Vµ,πµ (S k , A k , S k+1 ) -V µ (S j )} . If with-in trajectory samples are independent, then it is obvious that E T (U T ) = E T K S t , A t , S t+1 ; S t , A t , S t+1 := U * . However, for weakly dependent data, dependency may introduce an additional bias term E T (U T )-U * , thus we further decompose the term (ii) as (ii) = P(|E (U T ) -E [E T (U T )]| (iii) + | E [E T (U T )] -EU ⋆ ) | (iv) > ϵ 2 ). For the term (iii), we follow a similar idea to use a novel decomposition of the variance term of U-statistic from Han (2018) . The idea is to break down the summation of U-statistic into numerous parts, where the current time is affected by randomness, and the historical time will be canceled out after conditioning on the future. As | K(• ; •)| is bounded by M 2 max , under the mixing condition of Assumption 4.2, the exponential inequality from Merlevède et al. (2009) can be applied to to bound each decomposition part. Then we follow the Theorem 3.1 from Han (2018) that for any ϵ 0 , P(|E (U T ) -E [E T (U T )]| > ϵ 0 ) ≤ 2 exp - M 4 max T ϵ 2 0 C ′ 1 + M 2 max log log(4T ) log T T ϵ 0 C ′ 1 -1 , where C ′ 1 is some constant. Then, we proceed to bound the term (iv). By Hoeffding decomposition of kernel function K S t , A t , S t+1 ; St , Ãt , St+1 , there exist kernel functions K1 (S t , A t , S t+1 ) and K2 S t , A t , S t+1 ; St , Ãt , St+1 such that K1 (s, a, s ′ ) = E T K s, a, s ′ ; S t , A t , S t+1 -U * , K2 s, a, s ′ ; s, a, s ′ = K (s, a, s ′ ; s, a, s ′ ) -K1 (s, a, s ′ ) -K1 s, a, s ′ -U * , and E T K1 (S t , A t , S t+1 ) = 0, E T K2 S t , A t , S t+1 ; St , Ãt , St+1 = 0. Then by Hoeffding decomposition of U T , we have U T = U * + 2 n T t=1 K1 (S t , A t , S t+1 ) + U K2 . Taking the expectation from both sides: E T [U T ] = U * + 2 n T k=1 E T K1 (S t , A t , S t+1 ) + E T [U K2 ] = U * + E T [U K2 ] Therefore, by Lyapunov inequality, we can bound the bias term |E T [U T ] -U ⋆ | = E T U K2 ≤ E T U K2 ≤ E T U 2 K2 = 1≤h1≤l1≤T,1≤h2≤l2≤T E T K2 (S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 ) • K2 (S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ) 4 T 2 (T -1) 2 . ( ) We proceed by the discussing the relationship between h 1 , h 2 , l 1 , l 2 . Case 1.1: If 1 ≤ h 1 ≤ h 2 ≤ l 1 ≤ l 2 ≤ T and l 2 -l 1 ≤ h 1 -h 2 . Under the mixing condition assumption, and by Generalized Correlation inequality in Lemma 2 of, we have E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ≤4 M 2r max 1/r β 1/s (h 2 -h 1 ) , where 1/r + 1/s = 1, s > -1. Case 1.2: If 1 ≤ h 1 ≤ h 2 ≤ l 1 ≤ l 2 ≤ T and h 1 -h 2 ≤ l 2 -l 1 . Similar as Case 1.1, we have E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ≤4 M 2r max 1/r β 1/s (l 2 -l 1 ) . Combine Case 1.1 and Case 1.2, we apply the bounded inequalities (2.17-2.21) from Yoshihara (1976) , and have the following result 1≤h1≤h2≤l1≤l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ≤ l2-l1≤h2-h1 1≤h1≤h2≤l1≤l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 • K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 + h2-h1≤l2-l2 1≤h1≤h2≤l1≤l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ≤ M 2 max T 2 T j=1 (j + 1)β 1/s (j) = O M 2 max T 3-τ , where τ = 2 s+1 -2 1-δ1 1 δ1-1 1 + 1 s+1 . ( ) Case 2: If 1 ≤ h 1 ≤ l 1 ≤ h 2 ≤ l 2 ≤ T . Using similar technique as Case 1.1 and 1.2, we have 1≤h1≤l1≤h2≤l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ≤ l2-h2≤l1-h1 1≤h1≤l1≤h2≤l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 • K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 + l1-h1≤l2-h2 1≤h1≤l1≤h1≤l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 = O M 2 max T 3-τ Case 3: If 1 ≤ h 1 ≤ l 1 ≤ T and 1 ≤ h 2 = l 2 ≤ T . Following the same technique, we have 1≤h2=l2≤T 1≤h1≤l1≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 • K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ≤ 1≤h1=l1≤T 1≤h2=l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 + 2 1≤h1<l1≤T 1≤h2=l2≤T E T K2 S h1 , A h1 , S h1+1 ; § l1 , A l1 , S l1+1 • K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 ≤ U 2 max T 2 + M 2 max T 2 T j=1 β 1/s (j) = O M 2 max T 2 . Case 4: If 1 ≤ h 1 = l 1 ≤ T and 1 ≤ h 2 ≤ l 2 ≤ T . Using the same technique, we can obtain the same rate as follows: 1≤h1=l1≤T 1≤h2≤l2≤T E T K2 S h1 , A h1 , S h1+1 ; S l1 , A l1 , S l1+1 • K2 S h2 , A h2 , S h2+1 ; S l2 , A l2 , S l2+1 = O M 2 max T 2 . Combine Case 1-4 with the equation ( 28), we conclude that |EU T -U * | ≤ C ′ 0 M 2 max T -1+τ a.s. We further use the continuous mapping theorem to conclude that E[E T (U T )] -EU * ≤ C ′ 0 M 2 max T -1+τ 2 a.s., where τ is defined in (29) and C ′ 0 is a constant. As τ > 0, we have T -1+τ 2 < T -1 2 . Combine ( 27) and ( 30), for sufficiently large T , we have (ii) = P (|E (U T ) -E [E T (U T )]| + | E [E T (U T )] -EU ⋆ ) |> ϵ 2 ≤ 2 exp - T C ′ 1 ϵ/2 -C ′ 0 M 2 max T -(1+τ )/2 2 M 4 max + M 2 max ϵ/2 -C ′ 0 M 2 max T -(1+τ )/2 log T log log 4T = 2 exp - T C ′ 1 ϵ 2 /4 -T c 1 ϵC ′ 0 M 2 max T -(1+τ )/2 + T C ′ 1 C ′ 0 2 M 4 max T -(1+τ ) M 4 max + M 2 max ϵ/2 -C ′ 0 M 2 max T -(1+τ )/2 log T log log 4T = 2 exp - T c 1 ϵ 2 /4 -T T -(1+τ )/2 c 1 ϵC ′ 0 M 2 max + c 1 C ′ 0 2 M 4 max T -τ M 4 max + M 2 max ϵ/2 -C ′ 0 M 2 max T -(1+τ )/2 log T log log 4T Then by the monotonicity of exp(•), T T -(1+τ )/2 C ′ 1 ϵC ′ 0 M 2 max -T -τ C ′ 1 C ′ 0 2 M 4 max -T C ′ 1 ϵ 2 /4 M 4 max + log T log log 4T M 2 max ϵ/2 -T -(1 + τ )/2 log T log log 4T C ′ 0 M 4 max ≤ - T C ′ 1 ϵ 2 /4 -T 1/2 C ′ 1 ϵC ′ 0 M 2 max + T -τ C ′ 1 C ′ 0 2 M 4 max M 4 max + log T log log 4T M 2 max ϵ/2 -T -1/2 log T log log 4T C ′ 0 M 4 max ≤ - cC ′ 1 ϵ 2 T /4 -C ′ 0 C ′ 1 ϵM 2 max √ T M 2 max ϵ/2 -C ′ 0 M 2 max / √ T log T log log 4T + M 4 max ( ) where U K2 := U K2 (V θ1 µ , π θ2 µ , η ξ1 , ϖ ξ2 ) is defined similarly as in the proof of Theorem 4.2. The details of the decomposition can be seen in the proof of Theorem 4.2. The term ∆ 2 can be immediately decomposed as follows 1 T 0 T0 t=1 K X t , X (T0+t) -E T 1 T 0 T0 t=1 K X t , X = sup T0+t) which itself is a two-dimensional stationary sequences under mixing condition. Note that the last term is the expectation of the suprema of the empirical process (V θ 1 µ ,π θ 2 µ ,η ξ 1 ,ϖ ξ 2 ) 1 T 0 T0 t=1 Ḡ X t -E T 1 T 0 T0 t=1 Ḡ X t , where X t = X t , X 1/T 0 T0 t=1 Ḡ( X t ) -E T [1/T 0 T0 t=1 Ḡ( X t ) ] on the space Ḡθ,ξ . The distance in Ḡθ,ξ can be bounded by the following, N min{(2µ max + 4)M max , 1}ε, Ḡθ,ξ , { X t } T0 t=1 ≤ N Cε, Θ 1 , {D i } n i=1 N Cε, Θ 2 , {D i } n i=1 N Cε, Ξ 1 , {D i } n i=1 N Cϵ, Ξ 2 , {D i } n i=1 =e 4 (D Θ1 + 1) (D Θ2 + 1) (D Ξ1 + 1) (D Ξ2 + 1) 2eM max Cϵ DΘ 1 +DΘ 2 +DΞ 1 +DΞ 2 which implies N ϵ 16 , Ḡθ,ξ , { X t } T0 t=1 ≤e 4 (D Θ1 + 1) (D Θ2 + 1) (D Ξ1 + 1) (D Ξ2 + 1) 64( max +32)U 2 max e Cϵ DΘ 1 +DΘ 2 +DΞ 1 +DΞ 2 :=C 3 1 ϵ D Ḡθ,ξ where D Ḡθ,ξ = D Θ1 + D Θ2 + D Ξ1 + D Ξ2 . First, without loss of generality, let T 0 = 2m T0 k T0 for appropriate positive integers m T0 k T0 as in (Yu, 1994) . Follow Lemma 5 in Antos et al. (2008) , we obtain that P( sup (V θ 1 µ ,π θ 2 µ ,η ξ 1 ,ϖ ξ 2 ) 1 T 0 T0 t=1 Ḡ X t -E T 1 T 0 T0 t=1 Ḡ X t ≥ ϵ 2 ) ≤ C 3 1 ϵ D Ḡθ,ξ exp -4C 4 m T0 ϵ 2 + 2m T0 β(k T0 ) where C 4 = 1 2 1 8M 2 max 2 . If D Ḡ ≥ 2, and let β(m) ≲ exp (-δ 1 m) , T ≥ 1, m T = C 4 T 0 ϵ 2 /δ 1 1 2 , m T0 = T 0 / (2k T0 ), where D Ḡθ,ξ ≥ 2, C 3 , C 4 , δ 1 , we apply Lemma 14 in Antos et al. (2008) , then 2m T0 β k T 0 + C 1 1 ϵ D Ḡθ,ξ exp -4C 2 m T0 ϵ 2 ≤ δ and we have, with probability 1 -δ, ∆ 1 2 ≤ 2∆(∆/δ 1 ∨ 1) C 4 T 0 =⇒ ∆ 1 2 ≤ 2∆(∆/δ 1 ∨ 1) C 4 ⌊T /2⌋ where /2 , D Ḡθ,ξ = P-dim(Θ 1 ) + P-dim(Θ 2 ) + P-dim(Ξ 1 ) + P-dim(Ξ 2 ), and C 1 , ..., C 5 are some constants. Adapt the notations for the constants number from Theorem 4.2. By some algebra, we conclude that , D P-dim = P-dim(Θ 1 ) + P-dim(Θ 2 ) + P-dim(Ξ 1 ) + P-dim(Ξ 2 ), and C 4 , ..., C 8 are some constants. B.9 PROOF OF THEOREM 4.4 ∆ = (D Ḡθ,ξ /2) log T 0 + log(e/δ) + log + C 3 C D Ḡθ,ξ /2 4 =⇒ ∆ = (D Ḡθ,ξ /2) log(T /2) + log(e/δ) + log + C 3 C D Ḡθ,ξ /2 4 Now, we conclude that ∥ V θ1,k µ -V * ∥ 2 L 2 ≤ C 1 κ min (1 -γ) 2   C 3 D log 8C1 δ n + 2∆(∆/δ 1 ∨ 1) C 4 ⌊T /2⌋   + C 2 µ 2 (C + |1 -C| ∨ 1) 2 (1 -γ) 2 + C 5 V θ1 µ -V θ1,k ∥ V θ1,k µ -V π * ∥ 2 L 2 ≤ C 4 κ min (1 -γ) 2   C 5 D P-dim log 8C4 δ n + 2 ∆ δ1 ∨ 1 ∆ C 6 ⌊T /2⌋   generalization error + C 7 µ 2 (C + |1 -C| ∨ 1) 2 (1 -γ) 2 proximal bias + C 8 V θ1 µ -V θ1,k We note that SGD converges has a global convergence to a stationary point with a sublinear rate in the case of convexity. However, the resulting dose not typically holds for the non-convex analysis. The intuition behind the proof is that our quasi-optimal algorithm can be regarded as a special case of the randomized stochastic descent (RSD) algorithm for solving the non-convex minimization problem. The convergence analysis of for randomized stochastic descent algorithm has been established in Corollary 2.2 of (Ghadimi & Lan, 2013) . That is, RSD is provably convergent to a stationary point. Follow Theorem 3 in (Drori & Shamir, 2020) , an unbiased SGD algorithm, i.e., the quasi-optimal 32.14 D.3 ADDITIONAL EXPERIMENT RESULTS

D.3.1 SENSITIVITY ANALYSES

To validate the cross-validation procedure in practice and analyze the effect of µ on model performance, we conduct sensitivity analyses for the change of µ. Results are summerized in Figure 4 . This confirms that the cross-validation procedure indeed selects a proper µ which maximizes the discounted return. To measure the performance on safety, we aim to evaluate the distribution of Monte-Carlo discounted sum of rewards for each roll-out trajectory Dabney et al. (2018a) , instead of its empirical mean, i.e., discounted return. In particular, we generate 100 trajectories under the learned policy and record the discounted sum of rewards of each single trajectory. Then we draw the density plots in Figure 5 for all four environments. As shown in Figure 5 , the distribution of the quasi-optimal learning shows a thinner tail on the left. This is aligned to two safe RL algorithms IQN and CQL. The phenomenon indicates that there is less chance to enter a low reward trajectory which is damaged by allocating highly-risk actions. However, the non-safe RL approach SAC is more evenly distributed on both extremes; Hence, SAC may enter a low reward trajectory with higher probability (heavier left tail) compared to the quasi-optimal learning and two safe RL baselines. This validates that quasi-optimal learning can avoid risky actions as the other two safe RL baselines. 

D.3.3 MODEL PERFORMANCE ON LARGE DATASET

We evaluate the model performance in large sample size scenarios (10,000 transition pairs (n = 100, T = 100) for all four environments. The results are presented in Figure 6 . Deep RL baseline methods have some improvement in the model performance and variance reduction with increased training samples. Meanwhile, the quasi-optimal learning still outperforms all competing methods as shown in Figure 6 . We illustrate the safety of the proposed method via evaluating the proportion of safe transition, i.e., from a fixed current state to a safe transition state. The goal of the OhioT1M case study is to maintain the glucose level in a safe range. The safe state in this study is defined as the state where the glucose level is within the range of 80-140 mg/dL. The reward function, i.e., the index of glycemic control (Rodbard, 2009) , R t i = - 1(S t i,1 > 140)(140 -S t i,1 ) 1.1 + 1(S t i,1 < 80)(S t i,1 -80) 2 30 where S t i,1 is the glucose level of patient i at t decision stage. This reward setting tends to favor the safe range and penalize the risky scenario where the glucose level is out of the range of 80-140 mg/dL. The details of the evaluation procedure are summarized in the following. In offline OhioT1M dataset, we pick up the observed states which transited to risky states, i.e., the states out of the safe range of glucose level. On the picked-up states, we calculate the proportion of safe transition, in which the corresponding transition states are sampled from the transition kernel under the learned policy. The transition kernel is estimated by maximum likelihood estimation from the offline dataset. We summarize the results of the safe proportions on 1000 transition samplings in Figure 7 . As shown, the quasi-optimal learning achieves 82.2% safe proportions, which outperforms 67.3% in safe RL baseline IQN and 44.6% in non-safe RL baseline SAC. By the results, we may conclude that quasi-optimal learning enjoys a better safety guarantee when applied to the medical domain.



Figure 3: The boxplot of the discounted return over 50 repeated experiments. 8

on Constructing Quasi-Optimal Bellman Operator . . . . . . . . . . . . . . B.1.1 Proof of Theorem S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.2 Proof of Corollary S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.3 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Proofs on Quasi-Optimal Staionarity Equation . . . . . . . . . . . . . . . . . . . . . B.2.1 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Proofs on Kernel Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.1 Proof of Theorem S.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.2 Proof of Theorem S.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Proofs on Generic Properties of Quasi-optimal Bellman Operator . . . . . . . . . . B.4.1 Proof of Proposition S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.2 Proof of Proposition S.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 Proof of Lemma S.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.9 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of Simulation Settings and Real Data Analysis . . . . . . . . . . . . . . . . . D.2 Additional Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3 Additional Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3.1 Sensitivity Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3.2 Distribution Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . .

PROOFS ON GENERIC PROPERTIES OF QUASI-OPTIMAL BELLMAN OPERATOR B.4.1 PROOF OF PROPOSITION S.1

+ ϵ approximation error where ∆ = (D Ḡθ,ξ /2) log(⌊T /2⌋) + log(e/δ) + log + C 3 C D Ḡθ,ξ 4

Figure 4: The sensitivity analyses of µ over 50 repeated experiments

Figure 5: The distribution of Monte-Carlo discounted sum of rewards over 50 repeated experiments.

Figure 6: The boxplot of discounted return over 30 repeated experiments with sample size N = 100, T = 100.

Theorem S.1. Assume the induced policy has density function π * µ (a|s) ≤ C for all a, s, where C is a given constant. Then the proximal Bellman operator B µ in equation (2) has a closed form equivalent:

The mean running time in seconds of each method over 50 experiment runs in Environment II. The synthetic experiments are conducted on a single 2.3 GHz Dual-Core Intel Core i5 CPU

availability

D.3.3 Model Performance on Large Dataset . . . . . . . . . . . . . . . . . . . . 45 D.3.4 Safe Transitions and Learned Policy Distribution . . . . . . . . . . . . . . 45

annex

Such finding indicates that there exists a closed form solution of the the optimal weight function u * , such that u * (s, a) = G V * µ ,π * µ (s, a, s ′ ) -η(s) + ϖ(s, a) -V * µ (s), which is equal to u when (V µ , π µ ) = (V * µ , π * µ ). Notice that for a given µ, W s is fully determined by Q * µ (s, a), thus by Equation ( 3),(4), we have that π * µ (a|s), V * µ (s) is continuous over Q * µ (s, a). Additionally, by the complete slackness and stationary condition in Theorem S.1, we haveµ , π * µ can be represented by functions of Q * µ (s, a), the Lagrange multipliers -η(s) + ϖ(s, a) can also be represented by a function of Q * µ (s, a), and is also continuous over Q * µ (s, a). As π * µ (a|s), V * µ (s), -η(s) + ϖ(s, a) are all continuous over Q * µ (s, a), we only need to prove that Q * µ (s, a) is continuous over (s, a) . By the stationarity equation in Theorem 3.3, E s ′ |s,a [R(s ′ , s, a)] = g(Q * µ (s, a)). Since the reward function R(s ′ , s, a) and the transition kernel P(s ′ |s, a) are continuous over (s, a) by assumption, Q * µ (s, a) is continuous for any (s, a) as E s ′ |s,a [R(s ′ , s, a)] is continuous for any (s, a). Therefore, the optimal weight function u * (s, a) is continuous over any arbitrary state-action pair (s, a).

B.3.2 PROOF OF THEOREM S.3

Theorem S.3. Suppose u * ∈ H C0 K is reproduced by a universal kernel K(•, •), then the minimax optimizer (10) can be decoupled to a single-stage minimization problem as min Vµ,πµ,η,ϖ•C 0 K S t , A t ; S t , A t G Vµ,πµ ( S t , A t , S t+1 ) -η( S t ) + ϖ( A t | S t ) -V µ ( S t ) , where ( S t , A t , S t+1 ) is an independent copy of the transition pair (S t , A t , S t+1 ).Proof: Let ũ = E S t ,A t G Vµ,πmu (S t , A t , S t+1 ) -η(S t ) + ϖ(A t |S t ) -V µ (S t ) K(•, {S t , A t }) ., and define the inner product ⟨•, •⟩ HRKHS in H C0 K . It follows from the definition of L(V µ , π µ , η, ϖ, u) and kernel reproducing property we have, min Vµ,πµ,η,ϖ max u L 2 (V µ , π µ , η, ϖ, u)

= min

Vµ,πµ,η,ϖ max u E S t ,A t G Vµ,πµ S t , A t , S t+1 -η S t + ϖ A t | S t -V µ S t u S t , A t 2 Vµ,πµ,η,ϖVµ,πµ,η,ϖwhere C ′ 1 is a constant. Combine ( 26) and (32), we simplify the terms and thenwhere C 1 , C 2 , C 3 are some constants depending on δ 1 respectively, andB.8 PROOF OF THEOREM 4.3Proof of Theorem 4.3. To bound the performance error, we first decompose it aswhere the first term is the optimization error and the last term is the approximation error. Then we proceed to boundwhere V πµ µ satisfying the stationarity equation ( 9) and V * is the unique fixed point of B. First, we move to bound ∆ 1 . Follow a similar kernel reproducing property and a eigen decomposition spirit in Bertsekas (1997); Sutton & Barto (2018) ; Zhou et al. (2022) , we have. and the auxiliary functions η ξ1 (s) ∈ [-Cµ, 0] for any s ∈ S, thenwhere C 5 and C 6 are some constants, andNow, we have the remainder term ∆ 2 to bound.We first bound ∆ 1 2 . For any s ∈ S, then we have that and since E a∼ πµ(•|s) µ π µ (a|s) ≤ µ, then we haveFor the lower bound, asso similarly, we conclude thatIf follows the definition of the proximal Bellman operator B µ and due to the monotonicity of the Bellman operator that , where i ∈ Z + . And for any initial value function. e.g.,Therefore the following inequality holds thatWe repeatedly apply a similar procedure, without loss of generality. We first show one step thatThen we apply infinite many time B µ , then we can have thatCombine with the inequalities (33)-(34), we immediately have thatNext, by Proposition S.2, we haveNow, we need to bound the excess risk. The excess risk can be decomposed into approximation error and estimation error, i.e.where ∆ approx is the approximation error and ∆ est is the estimation error. The approximation error is assumed to be zero in our proof for simplicity. At first, we consider to bound the estimation error.where η ξ1 , ϖ ξ2 are Lagrange multipliers satisfying minimal Bayes risk associ-Observe that the randomness ofcan be decomposed into two parts, one is from the n number of i.i.d. trajectories and another one is from the dependent transition within each trajectory. For each single trajectory, we define the quantity, where E T is defined as taking expectation to single stationary trajectory and E is defined as taking expectation to i.i.d. trajectory random variable D 1 , respectively. Without loss of generality, we assume C 0 = 1. The U-statistic approximation for E T (U ⋆ ) is as follows:Then the uniform process is bounded bywhere Pis the empirical measure with respect to D i:n = {D i } n i=1 and we simply denotes it as P n in the following proof. The last term is the bound for uniform process w.r.t sum of trajectories. In this sense, it is necessary to boundsince the trajectories {D i } n i=1 are i.i.d. Now, we process to bound ∆ 1 . ∆ 1 can be re-expressed as the empirical process of {D i } n i=1 w.r.t. the probability space (Ω N , F N , P) equipped with empirical measure P n such thatwhere G(V θ1 µ , π θ2 µ , η ξ1 , ϖ ξ2 ; D i ) is the random function associated with random variable D i . To bound ∆ 1 , it is needed to calculate the covering number N (ϵ, F θ,ξ , {D i } n i=1 ) by Pollard's tail inequality (Pollard, 2012) , where the function space is the composite spaceNext, we proceed to bound the distance in composite space F θ,ξ .In particular, lettwo arbitrary functions, then the empirical norm distance w.r.t. {D i } n i=1 for the two function can be upper bounded bywhere M max,1 = 2M max . Therefore, as the proximal parameter 0 ≤ µ ≤ µ max < ∞, for any ε > 0 the metric entropy log N (() can be bound with respect to separate metric entropy of (Θ 1 , Θ 2 , Ξ 1 , Ξ 2 ). Denote min(2(µ max + 4)M max , 1) as C, thenTo bound these factors, we first introduce a idea of pseudo-dimension , that is, for any set X , any points x 1:N ∈ X N , any class F of functions on X taking values in [0, C] with pseudo-dimension D F < ∞ and any ϵ > 0, we haveTherefore, we havewhere C 1 = e 4 (D Θ1 + 1) (D Θ2 + 1) (D Ξ1 + 1) (D Ξ2 + 1), the "effective" psuedo dimension.Then we apply Pollard tail inequality, for any n ≥ 32/ϵ 2 , we haveWith probability 1 -δ, minimizing the RHS with respect to u, and plug the minimizer in, we havewhere C 2 = 8C 1 . Therefore, we conclude that, with probability 1 -δ, we haveδ n Next, we proceed to bound ∆ 2 . To simply the notation, we denote the U-statistic kernel asBy Hoeffding's decomposition of kernel function K(S t , A t ; S t , A t ), there exists kernel functions K1 (S t , A t ) and K2 (S t , A t ; S t , A t ) that E T K1 ( S t , A t ) = 0 and E T K2 (s, a; S t , A t ) = 0. The U-statistic U T can be decomposed intoalgorithm with diminishing learning rate and evaluated on Euclidean distance. Therefore, it suffices to show that the gradient of the loss is unbiased. Now we show that the gradient is unbiased, as followsWe conclude that the gradient estimator is unbiased. Follow Theorem 3 in (Drori & Shamir, 2020) , under the conditions stated in Theorem 4.4, we adapt Corollary 2.2 to our quasi-optimal algorithm, it completes the proof.

C PRACTICAL IMPLEMENTATION

In practice, {V * µ , π * µ , η, ϖ} needs to be parameterized for practical implementation. However, noticing that V * µ and π * µ are both associated with Q * µ with closed-form expressions (3)(4). Thus, we propose to represent (V * µ , π * µ ) by modeling Q * µ . Additionally, by modeling Q * µ as a quadratic function, the induced policy would follow a q-Gaussian distribution. Therefore, we model the coefficients associated with the quadratic form as a linear combination of basis function φ(s) such that, where φ(s) = [φ 1 (s), φ 2 (s), ..., φ m (s)] T is the m-dimensional basis function, and θ = [θ 1 , θ 2 , θ 3 ] T is the 3m-dimensional parameters we need to estimate. The advantage of such parametrization lies in that the parameter space could be reduced.To solve the constrained optimization problem, we propose a computationally efficient algorithm by transforming the original constrained optimization problem into an unconstrained minimization problem. Specifically, we impose restrictions on the representation of Lagrangian multipliers (η(s), ϖ(s, a)) so that they satisfy their constraints automatically. Although such re-parametrization may sacrifice model flexibility, it gains great computational advantage as the unconstrained optimization problem would be much simpler. To be specific, we parametrize ϖ as ϖ(s, a; θ) = max 0, -Therefore, ϖ(S t , A t ) ≥ 0 and π * µ (A t |S t ) • ϖ(S t , A t ) = 0 are automatically satisfied. Also, by specifying the expression of Lagrangian multipliers, ϖ(s, a) share the same set of parameters θ as (V * µ , π * µ ). We also definewhere b 0 is the sigmoid's midpoint and k 0 is the logistic growth rate. By flipping the sigmoid function to parametrize η(s; ξ), the constraint η(s) ∈ [-µC, 0] is also automatically satisfied.

D EXPERIMENT DETAILS AND ADDITIONAL RESULTS

For the reproducing purpose, we include our code for all the experiments and the guideline for access to the Ohio Type I Diabetes dataset in an anonymous GitHub link https://anonymous.4open. science/r/Quasi-optimal-Learning-with-Continuous-Treatments-9B88.

D.1 DETAILS OF SIMULATION SETTINGS AND REAL DATA ANALYSIS

The details of the data generative model of each environment in Section 6 are stated below:Environment I: We consider a bounded action space where A = [0, 1], and a 2-dimensional state space. A t i iid ∼ Unif(0, 1), the state transition function is defined asiid ∼ N (0, 0.5 2 ), and the reward function is defined as.Environment II: We consider a bounded action space where A = [0, 1], and a 2-dimensional state space. A t i iid ∼ Unif(0, 1), the state transition function is defined as∼ N (0, 0.5 2 ), and R t i = 0.25(S t+1 i,1 ) 3 +2S t+1 i,1 +0.5(S t+1 i,2 ) 3 +S t+1 i,2 +0.25(2A t i -1). Environment III: We consider an unbounded action space where A = (-∞, ∞), and a 8-dimensional state space. We sampled action uniformly from a bounded space, A t i iid ∼ Unif(-100, 100), while it is allowed to select actions on R for the learned policy. The state transition function is defined as, S t+1 i ∼ N (µ t+1 i , Σ), where Σ is a pre-specified covariance matrix, andfor j = 1, 2, 3, 4,for j = 5, 6, 7, 8.i,8 . Environment IV: This environment shares the same transition kernel as Environment III, the only difference is the reward function here isFor all four environments, we consider different sample sizes where the number of trajectories n = {25, 50}, and the length of each trajectory T = {24, 36}. The discount factor γ is set to 0.9.

Motivation of Synthetic experiment design:

We aim to test the performance of our proposed method on the settings of bounded and unbounded continuous action space with unimodal and multimodal reward functions. The motivation for testing the proposed method in bounded action space is to test if the proposed method could potentially handle the off-support bias, as illustrated in Figure 2 . The reason for considering a multimodal synthetic environment is to evaluate the quasi-optimal policy class (q-Gaussian policy class) works in a relatively complex situation. Especially for the q-Gaussian policy distribution which is unimodal, it is necessary to test if the q-Gaussian policy still works and is robust to the scenario where the optimal policy might be multimodally behaving.We make a summary of the synthetic experiments as follows:Environment I:• Setting: Bounded action space and unimodal reward function• Purpose: To evaluate if the quasi-optimal learning works in the scenario where it might suffer the off-support bias issue as the continuous action space is bounded.Environment II:• Setting: Bounded action space and multimodal reward function • Purpose: In addition to the purpose in Environment I, we aim to implement quasi-optimal learning in a more challenging environment. Also, this is for evaluating the robustness of the unimodal q-Gaussian policy under the scenario that the true optimal policy follows a multimodal probability distribution.Environment III:• Setting: High-dimension state space and well-separated reward function. The design of the well-separated reward function causes the effect that the selection of non-optimal or sub-optimal actions greatly damages the rewards and increases the risk. • Purpose: To evaluate the reliability/safety of quasi-optimal learning. We aim to examine if quasi-optimal learning could perform well in this scenario. As we expect quasi-optimal learning is able to identify the quasi-optimal sub-regions and avoids choosing those nonoptimal/sub-optimal actions which greatly damage the performance.Environment IV:• Setting: High-dimension state space and complex well-separated reward function.• Purpose: In addition to the purpose in Environment III, we target to evaluate the quasioptimal learning in a more complex environment, imposing great challenges on recovering the quasi-optimal regions for the proposed method. Indeed, imposing more complex structures on reward function indicates imposing difficulties on value function learning and thus imposes great challenges on identifying quasi-optimal regions.Ohio Type 1 Diabetes Dataset: For individuals in the first cohort, we treat glucose level , carbonhydrate intake, and acceleration level as state variables, i.e., S t i,1 , S t i,2 and S t i,3 . For individuals in the second cohort, heart rate is used instead of acceleration level as S t i,3 . The reward function is defined as

D.2 ADDITIONAL EXPERIMENT DETAILS

In our implementation, since the objective function, LU may not be convex with respect to (θ, ξ). We determine the initial point by randomly generating 200 initial values for all parameters and selecting the one with the smallest objective function value.For the discretization-based methods, i.e., Greedy-GQ and V-learning, we discretize the original action space into 20 bins for implementation in synthetic experiments and 14 bins for real data analysis. The number of bins is chosen by analyzing the distribution of action and the scale of rewards, where too few bins could not lead to an accurate approximation of the whole dynamic, and too many bins may damage the performance of these methods. We use a radial basis to approximate value functions for these two methods based on the recommendation of the original implementation (Ertefaie & Strawderman, 2018; Luckett et al., 2019) .For the DeepRL-based continuous control methods, i.e., DDPG, SAC, BEAR, CQL and IQN, we implement them mainly based on well-known offline deep reinforcement learning library (Seno & Imai, 2021) . For the general optimization and function approximation settings, we use a multi-layer perceptron (MLP) with 2 hidden layers, each with 32 nodes for function approximation. We set the batch size to be 64, and use ReLU function as the activation function. In addition to the summary provided below, the initial learning rate is chosen from the set {3 × 10 -4 , 1 × 10 -4 , 3 × 10 -5 }. We use Adam (Kingma & Ba, 2014) as the optimizer for learning the neural network parameters. We set the discounted factor to be γ = 0.9 for all experiments.To evaluate the policy obtained from the proposed method in synthetic experiments, we generate 100 independent trajectories, each with a length of 100 based on the learned policy. We use rejection sampling (Robert et al., 1999) to randomly sample each action by the induced density π µ (a|s) and calculate the discounted sum of reward for each trajectory. We compare the discounted return of each method. The boxplot of synthetic experiments results based on 50 runs is presented in Figure 3 .For real data analysis, since the data-generating process is unknown, we follow Luckett et al. ( 2020) to utilize the Monte Carlo approximation of the estimated V-function of the initial state of each trajectory to evaluate the performance of each method. To better evaluate the stability and performance of each method, we randomly select 10 or 20 trajectories from each individual based on available trajectories 50 times and apply all methods to the selected data. The baseline refers to the observed discounted return. The mean and standard deviation of the improvements on the Monto Carlo discounted returns are presented in Table 1 .We report all hyperparameters used in training and additional experiment results in this section. The value of µ is selected from the set {0.01, 0.05, 0.1, 0.2, 0.3, 0.5}. We select µ by cross-validation for each experiment, specifically we select µ with the largest fitted V-function value on the initial states of each trajectory, i.e., P n Vµ (S 1 i ) -(1 -γ) -1 µ, where we mitigate the effect of the threshold parameter µ. In our implementation, we set C = 5 for all synthetic experiments and real data analysis, and check that the induced policy π µ never reaches the boundary value.We set the learning rate α j for the jth iteration is be α0 1+d √ j , where α 0 is the learning rate of the initial iteration, and d is the decay rate of the learning rate. When n = 25, we set the batch size to be 5, and when n = 50, we set the batch size to be 7. We use the L 2 distance of iterative parameters as the stopping criterion for the SGD algorithm. The µ selected for each experiment, along with the learning rates and their descent rates, are shown in Table 2 3 and 4 The Learned Policy Distribution: In this dimension, we illustrate the validity of the quasi-optimal policy distribution on a fixed state. In OhioT1M dataset, we select a patient state with a glucose level of 217 mg/dL, which is moderate hyperglycemia. On this state, we draw a density plot in Figure 8 for the policy distribution learned by the quasi-optimal learning, IQN, and SAC. Figure 8 shows that the quasi-optimal learning identified support regions [3.15, 6.19] . As the patient is under moderate hyperglycemia, so the moderate insulin dosage, i.e., [3.15, 6.19] , works well to decrease the glucose level into a safe range. Meanwhile, it avoids overly dropping the patient's glucose level and causes hypoglycemia. In comparison, SAC is risky as it has a non-negligible probability of assigning too low and too high insulin dosage to the patient. The policy learned by the safe RL algorithm IQN tends to avoid assigning extreme dosage, but it has wider support than the one learned by quasi-optimal learning. Regarding efficiency or safety, the quasi-optimal has certain advantages compared with IQN in this case. 

