CONVERGENCE RATE OF PRIMAL-DUAL APPROACH TO CONSTRAINED REINFORCEMENT LEARNING WITH SOFTMAX POLICY

Abstract

In this paper, we consider primal-dual approach to solve constrained reinforcement learning (RL) problems, where we formulate constrained reinforcement learning under constrained Markov decision process (CMDP). We propose the primal-dual policy gradient (PD-PG) algorithm with softmax policy. Although the constrained RL involves a non-concave maximization problem over the policy parameter space, we show that for both exact policy gradient and model-free learning, the proposed PD-PG needs iteration complexity of O -2 to achieve its optimal policy for both constraint and reward performance. Such an iteration complexity outperforms or matches most constrained RL algorithms. For the learning with exact policy gradient, the main challenge is to show the positivity of deterministic optimal policy (at the optimal action) is independent on both state space and iteration times. For the model-free learning, since we consider the discounted infinite-horizon setting, and the simulator can not rollout with an infinite-horizon sequence; thus one of the main challenges lies in how to design unbiased value function estimators with finite-horizon trajectories. We consider the unbiased estimators with finite-horizon trajectories that involve geometric distribution horizons, which is the key technique for us to obtain the theoretical results for model-free learning. Under review as a conference paper at ICLR 2023 Theorem 2 shows that PD-PG with exact policy gradient needs the iteration complexity of (1) to obtain the O( )-optimality, where c is the infimum of the probability of the optimal action from softmax policy, c is a positive scalar independent on the time-step t and independent on the state space S. One of the main challenges to obtain the complexity (1) is to show that c is bounded away from 0, see Proposition 2. From Table 1 , we know the proposed PD-PG is with the iteration complexity of O( -2 ), which is comparable to extensive constrained RL algorithms. Model-Free Constrained RL. In Section 4, we propose a sample-based PD-PG that only uses empirical data to learn a safe policy. The sample-based PD-PG needs a complexity of to obtain the O( )-optimality, where m is the number of constraints. The iteration complexity (2) outperforms or matches extensive existing state-of-the-art constrained RL algorithms, see Table 1 . Since this work considers discounted infinite-horizon CMDP, and the simulator can not rollout with an infinite-horizon sequence; thus the main challenge lies in designing unbiased value function estimators with finite-horizon trajectories. In Section 4.2, according to Paternain (2018, Chapter 6), we introduce unbiased estimators with finite-horizon trajectories that involve geometric distribution horizons, which plays a critical role for us to obtain the iteration complexity of sample-based PD-PG. Finally, in Section 4.6, we also illustrate an iteration complexity trade-off between PD-PG and NPD-PG (Ding et al., 2020) , where we analyze it from the trade-off between the distribution mismatch coefficient d ρ 0 π ρ0 ∞ (contained in the proposed PD-PG) and the Moore-Penrose pseudo inverse Fisher information matrix F † (θ) (contained in NPD-PG (Ding et al., 2020)). RELATED WORK

1. INTRODUCTION

Reinforcement learning (RL) has achieved significant success in many fields (e.g., (Silver et al., 2017; Vinyals et al., 2019; OpenAI, 2019) ). However, most RL algorithms improve the performance under the assumption that an agent is free to explore any behaviors (that may be detrimental). For example, a robot agent should avoid playing actions that irrevocably harm its hardware (Deisenroth et al., 2013) .Thus, it is important to consider safe exploration that is known as constrained RL (or safe RL), which is usually formulated as constrained Markov decision processes (CMDP) (Altman, 1999) . The primal-dual approach (Altman, 1999; Bertsekas, 2014) is a fundamental way to solve CMDP problems. Recently, the primal-dual method has also been extended to policy gradient (e.g., (Tessler et al., 2019; Petsagkourakis et al., 2020; Xu et al., 2021) ). However, most previous work only focus on natural policy gradient (NPG) (Kakade, 2002) to solve constrained RL (e.g., (Ding et al., 2020; Xu et al., 2021; Zeng et al., 2021) ), little is known about the vanilla policy gradient (Sutton et al., 2000) with primal-dual approach to constrained RL, which involves the following foundational theoretical issues: (i) how to employ the primal-dual vanilla policy gradient method to constrained RL with exact information and model-free learning? (ii) how fast does primal-dual vanilla policy gradient converge to the optimal policy? (iii) what is the sample complexity of the primal-dual policy gradient? These questions are the focus of this paper, and we mainly consider softmax policy for the discounted infinite-horizon CMDP with finite action space and state space.

1.1. MAIN CONTRIBUTIONS

Constrained RL with Exact Policy Gradient. In Section 3, we propose a primal-dual policy gradient (PD-PG) algorithm, which improves reward performance via gradient ascent on the primal policy parameter space and plays safe explorations via projecting gradient descent on the dual space. Value-Based OPDOP (Ding et al., 2021, Theorem 1 ) O |S| 2 |A| (1 -γ) 4 2 Policy-Based CRPO 7 (Xu et al., 2021, Theorem 1) O |S||A| (1 -γ) 7 4 Value-Based UCBVI-γ 2 (He et al., 2021, Theorem 4.3 ) O |S||A| (1 -γ) 3 2 Policy-Based On-Line NPD-PG 8 (Zeng et al., 2021 , Theorem 1) O |S| 6 |A| 6 (1 -γ) 12 6 Policy-Based NPD-PG 3 (Ding et al., 2020, Theorem 1)  O  1  (1 -γ) 4 2 Policy-Based NPD-PG 3 (Ding et al., 2020, Theorem 4 ) O |S| 2 |A| 2 (1 -γ) 4 2 Policy-Based PD-PG (This Work, Algorithm 1) O |S| (1 -γ) 4 2 Policy-Based PD-PG (This Work, Algorithm 3) O |S| 2 |A| (1 -γ) 4 2 Table 1 : Typical exact gradient and model-free state-of-the-art algorithms for constrained RL.

2. PRELIMINARIES

Constrained Reinforcement Learning. Constrained RL is often formulated as a constrained Markov decision process (CMDP), which is the standard Markov decision process (MDP) M augmented with an additional constraint set C. A MDP is a tuple M = (S, A, P, r, ρ 0 , γ). Here S is state space, A is action space. P(s |s, a) is probability of state transition from s to s after playing a. r(s |s, a) denotes the reward that the agent observes when state transition from s to s after it plays a, and it is bounded as |r(•)| ≤ 1. ρ 0 (•) : S → [0, 1] is the initial state distribution and γ ∈ (0, 1). The policy π(a|s) denotes the probability of playing a in state s, and Π S denotes the set of all stationary policies. P π (s |s) denotes one-step state transformation probability from s to s by executing π. Let T = {s t , a t , r t+1 } t≥0 ∼ π be a trajectory generated by π, where s 0 ∼ ρ 0 (•), a t ∼ π(•|s t ), s t+1 ∼ P(•|s t , a t ), and r t+1 = r(s t+1 |s t , a t ). Let d s0 π (s) = (1 -γ) ∞ t=0 γ t P π (s t = s|s 0 ) be the state distribution of the Markov chain (starting at s 0 ) induced by policy π, and , where c i : S × A → R is cost function, each b i is cost limit, and |c i (•)| ≤ 1. We define value functions V ci π , action-value functions Q ci π , and advantage functions A ci π for costs in analogy to V π , Q π , and A π , with c i replacing r respectively, i ∈ {1, 2 • • • , m}, e.g., (4) d ρ0 π (s) = E s0∼ρ0(•) [d s0 π (s)]. Let V π (s) = E π [ ∞ t=0 γ t V ci π (s) = E π [ ∞ t=0 γ t c i (s t , 1 According to Bai et al. (2021) , (Kalagarla et al., 2021 , Theorem 1) involves a constant C bounded by |S|. 2 UCBVI-γ matches the lower bound Ω |S||A| (1-γ) 3 2 for MDP (Lattimore & Hutter, 2012; Azar et al., 2013) . 3 Theorem 1 of Ding et al. (2020) shows a convergence rate independent on S and A. Notice that in Theorem 4 of Ding et al. (2020) , |S| 2 |A| 2 samples are necessary for the two outer loops. 4 Bai et al. (2021) claim CSPDA needs O( |S||A| (1-γ) 4 2 ), but the inner loop of their Algorithm 1 needs an additional generative model that needs 1 (1-γ) 3 |S||A| log(|S||A|) 2 samples (Agarwal et al., 2020a, Chapter 2) . 5 We show this iteration complexity according to a recent work (Bai et al., 2021) . Since Wei et al. (2021) study the finite-horizon CMDP, we believe their Triple-Q plays at least O( |S| 2 |A| 2 5 ). 6 The worst-case of constraint violation shown in (Miryoosefi & Jin, 2021) reaches O |S| 2 |A| (1-γ) 4 2 if the number of constraint function is large than |S|. 7 Notice that the inner loop with Kin = O( T (1-γ)|S||A| ) iteration is needed (Xu et al., 2021, Theorem 3 ). 8 We show the iteration complexity after some simple algebra according to (Zeng et al., 2021, Lemma 8-9) . Strong Duality. Let λ ∈ R m , and λ 0, the Lagrange multiplier function L(π, λ) is defined as: L(π, λ) = J(π) + λ (b -c(π)) . (5) Its associated dual function is defined as: L D (λ) = max π∈ΠS L(π, λ), its optimal dual parameter is:  λ = arg min Assumption 1 (Slater Condition) . There exists a vector ξ ≺ 0, and a policy π ∈ Π S such that c(π) -b ξ. The Slater condition (Slater, 1950) is mild in practice (otherwise, we can simply increase the constraint vector b by a tiny amount), and Slater condition is a standard assumption for CMDP appears in the previous work (Chow et al., 2018; Paternain et al., 2019a; b; Le et al., 2019; Ding et al., 2020; Ying et al., 2021 ). Slater's condition and convexity of policy class Π S ensure that strong duality holds. We formulate the problem (3) as the following strong duality version. Theorem 1 (Strong Duality (Altman, 1999) ). Let the stationary policy space Π S be a convex set. Under Assumption 1, the CMDP problem (3) shares the same optimal solution as the following min-max and max-min problems J(π ) = min λ 0 max π∈ΠS L(π, λ) = max π∈ΠS min λ 0 L(π, λ). ( ) Policy Gradient with Softmax Policy. In this paper, we mainly consider softmax policy: π θ (a|s) = exp{θ s,a } ã∈A exp{θ s,ã } , ∀ (s, a) ∈ S × A, where θ ∈ R |S|×|A| , and each θ[s, a] := θ s,a . Finally, we define two additional notations: a c π θ (s, a) = A c1 π θ (s, a), A c2 π θ (s, a), • • • , A cm π θ (s, a) , A π θ (s, a, λ) = A π θ (s, a) -λ a c π θ (s, a). Proposition 1. Let π θ be softmax policy (9), the gradient of J(π θ ) and C i (π θ ) with respect to θ is: ∂J(π θ ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a), ∂C i (π θ ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A ci π θ (s, a). (10) Then the gradients of L(π θ , λ) with respect to θ, λ are: ∂L(π θ , λ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a, λ), ∂L(π θ , λ) ∂λ = b -c(π θ ). 3 PRIMAL-DUAL POLICY GRADIENT METHOD According to strong duality shown in Theorem 1, to solve the constrained RL problem (3), we only need to solve the equivalent unconstrained problem (8). We define primal-dual approach as follows, λ t+1 ← {λ t -η∇ λ L(π θt , λ t )} + , θ t+1 ← θ t + η∇ θ L(π θt , λ t ), where the elements of ∇ θ L(π θt , λ t ) and ∇ λ L(π θt , λ t ) are shown in Proposition 1, {•} + denotes the positive part operator, i.e., if x ≤ 0, {x} + = 0, else {x} + = x, and η > 0 is step-size. The complete primal-dual approach has been shown in Algorithm 1, where we introduce a notation G(θ, λ) ∈ R |S|×|A| that is the matrix version of ∂L(π θ ,λ) ∂θs,a , i.e., each G(θ, λ)[s, a] is defined as: G(θ, λ)[s, a] = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a, λ). Before we show the convergence rate of Algorithm 1, we assume that the initial state distribution ρ 0 (•) used in the gradient updates is bounded away from zero. Assumption 2 (Sufficient Exploration). The initial state distribution ρ 0 (•) satisfies ρ min := min s∈S {ρ 0 (s)} > 0. ( ) Algorithm 1 Primal-Dual Policy Gradient (PD-PG) Initialization: step-size η, θ 0 = 0, λ 0 = 0, policy gradient G(θ, λ); for t = 0, 1, • • • , T -1 do G(θ t , λ t )[s, a] = 1 1-γ d ρ0 π θ t (s)π θt (a|s)A π θ t (s, a, λ t ); λ t+1 ← {λ t -η(b -c(π θt ))} + ; θ t+1 ← θ t + ηG(θ t , λ t ); end for Assumption 2 has been adapted by Agarwal et al. (2020b) ; Mei et al. (2020) ; Ying et al. (2021) , which requires the initial distribution ρ 0 (•) lies in the interior of the probability simplex ∆ o (S) := {p s |p s > 0, s∈S p s = 1}. The condition (13) ensures "sufficient exploration", which means that for any policy π ∈ Π S , the distribution d ρ0 π (s) keeps positive over the state space S. Additionally, Assumption 2 is necessary for global optimality of policy gradient methods. Concretely, Mei et al. (2020) have shown that there exists an MDP with the condition min s∈S ρ 0 (s) = 0, and there exists a parameter θ such that this θ is a stationary policy of J(π θ ), but π θ is not an optimal policy. To short the expression, we define some additional notations. Recall ξ : = (ξ 1 , ξ 2 , • • • , ξ m ) ≺ 0 defined in Assumption 1, i.e., ξ i < 0, i = 1, 2, • • • , m, let ι := min 1≤i≤m {-ξ i }, := 1+ 2m (1-γ) 2 ι 2 . The distribution mismatch coefficient is defined as: d ρ 0 π ρ0 ∞ := max s∈S d ρ 0 π (s) ρ0(s) . Finally, let π t (a|s) := π θt (a|s), and π t := π θt . ( ) From stating this and the remaining results, we fix a deterministic optimal policy π (•|s), denote it as a (s), i.e., π (a (s)|s) = 1; if a = a (s), π (a|s) = 0. Proposition 2. Under Assumption 2, updating π t according to Algorithm 1, we obtain c =: inf s∈S,t≥1 {π t (a (s)|s)} > 0. We provide the proof in Appendix D.1 (see Lemma 15). According to (15), c is a positive scalar independent on the time-step t and independent on the state space S. Theorem 2. Under Assumption 1-2, π θ is the softmax policy defined in (9). Let time-step T satisfy T ≥ max 1 (1 -γ) 2 , 1 ((1 -γ) 2 + 2m/ι 2 ) 2 • D 2 2 |S| log |A|ρ 2 min D 1 , ( ) where D 1 and D 2 are positive scalars will be special later. The sequence {λ t , θ t } T -1 t=0 is generated by Algorithm 1. Let η = d ρ 0 π ρ0 ∞ |S| log |A| C(1-γ)T , β = d ρ 0 π ρ0 ∞ 4|S| log |A| (1-γ) 3 ι 2 c , where C is a positive scalar will be special later. Then for all i ∈ {1, 2, • • • , m}, π t := π θt satisfies min t<T {J(π ) -J(π t )} ≤ 4 d ρ0 π ρ 0 ∞ |S| log |A| c (1 -γ) 4 T , ( ) min t<T {C i (π t ) -b i } + ≤ 4 β -λ ∞ d ρ0 π ρ 0 ∞ |S| log |A| c (1 -γ) 4 T . ( ) Remark 1. Lemma 3 (see Appendix B.3) has shown the boundedness of λ as follows: λ ∞ ≤

2

(1-γ)ι . Furthermore, according to the discussions in Remark 3 (see Appendix D.2), the inequality β > 2 (1-γ)ι always holds. Thus β > λ ∞ , which implies the bounds ( 17) and ( 18) are well-defined. Remark 2. Theorem 2 implies Algorithm 1 needs the iteration complexity of O d ρ0 π ρ 0 2 ∞ |S| log |A| c (1 -γ) 4 2 (19) to obtain O( )-optimality. The iteration complexity of ( 19) is a function with respect to the toleration , which matches the best known policy gradient methods from (Ding et al., 2020) for CMDP, where both NPD-PG (Ding et al., 2020) and the proposed PD-PG share a complexity of O( -2 ).

4. PRIMAL-DUAL METHOD TO SOLVE MODEL-FREE CONSTRAINED RL

The main difficulty to implementing a model-free algorithm lies in designing an efficient policy gradient estimator for the discounted infinite-horizon MDP, which is intractable for sampling-based policy optimization since it requires a trajectory with an infinite horizon, which is impossible for practical simulation. In Section 4.2-4.3, we present an unbiased value function and policy gradient estimators with finite horizon trajectory, which is the benchmark for us to propose sample-based algorithms. The proposed algorithm and convergence analysis lie in Section 4.4-4.5.

4.1. DILEMMA IN MONTE-CARLO ROLLOUT

Recall Proposition 1, to obtain unbiased estimators of ∂J(π θ ) ∂θs,a and ∂Ci(π θ ) ∂θs,a , it is necessary to satisfy the following two conditions: • (C1): draw the state-action pair (s, a) according to: (s, a) ∼ d ρ0 π θ (•), π θ (•|s) ; • (C2): obtain the unbiassed estimator of the advantage functions A π θ (s, a) and A ci π θ (s, a). However, we can not obtain exact d s0 π θ (s) = E s0∼ρ0(•) [(1 -γ) ∞ t=0 γ t P π θ (s t = s|s 0 ) ] for modelfree RL, since the transformation probability P π θ (s t = s|s 0 ) is unknown. Additionally, Monte-Carlo rollout is a theoretically possible but intractable sampling-based approach to obtain an unbiased estimator of A π θ (s, a), which requires us to run infinite-horizon trajectories to estimate the value functions. For example, let {(s t , a t , r(s t , a t ))} ∞ t≥0 ∼ π θ start from (s 0 , a 0 ) = (s, a), Q(s, a) = ∞ t=0 γ t r(s t , a t ) is an unbiased estimators of Q π θ (s, a). Despite the unbiasedness of Q(s, a), Monte-Carlo rollout (20) requires infinite number of horizons, which is impossible in practice.

4.2. UNBIASED VALUE FUNCTION ESTIMATOR WITH FINITE HORIZON TRAJECTORY

Both of the conditions (C1) and (C2) can be implemented via a geometric random variable horizon during the simulated process (Paternain, 2018, Chapter 6) . Now, we present the insights behind this process, which requires us to master geometric distribution Geo(•), see Appendix E.1. Recall the Monte-Carlo rollout (20), if γ ≈ 0, then infinite series Q(s, a) (20) prioritizes the present reward information. In that sense, when γ is very small, we do not need to require the agent to evolve to collect the future reward for a long time. On the contrary, if γ ≈ 1, we need to require the agent to look far away into the future reward information. The geometric distribution provides us a way to formulate this idea (Paternain, 2018) . Concretely, let τ ∼ Geo(1 -γ), and rollout a finite horizon trajectory as D τ = {s t , a t , r(s t , a t )} t=0:τ ∼ π, where the initial state-action (s 0 , a 0 ) = (s, a). Then, we define an estimator of Q π (s, a) according to the sum of reward along the trajectory D τ : Q π (s, a) = τ t=0 r(s t , a t ). This Q π (s, a) unbiasedly estimates Q π (s, a) for each (s, a). Such a programming (21) can be extended to cost function if we use c i (•) to replace r(•) respectively, and can be extended to V π (s) and V ci π (s) if D τ starts from s. Due to the limitation of space, we have provided the details of implementation to estimate Q π and V π in Algorithm 4 (denoted as EstQ(π, g, s, a)) and Algorithm 5 (denoted as EstV(π, g, s, a)), see Appendix E.2, where we denote it as EstQ(π, g, s, a), where g = r(•) or g = c(•). The next Proposition 3-4 show that Algorithm 4 and Algorithm 5 output unbiased estimators for value functions. We have provided the proof in Appendix E.3. Proposition 3 (Unbiasedness of Algorithm 4). Let Q π (s, a) = EstQ(π, r, s, a), Q ci π (s, a) = EstQ(π, c i , s, a), then the following holds: E[ Q π (s, a)] = Q π (s, a), E[ Q ci π (s, a)] = Q ci π (s, a). Proposition 4 (Unbiasedness of Algorithm 5). Let V π (s) = EstV(π, r, s), V ci π (s) = EstV(π, c i , s), then the following holds: E[ V π (s)] = V π (s), E[ V ci π (s, a)] = V ci π (s). Algorithm 2 PG(π, g, s, a): Estimate Policy Gradient First Rollout: Q(s, a) = EstQ(π θ , g, s, a), let τ ∼ Geo(1 -γ) denote the terminal time; 3: Second Rollout: Q(s τ , a τ ) = EstQ(π θ , g, s τ , a τ ), let τ ∼ Geo(1 -γ) denote the terminal time; collect the trajectory {(s j , a j , g(s j , a j )} j=0:τ , where the initial (s 0 , a 0 ) = (s τ , a τ ); 4: Output: G(s, a) = 1 1 -γ Q(s τ , a τ ) ∂ log π θ (a τ |s τ ) ∂θ s,a

4.3. UNBIASED POLICY GRADIENT ESTIMATOR

Now, we introduce an unbiased policy gradient estimator, which involves two rollouts. First Rollout: we play a rollout with respect to π θ according to Algorithm 4, Q π θ (s, a) = EstQ(π θ , r, s, a), we use τ ∼ Geo(1 -γ) to denote the finite terminal time of the horizon of the rollout ( 22). Second Rollout: we play a rollout for the last state-action pair (s τ , a τ ) according to Algorithm 4, Q π θ (s τ , a τ ) = EstQ(π θ , r, s τ , a τ ). ( ) Let τ ∼ Geo(1 -γ) be the terminal time of the rollout ( 23), and we denote it as D = s j , a j , r(s j , a j ) j=0:τ , where the initial state-action pair (s 0 , a 0 ) = (s τ , a τ ). Output: let G π θ (s, a) be an estimator defined as follows, G π θ (s, a) = 1 1 -γ Q π θ (s τ , a τ ) ∂ log π θ (a τ |s τ ) ∂θ s,a . Since return objective J(π θ ) and cost function C i (π θ ) share a similar structure, all the estimators (22)-( 24) can be extended to C i (π θ ) if we use c i to replace r respectively. We have provided such policy gradient estimator with finite horizon trajectory in Algorithm 2, and denote it as PG(π, g, s, a). Theorem 3. Let π θ be the softmax policy (9), and G π θ (s, a) = PG(π θ , r, s, a), G ci π θ (s, a) = PG(π θ , c i , s, a), Then G π θ (s, a) and G ci π θ (s, a) satisfy E[ G π θ (s, a)] = ∂J(π θ ) ∂θ s,a , E[ G 2 π θ (s, a)] ≤ 4 (1 -γ) 3 ; E[ G ci π θ (s, a)] = ∂C i (π θ ) ∂θ s,a , E[( G ci π θ (s, a)) 2 ] ≤ 4 (1 -γ) 3 . Theorem 3 has guaranteed the unbiasedness and boundedness of the estimator PG(π, g, s, a), which is the benchmark for us to show theoretical results. We provide its proof in Appendix E.4.

4.4. MODEL-FREE ALGORITHM DERIVATION

We have shown model-free PD-PG in Algorithm 3. To show a stochastic primal-dual implementation, the iteration (12) implies that we need to estimate ∂L(π θ ,λ) ∂λ and ∂L(π θ ,λ) ∂θ . Estimator for ∂L(π θ ,λ) ∂λ . We obtain the estimators of the cost value function: V ci π θ (s) = EstV(π θ , c i , s). Furthermore, let C i (π θ ) = s∈S ρ 0 (s) V ci π θ (s), and ĉ(π θ ) =( C 1 (π θ ), C 2 (π θ ), • • • , C m (π θ )) , then according to Proposition 4, bĉ(π θ ) is an unbiased estimator of ∂L(π θ ,λ) ∂λ , i.e., for any given policy parameter θ, the following holds E[b -ĉ(π θ )] = ∂L(π θ , λ) ∂λ . Algorithm 3 Primal-Dual Approach to Model-Free Safe RL 1: Initialization: step-size η, θ 0 = 0, and Lagrange multiplier λ 0 = 0; 2: for t = 0, 1, 2,  ∇ λt L(π θt , λ t ) = b -ĉ(π θt ); 5: # Estimate ∂L(π θ t ,λt) ∂θt . Let G(π θt , λ t )[s, a] = G π θ t (s, a) -λ t g c π θ t (s, a); calculate ∇ θt L(π θt , λ t ) = G(π θt , λ t ); 6: #Primal-Dual Update for Parameters. λ t+1 = {λ t -η ∇ λt L(π θt , λ t )} + ; θ t+1 = θ t + η ∇ θt L(π θt , λ t ); 7: end for Estimator for ∂L(π θ ,λ) ∂θ . According to Algorithm 2, we obtain the policy gradient estimators with respect to ∂J(π θ ) ∂θs,a and ∂Ci(π θ ) ∂θs,a as follows, G π θ (s, a) = PG(π θ , r, s, a), G ci π θ (s, a) = PG(π θ , c i , s, a). ( ) Let g c π θ (s, a) = ( G c1 π θ (s, a), G c2 π θ (s, a), • • • , G cm π θ (s, a) ) collect all the policy gradient estimators of cost value function. Let the matrix G(π θ , λ) ∈ R |S|×|A| , and each element G(π θ , λ)[s, a] is: G(π θ , λ)[s, a] = G π θ (s, a) -λ g c π θ (s, a). Then according to Theorem 3, G(π θ , λ) is an unbiased estimator of ∂L(π θ ,λ) ∂θ . Stochastic Primal-Dual Iteration. We rewrite the iteration (12) as the following stochastic version, λ t ← λ t-1 -η(b -ĉ(π θt-1 ) + , θ t ← θ t-1 + η G(π θt-1 , λ t-1 ), where we calculate ĉ(π θt-1 ) and G(π θt-1 , λ t-1 ) according to estimator (25) and estimator (27).

4.5. CONVERGENCE RATE

For each time t, we notice the estimator ĉ(π t ) in the inner loop (see Line 3 of Algorithm 3) involves m trajectories, and estimator G(π θt , λ t ) (see Line 5 of Algorithm 3) involves (2|S||A|+m) trajectories. We use D t to collect all those (2|S||A| + 2m) trajectories, D t = {T t,i } 2|S||A|+2m i=1 . (29) According to rollout rule in Algorithm 4, Algorithm 5, and Algorithm 2, those (2|S||A| + 2m) trajectories among D t are independent with each other. Theorem 4. Under Assumption 1-2, π θ is the softmax policy (9). The time-step T shares a fixed low bound similar to (16). The initial λ 0 = 0, θ 0 = 0, the parameter sequence {λ t , θ t } T -1 t=0 is generated according to Algorithm 3. Let η, β satisfy η = d ρ 0 π ρ0 ∞ |S| log |A| C (1-γ)T , β = d ρ 0 π ρ0 ∞ 4|S| log |A| (1-γ) 3 ι 2 c , where C is a positive scalar will be special later. Then for all i ∈ {1, 2, • • • , m}, π t := π θt satisfies E min t<T {J(π ) -J(π t )} ≤ 4 d ρ0 π ρ 0 ∞ |S| log |A| c (1 -γ) 4 T , E min t<T {C i (π t ) -b i } + ≤ 4 d ρ0 π ρ 0 ∞ β -λ ∞ |S| log |A| c (1 -γ) 4 T , where the notation E[•] is short for E D0:D T -1 [•] that denotes the expectation with respect to the randomness over the trajectories {D t } T -1 t=0 . The details of the proof is in Appendix F.2, and the unbiased policy gradient estimator and the independent samples among D t play a critical role for us to obtain the Theorem 4. According to (29), we need (2|S||A| + 2m) trajectories to rollout the policy gradient estimator, Theorem 4 implies Algorithm 3 needs a total iteration complexity of O   d ρ0 π ρ 0 2 ∞ |S| |S||A| + m log |A| c (1 -γ) 4 2   to obtain O( )-optimality. The iteration complexity shown in (30) matches the best known algorithm NPD-PG Ding et al. (2020) . NPD-PG (Ding et al., 2020) and the proposed PD-PG share a complexity of O( -2 ), which is better than CRPO (Xu et al., 2021) that is with O( -4 ) and on-line NPD-PG (Zeng et al., 2021) that is with O( -6 ). 4.6 COMMENT ON (DING ET AL., 2020): A TRADE-OFF BETWEEN PD-PG AND NPD-PG The initial θ 0 = 0 implies the distribution of initial policy is uniform, which implies π 0 (a(s)|s) = |A| -1 . Then c is upper bounded as follows, c = inf s∈S,t≥1 {π t (a (s)|s)} ≤ |A| -1 , which implies the iteration complexity of the proposed PD-PG (30) is lower bounded as follows, O d ρ0 π ρ 0 2 ∞ |S| 2 |A| 2 (1 -γ) 4 2 (31) to obtain O( )-optimality, where we omit the constant m (that is the number of constraints), and O(•) to hide polylogarithmic factors in the input parameters. According to (Ding et al., 2020, Theorem 4) , the complexity of NPD-PG is upper bounded by O |S| 2 |A| 2 (1 -γ) 4 2 . ( ) Although the proposed PD-PG shares the same state-action independent iteration complexity of O( -2 ) with NPD-PG, the state-action dependent iteration complexity (31) and (32) implies PD-PG is difficult than NPD-PG (Ding et al., 2020) , which heavily depends on the initial distribution ρ 0 (•). Concretely, if ρ 0 (•) is near 0 at some state, then distribution mismatch coefficient can be very large, which is indeed detrimental for PD-PG to search a safe policy. In this sense, it also demonstrates the necessity of Assumption 2. However, although the upper bound (32) w.r.t NPD-PG does not contain the distribution mismatch coefficient, the NPD-PG (Ding et al., 2020) requires additional computation w.r.t the Moore-Penrose pseudo inverse F † (θ), where F(θ) is the Fisher information matrix: F(θ) = E s∼d ρ 0 π θ (•),a∼π θ (•|s) ∇ log π θ (a|s)(∇ log π θ (a|s) ) . Thus, there exists a hidden trade-off between PD-PG and NPD-PG. Finally, we should emphasize that trade-off is hidden due to the notation O(•) covers some information w.r.t MDP or policy space.

5. CONCLUSION

This work proposes PD-PG algorithm to solve constrained reinforcement learning problem, which is a Lagrangian-based algorithm with policy gradient. Although the maximization objective is nonconcave and the minimization is non-convex over the parameter space, we show that for both exact policy gradient and model-free learning, PD-PG converges to the optimal solution at a sublinear rate for both reward objective and safety constraint. Since we consider discounted infinite-horizon CMDP, we consider unbiased estimators with finite-horizon trajectories, which plays a critical role for us to obtain the iteration complexity of sample-based PD-PG. Additionally, we investigate that PD-PG needs a complexity of O -2 to obtain a O( )-optimality, which is comparable to state-of the-art algorithms available in the literature in constrained RL. A.3 STATE DISTRIBUTION P π (s |s 0 ) : The probability of single step state transformation probability from s to s by executing π. P π (s t |s 0 ) : The probability of visiting the state s t after t time steps from the initial state s 0 by executing π θ . P π : The state transition probability matrix, and its (s, s )-th component is 1 m : 1 m ∈ R m , and each element of vector 1 m is 1, i.e., 1 m = (1, 1, • • • , 1) . 0 m : 0 m ∈ R m , P π [s, s ] = a∈A π(a|s)P(s |s, a) := P π (s |s). d π θ s0 (s), d π θ ρ0 (s) : The discounted stationary state distribution of the Markov chain (starting at s 0 ) induced by π, d π θ s0 (s) = ∞ t=0 γ t P π (s t = s|s 0 ), d π θ ρ0 (s) = E s0∼ρ0(•) [d π s0 (s)]. d ρ0 π ρ 0 ∞ : The distribution mismatch coefficient, i.e., d ρ0 π ρ 0 ∞ := max s∈S d ρ0 π (s) ρ 0 (s) . Due to the Assumption 2, the term  d ρ0 π (s) ρ 0 (s) is well-defined. A.4 STATE, STATE-ACTION AND COST VALUE FUNCTION V π (s) : State value function V π (s) = E π [ ∞ t=0 γ t r t+1 |s 0 = s]. Q π (s, a) : State-action value function Q π (s, a) = E π [ ∞ t=0 γ t r t+1 |s 0 = s, a 0 = a]. A π (s, a) : Advantage function A π (s, a) = Q π (s, a) -V π (s). J = (b 1 , b 2 , • • • , b m ) . V ci π (s) : State cost value function V ci π (s) = E π [ ∞ t=0 γ t c i (s t , a t )|s 0 = s]. Q ci π (s, a) : State-action cost function Q ci π (s) = E π [ ∞ t=0 γ t c i (s t , a t )|s 0 = s, a 0 = a]. A ci π (s, a) : Advantage function A ci π (s, a) = Q ci π (s, a) -V ci π (s). C i (π) : The expected cost value function C i (π) = E s∼ρ0(•) [V ci π (s)]. c(π) : The vector stores all the expected cost values: c(π) = (C 1 (π), C 2 (π), • • • , C m (π)) . Π C : The feasible policy set Π C is defined as follows, Π C = m i=1 {π ∈ Π S and C i (π) ≤ b i } . A.5 PARAMETER λ : Lagrange multiplier parameter, λ ∈ R m . λ : Optimal dual parameter λ = arg min λ 0 max π∈ΠS J(π) + λ (b -c(π)) . A.6 CONSTANT ρ min : Sufficient exploration condition ρ min = min s∈S {ρ 0 (s)}. m : The dimension of the vector b, i.e, the number of the constrained conditions (3). ξ : A component-wise negitive vector defined in Assumption 1: We use P π (s t = s|s 0 ) to denote the probability of visiting s after t time steps from the initial state s 0 by executing π. Particularly, we notice if t = 0, s t = s 0 , then P π (s t = s|s 0 ) = 0, i.e., P π (s t = s|s 0 ) = 0, t = 0 and s = s 0 . ξ := (ξ 1 , ξ 2 , • • • , ξ m ) ≺ 0, and each ξ i < 0, i = 1, 2, • • • , m. ι : A positive constant ι := min 1≤i≤m {-ξ i }. : A positive constant := 1 + 2m (1 -γ) 2 ι 2 . D 1 , (33) Then for any initial state s 0 ∼ ρ(•), the following holds, P π (s t = s|s 0 ) = s ∈S P π (s t = s|s t-1 = s )P π (s t-1 = s |s 0 ). In this paper, we also use P (t) π to denote P π (s t = s |s), i.e., P (t) π (s |s) = P π (s t = s |s 0 = s). Recall d s0 π (s) denotes the normalized discounted distribution of the future state s encountered starting at s 0 by executing π, d s0 π (s) = (1 -γ) ∞ t=0 γ t P π (s t = s|s 0 ). Furthermore, since s 0 ∼ ρ 0 (•), we define d ρ0 π (s) = E s0∼ρ0(•) [d s0 π (s)] = s0∈S ρ 0 (s 0 )d s0 π (s)ds 0 as the discounted state visitation distribution over the initial distribution ρ 0 (•)

B.2 PERFORMANCE DIFFERENCE LEMMA

Lemma 1 (Performance Difference (Kakade & Langford, 2002) ). For any policies π and π , s 0 ∼ ρ 0 (•), and for each i = 1, 2, • • • , m, the following performance (or cost) difference holds J(π) -J(π ) = 1 1 -γ E s∼d ρ 0 π (•),a∼π(•|s) [A π (s, a)] , C i (π) -C i (π ) = 1 1 -γ E s∼d ρ 0 π (•),a∼π(•|s) A ci π (s, a) .

B.3 BASIC FACTS

In this section, we present some basic facts will be use later, those results are adaptive to Ding et al. (2020) , and we extend them to the versions of vectors. Recall L D (λ) = max π∈ΠS L(π, λ), for a give scalar c ∈ R, we define a notation Γ(c) := {λ 0 : L D (λ) ≤ c}. We notice the following inequality alway holds max π∈ΠS min λ 0 L(π, λ) ≤ min λ 0 max π∈ΠS L(π, λ) = min λ 0 L D (λ), ( ) if c < max π∈ΠS min λ 0 L(π, λ) = J(π ), then Γ(c) = ∅. We assume c ≥ J(π ) throughout this paper. Lemma 2. Recall the optimal dual variable λ = arg min λ 0 L D (λ), let c = J(π ), the following holds λ ∈ Γ (c) . (39) Proof. If we choose c = J(π ), and let λ ∈ Γ(c), then by the definition of Γ(c), we achieve L D ( λ) = max π∈ΠS L(π, λ) ≤ J(π ) (8) = max π∈ΠS min λ 0 L(π, λ). Let's maximize the left-hand of Eq.( 40) overset the space λ 0 , combining the result (38), then the vector λ satisfies max π∈ΠS min λ 0 L(π, λ) = min λ 0 max π∈ΠS L(π, λ), which implies λ = λ , further it implies λ ∈ Γ (J(π )) . ( ) With the result of Lemma 2, we denote the set of all optimal dual variables as Γ , i.e., Γ := λ : arg min λ 0 L D (λ) = Γ (J(π )) . Lemma 3. Consider the policy π satisfies Assumption 1, let λ ∈ Γ(c), then the following holds c -J(π) ≥ -λ ξ. Furthermore, the optimal dual variable λ is bounded as follows, λ ∞ ≤ 1 ι (J(π ) -J(π)) ≤ 2 (1 -γ)ι . Proof. Let λ ∈ Γ(c), and recall Assumption 1 and L D (λ) = max π∈ΠS L(π, λ), then c ≥ L D (λ) ≥ J(π) + λ (b -c(π)) (7) ≥ J(π) -λ ξ, which implies c -J(π) ≥ -λ ξ. Furthermore, according to Lemma 2, if c = J(π ), then for each λ ∈ Γ (J(π )), Eq.( 43) implies J(π ) -J(π) ≥ -λ ξ = λ (-ξ). Recall Assumption 1, we know ξ ≺ 0, i.e., ξ : = (ξ 1 , ξ 2 , • • • , ξ m ) ≺ 0, which implies each ξ i < 0, i = 1, 2, • • • , m. Let ι := min 1≤i≤m {-ξ i }, then ι is a positive scalar, and -ξ ι1 m . Let λ = (λ * 1 , λ * 2 , • • • , λ * m ) , since λ 0, then each λ * i ≥ 0. Furthermore, λ ∞ = max{λ * 1 , λ * 2 , • • • , λ * m } , according to Eq.( 44), we achieve J(π ) -J(π) ≥ ιλ 1 m = ι m i=1 λ * i ≥ ι λ ∞ , which implies λ ∞ ≤ 1 ι (J(π ) -J(π)) ≤ 2 (1 -γ)ι . ( ) Lemma 4. Let ϕ > λ ∞ , and for any policy π such that J(π ) -J(π) + ϕ1 m {c(π) -b} + ≤ δ, then Recall λ = arg min λ 0 L D (λ), then according to Theorem 1, we achieve 1 m {c(π) -b} + < δ ϕ -λ ∞ . ( ) Proof. Let v(ω) = max L(π, λ ) ≤ max π∈ΠS L(π, λ ) = L D (λ ) = J(π ) = v(0), ∀π ∈ Π S . Then, for each π such that π ∈ {π ∈ Π S : b -c(π) ω}, the following holds v(0) -λ ω (49) ≥ L(π, λ ) -λ ω = J(π) + λ (b -c(π)) -λ ω ≥ J(π), where the last Eq.( 50) holds: since b -c(π) ω, then λ (b -c(π)) -λ ω ≥ 0. Let's maximize the right-hand of Eq.( 50) with respect to π over the space {π ∈ Π S : bc(π) ω}, then we achieve v(0) -λ ω ≥ v(ω). Furthermore, if we choose ω := -{c(π) -b} + , then J(π) ≤ J(π ) = v(0) ≤ v(ω), where the last inequality holds since {π : b -c(π) 0} ⊂ {π : b -c(π) ω}. Finally, considering the results from ( 51) to (53), we have J(π) -J(π ) (53) ≤ v(ω) -J(π ) = v(ω) -v(0) (51) ≤ -λ ω. ( ) Consider the condition (47), δ (47) ≥ J(π ) -J(π) + ϕ1 m {c(π) -b} + (54) ≥ λ ω + ϕ1 m {c(π) -b} + (52) ≥ (ϕ1 m -λ ) {c(π) -b} + > (ϕ -λ ∞ )1 m {c(π) -b} + , where the last inequality holds: since ϕ > λ ∞ , then the following equation always holds ϕ1 m -λ (ϕ -λ ∞ )1 m . Eq.( 55) implies 1 m {c(π) -b} + < δ ϕ -λ ∞ . C POLICY GRADIENT W.R.T. OBJECTIVE AND COST VALUE FUNCTION Although some similar results with respect to Proposition 1 have appeared in Agarwal et al. (2020b) ; Mei et al. (2020) ; Lan (2021) , we also need to provide the details since we will use some key details later. Before we show Proposition 1, we need to calculate ∂π θ (a |s ) ∂θ s,a , which plays a critical role to proof Proposition 1. Lemma 5. Let π θ be the softmax policy defined in (9), then ∂π θ (a |s ) ∂θ s,a = ∂ ∂θ s,a exp{θ s ,a } ã∈A exp{θ s ,ã } = ∂ ∂θ s,a exp{θ s ,a }( ã∈A exp{θ s ,ã }) -exp{θ s ,a } ∂ ∂θ s,a ( ã∈A exp{θ s ,ã }) ( ã∈A exp{θ s ,ã }) 2 =                      exp{θ s,a }( ã∈A exp{θ s,ã }) -(exp{θ s,a }) 2 ( ã∈A exp{θ s,ã }) 2 , if s = s and a = a; - exp{θ s,a } exp{θ s,a } ( ã∈A exp{θ s,ã }) 2 , if s = s and a = a; 0, if s = s or a = a; =              π θ (a|s) -(π θ (a|s)) 2 , if s = s and a = a; -π θ (a |s)π θ (a|s) if s = s and a = a; 0. if s = s or a = a. Proposition 1. Under the softmax policy parameterization (9), the gradient of the objective J(π θ ) and cost C i (π θ ) with respect to θ is ∂J(π θ ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a), ∂C i (π θ ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A ci π θ (s, a). Furthermore, let a c π θ (s, a) = A c1 π θ (s, a), A c2 π θ (s, a), • • • , A cm π θ (s, a) , A π θ (s, a, λ) = A π θ (s, a) -λ a c π θ (s, a), then the gradients of L(π θ , λ) with respect to θ, λ are: ∂L(π θ , λ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a, λ), ∂L(π θ , λ) ∂λ = b -c(π θ ). ( ) Proof. Since J(π θ ) = E s∼d ρ 0 π θ (•) [V π θ (s)], to derive the gradient ∂J(π θ ) ∂θ s,a , we only need to show ∂V π θ (s 0 ) ∂θ s,a . According to the relationship between V π θ (s) and Q π θ (s, a): V π θ (s 0 ) = a ∈A π θ (a |s 0 )Q π θ (s 0 , a ), then we have ∂V π θ (s 0 ) ∂θ s,a = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + π θ (a |s 0 ) ∂Q π θ (s 0 , a ) ∂θ s,a . Due to the equation Q π θ (s, a) = s ∈S P(s |s, a)r(s |s, a) + γ s ∈S P(s |s, a)V π θ (s ), we achieve the gradient of Q π θ with respect to θ as follows, ∂Q π θ (s 0 , a ) ∂θ s,a = γ s ∈S P(s |s 0 , a ) ∂V π θ (s ) ∂θ s,a . Taking Eq.( 60) to Eq.( 59), we have ∂V π θ (s 0 ) ∂θ s,a = a ∈A   ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + γπ θ (a |s 0 ) s ∈S P(s |s 0 , a ) ∂V π θ (s ) ∂θ s,a   = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + γ s ∈S   a ∈A π θ (a |s 0 )P(s |s 0 , a )   =Pπ θ (s1=s |s0) ∂V π θ (s ) ∂θ s,a = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + γ s ∈S P π θ (s 1 = s |s 0 ) ∂V π θ (s ) ∂θ s,a , which implies for each t ∈ N + , the following equation holds: ∂V π θ (s ) ∂θ s,a = a ∈A ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) + γ s ∈S P π θ (s t+1 = s |s t = s ) ∂V π θ (s ) ∂θ s,a . Considering Eq.( 62) with the case t = 1, we write Eq.( 61) as follows, ∂V π θ (s 0 ) ∂θ s,a = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + γ s ∈S P π θ (s 1 = s |s 0 )   a ∈A ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) + γ s ∈S P π θ (s 2 = s |s 1 = s ) ∂V π θ (s ) ∂θ s,a   . (63) According to Eq.( 34), the following equation holds s ∈S P π θ (s 1 = s |s 0 )P π θ (s 2 = s |s 1 = s ) = P π θ (s 2 = s |s 0 ), and taking it to Eq.( 63), we achieve the gradient ∂V π θ (s 0 ) ∂θ s,a as follows, ∂V π θ (s 0 ) ∂θ s,a = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + γ s ∈S P π θ (s 1 = s |s 0 ) a ∈A ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) + γ 2 s ∈S s ∈S P π θ (s 1 = s |s 0 )P π θ (s 2 = s |s 1 = s ) ∂V π θ (s ) ∂θ s,a = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + γ a ∈A s ∈S P π θ (s 1 = s |s 0 ) ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) + γ 2 s ∈S P π θ (s 2 = s |s 0 ) ∂V π θ (s ) ∂θ s,a . Furthermore, according to (64), we analyze ∂V π θ (s 0 ) ∂θ s,a as follows, ∂V π θ (s 0 ) ∂θ s,a = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + γ a ∈A s ∈S P π θ (s 1 = s |s 0 ) ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) + γ 2 s ∈S P π θ (s 2 = s |s 0 )   a ∈A ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) + γ s ∈S P π θ (s 3 = s |s 2 = s ) ∂V π θ (s ) ∂θ s,a   = a ∈A ∂π θ (a |s 0 ) ∂θ s,a Q π θ (s 0 , a ) + a ∈A s ∈S γP π θ (s 1 = s |s 0 ) + γ 2 P π θ (s 2 = s |s 0 ) ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) + γ 3 s ∈S P π θ (s 3 = s |s 0 ) ∂ ∂θ s,a V π θ (s ) = • • • • • • = a ∈A s ∈S ∞ t=0 γ t P π θ (s t = s |s 0 ) ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) (65) = 1 1 -γ a ∈A s ∈S d s0 π θ (s ) ∂π θ (a |s ) ∂θ s,a Q π θ (s , a ) = 1 1 -γ a ∈A d s0 π θ (s) ∂π θ (a |s) ∂θ s,a Q π θ (s, a ) (66) = 1 1 -γ d s0 π θ (s)π θ (a|s)   Q π θ (s, a) - a ∈A π θ (a |s)Q π θ (s, a )   (67) = 1 1 -γ d s0 π θ (s)π θ (a|s) (Q π θ (s, a) -V π θ (s)) = 1 1 -γ d s0 π θ (s)π θ (a|s)A π θ (s, a), where Eq.( 66) holds since if s = s, then ∂π θ (a |s ) ∂θ s,a = 0; Eq.( 67) holds due to Eq.( 57). Finally, since J(π θ ) = E s0∼d ρ 0 π θ (•) [V π θ (s 0 )], then ∂J(π θ ) ∂θ s,a = E s0∼ρ0(•) 1 1 -γ d s0 π θ (s)π θ (a|s)A π θ (s, a) = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a). Similarly, ∂C i (π θ ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A ci π θ (s, a). This concludes the proof of Proposition 1. Let a c π θ (s, a) = A c1 π θ (s, a), A c2 π θ (s, a), • • • , A cm π θ (s, a) , then ∂c(π θ ) ∂θ s,a = ∂C 1 (π θ ) ∂θ s,a , ∂C 2 (π θ ) ∂θ s,a , • • • , ∂C m (π θ ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s) A c1 π θ (s, a), A c2 π θ (s, a), • • • , A cm π θ (s, a) = 1 1 -γ d ρ0 π θ (s)π θ (a|s)a c π θ (s, a). With the result in Proposition 1, it is easy to calculate the gradient of L(π θ , λ) (5) with respect to θ, λ, formally, we present it in the following Proposition 5. Proposition 5. Consider the Lagrange multiplier function L(π θ , λ) (5), its gradient with respect to θ, λ is: ∂ ∂θ s,a L(π θ , λ) = 1 1 -γ d ρ0 π θ (s)π θ (a|s) A π θ (s, a) -λ a c π θ (s, a) , ∂ ∂λ L(π θ , λ) = b -c(π θ ). ( ) We consider the following update rule with respect to λ and θ: λ t ← λ t-1 -η ∂ ∂λ L(π θt-1 , λ t-1 ) + ; ( ) θ t ← θ t-1 + η ∂ ∂θ L(π θt-1 , λ t-1 ), where each (s, a)-component update rule of θ t is defined as follows, θ t [s, a] := θ (t) s,a ← θ (t-1) s,a + η ∂ ∂θ s,a L(π θt-1 , λ t-1 ) (73) = θ (t-1) s,a + η 1 -γ d ρ0 πt-1 (s)π θt-1 (a|s) A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) , where θ (t) s,a = θ t [s, a]. To short the expression, we use the following notations: π t (a|s) := π θt (a|s) = exp θ (t) s,a ã∈A exp θ (t) s,ã , and π t := π θt . (75)

D PROOF OF THEOREM 2

In this section, we provide the necessary proof details of Theorem 2. It is very technical to achieve the result of Theorem 2, we outline some necessary intermediate results in Section D.1 where we provide some basic lemmas in Section D.1, and the proof of Theorem 2 is shown in Section D.2.

D.1 AUXILIARY LEMMAS

Lemma 6. The sequence {λ t , θ t } t≥0 is generated by ( 71)-( 72)/(74), and the softmax policy π t := π θt is defined as (75), then update rule with respect to π t equals to π t (a|s) = π t-1 (a|s) exp η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s)A (t-1) (s, a) Z t-1 (s) , ( ) where A (t-1) (s, a) = A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) , and Z t-1 (s) = ã∈A π t-1 (ã|s) exp η 1 -γ d ρ0 πt-1 (s)π t-1 (ã|s) A πt-1 (s, ã) -λ t-1 a c πt-1 (s, ã) . Proof. According to the update rules (74), we calculate policy π t (a|s) (75) as follows, π t (a|s) = π θt (a|s) = exp θ (t) s,a ã∈A exp θ (t) s,ã = exp θ (t-1) s,a + η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s) A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) ã∈A exp θ (t-1) s,a + η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s) A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) = exp θ (t-1) s,a exp η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s) A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) ã∈A exp θ (t-1) s,ã exp η 1 -γ d ρ0 πt-1 (s)π t-1 (ã|s) A πt-1 (s, ã) -λ t-1 a c πt-1 (s, ã) =π t-1 (a|s) ã∈A exp θ (t-1) s,ã exp η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s) A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) ã∈A exp θ (t-1) s,ã exp η 1 -γ d ρ0 πt-1 (s)π t-1 (ã|s) A πt-1 (s, ã) -λ t-1 a c πt-1 (s, ã) (77) =π t-1 (a|s) exp η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s) A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) ã∈A   exp θ (t-1) s,ã ã∈A exp θ (t-1) s,ã exp η 1 -γ d ρ0 πt-1 (s)π t-1 (ã|s) A πt-1 (s, ã) -λ t-1 a c πt-1 (s, ã)   =π t-1 (a|s) exp η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s) A πt-1 (s, a) -λ t-1 a c πt-1 (s, a) ã∈A π t-1 (ã|s) exp η 1 -γ d ρ0 πt-1 (s)π t-1 (ã|s) A πt-1 (s, ã) -λ t-1 a c πt-1 (s, ã) , where Eq.( 77) holds since we use the following definition π t-1 (a|s) = exp θ (t-1) s,a ã∈A exp θ (t-1) s,ã . To short the expression, we introduce two notations: Z t-1 (s) = ã∈A π t-1 (ã|s) exp η 1 -γ d ρ0 πt-1 (s)π t-1 (ã|s) A πt-1 (s, ã) -λ t-1 a c πt-1 (s, ã) , A (t-1) (s, a) = A πt-1 (s, a) -λ t-1 a c πt-1 (s, a), we rewrite (78) as follows π t (a|s) = π t-1 (a|s) exp η 1 -γ d ρ0 πt-1 (s)π t-1 (a|s)A (t-1) (s, a) Z t-1 (s) . ( ) This concludes the proof of Lemma 6. Before we further discussions, we need to define some a notations: H t (s) and a distance D[• •] between function p(•) and function q(•) over the action space A, H t (s) = a∈A π t+1 (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) , D[p(•) q(•)] = a∈A p(a) log p(a) q(a) . Note that if p(•), q(•) is reduced to probability distributions, then D(• •) is reduced to Kullback-Leibler divergence between the two distributions p(•), q(•): D[p(•) q(•)] = E x∼p(•) log p(x) q(x) := KL[p(•) q(•)]. Lemma 7. The sequence {λ t , θ t } t≥0 is generated by ( 71)-( 72)/(74), and the softmax policy π t := π θt is defined as (75), the π t and λ t satisfy the following equation c(π) = (C 1 (π), C 2 (π), • • • , C m (π)) , a c π (s, a) = (A c1 π (s, a), A c2 π (s, a), • • • , A cm π (s, a) ) . Furthermore, applying Lemma 1 again, we rewrite (85) as follows J(π ) -J(π t ) = 1 η s∈S d ρ0 π (s) d ρ0 πt (s) a∈A π (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) + 1 η s∈S a∈A d ρ0 π (s)π (a|s) d ρ0 πt (s)π t (a|s) log Z t (s) + λ t (c(π ) -c(π t )), ( ) where π is the optimal policy of primal problem (4), i.e., π = arg max π∈Π C J(π). Proof. According to Lemma 1, we calculate the performance difference between J(π ) and J(π t ) as follows, J(π ) -J(π t ) = 1 1 -γ E s∼d ρ 0 π (•),a∼π (•|s) [A πt (s, a)] = 1 1 -γ s∈S a∈A d ρ0 π (s)π (a|s)A πt (s, a). Recall Eq.( 80) and Eq.( 81), we write it as follows, log π t+1 (a|s) π t (a|s) Z t (s) = η 1 -γ d ρ0 πt (s)π t (a|s) A πt (s, a) -λ t a c πt (s, a) . Then, we rewrite the term A πt (s, a) as follows, A πt (s, a) = 1 -γ η • 1 d ρ0 πt (s)π t (a|s) • log π t+1 (a|s) π t (a|s) Z t (s) + λ t a c πt (s, a). ( ) Taking the results (84) into (83), we rewrite the performance difference between J(π ) and J(π t ) as follows, J(π ) -J(π t ) = 1 η s∈S d ρ0 π (s) d ρ0 πt (s) a∈A π (a|s) π t (a|s) • log π t+1 (a|s) π t (a|s) Z t (s) + 1 1 -γ s∈S a∈A d ρ0 π (s)π (a|s)λ t a c πt (s, a) = 1 η s∈S d ρ0 π (s) d ρ0 πt (s) a∈A π (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) + 1 η s∈S a∈A d ρ0 π (s)π (a|s) d ρ0 πt (s)π t (a|s) log Z t (s) + 1 1 -γ s∈S a∈A d ρ0 π (s)π (a|s)λ t a c πt (s, a). ( ) Recall the vectors c(π), a c π (s, a) defined in previous sections, where c(π) = (C 1 (π), C 2 (π), • • • , C m (π)) , a c π (s, a) = (A c1 π (s, a), A c2 π (s, a), • • • , A cm π (s, a)) . Furthermore, let us apply Lemma 1 again, we obtain c(π ) -c(π t ) = 1 1 -γ s∈S a∈A d ρ0 π (s)π (a|s)a c πt (s, a), which implies, J(π ) -J(π t ) = 1 η s∈S d ρ0 π (s) d ρ0 πt (s) a∈A π (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) + 1 η s∈S a∈A d ρ0 π (s)π (a|s) d ρ0 πt (s)π t (a|s) log Z t (s) + λ t (c(π ) -c(π t )). This concludes the proof of the result (82). Similarly, we obtain the performance difference between J(π t+1 ) and J(π t ). Lemma 8. The performance difference between J(π t+1 ) and J(π t ) satisfies the following equation J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) (86) = 1 η s∈S d ρ0 πt+1 (s) d ρ0 πt (s) H t (s) + 1 η s∈S a∈A d ρ0 πt+1 (s)π t+1 (a|s) d ρ0 πt (s)π t (a|s) log Z t (s). Proof. According to Lemma 1, we calculate the performance difference between J(π t+1 ) and J(π t ) as follows, J(π t+1 ) -J(π t ) = 1 1 -γ E s∼d ρ 0 π t+1 (•),a∼πt+1(•|s) [A πt (s, a)] (87) = 1 1 -γ s∈S a∈A d ρ0 πt+1 (s)π t+1 (a|s)A πt (s, a). Recall Eq.( 80) and Eq.( 81), we have A πt (s, a) = 1 -γ η • 1 d ρ0 πt (s)π t (a|s) • log π t+1 (a|s) π t (a|s) Z t (s) + λ t a c πt (s, a). Taking ( 89) to (88), we have J(π t+1 ) -J(π t ) = 1 η s∈S d ρ0 πt+1 (s) d ρ0 πt (s) a∈A π t+1 (a|s) π t (a|s) • log π t+1 (a|s) π t (a|s) + π t+1 (a|s) π t (a|s) • log Z t (s) + 1 1 -γ s∈S a∈A d ρ0 πt+1 (s)π t+1 (a|s)λ t a c πt (s, a) = - 1 η s∈S d ρ0 πt+1 (s) d ρ0 πt (s) H t (s) + 1 η s∈S a∈A d ρ0 πt+1 (s)π t+1 (a|s) d ρ0 πt (s)π t (a|s) log Z t (s) + λ t (c(π t+1 ) -c(π t )). This concludes the proof of the result (86). Lemma 9. The sequence {λ t , θ t } t≥0 is generated by ( 71)-( 72)/(74), and the softmax policy π t := π θt is defined as (75), then the performance difference between J(π t+1 ) and J(π t ) satisfies J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) ≥ 1 η s∈S a∈A d ρ0 πt+1 (s)π t+1 (a|s) d ρ0 πt (s)π t (a|s) log Z t (s), and log Z t (s) ≥ 0. Proof  According to Eq.( 86), and due to positivity of the terms d ρ0 πt (s) and d ρ0 πt+1 (s), we have J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) ≥ 1 η s∈S a∈A d ρ0 πt+1 (s)π t+1 (a|s) d ρ0 πt (s)π t (a|s) log Z t (s). Now, we need to show log Z t (s) ≥ 0. In fact, the following holds log Z t (s) = log ã∈A π t (ã|s) exp η 1 -γ d ρ0 πt (s)π t (ã|s) A πt (s, ã) -λ t a c πt (s, ã) ≥ ã∈A π t (ã|s) log exp η 1 -γ d ρ0 πt (s)π t (ã|s) A πt (s, ã) -λ t a c πt (s, ã) (92) = η 1 -γ ã∈A π t (ã|s)d ρ0 πt (s)π t (ã|s) A πt (s, ã) -λ t a c πt (s, ã) ≥ min a∈A {π t (a|s)} • η 1 -γ ã∈A d ρ0 πt (s)π t (ã|s) A πt (s, ã) -λ t a c πt (s, ã) ≥ min a∈A {π t (a|s)} • η 1 -γ d ρ0 πt (s) ã∈A π t (ã|s)A πt (s, ã) - ã∈A π t (ã|s)λ t a c πt (s, ã) = 0, where Eq.( 92) holds due to the Jensen's inequality, and the last equation ( 93) holds since: ã∈A π t (ã|s)A πt (s, ã) = 0, ã∈A π t (ã|s)A ci πt (s, ã) = 0, i = 1, 2, • • • , m, ã∈A π t (ã|s)a c πt (s, ã) =        ã∈A π t (ã|s)A c1 πt (s, ã) (94) = 0 , ã∈A π t (ã|s)A c2 πt (s, ã) =0 , • • • , ã∈A π t (ã|s)A cm πt (s, ã) =0        =0 m , the notation 0 m ∈ R m×1 denotes a vector with the elements are all zero. Lemma 10. The term s∈S log Z t (s) is bounded as follows, s∈S log Z t (s) ≤ η (1 -γ)ρ min J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) . Proof. Since d ρ0 πt (s) ≤ 1, π t (a|s) ≤ 1 always holds, recall Eq.( 90) in Lemma 9, we achieve the following inequality: J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) ≥ 1 η s∈S a∈A d ρ0 πt+1 (s)π t+1 (a|s) log Z t (s). According to Eq.( 143), we rewrite Eq.( 95) as follows J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) ≥ 1 -γ η s∈S a∈A ρ 0 (s)π t+1 (a|s) log Z t (s) (96) = 1 -γ η s∈S ρ 0 (s) log Z t (s) a∈A π t+1 (a|s) = 1 -γ η s∈S ρ 0 (s) log Z t (s). Then, under Assumption 2, for each s ∈ S, ρ min ≤ ρ 0 (s), result (97 ) implies η (1 -γ)ρ min J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) ≥ s∈S log Z t (s). Lemma 11. Let π θ and π θ be softmax policy, then the following holds V π θ (s) -V π θ (s) - ∂V π θ (s) ∂θ , θ -θ ≤ 4 (1 -γ) 3 θ -θ 2 2 . Proof. See (Mei et al., 2020, Lemma 7) . Let β = 4 (1 -γ) 3 , and if θ = θ + 1 β ∂V π θ (s) ∂θ , then we obtain V π θ (s) -V π θ (s) ≤ - 1 2β ∂V π θ (s) ∂θ 2 2 . ( ) Let q c π (s, a) ∈ R m is defined as follows, q c π (s, a) = (Q c2 π (s, a), Q c2 π (s, a), • • • , Q cm π (s, a)), v c π (s) = (V c2 π (s, a), V c2 π (s), • • • , V cm π (s)) Q π θ (s, a, λ) = Q π θ (s, a) -λ q c π θ (s), V π θ (s, λ) = V π θ (s) -λ v c π θ (s). Furthermore, we define ∆ r (s) = max π∈Π C {Q π (s, a)} - max π∈Π C ,a =a (s) {Q π (s, a)} (100) = Q (s, a (s)) - max π∈Π C ,a =a (s) {Q π (s, a)} > 0. Similarly, we define ∆ ci (s) = min π∈Π C {Q ci π (s, a)} - min π∈Π C ,a =a (s) {Q ci π (s, a)} (102) = Q ci (s, a (s)) - min π∈Π C ,a =a (s) {Q ci π (s, a)} > 0. ( ) We define ∆ (s) as follows ∆ (s) = ∆ r (s) + λ (∆ c1 (s), ∆ c2 (s), • • • , ∆ cm (s)) . ( ) Recall the notation θ s,a = θ[s, a], for all (s, a) ∈ S × A, and we define Θ 1 (s) = θ : ∂L(π θ , λ) ∂θ[s, a (s)] ≥ ∂L(π θ , λ) ∂θ[s, a] , ∀ a = a , Θ 2 (s) =      θ : Q π θ (s, a (s)) ≥ Q (s, a (s)) - 1 2 ∆ r -Q ci π θ (s, a (s)) ≥ -Q ci (s, a (s)) - 1 2 ∆ ci , i = 1, 2, • • • , m      , Θ 3 (s) = m i=1      θ t : V π θ t (s) ≥ Q π θ t (s, a (s)) - 1 2 ∆ r -V ci π θ t (s) ≥ -Q ci π θ t (s, a (s)) - 1 2 ∆ ci for t ≥ 0 is large enough      , Θ c (s) = θ : π θ (a (s)|s) ≥ c(s) c(s) + 1 , c(s) + 1 = |A| (1 -γ)∆ (s) . ( ) Lemma 12. Let θ t ∈ Θ 1 (s) ∩ Θ 2 (s) ∩ Θ 3 (s), then θ t+1 ∈ Θ 1 (s) ∩ Θ 2 (s) ∩ Θ 3 (s). Proof. θ t+1 ∈ Θ 2 (s) Since θ t ∈ Θ 3 (s), we obtain the following equation Q πt (s, a (s), λ) ≥ Q (s, a (s), λ) - 1 2 ∆ (s), where π t is short for π θt , and Q (s, a, λ) = max π∈Π C Q π (s, a, λ). Furthermore, we obtain Q πt+1 (s, a (s), λ) -Q πt (s, a (s), λ) = γ s ∈S P(s |s, a (s)) V πt+1 (s , λ) -V πt (s , λ) . ( ) According to Lemma 11, Eq.( 99), we know Q πt+1 (s, a (s), λ) -Q πt (s, a (s), λ) ≥ 0 ≥ - 1 2 ∆ (s), which implies θ t+1 ∈ Θ 2 (s). θ t+1 ∈ Θ 3 (s) For any a = a (s), we know Q πt (s, a (s), λ) -Q πt (s, a, λ) (111) =Q πt (s, a (s), λ) -Q (s, a (s), λ) + Q (s, a (s), λ) -Q πt (s, a, λ) (112) ≥ - 1 2 ∆ (s) + Q (s, a (s), λ) -Q (s, a) + Q (s, a) -Q πt (s, a, λ) (113) ≥ - 1 2 ∆ (s) + Q (s, a (s), λ) -max a =a (s) Q (s, a) + Q (s, a) -Q πt (s, a, λ) (100),( 102),(104) = - 1 2 ∆ (s) + ∆ (s) + γ s ∈S P(s |s, a (s)) V πt+1 (s , λ) -V πt (s , λ) (115) ≥ 1 2 ∆ (s). Similarly, we obtain Q πt+1 (s, a (s), λ) -Q πt+1 (s, a, λ) ≥ 1 2 ∆ (s), which implies θ t+1 ∈ Θ 3 (s). θ t+1 ∈ Θ 1 (s) According to Proposition 1, ∂L(π θ , λ) ∂θ s,a = 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a, λ), if θ t ∈ Θ 1 (s), i.e., ∂L(π , λ t ) ∂θ t [s, a (s)] ≥ ∂L(π θt , λ t ) ∂θ t [s, a] , we obtain: for a = a π t (a (s)|s)A πt (s, a (s), λ) ≥ π t (a|s)A πt (s, a, λ), where π t is short for π θt . Case (i): π t (a (s)|s) ≥ π t (a|s). If π t (a (s)|s) ≥ π t (a|s), according to softmax parameterization (9), we obtain θ t [s, a (s)] ≥ θ t [s, a] Recall the update rule of Algorithm 1, we know θ t+1 [s, a (s)] = θ t [s, a (s)] + η ∂L(π θt , λ t ) ∂θ t [s, a (s)] (105),(118) ≥ θ t [s, a] + η ∂L(π θt , λ t ) ∂θ t [s, a] = θ t+1 [s, a], which implies π t+1 (a (s)|s) = exp{θ t [s, a (s)]} a∈A exp{θ t [s, a]} ≥ exp{θ t [s, a]} a∈A exp{θ t [s, a]} = π t+1 (a|s). Recall (117), we obtain A πt+1 (s, a (s), λ) ≥ A πt+1 (s, a, λ). Eq.( 119)-Eq.(120) implies ∂L(π θt+1 , λ t+1 ) ∂θ t+1 [s, a (s)] ≥ ∂L(π θt+1 , λ t+1 ) ∂θ t+1 [s, a] , which implies θ t+1 ∈ Θ 1 (s). Case (ii): π t (a (s)|s) < π t (a|s). According to π t (a (s)|s)A πt (s, a (s), λ) ≥ π t (a|s)A πt (s, a, λ), we obtain π t (a (s)|s) (Q πt (s, a (s), λ) -V πt (s, λ)) ≥ π t (a|s) (Q πt (s, a, λ) -V πt (s, λ)) (121) = π t (a|s) (Q πt (s, a (s), λ) -V πt (s, λ) + Q πt (s, a, λ) -Q πt (s, a (s), λ)) , i.e., 1 - π t (a (s)|s) π t (a|s) (Q πt (s, a (s), λ) -V πt (s, λ)) (123) = (1 -exp {θ t [s, a (s)] -θ t [s, a]}) (Q πt (s, a (s), λ) -V πt (s, λ)) (124) ≤Q πt (s, a (s), λ) -Q πt (s, a, λ). ( ) Recall the update rule of Algorithm 1, we know θ t+1 [s, a (s)] = θ t [s, a (s)] + η ∂L(π θt , λ t ) ∂θ t [s, a (s)] , θ t+1 [s, a] = θ t [s, a] + η ∂L(π θt , λ t ) ∂θ t [s, a] . ( ) Since θ t ∈ Θ 1 (s), i.e., ∂L(π θt , λ t ) ∂θ t [s, a (s)] ≥ ∂L(π θt , λ t ) ∂θ t [s, a] , Eq.( 126) implies θ t+1 [s, a (s)] -θ t+1 [s, a] ≥ θ t [s, a (s)] -θ t [s, a]. Since we consider π t (a (s)|s) < π t (a|s), then 1 -exp{θ t [s, a (s)] -θ t [s, a]} = 1 - π t (a (s)|s) π t (a|s) > 0, which implies (1 -exp{θ t+1 [s, a (s)] -θ t+1 [s, a]}) Q πt+1 (s, a (s), λ) -V πt+1 (s, λ) ≤Q πt+1 (s, a (s), λ) -Q πt+1 (s, a, λ). Rearranging it, we obtain π t+1 (a (s)|s)A πt+1 (s, a (s), λ) ≥ π t+1 (a|s)A πt+1 (s, a, λ), which is ∂L(π θt+1 , λ t+1 ) ∂θ t+1 [s, a (s)] ≥ ∂L(π θt+1 , λ t+1 ) ∂θ t+1 [s, a] , which implies θ t+1 ∈ Θ 1 (s). Lemma 13. Let θ t ∈ Θ 1 (s) ∩ Θ 2 (s) ∩ Θ 3 (s), then π t+1 (a (s)|s) ≥ π t (a (s)|s). Proof. Recall ∂L(π θt , λ t ) ∂θ t [s, a (s)] ≥ ∂L(π θt , λ t ) ∂θ t [s, a] , we obtain π t+1 (a (s)|s) = exp {θ t+1 [s, a (s)]} a∈A exp {θ t+1 [s, a]} = exp θ t [s, a (s)] + η ∂L(π θt , λ t ) ∂θ t [s, a (s)] a∈A exp θ t (s, a) + η ∂L(π θt , λ t ) ∂θ t [s, a] ≥ exp θ t [s, a (s)] + η ∂L(π θt , λ t ) ∂θ t [s, a (s)] a∈A exp θ t (s, a) + η ∂L(π θt , λ t ) ∂θ t [s, a (s)] (129) = exp {θ t (s, a (s))} a∈A exp {θ t (s, a)} = π t (a (s)|s). ( ) Lemma 14. Θ c (s) ∩ Θ 2 (s) ∩ Θ 3 (s) ⊂ Θ 1 (s) ∩ Θ 2 (s) ∩ Θ 3 (s) Proof. Let θ ∈ Θ c (s) ∩ Θ 2 (s) ∩ Θ 3 (s) , we consider the two following cases: Case (i):π θ (a (s)|s) ≥ max a =a (s) π θ (a|s). ∂L(π θt , λ t ) ∂θ t [s, a (s)] = 1 1 -γ d ρ0 π θ (s)π θ (a (s)|s)A π θ (s, a (s), λ) (106),( 107) > 1 1 -γ d ρ0 π θ (s)π θ (a|s)A π θ (s, a, λ) = ∂L(π θt , λ t ) ∂θ t [s, a] , where the last equation holds since the same analysis from ( 111)-( 116), we have Q πt (s, a (s), λ) -Q πt (s, a, λ) ≥ 1 2 ∆ (s). Case (ii):π θ (a (s)|s) ≥ max a<a (s) π θ (a|s), which is impossible, since if this case hold, we obtain π θ (a (s)|s) + π θ (a|s) > 2c(s) c(s) + 1 > 1. Lemma 15. Under Assumption 2, updating π t according to Algorithm 1, we obtain c =: inf s∈S,t≥1 {π t (a (s)|s)} > 0. Proof. According to (Agarwal et al., 2021 , Lemma E.2-Lemma12), we know π t (a (s)|s) → 1, which implies there exists T 1 (s) ≥ 1, such that π θ T 1 (s) (a (s)|s) ≥ c(s) c(s) + 1 . Furthermore, since Q π θ t (s, a (s)) → Q (s, a (s)), as t → ∞, then there exists T 2 (s) ≥ 1, s.t Q π θ T 2 (s) (s, a (s)) ≥ Q (s, a (s)) - 1 2 ∆ (s). Finally, since Q π θ t (s, a (s)) → V (s), and V π θ t (s) → V (s), as t → ∞, then there exists T 3 (s) ≥ 1, such that ∀t ≥ T 3 (s), Q π θ t (s, a (s)) -V π θ t (s) ≤ 1 2 ∆ (s). Define T 0 (s) = max{T 1 (s), T 2 (s), T 3 (s)}, then we obtain θ T0(s) ∈ Θ c (s) ∩ Θ 2 (s) ∩ Θ 3 (s), θ T0(s) ∈ Θ 1 (s) ∩ Θ 2 (s) ∩ Θ 3 (s). According to Lemma 12-14" i.e., if θ t ∈ Θ 1 (s)∩Θ 2 (s)∩Θ 3 (s), then θ t+1 ∈ Θ 1 (s)∩Θ 2 (s)∩Θ 3 (s), and the policy π θt (a (s)|s) is increasing in the space Θ 1 (s) ∩ Θ 2 (s) ∩ Θ 3 (s), we have inf t≥0 π θt (a (s)|s) = min 1≤t≤T0(s) π θt (a (s)|s). T 0 (s) only depends on initialization and c(s), which only depends on the CMDP and state s. π θt (a (s)|s) > 0. Lemma 16. For any fixed T > 0, let θ 0 = 0, λ 0 = 0. The sequence {λ t , θ t } t≥0 is generated by ( 71)-( 72)/(74), and the softmax policy π t := π θt is defined as (75 ). Let χ = 1 (1-γ)c d ρ 0 π ρ0 ∞ , C 1 = m (1-γ) 2 1 + 1 1-γ . Then J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ t (c(π ) -c(π t )) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 . ( ) Proof. According to Lemma 7, we obtain J(π ) -J(π t ) = 1 η s∈S d ρ0 π (s) d ρ0 πt (s) a∈A π (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) + 1 η s∈S a∈A d ρ0 π (s)π (a|s) d ρ0 πt (s)π t (a|s) log Z t (s) + λ t (c(π ) -c(π t )), and summing (134) as t ranges from 0 to T -1, we have J(π ) - 1 T T -1 t=0 J(π t ) = 1 ηT T -1 t=0 s∈S d ρ0 π (s) d ρ0 πt (s) a∈A π (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) + 1 ηT T -1 t=0 s∈S a∈A d ρ0 π (s)π (a|s) d ρ0 πt (s)π t (a|s) log Z t (s) + 1 T T -1 t=0 λ t (c(π ) -c(π t )) ≤ 1 η(1 -γ)T T -1 t=0 s∈S d ρ0 π (s) ρ 0 (s) a∈A π (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) + 1 η(1 -γ)T T -1 t=0 s∈S a∈A d ρ0 π (s)π (a|s) ρ 0 (s)π t (a|s) log Z t (s) + 1 T T -1 t=0 λ t (c(π ) -c(π t )) ≤ 1 η(1 -γ)T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S a∈A π (a|s) π t (a|s) log π t+1 (a|s) π t (a|s) + 1 η(1 -γ)T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S a∈A π (a|s) π t (a|s) log Z t (s) + 1 T T -1 t=0 λ t (c(π ) -c(π t )) (137) ≤ 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S a∈A π (a|s) log π t+1 (a|s) π t (a|s) + 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S log Z t (s) + 1 T T -1 t=0 λ t (c(π ) -c(π t )) (139) = 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S KL [π (•|s) π t (•|s)] -KL [π (•|s) π t+1 (•|s)] (140) + 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S log Z t (s) + 1 T T -1 t=0 λ t (c(π ) -c(π t )) = 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ s∈S KL [π (•|s) π 0 (•|s)] -KL [π (•|s) π T (•|s)] + 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S log Z t (s) + 1 T T -1 t=0 λ t (c(π ) -c(π t )) ≤ 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ s∈S KL [π (•|s) π 0 (•|s)] + 1 T T -1 t=0 λ t (c(π ) -c(π t )) + 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ T -1 t=0 s∈S log Z t (s), where Eq.( 135) holds since: for any Markov stationary policy π, d ρ0 π (s) = E s0∼ρ0(•) [d s0 π (s)] = E s0∼ρ0(•) (1 -γ) ∞ t=0 γ t P π (s t = s|s 0 ) ≥ E s0∼ρ0(•) [(1 -γ)P π (s 0 = s|s 0 )] = (1 -γ)ρ 0 (s); Eq.( 137) holds since we use d ρ0 π ρ 0 ∞ to denote the distribution mismatch coefficient, i.e., d ρ0 π ρ 0 ∞ := max s∈S d ρ0 π (s) ρ 0 (s) , Due to the Assumption 2, the term d ρ0 π (s) ρ 0 (s) is well-defined; Eq.( 138) holds since π is a deterministic optimal policy, and we denote it as π (a (s)|s) = 1, otherwise, i.e.,if a = a (s), π (a|s) = 0. Eq.( 141) holds since we omit the term KL [π (•|s) π T (•|s)]. Taking Eq.( 98) into Eq.( 142), we achieve J(π ) - 1 T T -1 t=0 J(π t ) ≤ 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ s∈S KL [π (•|s) π 0 (•|s)] + 1 T T -1 t=0 λ t (c(π ) -c(π t )) + 1 ρ min (1 -γ)c 2 T d ρ0 π ρ 0 ∞ T -1 t=0 J(π t+1 ) -J(π t ) -λ t (c(π t+1 ) -c(π t )) = 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ s∈S KL [π (•|s) π 0 (•|s)] + 1 T T -1 t=0 λ t (c(π ) -c(π t )) + 1 ρ min (1 -γ)c 2 T d ρ0 π ρ 0 ∞ J(π T ) -J(π 0 ) + T -1 t=0 λ t (c(π t ) -c(π t+1 ))) . Now, we need to bound the term 1 T T -1 t=0 λ t (c(π t ) -c(π t+1 )) in Eq.( 147), our proof is adaptive to (Ding et al., 2020) . 1 T T -1 t=0 λ t (c(π t -c(π t+1 ))) = 1 T T -1 t=0 λ t (c(π t+1 ) -c(π t )) = 1 T T -1 t=0 λ t+1 c(π t+1 -λ t c(π t )) + 1 T T -1 t=0 λ t -λ t+1 c(π t+1 ) = 1 T λ T c(π T ) - X X X X λ 0 c(π 0 ) + 1 T T -1 t=0 λ t -λ t+1 c(π t+1 ) (148) ≤ 1 T λ T 2 c(π T ) 2 + 1 T T -1 t=0 λ t -λ t+1 2 c(π t+1 ) 2 , ( ) where Eq.( 148) holds since the initial value λ 0 = 0, and Eq.( 149) due to Cauchy-Schwarz inequality. Recall the update rule ( 71) with respect to λ: λ t+1 ← λ t -η ∂ ∂λ L(π t , λ t ) + = {λ t -η(b -c(π t ))} + , which implies λ t+1 -λ t 2 ≤ η b -c(π t ) 2 ≤ η ( b 2 + c(π t ) 2 ) . Now, we need to bound c(π t ) 2 = m i=1 |C i (π t )| 2 = m i=1 E s0∼ρ0(•) [V ci πt (s 0 )] 2 = m i=1 E s0∼ρ0(•),st∼P (t) π t (•|s0),at∼πt(•|st) ∞ t=0 γ t c i (s t , a t ) s 0 2 ≤ √ m 1 1 -γ , ( ) where last equation holds since the cost function c i (•) is bounded by 1. Recall b = (b 1 , b 2 , • • • , b m ) , let b max := max{b 1 , b 2 , • • • , b m }, then, according to (150), we achieve λ t+1 -λ t 2 ≤ η b -c(π t ) 2 ≤ η √ m b max + 1 1 -γ . ( ) Furthermore, λ T 2 = T -1 t=0 (λ t+1 -λ t ) + λ 0 2 ≤ η √ m b max + 1 1 -γ T. ( ) Taking Eq.( 151), Eq.( 152) and Eq.( 153) to Eq.( 149), we have 1 T T -1 t=0 λ t (c(π t ) -c(π t+1 )) ≤ 2ηm 1 1 -γ b max + 1 1 -γ . ( ) Finally, recall the result (147), and taking (154) to it, we obtain the following equation, J(π ) - 1 T T -1 t=0 J(π t ) ≤ 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ s∈S KL [π (•|s) π 0 (•|s)] + 1 T T -1 t=0 λ t (c(π ) -c(π t )) + 1 ρ min (1 -γ)c 2 d ρ0 π ρ 0 ∞ 1 T (J(π T ) -J(π 0 )) - 1 T T -1 t=0 λ t (c(π t+1 ) -c(π t )) = 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ s∈S KL [π (•|s) π 0 (•|s)] + 1 T T -1 t=0 λ t (c(π ) -c(π t )) + 1 ρ min (1 -γ)c 2 d ρ0 π ρ 0 ∞ 1 T (J(π T ) -J(π 0 )) + 1 T T -1 t=0 λ t (c(π t ) -c(π t+1 )) ≤ 1 η(1 -γ)c T d ρ0 π ρ 0 ∞ |S| log |A| + 1 T T -1 t=0 λ t (c(π ) -c(π t )) (155) + 1 ρ min (1 -γ)c 2 d ρ0 π ρ 0 ∞ 2 T • 1 1 -γ + 2ηm 1 1 -γ b max + 1 1 -γ , ( ) where Eq.( 155) holds since: according to the initial value λ 0 = 0, then the initial policy π 0 is reduced to uniform distribution, and the probability of each action a is π 0 (a|s) = 1 |A| , KL [π (•|s) π 0 (•|s)] = a∈A π (•|s) log π (a|s) π 0 (a|s) = a∈A π (•|s) log (|A|π (a|s)) ≤ log |A|, which implies s∈S KL [π (•|s) π 0 (•|s)] ≤ |S| log |A|; Eq.( 156) holds since: for any policy π, the objective J(π) satisfies J(π) ≤ 1 1 -γ , and the result (154) shows the boundedness of 1 T T -1 t=0 λ t (c(π t ) -c(π t+1 )) ≤ 2ηm 1 1 -γ b max + 1 1 -γ . Finally, let χ = 1 (1 -γ)c d ρ0 π ρ 0 ∞ , C 1 = m (1 -γ) 2 b max + 1 1 -γ , ( ) then we rewrite Eq.( 156) as follows J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ t (c(π ) -c(π t )) ≤ χ|S| log |A| η + 2χ ρ min (1 -γ) 2 • 1 T + 2ηχC 1 , This concludes the proof of (133).

D.2 DETAILS FOR PROOF OF THEOREM 2

Theorem 2 Under Assumption 1-2, π θ is the softmax policy defined in ( 9). The time-step T satisfies T ≥ max F (1 -γ) 2 , F ((1 -γ) 2 + 2m/ι 2 ) 2 , where F := D 2 2 |S| log |A|ρ 2 min D1 , D 1 and D 2 are positive scalars will be special later. The initial λ 0 = 0, θ 0 = 0, the parameter sequence {λ t , θ t } T -1 t=0 is generated according to Algorithm 1. Let η, β satisfy η = d ρ0 π ρ 0 ∞ |S| log |A| C(1 -γ)T , β = d ρ0 π ρ 0 ∞ 4|S| log |A| (1 -γ) 3 ι 2 c , where C is a positive scalar will be special later. Then for all i ∈ {1, 2, • • • , m}, π t := π θt satisfies min t<T {J(π ) -J(π t )} ≤ 4 d ρ0 π ρ 0 ∞ |S| log |A| c (1 -γ) 4 T , ( ) min t<T {C i (π t ) -b i } + ≤ 4 d ρ 0 π ρ0 ∞ β -λ ∞ |S| log |A| c (1 -γ) 4 T . ( ) In this section, we show the details for the proof of Theorem 2. The proof contains two key steps: bounding the optimality gap and bounding the constraint violation. Finally, we summary the hyperparameter setting for us to obtain the results presented in Theorem 2. Bounding the Optimality Gap. Recall the dual update ( 71)-( 72), we have λ T 2 2 = T -1 t=0 λ t+1 2 2 -λ t 2 2 (71) = T -1 t=0   λ t -η ∂ ∂λ L(π t , λ t ) + 2 2 -λ t 2 2   (70) = T -1 t=0 {λ t -η (b -c(π t ))} + 2 2 -λ t 2 2 ≤ T -1 t=0 λ t -η (b -c(π t )) 2 2 -λ t 2 2 = T -1 t=0 η 2 b -c(π t ) 2 2 + 2ηλ t (c(π t ) -b) ≤ T -1 t=0 η 2 m b max + 1 1 -γ 2 + 2ηλ t (c(π t ) -c(π )) (161) =η 2 m b max + 1 1 -γ 2 T + 2η T -1 t=0 λ t (c(π t ) -c(π )) , ( ) where Eq.( 161) holds since: c(π ) b, then λ t (c(π ) -b) ≤ 0, and λ t (c(π t ) -b) = λ t (c(π t ) -c(π ) + c(π ) -b) ≤ λ t (c(π t ) -c(π )) ; (163) Eq.(152) implies b -c(π t ) 2 ≤ √ m b max + 1 1 -γ . ( ) Combining the results ( 163) and ( 164), we obtain Eq.( 161). Since λ T 2 2 ≥ 0, Eq.( 162) implies the following boundedness w.r.t. λ t (c(π )) -c(π t ): 1 2 ηm b max + 1 1 -γ 2 ≥ 1 T T -1 t=0 λ t (c(π )) -c(π t ) . ( ) Recall Lemma 16, we have J(π ) - 1 T T t=0 J(π t ) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2χηC 1 + 1 T T -1 t=0 λ t (c(π ) -c(π t )) (165) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2χηC 1 + 1 2 ηm b max + 1 1 -γ 2 = 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + η 2χC 1 + 1 2 m b max + 1 1 -γ 2 , (166) which implies min t<T {J(π ) -J(π t )} ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + η 2χC 1 + 1 2 m b max + 1 1 -γ 2 . (167) Furthermore, let η = 1 T χ|S| log |A| 2χC 1 + 1 2 m b max + 1 1-γ 2 = d ρ0 π ρ 0 ∞ |S| log |A| (1 -γ)C 1 T , ( ) where the positive scalar C is defined as follows, C = c 2χC 1 + 1 2 m b max + 1 1 -γ 2 < +∞. ( ) Then we achieve the optimal gap as follows min t<T {J(π ) -J(π t )} ≤ χ|S| log |A| T 2χC 1 + 1 2 m b max + 1 1 -γ 2 + 2χ ρ min (1 -γ) 2 T (170) = χ|S| log |A| T (M 1 + 1)2χC 1 + 2χ ρ min (1 -γ) 2 T , where the constant M 1 is special as follows: 2χC 1 M 1 = 1 2 m b max + 1 1-γ 2 , i.e., M 1 = (1 -γ)c 3 d ρ0 π ρ 0 -1 ∞ b max + 1 1-γ 1 . Recall χ and C 1 defined in Eq.( 157), χ = 1 (1 -γ)c d ρ0 π ρ 0 ∞ , C 1 = m (1 -γ) 2 b max + 1 1 -γ , taking them into Eq.( 171), we rewrite Eq.( 171) as follows min t<T {J(π ) -J(π t )} ≤ 1 (1 -γ) 2 d ρ0 π ρ 0 ∞ |S| log |A| T • (M 1 + 1)2m (c ) 2 • b max + 1 1 -γ + 2 c 1 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞ = 1 (1 -γ) 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T + D 2 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞ , where D 1 and D 2 are two positive constants defined as follows D 1 := (M 1 + 1)2m (c ) 2 b max + 1 1 -γ < +∞, D := 2 c < +∞. (173) Finally, let 1 T ≥ D 2 2 (1 -γ) 2 |S| log |A|ρ 2 min D 1 , ( ) which implies min t<T {J(π ) -J(π t )} ≤ 2 (1 -γ) 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T . ( ) Bounding the Constraint Violation. We consider the parameter λ 2 ∈ [0, β], where the positive scalar β will be special later. According to the update rule (71) w.r.t. parameter λ, we obtain the following the equation, λ t+1 -λ 2 2 = {λ t -η(b -c(π t ))} + -λ 2 2 ≤ λ t -η(b -c(π t )) -λ 2 2 = λ t -λ 2 2 -2η(λ t -λ) (b -c(π t )) + η 2 b -c(π t ) 2 2 (152) ≤ λ t -λ 2 2 -2η(λ t -λ) (b -c(π t )) + η 2 m b max + 1 1 -γ 2 , which is equal to λ t+1 -λ 2 2 -λ t -λ 2 2 ≤ -2η(λ t -λ) (b -c(π t )) + η 2 m b max + 1 1 -γ 2 . (176) Summing Eq.( 176) from t = 0 to T -1, we achieve the following equation 0 ≤ λ T -λ 2 2 ≤ λ 0 -λ 2 2 -2η T -1 t=0 (λ t -λ) (b -c(π t )) + T η 2 m b max + 1 1 -γ 2 , which implies 1 T T -1 t=0 (λ t -λ) (b -c(π t )) ≤ 1 2ηT λ 0 -λ 2 2 + η 2 m b max + 1 1 -γ 2 . ( ) Due to c(π ) b, and λ t 0, then the following equation holds, - 1 T T -1 t=0 λ t (c(π ) -c(π t )) = - 1 T T -1 t=0 λ t (c(π ) -b + b -c(π t )) ≥ - 1 T T -1 t=0 λ t (b -c(π t )). ( ) Recall Lemma 16, J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ t (c(π ) -c(π t )) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 , 1 where we obtain the term T (174) by solving the inequality: 1 (1 -γ) 2 d ρ 0 π ρ0 ∞ |S| log |A|D1 T ≥ D2 ρmin(1 -γ) 3 T d ρ 0 π ρ0 ∞ . and taking Eq.( 178) into above equation, we achieve J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ t (b -c(π t )) 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 . We rewrite Eq.( 179) as follows, J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ t (b -c(π t )) =J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 (λ t -λ) (b -c(π t )) - 1 T T -1 t=0 λ (b -c(π t )) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 . Let λ = ( λ1 , λ2 , • • • , λm ) , taking the result (177) into Eq.( 180), we have J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ (b -c(π t )) = J(π ) - 1 T T -1 t=0 J(π t ) + 1 T T -1 t=0 λ (c(π t ) -b) (181) =J(π ) - 1 T T -1 t=0 J(π t ) + m i=1 λi 1 T T -1 t=0 (C i (π t ) -b i ) (182) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 + 1 2ηT λ 0 -λ 2 2 + η 2 m b max + 1 1 -γ 2 . ( ) For any policy π, the objective function J(π) is a linear function in an occupancy measure induced by such policy π. Since the set of occupancy measures is convex and compact, the average of occupancy measures is another occupancy measure that yields a policy, which implies there exists a policy πt such that 1 T T -1 t=0 J(π t ) = J(π t ), 1 T T -1 t=0 (C i (π t ) -b i ) = C i (π t ) -b i . Furthermore, let λi =        β = χ|S| log |A| 2 (1 -γ)ι , if T -1 t=0 (C i (π t ) -b i ) ≥ 0, 0, if T -1 t=0 (C i (π t ) -b i ) < 0, recall η defined in (168), i.e., η = 1 T χ|S| log |A| 2χC 1 + 1 2 m b max + 1 1-γ 2 , then we rewrite Eq.( 182) as follows, J(π ) - 1 T T -1 t=0 J(π t ) m i=1 λi 1 T T -1 t=0 (C i (π t ) -b i ) (184) = J(π ) -J(π t ) + m i=1 λi (C i (π t ) -b i ) (185) = J(π ) -J(π t ) + β m i=1 {C i (π t ) -b i } + =J(π ) -J(π t ) + β1 m {c(π t ) -b} + the vector version of Eq.( 186) (183) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 + 1 2ηT λ 0 -λ 2 2 + η 2 m b max + 1 1 -γ 2 (187) (185) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 + m 2ηT β 2 + η 2 m b max + 1 1 -γ 2 (188) (168) = χ|S| log |A| T 2χC 1 + 1 2 m b max + 1 1 -γ 2 + 2χ ρ min (1 -γ) 2 T + m 2ηT β 2 (189) (168) = 1 + 2m (1 -γ) 2 ι 2 χ|S| log |A| T 2χC 1 + 1 2 m b max + 1 1 -γ 2 + 2χ ρ min (1 -γ) 2 T (190) (170),(172) = 1 (1 -γ) 2 + 2m (1 -γ) 4 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T + D 2 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞ , where Eq.( 188) holds since: by the definition of λ in Eq.( 185), and initial λ 0 = 0, we have λ 0 -λ 2 2 = λ 2 2 ≤ mβ 2 ; Eq.( 190) holds since we replace the term m 2ηT β 2 as follows: recall β = χ|S| log |A| 2 (1 -γ)ι defined in (185) we have m 2ηT β 2 = m 2 β 2 2χC 1 + 1 2 m b max + 1 1-γ 2 T χ|S| log |A| = 2m (1 -γ) 2 ι 2 χ|S| log |A| T 2χC 1 + 1 2 m b max + 1 1 -γ 2 . Finally, let δ := 1 (1 -γ) 2 + 2m (1 -γ) 4 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T + D 2 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞ , then the results (191) can be represented simply as follows, J(π ) -J(π t ) + β1 m {c(π t ) -b} + ≤ δ; recall Lemma 3 that reveals the boundedness of λ , the definition of β = χ|S| log |A| 2 (1 -γ)ι implies β > λ ∞ , then applying Lemma 4, we have 1 m {c(π t ) -b} + < δ β -λ ∞ . ( ) Since m {c(π) -b} + = m i=1 {C i (π t ) -b i } + and each {C i (π t ) -b i } + ≥ 0, then we have {c(π t ) -b} + δ β -λ ∞ 1 m . Eq.( 194) implies for each i ∈ {1, 2, • • • , m}: {C i (π t ) -b i } + ≤ δ β -λ ∞ , i.e., {C i (π t ) -b i } + (184) = 1 T T -1 t=0 (C i (π t ) -b i ) + ≤ δ β -λ ∞ , which implies the Best-Case Constraint Violation as follows: for each i ∈ {1, 2, • • • , m}, we have min t<T {C i (π t ) -b i } + (196) ≤ 1 β -λ ∞ 1 (1 -γ) 2 + 2m (1 -γ) 4 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T + D 2 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞ . Furthermore, let 2 T ≥ D 2 2 ((1 -γ) 2 + 2m/ι 2 ) 2 ρ 2 min D 1 |S| log |A| , then we obtain min t<T {C i (π t ) -b i } + ≤ 2 β -λ ∞ 1 + 2m (1 -γ) 2 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T = 2 1 + 2m (1 -γ) 2 ι 2 β -λ ∞ d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T . ( ) Summarizing the Conclusion under Special Hyper-Parameter Setting. Finally, recall the condition for the term T in ( 174), (197), we conclude if the time-step T satisfies T ≥ max 1 (1 -γ) 2 , 1 ((1 -γ) 2 + 2m/ι 2 ) 2 • D 2 2 |S| log |A|ρ 2 min D 1 , the step-size η defined in (168) satisfies η = d ρ0 π ρ 0 ∞ |S| log |A| (1 -γ)C 1 T , and the constant term β satisfies β (185) = χ|S| log |A| 2 (1 -γ)ι ( ) (157) = 1 (1 -γ)c d ρ0 π ρ 0 ∞ |S| log |A| 2 (1 -γ)ι := d ρ0 π ρ 0 ∞ D|S| log |A| (1 -γ) 3 ι 2 , 2 where we obtain the term T (197) by solving the inequality: 1 (1 -γ) 2 + 2m (1 -γ) 4 ι 2 d ρ 0 π ρ0 ∞ |S| log |A|D1 T ≥ D2 ρmin(1 -γ) 3 T d ρ 0 π ρ0 ∞ . where we define the constant D as follows D 1 := 4 c . Then, according to ( 175) and ( 198), the following holds min t<T {J(π ) -J(π t )} ≤ 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T , min t<T {C i (π t ) -b i } + ≤ 2 β -λ ∞ 1 + 2m (1 -γ) 2 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T , where each i ∈ {1, 2, • • • , m}. This concludes the proof of Theorem 2. Remark 3. Recall Eq.( 143): d ρ0 π (s) ≥ (1 -γ)ρ 0 (s), which implies d ρ0 π ρ 0 ∞ ≥ (1 -γ). Thus, β ≥ 1 c |S| log |A| 2 (1 -γ)ι > 2 (1 -γ)ι . Lemma 3 shows that λ ∞ ≤ 2 (1 -γ)ι . Thus β > λ ∞ . E PROOF OF THEOREM 3 E.1 GEOMETRIC DISTRIBUTION Before we show the details of the proof, we introduce some basic notations about geometric distribution Geo(γ), which is defined as the following discrete probability distributions: the probability distribution of the number τ of failures before the first success, supported on the set {0, 1, 2, • • • }, i.e., P(τ = t) = (1 -γ) t γ, ∈ (0, 1), t = 0, 1, 2, 3, • • • . To understand Geo(γ) (202) clearly, we list the distribution column of the distribution Geo(γ) in the following Table 2 . Table 2 : Distribution column of the distribution Geo(γ).  τ 0 1 2 3 4 • • • • • • t • • • • • • γ (1 -γ)γ (1 -γ) 2 γ (1 -γ) 3 γ (1 -γ) 4 γ • • • • • • (1 -γ) t γ • • • • • • E.2 ROLLOUT WITH FINITE HORIZONS Algorithm 4 EstQ(π, g, s, a): Estimate Q Value Function (τ = t) = (1 -γ)γ t ; 4: for t = 0, 1, 2, • • • , τ -1 do 5: Collect reward (or cost) g(s t , a t ) and add to estimate: Q(s, a) ← Q(s, a) + g(s t , a t );

6:

Simulate the next state and next action as follows: s t+1 ∼ P(•|s t , a t ); a t+1 ∼ π(•|s t+1 ); 7: end for 8: Collect last reward (or cost) g(s τ , a τ ), add to estimate: Q(s, a) ← Q(s, a) + g(s τ , a τ ). 9: Output: Q(s, a). Algorithm 5 EstV(π, g, s): Estimate V Value Function 1: Input: Policy π to be evaluated; Reward function or Cost function g(•, •); State s; 2: Initialization: V (s) = 0, s 0 = s, a 0 ∼ π(•|s 0 ); 3: Draw an integer τ from a geometric distribution with parameter (1 -γ): P(τ = t) = (1 -γ)γ t ; 4: for t = 0, 1, 2, • • • , τ -1 do 5: Collect reward (or cost) g(s t , a t ) and add to estimate: V (s) ← V (s) + g(s t , a t ); Simulate the next state and next action as follows: s t+1 ∼ P(•|s t , a t ); a t+1 ∼ π(•|s t+1 ); 7: end for 8: Collect last reward (or cost) g(s τ , a τ ), add to estimate: V (s) ← V (s) + g(s τ , a τ ). 9: Output: V (s).

E.3 PROOF OF PROPOSITION 3

We need the following Lemma 17 to show Proposition 3. Lemma 17 (Dominated Convergence Theorem). Let {X n } n≥0 be a random variable sequence, and X n → X almost surely, as n → ∞. Furthermore, if |X n | ≤ Y for all n, and E[Y ] ≤ ∞, then E[X n ] → E[X], as n → ∞. Proof. See (Durrett, 2019, Theorem 1.6.7 ). Proposition 3. The output of Algorithm 4 (also see Algorithm 4) is an unbiased estimator of Q π (s, a) or Q c π (s, a), i.e., let Q π (s, a) = EstQ(π, r, s, a), Q ci π (s, a) = EstQ(π, c i , s, a), then the following holds E[ Q π (s, a)] = Q π (s, a), Q ci π (s, a)] = Q ci π (s, a). Proof. This proof is adaptive to Paternain (2018) ; Zhang et al. (2020) . Without losing generality, we only need to show the case of g(•, •) = r(•, •), i.e., E[ Q π (s, a)] = Q π (s, a). According to the iteration from Algorithm 4, we obtain the estimator of Q π (s, a) as follows, Q π (s, a) = τ t=0 r(s t , a t ), (s 0 , a 0 ) = (s, a), τ ∼ Geo(1 -γ). We consider the expectation of ( 205): E[ Q π (s, a)] = E τ t=0 t , a t ) π, s 0 = s, a 0 = a (206) = E ∞ t=0 I {t ≤ τ } r(s t , a t ) π, s 0 = s, a 0 = a (207) = ∞ t=0 E I {t ≤ τ } r(s t , a t ) π, s 0 = s, a 0 = a , where Eq.( 207) holds since: we have substituted ∞ for the τ via the indicator function I{•} such that the summand for t > τ is vanished; Eq.( 208) holds since we use the dominated convergence theorem (see Lemma 17): let X n = n t=0 I {t ≤ τ } r(s t , a t ), then we obtain |X n | = n t=0 I {t ≤ τ } r(s t , a t ) ≤ n t=0 I {t ≤ τ } := Y n , and recall τ ∼ Geo(1 -γ), then we obtain P(t ≤ τ ) = ∞ τ =t γ τ (1 -γ) = γ t , E[Y n ] = E n t=0 I {t ≤ τ } = n t=0 P(t ≤ τ ) (210) = n t=0 γ t ≤ 1 1 -γ ; ( ) according to the results ( 209) and ( 211), and applying Lemma 17, we obtain E ∞ t=0 I {t ≤ τ } r(s t , a t ) π, s 0 = s, a 0 = a = ∞ t=0 E I {t ≤ τ } r(s t , a t ) π, s 0 = s, a 0 = a , i.e., we have checked exchange condition for the sum and the expectation in the previous expression from Eq.( 207) to Eq.( 208). Furthermore, we consider the result (208) as follows, E[ Q π (s, a)] = ∞ t=0 E I {t ≤ τ } r(s t , a t ) π, s 0 = s, a 0 = a = ∞ t=0 E E τ ∼Geo(1-γ) I {t ≤ τ } r(s t , a t ) π, s 0 = s, a 0 = a (212) = ∞ t=0 E τ ∼Geo(1-γ) I {t ≤ τ } r(s t , a t ) π, s 0 = s, a 0 = a (213) = ∞ t=0 E γ t r(s t , a t ) π, s 0 = s, a 0 = a (214) = Q π (s, a), where Eq(212) holds due to the double expectation formula: X|Y ] , which implies that we find the expected value of X by conditioning it on another random variable Y ; E[X] = E Y [E X [ Eq.( 213) holds since: the horizon τ is drawn independently of the MDP sequence {s t , a t , r(s t , a t )}; Eq.( 214) holds since:τ ∼ Geo(1 -γ), then we obtain E τ ∼Geo(1-γ) I {t ≤ τ } = P(t ≤ τ ) = ∞ τ =t γ τ (1 -γ) = γ t . This concludes the result E[ Q π (s, a)] = Q π (s, a). If we replace all the term r(s t , a t ) to c i (s t , a t ) from Eq.( 205)-Eq.( 215), we obtain E[ Q ci π (s, a)] = Q ci π (s, a).

E.4 PROOF OF THEOREM 3

Since return objective J(π θ ) and cost function C i (π θ ) share a similar structure, all the Eq.( 22)-( 24) can be extended to C i (π θ ) if we use c i to replace r respectively. In this section, we only show the case of the reward function J(π θ ) in Theorem 3. Before we show the details, we present some insights of those unbiased esitimators.

Rollout Algorithm

We rollout a policy evaluation with respect to π θ according to Algorithm 4, Q π θ (s, a) = EstQ(π θ , r, s, a), we use τ ∼ Geo(1 -γ) to denote the terminal time of the horizon of the rollout (216). Furthermore, let G π θ (s, a) be an estimator defined as follows, G π θ (s, a) = 1 1 -γ Q π θ (s τ , a τ ) ∂ log π θ (a τ |s τ ) ∂θ s,a , where we obtain Q π θ (s τ , a τ ) according to Algorithm 4, Q π θ (s τ , a τ ) = EstQ(π θ , r, s τ , a τ ). Let τ ∼ Geo(1 -γ) be the terminal time of the horizon of the rollout (218), and we denote the rollout trajectory as follows, D = s j , a j , r(s j , a j ) τ j=0 , where initial state-action pair (s 0 , a 0 ) = (s τ , a τ ). Then, we rewrite the value Q π θ (s τ , a τ ) (218) as follows, Q π θ (s τ , a τ ) = τ j=0 r(s j , a j ), s 0 , a 0 = (s τ , a τ ), τ ∼ Geo(1 -γ). ( ) Algorithm 6 EstPG(π, g, s, a): Estimate Policy Gradient 1: Input: A policy π θ with given parameter θ, (s, a) ∈ S × A; 2: Policy Evaluation Rollout for (s, a)-Pair: Q π θ (s, a) = EstQ(π θ , r, s, a), and τ ∼ Geo(1-γ) denotes the terminal time of the horizon of such a policy evaluation rollout; 3: Policy Evaluation Rollout for (s τ , a τ )-Pair: Q π θ (s τ , a τ ) = EstQ(π θ , r, s τ , a τ ), and τ ∼ Geo(1 -γ) denotes the terminal time of the horizon of such a policy evaluation rollout; 4: Collect the trajectory D = (s j , a j , r(s j , a j ) j=0:τ , where initial state-action pair (s 0 , a 0 ) = (s τ , a τ ); 5: Output: G π θ (s, a) defined as follows, G π θ (s, a) = 1 1 -γ Q π θ (s τ , a τ ) ∂ log π θ (a τ |s τ ) ∂θ s,a = 1 1 -γ τ j=0 r(s j , a j ) ∂ ∂θ s,a log π θ (a τ |s τ ). Taking Eq.( 219) to (217), we obtain the expression of G π θ (s, a) (217) as follows, G π θ (s, a) = 1 1 -γ τ j=0 r(s j , a j ) ∂ ∂θ s,a log π θ (a τ |s τ ), where s 0 , a 0 = (s τ , a τ ), τ ∼ Geo(1 -γ). ( ) Remark 4. Since Algorithm 6 plays two rollouts that cause randomness of the estimator G π θ (s, a) (220) with respect to τ and τ , which implies G π θ (s, a) (220) is a random variable with respect to τ and τ .

Unbiased Analysis

We consider the expectation of G π θ (s, a) (217) as follows, for any given θ, E G π θ (s, a) =E τ,τ   1 1 -γ τ j=0 r(s j , a j ) ∂ log π θ (a τ |s τ ) ∂θ s,a   ( ) (220) = E τ        1 1 -γ E τ     τ j=0 r(s j , a j )   ∂ log π θ (a τ |s τ ) ∂θ s,a s 0 , a 0 = (s τ , a τ )   :=E1        , ( ) where E τ,τ [•] is short for the expectation over the randomness from the variables τ ∼ Geo(1γ), τ ∼ Geo(1 -γ), similarly, E τ [•] denotes the expectation over the randomness from the trajectory τ ∼ Geo(1 -γ). As a similar analysis from (207) to (215), we consider the term E 1 (223) as follows, E 1 (223) = E τ ∼Geo(1-γ)     τ j=0 r(s j , a j )   ∂ log π θ (a τ |s τ ) ∂θ s,a s 0 , a 0 = (s τ , a τ )   = E τ ∼Geo(1-γ)   ∞ j=0 I{j ≤ τ }r(s j , a j ) s 0 , a 0 = (s τ , a τ )   ∂ log π θ (a τ |s τ ) ∂θ s,a = Q π θ (s τ , a τ ) ∂ log π θ (a τ |s τ ) ∂θ s,a . According to Eq.( 223) and Eq.( 224), we obtain E G π θ (s, a) = 1 1 -γ E τ Q π θ (s τ , a τ ) ∂ log π θ (a τ |s τ ) ∂θ s,a = 1 1 -γ E τ,st,at ∞ t=0 I{t = τ }Q π θ (s t , a t ) ∂ log π θ (a t |s t ) ∂θ s,a , where E τ,st,at  0 if s t = s or a t = a, which implies ∂ log π θ (a t |s t ) ∂θ s,a =              1 -π θ (a t |s t ) if s t = s and a t = a -π θ (a|s) if s t = s and a t = a 0 if s t = s or a t = a. According to the result (229), it is similar to ( 209)-( 211), it is easy to check that Eq.( 226) satisfies the condition of dominated convergence theorem (see Lemma 17), thus we rewrite Eq.( 226) as follows, E G π θ (s, a) = ∞ t=0 1 1 -γ E τ,st,at I{t = τ }Q π θ (s t , a t ) ∂ log π θ (a t |s t ) ∂θ s,a = ∞ t=0 1 1 -γ E τ ∼Geo(1-γ) I{t = τ } E st,at Q π θ (s t , a t ) ∂ log π θ (a t |s t ) ∂θ s,a = ∞ t=0 γ t E st∼P (t) π θ (•|s0),at∼π θ (•|st) Q π θ (s t , a t ) ∂ log π θ (a t |s t ) ∂θ s,a = ∞ t=0 γ t   s ∈S P π θ (s t = s |s 0 ) a ∈A π θ (a |s ) Q π θ (s , a ) ∂ log π θ (a |s ) ∂θ s,a   (233) = s ∈S a ∈A ∞ t=0 γ t P π θ (s t = s |s 0 )π θ (a |s )Q π θ (s , a ) ∂ log π θ (a |s ) ∂θ s,a , where Eq.( 232) holds since E τ ∼Geo(1-γ) I{t = τ } = P(t = τ ) = γ t (1 -γ); where Eq.( 234) holds since we use the dominated convergence theorem (see Lemma 17). We should notice that the last Eq.( 234) share the same expression in the previous in Eq.( 65), and following the same analysis from Eq.( 65)-( 68), we obtain the unbiasedness of G π θ (s, a).

Boundedness Analysis

Recall the expectation of G π θ (s, a) (217) as follows, for any given θ, E ( G π θ (s, a)) 2 =E τ,τ      1 1 -γ τ j=0 r(s j , a j ) ∂ log π θ (a τ |s τ ) ∂θ s,a   2    (235) ≤ 4 (1 -γ) 2 E τ      τ j=0 r(s j , a j )   2    (236) ≤ 4 (1 -γ) 2 E τ ∼Geo(1-γ) τ 2 ≤ 4 (1 -γ) 3 , where Eq.( 236) holds since: Eq.( 229) implies the boundedness of ∂ log π θ (a t |s t ) ∂θ s,a ≤ 2; Eq.( 237) holds since: recall τ ∼ Geo(1 -γ) can be expressed as follows, Table 3 : Distribution column of the distribution τ ∼ Geo(1 -γ). τ 0 1 2 3 4 • • • • • • t • • • • • • 1 -γ (1 -γ)γ γ 2 (1 -γ) γ 3 (1 -γ) γ 4 (1 -γ) • • • • • • γ t (1 -γ) • • • • • • which implies the distribution of (τ ) 2 can be presented as follows, Table 4 : Distribution column of the distribution (τ ) 2 . (τ ) 2 0 1 4 9 16 • • • • • • t 2 • • • • • • 1 -γ (1 -γ)γ γ 4 (1 -γ) γ 9 (1 -γ) γ 16 (1 -γ) • • • • • • γ t 2 (1 -γ) • • • • • • thus E τ ∼Geo(1-γ) τ 2 ≤ E τ ∼Geo(1-γ) τ = 1 1 -γ . F PROOF OF THEOREM 4 In this section, we provide the necessary proof details of Theorem 4. It is very technical to achieve the result of Theorem 4, we outline some necessary intermediate results in Section F.1 where we provide some basic lemmas, and the proof of Theorem 4 is shown in Section F.2. Theorem 4 Under Assumption 1-2, π θ is the softmax policy defined in (9). The time-step T shares a fixed low bound similar to (16). The initial λ 0 = 0, θ 0 = 0, the parameter sequence {λ t , θ t } T -1 t=0 is generated according to Algorithm 3. Let η, β satisfy η = d ρ0 π ρ 0 ∞ |S| log |A| C (1 -γ)T , β = d ρ0 π ρ 0 ∞ 4|S| log |A| (1 -γ) 3 ι 2 c , where C is a positive scalar will be special later. Then for all i ∈ {1, 2, • • • , m}, π t := π θt satisfies E min t<T {J(π ) -J(π t )} ≤ 4 d ρ0 π ρ 0 ∞ |S| log |A| c (1 -γ) 4 T , E min t<T {C i (π t ) -b i } + ≤ 4 d ρ 0 π ρ0 ∞ β -λ ∞ |S| log |A| c (1 -γ) 4 T , where the notation E[•] is short for E D0:D T -1 [•] that denotes the expectation with respect to the randomness over the trajectories {D t } T -1 t=0 . F.1 AUXILIARY LEMMA Recall Algorithm 3, at current time t, we obtain cost value estimator ĉ(π θt ) according to (25), ĉ(π θt ) = C 1 (π θt ), C 2 (π θt ), • • • , C m (π θt ) , where each C i (π θt ) is a rollout estimator according to: C i (π θt ) = E s∼ρ0(•) V ci π θ t (s) = s∈S ρ 0 (s) V ci π θ t (s), V ci π θ t (s) = EstV(π θt , c i , s). According to Proposition 4, after some simple algebra, we obtain the unbiasedness of ĉ(π θt ): E [ĉ(π θt )] = c(π θt ). Now, we provide the boundedness of ĉ(π θt ) as follows, ĉ(π θt )(π θt ) 2 2 (239) ≤ m i=1 C i (π θt ) 2 , which implies we need to bound each | C i (π θt )|, where i ∈ {1, 2, •, m}. Recall the definition of C i (π θt ) in ( 240), we expand it as follows, C i (π θt ) (240) = E s∼ρ0(•) V ci π θ t (s) = s∈S ρ 0 (s) V ci π θ t (s). According to the iteration from Algorithm 5, we rewrite V ci π θ t (s) as follows, V ci π θ t (s) = τ t=0 r(s t , a t ), s 0 = s, τ ∼ Geo(1 -γ). Now, we bound the expectation of | C i (π θt )| 2 as follows, E C i (π θt ) 2 (245) = E τ ∼Geo(1-γ)   s∈S ρ 0 (s) τ t=0 r(s t , a t ) 2   ≤E τ ∼Geo(1-γ)   s∈S τ t=0 r(s t , a t ) 2   (246) ≤|S|E τ ∼Geo(1-γ) τ 2 (238) ≤ |S| 1 -γ . ( ) Collect the results ( 242), ( 243) and ( 247), we achieve the next Lemma 18. Lemma 18. Let π θt be the softmax policy defined in (9). For each parameter θ t , let C i (π θt ), ĉ(π θt ) be the estimator of cost value function defined in ( 240) and ( 239): C i (π θt ) = E s∼ρ0(•) V ci π θ t (s) = s∈S ρ 0 (s) V ci π θ t (s), ĉ(π θt ) = C 1 (π θt ), C 2 (π θt ), • • • , C m (π θt ) , where V ci π θ t (s) is defined in Eq.( 241). Then ĉ(π θt ) is an unbiased and bounded of cost value function, i.e., E [ĉ(π θt )] = c(π θt ), E C i (π θt ) 2 ≤ |S| 1 -γ , E ĉ(π θt ) 2 2 ≤ m|S| 1 -γ . Recall Algorithm 3, we obtain the unbiased estimator as follows, ∇ λt L(π θt , λ t ) = b -ĉ(π θt ), E ∇ λt L(π θt , λ t ) = E[b -ĉ(π θt )] = ∂L(π θt , λ t ) ∂λ t . According to the estimators ( 26) and ( 26), we obtain the policy gradient estimators , i.e., for each (s, a) ∈ S × A, G π θ t (s, a) = PG(π θt , r, s, a), G ci π θ t (s, a) = PG(π θt , c i , s, a), i = 1, 2, • • • , m . Furthermore, let the vector g c π θ t (s, a) ∈ R m collect all the policy gradient estimators of cost value function, i.e., g c π θ t (s, a) = G c1 π θ t (s, a), • • • , G cm π θ t (s, a) . Let G(π θt , λ t ) ∈ R |S|×|A| , each (s, a)-element is defined as follows, G(π θt , λ t )[s, a] = G π θ t (s, a) -λ t g c π θ t (s, a), then we obtain the policy gradient estimator of ∂L(π θ t ,λt) ∂θt : ∇ θt L(π θt , λ t ) = G(π θt , λ t ), E G(π θt , λ t ) = ∂L(π θt , λ t ) ∂θ t . Collect the results (254), and (256), we achieve the next Lemma 19. Lemma 19. Let π θt be the softmax policy defined in (9). For each parameter θ t , let C i (π θt ), ĉ(π θt ) be the estimator of cost value function defined in ( 240) and ( 239). Then, the following holds Recall Algorithm 3, we obtain the unbiased estimator as follows, E[b -ĉ(π θt )] = ∂L(π θt , λ t ) ∂λ t . ( ) Let the policy gradient G(π θt , λ t ) be defined in (255), then E G(π θt , λ t ) = ∂L(π θt , λ t ) ∂θ t . F.2 PROOF OF THEOREM 4 In this section, we show the details for the proof of Theorem 4. The proof contains two key steps: bounding the optimality gap and bounding the constraint violation. Finally, we summary the hyperparameter setting for us to obtain the results presented in Theorem 4. We rewrite the iteration (12) as the following stochastic version, λ t+1 = {λ t -η(b -ĉ(π θt )} + , θ t+1 = θ t + η G(π θt , λ t ), where we calculate ĉ(π θt ) and G(π θt , λ t ) according to ( 253) and (256). To short the expression, as before, we introduce the following notations: π t (a|s) := π θt (a|s) = exp θ (t) s,a ã∈A exp θ (t) s,ã , and π t := π θt . For each time t, we notice the estimator ĉ(π t ) in the inner loop (see Line 3) involves m trajectories, and estimator G(π θt , λ t ) (see Line 5) involves (2|S||A| + m) trajectories. We use D t to collect all those (2|S||A| + 2m) trajectories, D t = {T t,i } . According to rollout rule in Algorithm 4, Algorithm 5, and Algorithm 2, those (2|S||A| + 2m) trajectories among D t are independent with each other. Bounding the Optimality Gap. Lemma 20. The average term - 1 T T -1 t=0 λ t (c(π t ) -c(π )) is bounded as follows, -E D0:D T -1 1 T T -1 t=0 λ t (c(π t ) -c(π )) ≤ η 2 b 2 max + m|S| 1 -γ , where E D0:D T -1 [•] denotes the expectation with respect to the randomness over the trajectories {D t } T -1 t=0 . Proof. According to the dual update (259), we have λ T 2 2 = T -1 t=0 λ t+1 2 2 -λ t 2 2 (259) = T -1 t=0 {λ t -η(b -ĉ(π t ))} + 2 2 -λ t 2 2 ≤ T -1 t=0 λ t -η(b -ĉ(π t )) 2 2 -λ t 2 2 = T -1 t=0 η 2 b -ĉ(π t ) 2 2 + 2ηλ t (ĉ(π t ) -b) = T -1 t=0 η 2 b -ĉ(π t ) 2 2 + 2ηλ t (ĉ(π t ) -b) ≤ T -1 t=0 η 2 b -ĉ(π t ) 2 2 + 2ηλ t (ĉ(π t ) -c(π t )) + 2ηλ t (c(π t ) -c(π )) , where Eq.( 262) holds since we express (ĉ(π t ) -b) as follows, (ĉ(π t ) -b) = ĉ(π t ) -c(π t ) + c(π ) -b + c(π t ) -c(π ) , and the fact λ t 0 and c(π ) b, then λ t (c(π ) -b) ≤ 0 implies λ t (ĉ(π t ) -b) ≤ λ t (ĉ(π t ) -c(π t )) + λ t (c(π t ) -c(π )). For each given θ t-1 , the estimator ĉ(π θt ) is independent of λ t , and λ t is independent of (ĉ(π t )c(π t )). Thus according to (242), for each time t, the next equation holds E λ t (ĉ(π t ) -c(π t )) = 0, where t ∈ {0, 1, 2, • • • , T -1}. Furthermore, we consider the expectation of T -1 t=0 λ t (ĉ(π t ) -c(π t )) over the trajectory {D t } T -1 t=0 as follows, E D0:D T -1 T -1 t=1 λ t (ĉ(π t ) -c(π t )) =E D0:a T -2 (s)     T -2 t=0 λ t (ĉ(π t ) -c(π t )) + E D T -1 λ T -1 (ĉ(π T -1 ) -c(π T -1 )) (265) = 0     (266) =E D0:a T -2 (s) T -2 t=1 λ t (ĉ(π t ) -c(π t )) , where Eq.( 266) holds since the term T -2 t=0 λ t (ĉ(π t ) -c(π t )) is independent of the trajectories D T -1 , which implies E D0:D T -1 T -1 t=1 λ t (ĉ(π t ) -c(π t )) =E D0:a T -2 (s) T -2 t=1 λ t (ĉ(π t ) -c(π t )) + E D T -1 λ T -1 (ĉ(π T -1 ) -c(π T -1 )) . Let us expand recurrently according to the mathematical induction, we achieve E D0:D T -1 T -1 t=1 λ t (ĉ(π t ) -c(π t )) = 0. ( ) Recall the result (262), which implies T -1 t=0 η 2 b -ĉ(π t ) 2 2 + 2ηλ t (ĉ(π t ) -c(π t )) + 2ηλ t (c(π t ) -c(π )) ≥ 0. Recall the result (268), and we consider to take expectation of (269) over the trajectory {D t } T -1 t=0 , then we achieve the next equation E D0:D T -1 T -1 t=0 η 2 b -ĉ(π t ) 2 2 + 2ηλ t (ĉ(π t ) -c(π t )) + 2ηλ t (c(π t ) -c(π )) ≥ 0, rewriting (270), we obtain the following equation, -E D0:D T -1 1 T T -1 t=0 λ t (c(π t ) -c(π )) ≤E D0:D T -1 η 2T T -1 t=0 b -ĉ(π t ) (271) ≤E D0:D T -1 η 2T T -1 t=0 b 2 2 + ĉ(π t ) 2 2 ≤ 1 2 ηb 2 max + η 2T E D0:D T -1 T -1 t=0 ĉ(π t ) 2 2 (272) ≤ 1 2 ηb 2 max + η 2 m|S| 1 -γ = η 2 b 2 max + m|S| 1 -γ , where last Eq.( 273) holds since E D0:D T -1 T -1 t=0 ĉ(π t ) 2 2 =E D0:a T -2 (s) T -2 t=0 ĉ(π t ) 2 2 + E D T -1 [ ĉ(π T -1 ) 2 2 ] ( ) (252) ≤ E D0:a T -2 (s) T -2 t=0 ĉ(π t ) 2 2 + m|S| 1 -γ (275) • • • ≤ m|S| 1 -γ T, where Eq.( 274) holds since the term T -2 t=0 ĉ(π t ) 2 2 is independent of D T -1 ; Eq.( 276) holds since we expand recurrently according to (275) by the mathematical induction; Taking the result (276) into (272), we achieve the result (273). Lemma 21. The optimal gap is bounded as follows, E D0:D T -1 min t<T {J(π ) -J(π t )} ≤ 2 (1 -γ) 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T , where the positive scalar D 1 will be special later. Proof. Due to the unbiasedness of Q π θ (s, a), V π θ (s) according to Algorithm 4 and Algorithm 5, we achieve a similar result as (133) in Lemma 16, but we need to consider the over the trajectory {D t } T -1 t=0 . Concretely, we replace the terms with respect to expectation by the corresponding estimators, then we have E D0:D T -1 J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ t (c(π ) -c(π t )) ≤ 1 T χ|S| log |A| η + 2χ ρ min (1 -γ) 2 + 2ηχC 1 . Furthermore, taking Eq.( 273) (we have also presented it in Lemma 20) in to Eq.( 277), we obtain E D0:D T -1 J(π ) - 1 T T -1 t=0 J(π t ) ≤ χ|S| log |A| ηT + 2χ ρ min (1 -γ) 2 T + η 2 4χC 1 + b 2 max + m|S| 1 -γ . ( ) The above result (278) implies E D0:D T -1 min t<T {J(π ) -J(π t )} ≤ χ|S| log |A| ηT + 2χ ρ min (1 -γ) 2 T + η 2 4χC 1 + b 2 max + m|S| 1 -γ . ( ) It is similar to the analysis of result (167), if the next condition (280) holds, then above Eq.( 279) obtains the optimal gap (that is shown in Eq.( 283)) for the difference over {J(π ) -J(π t )} T -1 t=0 . χ|S| log |A| ηT = η 2 4χC 1 + b 2 max + m|S| 1 -γ , which implies η = χ|S| log |A| 2χC 1 + 1 2 b 2 max + mC 2 max 2(1-γ) 2 T = d ρ0 π ρ 0 ∞ |S| log |A| (1 -γ)C 1 T , where the positive scalar C is defined as follows, C = c 2χC 1 + 1 2 b 2 max + m|S| 1 -γ < +∞. ( ) Taking the step-size ( 281) into (279), we obtain the optimal gap as follows, E D0:D T -1 min t<T {J(π ) -J(π t )} ≤ χ|S| log |A| T 2χC 1 + 1 2 b 2 max + m|S| 1 -γ + 2χ ρ min (1 -γ) 2 T (284) = 1 (1 -γ)c d ρ0 π ρ 0 ∞ |S| log |A| T 2C 1 M 1 + 1 + 2χ ρ min (1 -γ) 2 T ( ) (157) = 1 (1 -γ)c d ρ0 π ρ 0 ∞ 2|S| log |A| T m (1 -γ) 2 b max + 1 1 -γ M 1 + 1 + 1 (1 -γ)c d ρ0 π ρ 0 ∞ 2 ρ min (1 -γ) 2 T (286) = 1 (1 -γ)c d ρ0 π ρ 0 ∞ |S| log |A| T D 1 + D 2 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞ , where Eq.( 285) holds since we choose the constant M 1 satisfies 2χC 1 M 1 = 1 2 b 2 max + m|S| 1 -γ , i.e., M 1 (157) = (1 -γ)c 3 d ρ0 π ρ 0 -1 ∞ b max + 1 1 -γ -1 b 2 max 4m + |S| 4(1 -γ) 2 ; (288) the constants D 1 and D 2 in (287) are defined as follows, D 1 := 2m (1 -γ) 2 b max + 1 1 -γ M 1 + 1 < +∞, D 2 := 2 c < +∞. Finally, it is similar to the same analysis of (175), we conclude that if T ≥ (D 2 ) 2 (1 -γ) 2 |S| log |A|ρ 2 min D 1 , which implies E D0:D T -1 min t<T {J(π ) -J(π t )} ≤ 2 (1 -γ) 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T . ( ) Bounding the Constraint Violation. Lemma 22. For any given λ 2 ∈ [0, β], where the positive scalar β will be special later, the term 1 T T -1 t=0 (λ t -λ) (b -c(π t )) is bounded as follows, E D0:D T -1 1 T T -1 t=0 (λ t -λ) (b -c(π t )) ≤ 1 2ηT λ 2 2 + η 2 b 2 max + m|S| 1 -γ . ( ) Proof. We consider the parameter λ 2 ∈ [0, β], where the positive scalar β will be special later. According to the update rule (259), we have λ t+1 -λ 2 2 = {λ t -η(b -ĉ(π t ))} + -λ 2 2 ≤ λ t -η(b -ĉ(π t )) -λ 2 2 = λ t -λ 2 2 -2η(λ t -λ) (b -ĉ(π t )) + η 2 b -ĉ(π t ) 2 2 ≤ λ t -λ 2 2 -2η(λ t -λ) (b -ĉ(π t )) + η 2 b 2 max + m|S| 1 -γ , ( ) where the last Eq.( 293) holds the next result (294) holds, which is contained in the previous Eq.( 273), b -ĉ(π t ) 2 2 ≤ b 2 2 + ĉ(π t ) 2 2 ≤ b 2 max + m|S| 1 -γ . We rewrite Eq.( 293) as follows, which implies 1 T T -1 t=0 (λ t -λ) (b -ĉ(π t )) ≤ 1 2ηT Z Z λ 0 -λ 2 2 + η 2 b 2 max + m|S| 1 -γ = 1 2ηT λ 2 2 + η 2 b 2 max + m|S| 2(1 -γ) . ( ) According to (242), taking expectation on Eq.( 296), we obtain E D0:D T -1 1 T T -1 t=0 (λ t -λ) (b -c(π t )) ≤ 1 2ηT λ 2 2 + η 2 b 2 max + m|S| 1 -γ , ( ) where we use the fact E[c t ] = c(π t ) (242), and λ t is independent of ĉ(π t ) for a given θ t-1 . Lemma 23. The constraint violation is bounded as follows, E D0:D T -1 min t<T {C i (π t ) -b i } + ≤ 2 β -λ ∞ 1 + 2m (1 -γ) 2 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T = 2 1 + 2m (1 -γ) 2 ι 2 β -λ ∞ d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T . Taking (297) into (277), we obtain E D0:D T -1 J(π ) - 1 T T -1 t=0 J(π t ) + 1 T T -1 t=0 λ t (b -c(π )) - 1 T T -1 t=0 λ (b -c(π t )) ≤ 1 T χ|S| log |A| η + 1 2η λ 2 2 + 2χ ρ min (1 -γ) 2 + η 2χC 1 + b 2 max 2 + m|S| 2(1 -γ) . Due to c(π ) b, and λ t 0, then we have E D0:D T -1 J(π ) - 1 T T -1 t=0 J(π t ) + 1 T T -1 t=0 λ t (b -c(π )) - 1 T T -1 t=0 λ (b -c(π t )) ≥E D0:D T -1 J(π ) - 1 T T -1 t=0 J(π t ) - 1 T T -1 t=0 λ (b -c(π t )) . ( ) It is similar to the previous proof from Eq.( 186) to Eq.( 191), we need to show the boundedness of the expectation E D0:D T -1 J(π ) -J(π t ) + m i=1 λi (C i (π t ) -b i ) defined in (301), which is a fundamental result for us to show the boundedness of the constraint violation. Recall c(π t ) = (C 1 (π t ), C 2 (π t ), • • • , C m (π t )) , and λ = λ1 , λ2 , • • • , λm , combining the result ( 298) and ( 299), and taking consideration to the step-size η defined in (281), we have  E D0:D T -1 J(π ) - 1 T T -1 t=0 J(π t ) + 1 T T -1 t=0 λ (c(π t ) -b) =E D0:D T -1 J(π ) - 1 T T -1 t=0 J(π t ) + m i=1 λi 1 T T -1 t=0 (C i (π t ) -b i ) (307) ≤ 1 T χ|S| log |A| η + m 2η β2 + 2χ ρ min (1 -γ) 2 + η 2χC 1 + b 2 max 2 + m|S| 2(1 -γ) ( ) (281) = χ|S| log |A| T 2χC 1 + 1 2 b 2 max + m|S| 2(1 -γ) + m 2ηT β2 + 2χ ρ min (1 -γ) 2 T ( ) (307),(287) = 1 (1 -γ) 2 + 2m (1 -γ) 4 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T + D 2 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞ , where Eq.( 301) holds due to the same reason as (184); Eq.( 302) holds since we set the parameter λi as follows Finally, it is similar to the proof of (196), we have λi =        β = χ|S| log |A| 2 (1 -γ)ι , if T -1 t=0 (C i (π t ) -b i ) ≥ 0, 0, if T -1 t=0 (C i (π t ) -b i ) < 0. E D0:D T -1 min t<T {C i (π t ) -b i } + (309) ≤ 1 β -λ ∞   1 (1 -γ) 2 + 2m (1 -γ) 4 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 T + D 2 ρ min (1 -γ) 3 T d ρ0 π ρ 0 ∞   . Furthermore, let T ≥ (D 2 ) 2 ((1 -γ) 2 + 2m/ι 2 ) 2 ρ 2 min D 1 |S| log |A| , then we obtain E D0:D T -1 min t<T {C i (π t ) -b i } + ≤ 2 β -λ ∞ 1 + 2m (1 -γ) 2 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T = 2 1 + 2m (1 -γ) 2 ι 2 β -λ ∞ d ρ0 π ρ 0 ∞ |S| log |A|D 1 (1 -γ) 4 T . ( ) Summarizing the Conclusion under Special Hyper-Parameter Setting. Finally, recall the condition for the term T in ( 290), (310), we conclude if the time-step T satisfies  T ≥ max 1 (1 -γ) β (307) = χ|S| log |A| 2 (1 -γ)ι ( ) (157) = 1 (1 -γ)c d ρ0 π ρ 0 ∞ |S| log |A| 2 (1 -γ)ι := d ρ0 π ρ 0 ∞ D |S| log |A| (1 -γ) 3 ι 2 , ( ) where we define the constant D as follows D := 4 c . Then, according to (175) and ( 198), the following holds E D0:D T -1 min t<T {J(π ) -J(π t )} ≤ 2 d ρ0 π ρ 0 ∞ |S| log |A|D (1 -γ) 4 T , E D0:D T -1 min t<T {C i (π t ) -b i } + ≤ 2 β -λ ∞ 1 + 2m (1 -γ) 2 ι 2 d ρ0 π ρ 0 ∞ |S| log |A|D (1 -γ) 4 T , where each i ∈ {1, 2, • • • , m}. This concludes the proof of Theorem 4.



r t+1 |s 0 = s] be the state value function. The stateaction value function is Q π (s, a) = E π [ ∞ t=0 γ t r t+1 |s 0 = s, a 0 = a], and advantage function is A π (s, a) = Q π (s, a) -V π (s). Finally, we define the objective function J(π) = E s∼ρ0(•) [V π (s)]. CMDP extends MDP with an additional constraint set C = {(c i , b i )} m i=1

a t )|s 0 = s]. Furthermore, we define the expected costC i (π) = E s∼ρ0(•) [V ci π (s)]. The feasible policy set Π C is defined as follows, Π C = ∩ m i=1 {π ∈ Π S and C i (π) ≤ b i } . The goal of safe RL is to search a policy π satisfies max π∈ΠS J(π), such that c(π) b,(3)where the vector c(π) = (C 1 (π), C 2 (π), • • • , C m (π)) , and b = (b 1 , b 2 , • • • , b m ) .If the constrained policy optimization problem (3) exists a solution, we denote it as:

and each element of vector 0 m is 1, i.e., 0 m = (0, 0, • • • , 0) . a b : It denotes component-wise order, i.e., a[j] ≤ b[j], j ∈ [m]. a ≺ b : It denotes component-wise order, i.e., a[j] < b[j], j ∈ [m]. A.2 MARKOV DECISION PROCESS S : The set of states. ∆ o (S) : The interior of the probability simplex over the state space S, ∆ o (S) := p s p s > 0, s∈S p s = 1 . A : The set of actions. P(s |s, a) : The probability of state transition from s to s under playing the action a. r(•) : The reward function r(•) : S × S × A → R, and it is bounded by |r(•)| ≤ 1. ρ 0 : ρ 0 (•) : S → [0, 1] is the initial state distribution. γ : The discount factor, and γ ∈ (0, 1).

π) : The objective function. C : The constraint set C = {(c i , b i )} m i=1 , where b i are limited values, and cost function c i : c i : S × A → R, and it is bounded by |c i (•)| ≤ 1. b : The vector stores all the limited values: b

D 2 : Two positive constants that are defined in Eq.(173). B PRELIMINARIES AND AUXILIARY LEMMA B.1 STATE DISTRIBUTION We use P π ∈ R |S|×|S| to denote the state transition matrix by executing π, and their components are: P π [s, s ] = a∈A π(a|s)P(s |s, a) := P π (s |s), s, s ∈ S, which denotes one-step state transformation probability from s to s .

π∈ΠS {J(π), and bc(π) ω}, according to Paternain et al. (2019b); Ding et al. (2020), v(ω) is concave. We notice J(π ) = v(0).



Input: Policy π; Reward function or Cost function g(•, •); State-action pair (s, a); 2: Initialization: Q(s, a) = 0, (s 0 , a 0 ) = (s, a); 3: Draw an integer τ from a geometric distribution with parameter (1 -γ): P

(λ t -λ) (b -ĉ(π t )) + η 2 from t = 0 to T -1, we achieve the following equation0 ≤ λ T -λ λ) (b -ĉ(π t )) + T η 2 b 2 max + m|S| 1 -γ ,

D0:D T -1 J(π ) -J(π t ) + m i=1 λi (C i (π t ) -b i ) (301) =E D0:D T -1 J(π ) -J(π t ) + β1 m {c(π t ) -b} +

D0:D T -1 J(π ) -J(π t ) + β1 m {c(π t ) -b} + ≤ δ.(308)

[•] is short for the expectation over the randomness from the variables τ ∼ Geo(1 -γ), s t ∼ P (a t |s t ) -(π θ (a t |s t )) 2 if s t = s and a t = a -π θ (a t |s t )π θ (a|s) if s t = s and a t = a

