ACHIEVING SUB-LINEAR REGRET IN INFINITE HORI-ZON AVERAGE REWARD CONSTRAINED MDP WITH LINEAR FUNCTION APPROXIMATION

Abstract

We study the infinite horizon average reward constrained Markov Decision Process (CMDP). In contrast to existing works on model-based, finite state space, we consider the model-free linear CMDP setup. We first propose a computationally inefficient algorithm and show that Õ( √ d 3 T ) regret and constraint violation can be achieved, in which T is the number of interactions, and d is the dimension of the feature mapping. We also propose an efficient variant based on the primaldual adaptation of the LSVI-UCB algorithm and show that Õ((dT ) 3/4 ) regret and constraint violation can be achieved. This improves the known regret bound of Õ(T 5/6 ) for the finite state-space model-free constrained RL which was obtained under a stronger assumption compared to ours. We also develop an efficient policy-based algorithm via novel adaptation of the MDP-EXP2 algorithm to the primal-dual set up with Õ( √ T ) regret and even zero constraint violation bound under a stronger set of assumptions. We consider an infinite horizon constrained MDP, denoted by (S, A, P, r, g) where S is the state space, A is the action space, P is transition probability measures, r, and g are reward and utility functions respectively. We assume that S is a measurable space with possibly infinite number of elements, and A is a finite action set. P(•|x, a) is the transition probability kernel which denotes the probability to reach a state when action a is taken at state x. We also denote P as p to simplify the notation. p satisfies X p(dx ′ |x, a) = 1 (following integral notation from Hernández-Lerma ( 2012))

1. INTRODUCTION

In many standard applications of Reinforcement learning (RL) (e.g., autonomous vehicles), the agent needs to satisfy certain constraints (e.g., safety constraint, fairness). These problems can be formulated as constrained Markov Decision process (CMDP) such that the agent needs to ensure that average utility (cost, resp.) exceeds a certain threshold (is below a threshold, resp.). While CMDP in the finite state-space has been studied, those studies can not be extended to the large state space. RL with value function approximation has demonstrated empirical success for large-scale RL application using the deep neural networks. However, theoretical understandings of constrained RL with value function approximation is quite limited. Recently, Ghosh et al. (2022) has made some progress towards the understanding of the constrained RL for linear MDP under episodic setting. In particular, Ghosh et al. (2022) developed a primal-dual adaptation of the LSVI-UCB and showed Õ( √ d 3 T ) regret and violation where d is the dimension of the feature space and T is the number of interactions. Importantly, the above bounds are independent of the cardinality of state-space. However, infinite-horizon model fits well compared to the finite horizon setting for many real-world applications (e.g., stock-market investment, routing decisions). Compared to the discounted-reward model, maximizing the long-term average reward under the long-term average utility also has its advantage in the sense that the transient behavior of the learner does not really matter for the latter case (Wei et al., 2020) . Recently, model-based RL algorithm for infinite-horizon average reward CMDP has been proposed (Chen et al., 2022) . However, it considers tabular setup. Further, the model-based approach requires large memory to store the model parameters. It is also hard (computationally) to extend model-based approaches for infinite state-space such as linear MDP (Wei et al., 2020) . Model-free RL algorithm is more popular because of the ease of implementation, and being computational, and storage efficient particularly for the large state space. However, model-free learning for the infinite horizon average reward setup is even more challenging. For example, it is still unknown whether it is possible to achieve a computationally efficient model-free algorithm with Õ( √ T ) regret even under the unconstrained tabular setup (Wei et al., 2020) for a weakly communicating MDP. To the best of our knowledge, Wei et al. (2022) is the only paper to study model-free algorithms for CMDPs in the infinite horizon average reward setup. In particular, they consider the finite-state tabular setting and the regret scales polynomially with the number of states. Thus, the result would not be useful for large-scale RL applications where the number of states could even be infinite. To summarize, little is known for the performance guarantee of model-free algorithms in CMDPs beyond tabular settings in the infinite-horizon average reward setting, even in the case of linear CMDP. Motivated by this, we are interested in the following question: Can we achieve provably sample-efficient and model-free exploration for CMDPs beyond tabular settings for the infinite horizon average reward setting? Contribution. To answer the above question, we consider CMDP with linear function approximation, where the transition dynamics, the utility function, and the reward function can be represented as a linear function of some known feature mapping. Our main contributions are as follows. • We propose an algorithm (Algorithm 1) which achieves Õ( √ d 3 T ) regret and constraint violation bounds with a high probability when the optimal policy belongs to a smooth function class (Definition 1). This is the first result that shows that Õ( √ T ) regret and violation are achievable for linear CMDP in the infinite-horizon regime using model-free RL. Achieving uniform concentration bound for individual value function turns out to be challenging and we need to rely on the smoothness of the policy, unlike the unconstrained case. The algorithm relies on an optimizer that returns the parameters corresponding to the state-action bias function by solving a contained optimization problem. • We also propose an efficient variant and show that Õ((dT ) 3/4 ) regret and violation bound can be achieved. This is the first result that provides sub-linear regret and violation guarantee under only Assumption 1 for linear CMDP using a computationally efficient algorithm. The idea is to consider a finite-horizon episodic setup by dividing the entire time-horizon T into multiple episodes T /H where each episode consists of H steps. We then invoke the primal-dual adaptation of the LSVI-UCB algorithm proposed in Ghosh et al. (2022) to learn the good policy for the finite-horizon setting by carefully crafting the constraint for the episodic case. Finally, we bound the gap between the infinite-horizon average and the finite-horizon result to obtain the final result. • We also propose an algorithm which can be implemented efficiently and achieves Õ( √ T ) regret and Õ( √ T ) constraint violation under a stronger set of assumptions (similar to the ones made in Wei et al. (2021a) for the unconstrained setup). We, further, show that one can achieve zero constraint violation for large enough (still, finite) T while maintaining the same order on regret. • We attain our bounds without estimating the unknown transition model or requiring a simulator, and they depend on the state-space only through the dimension of the feature mapping. To the best of our knowledge, these sub-linear bounds are the first results for model-free (or, model-based) online RL algorithms for infinite-horizon average reward CMDPs with function approximations. Wei et al. (2022) proposes a model-free algorithm in the tabular setting which achieves Õ(T 5/6 ) regret. Since linear MDP contains tabular setting our result improves the existing results. Further, we show that we can achieve zero constraint violation by maintaining the same order on regret under the same set of Assumptions of Wei et al. (2022) . We relegate Related Work to Appendix A. r : S × A → [0, 1], and g : S × A → [0, 1] and are assumed to be deterministic. We can readily extend to settings when r and g are random. The process starts with the state x 1 . Then at each step t ∈ [T ], the agent observes state x t ∈ S, picks an action a t ∈ A, receives a reward r(x t , a t ), and a utility g(x t , a t ). The MDP evolves to x t+1 that is drawn from P(•|x t , a t ). In this paper, we consider the challenging scenario where the agent only observes the bandit information r(x t , a t ) and g(x t , a t ) at the visited state-action pair (x t , a t ). The policy-space of an agent is ∆(A|S); {{π t (•|•)} : π t (•|x) ∈ ∆(A), ∀x ∈ S, and t ∈ T }. Here ∆(A) is the probability simplex over the action space. For any x t ∈ S , π t (a t |x t ) denotes the probability that the action a t ∈ A is taken at step t when the state is x t . We also denote the space of stationary policy as ∆ s (A|S) where the policy π(•|x t ) is independent of time t. Let J π r (x) and J π g (x) denote the expected value of the average reward and average utility respectively starting from state x when the agent selects action using the stationary policy π ∈ ∆ s (A|S) J π ⋄ (x) = lim T →∞ 1 T E π   T i=1 ⋄(x i , a i )|x 1 = x   , for ⋄ = r, g, where E is taken with respect to the policy π and the transition probability kernel P. From Puterman (2014), there exists a state-action bias function q satisfying the Bellman's equation for ⋄ = r, g: q π ⋄ (x, a) + J π ⋄ (x) = ⋄(x, a) + E x ′ ∼p(•|x,a) [v π ⋄ (x ′ )] where v π ⋄ (x) = a π(a|x)q ⋄ (x, a) = ⟨π(•|x), q ⋄ (x, •)⟩ A . q π and v π are analogue to the Q-function and value function respectively for finite-horizon and discounted infinite horizon scenario. The Problem: We are interested in solving the following problem maximize π J π r (x) subject to J π g (x) ≥ b Let π * be an optimal stationary solution of the above problem. Note that both J π r and J π g depend on the initial state x. Also note that our approach can be extended to multiple constraints, and the constraints of the form J π g (x) ≤ b. Unlike the episodic set-up where the sub-linear regret is possible, it is known that even in the unconstrained set-up, the necessary condition for sub-linear regret is that the optimal policy has a long-term average reward and utility that are independent of the initial state (Bartlett and Tewari, 2012) . Hence, we assume the following: Assumption 1. There exists J * r , J * g and a stationary π * such that they are solution of (2) and the followings hold for all ⋄ = r, g, x ∈ S, and a ∈ A J * ⋄ + q * ⋄ (x, a) = ⋄(x, a) + E x ′ ∼P(•|x,a) [v * ⋄ (x ′ )], v * ⋄ (x) = a π * (a|x)q * ⋄ (x, a) We denote J * r as the optimal gain. We also denote the span of v * ⋄ as sp(v * ⋄ ) = sup s v * ⋄ (s) -inf s v * ⋄ (s) for ⋄ = r, g. For a finite state and action space (a.k.a. tabular case), the weakly communicating MDP is the broadest class to study the regret minimization in the literature, and is known to satisfy Assumption 1. Assumption 1 is adaptation of the Bellman's optimality equation for unconstrained version (Wei et al., 2021a) in the CMDP. Note that unlike the unconstrained case, here the optimal policy might not be greedy with respect to q r as the greedy policy might not be feasible. Bellman's equation with state-independent average reward and average utility: A stationary policy π satisfies the Bellman's equation with state-independent average reward and average utility if there exists measurable functions v π ⋄ : S → R, q π ⋄ : S × A → R such that the following holds for all x ∈ S, ⋄ = r, g, and a ∈ A Wei et al. (2022) assumes that any stationary policy (not only the optimal policy) satisfies the Bellman's equation with state-independent average reward and average utility. In Section 3 we consider the setup where Assumption 1 is satisfied which is weaker compared to Wei et al. (2022) . In Section 4 we consider a set of assumptions which entails that every stationary policy satisfies (4). J π ⋄ + q π ⋄ (x, a) = ⋄(x, a) + E x ′ ∼P(•|x,a) [v π ⋄ (x ′ )], v π ⋄ (x) = a π(a|x)q π ⋄ (x, a), Learning Metric: The agent is unaware of the underlying environment, and seeks to learn the optimal policy. Hence, the agent can take action according to a non-stationary policy π t at a state x t . We are interested in minimizing the following metrics Regret(T ) = T t=1 (J * r -r(x t , a t )), Violation(T ) = (T b - T t=1 g(x t , a t )). The regret denotes the difference between the total reward obtained by best stationary policy and the reward obtained by the learner. The violation denotes how far the total utility achieved by the learner compared to the T b. Similar metrics are employed in model-based infinite horizon average reward (Chen et al., 2022) and in the finite horizon setup (Ding et al., 2021) . Linear Function Approximation. To handle a possible large number of states, we consider the following linear MDPs. Assumption 2. The CMDP is a linear MDP with feature map ϕ : S × A → R d , if there exists d unknown signed measures µ = {µ 1 , . . . , µ d } over S such that for any (x, a, x ′ ) ∈ S × A × S, P(x ′ |x, a) = ⟨ϕ(x, a), µ(x ′ )⟩ (6) and there exists vector θ r , θ g ∈ R d such that for any (x, a) ∈ S × A, r(x, a) = ⟨ϕ(x, a), θ r ⟩ g(x, a) = ⟨ϕ(x, a), θ g ⟩ Assumption 2 adapts the definition of linear MDP in Jin et al. (2020) to the constrained case. By the above definition, the transition model, the reward, and the utility functions are linear in terms of feature map ϕ. We remark that despite being linear, P(•|x, a) can still have infinite degrees of freedom since µ(•) is unknown. Note that tabular MDP is part of linear MDP (Jin et al., 2020) . Similar Constrained Linear MDP model is also considered in Ghosh et al. (2022) (for episodic setup). Note that Ding et al. (2021) ; Zhou et al. (2021) studied another related concept known as linear kernel MDP in episodic setup. In the linear kernel MDP, the transition probability is given by P(x ′ |x, a) = ⟨ψ(x ′ , x, a), θ⟩. In general, linear MDP and linear kernel MDPs are two different classes of MDP (Zhou et al., 2021) . Under linear CMDP, we can represent the q * ⋄ function as a linear in the feature space. Lemma 1 (Wei et al. (2021a) ). Under Assumptions 1 and 2, there exists a fixed w * ⋄ ∈ R d such that q * ⋄ (x, a) = ϕ(x, a) T w * ⋄ for ⋄ = r, g and all (x, a) ∈ S × A with ||w * ⋄ || √ d(2 + sp(v * ⋄ )).

3.1. FIXED-POINT OPTIMIZATION WITH OPTIMISM

We present our first algorithm (Algorithm 1) which achieves Õ(sp(v * r ) √ d 3 T ) regret and O(sp(v * g ) √ d 3 T ) constraint violation. Note that for the unconstrained tabular version, Ω( √ T ) is the lower bound for regret (Auer et al., 2008) . Hence, our obtained bound is optimal in terms of regret (nearly). Further, we show that we can attain zero violation while maintaining the same order in regret with an additional assumption (Remark 1). In the following we describe our Algorithm and point out the key modifications we have made over the unconstrained version. From Lemma 1 and (3), we have ϕ(x, a) T w * ⋄ = ⋄(x, a) -J * ⋄ + X v * ⋄ (x ′ )p(dx ′ |x, a) Hence, the natural idea is to build an estimator w ⋄,t for w * ⋄ from the data. However, we do not know J * ⋄ , p, r, and g. Further, we also need to find π in order to obtain v * ⋄ (cf.(3)). In the following, we describe how we address those challenges. First, in order to handle unknown p, and ⋄, we use the sample points till time t to estimate w ⋄,t . In particular, if for a moment, if we assume J * ⋄ and π are known then we can fit w ⋄,t such that ϕ(x, a) T w ⋄,t ≈ ⋄(x τ , a τ ) -J * ⋄ + v ⋄,t (x τ +1 ). Then solving a regularized least square problem with regularization term λ||w ⋄,t || 2 gives a natural estimate of w ⋄,t - w ⋄,t = (Λ t ) -1 t-1 τ =1 (ϕ(x τ , a τ )(⋄(x τ , a τ ) -J * ⋄ + v ⋄,t (x τ +1 )) where Λ t = λI + τ <t ϕ(x τ , a τ )ϕ(x τ , a τ ) is the empirical co-variance matrix. Now, we replace v ⋄,t (x τ +1 ) with the expression v ⋄,t (x τ +1 ) = ⟨π(•|x τ +1 ), q ⋄,t (x τ +1 , •)⟩ A which gives rise a fixed-point equation as q ⋄,t (•, •) = ϕ(•, •) T w ⋄,t . In order to incorporate uncertainty, we introduce a slack variable b ⋄,t with a bounded quadratic form ||b ⋄,t || Λt ≤ β ⋄ (β ⋄ is a parameter for ⋄ = r, g). It will ensure that w * ⋄ is contained within w ⋄,t + b ⋄,t with high-probability. Now, we discuss how we handle the unknown J * ⋄ and π. We introduce J ⋄,t and π t as variables and design an optimization problem P 1 which maximizes J r,t with the constraint J g,t ≥ b. We also restrict π to a class of soft-max policy Π (Definition 1) and search for π t over Π. Definition 1. We define the policy class as Π as the collection of soft-max policies Π = {π(•|•) ∝ exp(ϕ(•, •) T ζ), ζ ∈ R d , ||ζ||≤ L}. We replace J * ⋄ in (7) with J ⋄,t . Further, we add the slack variable b ⋄,t to w ⋄,t in (7) to obtain (8) in Algorithm 1. This has similarity with the "optimism under uncertainty" as the optimal (w * ⋄ , b * ⋄ , J * ⋄ , π * ) for ⋄ = r, g are feasible for the optimization problem with high probability (Lemma 2) if π * ∈ Π. Hence, it naturally entails that J r,t ≥ J * r . Wei et al. (2021a) also proposed an optimization problem with fixed-point equation (Algorithm 1) to obtain w * r in the unconstrained setup. In the following, we point out the main differences with the Algorithm 1 proposed in Wei et al. (2021a) and describe the necessity of the introduction of policy space Π. First, we have an additional constraint where J g,t ≥ b since we consider CMDP. Second and more importantly, in the unconstrained case (Wei et al., 2021a) the policy was greedy with respect to q r . Since the greedy policy may not be feasible, we need to search over policy in the CMDP. However, we restrict the policy to function class Π. The main reason for introduction to Π is that the uniform concentration bound (a key step for proving both regret and violation for model-free approach) for value function class v ⋄ can not be achieved unless the policy has smooth properties such as Lipschitz continuous. In particular, we need to show that log ϵ covering number for the class of v ⋄ scales at most O(log(T )). For the unconstrained case, we only need to obtain ϵ-covering number for v r and with greedy policy, ϵ-covering number for the class of q-function was enough. However, we need to obtain ϵ-covering number for both v r and v g , hence, such a trick would not hold. In order to obtain ϵ-covering number for v ⋄ , we use the smoothness of soft-max to show that log ϵ-covering number for the class of value function scales at most O(log(T )) (Lemma 7). In Appendix F.6 our analysis can be extended to the class of policies which satisfy Lemma 8. Also note that we update the parameters when the determinant of Λ t doubles compared to Λ st-1 where s t-1 is the time step with the most recent update before the time t. This can happen only at most O(d log T ) times. Similar to Wei et al. (2021a) , our algorithm also does not allow any efficient implementation due to the complicated nature of the optimization problem. However, we can show that with high probability, the optimal solution and parameters are feasible of P 1 by the careful choice of the parameter β ⋄ , and using the fact that ||w * ⋄ ||≤ sp(v * ⋄ ) √ d. Lemma 2. With probability 1 -2δ, Algorithm 1 ensures that J r,t ≥ J * r if π * ∈ Π. Using Lemma 2 we obtain the regret and violation bound for Algorithm 1 in the following Theorem 1. With probability 1 -5δ, the regret and violation bounds are Regret(T ) ≤ O(sp(v * r ) log(T /δ) √ d 3 T ), Violation(T ) ≤ O(sp(v * g ) log(T /δ) √ d 3 T ) if the optimal policy π * ∈ Π. This is the first result which shows that Õ( √ T ) regret and violation is achievable for infinite horizon average reward linear CMDP under Assumption 1 using model-free approach. Recently, Chen et al. (2022) obtained Õ( √ T ) regret and violation for model-based tabular setup for weakly communicating MDP. The proposed algorithm (Algorithm 4) in Chen et al. (2022) is also computationally inefficient. Chen et al. (2022) also requires the knowledge of sp(v * ⋄ ) for ⋄ = r, g. For the unconstrained MDP, the lower bound of regret is Ω( √ T ) (proved for tabular setup, (Auer et al., 2008) ). Hence, the regret order is optimal with respect to T . We also show that it is possible to zero violation for large T under an additional set of Assumption (Remark 1). Algorithm 1 Model Free Fixed-point Algorithm for Long-term Average Reward and Constraint 1: Initialization: β ⋄ = 2d(2 + sp(v * ⋄ )) log(32Ld(2 + sp(v * ⋄ ))T /δ) for ⋄ = r, g. 2: for t = 1, . . . , T do 3: if t = 1 or det(Λ t ) ≥ 2Λ st-1 then 4: Set s t = t.

5:

Obtain w ⋄,t for ⋄ = r, g as the solution of the following optimization problem P 1 : max wr,t,wg,t,br,t,bg,t,πt,Jr,t,Jg,t J r,t s.t. w ⋄,t = (Λ t ) -1 t-1 τ =1 (ϕ(x τ , a τ )(⋄(x τ , a τ ) -J ⋄,t + v ⋄,t (x τ +1 )) + b j,t ), ⋄ = r, g (8) J g,t ≥ b, q ⋄,t (x, a) = ϕ(x, a) T w ⋄,t , v ⋄,t (x) = ⟨π t , q ⋄,t ⟩, π t ∈ Π ||b ⋄,t || Λ -1 t ≤ β ⋄ , ||w ⋄,t ||≤ (2 + sp(v * ⋄ )) √ d 6: Take action a t ∼ π t (•|x t ) and observe x t+1 . 7: Update Λ t+1 = Λ t + ϕ(x t , a t )ϕ(x t , a t ) T

3.2. ALGORITHM BASED ON EPISODIC PRIMAL-DUAL ALGORITHM

We now present another optimism-based algorithm which can be implemented efficiently albeit with a sub-optimal regret and violation guarantee under the Slater's condition (Assumption 3). Assumption 3. We assume the following, there exists π, v π , q π , J π r , and J π g such that (4) is satisfied; and J π g ≥ b + γ with γ > 0. The above assumption states that there exists a stationary policy with strict feasibility. Strict feasibility assumption is common both in the infinite horizon average reward (Wei et al., 2022; Chen et al., 2022) and finite horizon episodic CMDP set up (Efroni et al., 2020; Ding et al., 2021) . We only need to know γ rather than the strictly feasible policy. In algorithm 1 the inefficiency arises since we do not know how to efficiently solve a fixed-point equation and find the policy to solve the optimization problem. We show that we eliminate the above inefficiencies by considering a finite-horizon problem. In particular, we divide the entire horizon T into T /H episodes with H rounds and employ a finite-horizon primal dual optimistic LSVI-based algorithm proposed in Ghosh et al. (2022) . However, since the original problem is still infinite horizon we need to bound the gap because of the finite-horizon truncation which we describe in the following.

3.2.1. OUR APPROACH

We replace the time index t with the combination of the episode index k and the step index h, t = (k -1)H + h. The k-th episode starts at time (k -1)H + 1. The agent chooses the policy π k h for the step h ∈ [H] at episode k ∈ [K]. Note that for finite-horizon setting, the policy needs not to be stationary even though the optimal policy for the infinite horizon problem is stationary. As we mentioned, the idea is here to obtain policy with good performance guarantee in the episodic setup and then bound gap between the infinite horizon and episodic setup. We now introduce the notations specific to the episodic setting. V π ⋄,h (x) = E π   H i=h ⋄(x i , a i )|x h = x   , Q π ⋄,h (x, a) = E π   H i=h ⋄(x i , a i )|x h = x, a h = a   . Note that since J π g ≥ b, the natural choice of the constraint in the episodic CMDP would be V π g,1 ≥ Hb. However, in order to compare the best policy in the episodic setup with the optimal policy in the infinite horizon setup we need to ensure that the optimal stationary policy for the infinite horizon problem is feasible in the episodic setup. If we have V π g,1 ≥ Hb -sp(v * g ) we can conclude that the optimal stationary policy π * is also feasible for the episodic setup using the following lemma: Lemma 3. If J π ⋄ (s) = J π ⋄ for ⋄ = r, g, and any state s, stationary policy π, then |V π ⋄,h -(H -h + 1)J π ⋄ |≤ sp(v π ⋄ ) Further, in order to apply the primal-dual algorithm we also need to show that there exists a strictly feasible policy for the episodic setup. Again using Lemma 3 and from Assumption 3, we conclude that there exists a strictly feasible policy for the episodic setup if V π g,1 ≥ Hb + γ -sp(v π g ). Thus, we consider the following constraint V π g,1 ≥ Hb -κ where κ = max{sp(v * g ), sp(v π g ) -γ}. Lagrangian: We consider the composite value function as V π,Y h (x) = V π r,h (x) + Y V π g,h (x). The following is proved in Ding et al. (2021) . Lemma 4 (Boundedness of Y * ). The optimal dual-variable Y * ≤ V π * r (x 1 ) -V π r (x 1 ) Hγ ≤ H Hγ where π * is the optimal solution of the episodic CMDP. Definition 2. We set ξ = 2/γ. ξ is used to truncate the dual variable.

3.2.2. ALGORITHM

We now describe our proposed Algorithm 2 (Appendix D). First, note that the finite-horizon stateaction value function for linear MDP can be represented as the following linear form Q π ⋄,h (x, a) = ϕ(x, a) T w π ⋄,h = ⋄ h (x, a) + X V ⋄,h+1 (x ′ )p(dx ′ |x, a) Hence, the natural estimates for the parameters w k r,h , w k g,h are obtained by solving the following regularized least square problem w k ⋄,h ← arg min w∈R d k-1 τ =1 h ′ [⋄(x τ h ′ , a τ h ′ ) + V k ⋄,h+1 (x τ h ′ +1 ) -w T ϕ(x τ h ′ , a τ h ′ )] 2 + λ||w|| 2 2 , Q k ⋄,h (x, a) = ⟨w k ⋄,h , ϕ(x, a)⟩ + β||ϕ(x, a)|| Λ -1 k where β||ϕ(x, a)|| Λ -1 k is the bonus term. Since we are considering a finite-horizon, hence, we can recursively set V ⋄,h+1 starting from V ⋄,H+1 = 0. Thus, we do not need to solve the fixed-point equation unlike in Algorithm 1. Another difference with Algorithm 1 is that we introduce a pointwise bonus term for optimism. Finally, we use a primal-dual adaptation unlike in Algorithm 1. The policy is based on the soft-max. In particular, at step h, π h,k (a|x) is computed based on the soft-max policy on the composite Q-function vector {Q k r,h (x, a) + Y k Q k g,h (x, a)} a∈A where Y k is the lagrangian multiplier. The computation of policy is also efficient unlike in Algorithm 1 as we do not need to search over the policy space. Note that the greedy policy with respect to the composite Q-function vector fails to show that log ϵ-covering number for the individual value function class is at most log(T ) (Ghosh et al., 2022) . Note that Ghosh et al. (2022) maintains different covariance matrices at each step h, Λ k h . But we maintain a single Λ k for all h since the transition probability and reward are the same across the steps. The last part (Step 15) of Algorithm 2 includes a dual-update step. The dual variable increases if the estimated utility value function does not satisfy the constraint.

3.2.3. MAIN RESULTS

Theorem 2. Under Assumptions 1, 2, and 3, with probability 1 -5δ, the regret and constraint violations are Regret(T ) ≤ O((1 + sp(v * r ))(dT ) 3/4 ι), Violation(T ) ≤ 2(1 + ξ) ξ Õ((1 + κ)(dT ) 3/4 ι) where ι = log(log(|A|)ξ2dT /δ). Unlike in Theorem 1, here, the optimal policy does not need to belong to Π. Wei et al. (2022) proved that the regret bound is Õ(T 5/6 ) for the model-free RL algorithm in the tabular regime under a much stronger Assumption compared to ours. Since linear CMDP contains a tabular setup, our approach improves the upper bound for the tabular case as well. We show that it is possible to achieve zero violation while maintaining the same order on regret under an additional Assumption (Remark 1). Note that the regret is also Õ((dT ) 3/4 ) for the unconstrained linear MDP (Wei et al., 2021a) for the efficient implementation variant of the algorithm. Published as a conference paper at ICLR 2023

Analysis:

The overall regret can be decomposed in Regret(T ) = T /H k=1 (HJ * r -V π * r,1 (x k 1 ))+ T /H k=1 (V π * r,1 (x k 1 )-V π k r,1 (x k 1 ))+ T /H k=1 (V π k r,1 (x k 1 )- kH t=(k-1)H+1 r(x t , a t )) The third term on the right-hand side can be bounded using the Azuma-Hoeffding inequality. Hence, we focus on the first two terms of the right-hand side. The second part is exactly the same as the regret for the episodic scenario. However, there is a subtle difference. We need to show that optimal policy for the original infinite horizon is also feasible (see Appendix G.1) which we show using the fact that we have relaxed the constraint by κ. We show that the above is upper bounded by Õ( Ghosh et al. (2022) , we have √ d 3 T H 2 ). Note that compared to

√

H improvement as in our case the transition probability is independent of the time h unlike in Ghosh et al. (2022) . One may wonder why not set the length H too small. However, since our original problem is about average-reward over infinite horizon, we need to argue that the best finite-horizon policy also performs well under the infinite-horizon criteria. The first term characterizes the gap. We show that the sub-optimality gap to the best finite horizon policy is bounded by T sp(v * r )/H using Lemma 3. Hence, if we set too small H, the gap would increase. By setting H = T 1/4 /d 3/4 we balance the upper bounds on the first two terms of the right-hand side. As T increases, H also increases.

Constraint Violation:

The violation term can also be decomposed as follows Violation(T ) = T /H k=1 (Hb -κ -V π k g,1 (x k 1 )) + (V π k g,1 (x k 1 ) - kH t=(k-1)H+1 g(x t , a t )) + (T /H)κ We obtain the upper bound of the first term as Õ( √ d 3 H 2 T ). The second term can be bounded by Azuma-Hoeffding inequality. The third term denotes the error incurred due to the introduction of κ. We obtain the final constraint violation bound by replacing H with T 1/4 /d 3/4 .

4. POLICY-BASED ALGORITHM

Even though Algorithm 2 is computationally efficient, still one needs to compute V k ⋄,h+1 for every encountered state and needs to evaluate (Λ k ) -1 . We now show that it is possible to achieve Õ( √ T ) regret and constraint violation under a different and relaxed set of assumptions using a computationally less intensive algorithm. We propose a policy-based algorithm towards this end. First, we state the assumptions and then we describe the algorithm and the results. Assumption 4. There exists a constant t mix ≥ 1 such that for any policy π, and any distribution ν 1 , ν 2 ∈ ∆ X over the state-space ||P π ν 1 -P π ν 2 || T V ≤ e -1/tmix ||ν 1 -ν 2 || T V where P π ν(x ′ ) = X a π(a|x)P(x ′ |x, a)dν(x), T V is the total variation distance. Ergodic MDP satisfies the above assumption Hao et al. (2021) . With the above assumption, we can conclude the following, as shown in the unconstrained setup (Wei et al., 2021a) ). Lemma 5. Under Assumption 4, any stationary policy π satisfies (4). Hence, under Assumption 4, the gain is constant, and any stationary policy indeed satisfies (4). Assumption 5. Let λ min (•) be minimum eigen value of a matrix. There exists σ > 0 such that for any π, λ min X ( a π(a|x)ϕ(x, a)ϕ(x, a) T )dν π (x) ≥ σ The assumption intuitively guarantees that every policy is explorative in the feature space. similar assumptions are also made (Wei et al., 2021a; Abbasi-Yadkori et al., 2019) in the unconstrained case. Similar to Wei et al. (2021a) we can relax this assumption to the setting where the above holds for one known policy instead of all policies. Lagrangian: J π r + Y J π g is the Lagrangian for the stationary policy π where Y is the dual variable. Under Assumption 3, optimal dual-variable Y * ≤ J π * r (x 1 ) -J π r (x 1 ) γ ≤ 1 γ ≤ ξ/2.

4.1. ALGORITHM

We consider a primal-dual adaptation of MDP-EXP2 in Algorithm 3 (Appendix E). The algorithm proceeds in epochs where each epoch consists of B = Õ(dt mix /σ) steps. Within each epoch, we collect B/2N trajectories where N = Õ(t mix ). The update of policy π k for k-th epoch is equivalent to the online mirror descent (OMD), i.e., π k (•|•) = arg max π ⟨π, q k-1 r (•, •) + Y k-1 q k-1 g (•, •)⟩ - 1 α D(π||π k-1 ) where , a) . We consider a composite q-function which is the sum of the q r and q g (scaled by the dual-variable). We estimate w k ⋄ from sample average reward R k,m and sample average utility G k,m by utilizing Lemma 5 and the fact that q π ⋄ is linear in feature space. Since we consider primal-dual adaptation of MDP-EXP2 in Wei et al. (2021a) , we maintain two different types of sample averages for the utility. G k,m is used to estimate q π k g + N J π k g ; on the other hand, Ĵk is used to estimate J π k g . The dual update is based on the value of Ĵk . Naturally, the analysis also differs significantly (see Appendix H.1). q k-1 ⋄ (x, a) = (w k-1 ⋄ ) T ϕ(x

4.2. MAIN RESULTS

Theorem 3. Under Assumptions 2, 4,5, and 3, Algorithm 2 ensures that with probability 1 -4δ, Regret(T ) ≤ (1 + ξ) Õ(1/σ t 3 mix T ), Violation(T ) ≤ 2(1 + ξ) ξ Õ(1/σ t 3 mix T ) Wei et al. (2021a) shows that Õ( √ T ) regret is achieved in the expectation under Assumptions 4, and 5 in the unconstrained setup. We show that even with a high probability, such a regret can be achieved along with Õ( √ T ) of violation. Assumption 2 can be relaxed to the scenarios where q π ⋄ is linear (even though the underlying MDP is not linear) similar to unconstrained setup (Assumption 4 in Wei et al. (2021a) ). Hence, Õ( √ T ) regret and violation can be achieved for any linear q π even if the underlying MDP is not linear. In the model-based tabular CMDP, Chen et al. (2022) shows that the regret also scales with Õ( √ T ) in the ergodic MDP. Also note that the regret scales with t mix and ξ there as well (order of t mix is worst there). Note that the dependence on d is implicit as 1/σ = Ω(d). We need to know t mix , and σ in Algorithm 3. Similar to the unconstrained setup Wei et al. (2021a) , we can modify our algorithm to incorporate unknown t mix and σ with slightly worsened regret. Remark 1. We can reduce the violation to 0 with sacrificing the regret a little bit (the order is the same) under the Assumptions in Section 4 (Appendix H.5). The idea is the following, we consider a tighter optimization problem where we add ϵ in the right-hand side of the constraint in (2). We then bound the difference of the optimal gains in the tighter problem and the original problem as a function of ϵ. The bound on regret and violation can be attained following our analysis for the tighter optimization problem since it is a CMDP with b + ϵ in place of b. Hence, by carefully choosing ϵ, we can show that it is strictly feasible, and achieves zero violation for large enough T (Lemma 36). For the assumption in Section 3.2 and Section 3.1, we need an additional assumption (any stationary policy satisfies Bellman's equation (cf.( 4)) to obtain zero constraint violation (Appendix H.5). The idea is similar to the one as described in earlier paragraph. The additional assumption would help us to bound the difference between the tighter and the original problem.

Organization of Appendix

We provide related work in Appendix A. In Appendix B, we provide some motivating examples. We summarize our results in Appendix C under various sets of Assumptions. We describe Algorithm 2 in Appendix D. We formally describe Algorithm 3 in Appendix E. We prove the results of Section 3.1 in Appendix F. We prove the results of Section 3.2 in Appendix G. We prove the results of Section 4 in Appendix H. We show how Algorithm 3 can achieve zero constraint violation while maintaining the same order on the regret bound (with respect to T ) in Appendix H.5. In Appendix H.5 we also show how Algorithms 1 and 2 achieve zero violation while maintaining the same order on regret (with respect to T ) with an additional assumption. We show numerical experiments in Appendix I.

A RELATED WORK

Model-free RL algorithms have been proposed (Xu et al., 2021; Ding et al., 2020; Bai et al., 2021) to solve CMDP in the discounted infinite horizon setting. All of these works consider an easier setting compared to standard RL in that they assume access to a simulator (Koenig and Simmons, 1993) (a.k.a. a generative model (Azar et al., 2012) ), which is a strong oracle that allows the agent to query arbitrary state-action pairs and return the reward and the next state, hence greatly alleviating the intrinsic difficulty of exploration in RL. Wei et al. (2021b) proposed a 'triple-Q' algorithm that does not require a 'simulator'. However, it only considered the tabular setting and the episodic set-up. To develop online sample-efficient algorithms for CMDPs, prior works have largely resorted to the finite horizon setup (Efroni et al., 2020; Brantley et al., 2020; Kalagarla et al., 2020; Liu et al., 2021) . Model-based RL in the episodic linear kernel CMDP is considered in Ding et al. (2021) . Model-based infinite horizon average reward CMDP for tabular setup is also considered in Zheng and Ratliff (2020) ; Singh et al. (2020) .

B MOTIVATING EXAMPLES

First, we provide some examples of constrained MDP. • Consider an intelligent agent taking an action to optimize the power consumption for a household. It would seek to minimize the overall cost while trying to maintain a minimum level of satisfaction (e.g., maintaining a certain temp., maintaining the charge of the electric vehicles). Here the reward can be cast as the negative of the cost for power consumption, while utility can be modeled as the satisfaction the user gets. • As another example, consider a sensor network where sensor nodes sample and send information to a server (fusion center) for processing. However, these sensor nodes also need to satisfy the energy constraint as they have limited battery capacity. Such a decision process can be modeled as CMDP where (i) the reward depends on the nature of the information and whether the information is successfully received or not and (ii) the cost corresponds to the cost for sampling and transmitting information. We now provide some examples of CMDP that might run for a long-time where the average reward CMDP would be the ideal candidate for modeling. • Consider the sensor network example provided earlier. Here, the sensor node continually takes decisions, and thus, an infinite horizon average reward CMDP is a natural choice. • As another example, consider that a server is scheduling jobs to different machines. The scheduler seeks to minimize the job completion time. The server wants to maintain a uniform queue length (i.e, the number of jobs waiting to be processed) across the machines which can only process jobs sequentially. The server is continually taking decisions, thus, it is maximizing the average reward while maintaining an average queue length below a certain threshold. This can be cast as the average reward CMDP. Another example is a controller in an autonomous system that takes a decision in order to maximize the average reward (objective) while trying to maintain system stability. We can model the system as an average reward CMDP where there is a utility of 1 if the system remains in the safe region at every step. The goal of the controller is to maximize the average reward while the system will be in one of the safe states for at least 1 -ϵ fraction of times for the desired choice of ϵ > 0 . This can be cast as average reward CMDP where the controller seeks to maximize the average reward while the average utility is at least Finally, we argue why we choose the function approximation setup. In many examples, provided above, the state space is continuous or at least very large. For example, the state of the battery for a sensor is continuous. Similarly, the length of the queue can be very large. Function approximation is generally used to approximate the Q-function or policy in such a large state space. We consider linear function approximation in our setup as a first step to handle the large state space.

C SUMMARY OF OUR RESULTS IN A TABLE

We summarize our results in Table 1 . Table 1 : Regret and Constraint Violations on Linear MDP for our proposed algorithms  ALGORITHM REGRET VIOLATIONS ASSUMPTIONS ALGORITHM 1 Õ( √ d 3 T ) Õ( √ d 3 T ) + ASSUMPTION 1 ALGORITHM 2 Õ((dT ) 3/4 ) Õ((dT ) 3/4 ) + ASSUMPTIONS 1, AND 3 ALGORITHM 3 Õ( √ T ) Õ( √ T ) * ASSUMPTIONS 3, Initialization: Y 1 = 0, w ⋄,h = 0, H = T 1/4 d 3/4 , K = T /H, α = log(|A|) √ KH 2(1 + ξ + H) , η = ξ/ √ KH 2 , β = C 1 dH log(2 log|A|ξdT /δ) 2: for episodes k = 1, . . . , K do 3: Receive the initial state x k 1 . 4: Λ k ← k-1 τ =1 h ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) T + λI 5: for step h = H, H -1, . . . , 1 do 6: w k r,h ← (Λ k ) -1 [ k-1 τ =1 H h ′ =1 ϕ(x τ h ′ , a τ h )[r(x τ h ′ , a τ h ′ ) + V k r,h+1 (x τ h ′ +1 )]] 7: w k g,h ← (Λ k ) -1 [ k-1 τ =1 H h ′ =1 ϕ(x τ h ′ , a τ h ′ )[g(x τ h ′ , a τ h ′ ) + V k g,h ′ +1 (x τ h ′ +1 )]] 8: Q k r,h (•, •) ← min{⟨w k r,h , ϕ(•, •)⟩ + β(ϕ(•, •) T (Λ k ) -1 ϕ(•, •)) 1/2 , H} 9: Q k g,h (•, •) ← min{⟨w k g,h , ϕ(•, •)⟩ + β(ϕ(•, •) T (Λ k ) -1 ϕ(•, •)) 1/2 , H} 10: π h,k (a|•) = exp(α(Q k r,h (•, a) + Y k Q k g,h (•, a))) a exp(α(Q k r,h (•, a) + Y k Q k g,h (•, a))) 11: V k r,h (•) = a π h,k (a|•)Q k r,h (•, a), V k g,h (•) = a π h,k (a|•)Q k g,h (•, a) 12: for step h = 1, . . . , H do 13: Compute Q k r,h (x k h , a), Q k g,h (x k h , a), π(a|x k h ) for all a. 14: Take action a k h ∼ π h,k (•|x k h ) and observe x k h+1 . 15: Y k+1 = max{min{Y k + η(b -V k g,1 (x k 1 )), ξ}, 0} E ALGORITHM 3 Here, we describe Algorithm 3. F PROOF OF THE RESULT OF SECTION 3.1 Notations: Without loss of generality, we assume that the first element of  ϕ(•, •) is 1. Hence ϕ(•, •) T e 1 = 1 Initialization: Y 1 = 0, N = 8t mix log(T ) w ⋄,1 = 0, B = 32N (log(dT ))/σ α = min{1/(1 + ξ) √ T t mix , σ/(24N (1 + ξ))}, η = ξ/ T /B. 2: for epochs k = 1, . . . , T /B do 3: Define the policy π k (a|x) such that π k (a|x) ∝ exp αϕ(x, a) T ( k-1 j=1 w r,j + Y j w g,j ) , and initialize count Jk .

4:

Execute π k in the entire epoch.

5:

for t = (k -1)B + 1, . . . , kB do 6: Execute a t ∼ π k (•|x t ), observes, r t (x t , a t ), g t (x t , a t ), and x t+1 . 7: if t ≥ (k -1)B + N + 1 then 8: Sum the constraint Jk = Jk + g(x t , a t ). 9: for step m = 1, . . . , B/2N do 10: Define τ k,m = (k -1)B + 2N (m -1) + N + 1 as the start of the m-th trajectory. 11: Compute R k,m = τ k,m +N -1 t=τ k,m r(x t , a t ) G k,m = τ k,m +N -1 t=τ k,m g(x t , a t ) 12: Compute M k = B/2N m=1 a π(a|x τ k,m )ϕ(x τ k,m , a τ k,m )ϕ(x τ k,m , a τ k,m ) T 13: if λ min (M k ) ≥ Bσ/(24N ), then 14: w r,k = M -1 k B/2N m=1 ϕ(x τ k,m , a τ k,m )R k,m 15: w g,k = M -1 k B/2N m=1 ϕ(x τ k,m , a τ k,m )G k,m 16: else 17: w r,k , w g,k = 0. 18: Ĵk = Jk /(B -N ) 19: Y k+1 = max{min{Y k + η(b -Ĵk ), ξ}, 0} have sup x |v * ⋄ (x)|≤ 1 2 sp(v * ⋄ ) without loss of generality. Without loss of generality, we assume ||ϕ(x, a)|| 2 ≤ √ 2 for all (x, a) ∈ S × A, ||µ(S)|| 2 ≤ √ d, ||θ j || 2 ≤ √ d for ⋄ = r, g. F.1 PROOF-SKETCH OF THEOREM 1 Note from (3) that for ⋄ = r, g t (J * ⋄ -⋄(x t , a t )) = t [q * ⋄ (x t , a t ) -E x ′ ∼p(•|xt,at) [v * ⋄ (x ′ )]] We then show that Lemma 6. With probability 1 -δ, t q st,⋄ (x t , a t ) -q * ⋄ (x t , a t ) ≤ T t=1 (J * ⋄ -J st,⋄ ) + T t=1 E x ′ ∼p(•|xt,at) [v st,⋄ (x ′ ) -v * ⋄ (x ′ )] + O(β ⋄ dT log(T )) (15) where s t is the last time the optimization problem is solved before time t. (J st,⋄ , b st,⋄ , w st,⋄ , π st ) are the solutions which also specify q st,⋄ and v st,⋄ . The key step in proving the above lemma is that we need to show st-1 τ =1 ϕ(x τ , a τ )(v st,⋄ (x ′ ) -E x ′ ∼p(•|xt,at) v st,⋄ (x ′ )) Λ -1 s t (16) is upper bounded by O(log T ) with high probability. To this end, value-aware uniform concentration is required to handle the dependence between v st,⋄ and samples {x τ } k-1 τ =1 , which renders the standard self-normalized inequality infeasible in the model-free setting. The general idea here is to fix a function class V ⋄ in advance and then show that each possible value function in our algorithm v st,⋄ is within this class which has polynomial log ϵ-covering number. The idea is to show that the function class q st,⋄ has log ϵ covering number which is at most O(d log(1+ sp(v * ⋄ )/ϵ)) (Lemma 7). Now, using the smoothness property of soft-max, we can also show that the policy class also has log ϵ-covering number which is at most d log(1 + 16L/ϵ). Combining the above we compute the ϵ-covering number for the class V ⋄ and show that ( 16) is upper bounded by log(T ) with high probability. The detailed proof is in Lemma 7. Now, we divide the proof of Theorem 1 in two parts. First, we bound the regret and subsequently, we bound the violation. Proof of Regret: From Lemma 2 with probability 1 -2δ, w * r , w * g , J * g , J * r , π * are feasible solution for the optimization problem. Since we are maximizing J r , thus, J st,r ≥ J * r with probability 1 -2δ. Further, from Lemma 6 with probability 1 -δ, t q st,r (x t , a t ) -q * r (x t , a t ) ≤ t (J * r -J st,r ) + T t=1 E x ′ ∼p(•|xt,at) [v st,r (x ′ ) -v * r (x ′ )] + O(β r dT log(T )) Hence, from union bound, with probability 1 -3δ, t q st,r (x t , a t ) -q * r (x t , a t ) ≤ T t=1 E x ′ ∼p(•|xt,at) [v st,r (x ′ ) -v * r (x ′ )] + O(β r dT log(T )) (17) By rearranging T t=1 (E x ′ ∼p(•|xt,at) [v * r (x ′ )] -q * r (x t , a t )) ≤ T t=1 (E x ′ ∼p(•|xt,at) [v st,r (x ′ )] -q st,r (x t , a t )) + O(β r dT log T ) = T t=1 (E x ′ ∼p(•|xt,at) [v st,r (x ′ )] -v st,r (x t )) + T t=1 (v st,r (x t ) -q st,r (x t , a t )) + O(β r dT log T ) We define F t,2 as the σ-algebra generated by [(x τ , a τ )] t τ =1 ∪ x t+1 . Then, E[v st,⋄ (x t ) - q st,⋄ (x t , a t )|F t-1 ] = 0. Thus, v st,⋄ (x t ) -q st,⋄ (x t , a t ) is a Martingale difference. Since ||w st,r ||≤ √ d(2 + sp(v * r ) ) (from the optimization problem in Algorithm 1), hence, v st,r (x t ) -q st,r (x t , a t ) is upper bounded by β r T log(T /δ) with probability 1 -δ. Thus, from union bound, with probability 1 -4δ, we have T t=1 (E x ′ ∼p(•|xt,at) [v * r (x ′ )] -q * r (x t , a t )) ≤ T t=1 (E x ′ ∼p(•|xt,at) [v st,r (x ′ )] -v st,r (x t )) + O(β r dT log(T /δ)) Notice that every time the algorithm updates (i.e., s t ̸ = s t-1 ) it holds that det(Λ t ) = det(Λ st ) ≥ 2det(Λ st-1 ). Since det(Λ T +1 )/det(Λ 1 ) ≤ (1 + T /λ) d , thus, it can not happen more than d log 2 (1 + T ) = O(d log T ) times. Thus, with probability 1 -4δ, T t=1 (E x ′ ∼p(•|xt,at) [v * r (x ′ )] -q * r (x t , a t )) ≤ T t=1 (E x ′ ∼p(•|xt,at) [v st+1,r (x ′ )] -v st+1,r (x t+1 )) + O(β r dT log(T /δ) + β r d log T ) (19) The first term in the right-hand side of ( 19) can be bounded by O(sp(v * r ) dT log(T /δ)) with probability 1 -δ by Azuma-Hoeffding inequality. Hence, with probability 1 -5δ, T t=1 (E x ′ ∼p(•|xt,at) [v * r (x ′ )] -q * r (x t , a t )) ≤ O(β r dT log(T /δ) + β r d log T ), where the last step holds via Azuma-Hoeffding inequality. Hence, the result follows from the definition of β r , and ( 14). Proof of Constraint Violation: From Lemma 6 with probability 1 -δ, t q st,g (x t , a t ) -q * g (x t , a t ) ≤ O(β g dT log T ) + T t=1 (J * g -J st,g ) + T t=1 E x ′ ∼p(•|xt,at) [v st,g (x ′ ) -v * g (x ′ )] Similar to (19) we obtain with probability 1 -3δ T t=1 (E x ′ ∼p(•|xt,at) [v * g (x ′ )] -q * g (x t , a t )) ≤ O(β g dT log(T /δ) + β g d log T ) + T t=1 (J * g -J st,g ) Hence, rearranging (20) and from ( 14), we obtain T t=1 (J * g -g(x t , a t )) - T t=1 (J * g -J st,g ) ≤ O(β g dT log T + β g d log T ) T t=1 (J st,g -g(x t , a t )) ≤ O(β g dT log(T /δ) + β g d log T ) Note that we can not conclude that J st,g ≥ J * g unlike the reward term. Nevertheless, since J st,g is feasible, thus, from the optimization problem P 1 in Algorithm 1, we have J st,g ≥ b. Hence, from (21) with probability 1 -3δ, T t=1 (b -g(x t , a t )) ≤ T t=1 (J st,g -g(x t , a t )) ≤ O(β g dT log(T /δ) + β g d log(T /δ)) Hence, the result follows. Published as a conference paper at ICLR 2023 F.2 PROOF OF LEMMA 2 For ⋄ = r, g, w * ⋄ = (Λ t ) -1 t-1 τ =1 ϕ(x τ , a τ )ϕ(x τ , a τ ) T w * ⋄ + λΛ -1 t w * ⋄ = (Λ t ) -1 t-1 τ =1 ϕ(x τ , a τ )(⋄(x τ , a τ ) -J * ⋄ + E x ′ ∼p(•|xτ ,aτ ) v * ⋄ (x ′ )) + λΛ -1 t w * ⋄ = (Λ t ) -1 t-1 τ =1 ϕ(x τ , a τ )(⋄(x τ , a τ ) -J * ⋄ + v * ⋄ (x τ +1 )) + λΛ -1 t w * ⋄ + ϵ * ⋄,t where ϵ t,⋄ = (Λ t ) -1 t-1 τ =1 ϕ(x τ , a τ )(E x ′ ∼p(•|xτ ,aτ ) v * ⋄ (x ′ ) -v * ⋄ (x τ +1 )) (23) Let us denote ε * τ,⋄ = E x ′ ∼p(•|xτ ,aτ ) v * ⋄ (x ′ ) -v * ⋄ (x τ +1 ) which is Martingale difference, then from the self-concentration result with probability 1 -2δ (Theorem 4), ϵ * t,⋄ ≤ sp(v * ⋄ ) log[ det(Λ t ) 1/2 det(Λ 0 ) -1/2 δ ] (24) Now, det(Λ t ) ≤ (λ + t) d , det(Λ 0 ) ≥ λ d . Thus, ϵ * t,⋄ ≤ sp(v * ⋄ ) d log((1 + T )/δ) On the other hand, note that λ (Λ t ) -1 w * ⋄,t ≤ β ⋄ /2, hence, by selecting b ⋄,t = λ(Λ t ) -1 w * ⋄,t + ϵ * ⋄,t (which is upper bounded by β ⋄ ) and J * g ≥ b we have the result.

F.3 PROOF OF LEMMA 6

Proof. For notational simplicity, we denote s t = s. From the definition of w s,⋄ w s,⋄ -w * ⋄ = Λ -1 s s-1 τ =1 ϕ(x τ , a τ )(⋄(x τ , a τ ) -J s,⋄ + v s,⋄ (x τ +1 )) + b s,⋄ -Λ -1 s s-1 τ =1 ϕ(x τ , a τ )(⋄(x τ , a τ ) -J * ⋄ + E x ′ ∼p(•|xτ ,aτ ) v * ⋄ (x ′ )) -λΛ -1 s w * ⋄ = Λ -1 s s-1 τ =1 ϕ(x τ , a τ )(J * ⋄ -J s,⋄ + E x ′ ∼p(•|xτ ,aτ ) [v s,⋄ (x ′ ) -v * ⋄ (x ′ )]) + ϵ s,⋄ + b s,⋄ -λΛ -1 s w * ⋄ where ϵ s,⋄ is defined in (23). w s,⋄ -w * ⋄ = Λ -1 s s-1 τ =1 ϕ(x τ , a τ )ϕ(x τ , a τ ) T (J * ⋄ e 1 -J s,⋄ e 1 + X (v s,⋄ (x ′ ) -v * ⋄ (x ′ ))dµ(x ′ )) + ϵ s,⋄ + b s,⋄ -λΛ -1 s w * ⋄ = J * ⋄ e 1 -J * s,⋄ e 1 + X (v s,⋄ (x ′ ) -v * ⋄ (x ′ ))dµ(x ′ ) + ϵ s,⋄ + b s,⋄ -λΛ -1 s (J * ⋄ e 1 -J s,⋄ e 1 + X (v s,⋄ (x ′ ) -v * ⋄ (x ′ ))dµ(x ′ )) -λΛ -1 s w * ⋄ (26) Therefore, q s,⋄ (x t , a t ) -q * ⋄ (x t , a t ) = ϕ(x t , a t ) T (w s,⋄ -w * ⋄ ) ≤ (J * ⋄ -J * s,⋄ ) + E x ′ ∼p(•|xt,at) [v s,⋄ (x ′ ) -v * ⋄ (x ′ )] + ϕ(x t , a t ) T (ϵ s,⋄ + b s,⋄ + λΛ -1 s u s,⋄ ) (27) where u s,⋄ = -(J * ⋄ e 1 -J s,⋄ e 1 + X (v s,⋄ (x ′ ) -v * ⋄ (x ′ ))dµ(x ′ )) -w * ⋄ ( ) We now bound the third term in the right hand side which will prove the lemma. From Lemma 21 with probability 1 -δ ||ϵ s,⋄ || Λs = || s-1 τ =1 ϕ τ ε ⋄,τ || Λ -1 s ≤4(2 + sp(v * ⋄ )) √ d d/2 log((s + λ)/λ) + log(N ϵ,⋄ /δ) + 4 s 2 ϵ 2 /λ 29) We now compute N ϵ,⋄ which is the ϵ-covering number for the value function class v s,⋄ . Lemma 7. The ϵ-covering number for the class V ⋄ = {v ⋄ |v ⋄ (x) = ⟨π(•|x), q ⋄ (x, •)⟩ A is upper bounded by log N ϵ,⋄ ≤ d log 1 + 2(2 + sp(v * ⋄ )) √ 2d ϵ ′ + d log 1 + 16L ϵ ′ . ( ) where ϵ ′ = ϵ 1 + (2 + sp(v * ⋄ )) √ 2d , π ∈ Π, and q ⋄ (•, •) = w T ⋄ ϕ(•, •), where ||w ⋄ ||≤ (2 + sp(v * ⋄ )) √ d. Proof. Proof is in Appendix F.4. Hence, from (29) and putting ϵ = 1 T , we obtain ||ϵ s,⋄ || Λs ≤ O(β ⋄ ) (31) with probability at least 1 -δ. Now, ||λΛ -1 s u s,⋄ || Λs = λ||u s,⋄ || Λ -1 s ≤ √ λ||u s,⋄ ||≤ O(1 + (2 + sp(v * ⋄ ))d) = O(β ⋄ ). (32) Further, ||b s,⋄ || Λs ≤ β ⋄ . Hence, ϕ(x t , a t ) T (ϵ s,⋄ + b s,⋄ + λΛ -1 s u s,⋄ ) ≤ ||ϕ(x t , a t )|| Λ -1 s ||ϵ s,⋄ + b s,⋄ + λΛ -1 s u s,⋄ || Λs ≤ 2||ϕ(x t , a t )|| Λ -1 t ||ϵ s,⋄ + b s,⋄ + λΛ -1 s u s,⋄ || Λs Thus, T t=1 (q s,⋄ (x t , a t ) -q * ⋄ (x t , a t )) ≤ T t=1 (J * -J st,⋄ ) + T t=1 E x ′ ∼p(•|xt,at) [v st,⋄ (x ′ ) -v * ⋄ (x ′ )] + O(β ⋄ T t=1 ||ϕ(x t , a t )|| Λ -1 t ) ≤ T t=1 (J * ⋄ -J st,⋄ ) + T t=1 E x ′ ∼p(•|xt,at) [v st,⋄ (x ′ ) -v * ⋄ (x ′ )] + O   β⋄ √ T T t=1 ||ϕ(x t , a t )|| 2 Λ -1 t    ≤ T t=1 (J * ⋄ -J st,⋄ ) + T t=1 E x ′ ∼p(•|xt,at) [v st,⋄ (x ′ ) -v * ⋄ (x ′ )] + O(β ⋄ dT log T ) where the second inequality follows from Cauchy-Schwarz inequality. The last inequality follows from the fact that T t=1 ||ϕ(x t , a t )|| 2 Λ -1 t ≤ 2 log(det(Λ T +1 )/(det(Λ 1 ))), log(det(Λ T +1 )) ≤ d log(λ + T ) F.4 PROOF OF LEMMA 7 Proof. ||w s,⋄ ||≤ (2 + sp(v * ⋄ )) √ d. By Lemma 22, the covering number for w s,⋄ is 1 + 2(2 + sp(v * ⋄ )) √ d ϵ d Since q s,⋄ (•, •) = ϕ(•, •) T w s,⋄ and ||ϕ||≤ √ 2, ϵ-covering number for w s,⋄ is also the ϵ-covering number for the class of state-action bias functions q s,⋄ is 1 + 2(2 + sp(v * ⋄ )) √ 2d ϵ d (34) with respect to ∞-norm. For the policy-parameter class, note that ||ζ||≤ L. Hence, the ϵ-covering number for the parameter class is (1 + 2L/ϵ) d from Lemma 22. We also use the following result where we show that the 1-norm of the difference between two policies is close if the parameters are close from the property of soft-max. Then, for any x, ṽs,⋄ (x) -v s,⋄ (x) = ⟨π, qs,⋄ ⟩ -⟨π, q s,⋄ ⟩ = ⟨π -π, q s,⋄ ⟩ -⟨π, (q s,⋄ -q s,⋄ )⟩ ≤ ϵ(2 + sp(v * ⋄ )) √ 2d + ϵ (37) Thus, consider ϵ ′ = ϵ 1 + (2 + sp(v * ⋄ )) √ 2d , then, log N ϵ,⋄ ≤ d log 1 + 2(2 + sp(v * ⋄ )) √ 2d ϵ ′ + d log 1 + 16L ϵ ′ . F.5 PROOF OF LEMMA 8 Proof. Dividing π(a|x) by π(a|x) yields π(a|x) π(a|x) = e ⟨ϕ(x,a),ζ⟩ e ⟨ϕ(x,a), ζ⟩ × a ′ e ⟨ϕ(x,a ′ ), ζ⟩ a ′ e ⟨ϕ(x,a ′ ),ζ⟩ = e ⟨ϕ(x,a),ζ-ζ⟩ × a ′ (e ⟨ϕ(x,a ′ ), ζ-ζ⟩ ) e ⟨ϕ(x,a ′ ),ζ⟩ ā e ⟨ϕ(x,ā),ζ⟩ = e ⟨ϕ(x,a),ζ-ζ⟩ a ′ π(a ′ |x)e ⟨ϕ(x,a ′ ), ζ-ζ⟩ Since ||ϕ||≤ 1, and exponential is a strictly increasing function, thus, π(a|x) π(a|x) = e ||ζ-ζ|| a ′ π(a ′ |x)e ||ζ-ζ|| ≤ e 2||ζ-ζ|| ≤ 1 + 4||ζ -ζ|| where the first inequality stems from the fact that a ′ π(a ′ |x) = 1. The second inequality stems from the fact that e x ≤ 1 + 2x if x ∈ (0, 0.5). Hence, π(a|x) -π(a|x) ≤ 4π(a|x)||ζ -ζ|| We can apply the similar argument with the roles ζ and ζ reversed to conclude Obviously, soft-max policy is one such policy which satisfies the condition of the Definition 3. π(a|x) -π(a|x) ≤ 4π(a|x)||ζ -ζ||

G PROOF OF THE RESULTS OF SECTION 3.2

Organization: In Appendix G.1, we provide the proof sketch and subsequently we prove various parts in Appendix E.2 to Appendix E.9. We formally prove Theorem 2 in Appendix G.10. Notations for this Section: Throughout this section, we denote Q k r,h , Q k g,h , w k r,h , w k g,h , Λ k h as the Qvalue and the parameter values estimated at the episode k. V k j,h (•) = ⟨π h,k (•|•), Q k j,h (•, •)⟩ A . π h,k (•|x) is the soft-max policy based on the composite Q-function at the k-th episode as Q k r,h + Y k Q k g,h . To simplify the presentation, we denote ϕ k h = ϕ(x k h , a k h ). Without loss of generality, we assume ||ϕ(x, a)|| 2 ≤ 1 for all (x, a) ∈ S × A, ||µ(S)|| 2 ≤ √ d, ||θ j || 2 ≤ √ d for ⋄ = r, g.

G.1 PROOF SKETCH OF THEOREM 2

We reiterate the decomposition of regret.

Regret(T ) =

T /H k=1 (HJ * r -V π * r,1 (x k 1 )) + T /H k=1 (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) + T /H k=1 (V π k r,1 (x k 1 ) - kH t=(k-1)H+1 r(x t , a t )) The first term is bounded via Lemma 3 by (T /H)sp(v * r ) which we prove in Appendix G.3. The third-term is bounded using Azuma-Hoeffding inequality, and its upper bound is given by O( √ T Hι) with probability 1 -δ. We bound the second term in Appendices G.4-G.8. In the following, we describe a rough idea to bound the second term. Note that in order to bound the second term we follow Ghosh et al. (2022) which bounds the regret over an episodic setup. However, there are subtle differences. Ghosh et al. (2022) bounds the regret with respect to the optimal solution for the episodic case, i.e., Ghosh et al. (2022) bounds the following K k=1 (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) where π * is the optimal solution for the episodic case (cf.( 46)). Instead, we are interested in bounding k (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) Note that the optimal solution for episodic case, π * , can be different from π * , the optimal solution for the infinite horizon average reward CMDP. Since we relax the constraint by subtracting κ from Hb, it would guarantee that π * is also feasible over the episodic case for any initial state x k 1 by Lemma 3. Hence, V π * (x k 1 ) -V π * (x k 1 ) ≥ 0 Thus, k (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) ≤ k (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) Now, we invoke the analysis of Ghosh et al. (2022) to bound the right-hand side of (45). Since we relax the constraint there will be an additional term κ in the violation at each episode. By summing over all the episodes the term will grow as (T /H)κ. Since H = O(T 1/4 ), we can bound that additional term by O(T 3/4 ). Now, we bound the right-hand side of (45). We first state the episodic CMDP on which the Algorithm is learning for completeness. Ghosh et al. (2022) , we first establish the following decomposition, which upper bounds the sum of regret and violation (Lemma 9). Lemma 9. For any Y ∈ [0, ξ], we have maximize π V π r,1 (x k 1 ) subject to V π g,1 (x k 1 ) ≥ Hb -κ (46) Similar to K k=1 (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) + Y (Hb -κ -V π k g,1 (x k 1 )) ≤ 1 2η Y 2 + ηH 2 K 2 + K k=1 V π * r,1 (x k 1 ) + Y k V π * g,1 (x k 1 ) -V k r,1 (x k 1 ) + Y k V k g,1 (x k 1 ) T1 + K k=1 V k r,1 (x k 1 ) -V π k r,1 (x k 1 ) + Y K k=1 V k g,1 (x k 1 ) -V π k g,1 (x k 1 ) T2 Proof. See Appendix G.4. This will serve as the basis when applying optimization tools to bound the violation as well. Note that by making Y = 0, we can bound the regret. Note that T 1 is similar to the term related to optimism in the unconstrained case with the difference being that we now have two value functions weighted by the dual variable Y k . Similarly, T 2 is similar to prediction error term with the additional weight by Y . Since the first term in the above inequality can be easily bounded with a proper choice of η, we are only left to bound T 1 and T 2 , respectively. In order to bound T 1 and T 2 , we use the following result Lemma 10. There exists a constant C 2 such that for any fixed p ∈ (0, 1), if we let E be the event that k-1 τ =1 H h ′ =1 ϕ τ j,h ′ [V k j,h+1 (x τ h ′ +1 ) -PV k j,h+1 (x τ h ′ , a τ h ′ )] (Λ k ) -1 ≤ C 2 dH √ χ (48) for all j ∈ {r, g}, χ = log[2(C 1 + 1)ξ log(|A|)dT /δ], for some constant C 2 , then Pr(E) = 1 -2δ. Proof. Please see the proof of Lemma 8 in Ghosh et al. (2022) . The above lemma entails that for all (k, h) ∈ [K] × [H] with high probability k-1 τ =1 h ϕ(x τ h , a τ h ) V k ⋄,h+1 (x τ h+1 ) -PV k ⋄,h+1 (x τ h , a τ h ) (Λ k ) -1 is upper bounded by lower order term (e.g., O(d √ log T ). Similar to Algorithm 1 the general idea here is to fix a function class V ⋄,h in advance and then show that each possible value function in our algorithm V k ⋄,h is within this class which has log-covering number which scales with log T . Similar to Ghosh et al. (2022) we introduce soft-max policy and define the following corresponding function classes to show that the value function class has log ϵ-covering number which scales with log T . We first define the following class for Q-function for ⋄ = r, g. Q ⋄ = {Q ⋄ |Q ⋄ (•, •) = min{⟨w ⋄ , ϕ(•, •)⟩ + β ϕ(•, •) T (Λ h ) -1 ϕ(•, •), H}. Then, we define the following value function class V ⋄ . V ⋄ = {V ⋄ |V ⋄ (•) = a π(a|•)Q ⋄ (•, a); Q ⋄ ∈ Q ⋄ , π ∈ Π e }, where Π e is given by the following class Ghosh et al. (2022) greedy algorithm with respect to the composite Q-function fails to show that the log ϵ-covering number for individual V ⋄ scales at most O(log(T )) as the greedy policy is not Lipschitz. Π e = {π|π(a|•) = SOFT-MAX a α ((Q r (•, •) + Y Q g (•, •)); ∀a ∈ A, Q r ∈ Q r , Q g ∈ Q g , Y ∈ [0, ξ]}. As described in Using the above we show the following Lemma 11. There exists an absolute constant β = C 1 dH √ ι, ι = log(log(|A|)ξ2dT /δ), and for any fixed policy π, on the event E defined in Lemma 10, we have ⟨ϕ(x, a), w k ⋄,h ⟩ -Q π ⋄,h (x, a) = P(V k ⋄,h+1 -V π ⋄,h+1 )(x, a) + ∆ k (x, a) for some ∆ k (x, a) that satisfies |∆ k (x, a)|≤ β ϕ(x, a) T (Λ k ) -1 ϕ(x, a). Proof. See Appendix G.5. The above lemma bounds the difference between the value function maintained in Algorithm 2 (without the bonus term) and the value function for any policy for both the reward and utility value functions. We bound this using the expected difference at the next step plus an error term. The result shows that this error term can be upper bounded by the bonus term with a high-probability. Using the above, we can easily show the following-Lemma 12. With prob. 1 -2δ, (for the event in E) Q π r,h (x, a) + Y k Q π g,h (x, a) ≤ Q k r,h (x, a) + Y k Q k g,h (x, a) + P(V π,Y k h+1 -V k h+1 )(x, a) Proof. See Appendix G.6. Using the above, and the property of the soft-max (Lemma 19), we obtain Lemma 13. With probability 1 -2δ, T 1 ≤ H log(|A|) α Proof. See Appendix G.7. Further, from Lemma 12, we obtain the following Lemma 14. On the event defined in E in Lemma 10, we have V k ⋄,1 (x 1 ) -V π k ⋄,1 (x 1 ) ≤ H h=1 (D k ⋄,h,1 + D k ⋄,h,2 ) + H h=1 2β ϕ(x k h , a k h ) T (Λ k h ) -1 ϕ(x k h , a k h ) where D k ⋄,h,1 = ⟨(Q k ⋄,h (x k h , •) -Q π k ⋄,h (x k h , •)), π h,k (•|x k h )⟩ -(Q k ⋄,h (x k h , a k h ) -Q π k ⋄,h (x k h , a k h )) D k ⋄,h,2 = P h (V k ⋄,h+1 -V π k ⋄,h+1 )(x k h , a k h ) -[V k ⋄,h+1 -V π k ⋄,h+1 ](x k h+1 ) Proof. See Appendix G.8. Using the above result, Azuma-Hoeffding inequality, and plugging the value of β we obtain the following Lemma 15. With probability 1 -4δ, T 2 ≤ (Y + 1)O( √ d 3 T H 2 ι 2 ) Proof. See Appendix G.9. Combining all the pieces: In (47), we replace Y with 0, and η = ξ/ √ KH 2 . Then, by combining (45),Lemmas 13 and 15 we bound the second term in (44) with probability 1 -4δ. Thus, the final result follows from the union bound ( Appendix G.10).

Constraint Violation:

The violation term can be decomposed as the following K k=1 (Hb -κ -V π k g,1 (x k 1 )) + K k=1 (V π k g,1 (x k 1 ) - kH t=(k-1)H+1 g(x t , a t )) + K k=1 κ ( ) where K = T /H. Note that the third term is equivalent to (T /H)κ. The second term can be bounded by Azuma Hoeffding inequality Õ( √ T Hι) with probability 1 -δ. We now describe how to bound the first term. Note from Lemma 9 K k=1 (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) + Y (Hb -κ -V π k g,1 (x k 1 )) ≤ 1 2η Y 2 + ηH 2 K 2 + K k=1 V π * r,1 (x k 1 ) + Y k V π * g,1 (x k 1 ) -V k r,1 (x k 1 ) + Y k V k g,1 (x k 1 ) T1 + K k=1 V k r,1 (x k 1 ) -V π k r,1 (x k 1 ) + Y K k=1 V k g,1 (x k 1 ) -V π k g,1 (x k 1 ) T2 (54) Now, using η = ξ √ KH 2 , Y ≤ ξ, Lemma 13 and 15 we obtain with probability 1 -4δ K k=1 (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) + Y (Hb -κ -V π k g,1 (x k 1 )) ≤ O( d 3 ξ 2 T ι 2 ) (55) Now, from the result of convex optimization (Corollary 1) we obtain that with probability 1 -3δ, K k=1 (Hb -κ -V π k g,1 (x k 1 )) ≤ 2(1 + ξ) ξ O( √ d 3 T ι 2 ) Thus, we obtain bound for the second term. The final result is obtained from the bounds of each individual term in (53).

G.2 PRELIMINARY RESULTS

Lemma 16. Under Assumption 2, for any fixed policy π, let w π h be the corresponding weights such that Q π j,h = ⟨ϕ(x, a), w π j,h ⟩, for j ∈ {r, g}, then we have for all h ∈ [H], ||w π j,h ||≤ 2H √ d (56) Proof. From the linearity of the action-value function, we have Q π j,h (x, a) = j h (x, a) + PV π j,h (x, a) = ⟨ϕ(x, a), θ j,h ⟩ + S V π j,h+1 (x ′ )⟨ϕ(x, a), dµ(x ′ )⟩ = ⟨ϕ(x, a), w π j,h ⟩ where w π j,h = θ j,h + S V π j,h+1 (x ′ )dµ(x ′ ). Now, ||θ j,h ||≤ √ d, and || S V π j,h+1 (x ′ )dµ h (x ′ )||≤ H √ d. Thus, the result follows. Lemma 17. For any (k, h), the weight w k j,h satisfies ||w k j,h ||≤ 2H dkH/λ (58) Proof. For any vector v ∈ R d we have |v T w k j,h |= |v T (Λ) -1 k k-1 τ =1 H h ′ =1 ϕ τ h (x τ h ′ , a τ h ′ )(j(x τ h ′ , a τ h ′ ) + a π h+1,k (a|x τ h ′ +1 )Q k j,h+1 (x τ h ′ +1 , a))| (59) here π h,k (•|x) is the Soft-max policy. Note that Q k j,h+1 (x, a) ≤ H for any (x, a). Hence, from (59) we have |v T w k j,h | ≤ k-1 τ =1 h ′ |v T (Λ k ) -1 ϕ τ h ′ |.2H ≤ k-1 τ =1 h ′ v T (Λ k ) -1 v k-1 τ =1 h ′ ϕ τ h ′ (Λ k ) -1 ϕ τ h ′ .2H ≤ 2H||v|| √ dkH √ λ ( ) We use Observation 1 to bound the second term. Note that ||w k j,h ||= max v:||v||=1 |v T w k j,h |. Hence, the result follows.

G.3 PROOF OF LEMMA 3

Proof. We prove the result for ⋄ = r. The proof for ⋄ = g is similar. For any x, and h ∈ [H], V π r,h (x) -(H + h -1)J π r = E[ H h ′ =h r(x ′ h , a ′ h ) -J π r |π, x h = x] = E[ H h ′ =h q π r (x ′ h , a ′ h ) -Pv π r (x ′ h , a ′ h )|π, x h = x] = E[ H h ′ =h (v π r (x h ′ ) -v π r (x h ′ +1 )|π, x h = x] = v π r (x) -E[v π r (x H+1 )] Hence, |V π r,h (x) -(H + h -1)J π r |≤ sp(v π r ) since |v π r |≤ 1 2 sp(v * r ).

G.4 PROOF OF LEMMA 9

We first state and prove the following result which is similar to the one proved in Ding et al. (2021) . Lemma 18. For Y ∈ [0, ξ], K k=1 (Y -Y k )(Hb -κ -V k g,1 (x k 1 )) ≤ Y 2 2η + ηH 2 K 2 (61) Proof. |Y k+1 -Y | 2 = |P roj [0,ξ] (Y k + η(b -V k g,1 (x k 1 ))) -P roj [0,ξ] (Y )| 2 ≤ (Y k + η(b -V k g,1 (x 1 ))) -Y ) 2 ≤ (Y k -Y ) 2 + η 2 H 2 + 2η(Y k -Y )(b -V k g,1 (x k 1 )) (62) Summing over k, we obtain 0 ≤ |Y K+1 -Y | 2 = |Y 1 -Y | 2 +2η K k=1 (Hb -κ -V k g,1 (x k 1 ))(Y k -Y ) + η 2 H 2 K K k=1 (Y -Y k )(b -V k g,1 (x k 1 )) ≤ |Y 1 -Y | 2 2η + ηH 2 K 2 (63) Since Y 1 = 0, we have the result. Now, we prove Lemma 9. Proof. Note that Y K k=1 (Hb -κ -V π k g,1 (x k 1 )) = K k=1 (Y -Y k )(Hb -κ -V k g,1 (x k 1 )) + Y k (Hb -κ) -Y k V k g,1 (x k 1 ) + Y (V k g,1 (x k 1 ) -V π k g,1 (x k 1 )) ≤ 1 2η Y 2 + η 2 H 2 K + K k=1 (Y k (Hb -κ) + (Y -Y k )V k g,1 (x k 1 ) -Y V π k g,1 (x k 1 )) ≤ 1 2η Y 2 + η 2 H 2 K + K k=1 (Y k V π * g,1 (x k 1 ) -Y k V k g,1 (x k 1 ) + Y V k g,1 (x k 1 ) -Y V π k g,1 (x k 1 )) where the first inequality follows from Lemma 18, and the second inequality follows from the fact that V π * g,1 (x k 1 ) ≥ Hb -κ. Hence, the result simply follows from the above inequality.

G.5 PROOF OF LEMMA 11

Proof. We only prove for ⋄ = r, the proof for ⋄ = g is similar. Note that Q π r,h (x, a) = ⟨ϕ(x, a), w π r,h ⟩ = r h (x, a) + PV π r,h+1 (x, a). We can write w π r,h = (Λ k ) -1 Λ k w π r,h . Now, Λ k = λI + k-1 τ =1 h ′ ϕ(x τ h ′ , a τ h ′ )ϕ(x τ h ′ , a τ h ′ ) T (See the algorithm 1, line 3). Hence, w π r,h = (Λ k ) -1 (λI + k-1 τ =1 h ′ ϕ(x τ h ′ , a τ h ′ )ϕ(x τ h ′ , a τ h ′ ) T )w π r,h Finally, from the definition of w π r,h , ϕ(x τ h , a τ h ) T w π r,h = r h (x τ h , a τ h ) + PV r,h+1 (x τ h , a τ h ) (from Lemma 4). Hence, w π r,h = (Λ k ) -1 λIw π r,h + (Λ k ) -1 k-1 τ =1 h ′ ϕ(x τ h ′ , a τ h ′ )(r h (x τ h ′ , a τ h ′ ) + PV π r,h+1 (x τ h ′ , a τ h ′ )) Hence, we have w k r,h -w π r,h = (Λ k ) -1 k-1 τ =1 H h ′ =1 ϕ τ h ′ [r(x τ h ′ , a τ h ′ ) + V k r,h+1 (x τ h ′ +1 )] -w π r,h = -λ(Λ k ) -1 (w π r,h ) + (Λ k ) -1 k-1 τ =1 h ′ ϕ τ h ′ [V k r,h+1 (x τ h ′ +1 ) -PV k r,h+1 (x τ h ′ , a τ h ′ )] + (Λ k ) -1 k-1 τ =1 h ′ ϕ τ h ′ [PV k r,h+1 (x τ h ′ , a τ h ′ ) -PV π r,h+1 (x τ h ′ , a τ h ′ )] Now, we bound each term in the right hand side of expression in (64). We call those terms as q 1 , q 2 , and q 3 respectively. First, note that |⟨ϕ(x, a), q 1 ⟩| = |λ⟨ϕ(x, a), (Λ k ) -1 (w π r,h )⟩| ≤ √ λ||w π r,h || ϕ(x, a) T (Λ k ) -1 ϕ(x, a) Second, from Lemma 10, for the event in E, we have |⟨ϕ(x, a), q 2 ⟩|≤ CdH √ χ ϕ(x, a) T (Λ k ) -1 ϕ(x, a) where χ = log(2(C 1 + 1) log(|A|)ξdT /p). Third, ⟨ϕ(x, a), q 3 ⟩ = ⟨ϕ(x, a), (Λ k ) -1 k-1 τ =1 h ′ ϕ τ h ′ [P(V k r,h+1 -V π r,h+1 )(x τ h ′ , a τ h ′ )]⟩ = ⟨ϕ(x, a), (Λ k ) -1 k-1 τ =1 h ′ ϕ τ h ′ (ϕ τ h ′ ) T (V k r,h+1 -V π r,h+1 )(x ′ )dµ(x ′ )⟩ = ⟨ϕ(x, a), (V k r,h+1 -V π r,h+1 )(x ′ )dµ(x ′ )⟩ -⟨ϕ(x, a), λ(Λ k ) -1 (V k r,h+1 -V π r,h+1 )(x ′ )dµ(x ′ )⟩ (67) The last term in (67) can be bounded as the following |⟨ϕ(x, a), λ(Λ k ) -1 (V k r,h+1 -V π r,h+1 )(x ′ )dµ(x ′ )⟩|≤ 2H √ dλ ϕ(x, a) T (Λ k ) -1 ϕ(x, a) (68) since || (V k r,h+1 -V π r,h+1 )(x ′ )dµ(x ′ )|| 2 ≤ 2H √ d as ||µ(S)||≤ √ d. The first term in ( 67) is equal to P(V k r,h+1 -V π r,h+1 )(x, a) Note that ⟨ϕ(x, a), w k r,h ⟩ -Q π r,h (x, a) = ⟨ϕ(x, a), w k r,h -w π r,h ⟩ = ⟨ϕ(x, a), q 1 + q 2 + q 3 ⟩. Since λ = 1, we have from ( 65), (66,(68) , and ( 69) |⟨ϕ(x, a), w k r,h ⟩ -Q π r,h (x, a) -P(V k r,h+1 -V π r,h+1 )(x, a)|≤ C 3 dH √ χ ϕ(x, a) T (Λ k ) -1 ϕ(x, a) for some constant C 3 which is independent of C 1 . Finally, note that C 3 √ χ = log(2(C 1 + 1) log(|A|)ξdT /p) = C 3 ι + log(C 1 + 1) ≤ C 1 √ ι ( ) where ι = log(2 log(|A|)ξdT /p). The last inequality follows from the fact that ι ∈ [log 2, ∞) as |A|≥ 2, and C 3 is independent of C 1 . Hence, we can always pick C 3 log 2 + log(C 1 + 1) ≤ C 1 √ log 4 which satisfies ( 71) for all values of ι ∈ [log 2, ∞).

G.6 PROOF OF LEMMA 12

Proof. From Lemma 11, we have w.p. 1 -2δ, Q π r,h (x, a) + Y k Q π g,h (x, a) ≤ ⟨ϕ(x, a), w k r,h ⟩ + Y k ⟨ϕ(x, a), w k g,h ⟩ + (1 + Y k )β ϕ(x, a) T (Λ k ) -1 ϕ(x, a) + P(V π r,h+1 + Y k V π g,h+1 -V k r,h+1 -Y k V k g,h+1 )(x, a) = Q k r,h (x, a) + Y k Q k g,h (x, a) + P(V π,Y k h+1 -V k h+1 )(x, a) G.7 PROOF OF LEMMA 13 First, we state and prove a supporting result which bounds the value functions corresponding to the greedy policy and the soft-max policy at a given step. We show that this gap can be controlled by the parameter α. Lemma 19. V k h (x) -V k h (x) ≤ log|A| α where Definition 4. V k h (•) = max a [Q k r,h (•, a) + Y k Q k g,h (•, a)]. V k h (•) is the value function corresponds to the greedy-policy with respect to the composite Q-function. Proof. Note that V k h (x) = a π h,k (a|x)[Q k r,h (x, a) + Y k Q k g,h (x, a)] where π h,k (a|x) = exp(α[Q k r,h (x, a) + Y k Q k g,h (x, a)]) a exp(α[Q k r,h (x, a) + Y k Q k g,h (x, a)]) (73) Denote a x = arg max a [Q k r,h (x, a) + Y k Q k g,h (x, a)] Now, recall from Definition 4 that V k h (x) = [Q k r,h (x, a x ) + Y k Q k g,h (x, a x )]. Then, V k h (x) -V k h (x) = [Q k r,h (x, a x ) + Y k Q k g,h (x, a x )] - a π h,k (a|x)[Q k r,h (x, a) + Y k Q k g,h (x, a)] ≤ log( a exp(α(Q k r,h (x, a) + Y k Q k g,h (x, a)))) α - a π h,k (a|x)[Q k r,h (x, a) + Y k Q k g,h (x, a)] ≤ log(|A|) α where the last inequality follows from Proposition 1 in Pan et al. (2019) . We are now ready to show Lemma 13. Proof. We prove the lemma by Induction. First, we prove for the step H. Note that Q k ⋄,H+1 = 0 = Q π ⋄,H+1 . Under the event in E as described in Lemma 10 and from Lemma 11, we have for j = r, g, |⟨ϕ(x, a), w k ⋄,H (x, a)⟩ -Q π ⋄,H (x, a)|≤ β ϕ(x, a) T (Λ k ) -1 ϕ(x, a) Hence, for any (x, a), Q π ⋄,H (x, a) ≤ min{⟨ϕ(x, a), w k ⋄,H ⟩ + β ϕ(x, a) T (Λ k ) -1 ϕ(x, a), H} = Q k ⋄,H (x, a) Hence, from the definition of V k h , V k H (x) = max a [Q k r,H (x, a) + Y k Q k g,h (x, a)] ≥ a π(a|x)[Q π r,H (x, a) + Y k Q π g,H (x, a)] = V π,Y k H (x) for any policy π. Thus, it also holds for π * . Hence, from Lemma 19, we have V π * ,Y k H (x) -V k H (x) ≤ log(|A|) α Now, suppose that it is true till the step h + 1 and consider the step h. Since, it is true till step h + 1, thus, for any policy π, P(V π,Y k h+1 -V k h+1 )(x, a) ≤ (H -h) log(|A|) α From (50) in Lemma 12 and the above result, we have for any (x, a) Q π r,h (x, a) + Y k Q π g,h (x, a) ≤ Q k r,h (x, a) + Y k Q k g,h (x, a) + (H -h) log(|A|) α Hence, V π,Y k h (x) ≤ V k h (x) + (H -h) log(|A|) α Now, again from Lemma 19, we have V k h (x) -V k h (x) ≤ log(|A|) α . Thus, V π,Y k h (x) -V k h (x) ≤ (H -h + 1) log(|A|) α (80) Now, since it is true for any policy π, it will be true for π * . From the definition of V π,Y k , we have V π * r,h (x) + Y k V π * g,h (x) -V k r,h (x) + Y k V k g,h (x) ≤ (H -h + 1) log(|A|) α Hence, the result follows by summing over K and considering h = 1.

G.8 PROOF OF LEMMA 14

Proof. By Lemma 11, for any x, h, a, k ⟨w k ⋄,h (x, a), ϕ(x, a)⟩ + β ϕ(x, a) T (Λ k ) -1 ϕ(x, a) -Q π k ⋄,h ≤ P(V k ⋄,h+1 -V π k ⋄,h+1 )(x, a) + 2β ϕ(x, a) T (Λ k ) -1 ϕ(x, a) Thus, Q k ⋄,h (x, a) -Q π k ⋄,h (x, a) ≤ P(V k ⋄,h+1 -V π k ⋄,h+1 )(x, a) + 2β ϕ(x, a) T (Λ k ) -1 ϕ(x, a) P(V k ⋄,h+1 -V π k ⋄,h+1 )(x, a) + 2β ϕ(x, a) T (Λ k ) -1 ϕ(x, a) -(Q k ⋄,h (x, a) -Q π k ⋄,h (x, a)) ≥ 0 (83) Since V k ⋄,h (x) = a π h,k (a|x)Q k ⋄,h (x, a) and V π k ⋄,h (x) = a π h,k (a|x)Q π k ⋄,h (x, a) where π h,k (a|•) = SOFT-MAX a α (Q k r,h + Y k Q k g,h ) ∀a. Thus, from (83), V k ⋄,h (x k h ) -V π k ⋄,h (x k h ) = a π h,k (a|x k h )[Q k ⋄,h (x k h , a) -Q π k ⋄,h (x k h , a)] ≤ a π h,k (a|x k h )[Q k ⋄,h (x k h , a) -Q π k ⋄,h (x k h , a)] + 2β ϕ(x k h , a k h ) T (Λ k ) -1 ϕ(x k h , a k h ) + P(V k ⋄,h+1 -V π k ⋄,h+1 )(x k h , a k h ) -(Q k ⋄,h (x k h , a k h ) -Q π k ⋄,h (x k h , a k h )) Thus, from (84), we have V k ⋄,h (x k h ) -V π k ⋄,h (x k h ) ≤ D k ⋄,h,1 + D k ⋄,h,2 + [V k ⋄,h+1 -V π k ⋄,h+1 ](x k h+1 ) + 2β ϕ(x k h , a k h ) T (Λ k ) -1 ϕ(x k h , a k h ) Hence, by iterating recursively, we have V k ⋄,1 (x 1 ) -V π k ⋄,1 ≤ H h=1 (D k ⋄,h,1 + D k ⋄,h,2 ) + H h=1 2β ϕ(x k h , a k h ) T (Λ k ) -1 ϕ(x k h , a k h ) The result follows. G.9 PROOF OF LEMMA 15 Proof. Note from Lemma 14, we have with probability 1 -2δ, K k=1 V k ⋄,1 (x 1 ) -V π k ⋄,1 (x 1 ) ≤ K k=1 H h=1 (D k ⋄,h,1 + D k ⋄,h,2 ) + K k=1 H h=1 2β ϕ(x k h , a k h ) T (Λ k ) -1 ϕ(x k h , a k h ) Published as a conference paper at ICLR 2023 We, now, bound the individual terms. First, we show that the first term corresponds to a Martingale difference. For any (k, h) ∈ [K] × [H], we define F k h,1 as σ-algebra generated by the state-action sequences, reward, and constraint values, {(x τ i , a τ i )} (τ,i)∈[k-1]×[H] ∪ {(x k i , a k i )} i∈[h] . Similarly, we define the F k h,2 as the σ-algebra generated by {( x τ i , a τ i )} (τ,i)∈[k-1]×[H] ∪ {(x k i , a k i )} i∈[h] ∪ {x k h+1 }. x k H+1 is a null state for any k ∈ [K]. A filtration is a sequence of σ-algebras {F k h,m } (k,h,m)∈[K]×[H]×[2] in terms of time index t(k, h, m) = 2(k -1)H + 2(h -1) + m (88) which holds that F k h,m ⊂ F k ′ h ′ ,m ′ for any t ≤ t ′ . Note from the definitions in (52 ) that D k ⋄,h,1 ∈ F k h,1 and D k ⋄,h,2 ∈ F k h,2 . Thus, for any (k, h) ∈ [K] × [H], E[D k ⋄,h,1 |F k h-1,2 ] = 0, E[D k ⋄,h,2 |F k h,1 ] = 0 (89) Notice that t(k, 0, 2) = t(k -1, H, 2) = 2(H -1)k. Clearly, F k 0,2 = F k-1 H,2 for any k ≥ 2. Let F 1 0,2 be empty. We define a Martingale sequence M k ⋄,h,m = k-1 τ =1 H i=1 (D τ ⋄,i,1 + D τ j,i,2 ) + h-1 i=1 (D k ⋄,i,1 + D k ⋄,i,2 ) + m l=1 D k j,h,l = (τ,i,l)∈[K]×[H]×[2],t(τ,i,l)≤t(k,h,m) D τ ⋄,i,l where t(k, h, m) = 2(k -1)H + 2(h -1) + m is the time index. Clearly, this martingale is adopted to the filtration {F k h,m } (k,h,m)∈[K]×[H]×[2] , and particularly K k=1 H h=1 (D k j,h,1 + D k j,h,2 ) = M K ⋄,H,2 Thus, M K ⋄,H,2 is a Martingale difference satisfying |M K ⋄,H,2 |≤ 4H since |D k ⋄,h,1 |, |D k ⋄,h,2 |≤ 2H From the Azuma-Hoeffding inequality, we have Pr(M K ⋄,H,2 > s) ≤ 2 exp(- s 2 16T H 2 ) With probability 1 -2δ at least for any ⋄ = r, g, k h M K j,H,2 ≤ 16T H 2 log(2/δ) Now, we bound the second term. Note that the minimum eigen value of Λ k h is at least λ = 1 for all (k, h) ∈ [K] × [H]. By Lemma 20, K k=1 h (ϕ k h ) T (Λ k h ) -1 ϕ k h ≤ 2 log det(Λ K+1 ) det(Λ 0 ) (94) Moreover, note that ||Λ K+1 ||= || k τ =1 h ϕ k h (ϕ k h ) T + λI||≤ λ + T , hence, K k=1 h (ϕ k h ) T (Λ k ) -1 ϕ k h ≤ 2d log λ + T λ ≤ 2dι Now, by Cauchy-Schwartz inequality, we have K k=1 H h=1 (ϕ k h ) T (Λ k ) -1 ϕ k h ≤ √ KH[ K k=1 H h=1 (ϕ k h ) T (Λ k ) -1 ϕ k h ] 1/2 ≤ √ 2dT ι (96) Note that β = C 1 dH √ ι. Thus, we have with probability 1 -4δ, K k=1 V k r,1 (x k 1 ) -V π k r,1 (x k 1 ) + Y K k=1 (V k g,1 (x k 1 ) -V π k g,1 (x k 1 )) ≤ (Y + 1)[ 2T H 2 log(4/p) + C 4 √ d 3 H 2 T ι 2 ] Hence, the result follows. G.10 PROOF OF THEOREM 2 Note from Lemma 9, Lemma 13, and Lemma 15, we have w.p. 1 -4δ, K k=1 (V π * r,1 (x 1 ) -V π k r,1 (x 1 )) + Y K k=1 (Hb -κ -V π k g,1 (x k 1 )) ≤ Y 2 2η + η 2 H 2 K + HK log|A| α + Õ((Y + 1) √ d 3 H 2 T ι 2 ) Replacing Y with 0 in (98), we have K k=1 (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 ) ≤ η 2 H 2 K + HK log|A| α + O( √ d 3 H 2 T ι 2 ) By noting that η = ξ √ KH 2 , and α = log|A| √ KH 2(1 + ξ + H) , we have K k=1 (V π * r,1 (x k 1 ) -V π k r,1 (x k 1 )) ≤ ξ √ KH 2 2 + √ KH(2(1 + ξ + H)) + O( √ d 3 T H 2 ι 2 ) = Õ( √ d 3 T H 2 ) ( ) where the last equality follows from the fact that KH = T . Now, observe the third term in (47). Note from Azuma-Hoeffding inequality (since E[V r,1 (π k )(x k 1 )-kH t=(k-1)H+1 (r(x h , a h ))] = 0 with respect to the Filtration F k-1 ). Hence, with probability 1 -δ T /H k=1 (V π k r,1 (x k 1 ) - kH t=(k-1)H+1 (r(x h , a h ))) ≤ H T /H log(2/δ) ≤ O( √ T Hι) Now, using Lemma 3, ( 45), (99), and (100), in (44) we obtain with probability 1 -5δ, Regret(T ) ≤ (T /H)sp(v * r ) + Õ( √ d 3 T H 2 ) + Õ( √ T H) Since H = T 1/4 /d 3/4 , we obtain with probability 1 -5δ, Regret(T ) ≤ T 3/4 d 3/4 sp(v * r ) + Õ((dT ) 3/4 ) = (1 + sp(v * r )) Õ((dT ) 3/4 ) Now, we bound Violation. By noting that η = ξ √ KH 2 , and α = log|A| √ KH 2(1 + ξ + H) , we have from (98), K k=1 (V π * r,1 (x 1 ) -V π k r,1 (x k 1 )) + Y K k=1 (Hb -κ -V π k g,1 (x k 1 )) ≤ ξ √ KH 2 + √ KH(2(1 + ξ + H)) + O((ξ + 1) √ d 3 H 2 T ι 2 ) From the convex optimization result (Corollary 1), we obtain K k=1 (Hb -κ -V π k g,1 (x k 1 )) ≤ 2(1 + ξ) ξ Õ( √ d 3 H 2 T ) Also note from Azuma-Hoeffding inequality, w.p. 1 -δ T /H k=1 (V π k g,1 (x k 1 ) - kH t=(k-1)H+1 g(x t , a t )) ≤ H T /H log(2/δ) Hence, combining ( 104) and ( 105), we obtain w.p. 1 -5δ, K k=1 (Hb -κ -V π k g,1 (x k 1 )) + T /H k=1 (V π k g,1 (x k 1 ) - kH t=(k-1)H+1 g(x t , a t )) ≤ Õ( (1 + ξ) ξ √ d 3 T H 2 ) Violation(T ) ≤ (T /H)κ + Õ( (1 + ξ) ξ √ d 3 T H 2 ) = (1 + κ) Õ( (1 + ξ) ξ (dT ) 3/4 ) where the last equality follows from the fact that H = T 1/4 /d 3/4 .

G.11 DIFFERENCES IN ANALYSIS FOR ALGORITHMS 1 AND 2

Note that we also need to show that log ϵ-covering number for v function class is also log(T ) in Algorithm 1. However, since Algorithm 1 and Algorithm 2 are different, the analysis is also different. In particular, in Algorithm 2 we consider a primal-dual type algorithm, hence, the policy is based on the combined state-action value function. Thus, we need to show that the log ϵ-covering number for individual value function scales at most with log(T ) even though the policy is based on the composite Q-function. On the other hand, in Algorithm 1, we directly search for the policy which solves a constrained optimization problem. In Algorithm 2 the policy is itself a function of Q ⋄ and dual variable Y k . Thus, we obtain the result only using the property of the function class of Q ⋄ and the boundedness of Y k . On the other hand a separate policy function class is defined for Algorithm 1 and we obtain the result by exploiting of the smoothness property of the function class. Thus, our analysis would work for any policy class with smoothness property (Appendix F.6) unlike Algorithm 2. We need that the optimal policy should also belong to Π in Algorithm 1. In contrast, in Algorithm 2 we consider an unconstrained version of the episodic CMDP setup. Carefully crafted Soft-max policy enables us to bound the gap with the optimal policy in the episodic setup. Hence, we do not need to search over a feasible policy space unlike in Algorithm 1.

G.12 SUPPORTING RESULTS

The following result is shown in Abbasi-Yadkori et al. (2011) and in Lemma D.2 in Jin et al. (2020) . Lemma 20. Let {ϕ t } t≥0 be a sequence in ℜ d satisfying sup t≥0 ||ϕ t ||≤ 1. For any t ≥ 0, we define Λ K = Λ 0 + K k=1 h ϕ k h (ϕ k h ) T . Then if the smallest eigen value of Λ 0 be at least 1, we have Let {ϕ t } ∞ t=1 be a ℜ d -valued Stochastic process where ϕ t ∈ F t-1 . Assume Λ 0 ∈ ℜ d×d is a positivedefine matrix, let, t = kH, Λ t = Λ 0 + k j=1 h ϕ j h (ϕ j h ) T . Then for any δ > 0 with probability at least 1 -δ, we have log det(Λ K ) det(Λ 0 ) ≤ K k=1 h (ϕ k h ) T (Λ k-1 ) -1 ϕ k h ≤ 2 log det(Λ K ) det(Λ 0 ) || t s=1 ϕ s ϵ s || 2 Λ -1 t ≤ 2σ 2 log det(Λ t ) 1/2 det(Λ 0 ) -1/2 δ ( ) The next result characterizes the covering number of an Euclidean ball (Lemma 5.2 in Vershynin (2010)). Lemma 21. Let {x t } ∞ t=1 be a stochastic process on state space X with corresponding filtration {F t }, {ϕ t } be a R d -values stochastic process where ϕ t ∈ F t-1 and ||ϕ t ||≤ 1, Λ t = λI + t-1 s=1 ϕ s ϕ T s , and V be an arbitrary set of functions defined on X with N ϵ be its ϵ-covering number with respect to dist(V, V ′ ) = sup x |V (x) -V ′ (x)| for some fixed ϵ > 0. Then for any δ > 0 with probability 1 -δ for any t, v ∈ V so that sup x |v(x)|≤ H, we have t-1 s=1 ϕ s (v(x s ) -E[v(x s )|F t-1 ]) 2 Λ -1 t ≤ 4H 2 d 2 log t + λ λ + log N ϵ δ + 8t 2 ϵ 2 /λ 2 (110) Observation 1. Let Λ k = k-1 τ =1 h ′ ϕ τ h ′ (ϕ τ h ′ ) T + λI, where λ > 0, and ϕ τ h ′ ∈ R d , then k-1 τ =1 h ′ (ϕ τ h ′ ) T (Λ k ) -1 ϕ τ h ′ ≤ d Proof. We have k-1 τ =1 h ′ (ϕ τ h ′ ) T (Λ k ) -1 ϕ τ h ′ = k-1 τ =1 h ′ tr((ϕ τ h ′ ) T (Λ k ) -1 ϕ τ h ′ ). Further, k-1 τ =1 h ′ tr((ϕ τ h ′ ) T (Λ k ) -1 ϕ τ h ′ ) = tr((Λ k ) -1 k-1 τ =1 h ′ ϕ τ h ′ (ϕ τ h ′ ) T ). Given the eigen value decomposition, We have used the following result from the optimization which is proved in Lemma 9 in Ding et al. (2021) . Lemma 23. Let Y * be the optimal dual variable of the episodic CMDP (46), and C ≥ 2Y * , then, if k-1 τ =1 h ′ ϕ τ h ′ (ϕ τ h ′ ) T = Udiag(λ 1 , . . . , λ d )U T . Thus, Λ k = Udiag(λ 1 +λ, . . . , λ d + λ)U T . Hence, tr((Λ k ) -1 k-1 τ =1 h ′ ϕ τ h ′ (ϕ τ h ′ ) T ) = V π * r,1 (x k 1 ) -V π k r,1 (x k 1 ) + C(b -V π k g,1 (x k 1 )) ≤ δ k (111) then [b -V π k g,1 (x k 1 )] + ≤ 2δ k C . ( ) Corollary 1. Let Y * be the optimal dual variable, and C ≥ 2Y * , if K k=1 V π * r,1 (x k 1 ) -V π k r,1 (x k 1 ) + C(b -V π k g,1 (x k 1 )) ≤ δ (113) then K k=1 (b -V π k g,1 (x k 1 )) ≤ 2δ C (114) Proof. Let δ k = V π * r,1 (x k 1 ) -V π k r,1 (x k 1 ) + C(b -V π k g,1 (x k 1 ), then from Lemma 23, (b -V π k g,1 (x k 1 ) + ≤ (b -V π k g,1 (x k 1 )) ≤ 2δ k C By summing over k, we obtain K k=1 (b -V π k g,1 (x k 1 )) ≤ k 2δ k C ≤ 2δ C ( ) where the last inequality follows from the fact that k δ k ≤ δ.

H PROOF OF RESULTS IN SECTION 4

Organization: We, first, provide a proof sketch of Theorem 3 in (Appendix H.1). Subsequently, we provide a detailed proof. We show that how we can achieve zero constraint violation via tweaking the algorithm in Appendix H.5. Notations for this Section: As shown in Appendix A of Wei et al. (2021a) , we assume that ||ϕ(x, a)|| 2 ≤ 1. Without loss of generality, we assume that we represent ϕ(x, a) T e 1 = 1 for some vector e 1 for every (x, a). ||µ(S)|| 2 ≤ √ d, ||θ j || 2 ≤ √ d for j = r, g. Also, for every policy π, q π (x, a) can be written as ϕ(x, a) T w π . ||w π ||≤ 6t mix √ d (by Lemma 34). We denote q k j (x, a) as (w k j ) T ϕ(x, a).

H.1 PROOF SKETCH OF THEOREM 3

Regret can be decomposed as the following Regret(T ) = K k=1 B[J * r -J π k r ] + [ k kB t=(k-1)B+1 (J π k r -r(x t , a t ))] In the following, we bound each of the two terms at the right hand side. Unlike the unconstrained setup Wei et al. (2021a) , here the decision depends on both q k r and q k g , thus, the analysis will also be different. In particular, in order to bound the first term, we bound T /B k=1 B[J * r -J π k r ]+BY (b-J π k g ). Specifically, since J * g ≥ b, we obtain for any Y ∈ [0, ξ] T /B k=1 B[J * r -J π k r ] + BY (b -J π k g ) ≤ B T /B k=1 [J * r + Y k J * g -J π k r -Y k J π k g ] T1 + (Y -Y k )B( Ĵk -J π k g ) T2 +B (Y -Y k )(b -Ĵk ) T3 Note that since the above expression is true for any Y ∈ [0, ξ], we can recover the bound on regret by setting Y = 0. For T 1 , we employ the value-difference lemma. This step may seem similar to the one considered for the unconstrained version (Wei et al., 2021a) which bounds k (J * r -J π k r ). However, there are two important differences. We have an additional constraint term scaled by the dual variable Y k which also changes over time. Second, we provide high probability bound instead of expectation bound in Wei et al. (2021a) . We utilize the one-step decent result of OMD from Lemma 3.3 in Cai et al. (2020) , Azuma-Hoeffding inequality, and the boundedness of ||w k ⋄ || using Assumption 6 to bound the term. The final result is shown in Lemma 24. Lemma 24. With probability 1 -δ, T 1 ≤ BKO(α(1 + ξ) 2 N 2 /σ 2 ) + B α log(|A|) Proof. See Appendix H.1.1. Note that in the unconstrained setup the term T 2 does not arise. For T 2 , we rely on the fact E k ( Ĵk -J π k g ) are close since the way we collected samples Ĵk is close to an unbiased estimator for J π k g (Lemma 33). Using the fact that |Y -Y k |≤ ξ, we bound T 2 as the following Lemma 25. With probability 1 -δ, T 2 ≤ ξ T + ξ T B log(2/δ) Proof. See Appendix H.1.2. Also note that a (π * (a|x) -π k (a|x))N (J Wei et al. (2021a) Lemma 28. Let E[•|τ k,m ] denote the expectation conditioned on (x τ k,m , a τ k,m ) and all the history before τ k,m , then π k r + Y k J π k g ) = 0 since J π k ⋄ is constant. From Lemma 15 in |E[R k,m |τ k,m ] -(q π k r (x τ k,m , a τ k,m ) + N J π k r )|≤ 1 T 7 |E[G k,m |τ k,m ] -(q π k g (x τ k,m , a τ k,m ) + N J π k g )|≤ 1 T 7 Thus, R k,m and G k,m are close to the unbiased estimator of q π k r +N J π k r and q π k g +N K π k g respectively. Also from Lemma 16 in Wei et al. (2021a) Lemma 29. For ⋄ = r, g, ||E k [w k ⋄ ] -(w π k ⋄ + N J π k ⋄ e 1 )||≤ 1 T 2 (129) where E k is the expectation conditioned on all history before epoch k. Now, we are ready to prove Lemma 24. Proof From Lemma 27, We express T 1 in (117) as B k a (π * (a|x) -π k (a|x))(q π k r (x, a) + Y k q π k g (x, a) + N J π k r + Y k N J π k g ) = B k a (π * (a|x) -π k (a|x))((q π k r (x, a) + Y k q π k g (x, a) + N J π k r + Y k N J π k g ) -E k [q k r (x, a) + Y k q k g (x, a)]) T4 + B k a (π * (a|x) -π k (a|x))(E k [q k r (x, a) + Y k q k g (x, a)] -(q k r (x, a) + Y k q k g (x, a))) T5 + B k a (π * (a|x) -π k (a|x))(q k r (x, a) + Y k q k g (x, a)) T6 Bounding T 4 : The bound of T 4 is similar to the unconstrained case Wei et al. (2021a) . The only difference is that since we have extra term corresponding to Y k q k g , there is an additional scaling factor of (1 + ξ). The details are in the following- Note that q π k ⋄ (x, a) = (w π k ⋄ ) T ϕ(x, a), J π k ⋄ = J π k ⋄ e T 1 ϕ(x, a),and q k ⋄ (x, a) = (w k ⋄ ) T ϕ(x, a). From Lemma 29, we obtain that |E k [w k ⋄ ] T ϕ(x, a) -(w π k ⋄ + N J π k ⋄ e 1 ) T ϕ(x, a)|≤ ||E k [w k ⋄ ] -(w π k ⋄ + N J π k ⋄ e 1 ) T ||||ϕ(x, a)|| ≤ 1 T 2 Hence, (π * (a|x) -π k (a|x))((q π k r (x, a) + Y k q π k g (x, a) + N J π k r + Y k N J π k g ) -E k [q k r (x, a) + Y k q k g (x, a)]) ≤ ||π * (•|x) -π k (•|x)|| 1 ||(q π k r (x, a) + Y k q π k g (x, a) + N J π k r + Y k N J π k g ) -E k [q k r (x, a) + Y k q k g (x, a)]|| ∞ ≤ 2 1 + ξ T 2 Hence, T 4 ≤ 2 1 + ξ T Bounding T 5 : Since we are providing high probability bound instead unlike the expected regret bound in Wei et al. (2021a) , we need to bound T 5 whereas such a term did not arise in Wei et al. (2021a) . T 5 is Martingale difference with respect to the Filtration F k-1 which contains all the histories till epoch k. |q k r (x, a) + Y k q k g (x, a)|≤ (1 + ξ)N/σ (From Lemma 31). Hence, from Azuma-Hoeffding inequality, we obtain w.p. 1 -δ, k a (π * (a|x) -π k (a|x))(E k [q k r (x, a) + Y k q k g (x, a)] -(q k r (x, a) + Y k q k g (x, a))) ≤ O(2(1 + ξ)N/σ 2T /B log(2/δ)) Thus, multiplying the above with B, we obtain T 5 ≤ O(2(1 + ξ)N/σ T B log(2/δ)) with probability 1 -δ. Bounding T 6 : In the unconstrained case Wei et al. (2021a) bounds E[ k a (π * (a|x)π k (a|x))q k r (x, a)] (Lemma 19 in Wei et al. ( 2021a)). T 6 may seem similar to the above expression, however, there are subtle differences. First, we do not have expectation since we need to consider high-probability bound. Second, we have additional Y k q k g term where Y k is changing over k. Thus, we need to use different techniques compared to Wei et al. (2021a) . In particular, we rely on the analysis of OMD and one-step decent result from Lemma 3.3 in Cai et al. (2020) to bound T 6 . In particular, we use the result from Lemma 3.3 in Cai et al. (2020) , to obtain ⟨π * -π k-1 , q k-1 r + Y k-1 q k-1 g ⟩ ≤ 1 2 α||q k-1 r + Y k q k-1 g || 2 ∞ + 1 α (D(π * |π k-1 ) -D(π * |π k )) Now, by the construction of w k-1 r and w k-1 g , we have ||q k-1 r + Y k q k-1 g || ∞ = O((1 + ξ) 2 N 2 /σ 2 ) ( 31). Finally, by telescope summing over k and using the fact that D(•) is non-negative, we have the result. The details can be found in Lemma 30. Specifically, we use the result from Lemma 30 to bound T 6 - T 6 ≤ T O(α(1 + ξ) 2 N 2 /σ 2 ) + B α log(|A|) since KB = T . Hence, we have with probability 1 -δ, B K k=1 a (π * (a|x) -π k (a|x))(q π k r (x, a) + Y k q π k g (x, a)) ≤ 2(1 + ξ) T + T O(α(1 + ξ) 2 N 2 /σ 2 ) + B α log(|A|) + O((1 + ξ) N σ T B log(2/δ)) (135) Since α = min{ σ (1 + ξ) √ T t mix , σ 24(1 + ξ)N }, N = 8t mix log(T ), B = 32N log(dT )σ -1 , then, B k a (π * (a|x) -π k (a|x))(q π k r (x, a) + Y k q π k g (x, a)) ≤ Õ((1 + ξ)/σ T t 3 mix ) k X B a (π * (a|x) -π k (a|x))(q π k r (x, a) + Y k q π k g (x, a))dν π * (x) ≤ Õ((1 + ξ)/σ T t 3 mix ) where the last inequality follows from the dominated convergence theorem and X dν π * (x) = 1. Hence, we have with probability 1 -δ, T 1 ≤ Õ((1 + ξ)/σ T t 3 mix ) For the second term, since |π k (a|x) -π k-1 (a|x)|≤ O(α((1 + ξ)N/σ)π k-1 (a|x)) (by Lemma 32), then, by Lemma 7 in Wei et al. (2020) , we have v π k r (x) -v π k-1 r (x) = O(α(1 + ξ)N 3 /σ) The third term is again a Martingale sequence. Hence, we have with probability 1 -δ, k t=(k-1)B+1 v π k r (x t+1 ) -E xt+1∼P(•|xt,at) (v π k r (x t+1 )) ≤ 6t mix 2T log(2/δ) Hence, we have with probability 1 -2δ, k kB t=(k-1)B+1 (J π k r -r(x t , a t )) ≤ (T /B)O(α(1 + ξ)N 3 /σ) + 12t mix 2T log(2/δ) (142) Now, using the value of α, we obtain with probability 1 -2δ, kB t=(k-1)B+1 (J π k r -r(x t , a t )) ≤ Õ( T t 3 mix ) H.3 CONSTRAINT VIOLATION BOUND From (120), and the value of α, N , and B, we have with probability 1 -2δ k B(J * r -J π k r ) + Y B(b -J π k g ) ≤ Õ((1 + ξ)/σ T t 3 mix ) (144) Now, from Corollary 4 k B(b -J π k g ) ≤ Õ( 2(1 + ξ) ξσ T t 3 mix ) On the other hand similar to Lemma 26, we can also show that Corollary 2. With prob. 1 -2δ, kB t=(k-1)B+1 (J π k g -g(x t , a t )) ≤ Õ( T t 3 mix ) Hence, we obtain with probability 1 -4δ, Violation(T ) ≤ Õ( 3(1 + ξ) ξσ T t 3 mix ) H.4 SUPPORTING RESULTS Similar to Lemma 18, we can show the following Corollary 3. For any Y ∈ [0, ξ] k (Y -Y k )(b -Ĵk ) ≤ Y 2 2η + ηK 2 (148) Lemma 30. K k=1 ⟨π * -π k , q k r + Y k q k g ⟩ ≤ KO(α(1 + ξ) 2 N 2 /σ 2 ) + 1 α log(|A|) Proof. Now, from the composition of the algorithm (12), we obtain π k = arg max π ⟨π, q k-1 r + Y k-1 q k-1 g ⟩ - 1 α D(π|π k-1 ) From Lemma 3.3 in Cai et al. (2020) , ⟨π * -π k-1 , q k-1 r + Y k-1 q k-1 g ⟩ ≤ 1 2 α||q k-1 r + Y k q k-1 g || 2 ∞ + 1 α (D(π * |π k-1 ) -D(π * |π k )) Now, from Lemma 31, ||q k-1 r + Y k q k-1 g || 2 ∞ ≤ O((1 + ξ) 2 N 2 /σ 2 ) Now, by summing from k = 2 to K + 1, and shifting k -1 to k, we obtain from (151), K k=1 ⟨π * -π k , q k r + Y k q k g ⟩ ≤ KO(α(1 + ξ) 2 N 2 /σ 2 ) + 1 α (D(π * |π 1 ) -D(π * |π K+1 )) By Pinsker's inequality -D(π * |π K+1 ) ≤ -||π * -π K+1 || 2 1 ≤ 0 On the other hand, since π 1 is uniformly random, thus, D(π * |π 1 ) ≤ log(|A|) Hence, from (152),  K k=1 ⟨π * -π k , q k r + Y k q k g ⟩ ≤ KO(α(1 + ξ) 2 N 2 /σ 2 ) + log + Y k-1 w k g )|≤ O((1 + ξ)N/σ) (155) Proof. ϕ(x, a) T (w k-1 r + Y k-1 w k g ) ≤ ||ϕ(x, a)||||w k-1 r + Y k-1 w k g || Note that from the construction of the parameter, ||w k-1 ⋄ ||≤ 24N Bσ B 2N √ dN Hence, we have ϕ(x, a) T (w k-1 r + Y k-1 w k g ) ≤ O((1 + ξ)N/σ) Lemma 32. For any a, x |π k (a|x) -π k-1 (a|x)|≤ π k-1 (a|x)O( αN σ (1 + ξ)) Proof. π k (a|x) -π k-1 (a|x) = π k-1 (a|x) exp(αϕ(x, a) T (w k-1 r + Y k-1 w k-1 g )) b π k-1 (b|x) exp(αϕ(x, b) T (w k-1 r + Y k-1 w k-1 g )) -π k-1 (a|x) ≤ π k-1 (a|x) exp(αϕ(x, a) T (w k-1 r + Y k-1 w k-1 g ) b π k-1 (b|x) exp(-α min b ϕ(x, b) T (w k-1 r + Y k-1 w k-1 g )) -π k-1 (a|x) ≤ π k-1 (a|x)(exp(max b 2αϕ(x, b) T (w k-1 r + Y k-1 w k-1 g )) -1) Note α max b |ϕ(x, b) T (w k-1 r + Y k-1 w k-1 g )|≤ 1 as long as α ≤ σ/(24(1 + ξ)N ). Combining with the fact that exp(2x) ≤ 1 + 8x for x ∈ [0, 1]. exp(max b 2αϕ(x, b) T (w k-1 r + Y k-1 w k-1 g )) -1 ≤ 8α max b |ϕ(x, b) T (w k-1 r + Y k-1 w k-1 g )| = O(α (1 + ξ)N σ ) Hence, the result follows. Lemma 33. For any k, |E k [ Ĵk ] -J π k g |≤ 1 T 7 Proof. Let P π k (x t , x 1 ) denote the probability that the learner reaches the state x k t at time t within epoch k when the epoch starts with the state x k 1 . Hence, ||P π k (x t , x 1 )|| T V = ||P π k δ x1 (x)|| t-1 T V where δ x1 (x) is 1 when x = x 1 , and 0 otherwise. |E k [ Ĵk ] -J π k g |= | 1 B -N B t=N +1 ( X a π k (a t |x t )g(x t , a t )dP π k (x t , x 1 ) - X a π k (x t |a t )g(x t , a t )dν π k (x)| = | 1 B -N B t=N +1 2||P π k (x k t , x k 1 ) -P π k ν π k || t-1 T V | ≤ | 1 B -N B t=N +1 2e -N/tmix | ≤ 2 T 7 where the second last inequality follows from Assumption 4. The last inequality follows from the definition of N . Similar to Corollary 1, we can show the following Corollary 4. For C ≥ 2Y * , K k=1 (J π * r -J π k r ) + C(b -J π k g ) ≤ δ (163) then k (b -J π k g ) ≤ 2δ C Finally, we state some results which are proved in Wei et al. (2021a) (Lemma 6 and 14, respectively). Lemma 34. For any stationary policy π, and x |v π (x)|≤ 4t mix , and |q π (x, a)|≤ 6t mix for any (x, a). Lemma 35. q π (x, a) = (w π + J π e 1 ) T ϕ(x, a) H.5 ZERO VIOLATION We show that by considering a ϵ-tighter optimization problem, and by carefully choosing ϵ, one can obtain zero constraint violation while maintaining the same order on the regret. First,we introduce some notations. Consider the ϵ-tighter optimization problem maximize J π r subject to J π g ≥ b + ϵ (166) where in the constraint b + ϵ replaces b. If ϵ ≤ γ/2, then Slater's condition is satisfied as π would still be strictly feasible for the ϵ-tighter CMDP (Assumption 3). The optimal dual variable is bounded by Y * ≤ 2/γ for this tighter optimization problem. Since we need ξ ≥ 2Y * , we make ξ = 4/γ. We denote the optimal policy of the ϵ-tighter optimization problem as π ϵ, * . We denote J π ϵ, * as the optimal constant gain of the tighter optimization problem (166).  ) where ϵ = min{C 6 ( 2(1 + ξ) σξ T t 3 mix log(dT ) 2 /δ T ), γ/2} for some absolute constant C 6 . Note that when Õ( 2(1 + ξ) σξ T t 3 mix T ) ≤ γ/2, then violation becomes 0. Hence, for some constant C 6 , we obtain violation 0 when C 6 ( 2(1 + ξ) σξ T t 3 mix log(dT ) 2 /δ T ≤ γ/2. Also note that by plugging in the value of ϵ, the upper bound on regret is still Õ( √ T ). Hence, even when for large enough T (still finite), the violation is 0, while the regret bound is still Õ( √ T ). Proof : First, we prove the upper bound on regret. The regret can be decomposed as the following - Regret(T ) = T t=1 (J π * r -J π ϵ, * r ) + T t=1 (J π ϵ, * r -r(x t , a t )) The first term can be bounded with the help of the following lemma (the proof for finite state is in Wei et al. (2022) and for episodic setup is in Wei et al. (2021b) , we provide the proof at the end of this section)-Lemma 37. If each stationary policy π induces a stationary state-action distribution ν π (x, a), every stationary policy satisfies (4), and π ϵ, * is the optimal solution of (166), then J π * r -J π ϵ, * r ≤ ϵ γ . ( ) Under uniform mixture assumption (Assumption 4, the conditions of the Lemma can be verified. Since the tighter optimization problem is also CMDP, we note that the second term in the right hand side of ( 168) is essentially the regret of the tighter CMDP. Hence, from Theorem 3 and Lemma 37 we obtain the expression of the regret bound in Lemma 36. Constraint Violation: Again applying Theorem 3 to the tighter optimization problem (166), we obtain We can obtain that zero violation for Theorem 1 in a similar way while maintaining same order of regret by picking ϵ = O(1/ √ T ) in the tighter CMDP. Here, we show the proof of Lemma 37. Proof. Let ν π (x, a) be the stationary state-action occupation distribution corresponding to the stationary policy π. According to Assumption, for every policy induces a stationary distribution. Hence, J π r = x,a r(x, a)dν π (x, a), and J π g = x,a g(x, a)dν π (x, a). Consider the state-action distribution ν ϵ (•, •) = (1 -ϵ/γ)ν π * (•, •) + ϵ/γν π (•, •). There exists a policy π which induces the state-action distribution ν ϵ Altman (1999). Now, x,a g(x, a)dν ϵ (x, a) = (1 -ϵ/γ) x,a g(x, a)dν π * (x, a) + ϵ/γ x,a g(x, a)dν π (x, a) ≥ (1 -ϵ/γ)b + ϵ/γ(b + γ) = b + ϵ (173) Thus, such a state-action occupation measure is feasible for the tighter optimization problem (166). Further, x,a r(x, a)dν ϵ (x, a) = (1 -ϵ/γ) x,a r(x, a)dν π * (x, a) + ϵ/γ x,a r(x, a)dν π (x, a) ≥ (1 -ϵ/γ)J * r ( ) Since the state-action measure ν ϵ (x, a) is feasible, then J π ϵ, * r ≥ x,a r(x, a)dν ϵ (x, a). Hence, from (174), we have J π ϵ, * r -J * r ≥ -ϵ/γ Hence, the result follows.

I EXPERIMENTS

We evaluate Algorithm 3 on a similar model described in Chen et al. (2022) ; Singh et al. (2020) a wireless node is continuously transmitting packets. The node consists of a queue where the maximum capacity is 9. At time t, the node needs to choose a transmission power a t ∈ {0.1, 0.9} as action. Higher transmission power results in higher probability of successful transmission. The number of packets arriving at the node at time t is given by Y t . We assume Y t ∈ {0, 1, 2, 3}, and the corresponding probabilities are (0.65, 0.2, 0.1, 0.05). The channel reliability is 0.9, i.e., each attempted transmission has success probability of 0.9. The dynamics of the queue is the following Q t+1 = min{9, max{Q t + Y t -D t , 0}} where D t is 1 with probability a t p r , and 0 otherwise. The goal is to maintain a short queue length with small transmission power. At time t, the node gets a reward 1 -a t . Reward decreases as a t increases. On the other hand, it gets a utility if the state of the queue is low. Specially, the utility is g(Q t ) = 1 -0.1Q t . We seek to ensure that the average utility is kept at least 0.7, i.e., the goal is to keep average queue length less than or equal to 3. Compared to Chen et al. (2022) , we use a different utility function. First, we ensure that the utility is bounded between 0 and 1. Further, Chen et al. (2022) considers the constraint as a cost function where we consider it as utility function. Hence, we need to provide higher utility for lower state and vice versa unlike in Chen et al. (2022) . Further, our algorithm is model-free unlike in Chen et al. (2022) . Note that the setup can be represented in a tabular form. Since linear MDP contains tabular form, the feature space representation becomes simple, in particular, ϕ(s, a) = e s,a where e s,a is 1 for the state-action pair (s, a), and 0 otherwise. The dimension of the feature space is |S||A|. We run Algorithm 3 for 5 × 10 5 time steps. The parameter we used are the followings: B = 200, N = 10, α = 1/ √ T , η = 10/ T /B. Note that we do not know t mix or σ, yet, we achieve sub-linear regret and a violation which approaches zero. We have used Bσ/(24N ) = 0.05 which we obtain by tuning. Hence, our algorithm can work well even if we do not know the MDP dependent parameters. In Fig. 1 we plot the average regret and average violation as a function of T for ϵ = 0.01 (the tightness for constrained introduced in Remark 1). As predicted by our theory, our algorithm achieves sub-linear regret and sub-linear violation as both the average regret and average violation decrease to 0 as T increases. Initially, the regret decreases fast since we start with an infeasible policy (violation increases, Fig. 3 ). Subsequently, the regret increases as our algorithm seeks to reduce the violation by making the dual variable high (the policy would be sub-optimal). The average violation decreases rapidly (Fig. 2 ). Eventually, the regret again decreases when T ≈ 1 × 10 5 as our algorithm finds the dual variable which balances the regret and violation. In this regime, the average violation slowly approaches 0 as T grows. In Fig. 2 we plot the average regret and average violation as a function of T for ϵ = 0.1. Again it is apparent that the regret and violation grow sub-linearly as the average regret and average violation decrease to 0. Since ϵ is high, from Fig. 3 we observe that the violation indeed eventually becomes 0. From Fig. 3 , we observe that the violation starts decreasing after T ≈ 0.5 × 10 5 as opposed to T ≈ 10 5 for smaller value of ϵ. The variation of regret is similar to Fig. 1 as the regret first decreases rapidly and then increases till T ≈ 0.5 × 10 5 . The regret eventually decreases steadily as T increases further. Note that the regret starts to decrease much earlier (at T ≈ 0.5 × 10 5 ) in this scenario compared to the scenario where ϵ is smaller. Intuitively, since ϵ is higher, the dual variable increases at a higher rate which helps in finding the feasible solution at a faster rate. The regret then starts decreasing as once we find the dual variable which balances between the regret and violation. We would like to point out the difference with Chen et al. (2022) . The violation in Chen et al. (2022) oscillates, and never approaches 0. On the other hand, in our evaluation, we observe that the violation eventually approaches 0. This is due to the fact that we have a learning rate η (which is O(1/ √ T )) in the dual-update step whereas in Chen et al. (2022) the learning rate is 1, hence, in Chen et al. (2022) the dual variable oscillates more compared to our approach. Further, Chen et al. (2022) only guarantees constant violation, whereas we can achieve zero violation.



EXPERIMENTAL RESULTSWe conduct numerical experiments to validate our theory on an environment similar toChen et al. (2022). We implement Algorithm 3 and observe that regret and violation indeed grow sub-linearly even without the knowledge of t mix , and σ. Please see Appendix I for details. CONCLUSION AND FUTURE WORKIn this work, we provide model-free RL algorithms under two different sets of Assumptions. To the best of our knowledge, ours is the first work on the average reward-constrained linear MDP. Our results improve the regret and violation bound even for the finite horizon tabular model-free constrained RL setup. Whether we can develop a computationally efficient algorithm that achieves optimal regret under Assumption 1 constitutes a future research question. Whether we can extend the analysis to the non-linear function approximation is also an open question.



If π, π ∈ Π where π is parameterized by ζ and π is parameterized by ζ, then for any x ∈ X a |π(a|x) -π(a|x)|≤ 8||ζ -ζ|| (35) if ||ζ -ζ||≤ 1/2. Proof. See Appendix F.5. The value function class v s,⋄ is parameterized by (ζ, w s,⋄ ). Consider ζ such that || ζ -ζ||≤ ϵ/8, then from Lemma 8 for every state x a |π(a|x) -π(a|x)|≤ ϵ (36)Further, consider ws,⋄ such that ||w s,⋄ -ws,⋄ ||≤ ϵ, then max x,a |q s,⋄ (x, a) -qs,⋄ (x, a)|≤ ϵ where qs,⋄ is parameterized by ws,⋄ . Now, ṽs,⋄ = ⟨π, qs,⋄ ⟩, and v s,⋄ = ⟨π, q s,⋄ ⟩.

only use Lemma 8 to show Lemma 7 for the class of bias value function. Thus, our analysis can be extended to any policy parameterized by ζ such that Definition 3. For any x, a |π(a|x) -π(a|x)|≤ R||ζ -ζ|| holds for some ||ζ -ζ||≤ L 1 where R is constant and L 1 > 0.

l=1 λ l /(λ l + d) ≤ d. Lemma 22. [Covering Number of Euclidean Ball] For any ϵ > 0, the ϵ-covering number of the Euclidean ball in R d with radius R is upper bounded by (1 + 2R/ϵ) d .

|A|) α Lemma 31. For any x, and a |ϕ(x, a) T (w k-1 r

Now, we are ready to state the result for the set of Assumptions in Section 4. Lemma 36. In Algorithm 3 replace b with b + ϵ, and ξ with 4/γ, then under Assumptions 3,2, 4, and 5, with probability 1 -4δ Regret(T ) ≤ Õ((1 + ξ)/σ T t 3 mix ) mix ) -T ϵ. (

ϵ -g(x t , a t )) ≤ Õ( 2(1 + ξ) σξ T t 3 mix )

Figure1: The plot for average regret (Regret divided by no. of steps) and average constraint violation (Violation(T )/T ) as a function of T (x-axis) for ϵ = 0.01. The x-axis is in the order of 10 5 . Each plot is an average of 5 trials.

4, AND 5 WE CAN REDUCE THE VIOLATION TO 0 FOR LARGE ENOUGH T (AP-PENDIX H.5) WHILE MAINTAINING THE SAME ORDER OF REGRET WITH RESPECT TO T . + WE CAN REDUCE THE VIOLATION TO 0 FOR ALGORITHMS 1 AND 2 IF WE ASSUME THAT ANY STATIONARY POLICY SATISFIES (4) (APPENDIX H.5).

Model Free Primal-Dual Algorithm for Long-term Average Reward in Linear MDP

Theorem 4. [Concentration of Self-Normalized Process Abbasi-Yadkori et al. (2011)] Let {ϵ t } ∞ t=1 be a real-valued stochastic process with corresponding filtration {F t } ∞ t=0 . Let ϵ t |F t-1 be a zero mean and σ sub-Gaussian, i.e., E[ϵ t |F t-1 ] = 0, and ∀ζ ∈ ℜ, E[e ζϵt |F t-1 ] ≤ e ζ 2 σ 2 /2 .

ACKNOWLEDGMENTS

This work has been partly supported by NSF grants NSF AI Institute (AI-EDGE) 2112471, CNS-2106933, 2007231, CNS-1955535, and CNS-1901057, and in part by Army Research Office under Grant W911NF-21-1-0244. Xingyu Zhou is supported in part by NSF CNS-2153220.

annex

In order to bound T 3 , we utilize the result from the dual-domain analysis (Corollary 3). We show that the upper bound of T 3 ≤ ξ √ T B.Hence, we have with probability 1 -2δ,In order to bound the second term in (117), we need to show that policy between two epochs change slowly. Here, we use the value of the hyper-parameter α, and the upper bound of the dual variable to bound the term Lemma 26. With probability 1 -2δ,Hence, combining the above, setting Y = 0 and replacing ( 120) in ( 117) we obtain with probabilityNow, replacing the value of α, and B, we obtainConstraint Violation: Similar to the regret step, we decompose the violation in the following formThe second term can be upper bounded in the similar way as the second term in (117) with probability 1 -2δ. For the first term, we observeNow, from (120), we have with probabilityFinally, by replacing B and applying Corollary 4, we obtain with probabilityHence, the result follows.

H.1.1 PROOF OF LEMMA 24

We, first, state Lemmas 27,28,29 which will be useful for proving Lemma 24. Lemma 27. Using the value difference lemma (Lemma 15) in Wei et al. (2020) , we havewhere ν π * (x) is the stationary distribution corresponding to the stationary policy π * .

H.1.2 PROOF OF LEMMA 25

Proof. We can decomposeSince K = T /B, hence, with probability 1 -δFrom Corollary 3 and η = ξ T /B , we havewhere recall that T 3 is the third term in (117).

H.2 PROOF OF LEMMA 26

Proof. Note thatThe first term is a Martingale sequence. Hence, with prob. 1 -δ, we haveThus, the result follows.Now, we show that zero violation can be attained under the set of Assumptions in Theorem 2 and one additional assumption: Assumption 6. Every stationary policy π satisfies (4), and induces a state-action stationary distribution ν π (x, a).For uniform mixture assumption (Assumption 4), one can show that Assumption 6 holds. Note that Wei et al. (2022) which studies the infinite horizon average reward CMDP for the finite state space setup assumes the above assumption inherently in their work. We are now ready to state and prove the main result.)}, and ξ = 4/γ, then, under Assumptions 3,2, 1, and 6,where ϵ = min{C 7 ((1 + κ)(dT ) 3/4 )ι T , γ/2} for absolute constant C 7 , and ι = log(2 log(|A|)dT /δ).Note that when Õ((1 + κ)(dT ) 3/4 ) T ≤ γ/2, the violation becomes 0. Hence, there exists some constant C 7 such that when C 7 ((1 + κ)(dT ) 3/4 )ι T ≤ γ/2, then the violation becomes 0. Hence, for large enough T (Still finite), the violation indeed becomes 0. On the other hand, by plugging in the value of ϵ, we note that the regret bound is still Õ((dT ) 3/4 ) (in fact it is Õ((1 + κ + sp(v * r ))(dT ) 3/4 ).Proof. First, we prove the upper bound on regret. The regret can be decomposed as the following -The first term is bounded from Lemma 37 since we assume that every stationary policy satisfies (4). Hence,Since the tighter optimization problem is also CMDP, we note that the second term in the right hand side of ( 171) is essentially the regret of the tighter CMDP.Hence, from Theorem 2 and Lemma 37 we obtain the expression of the regret bound in Lemma 38.Constraint Violation: Again applying Theorem 2 to the tighter optimization problem (166), we obtain 

