SAFE EXPLORATION INCURS NEARLY NO ADDITIONAL SAMPLE COMPLEXITY FOR REWARD-FREE RL

Abstract

Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.

1. INTRODUCTION

Reward-free reinforcement learning (RF-RL) is an RL paradigm under which a learning agent first explores an unknown environment without any reward signal in the exploration phase, and then utilizes the gathered information to obtain a near-optimal policy for any reward function during the planning phase. Since formally introduced in Jin et al. (2020b) , RF-RL has attracted increased attention in the research community (Kaufmann et al., 2021; Zhang et al., 2020; 2021; Wang et al., 2020; Modi et al., 2021) . It is particularly attractive for applications where many reward functions may be of interest, such as multi-objective RL (Miryoosefi & Jin, 2021) , or the reward function is not specified by the environment but handcrafted in order to incentivize some desired behavior of the RL agent (Jin et al., 2020b) . The ability of RF-RL to identify a near-optimal policy in response to an arbitrary reward function relies on the fact that the agent is allowed to explore any action during exploration. However, in practice, unrestricted exploration is often unrealistic or even harmful. In order to build safe, responsible and reliable artificial intelligence (AI), the RL agent often has to abide by certain application-dependent constraints, even during the exploration phase. Two motivating applications are provided as follows. • Autonomous driving. In order to learn a near-optimal driving strategy, an RL agent needs to try various actions at different states through exploration. While RF-RL is an appealing approach as the reward function is difficult to specify, it is of critical importance for the RL agent to take safe actions (even during exploration) in order to avoid catastrophic consequences. • Cellular network optimization. The operation of cellular network needs to take a diverse corpus of key performance indicators into consideration, which makes RF-RL a plausible solution. Meanwhile, the exploration also needs to meet certain system requirements, such as power consumption. While meeting these constraints throughout the learning process is a pressing need for the broad adoption of RL in real-world applications, it is a mission impossible to accomplish if no other information is provided, as the learner has little knowledge of the underlying MDP at the beginning of the learning process and will inevitably take undesirable actions (in hindsight) and violate the constraints. On the other hand, in various engineering applications, there often exist either rule-based (e.g., autonomous driving) or human expert-guided (e.g., cellular network optimization) solutions to ensure safe operation of the system. One natural question is, is it possible to leverage such existing safe solutions to ensure safety throughout the learning process? If so, how would the safe exploration requirement affect the corresponding RF-RL performances in terms of the sample complexity of exploration and the optimality and safety guarantees of the obtained policy in planning? To answer these questions, in this work, we introduce a new safe RF-RL framework. In the proposed safe RF-RL framework, the agent does not receive any reward information in the exploration phase, but is aware of a cost function associated with actions at a given state. We require that the cumulative cost in each episode is below a given threshold during exploration, with the aid of a pre-existing safe baseline policy π 0 . The ultimate learning goal of safe RF-RL is to find a safe and near-optimal policy for any given reward and cost functions after exploration. Main contributions. We summarize our main contributions as follows. • First, we introduce a novel safe RF-RL framework that imposes safety constraints during both exploration and planning of RF-RL, which may have implications in various applications. • Second, we propose a unified safe exploration strategy coined SWEET to leverage the prior knowledge of a safe baseline policy π 0 . SWEET admits general model estimation and safe exploration policy construction modules, thus can accommodate various MDP structures and different algorithmic designs. Under the assumption that the approximation error function is concave and continuous in the policy space, SWEET is guaranteed to achieve zero constraint violation during exploration, and output a near-optimal safe policy for any given reward function and safety constraint under some assumptions in planning, both with high probability. • Third, in order to facilitate the specific design of the approximation error function to ensure its concavity, we introduce a novel definition of truncated value functions. It relies on a new clipping method to avoid underestimation of the approximation error captured by the corresponding value function, and ensures the concavity of the resulted value function. • Finally, we particularize the SWEET algorithm for both tabular and low-rank MDPs, and propose Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms inherit the optimality guarantee during planning, and the safety guarantees in both exploration and planning. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art of their constraint-free counterparts up to some constant factors, proving that safety constraint incurs nearly no additional sample complexity for RF-RL.

2.1. EPISODIC MARKOV DECISION PROCESSES

We consider episodic Markov decision processes (MDPs) in the form of M = (S, A, P, H, s 1 ), where S is the state space and A is the finite action space, H is the number of time steps in each episode, P = {P h } H h=1 is a collection of transition kernels, and P h (s h+1 |s h , a h ) denotes the transition probability from the state-action pair (s h , a h ) at step h to state s h+1 in the next step. Without loss of generality, we assume that in each episode of the MDP, the initial state is fixed at s 1 . In addition, an MDP may be equipped with certain specified utility functions u = {u h } H h=1 , where we assume u h : S × A → [0, 1] is a deterministic function for ease of exposition. A Markov policy π is a set of mappings {π h : S → ∆(A)} H h=1 , where ∆(A) is the set of all possible distributions over the action space A. In particular, π h (a|s) denotes the probability of selecting action a in state s at time step h. We denote the set of all Markov policies by X . For an agent adopting policy π in an MDP M, at each step h ∈ [H] where [H] := {1, . . . , H}, she observes state s h ∈ S, and takes an action a h ∈ A according to π, after which the environment transits to the next state s h+1 with probability P h (s h+1 |s h , a h ). The episode ends after H steps, and we use a virtual state s H+1 to denote the terminal state at step H + 1. We use EP,π to denote the expectation of the distribution induced by the transition kernel P and policy π. Let Q π h,P,u (s h , a h ) and V π h,P,u (s h ) be the corresponding action-value function and value function at step h, respectively, for a given collection of utility functions u. Then, V π h,P,u (s h ) := EP,π H h ′ =h u h ′ (s h ′ , a h ′ ) s h , and Q π h,P,u (s h , a h ) := EP,π H h ′ =h u h ′ (s h ′ , a h ′ ) s h , a h . We also use the shorthand V π P,u to denote V π 1,P,u (s 1 ) due to the fixed initial state, and P h f (s h , a h ) = E s h+1 ∼P h (•|s h ,a h ) f (s h+1 ) for any function f : S → R. We further assume that the utility functions are normalized such that for any trajectory generated under a policy, the cumulative value over one episode is bounded by 1, i.e., H h=1 u h (s h , a h ) ≤ 1.

2.2. SAFE REWARD-FREE REINFORCEMENT LEARNING

The safe policy considered in this work is formally defined as follows. Definition 1. Given an MDP M * = (S, A, P * , H, s 1 ), a set of cost functions c = {c h } H h=1 and τ ∈ (0, 1], a policy π is (c, τ )-safe if V π P * ,c ≤ τ . Based on this definition of (c, τ )-safe policies, we now elaborate the proposed safe RF-RL framework, which contains two phases. In the first phase of "exploration", the agent is required to efficiently explore the unknown environment without reward signals, and simultaneously not to violate a predefined safety constraint (c, τ ) in each episode during this exploration phase. Let π (n) be the policy implemented in the n-th episode of the exploration. Then the agent's exploration should satisfy the safety constraint in every episode with high probability, namely, P V π (n) P * ,c ≤ τ, ∀n ∈ [N ] ≥ 1 -δ, where δ ∈ (0, 1) and N is the total number of episodes in the exploration phase. Note that the agent is only given a set of cost functions c but not the reward r in this phase. This is reasonable for many RL applications, where the purpose of exploration is not to maximize certain reward but to learn the environment, while the safety constraint need to be satisfied throughout the learning process. In the second phase of "planning", the agent is given an arbitrary set of reward functions r * and a new set of safety constraint (c * , τ * ). Without further exploration, she is required to learn an ϵ-optimal policy π with respect to the given reward r * , and subject to the safety constraint (c * , τ * ). Definition 2. Given an MDP M * = (S, A, P * , H, s 1 ), reward functions r * , cost functions c * and τ * ∈ (0, 1], π is an ϵ-optimal (c * , τ * )-safe policy if V π * P * ,r * -V π P * ,r * ≤ ϵ, and V π P * ,c * ≤ τ * , where ϵ ∈ (0, 1), and π * is the policy satisfying π * = arg max π V π P * ,r * s.t. V π P * ,c * ≤ τ * . The design goal of safe RF-RL algorithms is three-fold: 1) to collect as few sample trajectories as possible, 2) to satisfy the safety constraint (c, τ ) in the exploration phase, and 3) to obtain an ϵ-optimal (c * , τ * )-safe policy for any given reward r * and constraint (c * , τ * ) in the planning phase. We note that it is impossible to ensure zero constraint violation with high probability during exploration if an agent starts with no information about the system. Therefore, we make the assumption that a safe baseline policy is available to the learning agent during exploration. Besides, we also assume the constrained MDP always has enough feasible solutions, either during exploration or planning. Assumption 1 (Feasibility). The agent has knowledge of a baseline policy π 0 and κ ∈ (0, τ ) such that V π 0 P * ,c ≤ τ -κ. Besides, for any given constraint (c, τ ) in exploration or planning phases, the safety margin, defined as ∆(c, τ ) := τ -min π V π P * ,c , is bounded away from zero, i.e. ∆(c, τ ) ≥ ∆ min > 0. We remark that assuming the existence of a safe baseline policy is reasonable in practice. Many engineering applications already have existing solutions deployed and verified to be safe, although their reward performances may not necessarily be near-optimal. Such solutions can naturally serve as the baseline for safe RF-RL. Additionally, there are practical ways to construct safe baseline policies, e.g., via imitation learning using expert demonstrations, or via policy gradient algorithms to reduce the cost value function to be below the required safety threshold. This assumption is also widely adopted in the safe RL literature (see Section 6 for more discussions).

3. THE SWEET FRAMEWORK

Compared with constraint-free RF-RL, the additional safety requirements during both exploration and planning bring two main challenges in the design of safe RF-RL algorithms. First, in order to obtain an ϵ-optimal policy for any given reward during planning, it requires all actions to be sufficiently covered in the exploration phase. In particular, uniform action selection is one of the enablers for reward-free exploration when the state space is undesirably large (Agarwal et al., 2020; Uehara et al., 2021; Modi et al., 2021) . On the other hand, the predefined safety constraint (c, τ ) may preclude the agent from taking certain actions in exploration, which may affect the estimation accuracy of the environment and degrade the optimality of the output policy in planning. This dilemma requires a novel design to balance safety and state-action space coverage during exploration. Second, there may exist safety constraint mismatch between exploration and planning. Intuitively, the information obtained under a given set of constraint (c, τ ) during exploration may not provide enough coverage for the optimal policy under another set of constraint (c * , τ * ) during planning. How to design the safe exploration algorithm to handle such constraint mismatch is non-trivial. In this section, we introduce a unified framework for safe reward-free exploration, termed as SWEET. We will show that the general framework achieves the second and third design objectives, i.e., safe exploration, and ϵ-optimality and (c * , τ * )-safety for the output policy in planning. The first design objective, i.e., low sample complexity for exploration, is dependent on the underlying MDP structure and will be investigated in Section 4 and Section 5 for tabular MDPs and low-rank MDPs, respectively.

3.1. ALGORITHM DESIGN

The SWEET framework relies on several key design components, namely, the (ϵ 0 , t)-greedy policy, the approximation error function, and the empirical safe policy set, as elaborated below. Definition 3 ((ϵ 0 , t)-greedy policy). Given ϵ 0 ∈ (0, 1) and t ∈ {0, 1, • • • , H}, π ′ is an (ϵ 0 , t)-greedy version of π if there exists H ⊂ [H] with |H| = t such that π ′ h = π h for all h / ∈ H, π ′ h (a|s) = (1 -ϵ 0 )π h (a|s) + ϵ 0 /|A|, ∀h ∈ H, s ∈ S, a ∈ A. Essentially, under an (ϵ 0 , t)-greedy version of a given policy π, the agent follows policy π except for t out of H steps, at which with probability ϵ 0 , she takes actions uniformly at random from the state space A. One critical property of the (ϵ 0 , t)-greedy policy is that, the difference between the value functions under the (ϵ 0 , t)-greedy policy and its original policy is bounded by ϵ 0 t for any normalized utility function (See Lemma 1 in Appendix A). The approximation error function U( P , π) measures the uncertainty in the model estimate P under a policy π. Specifically, for a given MDP M * , U( P , π) upper bounds the value function difference under P and P * , i.e. U( P , π) ≥ max u |V π P ,u -V π P * ,u |, where u is a normalized utility function. The empirical safe policy set, which is critical for constructing safe exploration policies, is defined as C P ,U (κ, ϵ 0 , t) =      {π 0 }, if V π 0 P ,c + U( P , π 0 ) ≥ τ -ϵ 0 t -κ, π : V π P ,c + U( P , π) ≤ τ -ϵ 0 t , otherwise , where κ, ϵ 0 and t are constants satisfying the condition that τ -ϵ 0 t -κ > τ -κ. The intuition behind the construction of the empirical safe policy can be explained as follows (Liu et al., 2021) : if V π 0 P ,c + U( P , π 0 ) ≥ τ -ϵ 0 t -κ, it indicates that P is not sufficiently accurate. Thus, the empirical safe policy set only contains the safe baseline policy π 0 . On the other hand, if V π 0 P ,c + U( P , π 0 ) < τ -ϵ 0 t -κ, which happens when U( P , π 0 ) is sufficiently small, it indicates that P is sufficiently accurate on π 0 . Then, we relax the constraint on V π P ,c + U( P , π) from τ -ϵ 0 t -κ to τ -ϵ 0 t to include π 0 and other policies in the empirical safe policy set. Since V π P ,c + U( P , π) is an upper bound of the true value V π P * ,c for any π, it ensures that V π P * ,c < τ -ϵ 0 t for all π included in C P ,U . Moreover, all (ϵ 0 , t)-greedy versions of such policies satisfy the safety constraint (c, τ ). With those salient components, SWEET proceeds as follows. At the beginning of each episode, the agent executes a set of behavior policies, which are (ϵ 0 , t)-greedy versions of a reference policy π r obtained in the previous episode. For the first episode, the reference policy would be π 0 . The general construction of π r will be elaborated below. By collecting trajectories generated under the behavior policies, the agent updates the estimated model P and the corresponding approximation error function U( P , •). The agent then seeks a reference policy π r that maximizes the approximation error U( P , π) within the constructed empirical safe policy set. Intuitively, U( P , π) is an upper bound of certain distance of distributions over trajectories induced by π under P and P * . Therefore, π r induces a distribution that captures the most uncertainty in P . Choosing π r thus reduces the uncertainty in P in a greedy fashion. If U( P , π r ) is less than a termination threshold T defined in SWEET, it indicates that the estimated model P is sufficiently accurate for the planning task. The exploration phase then terminates. Otherwise, the agent continues to the next episode with the new π r . After termination, SWEET enters the planning phase and receives arbitrary reward functions r * and a safety constraint (c * , τ * ). The agent utilizes P to compute a policy π, which maximizes V π P ,r * subject to an empirical safety constraint V π P ,c + U( P , π) ≤ τ * . Algorithm 1 has the detail of SWEET. Algorithm 1 SWEET (Safe ReWard FrEe ExploraTion) 1: Input: Reference policy π r = π 0 , uncertainty function U, ϵ 0 , t, κ and T. 2: // Exploration: 3: while TRUE do 4: Construct a set of (ϵ 0 , t)-greedy policies of π r (see Definition 3) and use them to collect data; 5: Model estimation: Update P using collected data; 6: Obtain π r = arg max π∈C P ,U (κ,ϵ0,t) U( P , π) where C P ,U (κ, ϵ 0 , t) is defined in Equation (3); 7: if V π 0 P ,c + U( P , π 0 ) ≤ τ -ϵ 0 t -κ and U( P , π r ) ≤ T then 8: Output P ; break; 9: end if 10: end while 11: // Planning: 12: Receive reward function r * and a safety constraint (c * , τ * ). 13: Output: π = arg max π V π P ,r * s.t. V π P ,c * + U( P , π) ≤ τ * .

3.2. THEORETICAL ANALYSIS

Before we present the theoretical guarantee for SWEET, we first introduce the notion of mixture policies and equivalent policies and characterize the concavity over Markov policy space. Definition 4. Given two Markov policies π, π ′ ∈ X , we use γπ ⊕ (1 -γ)π ′ to denote the mixture policy that uses π with probability γ and uses π ′ with probability 1 -γ during an episode. Definition 5. Given an MDP M, two policies, including mixture policies, are equivalent if they induce the same marginal distribution over any state-action pair (a, s) in any step h ∈ [H]. By Theorem 6.1 in Altman (1999) , for any mixture policy γπ ⊕ (1 -γ)π ′ , there exists an equivalent Markov policy π γ (π, π ′ ) ∈ X . For ease of presentation, in the following, we simply use π γ to denote it when the definition is clear from the context. Therefore, the Markov policy space X is equipped with an abstract convexity by mapping all mixture policies to their equivalent Markov policies in X . With this convexity, we can define concave functions on X as follows. Definition 6. A function f : X → [0, 1] is concave and continuous on the Markov policy space X if for any π, π ′ ∈ X and γ ∈ [0, 1], f(π γ ) ≥ γf(π) + (1 -γ)f(π ′ ), and is continuous in γ ∈ [0, 1]. With Definition 6, we have the following result of SWEET. Theorem 1 (ϵ-optimality and safety guarantee of SWEET). Given an MDP M * and model estimate P , assume U( P , π) is concave and continuous over the Markov policy space X and V π P * ,u -V π P ,u ≤ U( P , π) for any normalized utility u and policy π, and Assumption 1 holds. Let ϵ 0 , t and κ be constants that satisfy ϵ 0 t + κ < κ. Let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ -ϵ0t 4 , κ(∆(c,τ )-ϵ0t-κ) 4(∆(c,τ )-ϵ0t) , and T ≤ (∆(c, τ ) -ϵ 0 t)U/2 be the termination condition of SWEET. If SWEET terminates in finite episodes, then, the following statements hold: (i) The exploration phase is safe. (ii) The output π of SWEET in the planning phase is an ϵ-optimal (c * , τ * )-safe policy. The detailed proof of Theorem 1 is deferred to Appendix A. We highlight the main idea behind the analysis as follows. While the construction of C P ,U (κ, ϵ 0 , t) ensures safe exploration, the ability for SWEET to find an ϵ-optimal (c * , τ * )-safe policy in planning relies on the concavity and continuity of U( P , •). Note that when SWEET terminates, it is only guaranteed that the approximation error U( P , π) is upper bounded by T for the policies within C P ,U (κ, ϵ 0 , t). Due to a possibly different constraint in planning, it is desirable to have U( P , π) sufficiently small under any π, so that the agent is able to achieve the learning goal in planning with the estimated model P . Let π = arg max π U( P , π), π γ be the equivalent Markov policy of γ π ⊕ (1 -γ)π 0 , f (γ) = U( P , π γ ) and g(γ) = V π γ P ,c . Then, f is concave and g is linear in γ by Theorem 6.1 in Altman (1999) . Let F (γ) = f (γ) + g(γ). The definition of C P ,U (κ, ϵ 0 , t) ensures that F (0) ≤ τ -ϵ 0 t -κ, and F (γ) ≤ τ -ϵ 0 t if π γ lies in C P ,U . It suffices to consider the case when F (1) > τ -ϵ 0 t, under which we can show that both g and f increase with γ. Then, the concavity of f and linearity of g ensure that f (γ)-f (0) f (1)-f (0) ≥ F (γ)-F (0) F (1)-F (0) , as illustrated in Figure 1 . Let π γ0 be the policy under which F (γ 0 ) = τ -ϵ 0 t. Then, F (γ 0 ) -F (0) ≥ κ. Combining with the fact that F (1) -F (0) ≤ 2, we have f (1) ≤ f (0) + (f (γ 0 ) -f (0)) 2 κ ≤ 2T/κ, which provides an upper bound on U( P , π) for any π.

3.3. TRUNCATED VALUE FUNCTION

Theorem 1 highlights the importance of the concave and continuous approximation error function on the Markov policy space. In the following, we introduce a prototype function, coined truncated value function, which is concave and continuous on the Markov policy space and can be used for the construction of the approximation error functions for both tabular and low-rank MDPs. Definition 7 (Truncated value function). Given an MDP M, α > 0, and a set of (un-normalized) utility functions u, the truncated value function V α,• P,u = V α,• 1,P,u (s 1 ) : X → R is defined as follows:      Qα,π h,P,u (s h , a h ) = u(s h , a h ) + αP h V α,π h+1,P,u (s h , a h ), V α,π h,P,u (s h ) = min 1, E π Qα,π h,P,u (s h , a h ) , where V α,π H+1,P,u (s H+1 ) = 0 and we omit the upper index α for simplicity when α = 1. It is worth noting that the clipping technique is applied to the value function as opposed to the action-value function, where the latter is more conventional in the existing literature. This new method is critical for achieving the superior sample complexity in safe RF-RL, as will be elaborated later. Meanwhile, it preserves the desired concavity and continuity (see Lemma 3 in Appendix A), which ensures the safety guarantee for both exploration and planning.

4.1. ALGORITHM DESIGN

Under tabular MDPs, state space S and action space A are both finite (with sizes S and A, respectively). We instantiate the modules of model estimation and exploration policy construction of the SWEET framework, and specify the approximation error function and parameter selection as follows. The details of the Tabular-SWEET algorithm is shown in Algorithm 2 in Appendix B. Model estimation. At each episode n, the agent uses π (n-1) , which is the reference policy derived from the last episode n -1, to collect a trajectory {s (n) 1 , a (n) 1 , . . . , s (n) H , a H }. For that, we set ϵ 0 and t as 0, i.e., it is essentially a (0, 0)-greedy version of policy π (n-1) . We note that although π (n-1) is a greedy policy, the uncertainty captured by the approximation error function U( P , π) will guide the agent to explore the uncertain state-action pairs and obtain sufficient coverage for the entire space. The agent then adds new data triples {s (n) h , a (n) h , s (n) h+1 } H h=1 to a maintained dataset D. Let N (n) h (s h , a h ) = n m=1 1{s (n) h = s h , a (n) h = a h } and N (n) h (s h , a h , s h+1 ) = n m=1 1{s (n) h = s h , a (n) h = a h , s h+1 = s h+1 } be the visitation counters. The agent estimates  and as 1 S otherwise. P (n) h (s h+1 |s h , a h ) as N (n) h (s h ,a h ,s h+1 ) N (n) h (s h ,a h ) if N (n) h (s h , a h ) > 1 Approximation error function. Inspired by Ménard et al. (2021) , we adopt an uncertainty-driven virtual reward function b (n) h (s h , a h ) = β0H N (n) h (s h ,a h ) to guide the exploration, where β 0 is a fixed parameter. Let α H = 1 + 1/H. Then, the approximation error function is specified as U (n) (π) := 4 V α H ,π P (n) , b(n) . According to Lemma 3, U (n) (π) is concave and continuous in π. Besides, as shown in Lemma 8 in Appendix B, we have |V π P * ,u -V π P ,u | ≤ U (n) (π) for any normalized utility u, i.e., U (n) (π) is a valid upper bound on the estimate error for the corresponding value function. The required properties of U in Theorem 1 are thus satisfied. Exploration policy. To guarantee that the exploration is safe, we set κ = κ/2, and construct an empirical safety set 3)). Hence, the algorithm finds a policy π (n) used for the next episode, which is in the safe set C (n) and maximizes the truncated value function V α H ,π C (n) := C P (n) ,U (n) (κ/2, 0, 0) (Equation ( P (n) , b(n) . The exploration phase stops at episode n ϵ when U (nϵ) (π (nϵ) ) ≤ T. The algorithm will utilize the model learned in episode n ϵ to design an ϵ-optimal policy with respect to arbitrary given reward r * and safety constraint (c * , τ * ).

4.2. THEORETICAL ANALYSIS

The theoretical guarantee of Tabular-SWEET is characterized in the theorem below, whose proof can be found in Appendix B. Theorem 2 (Sample complexity of Tabular-SWEET). Given ϵ, δ ∈ (0, 1), and safety constraint (c, τ ), under Assumption 1, let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ 4 , κ 16 , and T = ∆(c, τ )U/2 be the termination condition of Tabular-SWEET. Then, with probability at least 1 -δ, Tabular-SWEET achieves the learning objective of safe reward-free exploration (Equations (1) and (2)), and the number of trajectories collected in the exploration phase is at most Õ HSA(S+log(1/δ)) ∆(c,τ ) 2 U 2 + HSA(S+log(1/δ)) κ 2 . We discuss several possible scenarios and the corresponding selections of U as follows. • Constraint-free RF-RL. For this case ∆(c, τ ) = ∆ min = κ = 1, and c = 0. Thus, U = Θ(ϵ) and the sample complexity is Õ HS 2 A/ϵ 2 , which matches the state of the art (Ménard et al., 2021) . • Constraint-free planning. If only safe exploration is required, we set U = Θ(min{ϵ, κ}), and the sample complexity scales in Õ HS 2 A ∆(c,τ ) 2 ( 1 ϵ 2 + 1 κ 2 ) . The blow-up factor 1 ∆(c,τ ) 2 depends on the safety margin, and the impact of baseline policy only appears in the ϵ-independent term. • Constraint mismatch between exploration and planning. For this case, we set U = Θ(ϵ∆ min ), and the sample complexity is at most Õ HS 2 A ∆(c,τ ) 2 ( 1 ϵ 2 ∆ min 2 + 1 κ 2 ) .

5.1. LOW-RANK MDP

In this section, we present another SWEET variant for low-rank MDPs. Definition 8 (Low-rank MDP (Jiang et al., 2017; Agarwal et al., 2020; Uehara et al., 2021) ). An MDP M is a low-rank MDP with dimension d ∈ N if for each h ∈ [H], the transition kernel P h admits a d-dimensional decomposition, i.e., there exist two features ϕ h : S × A → R d and µ h : S → R d such that P h (s h+1 |s h , a h ) = ⟨ϕ h (s h , a h ), µ h (s h+1 )⟩, ∀s h , s h+1 ∈ S, a h ∈ A. Let ϕ = {ϕ h } h∈[H] and µ = {µ * h } h∈[H] be the features for P . Then, ∥ϕ * h (s, a)∥ 2 ≤ 1, ∥ µ * h (s)g(s)ds∥ 2 ≤ √ d, ∀(s, a) ∈ S × A, ∀g : S → [0, 1]. Differently from linear MDPs (Wang et al., 2020; Jin et al., 2020b) , low-rank MDP does not assume that the features ϕ are known a priori. The lack of knowledge on features in fact invokes a nonlinear structure, which makes it impossible to learn a model in polynomial time if there is no assumption on features ϕ and µ. We hence adopt the following conventional assumption (Jiang et al., 2017; Agarwal et al., 2020; Uehara et al., 2021) from the recent studies on low-rank MDPs. Assumption 2 (Realizability). A learning agent can access a finite model class {(Φ, Ψ)} that contains the true model, i.e., (ϕ * , µ * ) ∈ Φ × Ψ, where ⟨ϕ * h (s h , a h ), µ * (s h+1 )⟩ = P * h (s h+1 |s h , a h ). We note that finite model class assumption can be relaxed to the infinite case with bounded statistical complexity (Sun et al., 2019; Agarwal et al., 2020) . Then, we present the following standard oracle as a computational abstraction, which is commonly adopted in the literature (Agarwal et al., 2020; Uehara et al., 2021) . Definition 9 (MLE oracle). Given the model class (Φ, Ψ) and a dataset D of (s h , a h , s h+1 ), the MLE oracle MLE(D) takes D as the input and returns the following estimators as the output: ( φh , μh ) = MLE(D) = arg max ϕ h ∈Φ,µ h ∈Ψ (s h ,a h ,s h+1 )∈D log⟨ϕ h (s h , a h ), µ h (s h+1 )⟩.

5.2. ALGORITHM DESIGN

The instantiated SWEET algorithm, termed as Low-rank-SWEET, can be found in Algorithm 3 in Appendix C. It proceeds as follows. In each iteration n during the exploration phase, the agent samples H trajectories, indexed by {(n, h)} H h=1 . During the (n, h)-th episode, the agent executes an (ϵ 0 , 2)-greedy version of the reference policy π (n-1) , where ϵ 0 = κ/6 and the ϵ 0 -greedy action selection only takes place at time steps h and h -1. Denote the trajectory collected in episode (n, h) as {s (n,h) 1 , a (n,h) 1 , . . . , s (n,h) H , a (n,h) H }. The agent maintains a dataset D h for each time step h, which is updated through D (n) h ← D (n-1) h ∪ {s (n,h) h , a (n,h) h , s (n,h) h+1 }. Note that both s (n,h) h and a (n,h) h are affected by the ϵ 0 -greedy action selection. Model estimation. Then, the agent obtains the model estimate P (n) through the MLE oracle: ( φ(n) h , μ(n) h ) = MLE(D h ), P (n) h (s h+1 |s h , a h ) = ⟨ φ(n) h (s h , a h ), μ(n) h (s h+1 )⟩. Approximation error function. The algorithm will also use the estimated representation φ(n) h to update the empirical covariance matrix Û (n) as Û (n) h = n m=1 φ(n) h (s (m,h+1) h , a (m,h+1) h )( φ(n) h (s (m,h+1) h , a (m,h+1) h )) ⊤ + λ n I. It is worth noting that only a (m,h+1) h is affected by the ϵ 0 -greedy action selection, which is different from the dataset augmentation step. Next, the agent uses both φ(n) h and Û (n) to derive an explorationdriven virtual reward function as b(n) h (s, a) = α∥ φ(n) h (s, a)∥ ( Û (n) h ) -1 where ∥x∥ A := √ x ⊤ Ax and α is a pre-determined parameter. As shown in Lemma 14 in Appendix C, the approximation error can be bounded by the truncated value function with factor α = 1 up to a constant additive term, i.e. |V π P * ,u -V π P ,u | ≤ V π P (n) , b(n) + Ãζ/n := U (n) L (π) , where "L" stands for "Low-rank". Exploration policy. Based on SWEET, we choose κ = κ/3 such that κ + ϵ 0 t < κ. Then, the algorithm defines the empirical safe policy set as C (n) L := C P (n) ,U (n) L (κ/3, κ/6, 2). It then finds a reference policy π (n) in C (n) L that maximizes U π L (π), which is used for exploration at the next iteration.

5.3. THEORETICAL ANALYSIS

We summarize the results of Low-rank-SWEET in Theorem 3, and defer the proof to Appendix C. Theorem 3 (Sample complexity of Low-rank-SWEET). Given ϵ, δ ∈ (0, 1), and safety constraint (c, τ ), let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ 6 , κ 24 , and T = ∆(c, τ )U/3 be the termination condition of Low-rank-SWEET. Then, under Assumption 1,2, with probability at least 1 -δ, Low-rank-SWEET achieves the learning objective of safe reward-free exploration (Equations (1) and (2)) and the number of trajectories collected in the exploration phase is at most Õ H 3 d 4 A 2 log(1/δ) κ 2 ∆(c,τ ) 2 U 2 + H 3 d 4 A 2 log(1/δ) κ 4 . Remark 1. For the constraint-free scenario, we set ∆(c, τ ) = ∆ min = κ = 1, U = Θ(ϵ), and c to be zero. Then, the sample complexity scales as Õ H 3 d 4 A 2 /ϵ 2 , which outperforms the best known sample complexity of RF-RL (Agarwal et al., 2020; Modi et al., 2021) and even reward-known RL with computational feasibility (Uehara et al., 2021) , all for low-rank MDPs.

6. RELATED WORKS

Reward-free reinforcement learning. Reward-free exploration is formally introduced by Jin et al. (2020a) for tabular MDP, where an algorithm called RF-RL-Explore is proposed, which achieves Õ H 3 S 2 A/ϵ 2 sample complexityfoot_0 . The result is then improved to Õ H 2 S 2 A/ϵ 2 by Kaufmann et al. (2021) . By leveraging an empirical Bernstein inequality, RF-Express (Ménard et al., 2021) 2020) studies low-rank MDPs and proposes FLAMBE, whose learning objective can be translated to a reward-free learning goal with sample complexity Õ H 22 d 7 A 9 /ϵ 10 . Subsequently, Modi et al. (2021) proposes a model-free algorithm MOFFLE for low-nonnegative-rank MDPs, for which the sample complexity scales as Õ( H 5 A 5 d 3 LV ϵ 2 η ), where d LV denotes the non-negative rank of the transition kernel. Recently, Chen et al. (2022) studies RF-RL with more general function approximation, but their result scales in Õ(H 6 d 3 A/ϵ 2 ) when specializes to low-rank MDPs, and cannot recover our upper bound. Safe reinforcement learning. Safe RL is often cast in the Constrained MDP (CMDP) framework (Altman, 1999) under which the learning agent must satisfy a set of constraints (Efroni et al., 2020; Turchetta et al., 2020; Zheng & Ratliff, 2020; Qiu et al., 2020; Ding et al., 2020; Kalagarla et al., 2020; Liu et al., 2021; Wei et al., 2022; Ghosh et al., 2022) . However, most of the constraints considered in the existing works require the cumulative expected cost over a horizon falling below a certain threshold, which is less stringent than the episodic-wise constraint imposed in this work. Other forms of constraints such as minimizing the variance (Tamar et al., 2012) or more generally maximizing some utility function (Ding et al., 2021) , have also been investigated. Amani et al. (2021) studies safe RL with linear function approximation, where the constraint is defined using an (unknown) linear cost function of each state and action pair. In particular, Miryoosefi & Jin (2021) utilizes a reward-free oracle to solve CMDP which, however, does not have any safety guarantee for the exploration phase. By assuming availability of a safe baseline policy, Zheng & Ratliff (2020) considers a known MDP with unknown rewards and cost functions and presents C-UCRL that achieves regret Õ(N 3/4 ) and zero constraint violation, where N is the number of episodes. Liu et al. (2021) improves the result by proposing OptPess-LP, and achieves a regret of Õ H 2 √ S 3 AN /κ , where κ is the cost value gap between the baseline policy and the constraint boundary. Safe baseline policy has been widely utilized in conservative RL as well, which is a special case of safe RL, as the constraint is defined in terms of the total expected reward being above a threshold (Garcelon et al., 2020; Yang et al., 2021) .

7. CONCLUSION

We proposed a novel safe RF-RL framework where safety constraints are imposed during the exploration and planning phases of RF-RL. A unified algorithmic framework called SWEET was developed, which leverages an existing baseline policy to guide safe exploration. Leveraging a concave approximation error function, SWEET can achieve zero constraint violation in exploration and provably produce a near-optimal safe policy for any given reward function and safety constraint he feasible assumption in planning. We also instantiated SWEET to both tabular and low-rank MDPs, resulting in Tabular-SWEET and Low-rank-SWEET. The sample complexities of both algorithms match or outperform the state of the art in their constraint-free counterparts, proving that the safety constraint does not fundamentally impact the sample complexity of RF-RL.

FUNCTION

In this section, we first prove two supporting lemmas in Appendix A.1, which are useful for the proof of Theorem 1. Then, we provide the proof for Theorem 1 in Appendix A.2 and the proof for an important concavity property of the truncated value function in Appendix A.3, which are essential for instantiating Theorem 1 for Tabular-SWEET and Low-rank-SWEET.

A.1 SUPPORTING LEMMAS

We first show that the value function under an (ϵ 0 , t)-greedy policy deviates from that under the original policy by at most ϵ 0 t. Lemma 1. Let π ′ be an (ϵ 0 , t)-greedy version of policy π. Then, for an MDP with transition kernel P and normalized utility function u, we must have V π ′ P,u -V π P,u ≤ ϵ 0 t. Proof. First, we prove the statement for the case when t = 1. Assume policy π ′ deviates from policy π at step h, and denote ρ π h (s h ) as the marginal distribution induced by π under the transition kernel P . Let π U = (π 1 , • • • , π h-1 , U, π h+1 , • • • , π H ), where U is the uniform policy over the action space, i.e., U(a h |s h ) = 1/|A|. Consider the equivalent Markov policy of ϵ 0 π U ⊕ (1 -ϵ 0 )π, denoted by π ϵ0 . By Lemma 16, we have π ′ h (a h |s h ) = ϵ 0 |A| + (1 -ϵ 0 )π(a h |s h ) = ϵ 0 ρ π U h (s h )π U h (a h |s h ) + (1 -ϵ 0 )ρ π h (s h )π(a h |s h ) ϵ 0 ρ π U h (s h ) + (1 -ϵ 0 )ρ π h (s h ) = π ϵ0 h (a h |s h ), where the second equality follows from the fact that ρ π U h (s h ) = ρ π h (s h ), since the first h -1 policies of π and π U are the same. For any h ′ ≥ h + 1, we have π ϵ0 h ′ (a h ′ |s h ′ ) = ϵ 0 ρ π U h ′ (s h ′ )π h ′ (a h ′ |s h ′ ) + (1 -ϵ 0 )ρ π h ′ (s h ′ )π h ′ (a h ′ |s h ′ ) ϵ 0 ρ π U h ′ (s h ′ ) + (1 -ϵ 0 )ρ π h ′ (s h ′ ) = π h ′ (a h ′ |s h ′ ) = π ′ h ′ (a h ′ |s h ′ ), where the last equality is due to the definition of π ′ . Therefore, π ′ = π ϵ0 , which further yields that V π ′ P,u = V ϵ0π U ⊕(1-ϵ0)π P,u = ϵ 0 V π U P,u + (1 -ϵ 0 )V π P,u . Since the value function is upper bounded by 1 with normalized utility function u, we immediately obtain that V π ′ P,u -V π P,u ≤ ϵ 0 . For the general case where π ′ differs from π at steps in H ⊂ [H] with |H| = t ≤ H, consider a sequence of subsets {H i } t i=1 such that H i ⊂ H i+1 , |H i+1 | -|H i | = 1 , and H t = H. Then, we can define a sequence of policies {π i } t i=1 , such that π t is the (ϵ 0 , i)-greedy policy of π that deviates from π at steps in H i . Then, by the definition of (ϵ 0 , t)-greedy policy in Definition 3, π i+1 is an (ϵ 0 , 1)-greedy version of policy π i . Thus, by induction, we conclude that V π ′ P,u -V π P,u ≤ t i=1 V π i P,u -V π i-1 P,u ≤ ϵ 0 t, where we denote π ′ as π t and π as π 0 . The following lemma is critical for handling constraint mismatch not only between the exploration phase and the planning phase, but also between the constraint adopted for the construction of the empirical safe policy set used in exploration and the true constraint V π P * ,c ≤ τ . Lemma 2. Consider a set X in which the convex combination is defined through γx ⊕ (1 -γ)y ∈ X, where x, y ∈ X and γ ∈ [0, 1]. Let f, g : X → [0, 1] be two functions on X such that f is concave and g is convex, i.e. f (γx ⊕ (1 -γ)y) ≥ γf (x) + (1 -γ)f (y) and g(γx ⊕ (1 -γ)y) ≤ γg(x) + (1 -γ)g(y). We further assume that both f, g are continuous w.r.t. γ ∈ [0, 1]. Define an optimization problem (P) as follows: (P) max x f (x) s.t. g(x) + f (x) ≤ τ , τ ∈ (0, 1]. Assume there exists a strictly feasible solution x 0 ∈ X such that g(x 0 ) + f (x 0 ) ≤ τ -κ where κ ∈ (0, τ ), and denote x * as an optimal solution to (P). If the optimal value of (P) is strictly less than κ, i.e. f (x * ) < κ, then max x∈X f (x) ≤ 2f (x * )/κ. Proof. Let x 1 = arg max x∈X f (x), and x γ = γx 1 ⊕ (1 -γ)x 0 ∈ X, which is the convex combination of x 0 and x 1 . If x 1 satisfies the constraint in (P), the result is trivial. Therefore, it suffices to consider the case when g(x 1 ) + f (x 1 ) > τ . We first show that g(x 1 ) ≥ g(x 0 ) through contradiction. Assume g(x 1 ) < g(x 0 ). Then, we have f (x 1 ) > τ -g(x 1 ) = τ -κ -g(x 1 ) + κ (i) ≥ g(x 0 ) + f (x 0 ) -g(x 1 ) + κ (ii) > κ, where (i) follows because x 0 is a strictly feasible solution, and (ii) follows from the assumption that f (x 0 ) ∈ [0, 1] and g(x 0 ) > g(x 1 ). Note that f (x 0 ) < κ, and f (x γ ) is a continuous function with respect to γ ∈ [0, 1]. Thus, we can choose γ 1 ∈ (0, 1) such that f (x γ1 ) = κ ∈ [f (x 0 ), f (x 1 )]. In addition, by the convexity of g and the assumption that g(x 1 ) < g(x 0 ), we have g(x γ1 ) ≤ γ 1 g(x 1 ) + (1 -γ 1 )g(x 0 ) < g(x 0 ). Combining Equations ( 7) and ( 8), we have f (x γ1 ) + g(x γ1 ) ≤ g(x 0 ) + κ ≤ g(x 0 ) + f (x 0 ) + κ ≤ τ , which implies that x γ1 is a feasible solution of the optimization problem (P). Thus, by the optimality of x * , we have κ > f (x * ) ≥ f (x γ1 ) = κ, which is a contradiction. Therefore, g(x 0 ) ≤ g(x 1 ). Then, let γ 0 be the solution to the following equation. γ 0 g(x 1 ) + f (x 1 ) + (1 -γ 0 ) g(x 0 ) + f (x 0 ) = τ . ( ) Since g(x 0 ) + f (x 0 ) ≤ τ -κ and f, g ∈ [0, 1], we have: τ ≤ 2γ 0 + τ -κ, which implies γ 0 ≥ κ/2. Since f is concave and continuous w.r.t γ, there exists γ * ≤ γ 0 such that f (x γ * ) = γ 0 f (x 1 ) + (1 -γ 0 )f (x 0 ). On the other hand, due to the convexity of g, we have g(x γ * ) ≤ γ * g(x 1 ) + (1 -γ * )g(x 0 ) (i) ≤ γ 0 g(x 1 ) + (1 -γ 0 )g(x 0 ), where (i) follows from the fact that γ * ≤ γ 0 and g(x 1 ) > g(x 0 ). Combining Equations ( 10) and ( 11), we have g(x γ * ) + f (x γ * ) ≤ γ 0 g(x 1 ) + (1 -γ 0 )g(x 0 ) + γ 0 f (x 1 ) + (1 -γ 0 )f (x 0 ) = τ , which indicates that x γ * is a feasible solution of the optimization problem (P). Thus, by the optimality of x * , Equation ( 10), and γ 0 ≥ κ/2, we conclude that max x∈X f (x) = γ 0 f (x 1 ) γ 0 ≤ γ 0 f (x 1 ) + (1 -γ 0 )f (x 0 ) γ 0 = f (x γ * ) γ 0 ≤ 2f (x * )/κ.

A.2 PROOF OF THEOREM 1

We first formally restate Theorem 1 below and then provide the proof for this theorem. Theorem 4 (Restatement of Theorem 1). Given an MDP M * and model estimate P , assume U( P , π) is concave and continuous over the Markov policy space X and V π P * ,u -V π P ,u ≤ U( P , π) for any normalized utility u and policy π. Let ϵ 0 , t and κ be constants that satisfy ϵ 0 t + κ < κ. Let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ -ϵ0t 4 , κ(∆(c,τ )-ϵ0t-κ) 4(∆(c,τ )-ϵ0t) , and T ≤ (∆(c, τ )-ϵ 0 t)U/2 be the termination condition of SWEET. If SWEET terminates in finite episodes, then, the following statements hold: (i) The exploration phase is safe (See Equation (1)). (ii) The output π of SWEET in the planning phase is an ϵ-optimal (c * , τ * )-safe policy (See Equation (2)). Proof. The proof consists of three steps: Step 1 shows that the exploration phase of SWEET is safe; Step 2 shows that SWEET can find an ϵ-optimal policy in the planning phase for any given reward and without the constraint requirement; and Step 3 shows that SWEET can find an ϵ-optimal policy in the planning phase for any given reward and under any constraint (c * , τ * ) requirement. We next provide details for each step. Step 1. This step shows that the exploration phase of SWEET is safe. Note that the exploration policy, denoted by π b , is an (ϵ 0 , t)-greedy version of the reference policy π r , where π r is a solution to the following optimization problem: max π∈C P ,U (κ,ϵ0,t) U( P , π), where C P ,U (κ, ϵ 0 , t) =      {π 0 }, if V π 0 P ,c + U( P , π 0 ) ≥ τ -ϵ 0 t -κ, π : V π P ,c + U( P , π) ≤ τ -ϵ 0 t , otherwise. If π r = π 0 , then by Lemma 1, we have V π b P * ,c ≤ V π 0 P * ,c + ϵ 0 t ≤ τ -κ + ϵ 0 t < τ, where the last inequality is due the condition ϵ 0 t + κ < κ. If π r ̸ = π 0 , then V πr P ,c + U( P , π r ) ≤ τ -ϵ 0 t. By Lemma 1, we have V π b P * ,c ≤ V πr P * ,c + ϵ 0 t (i) ≤ V πr P ,c + U( P , π r ) + ϵ 0 t ≤ τ -ϵ 0 t + ϵ 0 t = τ, where (i) follows from the definition of U( P , π). Therefore, the exploration phase is safe. Step 2: This step shows that SWEET can find an ϵ-optimal policy in the planning phase for any given reward r * in the constraint-free setting (τ * = ∞), i.e., the planning phase does not have a constraint requirement. Consider the Markov policy space X with the convex combination defined by the mixture policy γπ ⊕ (1 -γ)π ′ , and g(π) = V π P ,c . Let π r be the reference policy when the termination condition is satisfied. Then, by the property of U( P , π) and the termination condition in SWEET, the following statements hold: • g(π) is convex (linear) and U(π) is concave on X . Moreover, they are both continuous. • The baseline policy π 0 ∈ X and g(π 0 ) + U( P , π 0 ) ≤ τ -ϵ 0 t -κ. • π r = arg max π U( P , π) s.t. g(π) + U( P , π) ≤ τ -ϵ 0 t. Moreover, U( P , π r ) ≤ T < κ. Applying Lemma 2 with the baseline policy π 0 , we have max π U(π) ≤ 2 κ U(π r ) ≤ 2T κ := x 1 . ( ) Let π min = arg min π V π P * ,c . By the definition of U and Assumption 1, we have g(π min ) + U(π min ) ≤ V π min P * ,c + 2U(π min ) ≤ V π min P * ,c + 4T κ = τ -ϵ 0 t -τ -ϵ 0 t -V π min P * ,c - 4T κ . We again apply Lemma 2 with the feasible solution fixed as policy π min to conclude that max π U(π) (i) ≤ 2T τ -ϵ 0 t -V π min P * ,c -4T/κ (ii) = 2T ∆(c, τ ) -ϵ 0 t -4T/κ := x 2 , where (i) follows from Equation (13), and (ii) follows from the definition of ∆(c, τ ). Continuing this process, we get a sequence {x n } with recursive formula x n+1 = 2T/(∆(c, τ ) -ϵ 0 t -2x n ), and max π U(π) ≤ inf{x n } ∞ n=1 . Denote ∆(c, τ ) -ϵ 0 t by ∆c . Then, T ≤ ∆c U/2 < κ( ∆c-κ)

4

, which implies that ∆c 2 -x 1 ≤ ∆2 c 4 -4T 2 . Then, based on Lemma 20, {x n } converges to ∆c -∆2 c -16T 4 . Therefore, we conclude that max π U(π) ≤ ∆c -∆2 c -16T 4 = U ∆c ∆c + ∆2 c -16T ≤ U. Let π = arg max π V π P * ,r * . By the definition of U(π), we can compute the suboptimality gap of π as follows. V π P * ,r * -V π P * ,r * = V π P * ,r * -V π P ϵ ,r * + V π P ϵ ,r * -V π P ϵ ,r * + V π P ϵ ,r * -V π P * ,r (i) ≤ U(π) + U(π) ≤ 2 max π U(π) ≤ 2U ≤ ϵ, where (i) follows from the optimality of π, i.e. V π P ϵ ,r ≤ V π P ϵ ,r . Step 3: This step shows that SWEET can find an ϵ-optimal policy in the planning phase for any given reward r * and under any constraint (c * , τ * ). Let g 0 (π) = V π P ,c * . Recall that π * = arg max π V π P * ,r * s.t. V π P * ,c * ≤ τ * , π = arg max π V π P ϵ ,r * s.t. g 0 (π) + U(π) ≤ τ * . If g 0 (π * ) + U(π * ) ≤ τ * , by the optimality of π, we immediately have V π P ϵ ,r * ≥ V π * P ϵ ,r * . If g 0 (π * ) + U(π * ) > τ * , then by the definition of U and Equation ( 14), we have τ * < g 0 (π * ) + U(π * ) ≤ V π P * ,c * + 2U(π * ) ≤ τ * + 2U. Let π = arg min π V π P * ,c * . By Equation ( 14), we have g 0 (π) + U(π) ≤ V π P * ,c * + 2U(π) ≤ V π P * ,c * + 2U < τ * . ( ) Let π γ be the Markov policy equivalent to the mixture policy γπ * ⊕ (1 -γ)π under the estimated model P . Let ∆ c * = ∆(c * , τ * ) = τ * -V π P * ,c * and γ = (∆ c * -3U)/∆ c * . By the linearity of g 0 and Equation ( 14), we have g 0 (π γ ) + U(π γ ) ≤ γg 0 (π * ) + (1 -γ)g 0 (π) + U (i) ≤ γ(τ * + 2U) + (1 -γ)(V π P * ,c * + 2U) + U = γ∆ c * + 3U + V π P * ,c * = ∆ c * + V π P * ,c * = τ * , where (i) follows from Equations ( 15) and ( 16). This implies that π γ is a feasible solution of the optimization problem solved in the planning phase. By the optimality of π and the linearity of V π P ,r * , we have V π P ,r * ≥ V π γ ≥ γV π * P ,r * (i) = V π * P ,r * - 3U ∆ c * V π * P ,r * ≥ V π * P ,r * - 3U ∆ min , where (i) follows because γ = (∆ c * -3U)/∆ c * , and the last inequality follows from the normalization condition and Assumption 1. Recall U ≤ ∆minϵ 5 . Therefore, the suboptimality gap under π can be computed as follows. V π * P * ,r * -V π P * ,r * ≤ V π * P ϵ ,r * + U(π * ) -V π P ϵ ,r * + U(π) (i) ≤ 3U ∆ min + 2U ≤ ϵ, where (i) follows from Equation (17).

A.3 PROOF OF CONCAVITY OF THE TRUNCATED VALUE FUNCTION

In this subsection, we show that the truncated value function defined in Equation ( 4) is concave and continuous on the Markov policy space X . These are crucial properties to be used for instantiating our theorem to Tabular-SWEET and Low-rank-SWEET. Lemma 3 (Concavity of the truncated value function). Let π γ be the equivalent markov policy of γπ ⊕ (1 -γ)π ′ under a transition model P . Then, V π γ P,u ≥ γ V π P,u + (1 -γ) V π ′ P,u . In addition, V π γ P,u is continuous w.r.t. γ ∈ [0, 1]. Moreover, if the utility function u satisfies the normalization condition, then the equality holds, i.e., V π γ P,u = γ V π P,u + (1 -γ) V π ′ P,u . Proof. Recall that the truncated value function is defined recursively as follows:        Qπ h, P (n) ,u (s h , a h ) = u(s, a) + α P (n) h V π h+1, P (n) ,u (s h , a h ) V π h, P (n) ,u = min 1, E π Qπ h, P (n) ,u (s, a) , with V α,π H+1,P,u (s H+1 ) = 0. The continuity of the truncated value function is straightforward, since it is a composition of H continuous functions. Therefore, we focus on the concavity part in the following analysis. By Lemma 16, the following equality holds for any utility function u and time step h. E P,π γ u h (s h , a h ) = γE P,π u h (s h , a h ) + (1 -γ)E P,π ′ u h (s h , a h ) . ( ) We then prove the claim by induction. First, we note that when h = H + 1, γ E P,π V π H+1,P,u (s H+1 ) + (1 -γ) E P,π ′ V π ′ H+1,P,u (s H+1 ) = 0 ≤ min 1, E P,π γ Qπ γ H+1,P,u (s H+1 , a H+1 ) . Assume it holds for step h + 1, i.e., γ E P,π V π h+1,P,u (s h+1 ) + (1 -γ) E P,π ′ V π ′ h+1,P,u (s h+1 ) ≤ min 1, E P,π γ Qπ γ h+1,P,u (s h+1 , a h+1 ) . Then, for step h, by Jensen's inequality, we have γ E P,π V π h,P,u (s h ) + (1 -γ) E P,π ′ V π ′ h,P,u (s h ) ≤ min 1, γ E P,π Qπ h,P,u (s h , a h ) + (1 -γ) E P,π ′ Qπ ′ h,P,u (s h , a h ) = min 1, γ E P,π u(s h , a h ) + (1 -γ) E P,π ′ u(s h , a h ) +αγ E P,π V π h+1,P,u (s h+1 ) + (1 -γ)α E P,π ′ V π ′ h+1,P,u (s h+1 ) (i) ≤ min 1, E P,π γ u(s h , a h ) + α E P,π γ Qπ γ h+1,P,u (s h+1 , a h+1 ) = min 1, E P,π γ Qπ γ h,P,u (s h , a h ) , where (i) follows from Equation ( 18) and the induction hypothesis. Therefore, at step h = 1, we have γ V π P,u + (1 -γ) V π ′ P,u ≤ min 1, E π γ Qπ γ P,u (s 1 , a 1 ) = V π γ P,u . If u satisfies the normalization condition, then Q π h,P,u (s h , a h ) ≤ 1 holds for any π and h. By the definition of the truncated value function, we have Qπ h,P,u (s h , a h ) = Q π h,P,u (s h , a h ) and V π h,P,u (s h ) = V π h,P,u (s h ), which implies that the truncated value function and the true value function are identical. By Lemma 16, π γ introduces the same marginal probability over any state-action pair as the mixture policy γπ ⊕ (1 -γ)π ′ . Therefore, when u is normalized, γ V π P,u + (1 -γ) V π ′ P,u = γV π P,u + (1 -γ)V π ′ P,u = V π γ P,u = V π γ P,u , which completes the proof.

B ANALYSIS OF TABULAR-SWEET

In this section, we first elaborate the Tabular-SWEET algorithm in Appendix B.1. To provide the analysis for this algorithm, we first provide several supporting lemmas in Appendix B.2, and then prove Theorem 2 in Appendix B.3.

B.1 THE TABULAR-SWEET ALGORITHM

We first specify the parameters in Tabular-SWEET. The detail of Tabular-SWEET is shown in Algorithm 2. Let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ 4 , κ 16 , and T = ∆(c, τ )U/2 be the termination condition of Tabular-SWEET. Let the maximum iteration number N be the solution of the following equation: N = 2 10 e 3 30 2 βHSA log(N + 1) ∆(c, τ ) 2 U 2 + 2 15 e 3 βHSA log(N + 1) κ 2 , ( ) where β = log(2SAH/δ) + S log(e(1 + N )). Recall that the estimated model is computed by P (n) h (s h+1 |s h , a h ) =    N (n) h (s h ,a h ,s h+1 ) N (n) h (s h ,a h ) , if N (n) h (s h , a h ) > 1, P (n) h (s h+1 |s h , a h ) = 1 S , otherwise, where N (n) h (s h , a h ) and N (n) h (s h , a h , s h+1 ) denote the numbers of visits of (s h , a h ) and (s h , a h , s h+1 ) up to n-th episode, respectively. Then, the exploration-driven virtual reward is defined as b(n) h (s h , a h ) = β0H N (n) h (s h ,a h ) , where β 0 = 8β. The approximation error bound is a concave function of the truncated value function, defined as U (n) (π) = 4 V π P (n) , b(n) . Since ϵ 0 = t = 0 and κ = κ/2, the safety set C (n) is given by C P ,U (κ, ϵ 0 , t) =      {π 0 }, if V π 0 P ,c + U (n) (π 0 ) ≥ τ -κ/2, π : V π P ,c + U (n) (π) ≤ τ , otherwise. Algorithm 2 Tabular-SWEET 1: Input: Baseline policy π 0 , dataset D = ∅, constants τ, κ, α H = H+1 H , T = ∆(c, τ )U/2. 2: // Exploration: 3: for n = 1, . . . , N do 4: Use π (n-1) to collect {s (n) 1 , . . . , a (n) H }; D ← D ∪ {s (n) h , a (n) h , s (n) h+1 } H h=1 ; 5: Estimate P (n) ; update b(n) h , U (n) (π) and the empirical safe policy set C (n) (Equations ( 20) to ( 22)); 6: Solve π (n) = arg max π∈C (n) U (n) (π); 7: if C (n) > 1 and U (n) (π (n) ) ≤ T then 8: n ϵ , P ϵ , bϵ ← n, P (n) , b(n) , break; 9: end if 10: end for 11: // Planning: 12: Receive reward function r * and safety constraint (c * , τ * ); 13: Output: π = arg max π V π P ϵ ,r * s.t. V π P ϵ ,c * + 4 V α H ,π P ϵ , bϵ ≤ τ * .

B.2 SUPPORTING LEMMAS

First, denote Var P h V π h+1,P ′ ,u (s h , a h ) as the variance of value function V π h+1,P ′ ,u (s h+1 ), where s h+1 follows the distribution P h (•|s h , a h ), i.e., Var P h V π h+1,P ′ ,u (s h , a h ) = E P h V π h+1,P ′ ,u (s h+1 ) -P h V π h+1,P ′ ,u (s h , a h ) 2 |s h , a h . (23) Then, we have the following lemma. Lemma 4 (Lemma 3 in Ménard et al. (2021) ). Let ρ π * ,h (s h , a h ) be the marginal probability over state-action pair (s h , a h ) induced by policy π under the true environment P * . Suppose the utility function u satisfies the normalization condition. Denote E 0 = ∀n, h, s h , a h , D KL ( P (n) h (•|s h , a h )||P * h (•|s h , a h )) ≤ β N (n) h (s h , a h ) , E 1 =    ∀n, h, s h , a h , N (n) h (s h , a h ) ≥ 1 2 n-1 m=0 ρ π (m) * ,h (s h , a h ) -β 1    , E 2 = ∀n, h, s h , a h , P (n) h -P * h V π h+1,P * ,u (s h , a h ) ≤ 2β 2 Var P * h V h+1,P * ,u (s h , a h ) N (n) h (s h , a h ) + 3β 2 N (n) h (s h , a h ) , where β = log(3SAH/δ) + S log(8e(1 + N )), β 1 = log(3SAH/δ), and β 2 = log(3SAH/δ) + log(8e(1 + N )). Note that β ≥ β 2 ≥ β 1 . Let E = E 0 ∩ E 1 ∩ E 2 . Then, we have P [E] ≥ 1 -δ. The following lemma shows the relationship between the visitation counters N (n) h (s h , a h ) and the pseudo-counter n-1 m=0 ρ π (m) * ,h (s h , a h ), where ρ π * ,h (s h , a h ) is the marginal distribution on (s h , a h ) induced by policy π under true model P * . Lemma 5 (Lemma 7 in Kaufmann et al. (2021) , Lemma 8 in Ménard et al. (2021) ). On the event E, we have min β N (n) h (s h , a h ) , 1 ≤ 4β max n-1 m=0 ρ π (m) * ,h (s h , a h ), 1 . In addition, we generalize Lemma 7 in Ménard et al. (2021) from deterministic policies to randomized policies. This is important for safe RL, as the optimal policy in constrained RL is possibly randomized. Lemma 6 (Law of total variance with randomized policy). Given model P , policy π, and normalized utility function u, define another utility function σ h (s h , a h ) as σ h (s h , a h ) = Var P h V π h+1,P,u (s h , a h ). Then, for any Markov policy π and h ∈ [H], the following bound holds: E π      Q π h,P,u (s h , a h ) - h ′ ≥h u(s h ′ , a h ′ )   2 s h , a h    ≥ Q π h,P,σ (s h , a h ). ( ) In particular, when h = 1, we have 1 ≥ E π      Q π P,u (s 1 , a 1 ) - h≥1 u(s h , a h )   2 s 1    ≥ E π Q π P,σ (s 1 , a 1 )|s 1 = h≥1 s h ,a h ρ π h (s h , a h )σ h (s h , a h ) = h≥1 s h ,a h ρ π h (s h , a h )Var P h V π h+1,P,u (s h , a h ), where ρ π h (s h , a h ) is the marginal distribution over state-action pair (s h , a h ) induced by policy π under model P . Proof. First, we note that the statement is trivial for h = H + 1 since all Q-value functions are 0. Then, we prove the result through induction. Assume that at time step h + 1, E π      Q π h+1,P,u (s h+1 , a h+1 ) - h ′ ≥h+1 u(s h ′ , a h ′ )   2 s h+1 , a h+1    ≥ Q π h+1,P,σ (s h+1 , a h+1 ). Then, at time step h, the LHS of Equation ( 24) can be computed as follows. E π      Q π h,P,u (s h , a h ) - h ′ ≥h u(s h ′ , a h ′ )   2 s h , a h    = E π      P h V π h+1 (s h , a h ) - h ′ ≥h+1 u(s h ′ , a h ′ ) + Q π h+1,P,u (s h+1 , a h+1 ) -Q π h+1,P,u (s h+1 , a h+1 )   2 s h , a h    = E π      Q π h+1,P,u (s h+1 , a h+1 ) - h ′ ≥h+1 u(s h ′ , a h ′ )   2 s h , a h    + E π Q π h+1,P,u (s h+1 , a h+1 ) -P h V π h+1 (s h , a h ) 2 s h , a h + 2E π      Q π h+1,P,u (s h+1 , a h+1 ) - h ′ ≥h+1 u(s h ′ , a h ′ )   Q π h+1,P,u (s h+1 , a h+1 ) -P h V π h+1 (s h , a h ) s h , a h    . The term within the expectation in the third term equals 0 if we further condition it on s h+1 , a h+1 , indicating that the third term is 0. Therefore, from the assumption, we have E π      Q π h,P,u (s h , a h ) - h ′ ≥h u(s h ′ , a h ′ )   2 s h , a h    ≥ E π Q h+1,P,σ (s h+1 , a h+1 ) s h , a h + E π   E a h+1 ∼π Q π h+1,P,u (s h+1 , a h+1 ) -P h V π h+1 (s h , a h ) 2 s h+1 s h , a h   (i) ≥ P h V h+1,P,σ (s h , a h ) + E π V π h+1,P,u (s h+1 ) -P h V π h+1 (s h , a h ) 2 s h , a h = σ h (s h , a h ) + P h V h+1,P,σ (s h , a h ) = Q π h,P,σ (s h , a h ) , where (i) follows from Jensen's inequality. Thus, Equation ( 24) holds for all step h, and the proof is completed. The following lemma is the key to ensure that Tabular-SWEET satisfies the termination condition. Lemma 7. On the event E, the summation of V α H ,π (n) P n , b(n) over any subset N ⊂ [N ] scales in the order of log |N | , i.e. n∈N V α H ,π (n) P (n) , b(n) ≤ 64e 3 βHSA log(1 + |N |). Proof. First, similar to the truncated value function, we extend the definitions of value function to incorporate the additional factor α H . Specifically, ∀h ∈ [H], Q α H ,π h,P,u = u(s h , a h ) + α H P h V α H ,π h+1,P,u , V α H ,π h,P,u = E π Q α H ,π h,P,u , and V α H ,π H+1,P,u = 0. We then examine the difference between the truncated Q-value function defined with respect to model P (n) and the Q-value function defined with respect to model P * . Qα H ,π h, P (n) , b(n) (s h , a h ) -Q α H ,π h,P * , b(n) (s h , a h ) = α H P (n) V α H ,π h+1, P (n) , b(n) (s h , a h ) -α H P * h V α H ,π h+1,P * , b(n) (s h , a h ) = α H P (n) h -P * h V α H ,π h+1, P (n) , b(n) (s h , a h ) + α H P * h V α H ,π h+1, P (n) , b(n) -V α H ,π h+1,P * , b(n) (s h , a h ). By Lemma 10 in Ménard et al. (2021) , we bound the first term as follows. P (n) h -P * h V α H ,π h, P (n) , b(n) (s h , a h ) ≤ min    1, 2Var P * h V α H ,π h+1, P (n) , b(n) (s h , a h ) β N (n) h (s h , a h ) + 2β 3N (n) h (s h , a h )    (i) ≤ Var P * h V α H ,π h+1, P (n) , b(n) (s h , a h ) H + min 1, (2 + H/2)β 3N (n) h (s h , a h ) (ii) ≤ P * h V α H ,π h+1, P (n) , b(n) (s h , a h ) H + min b(n) h (s h , a h )/8, 1 , where (i) follows from √ 2AB ≤ A/H + BH/2, and (ii) is due to the truncated value function is at most 1 and Var(X) ≤ E[X] if X ∈ [0, 1]. Therefore, by combining the above two inequalities and taking expectation, we have, V α H ,π h, P (n) , b(n) (s h ) ≤ E π Qα H ,π h, P (n) , b(n) (s h , a h )|s h ≤ E π min b(n) h (s h , a h ), 1 s h + α H E π min b(n) h (s h , a h )/8, 1 s h + E π (α H + α H H )P * h V α H ,π h+1, P (n) , b(n) (s h , a h ) s h ≤ E π 2 min b(n) h (s h , a h ), 1 + 1 + 3 H P * h V α H ,π h+1, P (n) , b(n) (s h , a h ) s h . Telescoping the above inequality from h = 1 to H and defining b (n) h (s h , a h ) = min b(n) h (s h , a h ), 1 , we get V α H ,π P (n) , b(n) ≤ V 1+3/H,π P * ,2b (n) ≤ 2e 3 V π P * , b(n) . Therefore, if ρ π (n) * ,h (s h , a h ) is the marginal distribution over state-action pairs induced by exploration policy π (n) under the true model P * , we have n∈N V α H ,π (n) P (n) , b(n) ≤ n∈N 2e 3 V π (n) P * ,b n ≤ 2e 3 n∈N H h=1 E P * ,π (n)   min 8Hβ N (n) (s h , a h ) , 1   = 2e 3 n∈N H h=1 s h ,a h ρ π (n) * ,h (s h , a h ) min 8Hβ N (n) (s h , a h ) , 1 (i) ≤ 2e 3 H h=1 s h ,a h n∈N ρ π (n) * ,h (s h , a h ) 8Hβ max 1, n-1 m=0 ρ πm * ,h (s h , a h ) ≤ 16e 3 Hβ H h=1 s h ,a h n∈N ρ π (n) * ,h (s h , a h ) max 1, m∈N ,m<n ρ πm * ,h (s h , a h ) (ii) ≤ 64e 3 Hβ H h=1 s h ,a h log   1 + n∈N ρ π (n) * ,h (s h , a h )   (iii) ≤ 64e 3 βHSA log(1 + |N |), where (i) is due to Lemma 5, (ii) follows from Lemma 18, and (iii) follows the fact that ρ π(m) * ,h (s h , a h ) ≤ 1. Therefore, n∈N V α H ,π (n) P (n) , b(n) ≤ 64e 3 βHSA log(1 + |N |).

B.3 PROOF OF THEOREM 2

Theorem 5 (Complete version of Theorem 2). Given ϵ, δ ∈ (0, 1), and safety constraint (c, τ ), let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ 4 , κ 16 , and T = ∆(c, τ )U/2 be the termination condition of Tabular-SWEET. Then, with probability at least 1 -δ, Tabular-SWEET achieves the learning objective of safe reward-free exploration (Equations (1) and (2)), and the number of trajectories collected in the exploration phase is at most O βHSAι ∆(c, τ ) 2 U 2 + βHSAι κ 2 , where ι = log βHSA ∆(c,τ ) 2 U 2 + βHSA κ 2 , and β = log(2SAH/δ) + S log(e(1 + N )). Proof. The proof of Theorem 2 mainly instantiates Theorem 1 by verifying that (a) U (n) (π) = 4 V α H ,π P (n) , b(n) is a valid approximation error bound for V π P (n) ,u , and (b) Tabular-SWEET satisfies the termination condition within N episodes. The proof consists of three steps with the first two steps verifying the above two conditions and the last step characterizes the sample complexity. Step 1: This step establishes the following lemma, which shows that U (n) (π) = 4 V α H ,π P (n) , b(n) is a valid approximation error bound. Lemma 8. With α H = 1 + 1/H defined in Tabular-SWEET (Algorithm 2), on the event E, for any policy π and any utility normalized function u, V π P (n) ,u -V π P * ,u ≤ 4 V α H ,π P (n) , b(n) . Proof. Recall that b(n) h = β 0 H N (n) h (s h , a h ) , where β 0 = 8β. Define utility function u v as u v h (s h , a h ) = Var P (n) h V π h+1, P (n) ,u (s h , a h ) min 8β N (n) h (s h , a h ) , 1 H .

Following

Step 1 of Lemma 1 in Ménard et al. (2021) , we get V π P (n) ,u -V π P * ,u ≤ V α H ,π P (n) ,u v + V α H ,π P (n) , b(n) . Next, we aim to show that V α H ,π P (n) ,u v ≤ e V α H ,π P (n) , b(n) . For that, let ρπ (s h , a h ) be the marginal distribution over state-action pair (s h , a h ) induced by model P (n) and policy π. Note that the truncated value function is a lower bound of the corresponding value function. Thus, we can expand V α H ,π P (n) ,u v as follows: V α H ,π P (n) ,u v = H h=1 s h ,a h α h-1 H ρπ (s h , a h )u v h (s h , a h ) (i) ≤ e H h=1 s h ,a h ρπ (s h , a h ) Var P (n) h V π h+1, P (n) ,u (s h , a h ) min 8β N (n) h (s h , a h ) , 1 H (ii) ≤ e H h=1 s h ,a h ρπ (s h , a h )Var P (n) h V π h+1, P (n) ,u (s h , a h ) H h=1 s h ,a h ρπ (s h , a h ) min 8β N (n) h (s h , a h ) , 1 H , where (i) follows from the fact that (1 + 1/H) H ≤ e and (ii) follows from Cauchy-Schwarz inequality. Note that in contrast to the optimistic policy, π could be a randomized policy in general. By Lemma 6, we have H h=1 s h ,a h ρπ (s h , a h )Var P (n) h V π h+1, P (n) ,u (s h , a h ) ≤ 1. Meanwhile, if we define u b h (s h , a h ) = min 8β N (n) h (s h ,a h ) , 1 H , which is obviously a normalized utility function, then, we have H h=1 s h ,a h ρπ (s h , a h ) min 8β N (n) h (s h , a h ) , 1 H = V π P (n) ,u b ≤ V α H ,π P (n) , b(n) , where the last inequality follows from the facts that u b h (s h , a h ) ≤ b(n) h (s h , a h ) and α H > 1. Thus, we have Equation ( 26) established. Combining Equations ( 25) and ( 26), we conclude that V π P (n) ,u -V π P * ,u ≤ e V α H ,π P (n) , b(n) + V α H ,π P (n) , b(n) (i) ≤ (1 + e) V α H ,π P (n) , b(n) ≤ 4 V α H ,π P (n) , b(n) , where (i) is due to the fact that the truncated value function is at most 1. Step 2: This step establishes the following lemma, which shows that Tabular-SWEET will terminate within N episodes. Lemma 9. On the event E, there exists n ϵ ∈ [N ] such that C (nϵ) > 1 and V α H ,π (nϵ) P (nϵ) , b(nϵ) ≤ T 2 /16, where N is defined in Equation (19), and T is defined in Tabular-SWEET (Algorithm 2). Proof. Denote N 0 = {n ∈ [N ] : π (n) = π 0 }. We first prove N 0 is finite. Note that for all n ∈ N 0 , V π 0 P (n) ,c + 4 V α H ,π 0 P (n) , b(n) ≥ τ -κ/2. By Lemmas 7 and 8, we have |N 0 |κ/2 ≤ n∈N0 V π 0 P (n) ,c + 4 V α H ,π 0 P (n) , b(n) -V π 0 P * ,c ≤ n∈N0 8 V α H ,π 0 P (n) , b ≤ 64 e 3 |N 0 |βHSA log(1 + N ), where the last inequality is due to Cauchy-Schwarz inequality. Therefore, |N 0 | ≤ 2 14 e 3 βHSA log(N +1) κ 2 . Then, we prove Lemma 9 by contradiction. Assume V π (n) P (n) , b(n) > T 2 /16, ∀n ∈ [N ]\N 0 . According to Lemma 7, we have (N -|N 0 |)T 2 /16 < n∈[N ]/N0 V α H ,π (n) P (n) , b(n) ≤ 64e 3 βHSA log(N -|N 0 | + 1), which implies that N < 2 10 e 3 βHSA log(N + 1) T 2 + 2 14 e 3 βHSA log(N + 1) κ 2 . This contradicts with the condition that N = 2 10 e 3 βHSA log(N +1) T 2 + 2 14 e 3 βHSA log(N +1) κ 2 . Therefore, by noting that U (n) (π) = 4 V α H ,π (n) P (n) , b(n) , there exists n ϵ ∈ [N ] such that the exploration phase under Tabular-SWEET terminates. Step 3: This step analyzes the sample complexity as follows. On the event E, since T = ∆(c, τ )U/2, by Lemma 9, the sample complexity is at most N = 2 8 e 3 30 2 βHSA log(N + 1) ∆(c, τ ) 2 U 2 + 2 15 e 3 βHSA log(N + 1) κ 2 . Note that n = c 0 log(c 1 n) implies n ≤ 2c 0 log(c 0 c 1 ). Thus, N = O βHSAι ∆(c, τ ) 2 U 2 + βHSAι κ 2 , where ι = log βHSA ∆(c, τ ) 2 U 2 + βHSA κ 2 . Therefore, Tabular-SWEET terminates in finite episodes. Besides, U(π) = 4 V α H ,π P ϵ , bϵ . On the event E, by Lemma 3, Lemma 8 and the concavity of √ x , U(π) is an approximation error function under P ϵ , and is concave and continuous on X . We further note that κ/2(∆(c,τ )-κ/2) 4∆(c,τ ) ≥ κ 16 due to the condition ∆(c, τ ) ≥ κ, which indicates that T = ∆(c, τ )U/2 satisfies the requirement in Theorem 4. Therefore, by Theorem 4, we conclude that with probability at least 1 -δ, the exploration phase of Tabular-SWEET is safe and π is an ϵ-optimal policy subject to the safety constraint (c * , τ * ).

C ANALYSIS OF LOW-RANK-SWEET

In this section, we first elaborate the Low-rank-SWEET algorithm in Appendix C.1. To provide the analysis for this algorithm, we first provide several supporting lemmas in Appendix C.2, and then prove Theorem 3 in Appendix C.3.

C.1 THE LOW-RANK-SWEET ALGORITHM

We first specify the parameters adopted in Low-rank-SWEET, which is presented in Algorithm 3. Let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ 6 , κ 24 , and T = ∆(c, τ )U/3 be the termination condition of Lowrank-SWEET. Recall that we set ϵ 0 = κ/6, t = 2, and κ = κ/3. We define the maximum number of iterations N as N = 2 10 β 3 H 2 d 4 A 2 ζ 2 κ 2 T 2 + 2 12 • 3 2 β 3 H 2 d 4 A 2 ζ 2 κ 4 , where ζ = log 2|Φ||Ψ|N H/δ , and β 3 is defined in Lemma 10. Besides, we set Ã = A/ϵ 0 and α = 5 β 3 ζ( Ã + d 2 ). For ease of exposition, we introduce the following notation for an (ϵ 0 , t)-greedy version of policy π, denoted as G ϵ0 H π, as follows: G ϵ0 H π(a h |s h ) = ϵ0 |A| + (1 -ϵ 0 )π(a h |s h ), if h ∈ H, π(a h , a h ), if h / ∈ H. where |H| = t. Intuitively, G ϵ0 H π follows π at time step h ∈ H with probability 1 -ϵ 0 and takes uniformly action selection with the probability ϵ 0 . We also define Π n = Unif{π (m) } n-1 m=0 , where Unif (X 0 ) is a mixture policy that uniformly chooses one policy from the policy set X 0 ⊂ X . We use G ϵ0 H Π n to denote the (ϵ 0 , |H|)-greedy version of Π n .

C.2 SUPPORTING LEMMAS

We first characterize the following high probability event. Lemma 10. Denote f (n) h (s h , a h ) = P * h (•|s h , a h ) - P (n) h (•|s h , a h ) 1 , U (n) h,ϕ = n E s h ∼(P * ,Πn ) a h ∼G ϵ 0 h Πn ϕ(s h , a h )(ϕ(s h , a h )) ⊤ + λI, where λ = β 3 d log(2N H|Φ|/δ)) and β 3 = O(1). Define events E 0 and E 1 as E 0 = ∀n ∈ [N ], h ∈ [H], s h ∈ S, a h ∈ A, E s h ∼(P * ,G ϵ 0 h-1 Πn ) a h ∼G ϵ 0 h Πn f (n) h (s h , a h ) 2 ≤ ζ/n , E 1 = ∀n ∈ [N ], h ∈ [H], s h ∈ S, a h ∈ A, 1 5 φ(n) h-1 (s, a) (U (n) h-1, φ ) -1 ≤ φ(n) h-1 (s, a) ( Û (n) h-1 ) -1 ≤ 3 φ(n) h-1 (s, a) (U (n) h-1, φ ) -1 , where ζ = log 2|Φ||Ψ|N H/δ . Denote E := E 0 ∩ E 1 . Then, P[E] ≥ 1 -δ. Algorithm 3 Low-rank-SWEET 1: Input: Constants ϵ 0 = κ/6, Ã = A/ϵ 0 , and termination condition T. Define exploration-driven reward function b (n) h (•, •) = min α∥ φ(n) h (•, •)∥ ( Û (n) h ) -1 , 1 . 11: Define U (n) (π) = V π (n) P (n) , b(n) + Ãζ/n C (n) L =      {π 0 }, if V π 0 P ,c + U (n) (π 0 ) ≥ τ -2κ/3, π V π P ,c + U (n) (π) ≤ τ -κ/3 , otherwise. ( ) 12: Solve π (n) = arg max π∈C (n) L U (n) (π), where C (n) L is defined in Equation ( 30). 13: if  C (n) L > 1 and U (n) (π (n) ) ≤ T then 14: n ϵ , P ϵ , bϵ ← n, P (n) , b(n) , P[E 1 ] ≥ 1 -δ/2. Therefore, P[E] ≥ 1 -δ. Based on Lemma 10, we can bound the exploration-driven virtual reward in Low-rank-SWEET as follows. Corollary 1. Given that the event E occurs, the following inequality holds for any n ∈ [N ], h ∈ [H], s h ∈ S, a h ∈ A: min α 5 φ(n) h (s h , a h ) (U (n) h, φ ) -1 , 1 ≤ b(n) h (s h , a h ) ≤ 3 α φ(n) h (s h , a h ) (U (n) h, φ ) -1 , where α = 5 β 3 ζ( Ã + d 2 ). Proof. Recall b(n) h (s h , a h ) = min α φ(n) h (s, a) ( Û (n) h ) -1 , 1 . Applying Lemma 10, we can immediately obtain the result. The following lemma summarizes Lemmas 12 and 13 in Uehara et al. (2021) and generalizes them to ϵ 0 -greedy policies. We provide the proof for completeness. Lemma 11. Let P h-1 = ⟨ϕ h-1 , µ h-1 ⟩ be a low-rank MDP model, and Π be an arbitrary and possibly a mixture policy. Define an expected Gram matrix as follows: M h-1,ϕ = λI + n E s h-1 ∼(P * ,Π) a h-1 ∼Π ϕ h-1 (s h-1 , a h-1 ) ϕ h-1 (s h-1 , a h-1 ) ⊤ . Further, let f h-1 (s h-1 , a h-1 ) be the total variation distance between P * h-1 and P h-1 at time step h -1. Suppose g ∈ S × A → R is bounded by B ∈ (0, ∞), i.e., ∥g∥ ∞ ≤ B. Then, for h ≥ 2, any policy π h , E s h ∼P h-1 a h ∼π h [g(s h , a h )|s h-1 , a h-1 ] ≤ ϕ h-1 (s h-1 , a h-1 ) (M h-1,ϕ ) -1 × n Ã E s h ∼(P * ,Π) a h ∼G ϵ 0 h Π [g 2 (s h , a h )] + λdB 2 + nB 2 E s h-1 ∼(P * ,Π) a h-1 ∼Π f h-1 (s h-1 , a h-1 ) 2 . Proof. We first derive the following bound: E s h ∼P h-1 a h ∼π h g(s h , a h )|s h-1 , a h-1 = s h a h g(s h , a h )π(a h |s h )⟨ϕ h-1 (s h-1 , a h-1 ), µ h-1 (s h )⟩ds h ≤ ϕ h-1 (s h-1 , a h-1 ) (M h-1,ϕ ) -1 a h g(s h , a h )π(a h |s h )µ h-1 (s h )ds h M h-1,ϕ , where the inequality follows from Cauchy-Schwarz inequality. We further expand the second term in the RHS of the above inequality as follows. a h g(s h , a h )π(a h |s h )µ h-1 (s h )ds h 2 M h-1,ϕ (i) ≤ n E s h-1 ∼(P * ,Π) a h-1 ∼Π      s h a h g(s h , a h )π h (a h |s h )µ(s h ) ⊤ ϕ(s h-1 , a h-1 )ds h   2    + λdB 2 = n E s h-1 ∼(P * ,Π) a h-1 ∼Π      E s h ∼P h-1 a h ∼π h g(s h , a h ) s h-1 , a h-1   2    + λdB 2 (ii) ≤ 2n E s h-1 ∼(P * ,Π) a h-1 ∼Π    E s h ∼P * h-1 a h ∼π h g(s h , a h ) 2 s h-1 , a h-1    + λdB 2 + 2nB 2 E s h-1 ∼(P * ,Π) a h-1 ∼Π f h-1 (s h-1 , a h-1 ) 2 (iii) ≤ n Ã E s h ∼(P * ,Π) a h ∼G ϵ 0 h Π g(s h , a h ) 2 + λdB 2 + nB 2 E s h-1 ∼(P * ,Π) a h-1 ∼Π f h-1 (s h-1 , a h-1 ) 2 , where (i) follows from the assumption that ∥g∥ ∞ ≤ B, (ii) follows from Jensen's inequality, and that f h-1 (s h-1 , a h-1 ) is the total variation distance between P * h-1 and P h-1 at time step h -1. For (iii), note that G ϵ0 h Π(•|s h ) ≥ ϵ 0 /A = 1/ Ã, which implies that π h (•|s h ) ≤ 1 ≤ ÃG ϵ0 h Π(•|s h ). This finishes the proof. Based on Lemma 11, we summarize three useful inequalities in the following lemma, which bridges the total variation f  (n) h . Lemma 12. Define W (n) h,ϕ = n E s h ∼(P * ,Πn ) a h ∼Πn ϕ(s h , a h )(ϕ(s h , a h )) ⊤ + λI, where λ = β 3 d log(2N H|Φ|/δ). Given that the event E occurs, the following inequalities hold for any iteration n: When h ≥ 2, E s h ∼ P (n) h-1 a h ∼π f (n) h (s h , a h ) s h-1 , a h-1 ≤ α φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 , E s h ∼P * h-1 a h∼π f (n) h (s h , a h ) s h-1 , a h-1 ≤ α ϕ * h-1 (s h-1 , a h-1 ) (U (n) h-1,ϕ * ) -1 , E s h ∼P * h-1 a h ∼π b(n) h (s h , a h ) s h-1 , a h-1 ≤ γ ϕ * h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ * ) -1 , where α = β 3 ζ( Ã + d 2 ), γ = 45β 3 ζ Ãd( Ã + d 2 ). When h = 1, E a1∼π f (n) 1 (s 1 , a 1 ) ≤ Ãζ/n, E a1∼π b(s 1 , a 1 ) ≤ 15α d Ã n . ( ) Proof. We start by showing Equation (34) as follows. Given that the event E occurs, we have E s h ∼ P (n) h-1 a h ∼π f (n) h (s h , a h ) s h-1 , a h-1 (i) ≤ φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 × n Ã E s h ∼(P * ,G ϵ 0 h-1 Πn) a h ∼G ϵ 0 h Πn [f (n) h (s h , a h ) 2 ] + λd + n E s h-1 ∼(P * ,G ϵ 0 h-1 Πn ) (a h-1 )∼G ϵ 0 h-1 Πn f (n) h-1 (s h-1 , a h-1 ) 2 (ii) ≤ φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 × n Ã E s h ∼(P * ,G ϵ 0 h-1 Πn) a h ∼G ϵ 0 h Πn [f (n) h (s h , a h ) 2 ] + λd + n Ã E s h-1 ∼(P * ,G ϵ 0 h-2 Πn ) (a h-1 )∼G ϵ 0 h-1 Πn f (n) h-1 (s h-1 , a h-1 ) 2 (iii) ≤ φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 2ζ Ã + β 3 ζd 2 ≤ α φ(n) h-1 (s h-1 , a h-1 ) (U (n) h-1, φ ) -1 , where (i) follows from Lemma 11 and the fact that f (n) h (s h , a h ) ≤ 1, (ii) follows from importance sampling at time step h -2, and (iii) follows from Lemma 10. Equation (35) follows from the arguments similar to the above. To obtain Equation (36), we first apply Lemma 11 and obtain E s h ∼P * h-1 a h ∼π (n) b(n) h (s h , a h ) s h-1 , a h-1 ≤ ϕ * h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ * ) -1 n Ã E s h ∼(P * ,Πn ) a h ∼G ϵ 0 h Πn [{ b(n) h (s h , a h )} 2 ] + λd, where we use the fact that b(n) h (s h , a h ) ≤ 1. We further bound the term n E s h ∼(P * ,Πn ) a h ∼G ϵ 0 h Πn [( b(n) h (s h , a h )) 2 ] as follows: n E s h ∼(P * ,Πn) a h ∼G ϵ 0 h Πn b(n) h (s h , a h ) 2 ≤ n E s h ∼(P * ,Πn ) a h ∼G ϵ 0 h Πn α2 φ(n) h (s h , a h ) 2 ( Û (n) h, φ ) -1 (i) ≤ n E s h ∼(P * ,Πn ) a h ∼G ϵ 0 h Πn 9 α2 φ(n) h (s h , a h ) 2 (U (n) h, φ ) -1 = 9α 2 tr        n E s h ∼(P * ,Πn ) a h ∼G ϵ 0 h Πn     φ(n) h (s h , a h ) φ(n) h (s h , a h ) ⊤   n E s h ∼(P * ,Πn ) a h ∼G ϵ 0 h Πn φh (s h , a h ) φ(n) h (s h , a h ) ⊤ + λI    -1            ≤ 9α 2 tr(I) = 9 α2 d, where (i) follows from Lemma 10, and we use tr(A) to denote the trace of any matrix A. Thus, E s h ∼P * h-1 a h ∼π b(n) h (s h , a h ) s h-1 , a h-1 ≤ ϕ * h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ * ) -1 9 Ãα 2 d + λd ≤ γ ϕ * h-1 (s h-1 , a h-1 ) W (n) h-1,ϕ * ) -1 . In addition, for h = 1, we have E a1∼π (n) f (n) 1 (s 1 , a 1 ) (i) ≤ Ã E a1∼G ϵ 0 1 Πn f (n) 1 (s 1 , a 1 ) 2 ≤ Ãζ/n, E a1∼π (n) b(s 1 , a 1 ) (ii) ≤ α Ã E a1∼G ϵ 0 1 Πn ∥ φ1 (s 1 , a 1 )∥ 2 ( Û (n) 1, φ ) -1 ≤ 3α Ã E a1∼G ϵ 0 1 Πn ∥ φ1 (s 1 , a 1 )∥ 2 (U (n) 1, φ ) -1 ≤ 3 25 Ãα 2 d n = 15α Ãζ/n, where both (i) and (ii) follow from Jensen's inequality and importance sampling. The following lemma is key to ensure that Low-Rank-SWEET terminates in finite episodes. Lemma 13. Given that the event E occurs, the summation of the truncated value functions V π (n) P (n) , b(n) under exploration policies {π (n) } n∈N is sublinear with respect to |N | for any N ⊂ [N ], i.e., the following bound holds: n∈N V π (n) P (n) , b(n) + Ãζ/n ≤ 32ζHd 2 Ã β 3 |N |. Proof. Note that V π h, P (n) , b(n) ≤ 1 holds for any policy π and h ∈ [H]. We first have n) . Applying the Equation (36) and Equation (37), we obtain the following bound on the value function V π (n) P (n) , b(n) -V π (n) P * , b(n) ≤ E π (n) P (n) 1 V π (n) 2, P (n) , b(n) (s 1 , a 1 ) -P * 1 V π (n) 2,P * , b(n) (s 1 , a 1 ) = E π (n) P (n) 1 -P * 1 V π (n) 2, P (n) , b(n) (s 1 , a 1 ) + P * 1 V π (n) 2, P (n) , b(n) -V π (n) 2,P * , b(n) (s 1 , a 1 ) ≤ E π (n) f (n) 1 (s 1 , a 1 ) + P * 1 V π (n) 2, P (n) , b(n) -V π (n) 2,P * , b(n) ≤ . . . ≤ E P * ,π (n)   H h=1 f (n) (s h , a h )   = V π (n) P * ,f (n) , which implies V π (n) P (n) , b(n) ≤ V π (n) P * , b(n) + V π (n) P * ,f V πn P * , b(n) : V πn P * , b(n) = H h=1 E s h ∼(P * ,π (n) ) a h ∼π (n) bn (s h , a h ) ≤ H h=2 E s h-1 ∼(P * ,π (n) ) a h-1 ∼π (n) γ ϕ * h-1 (s h-1 , a h-1 ) (W (n) h-1,ϕ * ) -1 + 15α d Ã n ≤ H h=1 E s h ∼(P * ,π (n) ) a h ∼π (n) γ ϕ * h (s h , a h ) (W (n) h,ϕ * ) -1 + 15α d Ã n . Similarly, we obtain V πn P * ,f (n) = H h=1 E s h ∼(P * ,π (n) ) a h ∼π (n) bn (s h , a h ) ≤ H h=2 E s h-1 ∼(P * ,π (n) ) a h-1 ∼π (n) α ϕ * h-1 (s h-1 , a h-1 ) (U (n) h-1,ϕ * ) -1 + ζ Ã n ≤ H h=1 E s h ∼(P * ,π (n) ) a h ∼π (n) α ϕ * h (s h , a h ) (U (n) h,ϕ * ) -1 + ζ Ã n . Then, taking the summation of V πn P * , b(n) +f (n) over n ∈ N , we have n∈N V πn P * ,f (n) + b(n) + Ãζ/n ≤ n∈N 15α d Ã n + 2 n∈N Ãζ n + n∈N H h=1 E s h ∼(P * ,π (n) ) a h ∼π (n) γ n ϕ * h (s h , a h ) (W (n) h,ϕ * ) -1 + n∈N H h=1 E s h ∼(P * ,π (n) ) a h ∼π (n) α ϕ * h (s h , a h ) (U (n) h,ϕ * ) -1 (i) ≤ 17α ζd Ã|N | + γ H h=1 |N | n∈N E s h ∼(P * ,π (n) ) a h ∼π (n) ϕ * h (s h , a h ) 2 (W (n) h,ϕ * ) -1 + α H h=1 Ã|N | n∈N E s h ∼(P * ,π (n) ) a h ∼G ϵ 0 h π (n) ϕ * h (s h , a h ) 2 (U (n) h,ϕ * ) -1 (ii) ≤ 17ζ 2β 3 d Ã( Ã + d 2 )|N | + H 45β 3 ζd Ã( Ã + d 2 ) d|N |ζ + H β 3 ζ( Ã + d 2 ) d Ã|N |ζ ≤ 32ζHd β 3 Ã(d 2 + Ã)|N | ≤ 32ζHd 2 Ã β 3 |N |, where (i) follows from Cauchy-Schwarz inequality and importance sampling, and (ii) follows from Lemma 19. Hence, the statement of Lemma 13 is verified.

C.3 PROOF OF THEOREM 3

Theorem 6 (Restatement of Theorem 3). Given ϵ, δ ∈ (0, 1), and safety constraint (c, τ ), let U = min ϵ 2 , ∆min 2 , ϵ∆min 5 , τ 6 , κ 24 , and T = ∆(c, τ )U/3 be the termination condition of Low-rank-SWEET. Then, with probability at least 1 -δ, Low-rank-SWEET achieves the learning objective of safe reward-free exploration (Equations (1) and (2)) and the number of trajectories collected in the exploration phase is at most O H 3 d 4 A 2 ι κ 2 ∆(c, τ ) 2 U 2 + H 3 d 4 A 2 ι κ 4 , where ι = log 2 H 2 d 4 A 2 κ 2 ∆(c,τ ) 2 U 2 + H 2 d 4 A 2 κ 4 |Φ||Ψ|H/δ . Proof. The proof of Theorem 3 mainly instantiates Theorem 1 by verifying that (a) U (n) (π) = V π P (n) , b(n) + Ãζ/n is a valid approximation error bound for V π P (n) ,u , and (b) Low-rank-SWEET satisfies the termination condition within N iterations. The proof consists of three steps with the first two steps verifying the above two conditions and the last step characterizes the sample complexity. Step 1: This step establishes that U (n) (π) = V π P (n) , b(n) + Ãζ/n is a valid approximation error bound in Low-rank-SWEET. Lemma 14. For all n ∈ [N ], policy π and the normalized utility function u, given that the event E occurs, we have V π P * ,u -V π P (n) ,u ≤ V π P (n) , b(n) + Ãζ/n. Proof. We first show that V π P * ,u -V π P (n) ,u ≤ V π P (n) ,f (n) . Recall the definition of the truncated value functions Vh, P (n) ,u (s h ) and Qh, P (n) ,u (s h , a h ): Qπ h, P (n) ,u (s h , a h ) = u(s, a) + P (n) h V π h+1, P (n) ,u (s h , a h ), V π h, P (n) ,u (s h ) = min 1, E π Qπ h, P (n) ,u (s h , a h ) . We develop the proof by induction. For the base case h = H + 1, we have V π H+1, P (n) ,u (s H+1 ) -V π H+1,P * ,u (s H+1 ) = 0 = V π H+1, P (n) , b(n) (s H+1 ). Assume that V π h+1, P (n) ,u (s h+1 ) -V π h+1,P * ,u (s h+1 ) ≤ V π h+1, P (n) , b(n) (s h+1 ) holds for any s h+1 . Then, from Bellman equation, we have, Q π h, P (n) ,u (s h , a h ) -Q π P * ,u (s h , a h ) = P (n) h V π h, P (n) ,u (s h , a h ) -P * h V π h+1,P * ,u (s h , a h ) = P (n) h V π h+1, P (n) ,u -V π h+1,P * ,u (s h , a h ) + P (n) h -P * h V π h,P * ,u (s h , a h ) (i) ≤ f (n) h (s h , a h ) + P (n) h V π h+1, P (n) ,u -V π h+1,P * ,u (s h , a h ) (ii) ≤ f (n) h (s h , a h ) + P (n) h V π h+1, P (n) ,f (n) (s h , a h ) = Qπ h, P (n) ,f (n) (s h , a h ), where (i) follows from ∥ P (n) h (•|s h , a h ) -P * h (•|s h , a h )∥ 1 = f (n) h (s h , a h ) and the assumption that u is normalized, and (ii) follows from the induction hypothesis. Then, by the definition of V π h, P (n) ,u (s h ), we have V π h, P (n) ,u (s h ) -V π h,P * ,u (s h ) = min 1 -V π h,P * ,u (s h ), E π Q π h, P (n) ,u (s h , a h ) -E π Q π h,P * ,u (s h , a h ) (i) ≤ min 1, E π Q π h, P (n) ,u (s h , a h ) -Q π h,P * ,u (s h , a h ) (ii) ≤ min 1, E π Qπ h, P (n) ,f (n) (s h , a h ) = V π h, P (n) ,f (n) (s h ), where (i) follows because Qπ h, P (n) ,u (s h , a h ) -Q π h,P * ,u (s h , a h ) > -1 , and (ii) follows from Equation (38). Therefore, by induction, we have V π P * ,u -V π P (n) ,u ≤ V π P (n) ,f (n) . Then, we show that V π P (n) ,f (n) ≤ V π P (n) , b(n) + Ãζ n . By Equation ( 34) and the fact that the total variation distance is upper bounded by 1, with probability at least 1 -δ/2, we have E P (n) ,π f (n) h (s h , a h ) s h-1 ≤ E π   min α φ(n) h-1 (U (n) h-1, φ ) -1 , 1   , ∀h ≥ 2. ( ) Similarly, when h = 1, E a1∼π f (n) 1 (s 1 , a 1 ) ≤ Ã E a∼G ϵ 0 1 Πn f (n) 1 (s 1 , a 1 ) 2 ≤ Ãζ n . Based on Corollary 1, Equation ( 39) and α = 5α, we have E π b(n) h (s h , a h ) s h ≥ E π   min α φ(n) h (U (n) h, φ ) -1 , 1   ≥ E P (n) ,π f (n) h+1 (s h+1 , a h+1 ) s h . (41) For the base case h = H, we have E P (n) ,π V π H, P (n) ,f (n) (s H ) s H-1 = E P (n) ,π f (n) H (s H , a H ) s H-1 ≤ E π b (n) H-1 (s H-1 , a H-1 )|s H-1 ≤ min    1, E π Qπ H-1, P (n) , b(n) (s H-1 , a H-1 ) s H-1    = V π H-1, P (n) , b(n) (s H-1 ). Assume that E P (n) ,π V π h+1, P (n) ,f (n) (s h+1 ) s h ≤ V π h, P (n) , b(n) (s h ) holds for step h + 1. Then, by Jensen's inequality, we obtain E P (n) ,π V π h, P (n) ,f (n) (s h ) s h-1 ≤ min    1, E P (n) ,π f (n) h (s h , a h ) + P (n) h V π h+1, P (n) ,f (n) (s h , a h ) s h-1    (i) ≤ min    1, E π b(n) h-1 (s h-1 , a h-1 ) + E P (n) ,π   E P (n) ,π V π h+1, P (n) ,f (n) (s h+1 ) s h s h-1      (ii) ≤ min    1, E π b (n) h-1 (s h-1 , a h-1 ) + E P (n) ,π V π h, P (n) , b(n) (s h ) s h-1    = min 1, E π Qπ h-1, P (n) , b(n) (s h-1 , a h-1 ) = V π h-1, P (n) , b(n) (s h-1 ), where (i) follows from Equation (41), and (ii) is due to the induction hypothesis. By induction, we conclude that V π P (n) ,f (n) = E π f (s) 1 (s 1 , a 1 ) + E P (n) ,π V π 2, P (n) ,f (n) (s 2 ) s 1 ≤ Ãζ/n + V π P (n) , b(n) . Combining Step 1 and Step 2, we conclude that V π P * ,u -V π P (n) ,u ≤ Ãζ/n + V π P (n) , b(n) . Step 2: This step shows that Low-rank-SWEET terminates in finite episodes. Lemma 15. On the event E, there exists n ϵ ∈ [N ] such that V π nϵ P (nϵ) , b(nϵ) + Ãζ/n ϵ ≤ T, where N is defined in Equation (27). Let N 0 = n : C (n) L = 1 . We first show that N 0 is a finite set. Note that, n ∈ N 0 implies that V π 0 P (n) ,c + V π 0 P (n) , b(n) + Ãζ/n > τ -2κ/3, and π (n) = π 0 . Then, we have, |N 0 |κ/3 < n∈N0 V π 0 P (n) ,c + V π 0 P (n) , b(n) + Ãζ n -V π 0 P * ,c (i) ≤ n∈N0 2 V π 0 P (n) , b(n) + 2 Ãζ n (ii) ≤ 64ζHd 2 Ã β 3 |N |, where (i) is due to Lemma 14 and the (ii) follows from Lemma 13. Therefore, we have |N 0 | ≤ 2 12 • 3 2 β 3 H 2 d 4 Ã2 ζ 2 κ 2 . Next, we prove the existence of n ϵ via contradiction. Assume V π (n) P (n) , b(n) + Ãζ n > T, ∀n ∈ [N ]/N 0 . By Lemma 13, we have (N -|N 0 |)T < n∈N V π (n) P (n) , b(n) + Ãζ n ≤ 32ζHd 2 Ã β 3 |N |, which implies N < |N 0 | + 2 10 β 3 H 2 d 4 Ã2 ζ 2 T 2 ≤ 2 10 β 3 H 2 d 4 Ã2 ζ 2 T 2 + 2 12 • 3 2 β 3 H 2 d 4 Ã2 ζ 2 κ 2 . This contradicts with the fact that N = 2 10 β3H 2 d 4 Ã2 ζ 2 T 2 + 2 12 •3 2 β3H 2 d 4 Ã2 ζ 2 κ 2 . Step 3: This step analyzes the sample complexity of Low-rank-SWEET as follows. Given that the event E occurs, since T = ∆(c, τ )U/3 and Ã = A/ϵ 0 = 6A/κ, by Lemma 15, the number of iterations is at most N = 2 12 3 4 β 3 H 2 d 4 A 2 ζ 2 κ 2 ∆(c, τ ) 2 U 2 + 2 14 3 4 β 3 H 2 d 4 A 2 ζ 2 κ 4 . Note that n = c 0 log 2 (c 1 n) implies n ≤ 4c 0 log 2 (c 0 c 1 ). Thus, N = O H 2 d 4 A 2 ι κ 2 ∆(c, τ ) 2 U 2 + H 2 d 4 A 2 ι κ 4 , where ι = log 2   H 2 d 4 A 2 κ 2 ∆(c, τ ) 2 U 2 + H 2 d 4 A 2 κ 4 |Φ||Ψ|H/δ   . Since there are H episodes in each iteration, the sample complexity is at most O H 3 d 4 A 2 ι κ 2 ∆(c, τ ) 2 U 2 + H 3 d 4 A 2 ι κ 4 . Thus, Low-rank-SWEET terminates in finite episodes. Note that U(π) = V π P ϵ , bϵ + Ãζ/n ϵ . Given that the event E occurs, by Lemma 3 and Lemma 14, U(π) is an approximation error function under P ϵ , and is concave and continuous on X . We further note that κ/3(∆(c,τ )-2κ/3) 4(∆(c,τ )-κ/3) ≥ κ 24 , and ∆(c, τ ) -κ/3 ≥ 2∆(c, τ )/3 due to the condition ∆(c, τ ) ≥ κ, which implies that T = ∆(c, τ )U/3 satisfies the requirement in Theorem 4. Therefore, using Theorem 4, we conclude that with probability at least 1 -δ, the exploration phase of Low-rank-SWEET is safe and π is an ϵ-optimal policy subject to the constraint (c * , τ * ).

D AUXILIARY LEMMAS

We first provide the following property of a mixture policy and its equivalent Markov policy for completeness. Lemma 16 (Theorem 6.1 in Altman (1999) ). Given a model P , any Markov policies π, π ′ ∈ X , and γ ∈ [0, 1], there exists π γ ∈ X that is Markov and equivalent to the mixture policy γπ ⊕ (1 -γ)π ′ . Let ρ π h (s h ) and ρ π h (s h , a h ) be the marginal distributions over the state and the state-action pairs induced by π under P , respectively. Then, the following statements hold: • ρ π γ h (s h ) = γρ π h (s h ) + (1 -γ)ρ π ′ h (s h ), • ρ π γ h (s h , a h ) = γρ π h (s h , a h ) + (1 -γ)ρ π ′ h (s h , a h ), • π γ (a h |s h ) = ρ π γ h (s h , a h )/ρ π γ h (s h ), • V π γ P,u = γV π P,u + (1 -γ)V π ′ P,u holds for any utility function u. Next, we present the estimation error of MLE in the n-th iteration at step h, given state s and action a, in terms of the total variation distance, i.e. f . By Theorem 21 in Agarwal et al. (2020) , we are able to guarantee that under all exploration policies, the estimation error can be bounded with high probability. Lemma 17. (MLE guarantee). Given δ ∈ (0, 1), we have the following inequality holds for any n ∈ [N ], h ∈ [H] with probability at least 1 -δ/2: Dividing both sides of the inequality in Lemma 17 by n, we have the following corollary hold, which is intensively used in the analysis. Corollary 2. Given δ ∈ (0, 1), the following inequality holds for any n, h ≥ 1 with probability at least 1 -δ/2: E s h ∼ ( P * ,G ϵ 0 h-1 Πn ) a h ∼G ϵ 0 h Πn f n h (s h , a h ) 2 ≤ ζ/n, where Π n and G ϵ0 h Π n are defined in Equation (29) and Equation (28), respectively. Then, we present two critical lemmas which ensure the summation of the approximation errors grows sublinearly in Tabular-SWEET and Low-rank-SWEET. Lemma 18 (Lemma 9 in Ménard et al. (2021) ). Suppose {a n } ∞ n=0 is a sequence with a n ∈ [0, 1], ∀n. Let S n = max{1, n m=0 a m }. Then, the following inequality holds: Finally, the following lemma is used in Theorem 4. Lemma 20. Given a, b > 0, define a positive sequence {x n } n≥1 recursively by x n+1 = b a -x n . If a 2 > 4b and x 1 ∈ [ a- √ a 2 -4b 2 , a+ √ a 2 -4b ), then, {x n } converges to a- √ a 2 -4b 2 . Proof. Step 1. We first show that x n ∈ [ a- √ a 2 -4b 2 , a+ √ a 2 -4b ). This is true for n = 1, as x 1 ∈ [ a- √ a 2 -4b 2 , a+ √ a 2 -4b ). Assume that x n-1 ∈ [ a- √ a 2 -4b 2 , a+ √ a 2 -4b ). Then, with simple algebra, we can show that x n = b a -x n-1 ∈ a - √ a 2 -4b 2 , a + √ a 2 -4b 2 . Step 2. We show that {x n } is a non-increasing sequence.

Indeed, from

Step 1, we have |a -2x n | ≤ a 2 -4b ⇒ a 2 -4ax n + 4x 2 n ≤ a 2 -4b ⇒ ax n -x 2 n ≥ b ⇒ x n ≥ b a -x n = x n+1 . Therefore, x n+1 ≤ x n holds for all n ≥ 1. Combining Steps 1 and 2, we conclude that there exists a limit of the sequence {x n }, denoted by x * . By the recursive formula, x * must be a solution to the following equation x * = b a -x * . Since x * ≤ x 1 < a+ √ a 2 -4b , by solving the above equation, we conclude that lim n→∞ x n = x * = a - √ a 2 -4b 2 .



The bound is adapted from the original result by normalizing the reward function. P ,r *



Figure 1: Illustration of the proof.

(n) h and the exploration-driven reward b

|s, a) -P * h (•|s, a) 1

h , a h ) 2 ≤ ζ, where ζ := log 2|Φ||Ψ|N H/δ .

Lemma 19 (Elliptical potential lemma: Lemma B.3 inHe et al. (2021)). Consider a sequence ofd × d positive semidefinite matrices X 1 , . . . , X N with tr(X n ) ≤ 1 for all n ∈ [N ]. Define M 0 = λ 0 I and M n = M n-1 + X n . Then, ∀N ⊂ [N ], n∈N tr(X n M -1 n-1 ) ≤ 2d log 1 + |N | dλ 0 .

achieves Õ HS 2 A/ϵ 2 sample complexity, which matches the minimax lower bound in H (Domingues et al., 2020). Zhang et al. (2020) considers the stationary case, and achieves Õ S 2 A/ϵ 2 sample complexity, which is nearly minimax optimal. When structured MDPs are considered, Wang et al. (2020) studies linear MDPs and obtains Õ d 3 H 4 /ϵ 2 sample complexity, where d is the dimension of feature vectors. Zhang et al. (2021) investigates linear mixture MDPs and achieves Õ H 3 d 2 /ϵ 2 sample complexity. Zanette et al. (2020b) considers a class of MDPs with low inherent Bellman error introduced by Zanette et al. (2020a). Agarwal et al. (

break Receive reward function r * and constraint (c * , τ * ), 19: Output: π = arg max π V π P ϵ ,r * s.t. V π P ϵ ,c * + V π P ϵ , bϵ + Ãζ nϵ ≤ τ * . Proof. By Corollary 2 in Appendix D, we have P[E 0 ] ≥ 1 -δ/2. Further, by Lemma 39 in Zanette et al. (2020a) for the version of fixed ϕ and Lemma 11 in Uehara et al. (2021), we have

ACKNOWLEDGMENTS

The work of R.Huang and J. Yang was supported by the U.S. National Science Foundation under the grant CNS-2003131. The work of Y. Liang was supported in part by the U.S. National Science Foundation under the grant RINGS-2148253.

