SAFE EXPLORATION INCURS NEARLY NO ADDITIONAL SAMPLE COMPLEXITY FOR REWARD-FREE RL

Abstract

Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.

1. INTRODUCTION

Reward-free reinforcement learning (RF-RL) is an RL paradigm under which a learning agent first explores an unknown environment without any reward signal in the exploration phase, and then utilizes the gathered information to obtain a near-optimal policy for any reward function during the planning phase. Since formally introduced in Jin et al. (2020b), RF-RL has attracted increased attention in the research community (Kaufmann et al., 2021; Zhang et al., 2020; 2021; Wang et al., 2020; Modi et al., 2021) . It is particularly attractive for applications where many reward functions may be of interest, such as multi-objective RL (Miryoosefi & Jin, 2021) , or the reward function is not specified by the environment but handcrafted in order to incentivize some desired behavior of the RL agent (Jin et al., 2020b) . The ability of RF-RL to identify a near-optimal policy in response to an arbitrary reward function relies on the fact that the agent is allowed to explore any action during exploration. However, in practice, unrestricted exploration is often unrealistic or even harmful. In order to build safe, responsible and reliable artificial intelligence (AI), the RL agent often has to abide by certain application-dependent constraints, even during the exploration phase. Two motivating applications are provided as follows. • Autonomous driving. In order to learn a near-optimal driving strategy, an RL agent needs to try various actions at different states through exploration. While RF-RL is an appealing approach as the reward function is difficult to specify, it is of critical importance for the RL agent to take safe actions (even during exploration) in order to avoid catastrophic consequences. • Cellular network optimization. The operation of cellular network needs to take a diverse corpus of key performance indicators into consideration, which makes RF-RL a plausible solution. Meanwhile, the exploration also needs to meet certain system requirements, such as power consumption. While meeting these constraints throughout the learning process is a pressing need for the broad adoption of RL in real-world applications, it is a mission impossible to accomplish if no other information is provided, as the learner has little knowledge of the underlying MDP at the beginning of the learning process and will inevitably take undesirable actions (in hindsight) and violate the constraints. On the other hand, in various engineering applications, there often exist either rule-based (e.g., autonomous driving) or human expert-guided (e.g., cellular network optimization) solutions to ensure safe operation of the system. One natural question is, is it possible to leverage such existing safe solutions to ensure safety throughout the learning process? If so, how would the safe exploration requirement affect the corresponding RF-RL performances in terms of the sample complexity of exploration and the optimality and safety guarantees of the obtained policy in planning? To answer these questions, in this work, we introduce a new safe RF-RL framework. In the proposed safe RF-RL framework, the agent does not receive any reward information in the exploration phase, but is aware of a cost function associated with actions at a given state. We require that the cumulative cost in each episode is below a given threshold during exploration, with the aid of a pre-existing safe baseline policy π 0 . The ultimate learning goal of safe RF-RL is to find a safe and near-optimal policy for any given reward and cost functions after exploration. Main contributions. We summarize our main contributions as follows. • First, we introduce a novel safe RF-RL framework that imposes safety constraints during both exploration and planning of RF-RL, which may have implications in various applications. • Second, we propose a unified safe exploration strategy coined SWEET to leverage the prior knowledge of a safe baseline policy π 0 . SWEET admits general model estimation and safe exploration policy construction modules, thus can accommodate various MDP structures and different algorithmic designs. Under the assumption that the approximation error function is concave and continuous in the policy space, SWEET is guaranteed to achieve zero constraint violation during exploration, and output a near-optimal safe policy for any given reward function and safety constraint under some assumptions in planning, both with high probability. • Third, in order to facilitate the specific design of the approximation error function to ensure its concavity, we introduce a novel definition of truncated value functions. It relies on a new clipping method to avoid underestimation of the approximation error captured by the corresponding value function, and ensures the concavity of the resulted value function. • Finally, we particularize the SWEET algorithm for both tabular and low-rank MDPs, and propose Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms inherit the optimality guarantee during planning, and the safety guarantees in both exploration and planning. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art of their constraint-free counterparts up to some constant factors, proving that safety constraint incurs nearly no additional sample complexity for RF-RL.

2.1. EPISODIC MARKOV DECISION PROCESSES

We consider episodic Markov decision processes (MDPs) in the form of M = (S, A, P, H, s 1 ), where S is the state space and A is the finite action space, H is the number of time steps in each episode, P = {P h } H h=1 is a collection of transition kernels, and P h (s h+1 |s h , a h ) denotes the transition probability from the state-action pair (s h , a h ) at step h to state s h+1 in the next step. Without loss of generality, we assume that in each episode of the MDP, the initial state is fixed at s 1 . In addition, an MDP may be equipped with certain specified utility functions u = {u h } H h=1 , where we assume u h : S × A → [0, 1] is a deterministic function for ease of exposition. A Markov policy π is a set of mappings {π h : S → ∆(A)} H h=1 , where ∆(A) is the set of all possible distributions over the action space A. In particular, π h (a|s) denotes the probability of selecting action a in state s at time step h. We denote the set of all Markov policies by X . For an agent adopting policy π in an MDP M, at each step h ∈ [H] where [H] := {1, . . . , H}, she observes state s h ∈ S, and takes an action a h ∈ A according to π, after which the environment transits to the next state s h+1

