SAFE EXPLORATION INCURS NEARLY NO ADDITIONAL SAMPLE COMPLEXITY FOR REWARD-FREE RL

Abstract

Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.

1. INTRODUCTION

Reward-free reinforcement learning (RF-RL) is an RL paradigm under which a learning agent first explores an unknown environment without any reward signal in the exploration phase, and then utilizes the gathered information to obtain a near-optimal policy for any reward function during the planning phase. Since formally introduced in Jin et al. (2020b), RF-RL has attracted increased attention in the research community (Kaufmann et al., 2021; Zhang et al., 2020; 2021; Wang et al., 2020; Modi et al., 2021) . It is particularly attractive for applications where many reward functions may be of interest, such as multi-objective RL (Miryoosefi & Jin, 2021), or the reward function is not specified by the environment but handcrafted in order to incentivize some desired behavior of the RL agent (Jin et al., 2020b) . The ability of RF-RL to identify a near-optimal policy in response to an arbitrary reward function relies on the fact that the agent is allowed to explore any action during exploration. However, in practice, unrestricted exploration is often unrealistic or even harmful. In order to build safe, responsible and reliable artificial intelligence (AI), the RL agent often has to abide by certain application-dependent constraints, even during the exploration phase. Two motivating applications are provided as follows. • Autonomous driving. In order to learn a near-optimal driving strategy, an RL agent needs to try various actions at different states through exploration. While RF-RL is an appealing approach as the reward function is difficult to specify, it is of critical importance for the RL agent to take safe actions (even during exploration) in order to avoid catastrophic consequences.

