ACCELERATING SAFE REINFORCEMENT LEARNING WITH CONSTRAINT-MISMATCHED POLICIES

Abstract

We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the controlled system must satisfy. The baseline policy can arise from a teacher agent, demonstration data or even a heuristic while the constraints might encode safety, fairness or other application-specific requirements. Importantly, the baseline policy may be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints. The key challenge therefore lies in effectively leveraging the baseline policy for faster learning, while still ensuring that the constraints are minimally violated. To reconcile these potentially competing aspects, we propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set. We analyze the convergence of our algorithm theoretically and provide a finite-time guarantee. In our empirical experiments on five different control tasks, our algorithm consistently outperforms several state-ofthe-art methods, achieving 10 times fewer constraint violations and 40% higher reward on average. 1

1. INTRODUCTION

Deep reinforcement learning (RL) has achieved superior performance in several domains such as games (Mnih et al., 2013; Silver et al., 2016) and robotic control (Levine et al., 2016; Rajeswaran et al., 2017) . However, in these complex applications, learning policies from scratch often requires tremendous amounts of time and computation power. To alleviate this issue, one would like to leverage a baseline policy available from a teacher or a previous task. However, the baseline policy may be sub-optimal for the new application and is not guaranteed to produce actions that satisfy given constraints on safety, fairness, or other costs. For instance, when you drive an unfamiliar vehicle, you do so cautiously to ensure safety, while at the same time you adapt your driving technique to the vehicle characteristics to improve your 'driving reward'. In effect, you (as the agent) gradually adapt a baseline policy (i.e., prior driving skill) to avoid violating the constraints (e.g., safety) while improving your driving reward (e.g., travel time, fuel efficiency). This problem is challenging because directly leveraging the baseline policy, as in DAGGER (Ross et al., 2011) or GAIL (Ho & Ermon, 2016) , may result in policies that violate the constraints since the baseline is not guaranteed to satisfy them. To ensure constraint satisfaction, prior work either adds a hyper-parameter weighted copy of the imitation learning (IL) objective (i.e., imitating the baseline policy) to the RL objective (Rajeswaran et al., 2017; Gao et al., 2018; Hester et al., 2018) , or pre-trains a policy with the baseline policy and then fine-tunes it through RL (Mülling et al., 2013; Chernova & Thomaz, 2014) . Both these approaches incur the cost of weight tuning to satisfy the cost constraint and do not ensure constraint satisfaction during training. In this work, to learn from the baseline policy while satisfying constraints, we propose an iterative algorithm that performs policy updates in three stages. The first step updates the policy to maximize expected reward using trust region policy optimization (e.g., TRPO (Schulman et al., 2015) ). This can, however, result in a new intermediate policy that is too far from the baseline policy and one that may not satisfy the constraints. The second step performs a projection in policy space to control Trust region Cost constraint set Region around 𝜋 ! 𝜋 ! 𝜋 !" # $ 𝜋 !" % $ 𝜋 !"# 𝜋 & ℎ ' ! (a) ℎ ! 𝜋 " ℎ # $ ℎ # $%& 𝜋 $ Reward optimal Cost constraint set (b) ℎ ! 𝜋 " ℎ # $ ℎ # $%& 𝜋 $ Cost constraint set Reward optimal (c) the distance between the current policy and the baseline policy. This distance is updated each episode depending on the reward improvement and constraint satisfaction, allowing the learning algorithm to explore without being overly restricted by the (potentially constraint-violating) baseline policy (Rajeswaran et al., 2017) . This also enables the baseline policy to influence the learning without the computational burden of learning a cost function for the baseline policy (Kwon et al., 2020) . The third step ensures constraint satisfaction at every iteration by performing a projection onto the set of policies that satisfy the given constraints. This ensures recovery from infeasible (i.e., constraint-violating) states (e.g., due to approximation errors), and eliminates the need for tuning weights for auxiliary cost objective functions (Tessler et al., 2018) . We call our algorithm Safe Policy Adaptation with Constrained Exploration (SPACE). This paper's contributions are two-fold. We first analyze our proposed SPACE algorithm and provide a finite-time guarantee for its convergence. We also provide an analysis of controlling the distance between the learned policy at iteration k and the baseline policy to ensure both feasibility of the optimization problem and safe exploration by the learning agent. Second, we empirically compare SPACE with state-of-the-art algorithms on five different control tasks, including two Mujoco environments with safety constraints from Achiam et al. ( 2017), two challenging traffic management tasks with fairness constraints from Vinitsky et al. ( 2018), and one human demonstration driving task with safety constraints from Brockman et al. (2016) . In all tasks, SPACE outperforms the state-of-the art safe RL algorithm, projection-based constrained policy optimization (PCPO) in Yang et al. (2020) , averaging 40% more reward with 10 times fewer cost constraint violations. This shows that SPACE leverages the baseline policy to achieve better learning efficiency while satisfying the cost constraint.

2. RELATED WORK

Policy optimization with constraints. Learning constraint-satisfying policies has been explored in the context of safe RL (Garcia & Fernandez, 2015) , (Hasanbeig et al., 2020; Junges et al., 2016; Jansen et al., 2020) . Prior work either uses a conditional-gradient type of approach (Achiam et al., 2017) , adds a weighted copy of the cost objective in the reward function (Tessler et al., 2018; Chow et al., 2019; Fujimoto et al., 2019; Stooke et al., 2020) , adds a safety layer to the policy (Dalal et al., 2018 ) (Avni et al., 2019) , or concerns about the chanced constraints (Fu & Prashanth L, 2018; Zheng & Ratliff, 2020) . Perhaps the closest work to ours is Projection-based Constrained Policy Optimization (PCPO) (Yang et al., 2020) , which also uses projections in policy space to ensure constraint satisfaction. However, PCPO does not have the capability to safely exploit prior information (through a baseline policy). The lack of using prior policies in PCPO makes it sampleinefficient. In addition, our SPACE algorithm's update dynamically sets distances between policies while PCPO does not. This update is important to effectively and safely learn from the baseline policy. Furthermore, we provide a safety guarantee to ensure the feasibility of the optimization problem while PCPO does not. Merely adding an IL objective in the reward objective of PCPO cannot learn efficiently, as shown in experiments. This analysis allows us to advance towards the practical use of RL in real applications, which PCPO and other algorithms have never done before. (2016) assume that the initial safe set is given, and the agent explores the environment and verifies the safety function from this initial safe set. There is no baseline policy here. In contrast, our assumption is to give a baseline policy to the agent. Both assumptions are reasonable as they provide an initial understanding of the environment.



Code is available at: https://sites.google.com/view/spacealgo.



Figure 1: (a) Update procedures for SPACE. Step 1 (green) improves the reward in the trust region. Step 2 (blue) projects the policy onto a region around the baseline policy πB. Step 3 (red) projects the policy onto the constraint set. (b) Illustrating when πB is outside the constraint set. (c) Illustrating when πB is inside the constraint set. The highest reward is achieved at the yellow star.

Policy optimization with the initial safe set. Wachi & Sui (2020);Sui et al. (2015); Turchetta et al.

