ACCELERATING SAFE REINFORCEMENT LEARNING WITH CONSTRAINT-MISMATCHED POLICIES

Abstract

We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the controlled system must satisfy. The baseline policy can arise from a teacher agent, demonstration data or even a heuristic while the constraints might encode safety, fairness or other application-specific requirements. Importantly, the baseline policy may be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints. The key challenge therefore lies in effectively leveraging the baseline policy for faster learning, while still ensuring that the constraints are minimally violated. To reconcile these potentially competing aspects, we propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set. We analyze the convergence of our algorithm theoretically and provide a finite-time guarantee. In our empirical experiments on five different control tasks, our algorithm consistently outperforms several state-ofthe-art methods, achieving 10 times fewer constraint violations and 40% higher reward on average. 1

1. INTRODUCTION

Deep reinforcement learning (RL) has achieved superior performance in several domains such as games (Mnih et al., 2013; Silver et al., 2016) and robotic control (Levine et al., 2016; Rajeswaran et al., 2017) . However, in these complex applications, learning policies from scratch often requires tremendous amounts of time and computation power. To alleviate this issue, one would like to leverage a baseline policy available from a teacher or a previous task. However, the baseline policy may be sub-optimal for the new application and is not guaranteed to produce actions that satisfy given constraints on safety, fairness, or other costs. For instance, when you drive an unfamiliar vehicle, you do so cautiously to ensure safety, while at the same time you adapt your driving technique to the vehicle characteristics to improve your 'driving reward'. In effect, you (as the agent) gradually adapt a baseline policy (i.e., prior driving skill) to avoid violating the constraints (e.g., safety) while improving your driving reward (e.g., travel time, fuel efficiency). This problem is challenging because directly leveraging the baseline policy, as in DAGGER (Ross et al., 2011) or GAIL (Ho & Ermon, 2016) , may result in policies that violate the constraints since the baseline is not guaranteed to satisfy them. To ensure constraint satisfaction, prior work either adds a hyper-parameter weighted copy of the imitation learning (IL) objective (i.e., imitating the baseline policy) to the RL objective (Rajeswaran et al., 2017; Gao et al., 2018; Hester et al., 2018) , or pre-trains a policy with the baseline policy and then fine-tunes it through RL (Mülling et al., 2013; Chernova & Thomaz, 2014) . Both these approaches incur the cost of weight tuning to satisfy the cost constraint and do not ensure constraint satisfaction during training. In this work, to learn from the baseline policy while satisfying constraints, we propose an iterative algorithm that performs policy updates in three stages. The first step updates the policy to maximize expected reward using trust region policy optimization (e.g., TRPO (Schulman et al., 2015) ). This can, however, result in a new intermediate policy that is too far from the baseline policy and one that may not satisfy the constraints. The second step performs a projection in policy space to control



Code is available at: https://sites.google.com/view/spacealgo. 1

