ACCELERATING SAFE REINFORCEMENT LEARNING WITH CONSTRAINT-MISMATCHED POLICIES

Abstract

We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the controlled system must satisfy. The baseline policy can arise from a teacher agent, demonstration data or even a heuristic while the constraints might encode safety, fairness or other application-specific requirements. Importantly, the baseline policy may be sub-optimal for the task at hand, and is not guaranteed to satisfy the specified constraints. The key challenge therefore lies in effectively leveraging the baseline policy for faster learning, while still ensuring that the constraints are minimally violated. To reconcile these potentially competing aspects, we propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint-satisfying set. We analyze the convergence of our algorithm theoretically and provide a finite-time guarantee. In our empirical experiments on five different control tasks, our algorithm consistently outperforms several state-ofthe-art methods, achieving 10 times fewer constraint violations and 40% higher reward on average.

1. INTRODUCTION

Deep reinforcement learning (RL) has achieved superior performance in several domains such as games (Mnih et al., 2013; Silver et al., 2016) and robotic control (Levine et al., 2016; Rajeswaran et al., 2017) . However, in these complex applications, learning policies from scratch often requires tremendous amounts of time and computation power. To alleviate this issue, one would like to leverage a baseline policy available from a teacher or a previous task. However, the baseline policy may be sub-optimal for the new application and is not guaranteed to produce actions that satisfy given constraints on safety, fairness, or other costs. For instance, when you drive an unfamiliar vehicle, you do so cautiously to ensure safety, while at the same time you adapt your driving technique to the vehicle characteristics to improve your 'driving reward'. In effect, you (as the agent) gradually adapt a baseline policy (i.e., prior driving skill) to avoid violating the constraints (e.g., safety) while improving your driving reward (e.g., travel time, fuel efficiency). This problem is challenging because directly leveraging the baseline policy, as in DAGGER (Ross et al., 2011) or GAIL (Ho & Ermon, 2016) , may result in policies that violate the constraints since the baseline is not guaranteed to satisfy them. To ensure constraint satisfaction, prior work either adds a hyper-parameter weighted copy of the imitation learning (IL) objective (i.e., imitating the baseline policy) to the RL objective (Rajeswaran et al., 2017; Gao et al., 2018; Hester et al., 2018) , or pre-trains a policy with the baseline policy and then fine-tunes it through RL (Mülling et al., 2013; Chernova & Thomaz, 2014) . Both these approaches incur the cost of weight tuning to satisfy the cost constraint and do not ensure constraint satisfaction during training. In this work, to learn from the baseline policy while satisfying constraints, we propose an iterative algorithm that performs policy updates in three stages. The first step updates the policy to maximize expected reward using trust region policy optimization (e.g., TRPO (Schulman et al., 2015) ). This can, however, result in a new intermediate policy that is too far from the baseline policy and one that may not satisfy the constraints. The second step performs a projection in policy space to control Step 1 (green) improves the reward in the trust region. Step 2 (blue) projects the policy onto a region around the baseline policy πB. Step 3 (red) projects the policy onto the constraint set. (b) Illustrating when πB is outside the constraint set. (c) Illustrating when πB is inside the constraint set. The highest reward is achieved at the yellow star. the distance between the current policy and the baseline policy. This distance is updated each episode depending on the reward improvement and constraint satisfaction, allowing the learning algorithm to explore without being overly restricted by the (potentially constraint-violating) baseline policy (Rajeswaran et al., 2017) . This also enables the baseline policy to influence the learning without the computational burden of learning a cost function for the baseline policy (Kwon et al., 2020) . The third step ensures constraint satisfaction at every iteration by performing a projection onto the set of policies that satisfy the given constraints. This ensures recovery from infeasible (i.e., constraint-violating) states (e.g., due to approximation errors), and eliminates the need for tuning weights for auxiliary cost objective functions (Tessler et al., 2018) . We call our algorithm Safe Policy Adaptation with Constrained Exploration (SPACE). This paper's contributions are two-fold. We first analyze our proposed SPACE algorithm and provide a finite-time guarantee for its convergence. We also provide an analysis of controlling the distance between the learned policy at iteration k and the baseline policy to ensure both feasibility of the optimization problem and safe exploration by the learning agent. Second, we empirically compare SPACE with state-of-the-art algorithms on five different control tasks, including two Mujoco environments with safety constraints from Achiam et al. (2017) , two challenging traffic management tasks with fairness constraints from Vinitsky et al. (2018) , and one human demonstration driving task with safety constraints from Brockman et al. (2016) . In all tasks, SPACE outperforms the state-of-the art safe RL algorithm, projection-based constrained policy optimization (PCPO) in Yang et al. (2020) , averaging 40% more reward with 10 times fewer cost constraint violations. This shows that SPACE leverages the baseline policy to achieve better learning efficiency while satisfying the cost constraint.

2. RELATED WORK

Policy optimization with constraints. Learning constraint-satisfying policies has been explored in the context of safe RL (Garcia & Fernandez, 2015) , (Hasanbeig et al., 2020; Junges et al., 2016; Jansen et al., 2020) . Prior work either uses a conditional-gradient type of approach (Achiam et al., 2017) , adds a weighted copy of the cost objective in the reward function (Tessler et al., 2018; Chow et al., 2019; Fujimoto et al., 2019; Stooke et al., 2020) , adds a safety layer to the policy (Dalal et al., 2018 ) (Avni et al., 2019) , or concerns about the chanced constraints (Fu & Prashanth L, 2018; Zheng & Ratliff, 2020) . Perhaps the closest work to ours is Projection-based Constrained Policy Optimization (PCPO) (Yang et al., 2020) , which also uses projections in policy space to ensure constraint satisfaction. However, PCPO does not have the capability to safely exploit prior information (through a baseline policy). The lack of using prior policies in PCPO makes it sampleinefficient. In addition, our SPACE algorithm's update dynamically sets distances between policies while PCPO does not. This update is important to effectively and safely learn from the baseline policy. Furthermore, we provide a safety guarantee to ensure the feasibility of the optimization problem while PCPO does not. Merely adding an IL objective in the reward objective of PCPO cannot learn efficiently, as shown in experiments. This analysis allows us to advance towards the practical use of RL in real applications, which PCPO and other algorithms have never done before. Policy optimization with the initial safe set. Wachi & Sui (2020) ; Sui et al. (2015) ; Turchetta et al. (2016) assume that the initial safe set is given, and the agent explores the environment and verifies the safety function from this initial safe set. There is no baseline policy here. In contrast, our assumption is to give a baseline policy to the agent. Both assumptions are reasonable as they provide an initial understanding of the environment.

Algorithm Properties Approach

DAGGER (Ross et al., 2011) Optimize IL objective DAPG (Rajeswaran et al., 2017) Optimize RL and IL objectives jointly NAC (Gao et al., 2018) Pre-train with IL and find-tune with RL IC-GAIL (Wu et al., 2019) Generative adversarial IL CPO (Achiam et al., 2017) Safe RL solved by a line search RCPO (Tessler et al., 2018) Safe RL solved by a Lagrangian approach PCPO (Yang et al., 2020) Safe RL solved by a projection PID-RL (Stooke et al., 2020) Safe RL solved by PID control

SPACE (ours)

Safe RL aided by baseline policies with analysis maximize reward, constraint-cost satisfaction, on-policy imitation, off-policy imitation, w/ expert/optimal demonstrations, w/ suboptimal demonstration Leveraging baseline policies for RL. Prior work has used baseline policies to provide initial information to RL algorithms to reduce or avoid undesirable situations. This is done by either: initializing the policy with the baseline policy (Driessens & Džeroski, 2004; Smart & Kaelbling, 2000; Koppejan & Whiteson, 2011; Abbeel et al., 2010; Gao et al., 2018; Le et al., 2019; Vecerik et al., 2017; Jaques et al., 2019) ; or providing a teacher's advice to the agent (Garcia & Fernández, 2012; Quintía Vidal et al., 2013; Abel et al., 2017; Zhang et al., 2019) . However, such works often assume that the baseline policy is constraint-satisfying (Sun et al., 2018; Balakrishna et al., 2019) . In contrast, our SPACE algorithm safely leverages the baseline policy without requiring it to satisfy the specified constraints. Pathak et al. (2015) ; Bartocci et al. (2011) also modify the existing known models (policies) based on new conditions in the context of the formal methods. In contrast, we solve this problem using projections in the policy space. Learning from logged demonstration data. To effectively learn from demonstration data given by the baseline policy, Wu et al. (2019); Brown et al. (2019) ; Kwon et al. (2020) assess the demonstration data by either: predicting their cost in the new task using generative adversarial networks (GANs) (Goodfellow et al., 2014) ; or directly learning the cost function of the demonstration data. However, these approaches require a large number of training samples from the new task. In addition, the learned cost function is not guaranteed to recover the true one. This may result in driving the agent to undesirable situations. In contrast, SPACE controls the distance between the learned and baseline policies to ensure reward improvement and constraint satisfaction. Table 1 catalogs related work on safe RL and IL. We can differentiate these methods on several key axes: (1) whether they optimize reward and/or ensure cost satisfaction, (2) whether they use on-policy methods (i.e., query the baseline policy) or off-policy methods (i.e., learn from a batch of demonstration data), (3) whether the baseline policy or demonstration data is optimal for the agent's own objective, and (4) the technical approach of the work.

3. PROBLEM FORMULATION

We frame our problem as a constrained Markov Decision Process (CMDP) (Altman, 1999) , defined as a tuple < S, A, T, R, C > . Here S is the set of states, A is the set of actions, and T specifies the conditional probability T (s |s, a) that the next state is s given the current state s and action a. In addition, R : S × A → R is a reward function, and C : S × A → R is a constraint cost function. The reward function encodes the benefit of using action a in state s, while the cost function encodes the corresponding constraint violation penalty. A policy is a map from states to probability distributions on A. It specifies that in state s the selected action is drawn from the distribution π(s). The state then transits from s to s according to the state transition distribution T (s |s, a). In doing so, a reward R(s, a) is received and a constraint cost C(s, a) is incurred, as outlined above. Let γ ∈ (0, 1) denote a discount factor, and τ denote the trajectory τ = (s 0 , a 0 , s 1 , • • • ) induced by a policy π. Normally, we seek a policy π that maximizes a cumulative discounted reward J R (π) . = E τ ∼π [ ∞ t=0 γ t R(s t , a t )] , while keeping the cumulative discounted cost below h C J C (π) . = E τ ∼π [ ∞ t=0 γ t C(s t , a t )] ≤ h C . Here we consider an additional objective. We are provided with a baseline policy π B and at each state s we measure the divergence between π(s) and π B (s). For example, this could be the KL-divergence D(s) . = D KL (π(s) π B (s)). We then seek a policy that maximizes Eq. ( 1), satisfies Eq. ( 2), and ensures the discounted divergence between the learned and baseline policies is below h D : J D (π) . = E τ ∼π [ ∞ t=0 γ t D(s t )] ≤ h D . We do not assume that the baseline policy satisfies the cost constraint. Hence we allow h D to be adjusted during the learning of π to allow for reward improvement and constraint satisfaction. Let µ t (•|π) denote the state distribution at time t under policy π. The discounted state distribution induced by π is defined to be s|π) . Now bring in the reward advantage function (Kakade & Langford, 2002) defined by d π (s) . = (1 -γ) ∞ t=0 γ t µ t ( A π R (s, a) . = Q π R (s, a) -V π R (s), where V π R (s) . = E τ ∼π [ ∞ t=0 γ t R(s t , a t )|s 0 = s] is the expected reward from state s under policy π, and Q π R (s, a) . = E τ ∼π [ ∞ t=0 γ t R(s t , a t )|s 0 = s, a 0 = a] is the expected reward from state s and initial action a, and thereafter following policy π. These definitions allow us to express the reward performance of one policy π in terms of another π: J R (π ) -J R (π) = 1 1-γ E s∼d π ,a∼π [A π R (s, a)]. Similarly, we can define A π D (s, a), Q π D (s, a) and V π D (s) for the divergence cost, and A π C (s, a), Q π C (s, a) and V π C (s) for the constraint cost.

4. SAFE POLICY ADAPTATION WITH CONSTRAINED EXPLORATION (SPACE)

We now describe the proposed iterative three step algorithm illustrated in Fig. 1 . In what follows, π k denotes the learned policy after iteration k, and M denotes a distance measure between policies. For example, M might be the 2-norm of the difference of policy parameters or some average over the states of the KL-divergence of the action policy distributions. Step 1. We perform one step of trust region policy optimization (Schulman et al., 2015) . This maximizes the reward advantage function A π R (s, a) over a KL-divergence neighborhood of π k : π k+ 1 3 = arg max π E s∼d π k , a∼π [A π k R (s, a)] s.t. E s∼d π k D KL (π(s) π k (s)) ≤ δ. ( ) Step 2. We project π k+ 1 3 onto a region around π B controlled by h k D to minimize M : π k+ 2 3 = arg min π M (π, π k+ 1 3 ) s.t. J D (π k ) + 1 1-γ E s∼d π k , a∼π [A π k D (s)] ≤ h k D . (5) Step 3. We project π k+ 2 3 onto the set of policies satisfying the cost constraint to minimize M : π k+1 = arg min π M (π, π k+ 2 3 ) s.t. J C (π k ) + 1 1-γ E s∼d π k , a∼π [A π k C (s, a)] ≤ h C . ( ) Control h k D in Step 2. We select h 0 D to be small and gradually increase h k D at each iteration to expand the region around π B . Specifically, we make h k+1 D > h k D if: (a) J C (π k ) > J C (π k-1 ): this increase is to ensure a nonempty intersection between the region around π B and the cost constraint set (feasibility). See Fig. 1(b) . (b) J R (π k ) < J R (π k-1 ): this increase gives the next policy more freedom to improve the reward and the cost constraint performance (exploration). See Fig. 1(c ). It remains to determine how to set the new value of h k+1 D . Let U 1 denote the set of policies satisfying the cost constraint, and U k 2 denote the set of policies in the region around π B controlled by h k D . Then we have the following Lemma. Lemma 4.1 (Updating h D ). If at step k+1: h k+1 D ≥ O (J C (π k )-h C ) 2 +h k D , then U 1 ∩U k+1 2 = ∅ (feasibility) and U k+1 2 ∩ ∂U 1 = ∅ (exploration). Proof. See the supplementary material. Lemma 4.1 ensures that the boundaries of the region around π B determined by h D and the set of policies satisfying the cost constraint intersect. Note that h D will become large enough to guarantee feasibility during training. This allows the learning algorithm to explore policies within the cost constraint set while still learning from the baseline policy.

5. A THEORETICAL ANALYSIS OF SPACE

We will implement a policy as a neural network with fixed architecture parameterized by θ ∈ R n . We then learn a policy from the achievable set {π(•|θ) : θ ∈ R n } by iteratively learning θ. Let θ k and π k . = π(•|θ k ) denote the parameter value and the corresponding policy at step k. In this setting, it is impractical to solve for the policy updates in Eq. ( 4), ( 5) and ( 6). Hence we approximate the reward function and constraints with first order Taylor expansions, and KL-divergence with a second order Taylor expansion. We will need the following derivatives: (1) g k . = ∇ θ E s∼d π k , a∼π [A π k R (s, a)], (2) a k . = ∇ θ E s∼d π k , a∼π [A π k D (s)], (3) c k . = ∇ θ E s∼d π k , a∼π [A π k C (s, a)], and (4) F k . = ∇ 2 θ E s∼d π k D KL (π(s) π k (s)) . Each of these derivatives are taken w.r.t. the neural network parameter and evaluated at θ k . We also define b k . = J D (π k ) -h k D , and d k . = J C (π k ) -h C . Let u k . = 2δ g k T F k -1 g k , and L = I for the 2-norm projection and L = F k for the KL-divergence projection. We first approximate Eq. (4), Eq. (5) and Eq. (6) using the following approximations: Step 1. Approximating Eq. (4) yields θ k+ 1 3 = arg max θ g k T (θ -θ k ) s.t. 1 2 (θ -θ k ) T F k (θ -θ k ) ≤ δ. ( ) Step 2 and Step 3. Approximating Eq. ( 5) and ( 6), similarly yields θ k+ 2 3 = arg min θ 1 2 (θ -θ k+ 1 3 ) T L(θ -θ k+ 1 3 ) s.t. a k T (θ -θ k ) + b k ≤ 0, θ k+1 = arg min θ 1 2 (θ -θ k+ 2 3 ) T L(θ -θ k+ 2 3 ) s.t. c k T (θ -θ k ) + d k ≤ 0, where L = I for the 2-norm projection and L = F k for the KL-divergence projection. We solve these problems using using convex programming (see the supplementary material for the derivation), then for each policy update, we have θ k+1 = θ k + u k F k -1 g k -max(0, u k a k T F k -1 g k + b k a k T L -1 a k )L -1 a k -max(0, u k c k T F k -1 g k + d k c k T L -1 c k )L -1 c k . ( ) Algorithm 1 shows the corresponding pseudocode. Convergence analysis. We consider the following simplified problem to provide a finite-time guarantee of SPACE: min θ∈C1∩C2 f (θ), where f : R n → R is a twice continuously differentiable function at every point in a open set X ⊆ R n , and C 1 ⊆ X and C 2 ⊆ X are compact convex sets with C 1 ∩ C 2 = ∅. The function f is the negative reward function of our CMDP, and the two constraint sets represent the cost constraint set and the region around π B . For a vector x, let x denote the Euclidean norm. For a matrix M let M denote the induced matrix 2-norm, and σ i (M ) denote the i-th largest singular value of M . Assumption 1. We assume: (1.1) The gradient ∇f is L-Lipschitz continuous over a open set X . (1.2) For some constant G, ∇f (θ) ≤ G. (1.3) For some constant H, diam(C 1 ) ≤ H and diam(C 2 ) ≤ H. Assumptions (1.1) and (1.2) ensure that the gradient can not change too rapidly and the norm of the gradient can not be too large. Assumption (1.3) implies that for every iteration, the diameter of the region around π B is bounded above by H. We will need a concept of an -first order stationary point (Mokhtari et al., 2018) . For > 0, we say that θ * ∈ C 1 ∩ C 2 an -first order stationary point ( -FOSP) of Problem (11) under KL-divergence projection if ∇f (θ * ) T (θ -θ * ) ≥ -, ∀θ ∈ C 1 ∩ C 2 . ( ) Similarly, under the 2-norm projection, θ * ∈ C 1 ∩ C 2 an -FOSP of (11) if ∇f (θ * ) T F * (θ -θ * ) ≥ -, ∀θ ∈ C 1 ∩ C 2 , where F * . = ∇ 2 θ E s∼d π * [D KL (π(s) π * (s))] . Notice that SPACE converges to distinct stationary points under the two possible projections (see the supplementary material). With these assumptions, we have the following Theorem. Theorem 5.1 (Finite-Time Guarantee of SPACE). Under the KL-divergence projection, there exists a sequence {η k } such that SPACE converges to an -FOSP in at most O( -2 ) iterations. Moreover, at step k + 1 f (θ k+1 ) ≤ f (θ k ) - L 2 2(G+ Hσ 1 (F k ) η k ) 2 . ( ) Similarly, under the 2-norm projection, there exists a sequence {η k } such that SPACE converges to an -FOSP in at most O( -2 ) iterations. Moreover, at step k + 1 f (θ k+1 ) ≤ f (θ k ) - L 2 2(Gσ1(F k -1 )+ H η k ) 2 . (15) Proof. The proof and the sequence {η k } are given in the supplementary material. Algorithm 1 SPACE Initialize a policy π 0 = π(•|θ 0 ) and a trajectory buffer B for k = 0, 1, 2, • • • do Run π k = π(•|θ k ) and store trajectories in B Compute g, a, c, F , b and d using B Obtain θ k+1 using the update in Eq. ( 10) if J C (π k ) > J C (π k-1 ) or J R (π k ) < J R (π k-1 ) then Update h k+1 D using Lemma 4.1 Empty B We now make several observations for Theorem 5.1. (1) The smaller H is, the greater the decrease in the objective. This observation supports the idea of starting with a small value for h D and increasing it only when needed. (2) Under the KL-divergence projection, the effect of σ 1 (F k ) is negligible. This is because in this case η k is determined by the KL-divergence between two consecutive updated policies (see the supplementary material). This implies that η k is proportional to σ 1 (F k ). Hence σ 1 (F k ) does not play a major role in decreasing the objective value. (3) Under the 2-norm projection, the smaller σ 1 (F k -1 ) (i.e., larger σ n (F k )) is, the greater the decrease in the objective. This is because a large σ n (F k ) means a large curvature of f in all directions. This implies that the 2-norm distance between the pre-projection and post-projection points is small, leading to a small deviation from the reward improvement direction after doing projections (see the supplementary material for a visualization). Figure 2 : (a) Gather task: the agent is rewarded for gathering green apples, but is constrained to collect a limited number of red apples (Achiam et al., 2017) . (b) Circle task: the agent is rewarded for moving in a specified wide circle, but is constrained to stay within a safe region smaller than the radius of the circle (Achiam et al., 2017) . (c) Grid task: the agent controls the traffic lights in a grid road network and is rewarded for high throughput, but is constrained to let lights stay red for at most 7 consecutive seconds (Vinitsky et al., 2018) . (d) Bottleneck task: the agent controls a set of autonomous vehicles (shown in red) in a traffic merge situation and is rewarded for achieving high throughput, but constrained to ensure that human-driven vehicles (shown in white) have low speed for no more than 10 seconds (Vinitsky et al., 2018) . (e) Car-racing task: the agent controls an autonomous vehicle on a race track and is rewarded for driving through as many tiles as possible, but is constrained to use the brakes at most 5 times to encourage a smooth ride (Brockman et al., 2016) . (f) A human player plays car-racing with demonstration data logged. f-PCPO also adds the divergence objective in the reward function, but followed by a PCPO update (3) Dynamic-point Constrained Policy Optimization (d-CPO). The d-CPO update solves f-CPO problem with a stateful λ k+1 = (λ k ) β , where 0 < β < 1. This is inspired by Rajeswaran et al. (2017) , in which they have the same weight-scheduling method to adjust λ k . (4) Dynamic-point PCPO (d-PCPO). The d-PCPO update solves f-PCPO problem with a stateful λ k+1 = (λ k ) β , where 0 < β < 1. d-PCPO also uses the same method to learn from π B as in d-CPO. For all the experiments and the algorithms, the weight is fixed and it is set to 1. Baseline policies π B . To test whether SPACE can safely and efficiently leverage the baseline policy, we used three variants of the baseline policies: (1) suboptimal π cost B with J C (π cost B ) ≈ 0, (2) suboptimal π reward B with J C (π reward B ) > h C , and (3) π near B with J C (π near B ) ≈ h C (i.e. , the baseline policy has the same cost constraint as the agent). Note that these π B have different degrees of constraint satisfaction. This is to examine (1) whether SPACE can achieve better learning efficiency given the constraint-satisfying π B (i.e., π cost B and π near B ), and (2) whether SPACE can safely learn from constraint-violating π B (i.e., π reward

B

). In addition, in the car-racing task we pre-train a π B using an off-policy algorithm (DDPG (Lillicrap et al., 2016) ), which directly learns from human demonstration data (Fig. 2(f) ). This suboptimal human baseline policy is denoted by π human . This is

Bottleneck

Car-racing Grid to simulate that π B comes from a heuristic or human demonstration data, which are far from optimal. The detailed implementation of updating h D can be found in Appendix E.1. Overall performance. The learning curves of the discounted reward (J R (π)), the undiscounted constraint cost (J C (π)), and the undiscounted divergence cost (J D (π)) over policy updates are shown for all tested algorithms and tasks in Fig. 3 . We use π near B in bottleneck and grid tasks, and π human B in car-racing task. Note that π human B is highly suboptimal to the agent (i.e., J R (π human

B

) is small). The value of the reward is only around 5 as shown in the plot. It does not solve the task at hand. Overall, we observe that (1) SPACE achieves at least 2 times faster cost constraint satisfaction in all cases even learning from π human B . (2) SPACE achieves at least 10% more reward in the bottleneck and car-racing tasks compared to the best baseline, and (3) SPACE is the only algorithm that satisfies the cost constraints in all cases. In contrast, PCPO agent cannot robustly satisfy the cost constraints. This is because using the baseline policy can provide additional information about the safety constraints. For example, when using π cost B that is only optimized for the safety constraints specified by the environment, we can observe the agent trained with SPACE satisfies the constraints more. This shows that using the baseline policy could also help to improve the cost performance. For example, in the car-racing task we observe that J D (π) in SPACE decreases at the initial iteration, but increases in the end. This implies that the learned policy is guided by the baseline policy π human B in the beginning, but use less supervision in the end. In addition, in the grid task we observe that the final reward of SPACE is lower than the baseline algorithm. This is because that SPACE converges to a policy in the cost constraint set, whereas the baseline algorithms do not find constraint-satisfying policies. These observations show that SPACE can robustly ensure constraint satisfaction while achieving fast learning aided by the potentially suboptimal baseline policy. f-CPO and f-PCPO. f-CPO and f-PCPO fail to improve the reward and have more cost violations. Most likely this is due to persistent supervision from the baseline policy which need not satisfy the cost constraints nor have high reward. For example, in the car-racing task we observe that the value of the divergence cost decreases throughout the training. This implies that the learned policy overly evolves to the sub-optimal baseline policy and hence degrades the reward performance. We observe that SPACE robustly satisfies the cost constraints in all cases even when learning from π reward B . In addition, we observe that learning guided by π reward B achieves faster reward learning efficiency at the initial iteration. This is because J R (π reward B ) > J R (π cost B ) as seen in the reward plot. Furthermore, we observe that learning guided by π cost B achieves faster reward learning efficiency at the later iteration. This is because that by starting in the interior of the cost constraint set (i.e., J C (π cost B ) ≈ 0 ≤ h C ), the agent can safely exploit the baseline policy. The results show that SPACE enables fast convergence to a constraint-satisfying policy, even if π B does not meet the constraints itself or does not optimize the reward.

7. CONCLUSION

We address the problem of learning constraint-satisfying policies given a baseline policy from either a teacher agent or demonstration data. The proposed algorithm effectively learns from the potentially suboptimal baseline policy without violating the constraints. SPACE achieves superior reward and cost performance compared with state-of-the-art approaches (i.e., PCPO). We further analyze the convergence of SPACE and provide an effective approach to controlling the distance between the learned and baseline policies. Future work will consider learning system dynamics to account for the uncertainty of the environment and hence enable safe learning in real applications.

Supplementary Material for Accelerating Safe Reinforcement Learning with Constraint-mismatched Policies

Outline. Supplementary material is outlined as follows. Section A discusses the impact of the proposed algorithm. Section B details the proof of updating h D in Lemma 4.1. Section C describes the proof of analytical solution to SPACE in Eq. ( 10). Section D gives the proof of finite-time guarantee of SPACE in Theorem 5.1 and discuss the difference between the KL-divergence and 2-norm projections. Section E assembles the additional experiment results to provide a detailed examination of the proposed algorithm compared to the baselines. These include: • evaluation of the discounted reward versus the cumulative undiscounted constraint cost to demonstrate that SPACE achieves better reward performance with fewer cost constraint violations, • evaluation of performance of SPACE guided by baseline policies with different J C (π B ) to demonstrate that SPACE safely learns from the baseline policies which need not satisfy the cost constraint, • ablation studies of using a fixed h D in SPACE to demonstrate the importance of using the dynamic h D to improve the reward and cost performance, • comparison of SPACE and other annealing approaches to demonstrate that SPACE exploits the baseline policy effectively, • comparison of SPACE under the KL-divergence and the 2-norm projections to demonstrate that they converge to different stationary points, • evaluation of using different initial values of h 0 D to demonstrate that the selection of the initial value does not affect the performance of SPACE drastically. Section E also details the environment parameters, the architectures of policies, computational cost, infrastructure for computation and the instructions for executing the code. Section F provides a procedure for getting a baseline human policy. Finally, we fill the Machine Learning Reproducibility Checklist in Section G.

A IMPACT OF SPACE

Many autonomous systems such as self-driving cars and autonomous robots are complex. In order to deal with this complexity, researchers are increasingly using reinforcement learning in conjunction with imitation learning for designing control policies. The more we can learn from a previous policy (e.g., human demonstration, previous applications), the fewer resources (e.g., time, energy, engineering effort, cost) we need to learn a new policy. The proposed algorithm could be applied in many fields where learning a policy can take advantage of prior applications while providing assurances for the consideration of fairness, safety, or other costs. For example, in a dialogue system where an agent is intended to converse with a human, the agent should safely learn from human preferences while avoiding producing biased or offensive responses. In addition, in the self-driving car domain where an agent learns a driving policy, the agent should safely learn from human drivers while avoiding a crash. Moreover, in the personalized robotic assistant setting where an agent learns from human demonstration, the agent should carefully imitate humans without damaging itself or causing harm to nearby humans. These examples highlight the potential impact of the proposed algorithm for accelerating safe reinforcement learning by adapting prior knowledge. This can open the door to advances in lifelong learning and adaptation of agents to different contexts. One deficiency of the proposed algorithm is that the agent still experiments with cost constraint violation when learning control policies. This is because that any learning-based system needs to experiment with various actions to find a constraint-satisfying policy. Even though the agent does not violate the safety constraints during the learning phase, any change or perturbation of the environment that was not envisioned at the time of programming or training may lead to a catastrophic failure during run-time. These systems cannot guarantee that sensor inputs will not induce undesirable consequences, nor can the systems adapt and support safety in situations in which new objectives are created. This creates huge concerns in safety-critical applications such as self-driving vehicles and personalized chatbot system. ℎ ! 𝜋 " ℎ ! # ℎ ! #$% 𝜋 # Cost constraint set Reward optimal 𝜋 &'()*+, Region around 𝜋 ! (a) ℎ ! 𝜋 " ℎ ! " ℎ ! #$% 𝜋 # Cost constraint set Region around 𝜋 ! Reward optimal 𝜋 &'()*+, (b) This raises several questions: What human-agent communication is needed to bring humans in the loop to increase safety guarantees for the autonomous system? How can trust and safety constraints be incorporated into the planning and control processes? How can one effectively identify unsafe plans of the baseline policy? We believe this paper will encourage future work to develop rigorous design and analysis tools for continual safety assurance in conjunction with using baseline policies from previous applications. B PROOF OF UPDATING h D IN LEMMA 4.1 Proof. Based on Theorem 1 in Achiam et al. (2017) , for any two policies π and π we have J C (π ) -J C (π) ≥ 1 1 -γ E s∼d π a∼π A π C (s, a) - 2γ π C 1 -γ 1 2 D KL (π (s)||π(s)) ⇒ 2γ π C (1 -γ) 2 E s∼d π 1 2 D KL (π (s)||π(s)) ≥ -J C (π ) + J C (π) + 1 1 -γ E s∼d π a∼π A π C (s, a) ⇒ 2γ π C (1 -γ) 2 E s∼d π 1 2 D KL (π (s)||π(s)) ≥ -J C (π ) + J C (π) ⇒ √ 2γ π C (1 -γ) 2 E s∼d π D KL (π (s)||π(s)) ≥ -J C (π ) + J C (π) ⇒ E s∼d π D KL (π (s)||π(s)) ≥ (1 -γ) 4 (-J C (π ) + J C (π)) 2 2γ 2 π C 2 . ( ) The fourth inequality follows from Jensen's inequality. We then define ϕ(π(s)) . = i π(a(i)|s) log π(a(i)|s). By Three-point Lemma (Chen & Teboulle, 1993) , for any three policies π, π , and π we have E s∼d π D KL (π (s)||π(s)) = E s∼d π D KL (π (s)||π(s)) + E s∼d π D KL (π(s)||π(s)) -E s∼d π (∇ϕ(π(s)) -∇ϕ(π(s))) T (π (s) -π(s)) . ( ) Let π boundary denote a policy satisfying J C (π boundary ) = h C (i.e., π boundary is in the boundary of the set of the policies which satisfy the cost constraint J C (π) ≤ h C ). Let π = π boundary , π = π B and π = π k in Eq. ( 16) and Eq. ( 17) (this is illustrated in Fig. 5 ). Then we have E s∼d π k D KL (π boundary (s)||π B (s)) -E s∼d π k D KL (π k (s)||π B (s)) = E s∼d π k D KL (π boundary (s)||π k (s)) -E s∼d π k (∇ϕ(π B (s)) -∇ϕ(π k (s))) T (π boundary (s) -π k (s)) ≥ (1 -γ) 4 (-J C (π boundary ) + J C (π k )) 2 2γ 2 π C 2 -E s∼d π k (∇ϕ(π B (s)) -∇ϕ(π k (s))) T (π boundary (s) -π k (s)) = (1 -γ) 4 (-h C + J C (π k )) 2 2γ 2 π C 2 -E s∼d π k (∇ϕ(π B (s)) -∇ϕ(π k (s))) T (π boundary (s) -π k (s)) = O -h C + J C (π k ) 2 , ( ) where J C (π boundary ) = h C . For the first case in Fig. 5 (a), we would like to have U 1 ∩ U k+1 2 = ∅ (feasibility). For the second case in Fig. 5 (b), we would like to have U k+1 2 ∩ ∂U 1 = ∅ (exploration). These implies that the policy in step k + 1 is π boundary which satisfies U 1 ∩ U k+1 2 = ∅ and U k+1 2 ∩ ∂U 1 = ∅. Now let h k+1 D . = E s∼d π k D KL (π boundary (s)||π B (s)) and h k D . = E s∼d π k [D KL (π k (s)||π B (s))]. Then Eq. 18 implies h k+1 D ≥ O -h C + J C (π k ) 2 + h k D . Lemma 4.1 theoretically ensures h D is large enough to guarantee feasibility and exploration of the agent. Note that we do not provide guarantees for finding an optimal policy. This requires additional assumptions on the objective function (e.g., convexity). In addition, the goal of this paper is to understand and analyze how to effectively exploit a baseline policy in constrained RL. Without such an analysis, we are not confident in deploying SPACE in real applications. Furthermore, the question of safely using baseline policies has a practical potential. It is less studied by prior work (Achiam et al., 2017; Chow et al., 2019; Tessler et al., 2018; Yang et al., 2020) .

C PROOF OF ANALYTICAL SOLUTION TO SPACE IN EQ. (10)

We first approximate the three stages in SPACE using the following approximations. Step 1. Approximating Eq. ( 4) yields θ k+ 1 3 = arg max θ g k T (θ -θ k ) s.t. 1 2 (θ -θ k ) T F k (θ -θ k ) ≤ δ. ( ) Step 2 and Step 3. Approximating Eq. ( 5) and ( 6), similarly yields θ k+ 2 3 = arg min θ 1 2 (θ -θ k+ 1 3 ) T L(θ -θ k+ 1 3 ) s.t. a k T (θ -θ k ) + b k ≤ 0, θ k+1 = arg min θ 1 2 (θ -θ k+ 2 3 ) T L(θ -θ k+ 2 3 ) s.t. c k T (θ -θ k ) + d k ≤ 0, where L = I for the 2-norm projection and L = F k for the KL-divergence projection. Proof. For the first problem in Eq. ( 19), since F k is the Fisher Information matrix, it is positive semi-definite. Hence it is a convex program with quadratic inequality constraints. If the primal problem has a feasible point, then Slater's condition is satisfied and strong duality holds. Let θ * and λ * denote the solutions to the primal and dual problems, respectively. In addition, the primal objective function is continuously differentiable. Hence the Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient for the optimality of θ * and λ * . We now form the Lagrangian: L(θ, λ) = -g k T (θ -θ k ) + λ 1 2 (θ -θ k ) T F k (θ -θ k ) -δ . And we have the following KKT conditions: -g k + λ * F k θ * -λ * F k θ k = 0 ∇ θ L(θ * , λ * ) = 0 (22) 1 2 (θ * -θ k ) T F k (θ * -θ k ) -δ = 0 ∇ λ L(θ * , λ * ) = 0 (23) 1 2 (θ * -θ k ) T F k (θ * -θ k ) -δ ≤ 0 primal constraints (24) λ * ≥ 0 dual constraints (25) λ * 1 2 (θ * -θ k ) T F k (θ * -θ k ) -δ = 0 complementary slackness (26) By Eq. ( 22), we have θ * = θ k + 1 λ * F k -1 g k . And by plugging Eq. ( 22) into Eq. ( 23), we have λ * = g k T F k -1 g k 2δ . Hence we have a solution θ k+ 1 3 = θ * = θ k + 2δ g k T F k -1 g k F k -1 g k , which also satisfies Eq. ( 24), Eq. ( 25), and Eq. ( 26). For the second problem in Eq. ( 20), we follow the same procedure for the first problem to form the Lagrangian: L(θ, λ) = 1 2 (θ -θ k+ 1 3 ) T L(θ -θ k+ 1 3 ) + λ(a k T (θ -θ k ) + b k ). And we have the following KKT conditions: Lθ * -Lθ k+ 1 3 + λ * a k = 0 ∇ θ L(θ * , λ * ) = 0 (28) a k T (θ * -θ k ) + b k = 0 ∇ λ L(θ * , λ * ) = 0 (29) a k T (θ * -θ k ) + b k ≤ 0 primal constraints (30) λ * ≥ 0 dual constraints (31) λ * (a k T (θ * -θ k ) + b k ) = 0 complementary slackness By Eq. ( 28), we have θ * = θ k + λ * L -1 a k . And by plugging Eq. ( 28) into Eq. ( 29) and Eq. ( 31), we have λ * = max(0, a k T (θ k+ 1 3 -θ k )+b k a k L -1 a k ). Hence we have a solution θ k+ 2 3 = θ * = θ k+ 1 3 -max(0, a k T (θ k+ 1 3 -θ k ) + b k a k T L -1 a k T )L -1 a k , which also satisfies Eq. ( 30) and Eq. (32). For the third problem in Eq. ( 21), instead of doing the projection on π k+ 2 3 which is the intermediate policy obtained in the second step, we project the policy π k+ 1 3 onto the cost constraint. This allows us to compute the projection without too much computational cost. We follow the same procedure for the first and second problems to form the Lagrangian: L(θ, λ) = 1 2 (θ -θ k+ 1 3 ) T L(θ -θ k+ 1 3 ) + λ(c k T (θ -θ k ) + d k ). And we have the following KKT conditions: Lθ * -Lθ k+ 1 3 + λ * c k = 0 ∇ θ L(θ * , λ * ) = 0 (34) c k T (θ * -θ k ) + d k = 0 ∇ λ L(θ * , λ * ) = 0 (35) c k T (θ * -θ k ) + d k ≤ 0 primal constraints (36) λ * ≥ 0 dual constraints (37) λ * (c k T (θ * -θ k ) + d k ) = 0 complementary slackness (38) By Eq. ( 34), we have θ * = θ k + λ * L -1 c k . And by plugging Eq. ( 34) into Eq. ( 35) and Eq. ( 37), we have λ * = max(0, c k T (θ k+ 1 3 -θ k )+d k c k L -1 c k ). Hence we have a solution θ k+1 = θ * = θ k+ 1 3 -max(0, c k T (θ k+ 1 3 -θ k ) + d k c k T L -1 c k T )L -1 c k . ( ) Hence by combining Eq. ( 27), Eq. ( 33) and Eq. ( 39), we have θ k+1 = θ k + 2δ g k T F k -1 g k F k -1 g k -max(0, 2δ g k T F k -1 g k a k T F k -1 g k + b k a k T L -1 a k )L -1 a k -max(0, 2δ g k T F k -1 g k c k T F k -1 g k + d k c k T L -1 c k )L -1 c k .

D PROOF OF FINITE-TIME GUARANTEE OF SPACE IN THEOREM 5.1

We now describe the reason for choosing two variants of -FOSP under two possible projections. Let η k R denote the step size for the reward, η k D denote the step size for the divergence cost, and η k C denote the step size for the constraint cost. Without loss of generality, under the KL-divergence projection, at step k + 1 SPACE does θ k+1 = θ k + η k R F k -1 g k -η k D F k -1 a k -η k C F k -1 c k . Similarly, under the 2-norm projection, at step k + 1 SPACE does θ k+1 = θ k + η k R F k g k -η k D a k -η k C c k . With this definition, we have the following Lemma. Lemma D.1 (Stationary Points for SPACE). Under the KL-divergence projection, SPACE converges to a stationary point θ * satisfying η * R g * = η * D a * + η * C c * . Under the 2-norm projection, SPACE converges to a stationary point θ * satisfying η * R g * = F * (η * D a * + η * C c * ). Proof. Under the KL-divergence projection, by using the definition of a stationary point we have θ * = θ * + η * R F * -1 g * -η * D F * -1 a * -η * C F * -1 c * ⇒ η * R F * -1 g * = η * D F * -1 a * + η * C F * -1 c * ⇒ η * R g * = η * D a * + η * C c * . Under the 2-norm projection, by using the definition of a stationary point we have  θ * = θ * + η * R F * -1 g * -η * D a * -η * C c * ⇒ η * R F * -1 g * = η * D a * + η * C c * ⇒ η * R g * = F * (η * D a * + η * C c * ). (θ -θ * ) T L(θ -θ * ) ≤ 0, ∀θ ∈ C. Proof. (⇒) Let θ * = P L C (θ) for a given θ ∈ C, θ ∈ C be such that θ = θ * , and α ∈ (0, 1). Then we have θ -θ * 2 L ≤ θ -θ * + α(θ -θ * ) 2 L = θ -θ * 2 L + α 2 θ -θ * 2 L -2α(θ -θ * ) T L(θ -θ * ) ⇒ (θ -θ * ) T L(θ -θ * ) ≤ α 2 θ -θ * 2 L . ( ) Since the right hand side of Eq. ( 40) can be made arbitrarily small for a given α, we have (θ -θ * ) T L(θ -θ * ) ≤ 0, ∀θ ∈ C. (⇐) Let θ * ∈ C be such that (θ -θ * ) T L(θ -θ * ) ≤ 0, ∀θ ∈ C. We show that θ * must be the optimal solution. Let θ ∈ C and θ = θ * . Then we have θ -θ 2 L -θ -θ * 2 L = θ -θ * + θ * -θ 2 L -θ -θ * 2 L = θ -θ * 2 L + θ -θ * 2 L -2(θ -θ * ) T L(θ -θ * ) -θ -θ * 2 L > 0 ⇒ θ -θ 2 L > θ -θ * 2 L . Hence, θ * is the optimal solution to the optimization problem, and θ * = P L C (θ). We now prove Theorem 5.1. Without loss of generality, on each learning episode SPACE updates the reward followed by the alternation of two projections onto the constraint sets (region around π B and the cost constraint set): θ k+ 1 3 = θ k -η k F -1 ∇f (θ k ), θ k+ 2 3 = P C2 (θ k+ 1 3 ), θ k+1 = P C1 (θ k+ 2 3 ), if θ k ∈ C 2 , θ k+ 1 3 = θ k -η k F -1 ∇f (θ k ), θ k+ 2 3 = P C1 (θ k+ 1 3 ), θ k+1 = P C2 (θ k+ 2 3 ), if θ k ∈ C 1 , where η k is the step size at step k. Proof. SPACE under the KL-divergence projection converges to an -FOSP. Based on Lemma D.2 under the KL-divergence projection, and setting θ = θ k -η k F k -1 ∇f (θ k ), θ * = θ k+ 2 3 and θ = θ k , we have (θ k -θ k+ 2 3 ) T F k (θ k -η k F k -1 ∇f (θ k ) -θ k+ 2 3 ) ≤ 0 ⇒ ∇f (θ k ) T (θ k+ 2 3 -θ k ) ≤ - 1 η k (θ k+ 2 3 -θ k ) T F k (θ k+ 2 3 -θ k ). ( ) Based on the L-Lipschitz continuity of gradients and Eq. ( 41), we have Comments on Assumption 1.3. In the paper, we assume that both the diameters of the cost constraint set (C 1 ) and the region around π B (C 2 ) are bounded above by H. This implies that given a small value for h D , the convergence speed is determined by how large the constraint set is. This allows us to do an analysis for the algorithm. In practice, we agree that this assumption is too strong and leave it as a future work for improvement. f (θ k+ 2 3 ) ≤ f (θ k ) + ∇f (θ k ) T (θ k+ 2 3 -θ k ) + L 2 θ k+ 2 3 -θ k 2 ≤ f (θ k ) - 1 η k (θ k+ 2 3 -θ k ) T F k (θ k+ 2 3 -θ k ) + L 2 θ k+ 2 3 -θ k 2 = f (θ k ) - L 2 θ k+ 2 3 -θ k 2 -∇f (θ k+ 2 3 ) T (θ k+1 -θ k+ 2 3 ) - L 2 θ k+1 -θ k+ 2 3 2 , Interpretation on Theorem 5.1. We now provide a visualization in Fig. 6 under two possible projections. For each projection, we consider two possible Fisher information matrices. Please read the caption for more detail. In Fig. 6 (a) we observe that since the reward improvement and projection steps use the KL-divergence, the resulting two update points with different σ 1 (F k ) are similar. In addition, under the 2-norm projection, the larger σ n (F k ) is, the greater the decrease in the objective. This is because that a large σ n (F k ) implies a large curvature of f in all directions. Intuitively, this makes the learning algorithm confident about where to update the policy to decrease the objective value greatly. Geometrically, a large σ n (F k ) makes the 2-norm distance between the pre-projection and post-projection points small, leading to a small deviation from the reward improvement direction. This is illustrated in Fig. 6(b) . We observe that since F k determines the curvature of f and the 2-norm projection is used, the updated point with a larger σ n (F k ) (red dot) achieves more improvement of the objective value. These observations imply that the spectrum of the Fisher information matrix does not play a major role in SPACE under the KL-divergence projection, whereas it affects the decrease of objective value in SPACE under the 2-norm projection. Hence we choose either KL-divergence or 2-norm projections depending on the tasks to achieve better performance.

E.1 IMPLEMENTATION DETAILS

Mujoco Task (Achiam et al., 2017) . In the point circle and ant circle tasks, the reward and cost functions are  R(s) = v T [-x 2 ; x 1 ] 1 + | [x 1 ; x 2 ] -d| , C(s) = 1[|x 1 | > x lim ], where x 1 and x 2 are the coordinates in the plane, v is the velocity of the agent, and d, x lim are environmental parameters that specify the safe area. The agent is rewarded for moving fast in a wide circle with radius of d, but is constrained to stay within a safe region smaller than the radius of the circle in x 1 -coordinate x lim ≤ d. For the point agent, we use d = 5 and x lim = 2.5; for the ant agent, we use d = 5 and x lim = 1. The environment is illustrated in Fig. 7 . In the point gather task, the agent receives a reward of +10 for gathering green apples, and a cost of 1 for gathering red apples. Two green apples and eight red apples are placed in the environment at the beginning. In the ant gather task, the agent receives a reward of +10 for gathering green apples, and a cost of 1 for gathering red apples. The agent also gets a reward of -10 for falling down to encourage smooth moving. Eight green apples and eight red apples are placed in the environment at the beginning. For the point and ant agents, the state space consists of the positions, orientations, velocities, and the external forces applied to the torso and joint angles. The action space is the force applied to joints. Traffic Management Task (Vinitsky et al., 2018) . In the grid task, the state space, action space, reward function, and cost function are illustrated as follows. (1) States: Speed, distance to the intersection, and edge number of each vehicle. The edges of the grid are uniquely numbered so the travel direction can be inferred. For the traffic lights, we return 0 and 1 corresponding to green or red for each light, a number between [0, t switch ] indicating how long until a switch can occur, and 0 and 1 indicating if the light is currently yellow. Finally, we return the average density and velocity of each edge. (2) Actions: A list of numbers a = [-1, 1] n where n is the number of traffic lights. If a i > 0 for traffic light i it switches, otherwise no action is taken. (3) Reward: The objective of the agent is to achieve high speeds. The reward function is R(s) = max(v target -v target -v , 0) v target , where v target is an arbitrary large velocity used to encourage high speeds and v ∈ R k is the velocities of k vehicles in the network. (4) Cost: The objective of the agent is to let lights stay red for at most 7 consecutive seconds. The cost function is C(s) = n i=1 1[t i,red > 7], where t i,red is the consecutive time that the light i is in red. In the bottleneck task, the state space, action space, reward function, and cost function are illustrated as follows. (1) States: The states include: the mean positions and velocities of human drivers for each lane for each edge segment, the mean positions and velocities of the autonomous vehicles on each segment, and the outflow of the system in vehicles per/hour over the last 5 seconds. (2) Actions: For a given edge-segment and a given lane, the action shifts the maximum speed of all the autonomous vehicles in the segment from their current value. By shifting the max-speed to higher or lower values, the system indirectly controls the velocity of the autonomous vehicles. (3) Reward: The objective of the agent is to maximize the outflow of the whole traffic. The reward function is R(s t ) = i=t i=t-5 ∆t n exit (i) 5 ∆t•n lane •500 , where n exit (i) is the number of vehicles that exit the system at time-step i, and n lane is the number of lanes. (4) Cost: The objective of the agent is to let the velocities of human drivers have lowspeed for no more than 10 seconds. The cost function is C(s) = n human i=1 1[t i,low > 10], where n human is the number of human drivers, and t i,low is the consecutive time that the velocity of human driver i is less than 5 m/s. For more information, please refer to Vinitsky et al. (2018) . Car-racing Task. In the car-racing task, the state space, action space, reward function, and the cost function are illustrated as follows. (1) States: It is a high-dimensional space where the state is a 96 × 96 × 3 tensor of raw pixels. Each pixel is in the range of [0, 255]. (2) Actions: The agent has 12 actions in total: a ∈ A = {(a steer , a gas , a brake )|a steer ∈ {-1, 0, 1}, a gas ∈ {0, 1}, a brake ∈ {0, 0.2}}, where a steer is the steering angle, a gas is the amount of gas applied, and a brake is the amount of brake applied. (3) Reward: In each episode, we randomly generate the track. The episode is terminated if the agent reaches the maximal step or traverse over 95% of the track. The track is discretized into 281 tiles. The agent receives a reward of 1000 281 for each tile visited. To encourage driving efficiency, the agent receives a penalty of -1 per-time step. (4) Cost: The cost is to constrain the accumulated number of brakes to encourage a smooth ride. Architectures and Parameters. For the gather and circle tasks we test two distinct agents: a point-mass (S ⊆ R 9 , A ⊆ R 2 ), and an ant robot (S ⊆ R 32 , A ⊆ R 8 ). The agent in the grid task is S ⊆ R 156 , A ⊆ R 4 , and the agent in the bottleneck task is S ⊆ R 141 , A ⊆ R 20 . Finally, the agent in the car-racing task is S ⊆ R 96×96×3 , A ⊆ R 3 . For the simulations in the gather and circle tasks, we use a neural network with two hidden layers of size (64, 32) to represent Gaussian policies. And we use the KL-divergence projection. For the simulations in the grid and bottleneck tasks, we use a neural network with two hidden layers of size (16, 16) and (50, 25) to represent Gaussian policies, respectively. And we use the 2-norm projection. For the simulation in the car-racing task, we use a convolutional neural network with two convolutional operators of size 24 and 12 followed by a dense layer of size (32, 16) to represent a Gaussian policy. And we use the KL-divergence projection. The choice of the projections depends on the task itself, we report the best performance among two projections. We use tanh as an activation function for all the neural network policies. In the experiments, since the step size is small, we reuse the Fisher information matrix of the reward improvement step in the KL-divergence projection step to reduce the computational cost. We use GAE-λ approach (Schulman et al., 2016) to estimate A π R (s, a), A π C (s, a), and A π D (s). For the simulations in the gather, circle, and car-racing tasks, we use neural network baselines with the same architecture and activation functions as the policy networks. For the simulations in the grid and bottleneck tasks, we use linear baselines. The hyperparameters of all algorithms and all tasks are in Table 2 . We conduct the experiments on three separate machines: machine A has an Intel Core i7-4770HQ CPU, machine B has an Intel Core i7-6850K CPU, and machine C has an Intel Xeon X5675 CPU. We report real-time (i.e., wall-clock time) in seconds for one policy update for all tested algorithms and tasks in Table 3 . We observe that SPACE has the same computational time as the other baselines. For the most intensive task, i.e., the car-racing task, the memory usage is 6.28GB. The experiments are implemented in rllab (Duan et al., 2016) , a tool for developing RL algorithms. We provide the link to the code: https://sites.google.com/view/spacealgo. Comments on the rationale behind when to increase h D . The update method of h D is empirically designed to ensure that the value of the cost does not increase (i.e., J C (π k ) ≤ J C (π k-1 )) and the reward keeps improving (i.e., J R (π k ) ≥ J R (π k-1 )) after learning from π B . Lemma 4.1 theoretically ensures h D is large enough to guarantee feasibility and exploration of the agent. 

Implementation of

(J C (π k ) -h C ) 2 + h k D if J C (π k ) > J C (π k-1 ) or J R (π k ) < J R (π k-1 ) at step k. We now provide the practical implementation. For each policy update we check the above conditions. If one of the conditions satisfies, we increase h k+1 D by setting the constant to 10, i.e., 10•(J C (π k )-h C ) 2 +h k D . In practice, we find that the performance of SPACE is not affected by the selection of the constant. Note that we could still compute the exact value of h k+1 D as shown in the proof of Lemma 4.1. However, this incurs the computational cost. Comments on learning from multiple baseline policies π B . In our setting, we use one π B . This allows us to do theoretical analysis. One possible idea for learning from multiple π B is to compute the distance to each π B . Then, select the one with the minimum distance to do the update. This ensures that the update for the reward in the first step is less affected by π B . And the analysis we did can be extended. We leave it as future work for developing this. Comments on refining the PCPO agent's policy (Yang et al., 2020) directly. Fine-tuning the pre-trained policy directly might result in lower reward and cost violations. This is because that the pre-trained policy has a low entropy and it does not explore. We empirically observe that the agent pre-trained with the baseline policy yields less reward in the new task (i.e., different cost constraint thresholds h C ) as illustrated in Section E.2. In contrast, the SPACE agent simultaneously learns from the baseline policy while ensuring the policy entropy is high enough to explore the environment. Comments on the feasibility of getting safe baseline policies. In many real-world applications such as drones, we can obtain baseline policies modeled from the first principle physics, or pretrain baseline policies in the constrained and safe environment, or use rule-based baseline policies. Importantly, we do not assume the baseline has to be a "safe policy" -it can be a heuristic that ignores safety constraints. This is one of the main motivations for our algorithm: to utilize priors from the baseline which may be unsafe, but guarantee the safety of the newly learned algorithm according to the provided constraints. Instructions for Reproducibility. We now provide the instructions for reproducing the results. First install the libraries for python3 such as numpy, scipy. To run the Mujoco experiments, get the licence from https://www.roboti.us/license.html. To run the traffic management experiments, install FLOW simulator from https://flow.readthedocs.io/en/latest/. To run the car-racing experiments, install OpenAI Gym from https://github.com/openai/ gym. Our implementation is based on the environment from Achiam et al. (2017) , please download the code from https://github.com/jachiam/cpo. The code is based on rllab (Duan et al., 2016) , install the relevant packages such as theano (http://deeplearning.net/ software/theano/). Then, download SPACE code from https://sites.google.com/ view/spaceneurips and place the codes on the designated folder instructed by Readme.txt on the main folder. Finally, go to the example folder and execute the code using python command.

E.2 EXPERIMENT RESULTS

Baseline policies. We pre-train the baseline policies using a safe RL algorithm. Here we also consider three types of baseline policies: (1) suboptimal π cost The Discounted Reward vs. the Cumulative Undiscounted Constraint Cost (see Fig. 8 ). To show that SPACE achieves higher reward under the same cost constraint violations (i.e., learning a constraint-satisfying policy without violating the cost constraint a lot), we examine the discounted reward versus the cumulative undiscounted constraint cost. The learning curves of the discounted reward versus the cumulative undiscounted constraint cost are shown for all tested algorithms and tasks in Fig. 8 . We observe that in these tasks under the same value of the reward, SPACE outperforms the baselines significantly with fewer cost constraint violations. For example, in the car-racing task SPACE achieves 3 times fewer cost constraint violations at the reward value of 40 compared to the best baseline -PCPO. This implies that SPACE effectively leverages the baseline policy while ensuring the constraint satisfaction. In contrast, without the supervision of the baseline policy, PCPO requires much more constraint violations to achieve the same reward performance as SPACE. In addition, although the fixed-point and the dynamic-point approaches use the supervision of the baseline policy, the lack of the projection step makes them less efficient in learning a constraint-satisfying policy. Comparison of Baseline Policies (see Fig. 9 ). To examine whether SPACE can safely learn from the baseline policy which need not satisfy the cost constraint, we consider two baseline policies: π cost B and π reward B . The learning curves of the undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost with two possible baselines over policy updates are shown for all tested algorithms and tasks in Fig. 9 . We observe that in the point gather and point circle tasks, the initial values of the cost are larger than h C (i.e., J C (π 0 ) > h C ). Using π cost B allows the learning algorithm to quickly satisfy the cost without doing the extensive projection onto the cost constraint set. For example, in the point circle task we observe that learning guided by π cost B quickly satisfies the cost constraint. In addition, we observe that in the ant gather and ant circle tasks, the initial values of the cost are smaller than h C (i.e., J C (π 0 ) < h C ). Intuitively, we would expect that using π reward B allows the agent to quickly improve the reward since the agent already satisfies the cost constraint in the beginning. In the ant gather task we observe that SPACE guided by π reward B does improve the reward more quickly at around 200 iteration. However, we observe that the agent guided by the both baseline policies achieve the same final reward performance in the ant gather and ant circle tasks. The reason is that using dynamic h D allows the agent to stay away from the baseline policy. This makes the baseline policy less influential in the end. As a result, the reward improvement mostly comes from the reward improvement step of SPACE if the agent starts in the interior of the cost constraint set (i.e., J C (π 0 ) ≤ h C ). Fixed h D (see Fig. 10 ). To understand the effect of using dynamic h k D when learning from a sub-optimal baseline policy, we compare the performance of SPACE with and without adjusting h D . The learning curves of the undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates are shown for all tested algorithms and tasks in Fig. 10 . We observe that SPACE with fixed h D converges to less reward. For example, in the ant circle task SPACE with the dynamic h D achieves 2.3 times more reward. The value of the divergence cost in the ant circle task shows that staying away from the baseline policy achieves more reward. This implies that the baseline policy in the ant circle task is highly sub-optimal to the agent. In addition, we observe that in some tasks the dynamic h D does not have much effect on the reward performance. For example, in the point gather task SPACE achieves the same reward performance. The values of the divergence cost in the point gather task decrease throughout the training. These observations imply that the update scheme of h D is critical for some tasks. Comparison of SPACE vs. d-CPO, d-PCPO and the Pre-training Approach (see Fig. 11 ). To show that SPACE is effective in using the supervision of the baseline policy, we compare the performance of SPACE to the dynamic-point and the pre-training approaches. In the pre-training approach, the agent first performs the trust region update with the objective function being the divergence cost. Once the agent has the same reward performance as the baseline policy (i.e., J R (π k ) ≈ J R (π B ) for some k), the agent performs the trust region update with the objective function being the reward function. The learning curves of the undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates are shown for all tested algorithms and tasks in Fig. 11 . We observe that SPACE achieves better reward performance compared to the pre-training approach in all tasks. For example, in the point circle, ant gather and ant circle tasks the pre-training approach seldom improves the reward but all satisfies the cost constraint. This implies that the baseline policies in these tasks are highly sub-optimal in terms of reward performance. In contrast, SPACE prevents the agent from converging to a poor policy. In addition, we observe that in the point gather task the pre-training approach achieves the same reward performance as the baseline policy, whereas SPACE has a better reward performance compared to the baseline policy. The pre-training approach does not keep improving the reward after learning from the baseline policy. This is because that after pre-training with the baseline policy, the entropy of the learned policy is small. This prevents the agent from trying new actions which may lead to better reward performance. This implies that pre-training approach may hinder the exploration of the learning agent on the new environment. Furthermore, in the car-racing task we observe that using pre-training approach achieves the same reward performance as SPACE but improves reward slowly, For d-CPO and d-PCPO, in the point and ant tasks we observe that both approaches have comparable or silently better reward and cost performance compared to SPACE. However, in the car-racing task we observe that d-CPO cannot improve the reward due to a slow update procedure for satisfying the cost constraint, whereas d-PCPO has a better reward performance. These observations imply that the projection steps in SPACE allow the learning agent to effectively and robustly learn from the baseline policy. Comparison of SPACE under the KL-divergence and the 2-norm Projections (see Fig. 12 ). Theorem 5.1 shows that under the KL-divergence and 2-norm projections, SPACE converges to different stationary points. To demonstrate the difference between these two projections, Fig. 12 shows the learning curves of the undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates for all tested algorithms and tasks. In the Mujoco tasks, we observe that SPACE under the KL-divergence projection achieves higher reward. For instance, in the point gather task the final reward is 25% higher under the same cost constraint satisfaction. In contrast, in the traffic management tasks, we observe that SPACE under the 2-norm projection achieves better cost constraint satisfaction. For instance, in the grid task SPACE under the 2-norm projection achieves a lower reward but more cost constraint satisfaction. In addition, in the bottleneck task SPACE under the 2-norm projection achieves more reward and cost constraint satisfaction. These observations imply that SPACE converges to different stationary points under two possible projections depending on tasks. Initial h 0 D (see Fig. 13 ). To understand the effect of the initial value of h 0 D , we test SPACE with three different initial values: h 0 D = 1, h 0 D = 5, and h 0 D = 25 in the ant circle and car-racing tasks. The learning curves of the undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates are shown for all tested algorithms and tasks in Fig. 13 . In both tasks, we observe that the initial value of h 0 D does not affect the reward and the cost performance significantly (i.e., the mean of learning curves lies in roughly the same standard deviation over the initialization). In addition, the value of the divergence cost over three h 0 D are similar throughout the training. These observations imply that the update scheme of h k D in SPACE is robust to the choice of the initial value of h 0 D . However, in the car-racing task we observe that the learning curves of using a smaller h 0 D tend to have higher variances. For example, the standard deviation of h 0 D = 1 in the reward plot is 6 times larger than the one with h 0 D = 25. This implies that SPACE may have reward performance degradation when using a smaller initial value of h 0 D . One possible reason is that when the distance between the learned and baseline policies is large, using a small value of h 0 D results in an inaccurate projection (i.e., due to approximation errors). This causes the policy to follow a zigzag path. We leave the improvement of this in future work.

F HUMAN POLICIES

We now describe the procedure for collecting human demonstration data in the car-racing task. A player uses the right key, left key, up key and down key to control the direction, acceleration, and brake of the car. The human demonstration data contain the display of the game (i.e., the observed state), the actions, and the reward. We collect 20 minutes of demonstration data. A human player is instructed to stay in the lane but does not know the cost constraint. This allows us to test whether SPACE can safely learn from the baseline policy which need not satisfy the cost constraints. We then use an off-policy algorithm (DDPG) trained on the demonstration data to get the baseline human policy. Since the learned baseline human policy does not interact with the environment, its reward performance cannot be better than the human performance. Fig. 14 shows the procedure. Implementation Details of DDPG. We use DDPG as our off-policy algorithm. We use a convolutional neural network with two convolutional operators of size 24 and 12 followed by a dense layer of size (32, 16) to represent a Gaussian policy. A Q function shares the same architecture of the policy. The learning rates of the policy and Q function are set to 10 -4 and 10 -3 , respectively. G THE MACHINE LEARNING REPRODUCIBILITY CHECKLIST (VERSION 1.2, MAR.27 2019) For all models and algorithms presented, indicate if you include 2 : • A clear description of the mathematical setting, algorithm, and/or model: -Yes, please see the problem formulation in Section 3, the update procedure for SPACE in Section 5, and the architecture of the policy in Section E.1. • An analysis of the complexity (time, space, sample size) of any algorithm: -Yes, please see the implementation details in Section E.1. • A link to a downloadable source code, with specification of all dependencies, including external libraries: -Yes, please see the implementation details in Section E.1. For any theoretical claim, check if you include: • A statement of the result: -Yes, please see Section 4 and Section 5. • A clear explanation of any assumptions: -Yes, please see Section 4 and Section 5. • A complete proof of the claim: -Yes, please see Section B, Section C, and Section D. For all figures and tables that present empirical results, indicate if you include: • A complete description of the data collection process, including sample size: -Yes, please see Section E.1 for the implementation details. • A link to a downloadable version of the dataset or simulation environment: -Yes, please see Section E.1 for the simulation environment. • An explanation of any data that were excluded, description of any pre-processing step: -It's not applicable. This is because that data comes from simulated environments. • An explanation of how samples were allocated for training / validation / testing: -It's not applicable. The complete trajectories (i.e., data) is used for training. There is no validation set. Testing is performed in the form of online learning approaches. • The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results: -Yes, we randomly select five random seeds, and please see Section E.1 for the implementation details. • The exact number of evaluation runs: -Yes, please see Section E.1 for the implementation details. • A description of how experiments were run: -Yes, please see Section E.1 for the implementation details. • A clear definition of the specific measure or statistics used to report results: -Yes, please see Section 6. • Clearly defined error bars: -Yes, please see Section 6. 2 Here is a link to the list: https://www.cs.mcgill.ca/ ˜jpineau/ ReproducibilityChecklist.pdf. • A description of results with central tendency (e.g., mean) variation (e.g., stddev): -Yes, please see Section 6. • A description of the computing infrastructure used: -Yes, please see Section E.1 for the implementation details. Figure 13 : The undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates for the tested algorithms and tasks. The solid line is the mean and the shaded area is the standard deviation over 5 runs. We observe that the initial value of h 0 D does not affect the reward and the cost performance significantly. (Best viewed in color.) Figure 14 : Procedure for getting a baseline human policy. We ask a human to play the car-racing game. He/She does not know the cost constraint. The trajectories (i.e., display of the game, the action, and the reward) are then stored. A human policy is obtained by using an off-policy algorithm (DDPG) trained on the trajectories.



Code is available at: https://sites.google.com/view/spacealgo.



Figure 1: (a) Update procedures for SPACE. Step 1 (green) improves the reward in the trust region. Step 2

Figure 3: The discounted reward, and the undiscounted constraint cost, the undiscounted divergence cost over policy updates for the tested algorithms and tasks. The solid line is the mean and the shaded area is the standard deviation over 5 runs. The baseline policies in the grid and bottleneck tasks are π near B , and the baseline policy in the car-racing task is π human B . The black dashed line is the cost constraint threshold hC . Overall, we observe that SPACE is the only algorithm that satisfies the constraints while achieving superior reward performance in all cases. Although π human B has substantially low reward, SPACE still can learn to improve the reward. (We only show the results in these tasks as representative cases since these tasks are more challenging. Please read Appendix for more results. Best viewed in color.)

d-CPO and d-PCPO. d-CPO and d-PCPO improve the reward slowly and have more cost violations. They do not use projection to quickly learn from π B . For example, in the car-racing task J D (π) in d-CPO and d-PCPO are high compared to SPACE throughout the training. This suggests that simply adding an IL objective in the reward objective is susceptible to a suboptimal π B . Hence these approaches require a good selection of the initial value of λ k and β. Suboptimal π cost B and π reward B . The learning curves of J C (π) and the J R (π) over policy updates are shown for the point gather task in Fig. 4. We use two suboptimal baseline policies: π cost B and π reward B , and h C is set to 0.5. They do not solve the task at hand.

Figure 4: Learning from suboptimal πB. The undiscounted constraint cost and the discounted reward over policy updates for the point gather task. The solid line is the mean and the shaded area is the standard deviation over 5 runs. The black dashed line is the cost constraint threshold hC . We observe that SPACE satisfies the cost constraints even when learning from the suboptimal πB. (Best viewed in color.)

Figure 5: (a) Illustrating when π B is outside the cost constraint set. (b) Illustrating when π B is inside the cost constraint set. π boundary is the policy with J C (π boundary ) = h C . We aim to bound h k+1 D

Lemma D.1 motivates the need for defining two variants of FOSP. Before proving Theorem 5.1, we need the following Lemma. Define P L C θ ) T L(θθ ), and L = F k under the KL-divergence projection, and L = I under the 2-norm projection. Lemma D.2 (Contraction of Projection (Yang et al., 2020)). For any θ, θ * = P L C (θ) if and only if

Figure 6: Update procedures for SPACE under the KL and 2-norm projections with two possible Fisher information matrices. A lower objective value is achieved at the darker green area. Red and orange ellipses are F k s with two different spectra of singular values. Red and orange dots are resulting updated points under these two spectra of F k s. (a) A red ellipse has a smaller σ 1 (F k ) and an orange ellipse has a larger σ 1 (F k ). Both ellipses have the same σ n (F k ). The two resulting θ k+ 2 3 are similar. (b) A red ellipse has a larger σ n (F k ) and an orange ellipse has a smaller σ n (F k ). Both ellipses have the same σ 1 (F k ). θ k+ 2 3 with a larger σ n (F k ) (red dot) has greater decrease of the objective value.

Figure 7: The environment of the circle task (adapted from Achiam et al. (2017)). The agent receives the maximum reward while staying in the safe area by following the red dashed line path.

Updating h k D . Lemma 4.1 shows that h k+1 D should be increased at least by O

Figure 8: The discounted reward vs. the cumulative undiscounted constraint cost over policy updates for the tested algorithms and tasks. The solid line is the mean over 5 runs. SPACE achieves the same reward performance with fewer cost constraint violations in all cases. (Best viewed in color.)

with J C (π cost B ) ≈ 0, (2) suboptimal π reward B with J C (π reward B ) > h C , and (3) π near B with J C (π near B ) ≈ h C Note that these π B have different degrees of constraint satisfaction.

Figure 9: The undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates for the tested algorithms and tasks. The solid line is the mean and the shaded area is the standard deviation over 5 runs. SPACE ensures cost constraint satisfaction guided by the baseline policy which need not satisfy the cost constraint. (Best viewed in color.)

Figure 11: The undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates for the tested algorithms and tasks. The solid line is the mean and the shaded area is the standard deviation over 5 runs. SPACE outperforms d-CPO, d-PCPO and the pre-training approach in terms of the efficiency of the reward improvement and cost constraint satisfaction. (Best viewed in color.)

Figure12: The undiscounted constraint cost, the discounted reward, and the undiscounted divergence cost over policy updates for the tested algorithms and tasks. The solid line is the mean and the shaded area is the standard deviation over 5 runs. SPACE converges to differently stationary points under two possible projections. (Best viewed in color.)

Comparison to prior work.

Parameters used in all tasks. (PC: point circle, PG: point gather, AC: ant circle, AG: ant gather, Gr: grid, BN: bottleneck, and CR: car-racing tasks)



annex

where the equality follows by setting δ (i.e., the size of the trust region) such that.Again, based on Lemma D.2, for θ ∈ C 2 we havewhere in the last two inequalities we use the property of the norm. Before reaching an -FOSP, Eq. ( 43) implies thatBased on Eq. ( 42) and Eq. ( 44), we haveBased on the L-Lipschitz continuity of gradients, for the projection to the constraint set C 1 we haveCombining Eq. ( 45) with Eq. ( 46), we haveHence it takes O( -2 ) iterations to reach an -FOSP.SPACE under the 2-norm projection converges to an -FOSP. Based on Lemma D.2 under the 2-norm projection, and settingBased on the L-Lipschitz continuity of gradients and Eq. ( 48), we havewhere, which represents the difference between the gradient and the nature gradient, and the equality follows by setting δ (i.e., the size of the trust region) such that.Again, based on Lemma D.2, for θ ∈ C 2 we havewhere in the last two inequalities we use the property of the norm. Before reaching an -FOSP, Eq. ( 50) implies thatBased on Eq. ( 49) and Eq. ( 51), we haveBased on the L-Lipschitz continuity of gradients, for the projection to the constraint set C 1 we haveCombining Eq. ( 52) with Eq. ( 53), we haveHence it takes O( -2 ) iterations to reach an -FOSP. and the pre-training approach has more cost constraint violations than SPACE. This implies that jointly using reinforcement learning and the supervision of the baseline policy achieve better reward and cost performance.

