INVERSE CONSTRAINED REINFORCEMENT LEARNING

Abstract

Standard reinforcement learning (RL) algorithms train agents to maximize given reward functions. However, many real-world applications of RL require agents to also satisfy certain constraints which may, for example, be motivated by safety concerns. Constrained RL algorithms approach this problem by training agents to maximize given reward functions while respecting explicitly defined constraints. However, in many cases, manually designing accurate constraints is a challenging task. In this work, given a reward function and a set of demonstrations from an expert that maximizes this reward function while respecting unknown constraints, we propose a framework to learn the most likely constraints that the expert respects. We then train agents to maximize the given reward function subject to the learned constraints. Previous works in this regard have either mainly been restricted to tabular settings or specific types of constraints or assume knowledge of transition dynamics of the environment. In contrast, we empirically show that our framework is able to learn arbitrary Markovian constraints in high-dimensions in a model-free setting.

1. INTRODUCTION

Reward functions are a critical component in reinforcement learning settings. As such, it is important that reward functions are designed accurately and are well-aligned with the intentions of the human designer. This is known as agent (or value) alignment (see, e.g., Leike et al. (2018; 2017) ; Amodei et al. (2016) ). Misspecified rewards can lead to unwanted and unsafe situations (see, e.g, Amodei & Clark (2016) ). However, designing accurate reward functions remains a challenging task. Human designers, for example, tend to prefer simple reward functions that agree well with their intuition and are easily interpretable. For example, a human designer might choose a reward function that encourages an RL agent driving a car to minimize its traveling time to a certain destination. Clearly, such a reward function makes sense in the case of a human driver since inter-human communication is contextualized within a framework of unwritten and unspoken constraints, often colloquially termed as 'common-sense'. That is, while a human driver will try to minimize their traveling time, they will be careful not to break traffic rules, take actions that endanger passersby, and so on. However, we cannot assume such behaviors from RL agents since they are are not imbued with common-sense constraints. Constrained reinforcement learning provides a natural framework for maximizing a reward function subject to some constraints (we refer the reader to Ray et al. (2019) for a brief overview of the field). However, in many cases, these constraints are hard to specify explicitly in the form of mathematical functions. One way to address this issue is to automatically extract constraints by observing the behavior of a constraint-abiding agent. Consider, for example, the cartoon in Figure 1 . Agents start at the bottom-left corner and are rewarded according to how quickly they reach the goal at the bottom-right corner. However, what this reward scheme misses out is that in the real world the lower bridge is occupied by a lion which attacks any agents attempting to pass through it. Therefore, agents that are naïvely trained to maximize the reward function will end up performing poorly in the real world. If, on the other hand, the agent had observed that the expert (in Figure 1(a) ) actually performed suboptimally with respect to the stipulated reward scheme by taking a longer route to the goal, it could have concluded that (for some unknown reason) the lower bridge must be avoided and consequently would have not been eaten by the lion! Scobee & Sastry (2020) formalizes this intuition by casting the problem of recovering constraints in the maximum entropy framework for inverse RL (IRL) (Ziebart et al., 2008) and proposes a greedy algorithm to infer the smallest number of constraints that best explain the expert behavior. However, Scobee & Sastry (2020) has two major limitations: it assumes (1) tabular (discrete) settings, and (2) the environment's transition dynamics. In this work, we aim to address both of these issues by learning a constraint function instead through a sample-based approximation of the objective function of Scobee & Sastry. Consequently, our approach is model-free, admits continuous states and actions and can learn arbitrary Markovian constraintsfoot_0 . Further, we empirically show that it scales well to high-dimensions. Typical inverse RL methods only make use of expert demonstrations and do not assume any knowledge about the reward function at all. However, most reward functions can be expressed in the form "do this task while not doing these other things" where other things are generally constraints that a designer wants to impose on an RL agent. The main task ("do this") is often quite easy to encode in the form of a simple nominal reward function. In this work, we focus on learning the constraint part ("do not do that") from provided expert demonstrations and using it in conjunction with the nominal reward function to train RL agents. In this perspective, our work can be seen as a principled way to inculcate prior knowledge about the agent's task in IRL. This is a key advantage over other IRL methods which also often end up making assumptions about the agent's task in the form of regularizers such as in Finn et al. (2016) . The main contributions of our work are as follows: • We formulate the problem of inferring constraints from a set of expert demonstrations as a learning problem which allows it to be used in continuous settings. To the best of our knowledge, this is the first work in this regard. • We eliminate the need to assume, as Scobee & Sastry do, the environment's transition dynamics. • We demonstrate the ability of our method to train constraint-abiding agents in highdimensions and show that it can also be used to prevent reward hacking.

2.1. UNCONSTRAINED RL

A finite-horizon Markov Decision Process (MDP) M is a tuple (S, A, p, r, γ, T ), where S ∈ R |S| is a set of states, A ∈ R |A| is a set of actions, p : S × A × S → [0, 1] is the transition probability function (where p(s |s, a) denotes the probability of transitioning to state s from state s by taking action a), r : S × A → R is the reward function, γ is the discount factor and T is the timehorizon. A trajectory τ = {s 1 , a 1 , . . . , s T , a T } denotes a sequence of states-action pairs such that s t+1 ∼ p(•|s t , a t ). A policy π : S → P(A) is a map from states to probability distributions over actions, with π(a|s) denoting the probability of taking action a in state s. We will sometimes abuse notation to write π(s, a) to mean the joint probability of visiting state s and taking action a under the policy π and similarly π(τ ) to mean the probability of the trajectory τ under the policy π. Define r(τ ) = T t=1 γ t r(s t , a t ) to be the total discounted reward of a trajectory. Forward RL algorithms try to find an optimal policy π * that maximizes the expected total discounted reward J(π) = E τ ∼π [r(τ )]. On the other hand, given a set of trajectories sampled from the optimal (also referred to as expert) policy π * , inverse RL (IRL) algorithms aim to recover the reward function r, which can then be used to learn the optimal policy π * via some forward RL algorithm.

2.2. CONSTRAINED RL

While normal (unconstrained) RL tries to find a policy that maximizes J(π), constrained RL instead focuses on finding a policy that maximizes J(π) while respecting explicitly-defined constraints. A popular framework in this regard is the one presented in Altman (1999) which introduces the notion of a constrained MDP (CMDP). A CMDP M c is a simple MDP augmented with a cost function c : S × A → R and a budget α ≥ 0. Define c(τ ) = T t=1 γ t c(s t , a t ) to be the total discounted cost of the trajectory τ and J c (π) = E τ ∼π [c(τ )] to be the expected total discounted cost. The forward constrained RL problem is to find the policy π * c that maximizes J(π) subject to J c (π) ≤ α. In this work, given a set D of trajectories sampled from π * c , the corresponding unconstrained MDP M (i.e., M c without the cost function c) and a budget α, we are interested in recovering a cost function which when augmented with M has an optimal policy that generates the same set of trajectories as in D. We call this as the inverse constrained reinforcement learning (ICRL) problem. If the budget α is strictly greater than 0, then the cost function c defines soft constraints over all possible state-action pairs. In other words, a policy is allowed to visit states and take actions that have non-zero costs as long as the expected total discounted cost remains less than α. If, however, α is 0 then the cost function translates into hard constraints over all state-action pairs that have a non-zero cost associated with them. A policy can thus never visit these state-action pairs. In this work, we restrict ourselves to this hard constraint setting. Note that this is not particularly restrictive since, for example, safety constraints are often hard constraints as well are constraints imposed by physical laws. Since we restrict ourselves to hard constraints, we can rewrite the constrained RL problems as follows: define C = {(s, a)|c(s, a) = 0} to be the constraint set induced by c. The forward constraint RL problem is to find the optimal constrained policy π * C that maximizes J(π) subject to π * C (s, a) = 0 ∀(s, a) ∈ C. The inverse constrained RL problem is to recover the constraint set C from trajectories sampled from π * C . Finally, we will refer to our unconstrained MDP as the nominal MDP hereinafter. The nominal MDP represents the nominal environment (simulator) in which we train our agent.

3.1. MAXIMUM LIKELIHOOD CONSTRAINT INFERENCE

We take Scobee & Sastry as our starting point. Suppose that we have a set of trajectories D = {τ (i) } N i=1 sampled from an expert π * C navigating in a constrained MDP M C * where C * denotes the (true) constraint set. Furthermore, we are also given the corresponding nominal MDP Mfoot_1 . Our goal is to recover a constraint set which when augmented with M results in a CMDP that has an optimal policy that respects the same set of constraints as π * C does. Scobee & Sastry pose this as a maximum likelihood problem. That is, if we let p M denote probabilities given that we are considering MDP M and assume a uniform prior on all constraint sets, then we can choose C * according to C * ← arg max C p M (D|C). Under the maximum entropy (MaxEnt) model presented in Ziebart et al. (2008) , the probability of a trajectory under a deterministic MDP M can be modelled as π M (τ ) = exp(βr(τ )) Z M 1 M (τ ), where Z M = exp(βr(τ ))1 M (τ )dτ is the partition function, β ∈ [0, ∞) is a parameter describing how close the agent is to the optimal distribution (as β → ∞ the agent becomes a perfect optimizer and as β → 0 the agent simply takes random actions) and 1 is an indicator function that is 1 for trajectories feasible under the MDP M and 0 otherwise. Assume that all trajectories in D are i.i.d. and sampled from the MaxEnt distribution. We have p(D|C) = 1 (Z M C ) N N i=1 exp(βr(τ (i) ))1 M C (τ (i) ). Note that 1 M C (τ (i) ) is 0 for all trajectories that contain any state-action pair that belongs to C. To maximize this, Scobee & Sastry propose a greedy strategy wherein they start with an empty constraint set and incrementally add state-action pairs that result in the maximal increase in p(D|C).

3.2. SAMPLE-BASED APPROXIMATION

Since we are interested in more realistic settings where the state and action spaces can be continuous, considering all possible state-action pairs individually usually becomes intractable. Instead, we propose a learning-based approach wherein we try to approximate 1 M C (τ ) using a neural network. Consider the loglikelihood  L(C) = 1 N log p(D|C) = 1 N N i=1 βr(τ (i) ) + log 1 M C (τ (i) ) -log Z M C , = 1 N N i=1 βr(τ (i) ) + log 1 M C (τ (i) ) -log exp(βr(τ ))1 M C (τ )dτ. (4) Note that 1 M C (τ ) = T t=0 1 M C ( L(C) = L(θ) = 1 N N i=1 βr(τ (i) ) + log ζ θ (τ (i) ) -log exp(βr(τ ))ζ θ (τ )dτ. Let M ζθ denote the MDP obtained after augmenting M with the cost function ζθ := 1 -ζ θfoot_2 , and π M ζθ denote the corresponding MaxEnt policy. Taking gradients of (5) with respect to θ gives us (see Appendix A.1 for derivation) ∇ θ L(θ) = 1 N N i=1 ∇ θ log ζ θ (τ (i) ) -E τ ∼π M ζθ [∇ θ log ζ θ (τ )] . Using a sample-based approximation for the right-hand term we can rewrite the gradient as ∇ θ L(θ) ≈ 1 N N i=1 ∇ θ log ζ θ (τ (i) ) - 1 M M j=1 ∇ θ log ζ θ (τ (j) ), where τ are sampled from π M ζθ (discussed in the next section). Notice that making ∇ θ L(θ) zero essentially requires matching the expected gradient of log ζ θ under the expert (left hand term) and nominal (right hand term) trajectories. For brevity, we will write π M ζθ as π θ from now onwards. We can choose ζ θ to be a neural network with parameters θ and a sigmoid at the output. We train our neural network via gradient descent by using the expression for the gradient given above. In practice, since we have a limited amount of data, ζ θ , parameterized as a neural network, will tend to overfit. To mitigate this, we incorporate the following regularizer into our objective function. R(θ) = δ τ ∼{D,S} [ζ θ (τ ) -1] where S denotes the set of trajectories sampled from π θ and δ ∈ [0, 1) is a fixed constant. R incentivizes ζ θ to predict values close to 1, thus encouraging ζ θ to choose the smallest number of constraints that best explain the expert data.

3.3. FORWARD STEP

To evaluate (7) we need to sample from π θ . Recall that π θ needs to maximize J(π) subject to π θ (s, a) = 0 ∀(s, a) ∈ Z where Z is the constraint set induced by ζθ . However, since ζ θ outputs continuous values in the range (0, 1), we instead solve the soft constrained RL version, wherein we try to find a policy π that maximizes J(π) subject to E τ ∼π [ ζθ (τ )] ≤ α. In our experiments, we set α to a very small value. Note that if α is strictly set to 0 our optimization program will have an empty feasible set. Please refer to Appendix A.5 for some more discussion on α. We represent our policy as a neural network with parameters φ and train it by solving the following equivalent unconstrained min-max problem on the Lagrangian of our objective function min λ≥0 max φ L F (φ, λ) = J(π φ ) + 1 β H(π φ ) -λ(E τ ∼π φ [ ζθ (τ )] -α) by gradient ascent on φ (via the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) ) and gradient descent on the Lagrange multiplier λ. Note that we also add the entropy H(π φ ) = -E τ ∼π φ [log π φ (τ )] of π φ to our objective function. Maximizing the entropy ensures that we recover the MaxEnt distribution in (2) at convergence (see Appendix A.3 for proof).

3.4. INCORPORATING IMPORTANCE SAMPLING

Running the forward step in each iteration is typically very time consuming. To mitigate this problem, instead of approximating the expectation in (6) with samples from π θ , we approximate it with samples from an older policy πθ where θ were the parameters of ζ at some previous iteration. We, therefore, only need to learn a new policy after a fixed number of iterations. To correct for the bias introduced in the approximation because of using a different sampling distribution, we add important sampling weights to our expression for the gradient. In this case, the importance sampling weights can be shown to be given by (see Appendix A.2 for the derivation) s(τ ) = ζ θ (τ ) ζθ(τ ) . ( ) The gradient can thus be rewritten as ∇ θ L(θ) ≈ 1 N N i=1 ∇ θ log ζ θ (τ (i) ) - 1 τ ∼πθ s(τ )   τ ∼πθ s(τ )∇ θ log ζ θ (τ )   . Algorithm 1 summarizes our training procedure.

4. EXPERIMENTS

Learning a single constraint: Consider the TwoBrdiges environment in Figure 1 . The agent starts at the bottom-left corner and can take one of the following 4 actions at each step: right, left, up, down. The reward in the left half is negative and in the right half is positive (and proportional to how close the agent is to the goal). This incentivizes the agent to cross over into the right half as quickly as possible, which is obviously via the lower bridge. However, since the lower bridge is occupied by a lion, the expert agent avoids it, whereas the nominal agent does not. Learning multiple constraints: For this experiment we design the ThreeBridges environment shown in Figure 2 (a). The agent starts at either the bottom or top-left corner with equal probability and, as in the TwoBridges case, is incentivized to cross over into the right half as quickly as possible. The expert on the other hand always takes the middle bridge, since both the upper and lower bridges are actually constrained. Preventing reward hacking: Figure 2 (b) shows the LapGridWorld environmentfoot_3 which we use to test if our method can prevent reward hacking. The agent is intended to sail clockwise around the track. Each time it drives onto a golden dollar tile, it gets a reward of 3. However, the nominal agent "cheats" by stepping back and forth on one dollar tile, rather than going around the track, and ends up getting more reward than the expert (which goes around the track, as intended). Scaling to high dimensions: For this experiment, we use a simulated robot called HalfCheetah from OpenAI Gym (Brockman et al., 2016) . The state and action spaces are of 18 and 6 dimensions respectively. The robot can move both in forward and backward directions and is rewarded proportionally to the distance it covers. For the constrained environment, shown in Figure 2 (c), we add a solid wall at a distance of 5 units from the origin to prevent the robot from moving forwards. Consequently, the expert always moves backwards, whereas the nominal agent moves in both directions. Transferring constraints: In many cases, constraints are actually part of the environment and are the same for different agents (for example, all vehicles must adhere to the same speed limit). In such instances, it is useful to first learn the constraints using only one agent and then transfer them onto other agents. As a proof of concept, we transfer the constraints learnt on the HalfCheetah agent from the previous paragraph on a Walker2d agent. Note that in this case ζ θ must only be fed a subset of the state and action spaces that are common across all agents. As such, we only train ζ θ on the Figures 3 and 4 show the results of these experiments. The rewards shown are the actual rewards that the agent gets in the constrained (true) environment. In the case of TwoBridges and ThreeBridges, we add solid obstacles on the constrained bridges to prevent the agent from passing through them. In the constrained LapGridWorld environment, we award the agent 12 points whenever it completes a lap (rather than awarding 3 points each time it lands on a golden dollar tile, as in the nominal environment). Finally, in the case of HalfCheetah and Walker2d, we terminate the episode whenever the agent moves beyond the wall. The bottom row in Figure 3 shows the average number of constraints that the agent violates per trajectory when rolled out in the nominal environment. (For the LapGridWorld this is the number of times the agent attempts to move in the anti-clockwise direction.) For the Walker2d experiment, we observed 0 constraint violations throughout training. This is because, in practice, ζ θ usually acts conservatively compared to the true constraint function (by also constraining state-action pairs that are close to the true constrained ones). Additional details on these experiments, including hyperparameters, can be found in Appendix A.4. As can be seen, over the course of training, the agent's true reward increases and its number of constraint violations go down.

5. ABLATION STUDIES

We conduct experiments on the TwoBridges environment to answer the following questions: (a) Can we learn constraints even when we have only one expert rollout? (b) Does importance sampling speedup convergence? (c) Does the regularization term in (8) encourage ζ θ to choose a minimal set of constraints? To answer (a) we run our algorithm for different number of expert rollouts. As shown in Figure 5 (a) we are able to achieve expert performance even with only one expert rollout. For (c) we run experiments with different values of δ which controls the extent of regularization (see ( 8)). Figure 6 shows the results. Note that as δ increases, ζ θ constrains fewer constraints. Also note that when δ = 1, ζ θ fails to constrain any state. 

6. RELATED WORK

Forward Constrained RL: Several approaches have been proposed in literature to solve the forward constrained RL problem in the context of CMDPs (Altman, 1999) . Achiam et al. (2017) analytically solves trust region policy optimization problems at each policy update to enforce the constraints. Chow et al. (2018) uses a Lyapnuov approach and also provide theoretical guarantees. Le et al. (2019) proposes an algorithm for cases when there are multiple constraints. Finally, a well-known approach centers around rewriting the constrained RL problem as an equivalent unconstrained minmax problem by using Lagrange multipliers (Zhang et al., 2019; Tessler et al., 2019; Bhatnagar, 2010) ) (see Section 3.3 for further details). Constraint Inference: Previous work done on inferring constraints from expert demonstrations has either focussed on either inferring specific type of constraints such as geometric (D'Arpino & Shah, 2017; Subramani et al., 2018) , sequential (Pardowitz et al., 2005) or convex (Miryoosefi et al., 2019) constraints or is restricted to tabular settings (Scobee & Sastry, 2020; Chou et al., 2018) or assume transition dynamics (Chou et al., 2020) . Preference Learning: Constraint inference also links to preference learning which aims to extract user preferences (constraints imposed by an expert on itself, in our case) from different sources such as ratings (Daniel et al., 2014) , comparisions (Christiano et al., 2017; Sadigh et al., 2017) , human reinforcement signals (MacGlashan et al., 2017) or the initial state of the agent's environment (Shah et al., 2019) . Preference learning also includes inverse RL, which aims to recover an agent's reward function by using its trajectories. To deal with the inherent ill-posedness of this problem, inverse RL algorithms often incorporate regularizers (Ho & Ermon, 2016; Finn et al., 2016) or assume a prior distribution over the reward function (Jeon et al., 2018; Michini & How, 2012; Ramachandran & Amir, 2007) .

7. CONCLUSION AND FUTURE WORK

We have presented a method to learn constraints from an expert's demonstrations. Unlike previous works, our method both learns arbirtrary constraints and can be used in continuous settings. While we consider our method to be an important first step towards learning arbitrary constraints in real-world continuous settings, there is still considerable room for improvement. For example, as is the case with Scobee & Sastry, our formulation is also based on (2) which only holds for deterministic MDPs. Secondly, we only consider hard constraints. Lastly, one very interesting extension of this method can be to learn constraints only from logs data in an offline way to facilitate safe-RL in settings where it is difficult to even build nominal simulators such as is the case for plant controllers.

A.3 RATIONALE FOR (9)

Consider a constrained MDP M C as defined in Section 2.2. We are interested in recovering the following policy π M C (τ ) = exp(βr(τ )) Z M C 1 C (τ ), where Z M C = exp(βr(τ ))1 C (τ )dτ is the partition function and 1 C is an indicator function that is 0 if τ ∈ C and 1 otherwise. Lemma: The Boltzmann policy π B (τ ) = exp(βr(τ ))/Z maximizes L(π) = E τ ∼π [r(τ )]+ 1 β H(π), where H(π) denotes the entropy of π. Proof: Note that the KL-divergence, D KL , between a policy π and π B can be written as D KL (π||π B ) = E τ ∼π [log π(τ ) -log π B (τ )], = E τ ∼π [log π(τ ) -βr(τ ) + log Z], = -E τ ∼π [βr(τ )] -H(π) + log Z, = -βL(π) + log Z. Since log Z is constant, minimizing D KL (π||π B ) is equivalent to maximizing L(π). Also, we know that D KL (π||π B ) is minimized when π = π B . Therefore, π B maximizes L. Proposition: The policy in ( 15) is a solution of  From the Lemma we know that the solution to this is π(τ, λ) = g(τ, λ) g(τ , λ)dτ , where g(τ, λ) = exp(β(r(τ ) -λ( ζθ (τ ) -α))). To find π * (τ ) = min λ π(τ, λ), note that: 1. When ζθ (τ ) ≤ α, then λ * = 0 minimizes π. In this case g(τ, λ * ) = exp(βr(τ )). 2. When ζθ (τ ) > α, then λ * → ∞ minimizes π. In this case g(τ, λ * ) = 0. We can combine both of these cases by writing π * (τ ) = exp(r(τ )) exp(r(τ ))1 ζθ (τ )dτ 1 ζθ (τ ), where 1 ζθ (τ ) is 1 if ζθ (τ ) ≤ α and 0 otherwise. (Note that the denominator is greater than 0 as long as we have at least one τ for which ζθ (τ ) ≤ α, i.e., we have at least one feasible solution.) QED.

A.4 EXPERIMENTAL SETTINGS

We used W&B (Biewald, 2020) to manage our experiments and conduct sweeps on hyperparameters. We used Adam (Kingma & Ba, 2015) to optimize all of our networks. All important hyperparameters are listed in Table 1 . Details on the environments can be found below.

A.4.1 TWOBRIDGES

In this environment, the agent's state consists of its (x, y) position coordinates. Agents start at (0, 0) and the goal is at (20, 0). Episdoes terminate when the agent is within one unit circle of the goal or when the number of timesteps exceeds 200. The bottom left corners of the bridges are at (4, 5) and 



Markovian constraints are of the form c(τ ) = T t=1 c(st, at) i.e. constraint function is independent of the past states and actions in the trajectory. Availability of transition dynamics model of nominal MDP is not necessary. Note that since we are assuming that α is 0, we can assign any non-zero (positive) cost to state-action pairs that we want to constrain. Here 1 -ζ θ assigns a cost of 1 to all such pairs. This environment is similar to the boat race environment inLeike et al. (2017).



Figure 1: The TwoBridges environment. (a) The expert avoids the lion and takes the upper bridge. (b) Since the nominal policy is simply trained to get to the goal as quickly as possible, it instead takes the lower bridge. (c) Our method, on the other hand, is able to learn that the lower bridge should be avoided, and consequently our policy takes the upper bridge.

ICRL with Importance Sampling Input: Expert trajectories D, iterations N , number of backward iterations B. Initialize θ and φ randomly. for i = 1, . . . , N do Learn π φ by solving (9) using current ζ θ . for j = 1, . . . , B do Sample a set of trajectories S = {τ (k) } M k=1 using π φ . Compute importance sampling weights s(τ (k) ) using (10) for k = 1, . . . , M . Use S and D to update θ via SGD by using the gradient in (11). end end return π φ

Figure 2: The environments used in the experiments. Note that nominal agents are not aware of the constraints shown.

Figure 3: Performance of agents during training on their respective constrained (true) environments. All plots were smoothed and averaged over 3 seeds.

Figure 4: Walker2d

For (b) we run experiments with and without importance sampling for different values of B (the number of backward iterations). Figures 5(b) and 5(c) show the results. Using B > 1 without(a) Number of expert rollouts. (b) With importance sampling. (c) Without importance sampling.

Figure 5: Ablation study results. All plots were smoothed and (except in (c)) averaged over 3 seeds. Incomplete plots indicate a failure in the training procedure.

Figure 6: Heatmap of ζ θ for each of the four possible actions in TwoBridges (in clockwise direction from top-left): right, left, down, up.

φ ) -λ(E τ ∼π φ [ ζθ (τ )] -α).(17)Proof: Let us rewrite the inner optimization problem asmax π E τ ∼π [r(τ ) -λ( ζθ (τ ) -α)]

Hyperparameters for various environments

A APPENDIX

A.1 GRADIENT OF LOG LIKELIHOOD The gradient of ( 5) iswhere the second line follows from the identity ∇ θ c θ (τ ) ≡ c θ (τ )∇ θ log c θ (τ ) and the fourth line from the MaxEnt assumption.

A.2 DERIVING THE IMPORTANCE SAMPLING WEIGHTS

Suppose that at some iteration of our training procedure we are interested in approximating the gradient of the log of the partition function ∇ θ log Z θ (where θ are the current parameters of our classifier) using an older policy π ζθ (where θ were the parameters of the classifier which induced the constraint set that this policy respects). We can do so by noting thatwhere the fourth lines follows from our MaxEnt assumption, i.e., π ζθ (τ ) = exp(r(τ ))ζθ(τ )/Zθ.Therefore(4, 14). Each bridge is 4 units long and 1 unit wide. Agents can take one of the following actions: right, left, up and down. Each action moves the agent 0.7 units in the respective direction. Agents attempting to move outside the 20 × 20 simulator or into the water (in between the bridges) end up in the same position and receive a reward of -2. The reward in the regions left of the bridge is fixed to -1 and on and to the right of the bridges is equal 10/d where d is the Euclidean distance of the agent to the goal. Additionally, the agent's reward is scaled by 20 if it is to the right of bridges and at or below the lower bridge (i.e. y < 6). Finally, the reward within one unit circle of the goal is fixed to 250.

A.4.2 THREEBRIDGES

This is similar to the ThreeBridges environment except that we now have three bridges. The bottomleft corners of each of the bridges are at (4, 1), (4, 9) and (4, 17.5). The middle bridge is 4 units long and 2 units wide while the other two bridges are 4 units long and 1.5 units wide. Agents attempting to move outside the simulator or into water receive a reward of -2. The reward in regions left of the bridges is fixed to -5 and on and to the right of the bridges is 200/d where d is Euclidean distance of the agent to the goal. The reward within one unit circle of the goal is fixed to 250. Finally, agents randomly start at either the bottom-left of top-left corners with equal probability.

A.4.3 LAPGRIDWORLD

Here, agents move on a 11 × 11 grid by taking either clockwise or anti-clockwise actions. The agent is awarded a reward 3 each time it moves onto a bridge with a dollar (see Figure 2 ). The agent's state is the number of the grid it is on.

A.4.4 HALFCHEETAH AND WALKER2D

The original reward schemes for HalfCheetah and Walker2d in OpenAI Gym (Brockman et al., 2016) , reward the agents proportional to the distance they cover in the forward direction. We modify this and instead simply reward the agents according to the amount of distance they cover (irrespective of the direction they move in).

A.5 ON THE BUDGET SIZE

As noted in Section 3.3, we set the budget size α to a very small value, typically around 0.01. α controls the extent to which the agent respects the constraints imposed by ζ θ . In this section, we study the effect of α on the agent's performance. We train an agent using our algorithm on the TwoBridges environment for different values of α. Figure 7 shows the results. As can be seen, the agent's performance drops at higher values of α. For example when α is 1, the agent fails to achieve any meaningful reward. 

