INVERSE CONSTRAINED REINFORCEMENT LEARNING

Abstract

Standard reinforcement learning (RL) algorithms train agents to maximize given reward functions. However, many real-world applications of RL require agents to also satisfy certain constraints which may, for example, be motivated by safety concerns. Constrained RL algorithms approach this problem by training agents to maximize given reward functions while respecting explicitly defined constraints. However, in many cases, manually designing accurate constraints is a challenging task. In this work, given a reward function and a set of demonstrations from an expert that maximizes this reward function while respecting unknown constraints, we propose a framework to learn the most likely constraints that the expert respects. We then train agents to maximize the given reward function subject to the learned constraints. Previous works in this regard have either mainly been restricted to tabular settings or specific types of constraints or assume knowledge of transition dynamics of the environment. In contrast, we empirically show that our framework is able to learn arbitrary Markovian constraints in high-dimensions in a model-free setting.

1. INTRODUCTION

Reward functions are a critical component in reinforcement learning settings. As such, it is important that reward functions are designed accurately and are well-aligned with the intentions of the human designer. This is known as agent (or value) alignment (see, e.g., Leike et al. (2018; 2017) ; Amodei et al. (2016) ). Misspecified rewards can lead to unwanted and unsafe situations (see, e.g, Amodei & Clark (2016) ). However, designing accurate reward functions remains a challenging task. Human designers, for example, tend to prefer simple reward functions that agree well with their intuition and are easily interpretable. For example, a human designer might choose a reward function that encourages an RL agent driving a car to minimize its traveling time to a certain destination. Clearly, such a reward function makes sense in the case of a human driver since inter-human communication is contextualized within a framework of unwritten and unspoken constraints, often colloquially termed as 'common-sense'. That is, while a human driver will try to minimize their traveling time, they will be careful not to break traffic rules, take actions that endanger passersby, and so on. However, we cannot assume such behaviors from RL agents since they are are not imbued with common-sense constraints. Constrained reinforcement learning provides a natural framework for maximizing a reward function subject to some constraints (we refer the reader to Ray et al. ( 2019) for a brief overview of the field). However, in many cases, these constraints are hard to specify explicitly in the form of mathematical functions. One way to address this issue is to automatically extract constraints by observing the behavior of a constraint-abiding agent. Consider, for example, the cartoon in Figure 1 . Agents start at the bottom-left corner and are rewarded according to how quickly they reach the goal at the bottom-right corner. However, what this reward scheme misses out is that in the real world the lower bridge is occupied by a lion which attacks any agents attempting to pass through it. Therefore, agents that are naïvely trained to maximize the reward function will end up performing poorly in the real world. If, on the other hand, the agent had observed that the expert (in Figure 1(a) ) actually performed suboptimally with respect to the stipulated reward scheme by taking a longer route to the goal, it could have concluded that (for some unknown reason) the lower bridge must be avoided and consequently would have not been eaten by the lion! Scobee & Sastry (2020) formalizes this intuition by casting the problem of recovering constraints in the maximum entropy framework for inverse RL (IRL) (Ziebart et al., 2008) and proposes a greedy algorithm to infer the smallest number of constraints that best explain the expert behavior. However, Scobee & Sastry (2020) has two major limitations: it assumes (1) tabular (discrete) settings, and (2) the environment's transition dynamics. In this work, we aim to address both of these issues by learning a constraint function instead through a sample-based approximation of the objective function of Scobee & Sastry. Consequently, our approach is model-free, admits continuous states and actions and can learn arbitrary Markovian constraintsfoot_0 . Further, we empirically show that it scales well to high-dimensions. Typical inverse RL methods only make use of expert demonstrations and do not assume any knowledge about the reward function at all. However, most reward functions can be expressed in the form "do this task while not doing these other things" where other things are generally constraints that a designer wants to impose on an RL agent. The main task ("do this") is often quite easy to encode in the form of a simple nominal reward function. In this work, we focus on learning the constraint part ("do not do that") from provided expert demonstrations and using it in conjunction with the nominal reward function to train RL agents. In this perspective, our work can be seen as a principled way to inculcate prior knowledge about the agent's task in IRL. This is a key advantage over other IRL methods which also often end up making assumptions about the agent's task in the form of regularizers such as in Finn et al. (2016) . The main contributions of our work are as follows: • We formulate the problem of inferring constraints from a set of expert demonstrations as a learning problem which allows it to be used in continuous settings. To the best of our knowledge, this is the first work in this regard. • We eliminate the need to assume, as Scobee & Sastry do, the environment's transition dynamics. • We demonstrate the ability of our method to train constraint-abiding agents in highdimensions and show that it can also be used to prevent reward hacking. 



Markovian constraints are of the form c(τ ) = T t=1 c(st, at) i.e. constraint function is independent of the past states and actions in the trajectory.



Figure 1: The TwoBridges environment. (a) The expert avoids the lion and takes the upper bridge. (b) Since the nominal policy is simply trained to get to the goal as quickly as possible, it instead takes the lower bridge. (c) Our method, on the other hand, is able to learn that the lower bridge should be avoided, and consequently our policy takes the upper bridge.

UNCONSTRAINED RL A finite-horizon Markov Decision Process (MDP) M is a tuple (S, A, p, r, γ, T ), where S ∈ R |S| is a set of states, A ∈ R |A| is a set of actions, p : S × A × S → [0, 1] is the transition probability function (where p(s |s, a) denotes the probability of transitioning to state s from state s by taking

