LEARNING SOFT CONSTRAINTS FROM CONSTRAINED EXPERT DEMONSTRATIONS

Abstract

Inverse reinforcement learning (IRL) methods assume that the expert data is generated by an agent optimizing some reward function. However, in many settings, the agent may optimize a reward function subject to some constraints, where the constraints induce behaviors that may be otherwise difficult to express with just a reward function. We consider the setting where the reward function is given, and the constraints are unknown, and propose a method that is able to recover these constraints satisfactorily from the expert data. While previous work has focused on recovering hard constraints, our method can recover cumulative soft constraints that the agent satisfies on average per episode. In IRL fashion, our method solves this problem by adjusting the constraint function iteratively through a constrained optimization procedure, until the agent behavior matches the expert behavior. We demonstrate our approach on synthetic environments, robotics environments and real world highway driving scenarios.

1. INTRODUCTION

Inverse reinforcement learning (IRL) (Ng et al., 2000; Russell, 1998) refers to the problem of learning a reward function given observed optimal or near optimal behavior. However, in many setups, expert actions may result from a policy that inherently optimizes a reward function subject to certain constraints. While IRL methods are able to learn a reward function that explains the expert demonstrations well, many tasks also require knowing constraints. Constraints can often provide a more interpretable representation of behavior than just the reward function (Chou et al., 2018) . In fact, constraints can represent safety requirements more strictly than reward functions, and therefore are especially useful in safety-critical applications (Chou et al., 2021; Scobee & Sastry, 2019) . Inverse constraint learning (ICL) may therefore be defined as the process of extracting the constraint function(s) associated with the given optimal (or near optimal) expert data, where we assume that the reward function is available. Notably, some prior work (Chou et al., 2018; 2020; 2021; Scobee & Sastry, 2019; Malik et al., 2021) has tackled this problem by learning hard constraints (i.e., functions that indicate which state action-pairs are allowed). We propose a novel method for ICL (for simplicity, our method is also called ICL) that learns cumulative soft constraints from expert demonstrations while assuming that the reward function is known. The difference between hard constraints and soft constraints can be illustrated as follows. Suppose in an environment, we need to obey the constraint "do not use more than 3 units of energy". As a hard constraint, we typically wish to ensure that this constraint is always satisfied for any individual trajectory. The difference between this "hard" constraint and proposed "soft" constraints is that soft constraints are not necessarily satisfied in every trajectory, but rather only satisfied in expectation. This is equivalent to the specification "on average across all trajectories, do not use more than 3 units of energy". In the case of soft constraints, there may be certain trajectories when the constraint is violated, but in expectation, it is satisfied. To formulate our method, we adopt the framework of constrained Markov decision processes (CMDP) (Altman, 1999) , where an agent seeks to maximize expected cumulative rewards subject to constraints on the expected cumulative value of constraint functions. While previous work in constrained RL focuses on finding an optimal policy that respects known constraints, we seek to learn the constraints based on expert demonstrations. We adopt an approach similar to IRL, but the goal is to learn the constraint functions instead of the reward function. Contributions. Our contributions can be summarized as follows: (a) We propose a novel formulation and method for ICL. Our approach works with any state-action spaces (including continuous stateaction spaces) and can learn arbitrary constraint functions represented by flexible neural networks. To the best of our knowledge, our method is the first to learn cumulative soft constraints (such constraints can take into account noise in sensor measurements and possible violations in expert demonstrations) bounded in expectation as in constrained MDPs. (b) We demonstrate our approach by learning constraint functions in various synthetic environments, robotics environments and real world highway driving scenarios. The paper is structured as follows. Section 2 provides some background about IRL and ICL. Section 3 summarizes previous work about constraint learning. Section 4 describes our new technique to learn cumulative soft constraints from expert demonstrations. Section 5 demonstrates the approach for synthetic environments and discusses the results (more results are provided in Appendix B). Finally, Section 6 concludes by discussing limitations and future work.

2. BACKGROUND

Markov Decision Process (MDP). An MDP is defined as a tuple (S, A, p, µ, r, γ), where S is the state space, A is the action space, p(•|s, a) are the transition probabilities over the next states given the current state s and current action a, r : S × A → R is the reward function, µ : S → [0, 1] is the initial state distribution and γ is the discount factor. The behavior of an agent in this MDP can be represented by a stochastic policy π : S × A → [0, 1], which is a mapping from a state to a probability distribution over actions. A constrained MDP augments the MDP structure to contain a constraint function c : S × A → R and an episodic constraint threshold β. Reinforcement learning and Constrained RL. The objective of any standard RL procedure (control) is to obtain a policy that maximizes the (infinite horizon) expected long term discounted reward (Sutton & Barto, 2018) : π * = arg max π E s0∼µ(•),at∼π(•|st),st+1∼p(•|st,at) ∞ t=0 γ t r(s t , a t ) =: J π µ (r) Similarly, in constrained RL, additionally the expectation of cumulative constraint functions c i must not exceed associated thresholds β i : π * = arg max π J π µ (r) such that J π µ (c i ) ≤ β i ∀i For simplicity, in this work we consider constrained RL with only one constraint function. Inverse reinforcement learning (IRL) and inverse constraint learning (ICL). IRL performs the inverse operation of reinforcement learning, that is, given access to a dataset D = {τ j } N j=1 = {{(s t , a t )} Mj t=1 } N j=1 sampled using an optimal or near optimal policy π * , the goal is to obtain a reward function r that best explains the dataset. By "best explanation", we mean that if we perform the RL procedure using r, then the obtained policy captures the behavior demonstrated in D as closely as possible. In the same way, given access to a dataset D (just like in IRL), which is sampled using an optimal or near optimal policy π * (respecting some constraints c i and maximizing some known reward r), the goal of ICL is to obtain the constraint functions c i that best explain the dataset, that is, if we perform the constrained RL procedure using r, c i ∀i, then the obtained policy captures the behaviour demonstrated in D. Setup. Similar to prior work (Chou et al., 2020) , we learn only the constraints, but not the reward function. Essentially, it is difficult to say whether a demonstrated behaviour is obeying a constraint, or maximizing a reward, or doing both. So, for simplicity, we assume the (nominal) reward is given, and

