BENCHMARKING CONSTRAINT INFERENCE IN INVERSE REINFORCEMENT LEARNING

Abstract

When deploying Reinforcement Learning (RL) agents into a physical system, we must ensure that these agents are well aware of the underlying constraints. In many real-world problems, however, the constraints are often hard to specify mathematically and unknown to the RL agents. To tackle these issues, Inverse Constrained Reinforcement Learning (ICRL) empirically estimates constraints from expert demonstrations. As an emerging research topic, ICRL does not have common benchmarks, and previous works tested algorithms under hand-crafted environments with manually-generated expert demonstrations. In this paper, we construct an ICRL benchmark in the context of RL application domains, including robot control, and autonomous driving. For each environment, we design relevant constraints and train expert agents to generate demonstration data. Besides, unlike existing baselines that learn a "point estimate" constraint, we propose a variational ICRL method to model a posterior distribution of candidate constraints. We conduct extensive experiments on these algorithms under our benchmark and show how they can facilitate studying important research challenges for ICRL. The benchmark, including the instructions for reproducing ICRL algorithms, is available at https://github.com/Guiliang/ICRL-benchmarks-public.

1. INTRODUCTION

Constrained Reinforcement Learning (CRL) typically learns a policy under some known or predefined constraints (Liu et al., 2021) . This setting, however, is not realistic in many real-world problems since it is difficult to specify the exact constraints that an agent should follow, especially when these constraints are time-varying, context-dependent, and inherent to experts' own experience. Further, such information may not be completely revealed to the agent. For example, human drivers tend to determine an implicit speed limit and a minimum gap to other cars based on the traffic conditions, rules of the road, weather, and social norms. To derive a driving policy that matches human performance, an autonomous agent needs to infer these constraints from expert demonstrations. An important approach to recovering the underlying constraints is Inverse Constrained Reinforcement Learning (ICRL) (Malik et al., 2021) . ICRL infers a constraint function to approximate constraints respected by expert demonstrations. This is often done by alternating between updating an imitating policy and a constraint function. Figure 1 summarizes the main procedure of ICRL. As an emerging research topic, ICRL does not have common datasets and benchmarks for evaluation. Existing validation methods heavily depend on the safe-Gym (Ray et al., 2019) environments. Utilizing these environments has some important drawbacks: 1) These environments are designed for control instead of constraint inference. To fill this gap, previous works often pick some environments and add external constraints to them. Striving for simplicity, many of the selected environments are deterministic with discretized state and action spaces (Scobee & Sastry, 2020; McPherson et al., 2021; Glazier et al., 2021; Papadimitriou et al., 2021; Gaurav et al., 2022) . Generalizing model performance in these simple environments to practical applications is difficult. 2) ICRL algorithms require expert demonstrations respecting the added constraints while general RL environments do not include such data, and thus previous works often manually generate the expert data. However, without carefully fine-tuning the generator, it is often unclear how the quality of expert trajectories influences the performance of ICRL algorithms. In this paper, we propose a benchmark for evaluating ICRL algorithms. This benchmark includes a rich collection of testbeds, including virtual, realistic, and discretized environments. The virtual environments are based on MuJoCo (Todorov et al., 2012 ), but we update some of these robot control tasks by adding location constraints and modifying dynamic functions. The realistic environments are constructed based on a highway vehicle tracking dataset (Krajewski et al., 2018) , so the environments can suitably reflect what happens in a realistic driving scenario, where we consider constraints about car velocities and distances. The discretized environments are based on grid-worlds for visualizing the recovered constraints (see Appendix B). To generate the demonstration dataset for these environments, we expand the Proximal Policy Optimization (PPO) (Schulman et al., 2017) and policy iteration (Sutton & Barto, 2018) methods by incorporating ground-truth constraints into the optimization with Lagrange multipliers. We empirically demonstrate the performance of the expert models trained by these methods and show the approach to generating expert demonstrations. For ease of comparison, our benchmark includes ICRL baselines. Existing baselines learn a constraint function that is most likely to differentiate expert trajectories from the generated ones. However, this point estimate (i.e., single constraint estimate) may be inaccurate. On the other hand, a more conceptually-satisfying method is accounting for all possibilities of the learned constraint by modeling its posterior distribution. To extend this Bayesian approach to solve the task in our benchmark, we propose a Variational Inverse Constrained Reinforcement Learning (VICRL) algorithm that can efficiently infer constraints from the environment with a high-dimensional and continuous state space. Besides the above regular evaluations, our benchmark can facilitate answering a series of important research questions by studying how well ICRL algorithms perform 1) when the expert demonstrations may violate constraints (Section 4.3) 2) under stochastic environments (Section 4.4) 3) under environments with multiple constraints (Section 5.2) and 4) when recovering the exact least constraining constraint (Appendix B.2).

2. BACKGROUND

In this section, we introduce Inverse Constrained Reinforcement Learning (ICRL) that alternatively solves both a forward Constrained Reinforcement Learning problem (CRL) and an inverse constraint inference problem (see Figure 1 ).

2.1. CONSTRAINED REINFORCEMENT LEARNING

Constrained Reinforcement Learning (CRL) is based on Constrained Markov Decision Processes (CMDPs) M c , which can be defined by a tuple (S, A, p R , p T , {(p Ci , ϵ i )} ∀i , γ, T ) where: 1) S and A denote the space of states and actions. 2) p T (s ′ |s, a) and p R (r|s, a) define the transition and reward distributions. 3) p Ci (c|s, a) denotes a stochastic constraint function with an associated bound ϵ i , where i indicates the index of a constraint, and the cost c ∈ [0, ∞]. 4) γ ∈ [0, 1) is the discount factor and T is the planning horizon. Based on CMDPs, we define a trajectory τ = [s 0 , a 0 , ..., a T -1 , s T ] and p(τ ) = p(s 0 ) T -1 t=0 π(a t |s t )p T (s t+1 |s t , a t ). To learn a policy under CMDPs, CRL agents commonly consider the following optimization problems. Cumulative Constraints. We consider a CRL problem that finds a policy π to maximize expected discounted rewards under a set of cumulative soft constraints: arg max π E p R ,p T ,π T t=0 γ t r t + 1 β H(π) s.t. E p C i ,p T ,π T t=0 γ t c i (s t , a t ) ≤ ϵ i ∀i ∈ [0, I] (1) where H(π) denotes the policy entropy weighted by 1 β . This formulation is useful given an infinite horizon (T = ∞), where the constraints consist of bounds on the expectation of cumulative constraint values. In practice, we commonly use this setting to define soft constraints since the agent can recover from an undesirable movement (corresponding to a high cost c i (s t , a t )) as long as the discounted additive cost is smaller than the threshold (ϵ i ).



Figure 1: The flowchart of ICRL.

