BENCHMARKING CONSTRAINT INFERENCE IN INVERSE REINFORCEMENT LEARNING

Abstract

When deploying Reinforcement Learning (RL) agents into a physical system, we must ensure that these agents are well aware of the underlying constraints. In many real-world problems, however, the constraints are often hard to specify mathematically and unknown to the RL agents. To tackle these issues, Inverse Constrained Reinforcement Learning (ICRL) empirically estimates constraints from expert demonstrations. As an emerging research topic, ICRL does not have common benchmarks, and previous works tested algorithms under hand-crafted environments with manually-generated expert demonstrations. In this paper, we construct an ICRL benchmark in the context of RL application domains, including robot control, and autonomous driving. For each environment, we design relevant constraints and train expert agents to generate demonstration data. Besides, unlike existing baselines that learn a "point estimate" constraint, we propose a variational ICRL method to model a posterior distribution of candidate constraints. We conduct extensive experiments on these algorithms under our benchmark and show how they can facilitate studying important research challenges for ICRL. The benchmark, including the instructions for reproducing ICRL algorithms, is available at https://github.com/Guiliang/ICRL-benchmarks-public.

1. INTRODUCTION

Constrained Reinforcement Learning (CRL) typically learns a policy under some known or predefined constraints (Liu et al., 2021) . This setting, however, is not realistic in many real-world problems since it is difficult to specify the exact constraints that an agent should follow, especially when these constraints are time-varying, context-dependent, and inherent to experts' own experience. Further, such information may not be completely revealed to the agent. For example, human drivers tend to determine an implicit speed limit and a minimum gap to other cars based on the traffic conditions, rules of the road, weather, and social norms. To derive a driving policy that matches human performance, an autonomous agent needs to infer these constraints from expert demonstrations. An important approach to recovering the underlying constraints is Inverse Constrained Reinforcement Learning (ICRL) (Malik et al., 2021) . ICRL infers a constraint function to approximate constraints respected by expert demonstrations. This is often done by alternating between updating an imitating policy and a constraint function. Figure 1 summarizes the main procedure of ICRL. As an emerging research topic, ICRL does not have common datasets and benchmarks for evaluation. Existing validation methods heavily depend on the safe-Gym (Ray et al., 2019) environments. Utilizing these environments has some important drawbacks: 1) These environments are designed for control instead of constraint inference. To fill this gap, previous works often pick some environments and add external constraints to them. Striving for simplicity, many of the selected environments are deterministic with discretized state and action spaces (Scobee & Sastry, 2020; McPherson et al., 2021; Glazier et al., 2021; Papadimitriou et al., 2021; Gaurav et al., 2022) . Generalizing model performance in these simple environments to practical applications is difficult.



Figure 1: The flowchart of ICRL.

