CAUSAL IMITATION LEARNING VIA INVERSE REINFORCEMENT LEARNING

Abstract

One of the most common ways children learn when unfamiliar with the environment is by mimicking adults. Imitation learning concerns an imitator learning to behave in an unknown environment from an expert's demonstration; reward signals remain latent to the imitator. This paper studies imitation learning through causal lenses and extends the analysis and tools developed for behavior cloning (Zhang, Kumor, Bareinboim, 2020) to inverse reinforcement learning. First, we propose novel graphical conditions that allow the imitator to learn a policy performing as well as the expert's behavior policy, even when the imitator and the expert's state-action space disagree, and unobserved confounders (UCs) are present. When provided with parametric knowledge about the unknown reward function, such a policy may outperform the expert's. Also, our method is easily extensible and allows one to leverage existing IRL algorithms even when UCs are present, including the multiplicative-weights algorithm (MWAL) (Syed & Schapire, 2008) and the generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016). Finally, we validate our framework by simulations using real-world and synthetic data.

1. INTRODUCTION

Reinforcement Learning (RL) has been deployed and shown to perform extremely well in highly complex environments in the past decades (Sutton & Barto, 1998; Mnih et al., 2013; Silver et al., 2016; Berner et al., 2019) . One of the critical assumptions behind many of the classical RL algorithms is that the reward signal is fully observed, and the reward function could be well-specified. In many real-world applications, however, it might be impractical to design a suitable reward function that evaluates each and every scenario (Randløv & Alstrøm, 1998; Ng et al., 1999) . For example, in the context of human driving, it is challenging to design a precise reward function, and experimenting in the environment could be ill-advised; still, watching expert drivers operating is usually feasible. In machine learning, the imitation learning paradigm investigates the problem of how an agent should behave and learn in an environment with an unknown reward function by observing demonstrations from a human expert (Argall et al., 2009; Billard et al., 2008; Hussein et al., 2017; Osa et al., 2018) . There are two major learning modalities that implements IL -behavioral cloning (BC) (Widrow, 1964; Pomerleau, 1989; Muller et al., 2006; Mülling et al., 2013; Mahler & Goldberg, 2017) Fu et al. (2017) . BC methods directly mimic the expert's behavior policy by learning a mapping from observed states to the expert's action via supervised learning. Alternatively, IRL methods first learn a potential reward function under which the expert's behavior policy is optimal. The imitator then obtains a policy by employing standard RL methods to maximize the learned reward function. Under some common assumptions, both BC and IRL are able to obtain policies that achieve the expert's performance (Kumor et al., 2021; Swamy et al., 2021) . Moreover, when additional parametric knowledge about the reward function is provided, IRL may produce a policy that outperforms the expert's in the underlying environment (Syed & Schapire, 2008; Li et al., 2017; Yu et al., 2020) . For concreteness, consider a learning scenario depicted in Fig. 1a , describing trajectories of humandriven cars collected by drones flying over highways (Krajewski et al., 2018; Etesami & Geiger, 2020) . Using such data, we want to learn a policy X ← π(Z) deciding on the acceleration (action A typical IRL imitator solves a minimax problem ) X ∈ X Z Y U (a) X Z Y U U 1 U 2 (b) Z 1 X 1 Z 2 X 2 Y U (c) Z 1 X 1 Y 1 Z 2 X 2 Y 2 Z 3 X 3 Y 3 U (d) min π max f Y E [f Y (X, Z)]-E [f Y (X, Z) | do(π)]. The inner step "guesses" a reward function being optimized by the expert; while the outer step learns a policy maximizing the learned reward function. Applying these steps leads to a policy π * : X ← ¬Z with the expected reward E[Y | do(π * )] = 1, which outperforms the sub-optimal expert. Despite the performance guarantees provided by existing imitation methods, both BC and IRL rely on the assumption that the expert's input observations match those available to the imitator. More recently, there exists an emerging line of research under the rubric of causal imitation learning that augments the imitation paradigm to account for environments consisting of arbitrary causal mechanisms and the aforementioned mismatch between expert and imitator's sensory capabilities (de Haan et al., 2019; Zhang et al., 2020; Etesami & Geiger, 2020; Kumor et al., 2021) . Closest to our work, Zhang et al. ( 2020); Kumor et al. ( 2021) derived graphical criteria that completely characterize when and how BC could lead to successful imitation even when the agents perceive reality differently. Still, it is unclear how to perform IRL-type training if some expert's observed states remain latent to the imitator, which leads to the presence of unobserved confounding (UCs) in expert's demonstrations. Perhaps surprisingly, naively applying IRL methods when UCs are present does not necessarily lead to satisfactory performance, even when the expert itself behaves optimally. To witness, we now modify the previous highway driving scenario to demonstrate the challenges of UCs. In reality, covariates Z (i.e., velocities and location) are also affected by the car horn U 1 of surrounding vehicles and the wind condition U 2 . However, due to the different perspectives of drones (recording from the top), such critical information (i.e, U 1 , U 2 ) is not recorded by the camera and thus remains unobserved. Fig. 1b graphically describes this modified learning setting. More specifically, consider an instance where Z ← U 1 ⊕ U 2 , Y ← ¬X ⊕ Z ⊕ U 2 ; ⊕ is the exclusive-or operator; and values of U 1 and U 2 are drawn uniformly over {0, 1}. An expert driver, being able to hear the car horn U 1 , follows a behavior policy X ← U 1 and achieves the optimal performance E[Y ] = 1. Meanwhile, observe that E[Y |z, x] = 1 belongs to a family of reward functions f Y (x, z) = α (where α > 0). Solving min π max f Y E [f Y (X, Z)] -E [f Y (X, Z) | do(π)] leads to an IRL policy π * with expected reward E[Y |do(π * )] = 0.5, which is far from the expert's optimal performance E[Y ] = 1. After all, a question that naturally arises is, under what conditions an IRL imitator procedure can perform well when UCs are present, and there is a mismatch between the perception of the two agents? In this paper, we answer this question and, more broadly, investigate the challenge of performing IRL through causal lenses. In particular, our contributions are summarized as follows. (1) We provide a novel, causal formulation of the inverse reinforcement learning problem. This formulation allows one to formally study and understand the conditions under which an IRL policy is learnable, including in settings where UCs cannot be ruled out a priori. (2) We derive a new graphical condition for deciding whether an imitating policy can be computed from the available data and knowledge, which provides a robust generalization of current IRL algorithms to non-Markovian settings, including GAIL (Ho & Ermon, 2016) and MWAL (Syed & Schapire, 2008) . (3) Finally, we move beyond this graphical condition and develop an effective IRL algorithm for structural causal models (Pearl, 2000) with



and inverse reinforcement learning (IRL) Ng et al. (2000); Ziebart et al. (2008); Ho & Ermon (2016);

Figure 1: Causal diagrams where X represents an action (shaded red) and Y represents a latent reward (shaded blue). Input covariates of the policy scope S are shaded in light red.

