CAUSAL IMITATION LEARNING VIA INVERSE REINFORCEMENT LEARNING

Abstract

One of the most common ways children learn when unfamiliar with the environment is by mimicking adults. Imitation learning concerns an imitator learning to behave in an unknown environment from an expert's demonstration; reward signals remain latent to the imitator. This paper studies imitation learning through causal lenses and extends the analysis and tools developed for behavior cloning (Zhang, Kumor, Bareinboim, 2020) to inverse reinforcement learning. First, we propose novel graphical conditions that allow the imitator to learn a policy performing as well as the expert's behavior policy, even when the imitator and the expert's state-action space disagree, and unobserved confounders (UCs) are present. When provided with parametric knowledge about the unknown reward function, such a policy may outperform the expert's. Also, our method is easily extensible and allows one to leverage existing IRL algorithms even when UCs are present, including the multiplicative-weights algorithm (MWAL) (Syed & Schapire, 2008) and the generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) . Finally, we validate our framework by simulations using real-world and synthetic data. * Equal contribution. * Otherwise, one could always simplify the diagram G and project other latent variables L \ {Y } using the projection algorithm (Tian, 2002, Sec. 4.5), without affecting the identifiability of target query E[Y | do(π)].

1. INTRODUCTION

Reinforcement Learning (RL) has been deployed and shown to perform extremely well in highly complex environments in the past decades (Sutton & Barto, 1998; Mnih et al., 2013; Silver et al., 2016; Berner et al., 2019) . One of the critical assumptions behind many of the classical RL algorithms is that the reward signal is fully observed, and the reward function could be well-specified. In many real-world applications, however, it might be impractical to design a suitable reward function that evaluates each and every scenario (Randløv & Alstrøm, 1998; Ng et al., 1999) . For example, in the context of human driving, it is challenging to design a precise reward function, and experimenting in the environment could be ill-advised; still, watching expert drivers operating is usually feasible. In machine learning, the imitation learning paradigm investigates the problem of how an agent should behave and learn in an environment with an unknown reward function by observing demonstrations from a human expert (Argall et al., 2009; Billard et al., 2008; Hussein et al., 2017; Osa et al., 2018) . There are two major learning modalities that implements IL -behavioral cloning (BC) (Widrow, 1964; Pomerleau, 1989; Muller et al., 2006; Mülling et al., 2013; Mahler & Goldberg, 2017) and inverse reinforcement learning (IRL) Ng et al. (2000) ; Ziebart et al. (2008) ; Ho & Ermon (2016) ; Fu et al. (2017) . BC methods directly mimic the expert's behavior policy by learning a mapping from observed states to the expert's action via supervised learning. Alternatively, IRL methods first learn a potential reward function under which the expert's behavior policy is optimal. The imitator then obtains a policy by employing standard RL methods to maximize the learned reward function. Under some common assumptions, both BC and IRL are able to obtain policies that achieve the expert's performance (Kumor et al., 2021; Swamy et al., 2021) . Moreover, when additional parametric knowledge about the reward function is provided, IRL may produce a policy that outperforms the expert's in the underlying environment (Syed & Schapire, 2008; Li et al., 2017; Yu et al., 2020) . For concreteness, consider a learning scenario depicted in Fig. 1a , describing trajectories of humandriven cars collected by drones flying over highways (Krajewski et al., 2018; Etesami & Geiger, 2020) . Using such data, we want to learn a policy X ← π(Z) deciding on the acceleration (action ) X ∈ X Z Y U (a) X Z Y U U 1 U 2 (b) Z 1 X 1 Z 2 X 2 Y U (c) Z 1 X 1 Y 1 Z 2 X 2 Y 2 Z 3 X 3 Y 3 U (d) Figure 1 : Causal diagrams where X represents an action (shaded red) and Y represents a latent reward (shaded blue). Input covariates of the policy scope S are shaded in light red. {0, 1} of the demonstrator car based on velocities and locations Z of surrounding cars. The driving performance is measured by a latent reward signal Y . Consider an instance where Y ← (1 -X)Z + X(1-Z) and values of Z are drawn uniformly over {0, 1}. A human expert generates demonstrations following a behavior policy such that P (X = 1 | Z = 0) = 0.6 and P (X = 0 | Z = 1) = 0.4. Evaluating the expert's performance gives E[Y ] = P (X = 1, Z = 0) + P (X = 0, Z = 1) = 0.5. Now we apply standard IRL algorithms to learn a policy X ← π(Z) so that the imitator's driving performance, denoted by E[Y | do(π)], is at least as good as the expert's performance E [Y ] . Detailed derivations of IRL policy are shown in (Ruan et al., 2023, Appendix A)  . Note that E[Y |z, x] = x + z -2xz belongs to a family of reward functions f Y (x, z) = αx + βz -γxz, where 0 < α < γ. A typical IRL imitator solves a minimax problem min π max f Y E [f Y (X, Z)]-E [f Y (X, Z) | do(π)]. The inner step "guesses" a reward function being optimized by the expert; while the outer step learns a policy maximizing the learned reward function. Applying these steps leads to a policy π * : X ← ¬Z with the expected reward E[Y | do(π * )] = 1, which outperforms the sub-optimal expert. Despite the performance guarantees provided by existing imitation methods, both BC and IRL rely on the assumption that the expert's input observations match those available to the imitator. More recently, there exists an emerging line of research under the rubric of causal imitation learning that augments the imitation paradigm to account for environments consisting of arbitrary causal mechanisms and the aforementioned mismatch between expert and imitator's sensory capabilities (de Haan et al., 2019; Zhang et al., 2020; Etesami & Geiger, 2020; Kumor et al., 2021) . Closest to our work, Zhang et al. (2020) ; Kumor et al. (2021) derived graphical criteria that completely characterize when and how BC could lead to successful imitation even when the agents perceive reality differently. Still, it is unclear how to perform IRL-type training if some expert's observed states remain latent to the imitator, which leads to the presence of unobserved confounding (UCs) in expert's demonstrations. Perhaps surprisingly, naively applying IRL methods when UCs are present does not necessarily lead to satisfactory performance, even when the expert itself behaves optimally. To witness, we now modify the previous highway driving scenario to demonstrate the challenges of UCs. In reality, covariates Z (i.e., velocities and location) are also affected by the car horn U 1 of surrounding vehicles and the wind condition U 2 . However, due to the different perspectives of drones (recording from the top), such critical information (i.e, U 1 , U 2 ) is not recorded by the camera and thus remains unobserved. Fig. 1b graphically describes this modified learning setting. More specifically, consider an instance where Z ← U 1 ⊕ U 2 , Y ← ¬X ⊕ Z ⊕ U 2 ; ⊕ is the exclusive-or operator; and values of U 1 and U 2 are drawn uniformly over {0, 1}. An expert driver, being able to hear the car horn U 1 , follows a behavior policy X ← U 1 and achieves the optimal performance E[Y ] = 1. Meanwhile, observe that E[Y |z, x] = 1 belongs to a family of reward functions f Y (x, z) = α (where α > 0). Solving min π max f Y E [f Y (X, Z)] -E [f Y (X, Z) | do(π)] leads to an IRL policy π * with expected reward E[Y |do(π * )] = 0.5, which is far from the expert's optimal performance E[Y ] = 1. After all, a question that naturally arises is, under what conditions an IRL imitator procedure can perform well when UCs are present, and there is a mismatch between the perception of the two agents? In this paper, we answer this question and, more broadly, investigate the challenge of performing IRL through causal lenses. In particular, our contributions are summarized as follows. (1) We provide a novel, causal formulation of the inverse reinforcement learning problem. This formulation allows one to formally study and understand the conditions under which an IRL policy is learnable, including in settings where UCs cannot be ruled out a priori. (2) We derive a new graphical condition for deciding whether an imitating policy can be computed from the available data and knowledge, which provides a robust generalization of current IRL algorithms to non-Markovian settings, including GAIL (Ho & Ermon, 2016) and MWAL (Syed & Schapire, 2008) . (3) Finally, we move beyond this graphical condition and develop an effective IRL algorithm for structural causal models (Pearl, 2000) with arbitrary causal relationships. Due to the space constraints, all proofs are provided in (Ruan et al., 2023, Appendix B) . For a more detailed survey on imitation learning and causal inference, we refer readers to (Ruan et al., 2023, Appendix E) .

1.1. PRELIMINARIES

We use capital letters to denote random variables (X) and small letters for their values (x). D X represents the domain of X and P X the space of probability distributions over D X . For a set X, let |X| denote its dimension. The probability distribution over variables X is denoted by P (X). Similarly, P (Y | X) represents a set of conditional distributions P (Y | X = x) for all realizations x. We use abbreviations P (x) for probabilities P (X = x); so does P (Y = y | X = x) = P (y | x). Finally, indicator function 1{Z = z} returns 1 if Z = z holds true; otherwise 0. The basic semantic framework of our analysis rests on structural causal models (SCMs) (Pearl, 2000, Ch. 7 ). An SCM M is a tuple ⟨U , V , F, P (U )⟩ with V the set of endogenous, and U exogenous variables. F is a set of structural functions s.t. for f V ∈ F, V ← f V (pa V , u V ), with PA V ⊆ V , U V ⊆ U . Values of U are drawn from an exogenous distribution P (U ), inducing distribution P (V ) over endogenous variables V . Since the learner can observe only a subset of endogenous variables, we split V into a partition O ∪ L where variable O ⊆ V are observed and L = V \ O remain latent to the leaner. The marginal distribution P (O) is thus referred to as the observational distribution. An atomic intervention on a subset X ⊆ V , denoted by do(x), is an operation where values of X are set to constants x, replacing the functions f X = {f X : ∀X ∈ X} that would normally determine their values. For an SCM M , let M x be a submodel of M induced by intervention do(x). For a set Y ⊆ V , the interventional distribution P (s|do(x)) induced by do(x) is defined as the distribution over Y in the submodel M x , i.e., P M (Y |do(x)) ≜ P Mx (Y ). We leave M implicit when it is obvious from the context. Each SCM M is associated with a causal diagram G which is a directed acyclic graph where (e.g., see Fig. 1 ) solid nodes represent observed variables O, dashed nodes represent latent variables L, and arrows represent the arguments PA V of each function f V ∈ F. Exogenous variables U are not explicitly shown; a bi-directed arrow between nodes V i and V j indicates the presence of an unobserved confounder (UC) affecting both V i and V j . We will use family abbreviations to represent graphical relationships such as parents, children, descendants, and ancestors. For example, the set of parent nodes of X in G is denoted by pa(X) G = ∪ X∈X pa(X) G ; ch, de and an are similarly defined. Capitalized versions Pa, Ch, De, An include the argument as well, e.g. Pa(X) G = pa(X) G ∪ X. For a subset X ⊆ V , the subgraph obtained from G with edges outgoing from X / incoming into X removed is written as G X /G X respectively. G [X] is a subgraph of G containing only nodes X and edges among them. A path from a node X to a node Y in G is a sequence of edges, which does not include a particular node more than once. Two sets of nodes X, Y are said to be d-separated by a third set Z in a DAG G, denoted by (X ⊥ ⊥ Y |Z) G , if every edge path from nodes in X to nodes in Y is "blocked" by nodes in Z. The criterion of blockage follows (Pearl, 2000, Def. 1.2.3) . For a more detailed survey on SCMs, we refer readers to (Pearl, 2000; Bareinboim et al., 2022) .

2. CAUSAL INVERSE REINFORCEMENT LEARNING

We investigate the sequential decision-making setting concerning a set of actions X, a series of covariates Z, and a latent reward Y in an SCM M . An expert (e.g., a physician, driver), operating in SCM M , selects actions following a behavior policy, which is the collection of structural functions f X = {f X | X ∈ X}. The expert's performance is evaluated as the expected reward E[Y ]. On the other hand, a learning agent (i.e., the imitator) intervenes on actions X following an ordering X 1 ≺ • • • ≺ X n ; each action X i is associated with a set of features PA * i ⊆ O \ {X i }. A policy π over actions X is a sequence of decision rules π = {π 1 , . . . , π n }. Each decision rule π i (X i | Z i ) is a probability distribution over an action X i ∈ X, conditioning on values of a set of covariates Z i ⊆ PA * i . Such policies π are also referred to as dynamic treatment regimes (Murphy et al., 2001; Chakraborty & Murphy, 2014) , which generalize personalized medicine to time-varying treatment settings in healthcare, in which treatment is repeatedly tailored to a patient's dynamic state. A policy intervention on actions X following a policy π, denoted by do(π), entails a submodel M π from a SCM M where structural functions f X associated with X (i.e., the expert's behavior policy) are replaced with decision rules X i ∼ π i (X i | Z i ) for every X i ∈ X. A critical assumption throughout this paper is that submodel M π does not contain any cycles. Similarly, the interventional distribution P (V | do(π)) induced by policy π is defined as the joint distribution over V in M π . Throughout this paper, detailed parametrizations of the underlying SCM M are assumed to be unknown to the agent. Instead, the agent has access to the input: (1) a causal diagram G associated with M , and (2) the expert's demonstrations, summarized as the observational distribution P (O). The goal of the agent is to output an imitating policy π * that achieves the expert's performance. Definition 1. For an SCM M = ⟨U , V , F, P (U )⟩, an imitating policy π * is a policy such that its expected reward is lower bounded by the expert's reward, i.e., E M [Y | do(π * )] ≥ E M [Y ]. In words, the right-hand side is the expert's performance that the agent wants to achieve, while the left-hand side is the real reward experienced by the agent. The challenge in imitation learning arises from the fact that the reward Y is not specified and latent, i.e., Y ̸ ∈ O. This precludes approaches that identify E[Y |do(π)] directly from the demonstration data (e.g., through the do-or soft-do-calculus Pearl (2000) ; Correa & Bareinboim (2020) ). There exist methods in the literature for finding an imitating policy in Def. 1. Before describing their details, we first introduce some necessary concepts. For any policy π, we summarize its associated state-action domain using a sequence of pairs of variables called a policy scope S. Definition 2 (Lee & Bareinboim (2020) ). For an SCM M , a policy scope S (for short, scope) over actions X is a sequence of tuples {⟨X i , Z i ⟩} n i=1 where Z i ⊆ PA * i for every X i ∈ X. We will consistently use π ∼ S to denote a policy π associated with scope S. For example, consider a policy scope , let G (i) , i = 1, . . . , n, denote a manipulated graph obtained from G by the following steps: for all j = i + 1, . . . , n, (1) remove arrows coming into every action X j ; and (2) add direct arrows from nodes in Z j to X j . Formally, the sequential π-backdoor criterion is defined as: Definition 3 (Kumor et al. (2021) ). Given a causal diagram G, a policy scope S = {⟨X 1 , {Z 1 }⟩, ⟨X 2 , {Z 2 }⟩} over actions X 1 , X 2 in Fig. 1c. A policy π ∼ S is a sequence of distributions π = {π 1 (X 1 | Z 1 ), π 2 (X 2 | Z 2 )}. S = {⟨X i , Z i ⟩} n i=1 is said to satisfy the sequential π-backdoor criterion in G (for short, π-backdoor admissible) if at each X i ∈ X, one of the following conditions hold: (1) X i is not an ancestor of Y in G (i) , i.e., X ̸ ∈ An(Y ) G (i) ; or (2) Z i blocks all backdoor path from X i to Y in G (i) , i.e., (Y ⊥ ⊥ X i |Z i ) in G (i) Xi . (Kumor et al., 2021) showed that whenever a π-backdoor admissible scope S is available, one could learn an imitating policy π * ∼ S by setting π * i (x i | z i ) = P (x i | z i ) for every action X i ∈ X. For instance, consider the causal diagram G in Fig. 1c. Scope S = {⟨X 1 , {Z 1 }⟩, ⟨X 2 , {Z 2 }⟩} is π-backdoor admissible since (X 1 ⊥ ⊥ Y |Z 1 ) and (X 2 ⊥ ⊥ Y |Z 2 ) hold in G, which is a super graph containing both manipulated G (1) and G (2) . An imitating policy π * = {π * 1 , π * 2 } is thus obtainable by setting π * 1 (X 1 | Z 1 ) = P (X 1 | Z 1 ) and π * 2 (X 2 | Z 2 ) = P (X 2 | Z 2 ). While impressive, a caveat of their results is that the performance of the imitator is restricted by that of the expert, i.e., E[Y | do(π * )] = E[Y ]. In other words, causal BC provides an efficient way to mimic the expert's performance. If the expert's behavior is far from optimal, the same will hold for the learning agent.

2.1. MINIMAL SEQUENTIAL BACKDOOR CRITERION

To circumvent this issue, we take a somewhat different approach to causal imitation by incorporating the principle of inverse reinforcement learning (IRL) principle. Following the game-theoretic approach (Syed & Schapire, 2008) , we formulate the problem as learning to play a two-player zero-sum game in which the agent chooses a policy, and the nature chooses an SCM instance. A key property of this algorithm is that it allows us to incorporate prior parametric knowledge about the latent reward signal. When such knowledge is informative, our algorithm is about to obtain a policy that could significantly outperform the expert with respect to the unknown causal environment, while at the same time are guaranteed to be no worse. Formally, let M = {∀M | G M = G, P M (O) = P (O)} denote the set of SCMs compatible with both the causal diagram G and the observational distribution P (O). Fix a policy scope S. Now consider the optimization problem defined as follows. ν * = min π∼S max M ∈M E M [Y ] -E M [Y | do(π)]. (1) The inner maximization in the above equation can be viewed as an causal IRL step where we attempt to "guess" a worst-case SCM M compatible with G and P (O) that prioritizes the expert's policy. That is, the gap in the performance between the expert's and the imitator's policies is maximized. Meanwhile, since the expert's reward E M [Y ] is not affected by the imitator's policy π, the outer minimization is equivalent to a planning step that finds a policy π * optimizing the learned SCM M . Obviously, the solution π * is an imitating policy if gap ν * = 0. In cases where the expert is sub-optimal, i.e., E M [Y ] < E M [Y | do(π)] for some policies π, we may have ν * < 0. That is, the policy π * will dominate the expert's policy f X regardless of parametrizations of SCM M in the worst-case scenario. In other words, π * to some extent ignores the sub-optimal expert, and instead exploits prior knowledge about the underlying model. Despite the clear semantics in terms of causal models, the optimization problem in Eq. ( 1) requires the learner to search over all possible SCMs compatible with the causal diagram G and observational distribution P (O). In principle, it entails a quite challenging search since one does not have access to the parametric forms of the underlying structural functions F nor the exogenous distribution P (U ). It is not clear how the existing optimization procedures can be used. In this paper, we will develop novel methods to circumvent this issue, thus leading to effective imitating policies. Our first algorithm relies on a refinement of the sequential π-backdoor, based on the concept of minimality. A subscope S ′ of a policy scope S = {⟨X i , Z i ⟩} n i=1 , denoted by S ′ ⊆ S, is a sequence {⟨X i , Z ′ i ⟩} n i=1 where Z ′ i ⊆ Z i for every X i ∈ X. A proper subscope S ′ ⊂ S is a subscope in S other than S itself. The minimal π-backdoor admissible scope is defined as follows. Definition 4. Given a causal diagram G, a π-backdoor admissible scope S is said to be minimal if there exists no proper subscope S ′ ⊂ S satisfying the sequential π-backdoor in G. Theorem 1. Given a causal diagram G, if there exists a minimal π-backdoor admissible scope S = {⟨X i , Z i ⟩} n i=1 in G, consider the following conditions: 1. Let effective actions X * = X ∩ An(Y ) G S and effective covariates Z * = Xi∈X * Z i ; 2. For i = 1, . . . , n + 1, let X * <i = {∀X j ∈ X * | j < i} and Z * <i = Xj ∈X * <i Z j . Then, for any policy π ∼ S, the expected reward E[Y | do(π)] is computable from P (O, Y ) as: E[Y | do(π)] = x * ,z * E[Y | x * , z * ]ρ π (x * , z * ) (2) where the occupancy measure ρ π (x * , z * ) = Xi∈X * P z i | x * <i , z * <i π i (x i | z i ). To illustrate, consider again the causal diagram G in Fig. 1c ; the manipulated diagram G (2) = G and G (1) is obtained from G by removing Z 2 ↔ X 2 . While scope S 1 = {⟨X 1 , {Z 1 }⟩, ⟨X 2 , {Z 2 }⟩} satisfies the sequential π-backdoor, it is not minimal since (X 1 ⊥ ⊥ Y ) in G (1) X1 . On the other hand, S 2 = {⟨X 1 , ∅⟩, ⟨X 2 , {Z 2 }⟩} is minimal π-backdoor admissible since (X 2 ⊥ ⊥ Y | Z 2 ) holds true in G (2) X2 ; and the covariate set {Z 2 } is minimal due to the presence of the backdoor path X 2 ← Z 2 → Y . Let us focus on the minimal π-backdoor admissible scope S 2 . Note that G S2 is a subgraph obtained from G by removing the bi-directed arrow Z 2 ↔ X 2 . We must have effective actions X * = {X 1 , X 2 } and effective covariates Z * = {Z 2 }. Therefore, Z * <1 = Z * <2 = ∅ and Z * <3 = {Z 2 }. For any policy π ∼ S 2 , Thm. 1 implies E[Y | do(π)] = x1,x2,z2 E[Y | x 1 , x 2 , z 2 ]P (z 2 |x 1 )π 2 (x 2 |z 2 )π(x 1 ). On the other hand, the same result in Thm. 1 does not necessarily hold for a non-minimal π-backdoor admissible scope. For instance, consider again the non-minimal scope S 1 = {⟨X 1 , {Z 1 }⟩, ⟨X 2 , {Z 2 }⟩}. The expected reward E[Y | do(π)] of a policy π ∼ S 2 is not computable from Eq. (2), and is ultimately not identifiable from distribution P (O, Y ) in G (Tian, 2008) .

2.2. IMITATION VIA INVERSE REINFORCEMENT LEARNING

Once a minimal π-backdoor admissible scope S is found, there exist effective procedures to solve for an imitating policy in Eq. ( 1). Let R be a hypothesis class containing all expected rewards E M [Y | x * , z * ] compatible with candidate SCMs M ∈ M , i.e., R = {E M [Y | x * , z * ] | ∀M ∈ M }. Applying the identification formula in Thm. 1 reduces the optimization problem in Eq. (1) as follows: ν * = min π∼S max r∈R x * ,z * r(x * , z * ) (ρ(x * , z * ) -ρ π (x * , z * )) where the expert's occupancy measure ρ(x * , z * ) = P (x * , z * ) and the agent's occupancy measure ρ π (x * , z * ) is given by Eq. (2). The above minimax problem is solvable using standard IRL algorithms. The identification result in Thm. 1 ensures that the learned policy applies to any SCM compatible with the causal diagram and the observational data, thus robust to the unobserved confounding bias in the expert's demonstrations. Henceforth, we will consistently refer to Eq. ( 3) as the canonical equation of causal IRL. In this paper, we solve for an imitating policy π * in Eq. ( 3) using state-of-the-art IRL algorithms, provided with common choices of parametric reward functions. These algorithms include the multiplicative-weights algorithm (MWAL) (Syed & Schapire, 2008) and the generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) . We refer readers to Algs. 3 and 4 in (Ruan et al., 2023, Appendix C) for more discussions on the pseudo-code and implementation details. Causal MWAL (Abbeel & Ng, 2004; Syed & Schapire, 2008) study IRL in Markov decision processes where the reward function r(x * , z * ) is a linear combination of k-length feature expectations vectors ϕ(x * , z * ). Particularly, let r(x * , z * ) = w • ϕ(x * , z * ) for a coefficient vector w contained in a convex set S k = w ∈ R k | ∥w∥ 1 = 1 and w ⪰ 0 . Let ϕ (i) be the i-th component of feature vector ϕ and let deterministic policies with scope S be ordered by π (1) , . . . , π (n) . The canonical equation in Eq. ( 3) is reducible to a two-person zero-sum matrix game under linearity. Proposition 1. For a hypothesis class R = {r = w • ϕ | w ∈ S k }, the solution ν * of the canonical equation in Eq. ( 3) is obtainable by solving the following minimax problem: ν * = min π∼S max w∈S k w ⊤ Gπ, ( ) where G is a k × n matrix given by G(i, j) = x * ,z * ϕ (i) (x * , z * ) (ρ(x * , z * ) -ρ π (j) (x * , z * )). There exist effective multiplicative weights algorithms for solving the matrix game in Eq. ( 4), including MW (Freund & Schapire, 1999) and MWAL (Syed & Schapire, 2008) . Causal GAIL (Ho & Ermon, 2016) introduces the GAIL algorithm for learning an imitating policy in Markov decision processes with a general family of non-linear reward functions. In particular, r(x * , z * ) takes values in the real space R, i.e., r ∈ R X * ,Z * where R X * ,Z * = {r : D X * × D Z * → R}. The complexity of reward function r is penalized by a convex regularization function ψ(r), i.e., ν * = min π∼S max r∈R X×Z x * ,z * r(x * , z * ) (ρ(x * , z * ) -ρ π (x * , z * )) -ψ(r) Henceforth, we will consistently refer to Eq. ( 5) as the penalized canonical equation of causal IRL. It is often preferable to solve its conjugate form. Formally, Proposition 2. For a hypothesis class R = {r : D X * × D Z * → R} regularized by ψ, the solution ν * of the penalized canonical equation in Eq. ( 5) is obtainable by solving the following problem: ν * = min π∼S ψ * (ρ -ρ π ) (6) where ψ * be a conjugate function of ψ and is given by ψ * = max r∈R X×Z a ⊤ r -ψ(r). Eq. ( 6) seeks a policy π which minimizes the divergence of the occupancy measures between the imitator and the expert, as measured by the function ψ * . The computational framework of generative adversarial networks (Goodfellow et al., 2014) provides an effective approach to solve such a matching problem, e.g., the GAIL algorithm (Ho & Ermon, 2016) .

3. CAUSAL IMITATION WITHOUT SEQUENTIAL BACKDOOR

In this section, we investigate causal IRL beyond the condition of minimal sequential π-backdoor. (Pearl, 2000, Def. 3.2.4) . Definition 5 (Identifiability). Given a causal diagram G and a policy π ∼ S, the expected reward  E[Y | do(π)] is said to be identifiable from distribution P (O, Y ) in G if E[Y | do(π)] E[Y | do(π)] = x * ,z * E[Y | x * , z * ]ρ π (x * , z * ) (7) where subsets X * ⊆ X, Z * ⊆ O \ X; and the imitator's occupancy measure ρ π (x * , z * ) is a function of the observational distribution P (O) and policy π. X Z Y (a) G X Z Y (b) GS Figure 2: Frontdoor Thm. 2 suggests a general procedure to learn an imitating policy via causal IRL. Whenever an identifiable scope S is found, the identification formula in Eq. ( 7) permits one to reduce the optimization problem in Eq. ( 1) to the canonical equation in Eq. ( 3). One could thus obtain an imitating policy π ∼ S by solving Eq. ( 3) where the expert's occupancy measure ρ(x * , z * ) = P (x * , z * ) and the imitator's occupancy measure ρ π (x * , z * ) is given by Eq. (7). As an example, consider the frontdoor diagram described in Fig. 2a and a policy scope S = {⟨X, ∅⟩}. The expected reward E[Y | do(π)] = x ′ E[Y | do(x ′ )]π(x ′ ) and E[Y | do(x ′ ) ] is identifiable from P (X, Y, Z) using the frontdoor adjustment formula (Pearl, 2000, Thm. 3.3.4 ). The expected reward E[Y | do(π)] of any policy π(X) could be written as: E[Y | do(π)] = z,x E[Y | x, z]P (x) x ′ P (z|x ′ )π(x ′ ). Let occupancy measures ρ(x, z) = P (x, z) and ρ π (x, z) = P (x) x ′ P (z|x ′ )π(x ′ ). We could thus learn an imitating policy in the frontdoor diagram by solving the canonical equation given by: ν * = min π∼S max r∈R x,z r(x, z) (ρ(x, z) -ρ π (x, z)) , where R is a hypothesis class of the reward function r(x, z) ≜ E[Y | x, z]. The solution π * (X) is an imitating policy performing at least as well as the expert's behavior policy if the gap ν * ≤ 0. Next, we will describe how to obtain the identification formula in Eq. ( 7) provided with an identifiable scope S. Without loss of generality, we will assume that the reward Y is the only endogenous variable that is latent in the causal diagram G, i.e., V = O ∪ {Y }. * We will utilize a special type of clustering of nodes in the causal diagram G, called the confounded component (for short, c-component). Definition 6 (C-component (Tian & Pearl, 2002)  ). For a causal diagram G, a subset C ⊆ V is a c-component if any pair V i , V j ∈ C is connected by a bi-directed path in G. For instance, the frontdoor diagram in Fig. 2a contains two c-components C 1 = {X, Y } and C 2 = {Z}. We will utilize a sound and complete procedure IDENTIFY (Tian, 2002; 2008) (Zhang et al., 2020, Appendix B) . Recall that G S is the causal diagram of submodel M π induced by policy π ∼ S. Fig. 2b shows diagram G S obtained from the frontdoor graph G and scope S = {⟨X, ∅⟩} described in Fig. 2a . Let Z Y = An(Y ) be ancestors of Y in G S . Our next result shows that IDENTIFY(G, Y, S) is ensured to find an identification formula of the form in Eq. ( 7) when it is identifiable. Lemma 1. Given a causal diagram G, a policy scope S is identifiable from P (O, Y ) in G if and only if IDENTIFY(G, Y, S) ̸ = "FAIL". Moreover, IDENTIFY(G, Y, S) returns an identification formula of the form in Eq. (7) where X * = Pa(C Y ) ∩ X and Z * = Pa(C Y ) \ ({Y } ∪ X); and C Y is a c-component containing reward Y in subgraph G [An(Z Y )] . For example, for the frontdoor diagram G in Fig. 2a , the manipulated diagram G S with scope S = {⟨X, ∅⟩} is described in Fig. 2b . Since Z Y = An(Y ) G S = {X, Z, Y }, C Y is thus given by {X, Y }. Lem. 1 implies that X * = Pa({X, Y }) ∩ {X} = {X} and Z * = Pa({X, Y }) \ {X, Y } = {Z}. Applying IDENTIFY(G, Y, {⟨X, ∅}) returns the frontdoor adjustment formula in Eq. (8).

3.1. SEARCHING FOR IDENTIFIABLE POLICY SCOPES

The remainder of this section describes an effective algorithm to find identifiable policy scopes S had the latent reward signal Y been observed. Let S denote the collection of all identifiable policy scopes S from distribution P (O, Y ) in the causal diagram G. Our algorithm LISTIDSCOPE, described in Alg. 1, enumerates elements in S. It takes as input a causal diagram G, a reward signal Y , and subsets L = ∅ and R = n i=1 PA * i . More specifically, LISTIDSCOPE maintains two scopes S l ⊆ S r (Step 2). It performs backtrack search to find identifiable scopes S in G such that S l ⊆ S ⊆ S r . It aborts branches that either (1) all subscopes in S r are identifiable (Step 3); or (2) all subscopes containing S l are non-identifiable (Step 6). The following proposition supports our aborting criterion. Lemma 2. Given a causal diagram G, for policy scopes S ′ ⊆ S, S ′ is identifiable from distribution P (O, Y ) in G if S is identifiable from P (O, Y ) in G. Algorithm 1: LISTIDSCOPE 1: Input: G, Y and subsets L ⊆ R 2: Output: a set of identifiable policy scopes S 3: Let scopes S r = {⟨X i , R ∩ PA * i ⟩} n i=1 and S l = {⟨X i , L ∩ PA * i ⟩} n i=1 . 4: if IDENTIFY(G, Y, S r ) ̸ = "FAIL ′′ then 5: Output S r . 6: end if 7: if IDENTIFY(G, Y, S l ) ̸ = "FAIL ′′ then 8: Pick an arbitrary V ∈ R \ L. 

At

Step 7, LISTIDSCOPE picks an arbitrary variable V that is included in input covariates R but not in L. It then recursively returns all identifiable policy scopes S in G: the first recursive call returns scopes taking V as an input for some actions X i ∈ X and the second call return all scopes that do not consider V when selecting values for all actions X. We say a policy π is associated with a collection of policy scopes S, denoted by π ∼ S, if there exists S ∈ S so that π ∼ S. It is possible to show that LIS-TIDSCOPE produces a collection of identifiable scopes that is sufficient for the imitation task. 

4. EXPERIMENTS

In this section, we demonstrate our framework on various imitation learning tasks, ranging from synthetic causal models to real-world datasets, including highway driving (Krajewski et al., 2018) and images (LeCun, 1998) . We find that our approach is able to incorporate parametric knowledge about the reward function and achieve effective imitating policies across different causal diagrams. For all experiments, we evaluate our proposed Causal-IRL based on the canonical equation formulation in Eq. ( 3). As a baseline, we also include: (1) standard BC mimicking the expert's nominal behavior policy; (2) standard IRL utilizing all observed covariates preceding every X i ∈ X while being blind to causal relationships in the underlying model; and (3) Causal-BC (Zhang et al., 2020; Kumor et al., 2021) that learn an imitating policy with the sequential π-backdoor criterion. We refer readers to (Ruan et al., 2023, Appendix D) for additional experiments and more discussions on the experimental setup. Backdoor Consider an SCM instance compatible with Fig. 1c including binary observed variables Z 1 , X Highway Driving We consider a learning scenario where the agent learns a driving policy from the observed trajectories of a human expert. Causal diagram of this example is provided in (Ruan et al., 2023, Appendix D, Fig. 4)  [Y | X 1 , X 2 , Z ] is a monotone function via reward augmentation (Li et al., 2017) . Simulation results are shown in Fig. 3b . We found that Causal-IRL performs the best among all strategies. Causal-BC is able to achieve the expert's performance. BC and IRL perform the worst among all and fail to obtain an imitating policy. MNIST Digits Consider again the frontdoor diagram in Fig. 2a . To evaluate the performance of our proposed approach in high-dimensional domains, we now replace variable Z with sampled images drawn from MNIST digits dataset (LeCun, 1998) . The reward Y is decided by a linear function taking Z and an unobserved confounder U X,Y as input. The Causal-IRL formulates the imitation problem as a two-person zero-sum game through the frontdoor adjustment described in Eq. ( 9), which can be solved by the MW algorithm (Freund & Schapire, 1999; Syed & Schapire, 2008) . As shown in Fig. 3c , simulation results reveal that Causal-IRL outperforms Causal-BC and BC; while IRL performs the worst among all the algorithms. Infinite MDPUC To demonstrate our proposed framework in the sequential decision-making setting with an infinite horizon, we consider a generalized Markov decision process incorporating unobserved confounders (Ruan & Di, 2022) , called the MDPUC (Zhang & Bareinboim, 2022) . This sequential model simulates real-world driving dynamics. By exploiting the Markov property over time steps, we are able to decompose the causal diagram over the infinite horizon into a collection of sub-graphs, one for each time step i = 1, 2, . . . . Fig. 1d shows the causal diagram spanning time steps i = 1, 2, 3. As a comparison, BC and IRL still utilize the stationary policy {⟨X i , {Z i }⟩}. By applying Thm. 1 at each time step, we obtain a π-backdoor admissible policy scope {⟨X i , {Z i , X i-1 , Z i-1 }⟩} for Causal-IRL and Causal-BC. Simulation results are shown in Fig. 3d . One could see by inspection that Causal-IRL performs the best and achieves the expert's performance.

5. CONCLUSION

This paper investigates imitation learning via inverse reinforcement learning (IRL) in the semantical framework of structural causal models. The goal is to find an effective imitating policy that performs at least as well as the expert's behavior policy from combinations of demonstration data, qualitative knowledge the data-generating mechanisms represented as a causal diagram, and quantitative knowledge about the reward function. We provide a graphical criterion (Thm. 1) based on the sequential backdoor, which allows one to obtain an imitating policy by solving a canonical optimization equation of causal IRL. Such a canonical formulation addresses the challenge of the presence of unobserved confounders (UCs), and is solvable by leveraging standard IRL algorithms (Props. 1 and 2). Finally, we move beyond the backdoor criterion and show that the canonical equation is achievable whenever expected rewards of policies are identifiable had the reward also been observed (Thms. 2 and 3).

ETHICS STATEMENT

This paper investigates the theoretical framework of causal inverse RL from the natural trajectories of an expert demonstrator, even when the reward signal is unobserved. Input covariates used by the expert to determine the original values of the action are unknown, introducing unobserved confounding bias in demonstration data. Our framework may apply to various fields in reality, including autonomous vehicle development, industrial automation, and chronic disease management. A positive impact of this work is that we discuss the potential risk of training IRL policy from demonstrations with the presence of unobserved confounding (UC). Our formulation of causal IRL is inherently robust against confounding bias. For example, solving the causal IRL problem in Eq. ( 1) requires the imitator to learn an effective policy that maximizes the reward in a worst-case causal model where the performance gap between the expert and imitator is the largest possible. More broadly, automated decision systems using causal inference methods prioritize safety and robustness during their decision-making processes. Such requirements are increasingly essential since black-box AI systems are prevalent, and our understandings of their potential implications are still limited.

REPRODUCIBILITY STATEMENT

The complete proof of all theoretical results presented in this paper, including Thms. 1 and 2, is provided in (Ruan et al., 2023, Appendix B) . Details on the implementation of the proposed algorithms are included (Ruan et al., 2023, Appendix C) . Finally, (Ruan et al., 2023, Appendix D) provides a detailed description of the experimental setup. Readers could find all appendices as part of the supplementary text after "References" section. We provided references to all existing datasets used in experiments, including HIGHD (Krajewski et al., 2018) and MNIST (LeCun, 1998) . Other experiments are synthetic and do not introduce any new assets. Source codes for all experiments and simulations are released in the complete technical report (Ruan et al., 2023) .



Zhang et al. (2020);Kumor et al. (2021) provide a graphical condition that is sufficient for learning an imitating policy via behavioral cloning (BC) provided with a causal diagram G. For a policy scope S = {⟨X i , Z i ⟩} n i=1

Observe that the key to the reduction of the canonical causal IRL equation in Eq. (3) lies in the identification of expected rewards E[Y | do(π)] had the latent reward Y been observed. Next we will study general conditions under which E[Y | do(π)] is uniquely discernible from distribution P (O, Y ) in the causal diagram G, called the identifiability of causal effects

for identifying causal effects E[Y | do(π)] of an arbitrary policy π ∼ S. Particularly, IDENTIFY takes as input the causal diagram G, a reward Y , and a policy scope S. It returns an identification formula for E[Y | do(π)] from P (O, Y ) if expected rewards of all policies π ∼ S are identifiable. Otherwise, IDENTIFY(G, Y, S) = "FAIL". Details of IDENTIFY are shown in

Y, L ∪ {V }, R).

Y, L, R \ {V }). 11: end if

Theorem 3. For a causal diagram G and a reward Y , LISTIDSCOPE(G, Y, ∅, n i=1 PA * i ) enumerates a subset S * ⊆ S so that for any π ∼ S, there is π * ∼ S * where E[Y | do(π)] = E[Y | do(π * )].Moreover, LISTIDSCOPE outputs identifiable policy scopes with a polynomial delay. This follows from the observation that LISTIDSCOPE searches over a tree of policy scopes with height at most | n i=1 PA * i | and IDENTIFY(G, Y, S) terminates in polynomial steps w.r.t. the size of diagram G.

Figure 3: Simulation results (a, b, c, d) for our experiments, where y-axis represents the expected reward of learned policies in the actual causal model; the grey dashed line denotes the expert's reward.

We say a policy scope S is identifiable (from P (O, Y ) in G) if for all policies π ∼ S, the corresponding expected rewards E[Y | do(π)] are identifiable from P (O, Y ) in G. Our next result shows that whenever an identifiable policy scope S is found, one could always reduce the causal IRL problem to the canonical optimization equation in Eq. (3). Theorem 2. Given a causal diagram G, a policy scope S is identifiable from P (O, Y ) in G if and only if for any policy π ∼ S, the expected reward E[Y | do(π)] is computable from P (O, Y ) as

1 , Z 2 , X 2 , Y ∈ {0, 1}. Causal-BC utilizes a sequential π-backdoor admissible scope {⟨X 1 , {Z 1 }⟩, ⟨X 2 , {Z 2 }⟩}; while Causal-IRL utilizes the scope {⟨X 1 , ∅⟩, ⟨X 2 , {Z

where X 1 is the accelerations of the ego vehicle at the previous step; Z 1 is both longitudinal and lateral historical accelerations of the ego vehicle two steps ago; X 2 is the velocity of the ego vehicle; Z 2 is the velocity of the preceding vehicle; W indicates the information from surrounding vehicles. Values of X 1 , X 2 , Z 1 , Z 2 are drawn from a real-world driving dataset HighDKrajewski et al. (2018). The reward Y is decided by a non-linear function f Y (X 2 , Z 2 , U Y ). Both Causal-IRL and Causal-BC utilize the scope {⟨X 1 , ∅⟩, ⟨X 2 , {Z 2 }⟩}. Causal-IRL also exploits the additional knowledge that the expected reward E

ACKNOWLEDGEMENTS

This research was supported in part by the NSF, ONR, AFOSR, DoE, Amazon, JP Morgan, and The Alfred P. Sloan Foundation.

