LATENT HIERARCHICAL IMITATION LEARNING FOR STOCHASTIC ENVIRONMENTS

Abstract

Many applications of imitation learning require the agent to avoid mode collapse and mirror the full distribution of observed behaviours. Existing methods that address this distributional realism typically rely on hierarchical policies conditioned on sampled types that model agent-internal features like persona, goal, or strategy. However, these methods are often inappropriate for stochastic environments, where internal and external factors of influence on the observed agent trajectories have to be disentangled, and only internal factors should be encoded in the agent type to be robust to changing environment conditions. We formalize this challenge as distribution shift in the conditional distribution of agent types under environmental stochasticity, in addition to the familiar covariate shift in state visitations. We propose Robust Type Conditioning (RTC), which eliminate this shifts with adversarial training under randomly sampled types. Experiments on two domains, including the large-scale Waymo Open Motion Dataset, show improved distributional realism while maintaining or improving task performance compared to state-of-the-art baselines.

1. INTRODUCTION

Learning to imitate behaviour is crucial when reward design is infeasible (Amodei et al., 2016; Hadfield-Menell et al., 2017; Fu et al., 2018; Everitt et al., 2021) , for overcoming hard exploration problems (Rajeswaran et al., 2017; Zhu et al., 2018) , and for realistic modelling of dynamical systems with multiple interacting agents (Farmer and Foley, 2009) . Such systems, including games, driving simulations, and agent-based economic models, often have known state transition functions, but require accurate agents to be realistic. For example, for driving simulations, which are crucial for accelerating the development of autonomous vehicles (Suo et al., 2021; Igl et al., 2022) , faithful reactions of all road users are paramount. Furthermore, it is not enough to mimic a single mode in the data; instead, agents must reproduce the full distribution of behaviours to avoid sim2real gaps in modelled systems (Grover et al., 2018; Liang et al., 2020) , under-explored solutions in complex tasks (Vinyals et al., 2019) and suboptimal policies in games requiring mixed strategies (Nash Jr, 1950) . Current imitation learning (IL) methods fall short of achieving such distributional realism by matching all modes in the data. The required stochastic policy cannot be recovered from a fixed reward function and adversarial methods, while aiming to match the distribution in principle, are known to be prone to mode collapse in practice (Wang et al., 2017; Lucic et al., 2018; Creswell et al., 2018) . Furthermore, progress on distributional realism is hindered by a lack of suitable IL benchmarks, with most relying on unimodal data and only evaluating task performance as measured by rewards, but not mode coverage. By contrast, many applications require distributional realism in addition to good task performance. For example, accurately evaluating the safety of autonomous vehicles in simulation relies on distributionally realistic agents. Consequently, our goal is to improve distributional realism while maintaining strong task performance. To mitigate mode collapse in complex environments, previous work uses hierarchical policies in an autoencoder framework (Wang et al., 2017; Suo et al., 2021; Igl et al., 2022) . During training, an encoder infers latent variables from observed trajectories and the agent, conditioned on those latent variables, strives to imitate the original trajectory. At test time, a prior distribution proposes distributionally realistic latent values, without requiring access to privileged future information. We refer to this latent vector as an agent's inferred type since it expresses intrinsic characteristics of the agent that yield the multimodal behaviour. Depending on the environment, the type could, for example, represent the agent's persona, belief, goal, or strategy. However, these hierarchical methods rely on either manually designed type representations (Igl et al., 2022) or the strong assumption that all stochasticity in the environment can be controlled by the agent (Wang et al., 2017; Suo et al., 2021) . Unfortunately, this assumption is violated in most realistic scenarios. For example, in the case of driving simulations, trajectories depend not only on the agent's type, expressing its driving style and intent, but also on external factors such as the behaviour of other road users. Crucially, despite being inferred from future trajectories during training, agent types must be independent of these external factors to avoid leaking information about future events outside the agent's control, which in turn can impair generalization at test time under changed, and ex-ante unknown, environmental conditions. In other words, the challenge in learning hierarchical policies using IL in stochastic environments is to disentangle the internal and external factors of influence on the trajectories and only encode the former into the type. Consider the example of an expert approaching an intersection at the same time as another car. The expert passes if the other car brakes and yields to it otherwise. To reconstruct the scene with ease, a naively trained latent model could not only encode the agent's intended direction (an internal decision) but also whether to yield, which depends on the other car (an external factor). This is catastrophic at test time when the latent, and hence the yielding decision, is sampled independently of the other car's behaviour. In contrast, if only the expert's intent were encoded in the latent, the policy would learn to react appropriately to external factors. In this paper, we identify these subtle challenges arising under stochastic environments and formulate them as a new form of distribution shift for hierarchical policies. Unlike the familiar covariate shift in the state distribution (Ross et al., 2011) , this conditional type shift occur in the distribution of the inferred latent type. It greatly reduces performance by yielding causally confused agents that rely on the latent type for information about external factors, instead of inferring them from the latest environment observation. We propose Robust Type Conditioning (RTC) to eliminate this distribution shift and avoid causally confused agents through a coupled adversarial training objective under randomly sampled types. We do not require access to an expert, counterfactuals, or manually specified type labels for trajectories. Experimentally, we show the need for improved distributional realism due to mode collapse in state-of-the-art imitation learning techniques such as GAIL (Ho and Ermon, 2016) . Furthermore, we show that naively trained hierarchical models with inferred types improve distributional realism, but exhibit poor task performance in stochastic environments. By contrast, RTC can maintain good task performance in stochastic environments while improving distributional realism and mode coverage. We evaluate RTC on the illustrative Double Goal Problem as well as the large scale Waymo Open Motion Dataset (Ettinger et al., 2021) of real driving behaviour.

2. BACKGROUND

We are given a dataset D = {τ i } N i=1 of N trajectories τ i = s (i) 0 , a (i) 0 , . . . s (i) T , drawn from p(τ ) of one or more experts interacting with a stochastic environment p(s t+1 |s t , a t ) where s t ∈ S are states and a t ∈ A are actions. Our goal is to learn a policy π θ (a t |s t ) to match p(τ ) when replacing the unknown expert and generating rollouts τ ∼ p(τ ) = p(s 0 ) T -1 t=0 π θ (â t |ŝ t )p(ŝ t+1 |ŝ t , ât ) from the inital states s 0 ∼ p(s 0 ). We simplify notation and write τ ∼ π θ (τ ) and τ ∼ D(τ ) to indicate rollouts generated by the policy or drawn from the data respectively. Expectations E τ ∼D and E τ ∼π θ are taken over all pairs (s t , a t ) ∈ τ and (ŝ t , ât ) ∈ τ . Previous work (e.g., Ross et al., 2011; Ho and Ermon, 2016) shows that a core challenge of learning from demonstration is reducing or eliminating the covariate shift in the state-visitation frequencies p(s) caused by accumulating errors when using π θ . Unfortunately, Behavioural Cloning (BC), a simple supervised training objective optimising max θ E τ ∼D [log π θ (a t |s t )] is not robust to it. To overcome covariate shift, generative

