LATENT HIERARCHICAL IMITATION LEARNING FOR STOCHASTIC ENVIRONMENTS

Abstract

Many applications of imitation learning require the agent to avoid mode collapse and mirror the full distribution of observed behaviours. Existing methods that address this distributional realism typically rely on hierarchical policies conditioned on sampled types that model agent-internal features like persona, goal, or strategy. However, these methods are often inappropriate for stochastic environments, where internal and external factors of influence on the observed agent trajectories have to be disentangled, and only internal factors should be encoded in the agent type to be robust to changing environment conditions. We formalize this challenge as distribution shift in the conditional distribution of agent types under environmental stochasticity, in addition to the familiar covariate shift in state visitations. We propose Robust Type Conditioning (RTC), which eliminate this shifts with adversarial training under randomly sampled types. Experiments on two domains, including the large-scale Waymo Open Motion Dataset, show improved distributional realism while maintaining or improving task performance compared to state-of-the-art baselines.

1. INTRODUCTION

Learning to imitate behaviour is crucial when reward design is infeasible (Amodei et al., 2016; Hadfield-Menell et al., 2017; Fu et al., 2018; Everitt et al., 2021) , for overcoming hard exploration problems (Rajeswaran et al., 2017; Zhu et al., 2018) , and for realistic modelling of dynamical systems with multiple interacting agents (Farmer and Foley, 2009) . Such systems, including games, driving simulations, and agent-based economic models, often have known state transition functions, but require accurate agents to be realistic. For example, for driving simulations, which are crucial for accelerating the development of autonomous vehicles (Suo et al., 2021; Igl et al., 2022) , faithful reactions of all road users are paramount. Furthermore, it is not enough to mimic a single mode in the data; instead, agents must reproduce the full distribution of behaviours to avoid sim2real gaps in modelled systems (Grover et al., 2018; Liang et al., 2020) , under-explored solutions in complex tasks (Vinyals et al., 2019) and suboptimal policies in games requiring mixed strategies (Nash Jr, 1950) . Current imitation learning (IL) methods fall short of achieving such distributional realism by matching all modes in the data. The required stochastic policy cannot be recovered from a fixed reward function and adversarial methods, while aiming to match the distribution in principle, are known to be prone to mode collapse in practice (Wang et al., 2017; Lucic et al., 2018; Creswell et al., 2018) . Furthermore, progress on distributional realism is hindered by a lack of suitable IL benchmarks, with most relying on unimodal data and only evaluating task performance as measured by rewards, but not mode coverage. By contrast, many applications require distributional realism in addition to good task performance. For example, accurately evaluating the safety of autonomous vehicles in simulation relies on distributionally realistic agents. Consequently, our goal is to improve distributional realism while maintaining strong task performance. To mitigate mode collapse in complex environments, previous work uses hierarchical policies in an autoencoder framework (Wang et al., 2017; Suo et al., 2021; Igl et al., 2022) . During training, an encoder infers 1

