DYNAMICS-AWARE SKILL GENERATION FROM BEHAVIOURALLY DIVERSE DEMONSTRATIONS Anonymous

Abstract

Learning from demonstrations (LfD) provides a data-efficient way for a robot to learn a task by observing humans performing the task, without the need for an explicit reward function. However, in many real-world scenarios (e.g., driving a car) humans often perform the same task in different ways, motivated not only by the primary objective of the task (e.g., reaching the destination safely) but also by their individual preferences (e.g., different driving styles), leading to a multimodal distribution of demonstrations. In this work, we consider a Learning from state-only Demonstration setup, where the reward function for the common objective of the task is known to the learning agent; however, the individual preferences leading to the variations in the demonstrations are unknown. We introduce an imitation-guided Reinforcement Learning (RL) framework that formulates the policy optimisation as a constrained RL problem to learn a diverse set of policies to perform the task with different constraints imposed by the preferences. Then we propose an algorithm called LfBD and show that we can build a parameterised solution space that captures different behaviour patterns from the demonstrations. In this solution space, a set of policies can be learned to produce behaviours that not only capture the modes but also go beyond the provided demonstrations.

1. INTRODUCTION

Learning from demonstrations (LfD) (Schaal, 1996) provides an alternative way to Reinforcement Learning (RL) for an agent to learn a policy by observing how humans perform similar tasks. However, in many real-world scenarios (e.g., driving a car), humans often perform the same task in different ways. Their behaviours are not only influenced by the primary objective of the task (e.g., reaching the destination safely) but also by their individual preferences or expertise (e.g., different driving styles) (Fürnkranz & Hüllermeier, 2010; Babes et al., 2011) . In other words, all the underlying policies maximize the same task reward, but under different constraints imposed by individual preferences. This leads to a multi-modal distribution of demonstrations, where each mode represents a unique behaviour. With multi-modal demonstrations, typical LfD methods, such as Behaviour Cloning and Generative Adversarial Imitation Learning, either learn a policy that converges to one of the modes resulting in a mode-seeking behaviour or exhibit a mean-seeking behaviour by trying to average across different modes (Ke et al., 2020; Ghasemipour et al., 2020; Zhang et al., 2020) . While the former will still recover a subset of solutions, the latter may cause unknown behaviour (see Fig 1 ). Furthermore, none of these approaches is able to learn policies that correspond to the behaviours of a wide range of individuals. The problem becomes even more challenging when there are only "state observations" in the demonstrations without the actions. In such situations, supervised or unsupervised learning approaches cannot be applied directly to find a policy, and the agent must interact with the environment or with a simulator (Torabi et al., 2019) . Being able to learn a diverse set of policies from demonstrations is often desirable to serve the requirements of a wide range of individual users. For instance, every self-driving car can have a driving policy (selected from the diverse set of pre-trained policies) that matches the preferences of a user. Many recent works show the advantages of having a diverse set of policies, for instance, rapid damage adaptation in robotics (Kaushik et al., 2020; Chatzilygeroudis et al., 2018; Cully et al., 2015) and safe sim-to-real policy transfer in robotics (Kaushik et al., 2022) .

