DYNAMICS-AWARE SKILL GENERATION FROM BEHAVIOURALLY DIVERSE DEMONSTRATIONS Anonymous

Abstract

Learning from demonstrations (LfD) provides a data-efficient way for a robot to learn a task by observing humans performing the task, without the need for an explicit reward function. However, in many real-world scenarios (e.g., driving a car) humans often perform the same task in different ways, motivated not only by the primary objective of the task (e.g., reaching the destination safely) but also by their individual preferences (e.g., different driving styles), leading to a multimodal distribution of demonstrations. In this work, we consider a Learning from state-only Demonstration setup, where the reward function for the common objective of the task is known to the learning agent; however, the individual preferences leading to the variations in the demonstrations are unknown. We introduce an imitation-guided Reinforcement Learning (RL) framework that formulates the policy optimisation as a constrained RL problem to learn a diverse set of policies to perform the task with different constraints imposed by the preferences. Then we propose an algorithm called LfBD and show that we can build a parameterised solution space that captures different behaviour patterns from the demonstrations. In this solution space, a set of policies can be learned to produce behaviours that not only capture the modes but also go beyond the provided demonstrations.

1. INTRODUCTION

Learning from demonstrations (LfD) (Schaal, 1996) provides an alternative way to Reinforcement Learning (RL) for an agent to learn a policy by observing how humans perform similar tasks. However, in many real-world scenarios (e.g., driving a car), humans often perform the same task in different ways. Their behaviours are not only influenced by the primary objective of the task (e.g., reaching the destination safely) but also by their individual preferences or expertise (e.g., different driving styles) (Fürnkranz & Hüllermeier, 2010; Babes et al., 2011) . In other words, all the underlying policies maximize the same task reward, but under different constraints imposed by individual preferences. This leads to a multi-modal distribution of demonstrations, where each mode represents a unique behaviour. With multi-modal demonstrations, typical LfD methods, such as Behaviour Cloning and Generative Adversarial Imitation Learning, either learn a policy that converges to one of the modes resulting in a mode-seeking behaviour or exhibit a mean-seeking behaviour by trying to average across different modes (Ke et al., 2020; Ghasemipour et al., 2020; Zhang et al., 2020) . While the former will still recover a subset of solutions, the latter may cause unknown behaviour (see Fig 1 ). Furthermore, none of these approaches is able to learn policies that correspond to the behaviours of a wide range of individuals. The problem becomes even more challenging when there are only "state observations" in the demonstrations without the actions. In such situations, supervised or unsupervised learning approaches cannot be applied directly to find a policy, and the agent must interact with the environment or with a simulator (Torabi et al., 2019) . Being able to learn a diverse set of policies from demonstrations is often desirable to serve the requirements of a wide range of individual users. For instance, every self-driving car can have a driving policy (selected from the diverse set of pre-trained policies) that matches the preferences of a user. Many recent works show the advantages of having a diverse set of policies, for instance, rapid damage adaptation in robotics (Kaushik et al., 2020; Chatzilygeroudis et al., 2018; Cully et al., 2015) and safe sim-to-real policy transfer in robotics (Kaushik et al., 2022) . . In this work, we consider a specific setup of LfD, known as Imitation Learning from Observations alone (ILfO) (Sun et al., 2019) , where the learning agent only has access to the state observations without their corresponding actions. We propose a new framework that combines Reinforcement Learning with ILfO to solve the issues of learning from the multi-modal states-only demonstrations, especially with a small set of unlabelled demonstrations. Unlike most of the LfD methods, our goal is not just to learn how to perform a task or how to mimic humans, but rather how to perform a task in all possible ways as shown in Fig 1D . Thus, we focus on applications where a high-level task reward function can be defined easily, but the preference components that cause diverse behaviours cannot be explicitly defined. These include tasks such as autonomous driving or robotic manipulations, where the agent needs to mimic the human's behaviour pattern (i.e., preference) while reaching the intended goal (i.e., task reward). In practice, defining a high-level task reward can be straightforward, i.e., for autonomous driving, this can be a function of the distance to a target location and the penalty for collisions. However, defining the preference component is far from easy, as it may be impossible to find a mathematical expression for each individual's preferences. Thus, we formulate the multimodal policy generation guided through demonstration as a constrained optimisation problem; where the generation of multimodal behaviours results from optimising policies for a given task reward function that satisfy different preference constraints. As contributions, we first propose a new imitation-guided RL framework and an algorithm called Learning from Behaviourally diverse Demosntration (LfBD) to solve the problem of policy generation from multimodal (state-only) demonstrations. We then propose a novel projection function that captures preferences as state-region visitations. This projection function allows us to build a parameterised solution space to allocate policies such that they satisfy different preference constraints. We show that our approach is capable of generating multimodal solutions beyond the provided demonstrations, i.e. the resulting solutions also include interpolations between the provided demonstrations. Furthermore, our method allows us to perform different types of post-hoc policy searches in the solution space: 1) Given a (sate-only) demonstration, find the closest policy capable of generating this demonstration. 2) Search policies in the solution space that have a high/low likelihood according to the provided demonstrations (i.e., similar to the provided demonstrations). 3) Find solutions that satisfy different constraints.

2.1. IMITATION LEARNING AS DIVERGENCE MINIMISATION

In this section, we discuss why current Imitation Learning (IL) methods are incapable of dealing with multi-modal demonstration distributions; especially, when only a small set of demonstrations is available. 



Figure1: Given the demonstrations from several individuals in A, the mean-seeking policy produces unseen behaviour that is unsafe as shown in B, the mode-seeking policy only recovers one mode as shown in C. We propose a new framework that recovers all the possible solution modes as shown in D. The example is inspired from(Ke et al., 2020)

Zhang et al. (2020);Ke et al. (2020); Ghasemipour et al. (2020)  have shown that current the Imitation learning (IL) methods can be derived as a family of f-divergence minimisation methods, where the divergence of the state-action distributions of the expert p πexp (s, a) and learning

