LEARNING FROM DEMONSTRATION WITH WEAKLY SUPERVISED DISENTANGLEMENT

Abstract

Robotic manipulation tasks, such as wiping with a soft sponge, require control from multiple rich sensory modalities. Human-robot interaction, aimed at teaching robots, is difficult in this setting as there is potential for mismatch between human and machine comprehension of the rich data streams. We treat the task of interpretable learning from demonstration as an optimisation problem over a probabilistic generative model. To account for the high-dimensionality of the data, a high-capacity neural network is chosen to represent the model. The latent variables in this model are explicitly aligned with high-level notions and concepts that are manifested in a set of demonstrations. We show that such alignment is best achieved through the use of labels from the end user, in an appropriately restricted vocabulary, in contrast to the conventional approach of the designer picking a prior over the latent variables. Our approach is evaluated in the context of two table-top robot manipulation tasks performed by a PR2 robot -that of dabbing liquids with a sponge (forcefully pressing a sponge and moving it along a surface) and pouring between different containers. The robot provides visual information, arm joint positions and arm joint efforts. We have made videos of the tasks and data available -see supplementary materials at: https://sites.google.com/view/weak-label-lfd.

1. INTRODUCTION

Learning from Demonstration (LfD) (Argall et al., 2009 ) is a commonly used paradigm where a potentially imperfect demonstrator desires to teach a robot how to perform a particular task in its environment. Most often this is achieved through a combination of kinaesthetic teaching and supervised learning-imitation learning (Ross et al., 2011) . However, such approaches do not allow for elaborations and corrections from the demonstrator to be seamlessly incorporated. As a result, new demonstrations are required when either the demonstrator changes the task specification or the agent changes its context-typical scenarios in the context of interactive task learning (Laird et al., 2017) . Such problems mainly arise because the demonstrator and the agent reason about the world by using notions and mechanisms at different levels of abstraction. An LfD setup, which can accommodate abstract user specifications, requires establishing a mapping from the high-level notions humans use-e.g. spatial concepts, different ways of applying force-to the low-level perceptive and control signals robot agents utilise-e.g. joint angles, efforts and camera images. With this in place, any constraints or elaborations from the human operator must be mapped to behaviour on the agent's side that is consistent with the semantics of the operator's desires. Concretely, we need to be able to ground (Vogt, 2002; Harnad, 1990) the specifications and symbols used by the operator in the actions and observations of the agent. Often the actions and the observations of a robot agent can be high-dimensional-high DoF kinematic chains, high image resolution, etc.-making the symbol grounding problem non-trivial. However, the concepts we need to be able to ground lie on a much-lower-dimensional manifold, embedded in the high-dimensional data space (Fefferman et al., 2016) . For example, the concept of pressing softly against a surface manifests itself in a data stream associated with the 7 DoF real-valued space of joint efforts, spread across multiple time steps. However, the essence of what differentiates one type of soft press from another nearby concept can be summarised conceptually using a lower-dimensional abstract space. The focus of this work is finding a nonlinear mapping (represented as a high-capacity neural model) between such a

