LEARNING FROM DEMONSTRATION WITH WEAKLY SUPERVISED DISENTANGLEMENT

Abstract

Robotic manipulation tasks, such as wiping with a soft sponge, require control from multiple rich sensory modalities. Human-robot interaction, aimed at teaching robots, is difficult in this setting as there is potential for mismatch between human and machine comprehension of the rich data streams. We treat the task of interpretable learning from demonstration as an optimisation problem over a probabilistic generative model. To account for the high-dimensionality of the data, a high-capacity neural network is chosen to represent the model. The latent variables in this model are explicitly aligned with high-level notions and concepts that are manifested in a set of demonstrations. We show that such alignment is best achieved through the use of labels from the end user, in an appropriately restricted vocabulary, in contrast to the conventional approach of the designer picking a prior over the latent variables. Our approach is evaluated in the context of two table-top robot manipulation tasks performed by a PR2 robot -that of dabbing liquids with a sponge (forcefully pressing a sponge and moving it along a surface) and pouring between different containers. The robot provides visual information, arm joint positions and arm joint efforts. We have made videos of the tasks and data available -see supplementary materials at: https://sites.google.com/view/weak-label-lfd.

1. INTRODUCTION

Learning from Demonstration (LfD) (Argall et al., 2009 ) is a commonly used paradigm where a potentially imperfect demonstrator desires to teach a robot how to perform a particular task in its environment. Most often this is achieved through a combination of kinaesthetic teaching and supervised learning-imitation learning (Ross et al., 2011) . However, such approaches do not allow for elaborations and corrections from the demonstrator to be seamlessly incorporated. As a result, new demonstrations are required when either the demonstrator changes the task specification or the agent changes its context-typical scenarios in the context of interactive task learning (Laird et al., 2017) . Such problems mainly arise because the demonstrator and the agent reason about the world by using notions and mechanisms at different levels of abstraction. An LfD setup, which can accommodate abstract user specifications, requires establishing a mapping from the high-level notions humans use-e.g. spatial concepts, different ways of applying force-to the low-level perceptive and control signals robot agents utilise-e.g. joint angles, efforts and camera images. With this in place, any constraints or elaborations from the human operator must be mapped to behaviour on the agent's side that is consistent with the semantics of the operator's desires. Concretely, we need to be able to ground (Vogt, 2002; Harnad, 1990) the specifications and symbols used by the operator in the actions and observations of the agent. Often the actions and the observations of a robot agent can be high-dimensional-high DoF kinematic chains, high image resolution, etc.-making the symbol grounding problem non-trivial. However, the concepts we need to be able to ground lie on a much-lower-dimensional manifold, embedded in the high-dimensional data space (Fefferman et al., 2016) . For example, the concept of pressing softly against a surface manifests itself in a data stream associated with the 7 DoF real-valued space of joint efforts, spread across multiple time steps. However, the essence of what differentiates one type of soft press from another nearby concept can be summarised conceptually using a lower-dimensional abstract space. The focus of this work is finding a nonlinear mapping (represented as a high-capacity neural model) between such a low-dimensional manifold and the high-dimensional ambient space of cross-modal data. Moreover, we show that, apart from finding such a mapping, we can also shape and refine the low-dimensional manifold by imposing specific biases and structures on the neural model's architecture and training regime. In this paper, we propose a framework that allows human operators to teach a PR2 robot about different spatial, temporal and force-related aspects of a robotic manipulation task, using tabletop dabbing and pouring as our main examples. They serve as concrete representative tasks that incorporate key issues specific to robotics (e.g. continuous actions, conditional switching dynamics, forceful interactions, and discrete categorizations of these). Numerous other applications require the very same capability. Our main contributions are: • A learning method which incorporates information from multiple high-dimensional modalities-vision, joint angles, joint efforts-to instill a disentangled low-dimensional manifold (Locatello et al., 2019) . By using weak expert labels during the optimisation process, the manifold eventually aligns with the human demonstrators' 'common sense' notions in a natural and controlled way without the need for separate post-hoc interpretation. • We release a dataset of subjective concepts grounded in multi-modal demonstrations. Using this, we evaluate whether discrete latent variables or continuous latent variables, both shaped by the discrete user labels, better capture the demonstrated continuous notions.

2. IMAGE-CONDITIONED TRAJECTORY & LABEL MODEL

In the task we use to demonstrate our ideas, which is representative of a broad class of robotic manipulation tasks, we want to control where and how a robot performs an action through the use of a user specification defined by a set of coarse labels-e.g. "press softly and slowly behind the cube in the image". In this context, let, x denote a K × T dimensional trajectory for K robot joints and a fixed time length T , y denote a set of discrete labels semantically grouped in N label groups G = {g 1 , . . . , g N } (equivalent to multi-label classification problem) and i denote an RGB imagefoot_0 . The labels y describe qualitative properties of x and x with respect to i-e.g. left dab, right dab, hard dab, soft dab, etc. We aim to model the distribution of demonstrated robot-arm trajectories x and corresponding user labels y, conditioned on a visual environment context i. This problem is equivalent to that of structured output representation (Walker et al., 2016; Sohn et al., 2015; Bhattacharyya et al., 2018 )-finding a one-to-many mapping from i to {x, y} (one image can be part of the generation of many robot trajectories and labels). For this we use a conditional generative model, whose latent variables c = {c s , c e , c u } can accommodate the aforementioned mapping-see Figure 2 . The meaning behind the different types of latent variables-c s , c e and c u -is elaborated in section 3.



What i actually represents is a lower-dimensional version of the original RGB image I. The parameters of the image encoder are jointly optimised with the parameters of the recognition and decoder networks



Figure 1: User demos through teleoperation and a variety of modalities (left) are used to fit a common low-level disentangled manifold (middle) which contributes in an interpretable way to the generative process for new robot behaviour (right).

