WASSERSTEIN GRADIENT FLOWS FOR OPTIMIZING GMM-BASED POLICIES Anonymous authors Paper under double-blind review

Abstract

Robots often rely on a repertoire of previously-learned motion policies for performing tasks of diverse complexities. When facing unseen task conditions or when new task requirements arise, robots must adapt their motion policies accordingly. In this context, policy optimization is the de facto paradigm to adapt robot policies as a function of task-specific objectives. Most commonly-used motion policies carry particular structures that are often overlooked in policy optimization algorithms. We instead propose to leverage the structure of probabilistic policies by casting the policy optimization as an optimal transport problem. Specifically, we focus on robot motion policies that build on Gaussian mixture models (GMMs) and formulate the policy optimization as a Wassertein gradient flow over the GMMs space. This naturally allows us to constrain the policy updates via the L 2 -Wasserstein distance between GMMs to enhance the stability of the policy optimization process. Furthermore, we leverage the geometry of the Bures-Wasserstein manifold to optimize the Gaussian distributions of the GMM policy via Riemannian optimization. We evaluate our approach on common robotic settings: Reaching motions, collision-avoidance behaviors and multi-goal tasks. Our results show that our method outperforms common policy optimization baselines in terms of task success rate and low-variance solutions.

1. INTRODUCTION

One of the main premises about autonomous robots is their ability to successfully perform a large range of tasks in unstructured environments. This demands robots to adapt their task models according to environment changes, and consequently to adjust their actions to successfully perform under unseen conditions (Peters et al., 2016) . In general, robotic tasks, e.g. picking or inserting an object, are usually executed by composing previously-learned skills (Schaal et al., 2003) , each represented by a motion policy. Therefore, in order to successfully perform under new settings, the robot should adapt its motion policies according to the new task requirements and conditions. Research on methods for robot motion policy adaptation is vast (Kober et al., 2013; Chatzilygeroudis et al., 2020) , with approaches mainly building on black-box optimizers (Stulp & Sigaud, 2012) , end-to-end deep reinforcement learning (Ibarz et al., 2021) , and policy search (Deisenroth et al., 2013) . Regardless of the optimization method, most approaches rely on policy structure-unaware adaptation strategies. However, several motion policy models (e.g, dynamic movement primitives (DMP) (Ijspeert et al., 2013) , Gaussian mixture models (GMM) (Calinon et al., 2007) , probabilistic movement primitives (ProMPs) (Paraschos et al., 2018) , and neural networks (Bahl et al., 2020) , among others), carry specific physical or probabilistic structures that should not be ignored. First, these policy models are often learned from demonstrations in a starting learning phase (Schaal et al., 2003) , thus the policy structure already encapsulates relevant prior information about the skill. Second, structure-unaware adaptation strategies optimize the policy parameters disregarding the special characteristics of the policy model (e.g., a DMP represents a second-order dynamical system). In this regard, we hypothesize that the policy structure may be leveraged to better control the adaptation strategy via policy structure-aware gradients and trust regions. Our main idea is to design a policy optimization strategy that explicitly builds on a particular policy structure. Specifically, we focus on GMM policies, which have been widely used to learn motion skills from human demonstrations (Calinon et al., 2007; Cederborg et al., 2010; Calinon, 2016;  

