WASSERSTEIN GRADIENT FLOWS FOR OPTIMIZING GMM-BASED POLICIES Anonymous authors Paper under double-blind review

Abstract

Robots often rely on a repertoire of previously-learned motion policies for performing tasks of diverse complexities. When facing unseen task conditions or when new task requirements arise, robots must adapt their motion policies accordingly. In this context, policy optimization is the de facto paradigm to adapt robot policies as a function of task-specific objectives. Most commonly-used motion policies carry particular structures that are often overlooked in policy optimization algorithms. We instead propose to leverage the structure of probabilistic policies by casting the policy optimization as an optimal transport problem. Specifically, we focus on robot motion policies that build on Gaussian mixture models (GMMs) and formulate the policy optimization as a Wassertein gradient flow over the GMMs space. This naturally allows us to constrain the policy updates via the L 2 -Wasserstein distance between GMMs to enhance the stability of the policy optimization process. Furthermore, we leverage the geometry of the Bures-Wasserstein manifold to optimize the Gaussian distributions of the GMM policy via Riemannian optimization. We evaluate our approach on common robotic settings: Reaching motions, collision-avoidance behaviors and multi-goal tasks. Our results show that our method outperforms common policy optimization baselines in terms of task success rate and low-variance solutions.

1. INTRODUCTION

One of the main premises about autonomous robots is their ability to successfully perform a large range of tasks in unstructured environments. This demands robots to adapt their task models according to environment changes, and consequently to adjust their actions to successfully perform under unseen conditions (Peters et al., 2016) . In general, robotic tasks, e.g. picking or inserting an object, are usually executed by composing previously-learned skills (Schaal et al., 2003) , each represented by a motion policy. Therefore, in order to successfully perform under new settings, the robot should adapt its motion policies according to the new task requirements and conditions. Research on methods for robot motion policy adaptation is vast (Kober et al., 2013; Chatzilygeroudis et al., 2020) , with approaches mainly building on black-box optimizers (Stulp & Sigaud, 2012) , end-to-end deep reinforcement learning (Ibarz et al., 2021) , and policy search (Deisenroth et al., 2013) . Regardless of the optimization method, most approaches rely on policy structure-unaware adaptation strategies. However, several motion policy models (e.g, dynamic movement primitives (DMP) (Ijspeert et al., 2013) , Gaussian mixture models (GMM) (Calinon et al., 2007) , probabilistic movement primitives (ProMPs) (Paraschos et al., 2018), and neural networks (Bahl et al., 2020) , among others), carry specific physical or probabilistic structures that should not be ignored. First, these policy models are often learned from demonstrations in a starting learning phase (Schaal et al., 2003) , thus the policy structure already encapsulates relevant prior information about the skill. Second, structure-unaware adaptation strategies optimize the policy parameters disregarding the special characteristics of the policy model (e.g., a DMP represents a second-order dynamical system). In this regard, we hypothesize that the policy structure may be leveraged to better control the adaptation strategy via policy structure-aware gradients and trust regions. Our main idea is to design a policy optimization strategy that explicitly builds on a particular policy structure. Specifically, we focus on GMM policies, which have been widely used to learn motion skills from human demonstrations (Calinon et al., 2007; Cederborg et al., 2010; Calinon, 2016 ; Unlike the aforementioned approaches, we propose a policy optimization technique that explicitly considers the underlying GMM structure. To do so, we exploit optimal transport theory (Santambrogio, 2015; Peyré & Cuturi, 2019), which allows us to view the set of GMM policies as a particular space of probability distributions GMM d . Specifically, we leverage the idea of Chen et al. ( 2019) and Delon & Desolneux (2020) to view a GMM as a set of discrete measures (dirac masses) on the space of Gaussian distributions G(R d ), which is endowed with a Wasserstein distance (see § 2). This allows us to formulate the policy optimization as a Wasserstein gradient flow (WGF) over the space of GMMs (as illustrated in Fig. 1 and explained in §3), where the policy updates are naturally guaranteed to be GMMs. Moreover, we take advantage of the geometry of the Bures-Wasserstein manifold to optimize the Gaussian distributions of a GMM policy via Riemannian optimization. We evaluate our approach over a set of different GMM policies featuring common robot skills: Reaching motions, collision-avoidance behaviors and multi-goal tasks (see § 4). Our results show that our method outperforms common policy optimization baselines in terms of task success rate while providing low-variance solutions. Related Work: Richemond & Maginnis (2017) pioneered the idea of understanding policy optimization through the lens of optimal transport. They interpreted the policy iteration as gradient flows by leveraging the implicit Euler scheme under a Wasserstein distance (see § 2), considering only 1-step return settings. They observed that the resulting policy optimization resembles the gradient flow of the Fokker-Planck equation (JKO scheme) (Jordan et al., 1996) . In a similar spirit, Zhang et al. (2018) proposed to use WGFs to formulate policy optimization as a sequence of policy updates traveling along a gradient flow on the space of probability distributions until convergence. To solve the WGF problem, the authors proposed a particle-based algorithm to approximate continuous density functions and subsequently derived the gradients for particle updates based on the JKO scheme. Although Zhang et al. (2018) considered general parametric policies, their method assumed a distribution over the policy parameters and did not consider a specific policy structure, which partially motivated their particle-based approximation. Recently, Mokrov et al. (2021) tackled the computational burden of particle methods by leveraging input-convex neural networks to approximate the WGFs computation. They reformulated the well-known JKO optimization Jordan et al. (1996) over probability measures by an optimization over convex functions. Yet, this work remains a general solution for WFG computation and it did not address its use for policy optimization problems. Aside from optimal transport approaches, Arenz et al. ( 2020) proposed a trust-region variational inference for GMMs to approximate multimodal distributions. Although not originally designed for policy optimization, the authors elucidated a connection to learn GMMs of policy parameters in black-box RL. However, their method cannot directly be applied to our GMM policy adaptation setting, nor does it consider the GMM structure from an optimal transport perspective. Nematollahi et al. (2022) proposed SAC-GMM, a hybrid model that employs the well-known SAC algorithm (Haarnoja et al., 2018) to refine dynamic skills encoded by GMMs. The SAC policy was designed to learn residuals on a single vectorized stack of GMM parameters, thus fully disregarding the GMM structure and the geometric constraints of its parameters. Finally, two recent



Figure 1: Illustration our policy structure-aware adaptation of GMM policies. Policy updates follow a Wasserstein gradient flow on the manifold of GMM policies GMM d . Jaquier et al., 2019). GMMs provide a simple but expressive enough representation for learning a large variety of robot skills: Stable dynamic motions (Khansari-Zadeh & Billard, 2011; Ravichandar et al., 2017; Figueroa & Billard, 2018), collaborative behaviors Ewerton et al. (2015); Rozo et al. (2016), and contact-rich manipulation Lin et al. (2012); Abu-Dakka et al. (2018), among others. Often, skills learned from demonstrations need to be refined -due to imperfect data -or adapted to comply with new task requirements. In this context, existing adaptation strategies for GMM policies either build a kernel method on top of the original skill model Huang et al. (2019), or leverage reinforcement learning (RL) to adapt the policy itself(Arenz et al., 2020; Nematollahi et al., 2022). However, none of these techniques explicitly considered the structure of the GMM policy.

