WASSERSTEIN GRADIENT FLOWS FOR OPTIMIZING GMM-BASED POLICIES Anonymous authors Paper under double-blind review

Abstract

Robots often rely on a repertoire of previously-learned motion policies for performing tasks of diverse complexities. When facing unseen task conditions or when new task requirements arise, robots must adapt their motion policies accordingly. In this context, policy optimization is the de facto paradigm to adapt robot policies as a function of task-specific objectives. Most commonly-used motion policies carry particular structures that are often overlooked in policy optimization algorithms. We instead propose to leverage the structure of probabilistic policies by casting the policy optimization as an optimal transport problem. Specifically, we focus on robot motion policies that build on Gaussian mixture models (GMMs) and formulate the policy optimization as a Wassertein gradient flow over the GMMs space. This naturally allows us to constrain the policy updates via the L 2 -Wasserstein distance between GMMs to enhance the stability of the policy optimization process. Furthermore, we leverage the geometry of the Bures-Wasserstein manifold to optimize the Gaussian distributions of the GMM policy via Riemannian optimization. We evaluate our approach on common robotic settings: Reaching motions, collision-avoidance behaviors and multi-goal tasks. Our results show that our method outperforms common policy optimization baselines in terms of task success rate and low-variance solutions.

1. INTRODUCTION

One of the main premises about autonomous robots is their ability to successfully perform a large range of tasks in unstructured environments. This demands robots to adapt their task models according to environment changes, and consequently to adjust their actions to successfully perform under unseen conditions (Peters et al., 2016) . In general, robotic tasks, e.g. picking or inserting an object, are usually executed by composing previously-learned skills (Schaal et al., 2003) , each represented by a motion policy. Therefore, in order to successfully perform under new settings, the robot should adapt its motion policies according to the new task requirements and conditions. Research on methods for robot motion policy adaptation is vast (Kober et al., 2013; Chatzilygeroudis et al., 2020) , with approaches mainly building on black-box optimizers (Stulp & Sigaud, 2012) , end-to-end deep reinforcement learning (Ibarz et al., 2021) , and policy search (Deisenroth et al., 2013) . Regardless of the optimization method, most approaches rely on policy structure-unaware adaptation strategies. However, several motion policy models (e.g, dynamic movement primitives (DMP) (Ijspeert et al., 2013) , Gaussian mixture models (GMM) (Calinon et al., 2007) , probabilistic movement primitives (ProMPs) (Paraschos et al., 2018) , and neural networks (Bahl et al., 2020) , among others), carry specific physical or probabilistic structures that should not be ignored. First, these policy models are often learned from demonstrations in a starting learning phase (Schaal et al., 2003) , thus the policy structure already encapsulates relevant prior information about the skill. Second, structure-unaware adaptation strategies optimize the policy parameters disregarding the special characteristics of the policy model (e.g., a DMP represents a second-order dynamical system). In this regard, we hypothesize that the policy structure may be leveraged to better control the adaptation strategy via policy structure-aware gradients and trust regions. Our main idea is to design a policy optimization strategy that explicitly builds on a particular policy structure. Specifically, we focus on GMM policies, which have been widely used to learn motion skills from human demonstrations (Calinon et al., 2007; Cederborg et al., 2010; Calinon, 2016;  Figure 1 : Illustration our policy structure-aware adaptation of GMM policies. Policy updates follow a Wasserstein gradient flow on the manifold of GMM policies GMM d . Jaquier et al., 2019) . GMMs provide a simple but expressive enough representation for learning a large variety of robot skills: Stable dynamic motions (Khansari-Zadeh & Billard, 2011; Ravichandar et al., 2017; Figueroa & Billard, 2018) , collaborative behaviors Ewerton et al. (2015) ; Rozo et al. (2016) , and contact-rich manipulation Lin et al. (2012) ; Abu-Dakka et al. (2018) , among others. Often, skills learned from demonstrations need to be refined -due to imperfect data -or adapted to comply with new task requirements. In this context, existing adaptation strategies for GMM policies either build a kernel method on top of the original skill model Huang et al. (2019) , or leverage reinforcement learning (RL) to adapt the policy itself (Arenz et al., 2020; Nematollahi et al., 2022) . However, none of these techniques explicitly considered the structure of the GMM policy. Unlike the aforementioned approaches, we propose a policy optimization technique that explicitly considers the underlying GMM structure. To do so, we exploit optimal transport theory (Santambrogio, 2015; Peyré & Cuturi, 2019) , which allows us to view the set of GMM policies as a particular space of probability distributions GMM d . Specifically, we leverage the idea of Chen et al. (2019) and Delon & Desolneux (2020) to view a GMM as a set of discrete measures (dirac masses) on the space of Gaussian distributions G(R d ), which is endowed with a Wasserstein distance (see § 2). This allows us to formulate the policy optimization as a Wasserstein gradient flow (WGF) over the space of GMMs (as illustrated in Fig. 1 and explained in §3), where the policy updates are naturally guaranteed to be GMMs. Moreover, we take advantage of the geometry of the Bures-Wasserstein manifold to optimize the Gaussian distributions of a GMM policy via Riemannian optimization. We evaluate our approach over a set of different GMM policies featuring common robot skills: Reaching motions, collision-avoidance behaviors and multi-goal tasks (see § 4). Our results show that our method outperforms common policy optimization baselines in terms of task success rate while providing low-variance solutions. Related Work : Richemond & Maginnis (2017) pioneered the idea of understanding policy optimization through the lens of optimal transport. They interpreted the policy iteration as gradient flows by leveraging the implicit Euler scheme under a Wasserstein distance (see § 2), considering only 1-step return settings. They observed that the resulting policy optimization resembles the gradient flow of the Fokker-Planck equation (JKO scheme) (Jordan et al., 1996) . In a similar spirit, Zhang et al. (2018) proposed to use WGFs to formulate policy optimization as a sequence of policy updates traveling along a gradient flow on the space of probability distributions until convergence. To solve the WGF problem, the authors proposed a particle-based algorithm to approximate continuous density functions and subsequently derived the gradients for particle updates based on the JKO scheme. Although Zhang et al. (2018) considered general parametric policies, their method assumed a distribution over the policy parameters and did not consider a specific policy structure, which partially motivated their particle-based approximation. Recently, Mokrov et al. (2021) tackled the computational burden of particle methods by leveraging input-convex neural networks to approximate the WGFs computation. They reformulated the well-known JKO optimization Jordan et al. (1996) over probability measures by an optimization over convex functions. Yet, this work remains a general solution for WFG computation and it did not address its use for policy optimization problems. Aside from optimal transport approaches, Arenz et al. (2020) proposed a trust-region variational inference for GMMs to approximate multimodal distributions. Although not originally designed for policy optimization, the authors elucidated a connection to learn GMMs of policy parameters in black-box RL. However, their method cannot directly be applied to our GMM policy adaptation setting, nor does it consider the GMM structure from an optimal transport perspective. Nematollahi et al. (2022) proposed SAC-GMM, a hybrid model that employs the well-known SAC algorithm (Haarnoja et al., 2018) to refine dynamic skills encoded by GMMs. The SAC policy was designed to learn residuals on a single vectorized stack of GMM parameters, thus fully disregarding the GMM structure and the geometric constraints of its parameters. Finally, two recent works share our idea of leveraging geometry in policy optimization: First, a Riemannian proximal policy optimization for GMMs was proposed by Wang et al. (2020) , where the geometry induced by the GMM parameters was considered in the optimization via Riemannian gradients, similarly to our method. The policy optimization was regularized by a Wasserstein distance to control the exploration-exploitation trade-off. However, their method did not formulate the policy optimization as an optimal transport problem, i.e. the policy updates do not follow a WGF, as in our approach, but it employed instead a classical non-convex optimization. Second, Moskovitz et al. (2021) employed the Wasserstein natural gradient to exploit the local geometry induced by the Wasserstein regularization of behavioral policy optimization (Pacchiano et al., 2020) . In contrast, our method exploits the geometry induced by the structure of the space of GMM policies via the Bures-Wasserstein manifold, which naturally guarantees that policy updates stay on GMM d .

2.1. WASSERSTEIN GRADIENT FLOWS

In Euclidean space a gradient flow is a smooth curve x : R → R d that satisfies the partial differential equation (PDE) ẋ(t) = -∇L(x(t)) for a given loss function L : R d → R and starting point x 0 at t = 0 (Santambrogio, 2015; 2017) . A solution can be found straightforwardly by forward discretization, leading to the well-known explicit Euler update scheme x τ k+1 = x k -λ∇L(x τ k ), where λ denotes the learning rate and x τ indicates a discretization of the curve x(t) with discretization parameter τ . Alternatively, we can use a backward discretization, which leads to the following implicit Euler scheme x τ k+1 = arg min x ∥x -x τ k ∥ 2 2τ + L(x) . Eq. 1 is sometimes referred to as Minimizing Movement Scheme and can be used as an alternative characterization of a gradient flow. This characterization is particularly interesting when we need to extend the concept of gradient flows to (non-Euclidean) general metric settings, since there is no notion of ∇L in these cases (Santambrogio, 2015; Ambrosio et al., 2005) . Eq. 1 does not involve any gradients and can be expressed using only metric quantities. In this work, we are particularly interested in gradient flows in the L 2 -Wasserstein space, defined as the set of probability measures P(X) on a separable Banach space X (Panaretos & Zemel, 2020) and endowed with the L 2 -Wasserstein distance W 2 defined as W 2 (µ, ν) = inf γ∈Π(µ,ν) X×X ∥x 1 -x 2 ∥ 2 dγ(x 1 , x 2 ) 1 2 , where µ, ν ∈ P(X) and γ ∈ P(X 2 ) is defined to have the two marginals µ and ν. A Generalized Minimizing Movement scheme characterizing gradient flows in the Wasserstein space can be written in analogy to Eq. 1 as: π τ k+1 = arg min π W 2 2 (π, π τ k ) 2τ + L(π) , where L is a functional to be minimized on the Wasserstein space and π k ∈ P(X). In the following, we will omit the superscript τ for notational convenience.

2.2. REINFORCEMENT LEARNING AS WASSERSTEIN GRADIENT FLOWS

Our view of the policy structure-aware optimization builds on the approach outlined by Richemond & Maginnis (2017) , which in turn is based on the JKO scheme of Jordan et al. (1996) . They proposed a formulation of 1-step RL problems in terms of Wasserstein gradient flows. In particular, they studied the evolution of a policy π under the influence of a free energy functional J of the form: J(π) = K r (π) + βH(π) = A dπ(a|s)r(s, a) -β A dπ(a|s) log(π(a|s)), where K r (π) denotes the inner energy of the system, here determined by the reward r(s, a). Moreover, H(π) is the entropy of the policy π(a|s), with s and a denoting the state and action, respectively. Thus Eq. 4 can be recognized as the usual objective in 1-step RL settings with entropy regularization. It is well known that the evolution of probability densities under a free energy of this form is properly described by a PDE known as the Fokker-Planck equation. Richemond & Maginnis (2017) exploited the result of Jordan et al. (1996) , which stated that this evolution can be interpreted as the gradient flow of the functional J in Wasserstein space. This flow is characterized by the following minimizing movement scheme π k+1 = arg min π W 2 2 (π, π k ) 2τ -J(π) , which naturally provides iterative updates for the policy π. While Richemond & Maginnis (2017) considered a 1-step bandit setting, we extend this approach to full multi-step RL problems and learn policies for long-horizon tasks.

2.3. THE L 2 -WASSERSTEIN DISTANCE BETWEEN GAUSSIAN MIXTURE MODELS (GMMS)

In this paper, we consider policies π(x) that build on a GMM structure, i.e., π(x) = N i=1 ω i N (x; µ i , Σ i ), where N denotes a multivariate Gaussian distribution with mean µ i and covariance matrix Σ i , and ω i are the weights of the N individual Gaussian components, which are subject to i ω i = 1. In the following, we will write μ, Σ and ω to denote the stacked means, covariance matrices and weights of the N components. Therefore, we do not consider WGFs on the full manifold of probability distributions (Wasserstein space) P(R d ) but rather focus on WGFs evolving on the submanifold of GMMs, that is GMM d ⊂ P(R d ). Following Chen et al. (2019) ; Delon & Desolneux (2020) , we can approximately describe this submanifold as a discrete distribution over the space of Gaussian distributions equipped with the Wasserstein metric. This in turn can be identified with the Bures-Wasserstein manifold which is the product manifold R d × S d ++ , where S d ++ denotes the Riemannian manifold of d-dimensional symmetric positive definite matrices. The corresponding approximated Wasserstein distance between two GMMs π 1 , π 2 is given by W 2 2 π 1 (x), π 2 (x) = min P ∈U (ω1,ω2) N i,j P ij W 2 2 N 1 (x; µ i , Σ i ), N 2 (x; µ j , Σ j ) , where U (ω 1 , ω 2 ) = {P ∈ R N ×N + |P 1 N = ω 1 , P T 1 N = ω 2 } with 1 N denoting an N -dimensional vector of ones. The Wasserstein distance between two Gaussian distributions in Eq. 6 can be computed analytically as follows W 2 2 N 1 (x; µ i , Σ i ), N 2 (x; µ j , Σ j ) = ∥µ i -µ j ∥ 2 + tr Σ i + Σ j -2 Σ 1 /2 i Σ j Σ 1 /2 i .

2.4. LEARNING GMM POLICIES FROM DEMONSTRATIONS

A popular approach in RL -particularly in the robotics domain -to reduce the number of policy rollouts in the environment is to warm-start the policy with a set of demonstrations provided by an expert. In this work we choose to represent our policy via a GMM. We assume that demonstrations are provided as a set of trajectories τ of state-action pairs τ = {(s 0 , a 0 ), (s 1 , a 1 ), . . . (s T , a T )}. To initialize our policy, we first use the Expectation-Maximization (EM) algorithm to fit a GMM, in the joint state-action space, to the demonstrations. This results in a mixture distribution π(s, a) = N i=1 ω i N [s a] T ; µ i , Σ i from which a policy can be obtained by conditioning on the state s, as follows π(a|s) = π(s, a) π(s, a)da . ( ) In the context of GMMs, this is also known as Gaussian Mixture Regression (GMR) (Ghahramani & Jordan, 1994) . The resulting conditional distribution is another GMM on action space with statedependent parameters, given by π(a t |s t ) = N i=1 ω i (s t )N (a t ; µ a i (s t ), Σ a i ). Details on computation of Eq. 9 from the original GMM are given in App. A.1.

3. WASSERSTEIN GRADIENT FLOWS FOR GMM POLICY OPTIMIZATION

In this work, we focus on multi-step RL tasks for policy adaptation. We consider a finite-horizon Markov Decision Process (MDP) with continuous state and action spaces S ∈ R n and A ∈ R m , transition and reward functions p(s t+1 |s t , a t ) and r(s t , a t ), initial state distribution ρ(s 0 ) and a discount factor γ. Further, we assume to have an initial policy π(a t |s t ), which is to be adapted by optimizing some objective function K r (π). As stated in § 1, this problem arises in robot learning settings where a policy learned via imitation learning (e.g., LfD) needs to be adapted to new objectives or unseen environmental conditions. To promote exploration and avoid premature convergence to suboptimal policies, we leverage maximum entropy RL (Eysenbach & Levine, 2022) by adding an entropy term H(π) to the objective. Thus, the overall objective has the form of a free energy functional (resembling Eq. 4) and can be written as J(π) = K r (π) + βH(π), ( ) where β is a hyperparameter and K r (π) corresponds to the usual cumulative return K r (π) = E τ t r(s t , a t ) = Π t ds 0 ds t da t ρ(s 0 )π(a t |s t )p(s t+1 |s t , a t ) t γ t r(s t , a t ). ( ) The evolution of the policy π(a t |s t ) over the course of the optimization can be described as a flow of a probability distribution in Wasserstein space. This formulation comes with three major benefits: (i) We directly leverage the Wasserstein metric properties for describing the evolution of probability distributions; (ii) We exploit the L 2 -Wasserstein distance to constrain the policy updates, which is important to guarantee stability in policy optimization (Schulman et al., 2015; 2017; Otto et al., 2021) ; (iii) By constraining to specific submanifolds of the Wasserstein space, in this case GMMs, we can impose additional structural properties on the policy optimization. Since our objective in Eq. 10 has the form of the free energy functional studied by Richemond & Maginnis (2017) ; Jordan et al. (1996) , we can leverage the iterative updates scheme of Eq. 5 to formulate the evolution of our policy iteration under the flow generated by Eq. 10. As mentioned previously, we focus on the special case of GMM policies and therefore restrict the Wasserstein gradient flow to the submanifold of GMM distributions GMM d . We refer the interested reader to App. A.3, where we provide the explicit form of J(π) of Eq. 10 for the GMM case.

3.1. POLICY OPTIMIZATION

To begin with, we leverage the approximation that describes the GMM submanifold as a discrete distribution over the space of Gaussian distributions G(R d ), equipped with the Wasserstein metric (Chen et al., 2019) . Consequently, our policy optimization problem naturally splits into an optimization over the (N -1)-dimensional simplex and an optimization on the N -fold product of the Bures-Wasserstein manifold (BW N ), i.e. the product manifold R d × S d ++ N . The former corresponds to the GMM weights while the latter applies to the set of Gaussian distributions parameters. Note that the identification with the BW N manifold allows us to perform the optimization directly on the parameter space. This comes with several benefits: (i) We can leverage the well-known analytic solution of the Wasserstein distance between two Gaussian distributions in Eq. 6, greatly reducing the computational complexity of the policy optimization. (ii) As Chen et al. ( 2019) show, we can guarantee that the policy optimized via Eq. 6 remains a GMM. (iii) Unlike the full Wasserstein spacefoot_0 , the resulting product manifold is a true Riemannian manifold such that we can leverage the machinery of Riemannian optimization. Importantly, working in the parameter space allows us to apply an explicit Euler scheme instead of the implicit formulation of Eq. 3. According to the above-mentioned split, we formulate the policy optimization as an EM-like twostep procedure that alternates between the Gaussian parameters (i.e. means and covariance matrices) and the GMM weights. To optimize the former, we propose to leverage the Riemannian structure of the BW manifold to reformulate the updates as a forward discretization, similarly to Chen & Li (2020) . In other words, by embedding the Gaussian components of the GMM policy in a Riemannian manifold, the Wasserstein gradient flow in the implicit form of Eq. 5 can be approximated by an explicit Euler update scheme according to the BW metric (further details are provided in App. A.4). This allows us to leverage the expressions of the Riemannian gradient and exponential map of the BW manifold (Malagò et al., 2018; Han et al., 2021) . Thus, the optimization boils down to Riemannian gradient descent where the gradient is defined w.r.t the Bures-Wasserstein metric. In particular, we use the expression for Riemannian gradient, metric and exponential map used in (Han et al., 2021) . Formally, the resulting updates for the Gaussian parameters of the GMM follow the Riemannian gradient descent scheme given by: μk+1 = R μk λ • grad μ J(π k ) , and Σk+1 = R Σk λ • grad Σ J(π k ) , where grad denotes the Riemannian gradient w.r.t. the Bures-Wasserstein metric, R x : T x M → M denotes the retraction operator, which maps a point on the tangent space T x M back to the manifold M ≡ BW (Boumal, 2022) . Moreover, λ is a learning rate and π k def = π(μ k , Σk , ωk ). The Euclidean gradients of J(π) required for computing grad can be obtained using a likelihood ratio estimator (a.k.a score function estimator or REINFORCE) (Williams, 2004) and are provided in App. A.3. Concerning the GMM weights, we first reparameterize them as ω j = exp ηj N k=1 exp η k and optimize w.r.t. the new parameters η j , j = 1...N , which unlike ω are unconstrained. For this optimization we employ the implicit Euler scheme: ηk+1 = arg min η W 2 2 (π k+1 (η), π k ) 2τ -J(π k+1 (η)) , where π k+1 (η) def = π(μ k+1 , Σk+1 , η). We minimize Eq. 13 by gradient descent w.r.t. η as follows: ηk+1 = ηk -λ∇ η W 2 2 (π k+1 (η), π k ) τ -J(π k+1 (η)) . The gradient of J(π) can be obtained analytically using a likelihood ratio estimator. For the Wasserstein term, we first compute the gradient w.r.t. the weights via the Sinkhorn algorithm (Cuturi & Doucet, 2014) , from which the gradient w.r.t η can be then obtained via the chain rule. Note that we have to rely on the Sinkhorn algorithm here since there is no analytic solution available for the Wasserstein distance between discrete distributions, unlike the above case of the Gaussian components. Consequently, we cannot compute the corresponding gradients.

3.2. IMPLEMENTATION PIPELINE

To carry out the policy optimization, we proceed as in the usual on-policy RL scheme: We first roll out the current policy to collect samples of state-action-reward tuples. Then, we use the collected interaction trajectories to compute a sample-based estimate of the functional K r (π) and its gradients w.r.t the policy parameters, as explained in § 3.1. An optimization step consists of alternating between optimizing the Gaussian parameters using 12, and updating the weights via 14. For the optimization of the Gaussian parameters we leverage Pymanopt (Townsend et al., 2016) for Riemannian optimization. We extended this library by implementing the Bures-Wasserstein manifold based on the expressions provided by Han et al. (2021) (see App. A.2 for details). Furthermore, we added a custom line-search routine that accounts for a constraint on the Wasserstein distance between the old and the optimized GMM, as to our knowledge such a search method does not exist in out-of-the-box optimizers. The details of this custom line-search can be found in Algorithm 2 in App. A.5. Regarding the optimization of the GMM weights, we use POT (Flamary et al., 2021) , a Python library for optimal transport, from which we obtain the quantities required for computing the gradients of the Wasserstein distance w.r.t. the weights in Eq. 14. The full policy optimization finishes if either the objective stops improving or the Wasserstein distance between the old and optimized GMMs exceeds a predefined threshold, which is chosen experimentally. Afterwards, fresh rollouts are performed with the updated policy and the aforementioned two-step procedure starts over. This optimization loop is repeated until a task-dependent success criterion has been fulfilled. We summarize the proposed optimization in Algorithm 1.

4. EXPERIMENTS

We tested our approach in three different robotic settings: a reaching skill, a collision-free trajectory tracking, and a multiple-goal task. All the tasks are represented in a 2D operational space. The robot Algorithm 1 GMM Policy Optimization via Wasserstein Gradient Flows Input: initial policy π(a|s) 1: while not goal reached do 2: Rollout policy π(a|s) in the environment for M episodes to collect interaction trajectories τ = {(s 0 , a 0 , r 0 ), (s 1 , a 1 , r 1 ), . . . , (s T , a T , r T )} M m=1 3: repeat 4: Update Gaussian components parameters μ, Σ using Riemannian optimization (12), where λ ls is determined via line-search (see §3.2). Update GMM weights ω via gradient descent on the free energy objective 10, using 14 8: until converged 9: end while motion policies were initially learned from human demonstrations collected on a simple Python graphical interface. We assumed we were given M demonstrations, each of which contained T m data points for a dataset of N = m T m total observations τ = {(s t , a t )} N t=1 . The state s and action a correspond to the robot end-effector position x ∈ R 2 and velocity ẋ ∈ R 2 . The GMM models were trained via classical Expectation-Maximization. The policy rollout consists of sampling a velocity action a t ∼ π(a t |s t ) using Eq. 9, and subsequently commanding the robot via a Cartesian velocity controller at a frequency of 100Hz. For all the experiments, we used the Robotics Toolbox for Python (Corke & Haviland, 2021) to simulate the robotic environments. To show the importance of accounting for the policy structure in RL settings, we compared our method against two structure-unaware baselines: PPO (Schulman et al., 2017) and SAC-GMM (Nematollahi et al., 2022) . As PPO was not originally designed to directly optimize the parameters of a previously-learned GMM policy, we designed the policy actions to represent (usually small) corrections to the GMM parameters, i.e. a = [∆ω ∆ μ ∆ vec( Σ)], following the same methodology as SAC-GMM (Nematollahi et al., 2022) . The PPO and SAC implementations correspond to the code provided by Stable-Baselines3 (Raffin et al., 2021) , whose policies are parametrized by MLP networks. During policy optimization, we sample an action from the MLP policy that is then used to update the GMM parameters by adding the computed corrections to the current parameters. Later, we proceed as described earlier, namely, the updated GMM is used to compute the velocity action via Eq. 9. For comparison purposes, we report statistical results for the three considered settings over 5 runs for the task success rate and solution variance. We tuned the baselines separately for each task using Optuna (Akiba et al., 2019) . In addition, to assess the importance of our Riemannian formulation, we performed an ablation where we used the implicit scheme based on Euclidean gradient descent instead of the explicit optimization on the Bures-Wasserstein manifold (see App. A.6.2). Last but not least, we tested our approach on a 3D version of the collision-free task performed by a 7-DoF Franka Emika Panda robot in a virtual environment as reported in App. A.6.3.

4.1. TASKS DESCRIPTION

Reaching Task: This experiment consists of: (1) learning an initial GMM policy such that the robot end-effector reaches a target by following an L-shape trajectory from its initial position, and (2) adapting the GMM policy to reach a new target located midway and above the previouslylearned L-shape trajectories. The initial policy, shown in Fig. 2 -left and Fig. 12a , was learned from 12 demonstrations and encoded by a 7-component GMM. To adapt the policy, we defined a dense reward as a function of the position error between the robot end-effector and the new target. We also added a sparse penalty term that punishes rollouts leading to significantly divergent trajectories. Convergence is achieved when a minimum average position error w.r.t the target -computed over an episode -is reached. Collision-avoidance Task: This task consists of: (1) learning an initial GMM policy of a linear reaching motion, and (2) adapting the GMM policy to reach a new horizontally-translated target while avoiding to collide with two spherical obstacles located midway between the initial robot position and the new target. The initial GMM policy was learned from 10 human demonstrations and represented by a 3-component GMM, as shown in Fig. 2 -middle and Fig. 12b . For policy optimization, we defined a sparse reward as a function of the position error between the robot end-effector ) depict the initial GMM policy, projected on the 2D Cartesian position space. The end-effector trajectory resulting from the initial GMM policy is shown in dark blue lines ( ). Red circles ( ) in the collision-avoidance task represent the obstacles (middle). The different targets of the multiple-goal task (right) are depicted as red stars. position and the target at the end of the rollout. We also included two sparse penalty terms: the first one punishes rollouts leading to collisions with the obstacles, for which the rollout is stopped; the second term penalizes rollouts with significantly divergent trajectories. Convergence is determined by a minimum average position error w.r.t the target computed over an episode. Multiple-goal Task: This setting involves: (1) learning an initial GMM policy where the robot end-effector reaches two different targets (i.e., task goals) starting from the same initial position, and (2) adapting the initial policy to reach a new target located close to one of the previous task goals. The intended adapted behavior should make the robot go through the most relevant GMM components according to the new target location. The initial GMM policy was learned from 12 demonstrations and encoded by a 6-component GMM, as shown in Fig. 2 -right and Fig. 12c . To optimize the initial GMM policy, we specified a sparse reward based on the position error between the robot end-effector position and the chosen target at the end of the rollout. Similar to the previous experiments, we added a sparse penalty term to penalize rollouts generating significantly divergent trajectories. Again, the policy optimization converges when the average position error w.r.t the chosen target reaches a minimum threshold.

4.2. RESULTS ANALYSIS

The reaching task tested our method's ability to adapt a previously-learned reaching skill to a new goal, located at (6.0, -6.5) (cf. Fig. 12a-left ). Achieving this required to adapt the Gaussian parameters of mainly the last four GMM components, while the other ones remained unchanged. We compared all methods in terms of the success rate over environment steps, where the success rate is defined as the percentage of rollouts that reach the new goal. Figure 3 -left shows that our method achieved a success rate of 1 after approximately 70000 environment interactions. Despite PPO was also able to complete the task reliably, it required many more environment steps (cf. Fig. 5-left ). In sharp contrast, SAC did not reach any improvement. These observations underline the importance of some kind of trust region or constraint on the policy updates, which allowed both our method and PPO to reach good success rates. Furthermore, this experiment showed that our method is much more sample-efficient in adapting the GMM parameters, which we attribute to the fact that our method explicitly takes the GMM structure into account in the formulation of the optimization. In the collision-avoidance task, we tested whether our method was able to adapt a trajectory tracking skill in order to avoid collisions with newly added obstacles. These were placed in such a way that the robot was forced to move its end-effector through a narrow path between the obstacles (cf. Fig. 2-middle ). While the reaching task could be adapted by mainly varying the means of the GMM components, this task also demands to adapt the covariance of the second GMM component. Figure 3 -middle shows that our method solved this task reliably after comparatively few environment interactions. Although PPO also achieved a success rate of 1, it took 6 times more environment steps than our method. SAC only reached an average success rate of 0.8, however with high variance (cf. Fig. 5-middle ). These results again show the importance of the constraints on the policy updates. The huge discrepancy in the required environment steps between our method and PPO further emphasizes the importance of taking the GMM structure into account in the policy optimization. While the previous two tasks were accomplished by adapting mostly the Gaussian parameters of the GMM, the multiple-goal task requires to adapt the GMM weights. The initial skill comprised reaching motions to two different goals and an execution of this skill results in reaching one of them, depending on the sampling noise (cf. Fig. 12c ). The easiest way to adapt the policy to reach only one of the two goals is to reduce the GMM weights of the components belonging to the undesired motion and correspondingly increase the weights of the other components. As shown in Fig. 3 right, our method again quickly achieved a success rate of 1. PPO required substantially many more environment steps, while SAC was not able to solve the task. In Fig. 4 we report the success rate variance over 5 runs at a fixed time step, which corresponded to the step at which the first method achieved a success rate of 1, thus prioritizing sample efficiency. The plots show that our method exhibits a very low solution variance. Both baselines varied largely, except for the reaching task, where all SAC runs collapse to a success rate of 0. These results show that our method, despite showing large variance at the start, was able to quickly reduce the variance and converge reliably to a good success rate. We also provide similar plots of solution variance in Fig. 6 , where we report the results for each method using its own convergence time step.

5. CONCLUSIONS AND FUTURE WORK

We presented a novel method for GMM policy optimization, which leverages optimal transport theory to formulate the policy optimization as a Wasserstein gradient flow on the manifold of GMMs. Our formulation explicitly accounts for the GMM structure in the optimization and furthermore enables us to naturally constrain the policy updates by the L 2 -Wasserstein distance between GMMs to enhance the stability of the policy optimization process. Moreover, the embedding of the Gaussian components of the GMM policy in the Bures-Wassertein manifold greatly reduced the computational cost of the policy optimization. Experiments on three robotic tasks provided strong evidence of the importance of our policy-structure aware optimization against approaches that disregard the GMM structure. A possible limitation of our method is that each optimization loop involves running the Sinkhorn algorithm, which is computationally expensive. This might be improved by employing recent advances on initializing the Sinkhorn algorithm (Thornton & Cuturi, 2022) . Also, we observed an intricate interplay between the optimization of the GMM weights and the Gaussian parameters, which sometimes resulted in one update hampering the other. In future work we plan to address the latter problem by using separate adaptive learning rates for weights and Gaussian parameters. Another possibility would entail to reformulate the approach as a fully dynamical, particle-based optimization on the Bures-Wasserstein manifold, where both the locations and weights of the particles are updated using Wasserstein Fisher-Rao gradient flows Chizat et al. (2015) ; Chizat (2019) ; Liero et al. (2018) . Finally it would be interesting to combine our method with an actor-critic formulation and to replace the multi-step cumulative reward by a trained Q-function.

A APPENDIX A.1 DETAILS ON GAUSSIAN MIXTURE REGRESSION (GMR)

In GMR we start from a GMM in state-action space π(s, a) = N i=1 ω i N [s a] T ; µ i , Σ i from which a policy, i.e. a probability distribution on the action space, can be obtained by conditioning on the state, as follows π(a|s) = π(s, a) π(s, a)da . ( ) The resulting conditional distribution is another GMM on the action sapce, with state dependent parameters, given by: π(a t |s t ) = N i=1 ω i (s t )N (a t ; µ a i (s t ), Σ a i ), with µ a i (s t ) = µ a i + Σ as i (Σ s i ) -1 (s t -µ s i ) , Σ a i = Σ a i -Σ as i (Σ s i ) -1 Σ sa i , ω i (s t ) = ω i N (s t ; µ s i , Σ s i ) n k ω k N (s t ; µ s k , Σ s k ) . ( ) Note that we have split the GMM parameters µ i and Σ i into their state and action components according to µ i = µ s i µ a i , Σ i = Σ s i Σ sa i Σ as i Σ a i .

A.2 RIEMANNIAN GRADIENTS AND RETRACTIONS

For completeness we give here the explicit expressions of the Riemannian gradients and the retractions used in § 3.1. As the mean vectors are assumed to lie in the Euclidean space, their Riemannian gradients actually coincide with the Euclidean gradients and no retraction is required, so Eq. 12 reduces to the well-known Euclidean gradient descent μk+1 = μk + ∇ μJ (π k ), where ∇ μ denotes the Euclidean gradient w.r.t. μ. For the covariance matrices we use the gradient and retraction w.r.t. the Bures-Wasserstein manifold, taken from (Malagò et al., 2018; Han et al., 2021) . The gradient is given by grad Σ J(π k ) = 4{∇ ΣJ (π k ) Σ} S , where again ∇ Σ denotes the Euclidean gradient w.r.t. Σ and {X} S = (X+X T )

2

. Furthermore, the retraction is given by R Σ k X = Σk + X + L X Σk XL X Σk , where L X Σk is the Lyapunov operator, defined as the solution to the matrix linear system L X Σk X + XL X Σk = Σk . A.3 EXPRESSIONS OF THE FREE FUNCTIONAL J(π) AND ITS EUCLIDEAN GRADIENTS For completeness sake, we provide here the explicit expression of the Euclidean gradients for the objective J(π) w.r.t. the parameters of the GMM, which are used in the construction of the Riemannian gradients. Using the policy gradient theorem, we obtain the gradient of Eq. 11 w.r.t to a parameter ξ as follows In this work, we focus on GMM policies, for which the objective J(π) takes the form:  ∇ ξ J(π) = ∇ ξ Π t J(π) = Π By inserting Eq. 25 into Eq. 24 we obtain for the individual parameters of the GMM ∇µ l J(π) = E τ t ω l N (s t , a t ; µ l , Σ l )Σ -1 l ((s t , a t ) -µ l ) j ω j N (s t , a t ; µ j , Σ j ) ω l daN (s t , a t ; µ l , Σ l )Σ -1 l ((s t , a t )µ l ) j ω j daN (s t , a t ; µ j , Σ j ) t ′ >t r(s t , a t ) , = E τ t ω l N (s t , a t ; µ l , Σ l )Σ -1 l ((s t , a t ) -µ l ) j ω j N (s t , a t ; µ j , Σ j ) (27) -δ s ω l N (s t ; µ l,s Σ l,ss )Σ -1 l,ss (s t -µ l,s ) j ω j N (s t ; µ j,s , Σ j,ss ) t ′ >t r(s t , a t ) , = E τ t ζ l,st,at Σ -1 l ((s t , a t ) -µ l ) -δ s ζ l,st Σ -1 l,ss (s t -µ l,s ) t ′ >t r(s t , a t ) . Here δ s ∈ {0, 1} indicates which terms the gradient acts on. In this case, the gradient act on the state components and it is absent for the action dimensions. ∇ Σ l J(π) = E τ   t   - 1 2 ω l N (s t , a t ; µ l , Σ l )Σ -1 l 1 -((s t , a t ) -µ l ) ((s t , a t ) -µ l ) T Σ -1 l j ω j N (s t , a t ; µ j , Σ j ) (28) + 1 2 ω l daN (s t , a t ; µ l , Σ l )Σ -1 l 1 -((s t , a t ) -µ l ) ((s t , a t ) -µ l ) T Σ -1 l j ω j daN (s t , a t ; µ j , Σ j )   t ′ >t r(s t , a t )   , = E τ t - ζ l,st,at 2 Σ -1 l 1 -((s t , a t ) -µ l ) ((s t , a t ) -µ l ) T Σ -1 l +δ s ζ l,st 2 Σ -1 l,s 1 -(s t -µ l,s ) (s t -µ l,s ) T Σ -1 l,s t ′ >t r(s t , a t ) . This approximation can be obtained by considering a first-order approximation of the geodesic on the BW manifold. As the exponential map (a.k.a. the retraction) is defined via the geodesic, the retraction operator in Eq. 40 turns into a simple addition operation under a first-order approximation, leading to Eq. 39. Notice that such approximation does not guarantee that the updated parameters θ stay on the manifold, except for the cases in which θ ∈ R d . In our case, we leverage the retraction and Riemannian gradients of (Malagò et al., 2018; Han et al., 2021) , which allow us to apply the exact Riemannian gradient descent of 40. This avoids to rely on first-order approximations and in turn we can guarantee that the updates of the Gaussian distribution parameters always lie on on the product manifold R d × S d ++ N .

A.5 ADDITIONAL DETAILS ON THE IMPLEMENTATION

We extended the Pymanopt (Townsend et al., 2016) by adding a custom line-search routine that accounts for a constraint on the Wasserstein distance between the old and the optimized GMMs. The details of this line-search can be found in Algorithm 2. Algorithm 2 Constrained line-search. The constraint function c(x 0 .•) is arbitrary in general. We use the L 2 -Wasserstein distance between two points on the manifold of GMMs as constraint. Input: point x 0 on the manifold, descent direction d, initial step size λ 0 , decrement α, constraint c(x 0 , •), maximum allowed value for constraint c max , minimum step size λ min Output: step size s, updated point on manifold x Fig. 6 shows the variance of the success rate for the three methods at their time step of convergence for all three robotic tasks. Concerning SAC, which did not converge after the maximum number of environment steps used for training, we chose the last time step. Specifically, we chose the following time steps for PPO, SAC and WGF, respectively: reaching task (280000, 400000, 80000), collision avoidance task (275000, 300000, 90000), multiple goal task (130000, 200000, 95000) . These plots show that PPO may also reach low-variance success rate over the five runs at the time step of conver-gence, at the cost of a prohibitively large number of steps. SAC showed huge variance in all tasks, apart from the reaching task, where all runs collapsed to a success rate of 0. : Variance of the success rate over the 5 runs for our method (WGF) and the two baselines on the reaching task (left), the collision avoidance task (middle) and the multiple-goal task (right). The violine plots are overlaid with box plots, quartile lines and a swarm plot, where dots indicate the success rates of individual runs. The time steps at which we determined the variance are for PPO, SAC and WGF for the three tasks from left to right: (280000, 400000, 80000), (275000, 300000, 90000), (130000, 200000, 95000) . 1: x = x 0 + λ 0 • d λ = λ 0 2: while c(x 0 , x) > c

A.6.2 ADDITIONAL ABLATIONS

In order to assess the influence of leveraging a Riemannian optimization approach on the Bures-Wasserstein manifold, we conducted an ablation of our method by eliminating the Riemannian formulation. Instead of the explicit Euler scheme update in Eq. 12, which corresponds to Riemannian gradient descent w.r.t. the Bures-Wasserstein metric, we use the implicit Euler scheme μk+1 = arg min μ W 2 2 (π k (μ), π k ) 2τ -J(π k (μ)) , Σk+1 = arg min Σ W 2 2 (π k ( Σ), π k ) 2τ -J(π k ( Σ)) . To guarantee that the updated covariance matrices do not leave the manifold of symmetric positive definite matrices, we parameterize them in terms of Cholesky factors. The results obtained with this non-Riemannian version of our method are shown in Fig. 7 in direct comparison to our method and Fig. 8 for an extended range. The results clearly show that the non-Riemannian method struggles to reach a success rate of 1 for the reaching task and the collision-avoidance task. Furthermore, we observe a high variance over different runs in the same settings (see Fig. 9 and Fig. 10 ). We attribute this to the fact that the our method takes exact gradient steps in the direction of steepest descent w.r.t. the underlying BW metric, whereas the implicit scheme only approximates this direction. For this reason the non-Riemannian method is much more noisy, which in turn leads to the aforementioned high variance. Nevertheless, the multiple-goal task constitutes an exception. Here we observed a similar performance for our approach and the ablated method. The reason for this is that the optimization of this task is mainly dominated by the weight updates, which are identical for both methods. This result is therefore expected and confirms that correctness of our ablation strategy. A.6.3 ADDITIONAL EXPERIMENT WITH 7-DOF ROBOTIC MANIPULATOR We carried out an additional experiment to show that our method can be employed on tasks performed by off-the-shelf robotic manipulators (e.g. a 7-DoF Franka Emika Panda robot). Specifi- cally, we extended the collision-avoidance task described in § 4 to a 3D environment (i.e. the state s = x ∈ R 3 and the action a = ẋ ∈ R 3 ). The initial 3-components GMM policy was trained using 10 human demonstrations featuring linear reaching 3D trajectories. For policy optimization, we used a sparse reward defined as a function of the position error between the robot end-effector position and the target at the end of the rollout. Moreover, two sparse penalty terms were added to punish collision with obstacles and divergent trajectories. Similarly to the planar task reported in the main paper, we tested whether our method was able to adapt a trajectory tracking skill in order to avoid collisions with newly added obstacles. This means that the robot end-effector needed to pass through a narrow path between two spherical obstacles. The robot end-effector pose was controlled using a full-pose Cartesian velocity controller at a frequency of 100Hz, where the end-effector orientation was kept constant. Figure 11 shows that our method reached a success rate of 1.0 very quickly, taking approximately 20000 environment steps. Moreover, the solution variance of our method was also very low, which is consistent with our observations concerning the performance of our policy optimization on the three planar tasks analyzed in the main paper.

A.6.4 INITIAL GMM POLICIES

For the sake of completeness, Fig. 12 provides 2D projections of the initial GMM policies learned from demonstrations for the three robotic settings considered in the main paper: the reaching motion skill, the collision-free trajectory tracking, and the multiple-goal task. Figure 12 also provides the demonstration data used to train the initial policies. Note that these models are then adapted according to the policy optimization approach introduced in § 3.2. -10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0 



Wasserstein space is not a true Riemannian manifold, but it can be equipped with a Riemannian structure and formal calculus on this manifold(Otto, 2001), which has been made rigorous by(Ambrosio et al., 2005)



Figure2: The three tested robotic settings: a reaching skill (left), a collision-free trajectory tracking (middle), and a multiple-goal task (right). The robot color goes from light gray to black to show the evolution of the task reproduction. Green Gaussian components () depict the initial GMM policy, projected on the 2D Cartesian position space. The end-effector trajectory resulting from the initial GMM policy is shown in dark blue lines ( ). Red circles ( ) in the collision-avoidance task represent the obstacles (middle). The different targets of the multiple-goal task (right) are depicted as red stars.

Figure3: Success rate of our method (WGF) and the baselines on the reaching (left), the collision-avoidance (middle) and the multiple-goal tasks (right). The shaded area depicts the standard deviation over 5 runs.

Fig.5shows the convergence curves for the two baselines as in Fig.3of the main paper, however, we extended the horizontal axis up to the maximum number of environment steps used for training.

Figure 5: The success rate of the two baselines on the reaching task (left), the collision-avoidance task (middle) and the multiple-goal task (right). The shaded area indicates the standard deviation over 5 runs.

Figure6: Variance of the success rate over the 5 runs for our method (WGF) and the two baselines on the reaching task (left), the collision avoidance task (middle) and the multiple-goal task (right). The violine plots are overlaid with box plots, quartile lines and a swarm plot, where dots indicate the success rates of individual runs. The time steps at which we determined the variance are for PPO, SAC and WGF for the three tasks from left to right: (280000, 400000, 80000), (275000, 300000, 90000),(130000, 200000, 95000).

Figure7: The success rate of our method and an ablated version, not using the Bures-Wasserstein formulation for the reaching task (left), the collision-avoidance task (middle) and the multiple-goal task (right). The shaded area indicates the standard deviation over 5 runs.

Figure 8: Extended plot of the success rate of and ablated version of our method, not using the Bures-Wasserstein-based formulation for the reaching task (left), the collision-avoidance task (middle) and the multiple-goal task (right). The shaded area indicates the standard deviation over 5 runs.

Figure 12: Green Gaussian components ( ) represent the initial GMM policy learned from demonstrations, projected on the Cartesian position (left) and velocity (left) spaces. The recorded position and velocity data are depicted as black dots ( ).

ds 0 ds t da t ρ(s 0 )π(a t |s t )p(s t+1 |s t , a t )

t ds 0 ds t da t ρ(s 0 ) (s t )N (a t ; µ i (s t ), Σ i (s t ))p(s t+1 |s t , a t ) (s t )N (a t ; µ i (s t ), Σ i (s t ))p(s t+1 |s t , a t ) (s t )N (a t ; µ i (s t ), Σ i (s t ))p(s t+1 |s t , a t ) .

annex

∇ω l J(π) = E τ t N (s t , a t ; µ l , Σ l ) j ω j N (s t , a t ; µ j , Σ j ) -daN (s t , a t ; µ l , Σ l ) j ω j daN (s t , a t ; µ j , Σ j )t ′ >t r(s t , a t )Note that we introduced the responsibilities ζ l,st,at and ζ l,st , which are defined as follows, and (31)

A.4 RELATION BETWEEN FORWARD AND BACKWARD DISCRETIZATION IN THE BURES-WASSERSTEIN METRIC

In this section we outline the relation between the implicit and explicit optimization schema w.r.t. the Bures-Wasserstein metric, which is leveraged to formulate our policy optimization in § 3. We closely follow Chen & Li (2020) . For the sake of simplicity, we group the Gaussian parameters µ and Σ into a single parameter vector θ. Furthermore, we restrict our explanation to a single Gaussian component, which is possible without loosing generality, as each of the N components live in its own manifold R d ×S d ++ . The Riemannian gradient w.r.t the Gaussian parameters θ, grad θ J(π(θ)), satisfies by definitionwhere ∇ θ denotes the Euclidean gradient, ξ is an arbitrary vector on the tangent space T θ M, and g θ is the Riemannian metric tensor, defining the inner product on T θ M. The Riemannian metric g θ can be written aswith two arbitrary tangent vectors ζ, ξ, and G W (θ) being a positive definite matrix. Moreover, note that the Wasserstein distance W 2 2 N (θ), N (θ + ∆θ) , where ∆θ denotes a small perturbation in the Gaussian parameters θ, can be expressed asfor ∆θ → 0. Similarly, we can approximate the objective evaluated at J(θ + ∆θ) via the Taylor theorem asWith this, we can approximatefrom which we obtain the update equation for θ as followsNote that Eq. 39 in turn corresponds to an approximation of the exact Riemannian gradient descent1.0 0.5 0.0 0.5 1.0 1.5 2.0 success rate WGF non_BW method 1.0 0.5 0.0 0.5 1.0 1.5 2.0 success rate WGF non_BW method 1.0 0.5 0.0 0.5 1.0 1.5 2.0 success rate WGF non_BW method Figure 10 : Variance of the success rate over 5 runs for our method (WGF) and the ablated method (non-BW) on the reaching task (left), the collision avoidance task (middle) and the multiple-goal task (right). The violine plots are overlaid with box plots, quartile lines and a swarm plot, where dots indicate the success rates of individual runs. The time steps at which we determined the variance are (80000, 400000), (90000, 200000), (85000, 90000). 

