PROBABILISTIC MIXTURE-OF-EXPERTS FOR EFFICIENT DEEP REINFORCEMENT LEARNING

Abstract

Deep reinforcement learning (DRL) has successfully solved various problems recently, typically with a unimodal policy representation. However, grasping the decomposable and hierarchical structures within a complex task can be essential for further improving its learning efficiency and performance, which may lead to a multimodal policy or a mixture-of-experts (MOE). To our best knowledge, present DRL algorithms for general utility do not deploy MOE methods as policy function approximators due to the lack of differentiability, or without explicit probabilistic representation. In this work, we propose a differentiable probabilistic mixtureof-experts (PMOE) embedded in the end-to-end training scheme for generic offpolicy and on-policy algorithms using stochastic policies, e.g., Soft Actor-Critic (SAC) and Proximal Policy Optimisation (PPO). Experimental results testify the advantageous performance of our method over unimodal polices and three different MOE methods, as well as a method of option frameworks, based on two types of DRL algorithms. We also demonstrate the distinguishable primitives learned with PMOE in different environments.

1. INTRODUCTION

The mixture-of-experts method (MOE) (Jacobs et al., 1991a ) is testified to be capable of improving the generalisation ability of reinforcement learning (RL) agents (Hausknecht & Stone, 2016a; Peng et al., 2016; Neumann et al.) . Among these methods, the Gaussian Mixture Models (GMM) are promising to model multimodal policy in RL (Peng et al., 2019; Akrour et al., 2020) , in which distinguishable experts or so-called primitives are learned. The distinguishable experts can propose several solutions for a task and have a larger range of exploration space, which can potentially lead to better task performance and sample efficiency compared to its unimodal counterpart (Bishop, 2007) . The multimodal policy can be learned by various methods, such as a two-stage training approach (Peng et al., 2019) , specific clustering method (Akrour et al., 2020) , or especially parameterised actions design (Hausknecht & Stone, 2016b) . However, these methods are limited, neither applicable to complicated scenarios such as high-dimensional continuous control tasks nor the training algorithms are too complex to deal with general utility. To the best of our knowledge, the present DRL algorithms for general utility do not deploy MOE to model the multimodal policy mainly due to the lack of differentiability, or without explicit probabilistic representation. Therefore, in the policy gradientbased algorithms (Sutton et al., 1999a) , the gradient of the performance concerning the policy parameters is undifferentiated. The undifferentiability problem also remains to learn a deep neural network policy thus making the combinations of MOE and DRL not trivial. In this paper, we propose a probabilistic framework to tackle the undifferentiated problem by holding the mixture distribution assumption. We will still use the GMM to model the multimodal policies. Once the undifferentiated problem is solved, our training methods can be combined with the policy gradient algorithms by simply setting the number of experts (mixtures) greater than one. Hereafter, the contribution can be summarised as follows: • We analyse the undifferentiability problem of approximating policy as the GMM in DRL and its associated drawbacks. • We propose an end-to-end training method to obtain the primitives with probability in a frequentist manner to solve the undifferentiability problem. • Our experiments show the proposed method can achieve better task performance and sample efficiency by exploring larger behaviours space, especially in complicated continuous control tasks, compared with unimodal RL algorithms and three different MOE methods or option frameworks.

2. RELATED WORK

Hierarchical Policies There are two main related hierarchical policy structures. The feudal schema (Dayan & Hinton, 1992) has two types of agents: managers and workers. The managers first make high-level decisions, then the workers make low-level actions according to these high-level decisions. The options framework (Sutton et al., 1999b) (Tresp, 2000; Yuan & Neubauer, 2008; Luo & Sun, 2017) . MOE can be also combined with RL (Doya et al., 2002; Neumann et al.; Peng et al., 2016; Hausknecht & Stone, 2016a; Peng et al., 2019) , in which the policies are modelled as probabilistic mixture models and each expert aim to learn distinguishable policies. Policy-based RL Policy-based RL aims to find the optimal policy to maximise the expected return through gradient updates. Among various algorithms, Actor-critic is often employed (Barto et al., 1983; Sutton & Barto, 1998) . Off-policy algorithms (O'Donoghue et al., 2016; Lillicrap et al., 2016; Gu et al., 2017; Tuomas et al., 2018) are more sample efficient than on-policy ones (Peters & Schaal, 2008; Schulman et al., 2017; Mnih et al., 2016; Gruslys et al., 2017) . However, the learned policies are still unimodal.

3.1. NOTATION

The model-free RL problem can be formulated by Markov Decision Process (MDP), denoted as a tuple (S, A, P, r), where S and A are continuous state and action space, respectively. The agent observes state s t ∈ S and takes an action a t ∈ A at time step t. The environment emits a reward r : S × A → [r min , r max ] and transitions to a new state s t+1 according to the transition probabilities P : S × S × A → [0, ∞). In deep reinforcement learning algorithms, we always use the Q-value function Q(s t , a t ) to describe the expected return after taking an action a t in the state s t . The Q-value can be iteratively computed by applying the Bellman backup given by: Q(s t , a t ) E st+1∼P r(s t , a t ) + γE at+1∼π [Q(s t+1 , a t+1 )] . (1) Our goal is to maximise the expected return: E at∼πΘ(at|st) [Q(s t , a t )], (2)



Θ * (a t |s t ) = arg max πΘ(at|st)

has an upper-level agent (policy-over-options), which decides whether the lower level agent (sub-policy) should start or terminate. In the early years, it's the subject of research to discover temporal abstractions autonomously often in discrete actions and the state space(McGovern & Barto, 2001; Menache et al., 2002; Simsek & Barto, 2008; Silver &  Ciosek, 2012). Recently,(Mankowitz et al., 2016)  proposes a method that assumes the initiation sets and termination functions have particular structures.(Kulkarni et al., 2016)  uses internal and extrinsic rewards to learn sub-policies and policy-over-options.(Bacon et al., 2017)  trains sub-policies and policy-over-options in end-to-end fusion with a deep termination function.(Vezhnevets et al., 2017)   generalises the feudal schema into continuous action space and uses an embedding operation to solve the indifferentiable problem.(Peng et al., 2016)  introduces a mixture of actor-critic experts approaches to learn terrain-adaptive dynamic locomotion skills.(Peng et al., 2019)  changes the mixture-of-experts distribution addition expression into the multiplication expression.

