PROBABILISTIC MIXTURE-OF-EXPERTS FOR EFFICIENT DEEP REINFORCEMENT LEARNING

Abstract

Deep reinforcement learning (DRL) has successfully solved various problems recently, typically with a unimodal policy representation. However, grasping the decomposable and hierarchical structures within a complex task can be essential for further improving its learning efficiency and performance, which may lead to a multimodal policy or a mixture-of-experts (MOE). To our best knowledge, present DRL algorithms for general utility do not deploy MOE methods as policy function approximators due to the lack of differentiability, or without explicit probabilistic representation. In this work, we propose a differentiable probabilistic mixtureof-experts (PMOE) embedded in the end-to-end training scheme for generic offpolicy and on-policy algorithms using stochastic policies, e.g., Soft Actor-Critic (SAC) and Proximal Policy Optimisation (PPO). Experimental results testify the advantageous performance of our method over unimodal polices and three different MOE methods, as well as a method of option frameworks, based on two types of DRL algorithms. We also demonstrate the distinguishable primitives learned with PMOE in different environments.

1. INTRODUCTION

The mixture-of-experts method (MOE) (Jacobs et al., 1991a ) is testified to be capable of improving the generalisation ability of reinforcement learning (RL) agents (Hausknecht & Stone, 2016a; Peng et al., 2016; Neumann et al.) . Among these methods, the Gaussian Mixture Models (GMM) are promising to model multimodal policy in RL (Peng et al., 2019; Akrour et al., 2020) , in which distinguishable experts or so-called primitives are learned. The distinguishable experts can propose several solutions for a task and have a larger range of exploration space, which can potentially lead to better task performance and sample efficiency compared to its unimodal counterpart (Bishop, 2007) . The multimodal policy can be learned by various methods, such as a two-stage training approach (Peng et al., 2019) , specific clustering method (Akrour et al., 2020) , or especially parameterised actions design (Hausknecht & Stone, 2016b) . However, these methods are limited, neither applicable to complicated scenarios such as high-dimensional continuous control tasks nor the training algorithms are too complex to deal with general utility. To the best of our knowledge, the present DRL algorithms for general utility do not deploy MOE to model the multimodal policy mainly due to the lack of differentiability, or without explicit probabilistic representation. Therefore, in the policy gradientbased algorithms (Sutton et al., 1999a) , the gradient of the performance concerning the policy parameters is undifferentiated. The undifferentiability problem also remains to learn a deep neural network policy thus making the combinations of MOE and DRL not trivial. In this paper, we propose a probabilistic framework to tackle the undifferentiated problem by holding the mixture distribution assumption. We will still use the GMM to model the multimodal policies. Once the undifferentiated problem is solved, our training methods can be combined with the policy gradient algorithms by simply setting the number of experts (mixtures) greater than one. Hereafter, the contribution can be summarised as follows: • We analyse the undifferentiability problem of approximating policy as the GMM in DRL and its associated drawbacks. • We propose an end-to-end training method to obtain the primitives with probability in a frequentist manner to solve the undifferentiability problem.

