ADAPTIVE DISCRETIZATION FOR CONTINUOUS CON-TROL USING PARTICLE FILTERING POLICY NETWORK Anonymous authors Paper under double-blind review

Abstract

Controlling the movements of highly articulated agents and robots has been a long-standing challenge to model-free deep reinforcement learning. In this paper, we propose a simple, yet general, framework for improving the performance of policy gradient algorithms by discretizing the continuous action space. Instead of using a fixed set of predetermined atomic actions, we exploit particle filtering to adaptively discretize actions during training and track the posterior policy represented as a mixture distribution. The resulting policy can replace the original continuous policy of any given policy gradient algorithm without changing its underlying model architecture. We demonstrate the applicability of our approach to state-of-the-art on-policy and off-policy baselines in challenging control tasks. Baselines using our particle-based policies achieve better final performance and speed of convergence as compared to corresponding continuous implementations and implementations that rely on fixed discretization schemes.

1. INTRODUCTION

In the last few years, impressive results have been obtained by deep reinforcement learning (DRL) both on physical and simulated articulated agents for a wide range of motor tasks that involve learning controls in high-dimensional continuous action spaces (Lillicrap et al., 2015; Levine et al., 2016; Heess et al., 2017; Haarnoja et al., 2018c; Rajeswaran et al., 2018; Tan et al., 2018; Peng et al., 2018; 2020) . Many methods have been proposed that can improve the performance of DRL for continuous control problems, e.g. distributed training (Mnih et al., 2016; Espeholt et al., 2018 ), hierarchical learning (Daniel et al., 2012; Haarnoja et al., 2018a) , and maximum entropy regularization (Haarnoja et al., 2017; Liu et al., 2017; Haarnoja et al., 2018b) . Most of such works, though, focus on learning mechanisms to boost performance beyond the basic distribution that defines the action policy, where a Gaussian-based policy or that with a squashing function is the most common choice as the basic policy to deal with continuous action spaces. However, the unimodal form of Gaussian distributions could experience difficulties when facing a multi-modal reward landscape during optimization and prematurely commit to suboptimal actions (Daniel et al., 2012; Haarnoja et al., 2017) . To address the unimodality issue of Gaussian policies, people have been exploring more expressive distributions than Gaussians, with a simple solution being to discretize the action space and use categorical distributions as multi-modal action policies (Andrychowicz et al., 2020; Jaśkowski et al., 2018; Tang & Agrawal, 2019) . However, categorical distributions cannot be directly extended to many off-policy frameworks as their sampling process is not reparameterizable. Importantly, the performance of the action space discretization depends a lot on the choice of discrete atomic actions, which are usually picked uniformly due to lack of prior knowledge. On the surface, increasing the resolution of the discretized action space can make fine control more possible. However, in practice, this can be detrimental to the optimization during training, since the policy gradient variance increases with increasing number of atomic actions (Tang & Agrawal, 2019) . Our work also focuses on action policies defined by an expressive, multimodal distribution. Instead of selecting fixed samples from the continuous action space, though, we exploit a particlebased approach to sample the action space dynamically during training and track the policy represented as a mixture distribution with state-independent components. We refer to the resulting policy network as Particle Filtering Policy Network (PFPN). We evaluate PFPN on state-of-the-art on-policy and off-policy baselines using high-dimensional tasks from the PyBullet Roboschool en-

