ADAPTIVE DISCRETIZATION FOR CONTINUOUS CON-TROL USING PARTICLE FILTERING POLICY NETWORK Anonymous authors Paper under double-blind review

Abstract

Controlling the movements of highly articulated agents and robots has been a long-standing challenge to model-free deep reinforcement learning. In this paper, we propose a simple, yet general, framework for improving the performance of policy gradient algorithms by discretizing the continuous action space. Instead of using a fixed set of predetermined atomic actions, we exploit particle filtering to adaptively discretize actions during training and track the posterior policy represented as a mixture distribution. The resulting policy can replace the original continuous policy of any given policy gradient algorithm without changing its underlying model architecture. We demonstrate the applicability of our approach to state-of-the-art on-policy and off-policy baselines in challenging control tasks. Baselines using our particle-based policies achieve better final performance and speed of convergence as compared to corresponding continuous implementations and implementations that rely on fixed discretization schemes.

1. INTRODUCTION

In the last few years, impressive results have been obtained by deep reinforcement learning (DRL) both on physical and simulated articulated agents for a wide range of motor tasks that involve learning controls in high-dimensional continuous action spaces (Lillicrap et al., 2015; Levine et al., 2016; Heess et al., 2017; Haarnoja et al., 2018c; Rajeswaran et al., 2018; Tan et al., 2018; Peng et al., 2018; 2020) . Many methods have been proposed that can improve the performance of DRL for continuous control problems, e.g. distributed training (Mnih et al., 2016; Espeholt et al., 2018) , hierarchical learning (Daniel et al., 2012; Haarnoja et al., 2018a) , and maximum entropy regularization (Haarnoja et al., 2017; Liu et al., 2017; Haarnoja et al., 2018b) . Most of such works, though, focus on learning mechanisms to boost performance beyond the basic distribution that defines the action policy, where a Gaussian-based policy or that with a squashing function is the most common choice as the basic policy to deal with continuous action spaces. However, the unimodal form of Gaussian distributions could experience difficulties when facing a multi-modal reward landscape during optimization and prematurely commit to suboptimal actions (Daniel et al., 2012; Haarnoja et al., 2017) . To address the unimodality issue of Gaussian policies, people have been exploring more expressive distributions than Gaussians, with a simple solution being to discretize the action space and use categorical distributions as multi-modal action policies (Andrychowicz et al., 2020; Jaśkowski et al., 2018; Tang & Agrawal, 2019) . However, categorical distributions cannot be directly extended to many off-policy frameworks as their sampling process is not reparameterizable. Importantly, the performance of the action space discretization depends a lot on the choice of discrete atomic actions, which are usually picked uniformly due to lack of prior knowledge. On the surface, increasing the resolution of the discretized action space can make fine control more possible. However, in practice, this can be detrimental to the optimization during training, since the policy gradient variance increases with increasing number of atomic actions (Tang & Agrawal, 2019) . Our work also focuses on action policies defined by an expressive, multimodal distribution. Instead of selecting fixed samples from the continuous action space, though, we exploit a particlebased approach to sample the action space dynamically during training and track the policy represented as a mixture distribution with state-independent components. We refer to the resulting policy network as Particle Filtering Policy Network (PFPN). We evaluate PFPN on state-of-the-art on-policy and off-policy baselines using high-dimensional tasks from the PyBullet Roboschool en-vironments (Coumans & Bai, 2016 -2019) and the more challenging DeepMimic framework (Peng et al., 2018) . Our experiments show that baselines using PFPN exhibit better overall performance and/or speed of convergence and lead to more robust agent control. as compared to uniform discretization and to corresponding implementations with Gaussian policies. Main Contributions. Overall, we make the following contributions. We propose PFPN as a general framework for providing expressive action policies dealing with continuous action spaces. PFPN uses state-independent particles to represent atomic actions and optimizes their placement to meet the fine control demand of continuous control problems. We introduce a reparameterization trick that allows PFPN to be applicable to both on-policy and off-policy policy gradient methods. PFPN outperforms unimodal Gaussian policies and the uniform discretization scheme, and is more sampleefficient and stable across different training trials. In addition, it leads to high quality motion and generates controls that are more robust to external perturbations. Our work does not change the underlying model architecture or learning mechanisms of policy gradient algorithms and thus can be applied to most commonly used policy gradient algorithms.

2. BACKGROUND

We consider a standard reinforcement learning setup where given a time horizon H and the trajectory τ = (s 1 , a 1 , • • • , s H , a H ) obtained by a transient model M(s t+1 |s t , a t ) and a parameterized action policy π θ (a t |s t ), with s t ∈ R n and a t ∈ R m denoting the state and action taken at time step t, respectively, the goal of learning is to optimize θ that maximize the cumulative reward: J(θ) = E τ ∼p θ (τ ) [r t (τ )] = p θ (τ )r(τ )dτ. (1) Here, p θ (τ ) denotes the state-action visitation distribution for the trajectory τ induced by the transient model M and the action policy π θ with parameter θ, and r(τ ) = t r(s t , a t ) where r(s t , a t ) ∈ R is the reward received at time step t. We can maximize J(θ) by adjusting the policy parameters θ through the gradient ascent method, where the gradient of the expected reward can be determined according to the policy gradient theorem (Sutton et al., 2000) , i.e. ∇ θ J(θ) = E τ ∼π θ (•|st) [A t ∇ θ log π θ (a t |s t )|s t ] . where A t ∈ R denotes an estimate to the reward term r t (τ ). In DRL, the estimator of A t often relies on a separate network (critic) that is updated in tandem with the policy network (actor). This gives rise to a family of policy gradient algorithms known as actor-critic. On-Policy and Off-Policy Actor-Critics. In on-policy learning, the update policy is also the behavior policy based on which a trajectory is obtained to estimate A t . Common on-policy actor-critic algorithms include A3C (Mnih et al., 2016) and PPO (Schulman et al., 2017) , and directly employ Equation 2 for optimization. In off-policy learning, the policy can be updated without the knowledge of a whole trajectory. This results in more sample efficient approaches as samples are reusable. While algorithms such as Retrace (Munos et al., 2016) and PCL (Nachum et al., 2017) rely on Equation 2, many off-policy algorithms exploit a critic network to estimate A t given a state-action pair (Q-or soft Q-value). Common off-policy actor-critic methods include DDPG (Lillicrap et al., 2015) , SAC (Haarnoja et al., 2018b; d) and their variants (Haarnoja et al., 2017; Fujimoto et al., 2018) . These methods perform optimization to maximize a state-action value Q(s t , a t ). In order to update the policy network with parameter θ, they require the action policy to be reparameterizable such that the sampled action a t can be rewritten as a function differentiable to the parameter θ, and the optimization can be done through the gradient ∇ at Q(s t , a t )∇ θ a t . Policy Representation. Given a multi-dimensional continuous action space, the most common choice in current DRL baselines is to model the policy π θ as a multivariate Gaussian distribution with independent components for each action dimension (DDPG, SAC and their variants typically use Gaussian with a monotonic squashing function to stabilize the training). For simplicity, let us consider a simple case with a single action dimension and define the action policy as π θ (•|s t ) := N (µ θ (s t ), σ 2 θ (s t )). Then, we can obtain log π θ (a t |s t ) ∝ -(a tµ θ (s t )) 2 . Given a sampled action a t and the estimate of cumulative rewards A t , the optimization process based on the above expression can be imagined as that of shifting µ θ (s t ) towards the direction of a t if A t is higher

