ADDRESSING HIGH-DIMENSIONAL CONTINUOUS AC-TION SPACE VIA DECOMPOSED DISCRETE POLICY-CRITIC

Abstract

Reinforcement learning (RL) methods for discrete action spaces like DQNs are being widely used in tasks such as Atari games. However, they encounter difficulties when addressing continuous control tasks, since discretizing continuous action space incurs the curse-of-dimensionality. To tackle continuous control tasks via discretized actions, we propose a decomposed discrete policy-critic (D2PC) architecture, which was inspired by multi-agent RL (MARL) and associates with each action dimension a discrete policy, while leveraging a single critic network to provide a shared evaluation. Building on D2PC, we advocate soft stochastic D2PC (SD2PC) and deterministic D2PC (D3PC) methods with a discrete stochastic or deterministic policy, which show comparable or superior training performances relative to even continuous actor-critic methods. Additionally, we design a mechanism that allows D3PC to interact with continuous actor-critic methods, contributing to the Q-policy-critic (QPC) algorithm, which inherits the training efficiency of discrete RL and the near-optimal final performance of continuous RL algorithms. Substantial experimental results on several continuous benchmark tasks validate our claims.

1. INTRODUCTION

Reinforcement learning (RL) Sutton & Barto (2018) is a class of machine learning methods that train models using interactions with the environment. RL can be divided into discrete RL with a discrete action space that contains limited actions, and continuous RL with a continuous action space having infinite actions. Actor-critic Konda & Tsitsiklis (1999) , a structure that uses a policy network to learn a state-action value function, is commonly used to address continuous RL tasks such as continuous control, but it is demonstrated to be fragile and sensitive to hyperparameters Haarnoja et al. (2018a) . 2016), naive action discretization would lead to the dimension explosion problem. For example, if the continuous action space contains M dimensions and each dimension is discretized to N fractions, there will be M N actions for the discrete RL algorithm. As a remedy, a possible way to address this problem is to discretize independently each action dimension of the high-dimensional continuous action space, so that the total number of discrete actions M N grows linear instead of exponential in dimension M . Nonetheless, this procedure may lead to the low sample-efficiency problem, because one can only take on-policy iterations to collaboratively learn the relationship between different action dimensions' policies. To integrate the discrete action space and off-policy iterations in a continuous RL algorithm, we propose a decomposed discrete policy-critic (D2PC) architecture inspired by multi-agent actor-critic methods, by viewing each action dimension as an independent agent and assigning a uni-dimensional policy which is optimized by referencing a centralized critic network. Based on D2PC, we propose two algorithms which can effectively address continuous control tasks via discrete policy and experience replay. The first algorithm is soft D2PC (SD2PC), which takes maximum-entropy value iterations and optimizes softmax stochastic policies for each action dimension; and the second called determined D2PC (D3PC), which assigns an independent Q function for each action dimension fitting the critic's value function by supervised learning. Experiments show that the two algorithms exhibit high training efficiencies and markedly outperform state-of-the-art actor-critic algorithms like twin delayed deep deterministic policy gradients (TD3) Fujimoto et al. Though algorithms with D2PC trains fast, we observe from experiments that they may be limited in achieving the best performance finally, namely getting stuck in locally optimal solutions. Since D2PC and standard actor-critic methods share the same critic structure, this motivates us to combine the best of the two methods, by developing the Q-policy-critic (QPC) algorithm. Our QPC uses a discrete-continuous hybrid policy, co-trains a continuous actor along with D2PC, and dynamically exploits the discrete and continuous actors to improve the critic network. Substantial numerical tests over continuous control tasks demonstrate that QPC achieves significant improvement in convergence, stability, and rewards compared with both D3PC and continuous actor-critic methods.

2. RELATED WORK

As pointed out in Tavakoli et al. ( 2018), it is possible for discrete RL algorithms to address continuous action spaces, by regarding the multi-dimensional continuous action space as fully cooperative multiagent reinforcement learning between dimensions, that is, each action dimension as an independent agent. Several attempts have been made by splitting a multi-dimensional continuous policy into single-dimensional discrete policies. For example, Metz et al. ( 2017) proposed a next-step prediction model to learn each dimension's discrete Q value sequentially; Tang & Agrawal (2020) integrated single-dimensional discrete policies with ordinal parameterization to encode the natural ordering between discrete actions; see also Jaśkowski et al. (2018) and Andrychowicz et al. (2020) . It is worth remarking that these methods either cannot exploit experience replay or cannot deal with high-dimensional continuous action spaces, rendering them typically less effective than actor-critic methods for continuous control. Our proposed D2PC structure, in some way, can be seen as a variant of multi-agent actor-critic, which relies on a centralized critic network to optimize each action dimension's discrete policy. Since the seminal contribution of deterministic policy gradients Silver et al. (2014) in 2014, offpolicy actor-critic methods have arguably become the most effective way in dealing with continuous control tasks, thanks to their appropriate structure to deal with continuous actions and higher data efficiency compared with on-policy algorithms, e.g., proximal policy optimization (PPO, Schulman et al. (2015) ), trust-region policy optimization Schulman et al. (2015) ; Wu et al. (2017) , among many others. Off-policy actor-critic methods can be grouped into two categories: deterministic algorithms including e.g., the celebrated DDPG Lillicrap et al. (2015) , D4PG Barth-Maron et al. (2018) , TD3, and stochastic algorithms such as SQL Haarnoja et al. (2017), and SAC Haarnoja et al. (2018a) . DDPG uses the Bellman equation Bellman (1966) with the temporal difference method to iteratively update the critic network, in which the loss function is given by J Q (θ Q ) = Q(s(t), a(t); θ Q ) -r t + γQ(s(t + 1), µ(s(t + 1); θ µ ); θ Q ) 2 (1) where s(t) represents the state, , a(t) the action, r t the reward, and γ is the discounting factor; θ Q and θ Q denote the parameters of the critic network and the target critic network; θ µ and θ µ denote the parameters of the current and the target deterministic policy networks. The critic network in DDPG provides evaluations of the actions to optimize the policy of the actor network by minimizing the following loss function J µ (θ µ ) = -Q(s(t), µ(s(t); θ µ ); θ Q ) For SAC with a stochastic policy, they developed soft RL, which introduces entropy into value and policy iterations. SAC's critic network is trained using optimizing the following loss function  J Q (θ Q ) = Q(s(



Employing discrete RL algorithms for continuous control via action discretization is a feasible way to improve the training efficiency Pazis & Lagoudakis (2009), since discrete RL algorithms including Q-learning Watkins & Dayan (1992) and deep Q network (DQN) Mnih et al. (2013; 2015) have low complexity and exhibit stable training behavior. But for continuous control tasks with high-dimensional continuous action spaces Levine et al. (

(2018)  or soft actor-critic (SAC)Haarnoja et al. (2018a).

