ADDRESSING HIGH-DIMENSIONAL CONTINUOUS AC-TION SPACE VIA DECOMPOSED DISCRETE POLICY-CRITIC

Abstract

Reinforcement learning (RL) methods for discrete action spaces like DQNs are being widely used in tasks such as Atari games. However, they encounter difficulties when addressing continuous control tasks, since discretizing continuous action space incurs the curse-of-dimensionality. To tackle continuous control tasks via discretized actions, we propose a decomposed discrete policy-critic (D2PC) architecture, which was inspired by multi-agent RL (MARL) and associates with each action dimension a discrete policy, while leveraging a single critic network to provide a shared evaluation. Building on D2PC, we advocate soft stochastic D2PC (SD2PC) and deterministic D2PC (D3PC) methods with a discrete stochastic or deterministic policy, which show comparable or superior training performances relative to even continuous actor-critic methods. Additionally, we design a mechanism that allows D3PC to interact with continuous actor-critic methods, contributing to the Q-policy-critic (QPC) algorithm, which inherits the training efficiency of discrete RL and the near-optimal final performance of continuous RL algorithms. Substantial experimental results on several continuous benchmark tasks validate our claims.

1. INTRODUCTION

Reinforcement learning (RL) Sutton & Barto (2018) is a class of machine learning methods that train models using interactions with the environment. RL can be divided into discrete RL with a discrete action space that contains limited actions, and continuous RL with a continuous action space having infinite actions. Actor-critic Konda & Tsitsiklis (1999) , a structure that uses a policy network to learn a state-action value function, is commonly used to address continuous RL tasks such as continuous control, but it is demonstrated to be fragile and sensitive to hyperparameters Haarnoja et al. (2018a) . 2016), naive action discretization would lead to the dimension explosion problem. For example, if the continuous action space contains M dimensions and each dimension is discretized to N fractions, there will be M N actions for the discrete RL algorithm. As a remedy, a possible way to address this problem is to discretize independently each action dimension of the high-dimensional continuous action space, so that the total number of discrete actions M N grows linear instead of exponential in dimension M . Nonetheless, this procedure may lead to the low sample-efficiency problem, because one can only take on-policy iterations to collaboratively learn the relationship between different action dimensions' policies. To integrate the discrete action space and off-policy iterations in a continuous RL algorithm, we propose a decomposed discrete policy-critic (D2PC) architecture inspired by multi-agent actor-critic methods, by viewing each action dimension as an independent agent and assigning a uni-dimensional policy which is optimized by referencing a centralized critic network.



Employing discrete RL algorithms for continuous control via action discretization is a feasible way to improve the training efficiency Pazis & Lagoudakis (2009), since discrete RL algorithms including Q-learning Watkins & Dayan (1992) and deep Q network (DQN) Mnih et al. (2013; 2015) have low complexity and exhibit stable training behavior. But for continuous control tasks with high-dimensional continuous action spaces Levine et al. (

