NEURAL DISCRETE REINFORCEMENT LEARNING

Abstract

Designing effective action spaces for complex environments is a fundamental and challenging problem in reinforcement learning (RL). Some recent works have revealed that naive RL algorithms utilizing well-designed handcrafted discrete action spaces can achieve promising results even when dealing with highdimensional continuous or hybrid decision-making problems. However, elaborately designing such action spaces requires comprehensive domain knowledge. In this paper, we systemically analyze the advantages of discretization for different action spaces and then propose a unified framework, Neural Discrete Reinforcement Learning (NDRL), to automatically learn how to effectively discretize almost arbitrary action spaces. Specifically, we propose the Action Discretization Variational AutoEncoder (AD-VAE), an action representation learning method that can learn compact latent action spaces while maintain the essential properties of original environments, such as boundary actions and the relationship between different action dimensions. Moreover, we uncover a key issue that parallel optimization of the AD-VAE and online RL agents is often unstable. To address it, we further design several techniques to adapt RL agents to learned action representations, including latent action remapping and ensemble Q-learning. Quantitative experiments and visualization results demonstrate the efficiency and stability of our proposed framework for complex action spaces in various environments.

1. INTRODUCTION

Recent advances in Reinforcement Learning have yielded many promising research achievements Vinyals et al. (2019) ; Berner et al. (2019) ; Schrittwieser et al. (2019) . However, the complexity of action spaces still prevents us from directly utilizing advanced RL algorithms to real-world scenarios, such as high-dimensional continuous control in robot manipulation Lillicrap et al. (2016) and structured hybrid action decision-making in strategy games Kanervisto et al. (2022) To handle these issues, some existing work first elaborately design particular reinforcement learning methods in original complex action spaces. Specifically, deterministic policy gradient methods Lillicrap et al. ( 2016 2022) build prior sets of discrete actions to from expert demonstrations, and then deploy RL agents on this fixed discrete action sets. To preserve the necessary attributes of environments, all the above discretiza-tion techniques require related domain knowledge to discard redundant information about actions, which means that they are unsuitable for different environments with arbitrary action spaces. . In this paper, we focus on how to learn a unified discrete action representations from scratch without any domain knowledge. Based on previous analysis and our investigations (as shown in Figure 1 ), we summarize the following advantages of discretization for the complexity of the action space: • Unified action discretization provides a powerful and general approach to dealing with reinforcement learning in complex action spaces. It is equivalent to split the entire pipeline into two parts: (1). representation learning and (2). decision-making. The former focus on intrinsic properties and data distributions of the action space, then transform various action spaces into standard discrete action sets, while the latter only needs to solve core decision-making problems. • Effective discretization can improve sample efficiency by reducing the overhead in repeating sub-optimal, useless, and semantically similar actions. RL agent can just explore and exploit the necessary subsets of the original action space during training. 2019) to improve the capability of AD-VAE for the relationships between different action dimensions and boundary action values. Furthermore, we find a core issue of parallel optimization of AD-VAE and RL agents: the online updates of AD-VAE may lead to semantic changes of latent actions (i.e. the non-stationary of decision spaces), resulting in severe data staleness and Q-value over-estimation. To solve this problem, we introduce action remapping and ensemble Q-learning. Concretely, we apply the classic DQN as an instance to our framework, named Action Discretization Q-learning (ADQ), which can be deployed for most complex action spaces. Compared with pioneer works (Chandak et al., 2019a; Zhou et al., 2020; Dadashi et al., 2022) , to our best knowledge, our proposed framework is the first online RL paradigm capable of employing in discrete action spaces learned from different continuous and hybrid decision-making environments. To demonstrate the efficiency and stability of our NDRL framework and AD-VAE method, we evaluate it on the classic continuous control benchmark MuJoCo Todorov et al. (2012) , showing that ADQ can achieve excellent performance operating in high-dimensional continuous space even with a small number of actions. To evaluate the generality, we test it on the hybrid action environments Gym Hybrid thomashirtz (2021), HardMove from HyAR and GoBigger Zhang (2021). The results show that ADQ outperforms current state-of-the-art hybrid action algorithms in both sample efficiency and final performance. Besides, we also conduct a series of ablation study experiments and interpret more details about NDRL by visualization on the latent space.

2. RELATED WORK

Action Discretization Discretization and continuity are like the relationship between 0 and 1 in the binary world. All things, including time and space, are continuous, but for the convenience of cognition, we will discretize all of them. Only then can we have measures, such as the concepts of hours, minutes and meters. In RL, learning directly on a high-dimensional continuous action space may present difficulties in exploration due to the uncountable set of actions. In addition, (Bjorck et al., 2021) argues that the nonlinear function saturation caused by unstable network parameterization will cause the well-known high variance problem. The most straightforward solution is discretization, however, this usually suffers from the curse of dimensionality. To alleviate this problem, many assumptions about the action space have been proposed. For example, (Tang & Agrawal, 2020) verifies the feasibility of discretizing the action space in on-policy optimization by utilizing the factorized distribution across action dimensions. In (Dadashi et al., 2022) , the authors proposed to circumvent the curse of dimensionality problem by learning a set of plausible discrete actions from expert demonstrations. We argue that this algorithm can naturally be seen as a special case of



Complex action spaces lead to extensive challenges in designs of policy optimization Xiong et al. (2018b), efficiency of exploration Seyde et al. (2021b) and behaviour stability of learned agents Bester et al. (2019).

); Fujimoto et al. (2018) are designed to handle continuous control problems. And Xiong et al. (2018b); Fan et al. (2019b) propose some techniques to extract the relationship between different action dimensions, which is important in hybrid action spaces. However, these designs often suffer from low exploration efficiency and unstable training due to the infinite action spaces and interference between different sub-actions Bester et al. (2019), respectively. Action space shaping Kanervisto et al. (2020) is another way to tackle these problems. Particularly, many RL applications Kanervisto et al. (2022); Wei et al. (2022) design specific action discretization mechanisms to simplify the decision-making spaces, leading to the promising performance improvement, but it requires intensive investigations about the corresponding environments. Moreover, the combination of many manually discretized sub-actions will result in the exponential explosion of action numbers, which is incompatible with large action spaces. Recently, some works propose to learn abstract action representations to boost RL training. HyAR Li et al. (2021) designs a special training scheme with VAE Kingma & Welling (2014) to map the original hybrid action space to a continuous latent action space. Some other methods Dadashi et al. (2022); Shafiullah et al. (2022); Jiang et al. (

