FACTORED ACTION SPACES IN DEEP REINFORCE-MENT LEARNING

Abstract

Very large action spaces constitute a critical challenge for deep Reinforcement Learning (RL) algorithms. An existing approach consists in splitting the action space into smaller components and choosing either independently or sequentially actions in each dimension. This approach led to astonishing results for the Star-Craft and Dota 2 games, however it remains underexploited and understudied. In this paper, we name this approach Factored Actions Reinforcement Learning (FARL) and study both its theoretical impact and practical use. Notably, we provide a theoretical analysis of FARL on the Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC) algorithms and evaluate these agents in different classes of problems. We show that FARL is a very versatile and efficient approach to combinatorial and continuous control problems.

1. INTRODUCTION

In many decision making problems, especially for combinatorial problems, the search space can be extremely large. Learning from scratch in this setting can be hard if not sometimes impossible. Using deep neural networks helps dealing with very large state spaces, but the issue remains when the action space or the horizon required to solve a problem are too large, which is often the case in many real-world settings. Several approaches tackle the problem of long horizon tasks like learning compositional neural programs (Pierrot et al., 2019) , hierarchical policies (Levy et al., 2018) or options (Bacon et al., 2017) . However, the large action space problem is not as well covered. The main approach consists in factorizing the action space into a Cartesian product of smaller subspaces. We call it Factored Actions Reinforcement Learning (FARL). In FARL, the agent must return a sequence of actions at each time step instead of a single one. This approach has been applied successfully to obtain astonishing results in games like StarCraft (Jaderberg et al., 2019 ), Dota 2 (Berner et al., 2019) or for neural program generation (Li et al., 2020) . There was also several attempts to use factored action spaces with DQN, PPO and AlphaZero to solve continuous action problems by discretizing actions and specifying each dimension at a time (Metz et al., 2017; Grill et al., 2020; Tang & Agrawal, 2020) . The resulting algorithms outperformed several native continuous action algorithms on MUJOCO benchmarks. While this approach has been successfully applied in practice, a deeper analysis of the consequences of such a formulation on the RL problem is missing. In this paper, we highlight two different ways to factorize the policy and study their theoretical impact. We discuss the pros and cons of both approaches and illustrate them with practical applications. We extend two state-of-the-art agents PPO and SAC to work with both factorization methods. To highlight the generality of the approach, we apply these algorithms to diverse domains, from large sequential decision problems with discrete actions to challenging continuous control problems and hybrid domains mixing discrete decisions and continuous parameterization of these decisions. We illustrate the method on three benchmarks chosen for the different difficulties they raise and highlight the benefits of using factored actions.

2. RELATED WORK

A large part of the reinforcement learning literature covers the long time horizon problem with approaches based on options (Bacon et al., 2017; Vezhnevets et al., 2017 ), compositionality (Pierrot et al., 2019; 2020) or more generally Hierarchical Reinforcement Learning (Levy et al., 2017; 2018; Yang et al., 2018; Nachum et al., 2018a; b) . However, there has been fewer attempts to deal with large action spaces. In many real life problems, especially in combinatorial or optimisation research standard problems, the number of entities and the size of instances can be very large thus leading to action spaces which may contain thousands of actions. Some prior works have focused on factorizing the action space into binary sub-spaces and using generalized value functions (Pazis & Parr, 2011) . A similar approach leveraged Error-Correcting Output Code classifiers (ECOCs) (Dietterich & Bakiri, 1994) to factorize the action space and allow for parallel training of a sub-policy for each action sub-space (Dulac-Arnold et al., 2012) . More recently, Dulac-Arnold et al. (2015) proposed to leverage prior information about the actions to embed them into a continuous action space in which the agent can generalize. A concurrent approach is to learn what not to learn (Zahavy et al., 2018) . The authors train an action elimination network to eliminate sub-optimal actions, thus reducing the number of possible actions for the RL agent. In another approach, the action space is factored into a Cartesian product of n discrete sub-action spaces. In Parameterized actions RL, also called Hybrid RL, actions are factored into sequences that correspond to the choice of an action in a discrete action space of size m and then the choice of the intensity of this action in a continuous action space (Hausknecht & Stone, 2015; Masson et al., 2016; Fan et al., 2019; Delalleau et al., 2019) . In other problems, the action space exhibits a natural factorization as in Dota 2 or StarCraft. Indeed, one must first choose a macro-action such as selecting a building or a unit and then a sequence of micro-actions such as creating a specific unit, at a specific position. In such a factorization, the autoregressive property is essential, as the selection of an action must be conditioned on the previously selected actions in the sequence. For both games, factorizing the action space and selecting sequences of autoregressive actions instead of single discrete actions has been shown to be crucial (Berner et al., 2019; Vinyals et al., 2019) . However neither of these works sufficient highlight this aspect nor propose a proper formalisation. As far as we know, the only work that establishes a proper FARL framework is Metz et al. (2017) with the model called Sequential DQN (SDQN). They build on existing methods to construct sequential models that have been proposed outside the RL literature. Notably, these models are a natural fit for language modelling Bengio et al. (2003); Sutskever et al. (2014) . Metz et al. (2017) extend the DQN algorithm (Mnih et al., 2013) to the sequential setting and present this approach as an alternative way to handle continuous action spaces such as robotic control. Here, we go beyond Q-learning approaches and propose general formulations to extend any actor-critic RL agent to the FARL setting. We illustrate this framework on two examples: we extend both the Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Soft Actor Critic (SAC) algorithms (Haarnoja et al., 2018) to the sequential setting. We also highlight the flexibility and generality of the FARL approach by using it on a broad class of problems. We show results on robotic control MUJOCO benchmarks as in Metz et al. (2017) to demonstrate the relevance of our derivations, and we also successfully apply factored PPO and SAC to parameterized and multi-agent problems.

3. FACTORED ACTION SPACES

In this section, we introduce notations for Markov Decision Problems with factored action spaces. We consider a Markov Decision Process (MDP) (S, A, T , R, γ, ρ 0 ) where S is the state space, A the action space, T : S × A → S the transition function, R : S × A × S → R the reward function, γ ≤ 1 is a discount factor and ρ 0 is the initial state distribution. We assume that the state space is continuous and that the MDP is fully-observable, thus observations equal states. In this paper, we assume that the action space is factored, thus it might be expressed as a Cartesian product of n discrete action sub-spaces: A = A 1 × • • • × A n where A i is a discrete action space of size n i . We aim to learn a parameterized stochastic policy π θ : A × S → [0, 1], where π(a|s) is the probability of choosing action a in state s. The objective function to maximise is J (θ) = IE τ [ ∞ t=0 γ t r t ] where τ is a trajectory obtained from π θ starting from state s 0 ∼ ρ 0 and r t is the reward obtained along this trajectory at time t. We define the Q-value for policy π, Q π : S × A → R as Q π (s, a) = IE τ [ t γ t r t ], where τ is a trajectory obtained from π θ starting from state s and performing initial action a. We define the V-value V π : S → R as V (s) = a∈A π(a|s)Q π (s, a). The policy is factored into a product of n joint distributions to handle the factored action space. We consider two settings.

