FACTORED ACTION SPACES IN DEEP REINFORCE-MENT LEARNING

Abstract

Very large action spaces constitute a critical challenge for deep Reinforcement Learning (RL) algorithms. An existing approach consists in splitting the action space into smaller components and choosing either independently or sequentially actions in each dimension. This approach led to astonishing results for the Star-Craft and Dota 2 games, however it remains underexploited and understudied. In this paper, we name this approach Factored Actions Reinforcement Learning (FARL) and study both its theoretical impact and practical use. Notably, we provide a theoretical analysis of FARL on the Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC) algorithms and evaluate these agents in different classes of problems. We show that FARL is a very versatile and efficient approach to combinatorial and continuous control problems.

1. INTRODUCTION

In many decision making problems, especially for combinatorial problems, the search space can be extremely large. Learning from scratch in this setting can be hard if not sometimes impossible. Using deep neural networks helps dealing with very large state spaces, but the issue remains when the action space or the horizon required to solve a problem are too large, which is often the case in many real-world settings. Several approaches tackle the problem of long horizon tasks like learning compositional neural programs (Pierrot et al., 2019 ), hierarchical policies (Levy et al., 2018) or options (Bacon et al., 2017) . However, the large action space problem is not as well covered. The main approach consists in factorizing the action space into a Cartesian product of smaller subspaces. We call it Factored Actions Reinforcement Learning (FARL). In FARL, the agent must return a sequence of actions at each time step instead of a single one. This approach has been applied successfully to obtain astonishing results in games like StarCraft (Jaderberg et al., 2019 ), Dota 2 (Berner et al., 2019) or for neural program generation (Li et al., 2020) . There was also several attempts to use factored action spaces with DQN, PPO and AlphaZero to solve continuous action problems by discretizing actions and specifying each dimension at a time (Metz et al., 2017; Grill et al., 2020; Tang & Agrawal, 2020) . The resulting algorithms outperformed several native continuous action algorithms on MUJOCO benchmarks. While this approach has been successfully applied in practice, a deeper analysis of the consequences of such a formulation on the RL problem is missing. In this paper, we highlight two different ways to factorize the policy and study their theoretical impact. We discuss the pros and cons of both approaches and illustrate them with practical applications. We extend two state-of-the-art agents PPO and SAC to work with both factorization methods. To highlight the generality of the approach, we apply these algorithms to diverse domains, from large sequential decision problems with discrete actions to challenging continuous control problems and hybrid domains mixing discrete decisions and continuous parameterization of these decisions. We illustrate the method on three benchmarks chosen for the different difficulties they raise and highlight the benefits of using factored actions.

2. RELATED WORK

A large part of the reinforcement learning literature covers the long time horizon problem with approaches based on options (Bacon et al., 2017; Vezhnevets et al., 2017 ), compositionality (Pierrot et al., 2019; 2020) or more generally Hierarchical Reinforcement Learning (Levy et al., 2017; 2018; 1 

