FACTORED ACTION SPACES IN DEEP REINFORCE-MENT LEARNING

Abstract

Very large action spaces constitute a critical challenge for deep Reinforcement Learning (RL) algorithms. An existing approach consists in splitting the action space into smaller components and choosing either independently or sequentially actions in each dimension. This approach led to astonishing results for the Star-Craft and Dota 2 games, however it remains underexploited and understudied. In this paper, we name this approach Factored Actions Reinforcement Learning (FARL) and study both its theoretical impact and practical use. Notably, we provide a theoretical analysis of FARL on the Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC) algorithms and evaluate these agents in different classes of problems. We show that FARL is a very versatile and efficient approach to combinatorial and continuous control problems.

1. INTRODUCTION

In many decision making problems, especially for combinatorial problems, the search space can be extremely large. Learning from scratch in this setting can be hard if not sometimes impossible. Using deep neural networks helps dealing with very large state spaces, but the issue remains when the action space or the horizon required to solve a problem are too large, which is often the case in many real-world settings. Several approaches tackle the problem of long horizon tasks like learning compositional neural programs (Pierrot et al., 2019) , hierarchical policies (Levy et al., 2018) or options (Bacon et al., 2017) . However, the large action space problem is not as well covered. The main approach consists in factorizing the action space into a Cartesian product of smaller subspaces. We call it Factored Actions Reinforcement Learning (FARL). In FARL, the agent must return a sequence of actions at each time step instead of a single one. This approach has been applied successfully to obtain astonishing results in games like StarCraft (Jaderberg et al., 2019) , Dota 2 (Berner et al., 2019) or for neural program generation (Li et al., 2020) . There was also several attempts to use factored action spaces with DQN, PPO and AlphaZero to solve continuous action problems by discretizing actions and specifying each dimension at a time (Metz et al., 2017; Grill et al., 2020; Tang & Agrawal, 2020) . The resulting algorithms outperformed several native continuous action algorithms on MUJOCO benchmarks. While this approach has been successfully applied in practice, a deeper analysis of the consequences of such a formulation on the RL problem is missing. In this paper, we highlight two different ways to factorize the policy and study their theoretical impact. We discuss the pros and cons of both approaches and illustrate them with practical applications. We extend two state-of-the-art agents PPO and SAC to work with both factorization methods. To highlight the generality of the approach, we apply these algorithms to diverse domains, from large sequential decision problems with discrete actions to challenging continuous control problems and hybrid domains mixing discrete decisions and continuous parameterization of these decisions. We illustrate the method on three benchmarks chosen for the different difficulties they raise and highlight the benefits of using factored actions.

2. RELATED WORK

A large part of the reinforcement learning literature covers the long time horizon problem with approaches based on options (Bacon et al., 2017; Vezhnevets et al., 2017) , compositionality (Pierrot et al., 2019; 2020) or more generally Hierarchical Reinforcement Learning (Levy et al., 2017; 2018; Yang et al., 2018; Nachum et al., 2018a; b) . However, there has been fewer attempts to deal with large action spaces. In many real life problems, especially in combinatorial or optimisation research standard problems, the number of entities and the size of instances can be very large thus leading to action spaces which may contain thousands of actions. Some prior works have focused on factorizing the action space into binary sub-spaces and using generalized value functions (Pazis & Parr, 2011) . A similar approach leveraged Error-Correcting Output Code classifiers (ECOCs) (Dietterich & Bakiri, 1994) to factorize the action space and allow for parallel training of a sub-policy for each action sub-space (Dulac-Arnold et al., 2012) . More recently, Dulac-Arnold et al. (2015) proposed to leverage prior information about the actions to embed them into a continuous action space in which the agent can generalize. A concurrent approach is to learn what not to learn (Zahavy et al., 2018) . The authors train an action elimination network to eliminate sub-optimal actions, thus reducing the number of possible actions for the RL agent. In another approach, the action space is factored into a Cartesian product of n discrete sub-action spaces. In Parameterized actions RL, also called Hybrid RL, actions are factored into sequences that correspond to the choice of an action in a discrete action space of size m and then the choice of the intensity of this action in a continuous action space (Hausknecht & Stone, 2015; Masson et al., 2016; Fan et al., 2019; Delalleau et al., 2019) . In other problems, the action space exhibits a natural factorization as in Dota 2 or StarCraft. Indeed, one must first choose a macro-action such as selecting a building or a unit and then a sequence of micro-actions such as creating a specific unit, at a specific position. In such a factorization, the autoregressive property is essential, as the selection of an action must be conditioned on the previously selected actions in the sequence. For both games, factorizing the action space and selecting sequences of autoregressive actions instead of single discrete actions has been shown to be crucial (Berner et al., 2019; Vinyals et al., 2019) . However neither of these works sufficient highlight this aspect nor propose a proper formalisation. As far as we know, the only work that establishes a proper FARL framework is Metz et al. (2017) with the model called Sequential DQN (SDQN). They build on existing methods to construct sequential models that have been proposed outside the RL literature. Notably, these models are a natural fit for language modelling Bengio et al. (2003) ; Sutskever et al. (2014) . Metz et al. (2017) extend the DQN algorithm (Mnih et al., 2013) to the sequential setting and present this approach as an alternative way to handle continuous action spaces such as robotic control. Here, we go beyond Q-learning approaches and propose general formulations to extend any actor-critic RL agent to the FARL setting. We illustrate this framework on two examples: we extend both the Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Soft Actor Critic (SAC) algorithms (Haarnoja et al., 2018) to the sequential setting. We also highlight the flexibility and generality of the FARL approach by using it on a broad class of problems. We show results on robotic control MUJOCO benchmarks as in Metz et al. (2017) to demonstrate the relevance of our derivations, and we also successfully apply factored PPO and SAC to parameterized and multi-agent problems.

3. FACTORED ACTION SPACES

In this section, we introduce notations for Markov Decision Problems with factored action spaces. We consider a Markov Decision Process (MDP) (S, A, T , R, γ, ρ 0 ) where S is the state space, A the action space, T : S × A → S the transition function, R : S × A × S → R the reward function, γ ≤ 1 is a discount factor and ρ 0 is the initial state distribution. We assume that the state space is continuous and that the MDP is fully-observable, thus observations equal states. In this paper, we assume that the action space is factored, thus it might be expressed as a Cartesian product of n discrete action sub-spaces: A = A 1 × • • • × A n where A i is a discrete action space of size n i . We aim to learn a parameterized stochastic policy π θ : A × S → [0, 1], where π(a|s) is the probability of choosing action a in state s. The objective function to maximise is J (θ) = IE τ [ ∞ t=0 γ t r t ] where τ is a trajectory obtained from π θ starting from state s 0 ∼ ρ 0 and r t is the reward obtained along this trajectory at time t. We define the Q-value for policy π, Q π : S × A → R as Q π (s, a) = IE τ [ t γ t r t ], where τ is a trajectory obtained from π θ starting from state s and performing initial action a. We define the V-value V π : S → R as V (s) = a∈A π(a|s)Q π (s, a). The policy is factored into a product of n joint distributions to handle the factored action space. We consider two settings. Independent Factorization. A first setting corresponds to problems in which the actions components can be chosen independently from each other, only taking into account the environment state s. In this case, we decompose policy π θ into n policies π i θ : S → A i such that ∀(s, a) ∈ S × A, π θ (a|s) = n i=1 π i θ (a i |s) where a i is the i th component of action a. Each sub-policy i returns probabilities over the possible actions a i ∈ A i . In this setting, to sample an action from π θ , we sample in parallel the sub-actions from the sub-distributions π i θ . Autoregressive Factorization. In this second setting, actions are assumed ordered and dependent. For instance, the choice of action a 2 depends on the value of the action a 1 that has been chosen by policy π 1 . To account for this, we impose the autoregressive property, i.e. intermediate actions a i are selected conditionally to previous action choices a 1 , . . . , a i-1 . More formally, the probability of choosing action a i is π i θ (a i |s, a 1 , . . . , a i-1 ). As Metz et al. (2017) , we introduce sub-state spaces U i = S × A 1 × • • • × A i where U 0 = S and associated sub-states u i t ∈ U i , which contain the information of the environment state s t and all the sub-actions that have been selected so far. We decompose the policy π : S → A into n sub-policies π i : U i-1 → A i , i ∈ [1, n] such that ∀(s, a) ∈ S × A, π θ (a t |s t ) = n i=1 π i θ (a i t |u i-1 t ). In this setting the sub-actions cannot be sampled in parallel. To sample an action from π θ , we sample each sub-action sequentially, starting from the first and conditioning each sub-policy on the previously sampled actions.

4. PROPERTIES OF FACTORED POLICIES

In this section, we discuss the differences between both factorization methods from a theoretical point of view and the impact on their use in practice. We study in particular how the factorization choice affects the expression of the policy entropy and the Kullback-Leibler divergence between two factored policies. Entropies and KL divergences between policies are used in many RL algorithms as regularization terms in order to favor exploration during policy improvement or to prevent policy updates from modifying the current policy too much. When the policy is factored into independent sub-policies, we show that these quantities can easily be computed as the sum of the same quantities computed over the sub-policies. It is not as simple when the policy is autoregressive, but in this case, when computed over actions sampled from the current policy, the sum of the sub-entropies or sub-KL divergences has actually for expected value the global entropy or global KL divergence. The proofs of all propositions are in Appendix B.

4.1. SHANNON ENTROPY

The Shannon entropy of a policy is used in several RL algorithms to regularize the policy improvement step or to favor exploration. It is defined as H(π(.|s)) = -a∈A π(a|s) log (π(a|s)). Proposition 1.1 When the policy is factored into independent policies, its Shannon entropy can be computed as the sum of the Shannon entropies of the sub-policies: H(π(.|s)) = n i=1 H(π i (.|s)) with H(π i (.|s)) = - a j ∈Ai π i (a j |s) log π i (a j |s) . (1) In this setting, the n Shannon entropies can be computed independently in parallel and then summed to obtain the global policy entropy. Proposition 1.2 When the policy is autoregressive, we have: H(π(.|s)) = IE a∼π(.|s) n i=1 H(π i (.|u i-1 )) . This result gives us a simple way to estimate the Shannon entropy of π. In practice, updates are performed on batches of transitions originating from different states, so the quantity used for regularization is an estimation of IE s [H(π(.|s))]. Using the above proposition, we know that it equals IE s IE a∼π(.|s) n i=1 H(π i (.|u i-1 )) = IE s,a∼π(.|s) n i=1 H(π i (.|u i-1 )) . Therefore, using n i=1 H(π i (.|u i )) instead of H(π(.|s)) for each transition of the batch, we are actually estimating the same quantity. However, it must be noted that the estimation is correct only if all sequences of actions are sampled according to the current policy π(.|s). This does not cause any issue in an on-policy context, but when using a replay buffer (and thus transitions obtained with past versions of the policy), the sequences of actions must be resampled with the current policy for the estimation to remain correct.

4.2. KULLBACK-LEIBLER DIVERGENCE

The KL-divergence between two policies is also often used in RL, either as a regularization term Schulman et al. (2017) or as a loss function Grill et al. (2020) . The KL-divergence between two policies π(.|s) and µ(.|s) is defined as KL [π(.|s)||µ(.|s)] = -a∈A π(a|s) log µ(a|s) π(a|s) . Proposition 2.1 When the two policies are factored into independent sub-policies, their KLdivergence can be computed as the sum of the KL-divergences between the sub-policies: KL [π(.|s)||µ(.|s)] = n i=1 KL π i (.|s)||µ i (.|s) . In this setting, the n KL-divergences can be computed independently in parallel and then summed to obtain the final KL-divergence. Proposition 2.2 When the two policies are autoregressive, we have: KL [π(.|s)||µ(.|s)] = IE a∼π(.|s) n i=1 KL π i (.|u i-1 )||µ i (.|u i-1 ) . In this setting, similarly to the Shannon entropy, we use this result to form an estimate where the enriched states u i are computed from a sequence of actions sampled according to π(.|s). Importantly, the sequence of actions used to sequentially compute the sub-distributions must be the same for π and µ.

5. FACTORED AGENTS

In this section, we highlight the impact of factorizing the action space and of the chosen factorization approach for the policy. In particular, we study two state-of-the-art algorithms, PPO (on-policy) and SAC (off-policy), and show how to adapt them to both factorization settings. We provide guidelines and practical tips to make the factorization work in practice. The relative performance of these algorithms is evaluated in Section 6.

5.1. POLICY ARCHITECTURE AND ACTION SAMPLING

We consider a stochastic policy network π θ taking an environment state s ∈ S and returning probabilities over the actions a ∈ A. When the sub-policies are independent distributions, the policy network is composed of n heads, each taking either directly the state or an embedding of the state computed by a shared neural network between all heads. The i th head returns probabilities over the i th action component a i ∈ A i . The action components are sampled independently and the probability of the resulting action is computed as the product of the components probabilities. As the action components are independent, their sampling can be performed in parallel. When the policy is autoregressive, the policy network is also composed of n heads. However the i th head takes as input not only the state or an embedding but also an encoding of the first i -1 selected action components. This encoding can be engineered or computed through a recurrent model. Thus, the i th head returns probabilities over the i th action component a i ∈ A i conditioned on the choice of the first i -1 action components (a 1 , . . . , a i-1 ). Therefore, the action components must be sampled sequentially. As above, the resulting action probability is computed as the product of its components probabilities.

5.2. PROXIMAL POLICY OPTIMIZATION

The Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017 ) is an on-policy algorithm using a stochastic policy π θ and a value function V θ . It replaces the policy gradient by a surrogate loss to constrain the size of policy updates. The policy parameters are updated so as to maximise this loss. An entropy term H(π θ (.|s)) and a Kullback-Leibler distance between the current policy distribution and its version before the update can be added as regularization terms to further stabilize the trainingfoot_0 . More formally, policy parameters are updated by gradient descent so as to maximise the following expression: L π (θ) = min r(θ) Â, clip(r(θ), 1 -, 1 + ) Â + H(π θ (.|s)) -βKL(π θ old (.|s)||π θ (.|s)), where Â is an estimate of the advantage and is computed from the target network V θ using the Generalized Advantage Estimation method (GAE) (Schulman et al., 2015) . The r(θ) term denotes the policy ratio r(θ) = π θ (a|s) π θ old (a|s) where θ old are the parameters before the update. The value network is trained to minimize the mean squared error between its prediction and an estimation of the return. This estimation can be computed from the advantage estimate. More formally, the value function parameters are updated to minimize: L V (θ) = V θ (s) -( Â + V θ old (s)) 2 . As the value network only depends on states and not on actions, its architecture and update rule are not impacted by the factorization of the action space. It impacts only the policy architecture and its update expression. More precisely: (a) The policy architecture and the way the actions are sampled are modified as explained in Section 5.1; (b) Both π θ old (a|s) and π θ (a|s) are computed by multiplying the probabilities given by the sub-distributions over the action components; (c) The Shannon entropy and the KL-divergence terms are computed as explained in Sections 4.1 and 4.2.

5.3. SOFT ACTOR CRITIC

As we illustrated with PPO, factorizing the action space of an RL algorithm using the V -value function is easy. However, things get more complicated when using a Q-value function, as actions are involved in the critic. We give a concrete example with the Soft Actor Critic (SAC) (Haarnoja et al., 2018) algorithm. SAC learns a stochastic policy π * maximizing both the returns and the policy entropy. It maximises sums of augmented rewards r sac t = r t +αH(π(.|s t )) where α is a temperature parameter controlling the importance of entropy versus reward, and thereby the level of stochasticity of the policy. SAC trains a policy network π φ for control and a soft Q-function network Q θ and relies on a replay buffer D. While the original SAC paper proposes derivations to handle the continuous action setting, the algorithm has been then extended to handle discrete action spaces (Christodoulou, 2019) . We use the discrete action derivations as a basis to derive its factored versions. SAC relies on soft policy iteration which alternates between policy evaluation and policy improvement. To extend this algorithm to factored action spaces, we parameterize the policy as explained in Section 5.1.

CRITIC ARCHITECTURE

A naive parameterization of the critic would use a Q-value function taking a state-action pair as input and returning a scalar value. As there are n 1 × • • • × n n possible actions, training a Q-value under this form quickly becomes intractable as the action space grows. To avoid this, we consider a sub Q-value for each action dimension. In the independent factorization setting, the action components are selected independently, thus we consider n independent sub Q-value functions Q i : S → A i which take a state s and return one Q-value per possible action in A i . The Q-value function Q i estimates the average return from state s given the sub-action chosen by policy π i , regardless of the other sub-actions chosen, i.e. as if it had to assume that the other sub-actions were chosen by the environment. By maintaining these independent n Q-value functions, one can perform n independent policy evaluation steps in parallel as well as n independent policy improvement steps in parallel. We also consider n independent temperature parameter α i to control the entropy versus reward trade-off for each action dimension. In the autoregressive setting, the action chosen by a sub-policy π i is conditioned on the i -1 previously chosen action components (a 1 , . . . , a i-1 ). To give a proper meaning to separate Qvalues in this framework, we reuse the formulation proposed by Metz et al. (2017) and consider two MDPs: the top MDP corresponds to the MDP at hand in which the action space is factored; the bottom MDP is an extended version of the top one. Therefore, between two top MDPs states s t and s t+1 , we consider the n intermediate states u i t . For the value functions to be equal in both MDPs, we set r = 0 and γ = 1 on each intermediate state. In this formulation, the i th Q-value function Q i : U i-1 → A i can be interpreted as an estimate of the average return from a intermediate state u i given the action choice of policy π i . We also introduce an extra Q-value function, dubbed "up Q-value", Q U θ : S × A → R that estimates the return of the policy in the top MDP. We consider n temperature parameters α i , one per action dimension in the bottom MDP, as well as a global temperature parameter α to control the entropy versus reward trade-off in the top MDP. Below we detail the impact of both factorization approaches on the soft policy evaluation, as this is where most changes are necessary. See Appendix A for the detailed derivations of the policy improvement step and the automatic entropy adjustment step.

SOFT POLICY EVALUATION

To evaluate policy π, Haarnoja et al. (2018) introduced the soft state value function defined as V (s t ) = IE at∼π φ (.|st) [Q θ (s t , a t ) -α log (π φ (a t |s t ))] . To compute V (s t ), one must compute an expectation over the distribution π φ (.|s t ). In the continuous action setting, computing this expectation is replaced by an expectation over a Gaussian using the re-parameterization trick, thus reducing the variance of the estimate. In the discrete action setting, this expectation can be computed directly through an inner product, thus leading to a smaller variance estimate of the soft value, see Appendix A.1 for more details. Note that, while in the continuous action setting the Q-value network takes a state and an action and returns a scalar value, its discrete version takes a state and returns a scalar value for each possible action. Finally, the soft Q-value network is trained to minimize the soft Bellman residual: J Q (θ) = IE (st,at)∼D (Q θ (s t , a t ) -(r t + γV (s t+1 )) 2 . (5) Independent Factorization setting. When the sub-policies are independent distributions, the n sub Q-value functions parameters are updated so as to minimize the soft Bellman residual in Equation (5), where Q θ is replaced by Q i θ and the soft value term V (s t+1 ) is replaced by V i (s t+1 ) computed as V i (s t ) = π i φ (.|s t ) T Q i θ (s t , .) -α i log (π i φ (.|s t )) , where Q i θ (s t , .) stands for the vector of Q-values over A i given state s t . As these equations are independent from each other, all the updates can be performed in parallel. Autoregressive Factorization setting. When the policy is autoregressive, the first n -1 sub Qvalue functions are updated so as to minimize the soft Bellman residual in Equation ( 5) where r = 0, γ = 1, Q θ (s t , a t ) is replaced by Q i θ (u i-1 t , a i t ) and the soft value term V (s t+1 ) is replaced by V i+1 (u i t ). This term is computed using (6) where states s t ∈ S are replaced by sub-states u i-1 t ∈ U i-1 , see Appendix A.2 for more details. The up Q-value function is updated using ( 4) and ( 5) where the expectation over the distribution π φ (.|s t ) is replaced by an estimate using one action sampled from π φ (.|s t ). The last sub Q-value is updated to enforce equality between values in the top and bottom MDPs: J Q n (θ) = IE (st,at)∼D Q n θ (u n-1 t , a n t ) -Q U θ (s t , a t ) 2 . ( ) On one side, training the up Q-value overcomes the credit assignment problem induced by the zero reward in the bottom MDP. However, as mentioned before, relying on this Q-value alone does work in practice. On the other side, training the sub Q-values enables computing a better estimate of the expectation over the distribution π φ (.|s t ).

6. EXPERIMENTAL STUDY

We illustrate the efficiency of FARL on three different use cases. First, we show that factored action spaces are well suited to solve problems in which actions are composed of a discrete part and a continuous part, i.e. parameterized or hybrid action spaces. Second, we show that the autoregressive property can be used in multiagent problems. We use PPO in the challenging Google football environment Kurach et al. (2019) . Finally, we evaluate our agents in discretized MUJOCO environments and show that independent factored policies match continuous action ones even in large dimension and can learn even with millions of possible actions. See Appendix C.1 for a summary of the experimental setting and additional experimental details.

6.1. PARAMETERIZED ACTION SPACES

In this section, we highlight the benefits of autoregressive factorization in parameterized action spaces. We use the gym PLATFORM benchmark, introduced in Masson et al. ( 2016); Bester et al. (2019) , in which an agent must solve a platform game by selecting at each step a discrete action among hop, run or leap as well as the continuous intensity of this action. In the original problem formulation, the agent must return one discrete action as well as three continuous actions lying in different ranges. Only the continuous action corresponding to the discrete choice is applied. The environment contains three platforms as well as stochastic enemies to be avoided. The observation is a vector of 9 features. The return is the achievement percentage of the game: 100 corresponds to completion, i.e. the agent reached the extreme left of the platform. By reducing the action space through autoregressive factorization, the three continuous action spaces are transformed into one single discrete space containing m actions corresponding to discrete bins. The agent first chooses a discrete action and then autoregressively selects among the m bins. The selected bin is converted into a continuous value depending on the discrete choice. Thus, there are 3m actions. With this transformation, both factored agents reach completion in a few time steps, see Figure 1c . We also observe that Factored PPO (FPPO) is more stable and has a lower inter-seeds variance but reaches a plateau score of 90% while Factored SAC (FSAC) shows more variability but is sometimes able to reach 100%.

6.2. MULTI-AGENT BENCHMARK: GOOGLE FOOTBALL

In this section, we show that autoregressive factorization also performs well in multi-agent problems. We tested FPPO in the Google football environment where Kurach et al. (2019) have shown good performance using IMPALA. As the authors, we conducted our experiments with the 3 versus 1 Keeper scenario from Football Academy, where we control a team of 3 players who must score against an opponent keeper. Three types of observations exist in this environment. While the authors observed that their algorithms perform better with minimap pictures inputs, we performed well from raw vectors of features as can be seen in Figure 1b . In the original study, a single neural architecture is shared between all players. Each agent receives a vector of general features as well as a one hot code corresponding to its number and chooses a discrete action among 19. Thus, the total number of possible actions is 19 3 = 6859. Through autoregressivity, we consider only one agent that receives the global features and return a sequence of discrete actions, one per player to be controlled. Therefore, instead of making a choice among 6859 actions, our agent chooses three actions choices among 19 at each time step. We show that through this approach, PPO outperforms IMPALA with fewer computer resources. After 50M steps, it reaches an averaged number of goals of 0.91 ± 0.04 while IMPALA reaches 0.86 ± 0.08 as reported in Kurach et al. (2019) . FPPO was trained only for 2 days on 4 CPU cores while IMPALA was trained with 150 CPU cores. The 4 trained agents played 10.000 episodes. Results are averaged over 4 seeds. To demonstrate the impact of autoregressivity in this setting, we also performed an ablation study in which we kept the same architecture and hyperparameters but removed the autoregressive property. In this case, the PPO policy returns 3 actions, each depending of the environment observation but independent from each other. Figure 1a shows that, without autoregressivity, the ablation can still learn some behavior but quickly plateaus to a poor local optimum while FPPO finds an optimal strategy. 

6.3. MUJOCO BENCHMARKS

Finally, we evaluate our factored agents on four well-documented MUJOCO benchmarks, . We discretize each of the n continuous action dimensions into m bins, resulting in m n actions. We use an independent factorization approach as no inter-correlation between action components is needed for those benchmarks for both continuous versions of SAC and PPO. Indeed, these algorithms sample actions from Gaussian distributions with diagonal correlation matrices. We chose m = 11 bins for the three benchmarks but observed a low impact of this value on performance. Results are reported in Figure 2 . We confirm the results from Tang & Agrawal (2020) for FPPO and demonstrate that FSAC obtains comparable performance to its continuous version despite the discretization. Notably, FSAC performs well on factored Humanoid-v2 which has ∼ 10 17 possible actions, thus demonstrating the scalability of action independent factorization. 

7. CONCLUSION

Factorizing action spaces leads to impressive results in several RL domains but remains underexploited and understudied. In this work, we studied two factorization methods and highlighted both there theoretical impact on update equations as well as their practical use. We derived practical expression used either to compute or estimate Shannon entropy and Kullback-Leibler divergences. Notably, we showed that action space factorization is well suited for many problems and that these approaches can scale to large number of actions. We used the theoretical study to adapt PPO and SAC, two state-of-the-art agents, to both factorization settings and demonstrated their performance on several benchmarks. We believe that from these derivations and the implementation tips we provided, most of the existing RL agents can be easily adapted to factored action spaces.

B PROPERTIES OF FACTORED POLICIES -PROOFS

B.1 SHANNON ENTROPY PROOF OF PROPOSITION 1.1 When the policy π is factored into independent sub-policies, all partial actions a i depend only on the state s. In this case, the entropy of the policy is by definition the joint entropy of the random variables associated to each partial action. Furthermore, this joint entropy is equal to the sum of the entropies of each sub-policy, due to the additivity of entropy for independent random variables: H(π(.|s)) = - a=(a 1 ,...,a n )∈A1×•••×An π(a|s) log (π(a|s)) = - a 1 ,...,a n ∈A1×•••×An n i=1 π i (a i |s) log n i=1 π i (a i |s) = n i=1 H(π i (.|s)). Remark: see (Gray, 2011) for a proof of the additivity of the Shannon entropy for independent random variables. In the case of two discrete independent random variables X and Y , it can be demonstrated as follows: H(X, Y ) = - PROOF OF PROPOSITION 1.2 When the discrete random variables X and Y are not necessarily independent, the more general equation is that the joint entropy is equal to the sum of the entropy of X and the conditional entropy H(Y |X) = - x,y P (x, y) log P (x,y) P (x) : H(X, Y ) = H(X) + H(Y |X). It can be expressed as an expected value: H(X, Y ) = H(X) - x,y P (x) P (x, y) P (x) log P (x, y) P (x) = H(X) - x P (x) y P (y|x) log (P (y|x)) = H(X) + IE x [H(Y |x)] . Or for three random variables: H(X, Y, Z) = H(X) + IE x [H(Y |x)] + IE x,y [H(Z|x, y)], and this can be generalized further. Applying it to an autoregressive policy π(a|s) = n i=1 π i (a i |u i-1 ), it yields the following equation: H(π(.|s)) = H(π 1 (.|s))+IE u 1 H(π 2 (.|u 1 )) +IE u 2 H(π 3 (.|u 2 )) +• • •+IE u n-1 H(π n (.|u n-1 )) , which can be written: H(π(.|s)) = IE a∼π(.|s) n i=1 H(π i (.|u i-1 )) . B.2 KULLBACK-LEIBLER DIVERGENCE PROOF OF PROPOSITION 2.1 Let us consider 4 independent discrete random variables X, Y , X and Y , X and X having same support and respectively probability mass functions p and p , and Y and Y having same support and respectively probability mass functions q and q . We denote by H(X, Y |X , Y ) the cross-entropy between the joint distributions of X, Y and X , Y : H(X, Y |X , Y ) = - x,y p(x)q(y) log p (x)q (y). We have: H(X, Y ||X , Y ) = - x,y p(x)q(y) log p (x) - x,y p(x)q(y) log q (y) = - x p(x) log p (x) - y q(y) log q (y). Therefore: H(X, Y ||X , Y ) = H(X||X ) + H(Y ||Y ). The Kullback-Leibler divergence, cross-entropy and entropy are linked by the following formula: KL[Z 1 ||Z 2 ] = H(Z 1 ||Z 2 ) -H(Z 1 ). Using the additivity of entropy for independent random variables: KL[X, Y ||X , Y ] = H(X, Y ||X , Y ) -H(X, Y ) = H(X||X ) + H(Y ||Y ) -H(X) -H(Y ) = H(X||X ) + H(Y ||Y ) -H(X) -H(Y ) = KL[X||X ] + KL[Y ||Y ]. This can be generalized to joint distributions of more than 2 independent discrete random variables. Applied to the context of policies in factored action spaces, and assuming that π and µ are two policies such that ∀(s, a) ∈ S × A, π i (a|s) = 

PROOF OF PROPOSITION 2.2

As in the proof of the previous proposition, we consider 4 discrete random variables X, X , Y and Y . X and X have the same support and are independent, Y and Y have the same support and are independent, but this time X and Y are not independent, and X and Y are not independent. We respectively denote their joint probability mass functions by p and q. Without ambiguity, we also denote by p and q the marginalizations over y: p(x) = y p(x, y) and q(y) = y q(x, y). Let us consider the cross-entropy between the joint probability distributions of X, Y and X , Y : H(X, Y ||X , Y ) =x,y p(x, y) log q(x, y) = - x,y p(x, y) log q(x, y) + x,y p(x, y) log q(x)x,y p(x, y) log q(x) =x,y p(x, y) log q(x, y) q(x) - x p(x) log q(x) = - x,y p(x) p(x, y) p(x) log q(x, y) q(x) - 

C ADDITIONAL EXPERIMENTAL RESULTS

C.1 EXPERIMENTAL SETTING SUMMARY We study in Figure 3 the impact of coefficient β that defines the entropy target on the agent performance. This parameter defines the balance between exploration and exploitation. When its value tends to 1, α is tuned so as to maintain a policy entropy near the maximum entropy while when its value tends to 0, the policy becomes almost deterministic. We observe that in Humanoid-v2 this value must remain small so as to ensure convergence. 



The entropy term is not specified in the original paper, but can often be found in available implementations such as in RLlib or Spinning Up.



Figure 1: autoregressive factorization assessment. Figures a) and b) correspond to Factored PPO performance in the Google Football 3 vs 1 keeper environment. Figure c) compares FPPO and FSAC in the gym PLATFORM environment.

Figure 2: Independent factorization assessment. FSAC (in blue) vs FPPO (in purple) in factored Mujoco environments. We use independent factorization for both agents.

i=1 π i (a i |s) and µ i (a|s) = n i=1 µ i (a i |s), it results in KL [π(.|s)||µ(.|s)] = n i=1 KL π i (.|s)||µ i (.|s) .

) log q(y|x) -x p(x) log q(x) = H(X||X ) + IE x∼p H(Y |x||Y |x). Using the equality KL[Z 1 ||Z 2 ] = H(Z 1 ||Z 2 ) -H(Z 1 ), and the equality derived in the proof of Proposition 1.2, we get:KL[X, Y ||X , Y ] = H(X||X ) + IE x∼p [H(Y |x||Y |x)] -H(X, Y ) = H(X||X ) -H(X) + IE x∼p [H(Y |x||Y |x)] -IE x∼p [H(Y |x)] = KL [X||X ] + IE x∼p [KL(Y |x||Y |x)] .Similarly, for random variables X, Y, Z and X , Y , Z , we obtain KL[X, Y, Z||X , Y , Z ] = KL [X||X ]+IE x∼p [KL(Y |x||Y |x)]+IE (x,y)∼p [KL(Z|x, y||Z |x, y)] . Again, the equation can be generalized to joint distributions of n discrete random variables. In the context of autoregressive policies, assuming that π and µ are two policies such that ∀(s, a) ∈ S × A, π i (a|s) = n i=1 π i (a i |u i-1 ) and µ i (a|s) = n i=1 µ i (a i |u i-1 ), it results in KL [π(.|s)||µ(.|s)] = IE a∼π(.|s) n i=1KL π i (.|u i-1 )||µ i (.|u i-1 ) .

Figure 3: Study on the impact of the entropy target in FSAC on the Humanoid-v2 environment. Each run has been averaged over 4 seeds. In this example, all dimensions share the same target Hi = β log 17. Each run represented corresponds to a different value of parameter β.

Experimental setting summary

APPENDIX

A FACTORED SOFT ACTOR CRITIC: ADDITIONAL DETAILS A.1 EXPECTATION COMPUTATIONS Several expressions minimised in SAC rely on an expectation over actions sampled according to the policy π θ . In the original version, SAC considers continuous action spaces and assumes a squashed Gaussian policy distribution. Actions are sampled from a Gaussian distribution parameterized by a mean vector µ θ (s) and a diagonal covariance matrix σ θ (s), both returned by a neural network of weights θ. Then the actions are scaled by a tanh function. More formally, a = tanh (e) where e ∼ N (µ θ (s), σ θ (s)). This expression can be rewritten as = tanh (σ θ (s)z + µ θ (s)) where z ∼ N (0, 1). This is called the reparametrization trick; instead of sampling from a distribution which depends on the neural network weights, we sample from a standard normal distribution and apply a linear scaling. This trick enables to rewrite the expectation IE a∼π θ (.|s) that depends on θ as an expectation over the IE z∼N (0,1) . This trick allows to reduce the variance of this expectation estimate, thus allowing to compute it in practice.When the action space is discrete, a common choice for the policy distribution is a categorical distribution over actions. In this case, the policy neural network outputs a softmax over the possible actions. This parametrization enables to compute the exact expectation over actions without having to rely on an estimate. Indeed, this expectation can be computed as a simple inner product:where f : S × A → R is a scalar function and f (., s) = [f (a 1 , s), . . . , f (a n , s)] is a vector that contains all the values of f for each possible action a. Using this expression reduces the variance of both the actor and critic losses. We use this expression to construct the losses expressions in the factored action settings.

A.2 SOFT POLICY ITERATION

We demonstrated that in the independent factorization setting, we can use n sub-policies and n sub-Q-values, one per action dimension, and update them independently in parallel as if we were solving n parallel MDPs, one per dimension. While this introduces some instability as the reward obtained by one sub-agent depends on the other agents decisions, we observed that this strategy works well in practice. We hypothesize that this performance comes form the fact that modifying the behavior of other agents is slow enough for one agent to improve its own behavior as if the MDP was stationary.In the autoregressive factorization setting, such a strategy cannot work as agents choices are conditioned on choices of other agents. In this situation, as explained in Section 5.3, we consider two MDPs: the top MDP that corresponds to the MDP at hand in which the action space is factored, and the bottom MDP which is an extended version of the top one. Between two top MDPs states s t and s t+1 , we consider the n intermediate states u i t . We apply to these intermediate states a discount factor that equals 1 and a zero reward so as to ensure value equality on the shared states between both MDPs. In this case, the sub-Q-value functions estimate the Q-values on bottom states:We also consider a Q-value function Q U θ estimating returns on top states. Both Q-values types are complementary: the top Q-value suffers from the large number of possible actions. The sub-Q-values do not have this issue as they rely on smaller number of actions, i.e. the number of actions per dimension, but suffer from the credit assignment problem induced by the extra zero rewards. We observed that training the top Q-value alone does not work, however adding these sub-Q-values enables learning. On one side, we train the up Q-value so as to minimize soft Bellman residuals in the top MDP, see (5) and on the other we train all sub-Q-values, except the last one, to minimize soft Bellman residuals in the bottom MDP:where. Finally, the last sub Qvalue is updated to enforce equality between values in the top and bottom MDPs:Note that in all previous expressions, the expectation over actions has been sampled through an inner product as explained in (8).

A.3 SOFT POLICY IMPROVEMENT

The policy improvement step updates policy π φ by minimizing the cost function:As before, the expectation approximated by a Monte Carlo estimate in the continuous action setting is replaced by the true expectation in the discrete action setting.When the sub-policies are independent distributions, the n sub-policies parameters are updated independently so as to minimize:When the policy is autoregressive, the n sub policies are updated so as to minimize the same loss in which states s t ∈ S are replaced by sub-states u i-1 t ∈ U i-1 . Each sub-policy is updated from its corresponding sub Q-value function.

A.4 AUTOMATING ENTROPY ADJUSTMENT

The entropy adjustment coefficient α can either be fixed as an hyper-parameter or updated online so as to ensure that the policy entropy does go below an entropy target value H. Haarnoja et al. (2018) show that α can be updated at each iteration so as to minimizeIn this work, we express the entropy target as a fraction β ∈ [0, 1] of the maximum entropy, i.e. the entropy of the uniform distribution over A that we note H u . We give some expressions of H u corresponding to different action spaces in Table 1 .Table 1 : Uniform distributions entropies.

Action space Entropy of uniform distribution

When the factorization is independent, we simply minimize n independent losses, one for each sub-entropy coefficient α i :where Hi = β log (n i ).When the factorization is autoregressive, we minimize the same n losses where states s t are simply replaced by sub-states u i-1 t . We also optimize the entropy coefficient for the top MDP α by minimizing expression (12) where the entropy target is computed as H = β n i=1 log (n i ). We found in practice that the choice of parameter β has the greatest impact on performance. We show an example on the Humanoid-v2 benchmark in Section C.2. 

