FACTOREDRL: LEVERAGING FACTORED GRAPHS FOR DEEP REINFORCEMENT LEARNING

Abstract

We propose a simple class of deep reinforcement learning (RL) methods, called FactoredRL, that can leverage factored environment structures to improve the sample efficiency of existing model-based and model-free RL algorithms. In tabular and linear approximation settings, the factored Markov decision process literature has shown exponential improvements in sample efficiency by leveraging factored environment structures. We extend this to deep RL algorithms that use neural networks. For model-based algorithms, we use the factored structure to inform the state transition network architecture and for model-free algorithms we use the factored structure to inform the Q network or the policy network architecture. We demonstrate that doing this significantly improves sample efficiency in both discrete and continuous state-action space settings.

1. INTRODUCTION

In many domains, the structure of the Markov Decision Process (MDP) is known at the time of problem formulation. For example, in inventory management, we know the structure of the state transition: how inventory flows from a vendor, to a warehouse, to a customer (Giannoccaro & Pontrandolfo, 2002; Oroojlooyjadid et al., 2017) . In portfolio management, we know that a certain asset changes only when the agent buys or sells a corresponding item (Jiang et al., 2017) . Similar structural information is available in vehicle routing, robotics, computing, and many others. Our work stems from the observation that we can exploit the known structure of a given MDP to learn a good policy. We build on the Factored MDP literature (Boutilier et al., 1995; Osband & Van Roy, 2014; Kearns & Singh, 2002; Cui & Khardon, 2016) , and propose a factored graph to represent known relationships between states, actions and rewards in a given problem. We use the factored graphs to inform the structure of the neural networks used in deep reinforcement learning (RL) algorithms to improve their sample efficiency. We give literature references and example factor graphs for real world applications in Appendix A. Consider a motivational example, where the goal of the agent is to balance multiple independent cartpoles simultaneously, with each cartpole defined as per OpenAI gym (G. Brockman & Zaremba, 2016) . The agent can take a 'left' or 'right' action on each cartpole, and the state includes the position and velocity of each cart and each pole. We refer to this as the Multi-CartPole problem. Both model-based and model-free algorithms treat the state-action space as a single entity, which makes exploration combinatorially complex. As a consequence, the sample efficiency of RL algorithms degrades exponentially with the number of cartpoles, despite the problem remaining conceptually simple for a human. By allowing the agent access to the problem's factored structure (i.e. each action affects only one cartpole), we bypass the need to learn about each action's relationship with the entire state, and instead only need to learn about each action's relationship with its single, related cartpole. We show how to integrate knowledge of the factored graph into both model-based and model-free deep RL algorithms, and thereby improve sample efficiency. In all cases, we first write down a factored graph as an adjacency matrix, representing the relationships between state, action, and reward. From this adjacency matrix, we then define a Factored Neural Network (Factored NN), which uses input and output masking to reflect the structure of the factored graph. Finally, we show how to integrate this Factored NN into existing deep RL algorithms. For modelbased, we use the Factored NN to learn decomposed state transitions, and then integrate this state transition model with Monte Carlo Tree Search (MCTS) (Kocsis & Szepesvári, 2006) . For model-free, we use the Factored NN to learn a decomposed Q-function, and then integrate with DQN (Mnih et al., 2015) . Also for model-free, we use the Factored NN to learn a decomposed policy function, and then integrate with PPO (Schulman et al., 2017) . In all three cases, we demonstrate empirically that these Factored RL methods (Factored MCTS, DQN, and PPO) are able to achieve better sample efficiency than their vanilla implementations, on a range of environments.

2. RELATED WORK

Several methods have been proposed that exploit the structural information of a problem in the Factored MDP literature. Kearns & Koller (1999) propose a method to conduct model-based RL with a Dynamic Bayesian Network (DBN) (Dean & Kanazawa, 1989) and learn its parameters based on an extension of the Explicit Explore or Exploit (E 3 ) algorithm (Kearns & Singh, 2002) . Guestrin et al. (2003) propose a linear program and a dynamic program based algorithm to learn linear value functions in Factored MDPs, and extend it to multi-agent settings (Guestrin et al., 2002) . They exploit the context specific and additive structure in Factored MDP that capture the locality of influence of specific states and actions. We use the same structures in our proposed algorithms. Cui & Khardon (2016) propose a symbolic representation of Factored MDPs. Osband & Van Roy (2014) propose posterior sampling and upper confidence bounds based algorithms and prove that they are near-optimal. They show that the sample efficiency of the algorithm scales polynomially with the number of parameters that encode the factored MDP, which may be exponentially smaller than the full state-action space. Xu & Tewari (2020) extend the results to non-episodic settings and Lattimore et al. ( 2016) show similar results for contextual bandits. The algorithms proposed in these prior works assume a tabular (Cui et al., 2015; Geißer et al.) or linear setting (Guestrin et al., 2003) , or require symbolic expressions (Cui & Khardon, 2016) . We extend these ideas to deep RL algorithms by incorporating the structural information in the neural network. Li & Czarnecki (2019) propose a factored DQN algorithm for urban driving applications. Our proposed algorithms are similar, but we extend the ideas to model-based algorithms like MCTS (Kocsis & Szepesvári, 2006) , and model-free on-policy algorithms like PPO (Schulman et al., 2017) . We also evaluate our algorithms on a variety of environments which encompass discrete and continuous stateaction spaces. The Factored NN we propose is closely related to Graph Neural Networks (Scarselli et al., 2008; Zhou et al., 2018) , which are deep learning based methods that operate on graph domain and have been applied to domains such as network analysis (Kipf & Welling, 2016) , molecule design (Liu et al., 2018) and computer vision (Xu et al., 2018) . Instead of explicitly embedding the neighbors of all the nodes with neural networks, we use a single neural network with masking. NerveNet Wang et al. (2018) addresses the expressiveness of structure in an MDP, similar to our work. They focus on robotics applications and demonstrate state-action factorization with PPO. In our work, we additionally demonstrate state transition and state-reward factorization in MCTS and DQN respectively. In addition, they propose imposing a structure with Graph Neural Networks. In contrast, we propose using input and output masking without modifying the neural architecture. Working Memory Graphs Loynd et al. ( 2020) uses Transformer networks for modeling both factored observations and dependencies across time steps. However, they only evaluate their method in a grid world with a single discrete action. In contrast, we demonstrate our methods on multiple environments and algorithms with factorization in state transition, state-action and state-reward relationships. In addition, our factored network is a simple extension to the existing network used to solve a problem, whereas they impose a complex network architecture. Action masking has been used effectively to improve RL performance in multiple works (Williams & Zweig, 2016; Williams et al., 2017; Vinyals et al., 2017) . We use a similar trick when applying our Factored NN to policy networks in model-free RL. However, we use both an action mask as well as a state mask to incorporate factored structure in policy networks. Our state transition networks for model-based RL also imposes masks on both input and output corresponding to current state-action and next state respectively. Wu et al. (2018) introduce an action dependent baseline in actor-critic algorithms, where a separate advantage function is learned for each action. Their method also exploits

