FACTOREDRL: LEVERAGING FACTORED GRAPHS FOR DEEP REINFORCEMENT LEARNING

Abstract

We propose a simple class of deep reinforcement learning (RL) methods, called FactoredRL, that can leverage factored environment structures to improve the sample efficiency of existing model-based and model-free RL algorithms. In tabular and linear approximation settings, the factored Markov decision process literature has shown exponential improvements in sample efficiency by leveraging factored environment structures. We extend this to deep RL algorithms that use neural networks. For model-based algorithms, we use the factored structure to inform the state transition network architecture and for model-free algorithms we use the factored structure to inform the Q network or the policy network architecture. We demonstrate that doing this significantly improves sample efficiency in both discrete and continuous state-action space settings.

1. INTRODUCTION

In many domains, the structure of the Markov Decision Process (MDP) is known at the time of problem formulation. For example, in inventory management, we know the structure of the state transition: how inventory flows from a vendor, to a warehouse, to a customer (Giannoccaro & Pontrandolfo, 2002; Oroojlooyjadid et al., 2017) . In portfolio management, we know that a certain asset changes only when the agent buys or sells a corresponding item (Jiang et al., 2017) . Similar structural information is available in vehicle routing, robotics, computing, and many others. Our work stems from the observation that we can exploit the known structure of a given MDP to learn a good policy. We build on the Factored MDP literature (Boutilier et al., 1995; Osband & Van Roy, 2014; Kearns & Singh, 2002; Cui & Khardon, 2016) , and propose a factored graph to represent known relationships between states, actions and rewards in a given problem. We use the factored graphs to inform the structure of the neural networks used in deep reinforcement learning (RL) algorithms to improve their sample efficiency. We give literature references and example factor graphs for real world applications in Appendix A. Consider a motivational example, where the goal of the agent is to balance multiple independent cartpoles simultaneously, with each cartpole defined as per OpenAI gym (G. Brockman & Zaremba, 2016) . The agent can take a 'left' or 'right' action on each cartpole, and the state includes the position and velocity of each cart and each pole. We refer to this as the Multi-CartPole problem. Both model-based and model-free algorithms treat the state-action space as a single entity, which makes exploration combinatorially complex. As a consequence, the sample efficiency of RL algorithms degrades exponentially with the number of cartpoles, despite the problem remaining conceptually simple for a human. By allowing the agent access to the problem's factored structure (i.e. each action affects only one cartpole), we bypass the need to learn about each action's relationship with the entire state, and instead only need to learn about each action's relationship with its single, related cartpole. We show how to integrate knowledge of the factored graph into both model-based and model-free deep RL algorithms, and thereby improve sample efficiency. In all cases, we first write down a factored graph as an adjacency matrix, representing the relationships between state, action, and reward. From this adjacency matrix, we then define a Factored Neural Network (Factored NN), which uses input and output masking to reflect the structure of the factored graph.

