PROVABLE RICH OBSERVATION REINFORCEMENT LEARNING WITH COMBINATORIAL LATENT STATES

Abstract

We propose a novel setting for reinforcement learning that combines two common real-world difficulties: presence of observations (such as camera images) and factored states (such as location of objects). In our setting, the agent receives observations generated stochastically from a latent factored state. These observations are rich enough to enable decoding of the latent state and remove partial observability concerns. Since the latent state is combinatorial, the size of state space is exponential in the number of latent factors. We create a learning algorithm FactoRL (Fact-o-Rel) for this setting which uses noise-contrastive learning to identify latent structures in emission processes and discover a factorized state space. We derive polynomial sample complexity guarantees for FactoRL which polynomially depend upon the number factors, and very weakly depend on the size of the observation space. We also provide a guarantee of polynomial time complexity when given access to an efficient planning algorithm.

1. INTRODUCTION

Most reinforcement learning (RL) algorithms scale polynomially with the size of the state space, which is inadequate for many real world applications. Consider for example a simple navigation task in a room with furniture where the set of furniture pieces and their locations change from episode to episode. If we crudely approximate the room as a 10 × 10 grid and consider each element in the grid to contain a single bit of information about the presence of furniture, then we end up with a state space of size 2 100 , as each element of the grid can be filled independent of others. This is intractable for RL algorithms that depend polynomially on the size of state space. The notion of factorization allows tractable solutions to be developed. For the above example, the room can be considered a state with 100 factors, where the next value of each factor is dependent on just a few other parent factors and the action taken by the agent. Learning in factored Markov Decision Processes (MDP) has been studied extensively (Kearns & Koller, 1999; Guestrin et al., 2003; Osband & Van Roy, 2014) with tractable solutions scaling linearly in the number of factors and exponentially in the number of parent factors whenever planning can be done efficiently. However, factorization alone is inadequate since the agent may not have access to the underlying factored state space, instead only receiving a rich-observation of the world. In our room example, the agent may have access to an image of the room taken from a megapixel camera instead of the grid representation. Naively, treating each pixel of the image as a factor suggests there are over a million factors and a prohibitively large number of parent factors for each pixel. Counterintuitively, thinking of the observation as the state in this way leads to the conclusion that problems become harder as the camera resolution increases or other sensors are added. It is entirely possible, that these pixels (or more generally, observation atoms) are generated by a small number of latent factors with a small number of parent factors. This motivates us to ask: can we achieve PAC RL guarantees that depend polynomially on the number of latent factors and very weakly (e.g., logarithmically) on the size of observation space? Recent work has addressed this for a rich-observation setting with a non-factored latent state space when certain supervised learning problems are tractable (Du et al., 2019; Misra et al., 2020; Agarwal et al., 2020) . However, addressing the rich-observation setting with a latent factored state space has remained elusive. Specifically, ignoring the factored structure in the latent space or treating observation atoms as factors yields intractable solutions.

