NON-MARKOVIAN PREDICTIVE CODING FOR PLANNING IN LATENT SPACE Anonymous

Abstract

High-dimensional observations are a major challenge in the application of modelbased reinforcement learning (MBRL) to real-world environments. In order to handle high-dimensional sensory inputs, existing MBRL approaches use representation learning to map high-dimensional observations into a lower-dimensional latent space that is more amenable to dynamics estimation and planning. Crucially, the task-relevance and predictability of the learned representations play critical roles in the success of planning in latent space. In this work, we present Non-Markovian Predictive Coding (NMPC), an information-theoretic approach for planning from high-dimensional observations with two key properties: 1) it formulates a mutual information objective that prioritizes the encoding of taskrelevant components of the environment; and 2) it employs a recurrent neural network capable of modeling non-Markovian latent dynamics. To demonstrate NMPC's ability to prioritize task-relevant information, we evaluate our new model on a challenging modification of standard DMControl tasks where the DMControl background is replaced with natural videos, containing complex but irrelevant information to the planning task. Our experiments show that NMPC is superior to existing methods in the challenging complex-background setting while remaining competitive with current state-of-the-art MBRL models in the standard setting.

1. INTRODUCTION

Learning to control from high dimensional observations has been made possible due to the advancements in reinforcement learning (RL) and deep learning. These advancements have enabled notable successes such as solving video games (Mnih et al., 2015; Lample & Chaplot, 2017) and continuous control problems (Lillicrap et al., 2016) from pixels. However, it is well known that performing RL directly in the high-dimensional observation space is sample-inefficient and may require a large amount of training data (Lake et al., 2017) . This is a critical problem, especially for real-world applications. Recent model-based RL works (Kaiser et al., 2020; Ha & Schmidhuber, 2018; Hafner et al., 2019; Zhang et al., 2019; Hafner et al., 2020) proposed to tackle this problem by learning a world model in the latent space, and then applying RL algorithms in the latent world model. The existing MBRL methods that learn a latent world model typically do so via reconstruction-based objectives, which are likely to encode task-irrelevant information, such as of the background. In this work, we take inspiration from the success of contrastive learning and propose Non-Markovian Predictive Coding (NMPC), a novel information-theoretic approach for planning from pixels. In contrast to reconstruction, NMPC formulates a mutual information (MI) objective to learn the latent space for control. This objective circumvents the need to reconstruct and prioritizes the encoding of task-relevant components of the environment, thus make NMPC more robust when dealing with complicated observations. Our primary contributions are as follows: • We propose Non-Markovian Predictive Coding (NMPC), a novel information-theoretic approach to learn latent world models for planning from high-dimensional observations and theoretically analyze its ability to prioritize the encoding of task-relevant information. • We show experimentally that NMPC outperforms the state-of-the-art model when dealing with complex environments dominated by task-irrelevant information, while remaining competitive on standard DeepMind control (DMControl) tasks. Additionally, we conduct detailed ablation analyses to study the empirical importance of the components in NMPC.

2. BACKGROUND

The motivation and design of our model are largely based on two previous works (Shu et al., 2020; Hafner et al., 2020) . In this section, we briefly go over the key concepts in each work. Shu et al. (2020) proposed PC3, an information-theoretic approach that uses contrastive predictive coding (CPC) to learn a latent space amenable to locally-linear control. Specifically, they present the theory of predictive suboptimality to motivate a CPC objective between the latent states of two consecutive time steps, instead of CPC between the frame and its corresponding state. Moreover, they use the latent dynamics F as the variational device in the lower bound cpc (E, F ) = E 1 K i ln F E o (i) t+1 | E o (i) t , a (i) t 1 K j F E o (i) t+1 | E o (j) t , a (1) This particular design of the critic has two benefits, where it circumvents the instantiation of an auxiliary critic, and also takes advantage of the property that an optimal critic is the true latent dynamics. However, the author also shows that this objective does not ensure the learning of a latent dynamics F that is consistent with the true latent dynamics, therefore introduces the consistency loss to ensure the latent dynamics model F indeed approximates the true latent dynamics. cons (E, F ) = E p(ot+1,ot,at) ln F (E (o t+1 ) | E (o t ) , a t ) Since PC3 only tackles the problem from an optimal control perspective, it is not readily applicable to RL problems. Indeed, PC3 requires a depiction of the goal image in order to perform control, and also the ability to teleport to random locations of the state space to collect data, which are impractical in many problems. On the other hand, Dreamer (Hafner et al., 2020) achieves stateof-the-art performance on many RL tasks, but learns the latent space using a reconstruction-based objective. While the authors included a demonstration of a contrastive approach that yielded inferior performance to their reconstruction-based approach, their contrastive model applied CPC between the frame and its corresponding state, as opposed to between latent states across time steps. In this paper, we present Non-Markovian Predictive Coding (NMPC), a novel latent world model that leverages the concepts in PC3 and apply them to Dream paradigm and RL setting. Motivated by PC3, we formulate a mutual information objective between historical and future latent states to learn the latent space. We additionally take advantage of the recurrent model in Dreamer to model more complicated dynamics than what was considered in PC3. The use of recurrent dynamics also allows us to extend Eqs. ( 1) and ( 2) to the non-Markovian setting, which was not considered in PC3.

3. NON-MARKOVIAN PREDICTIVE CODING FOR PLANNING FROM PIXELS

To plan in an unknown environment, we need to model the environment dynamics from experience. We do so by iteratively collecting new data and using those data to train the world model. In this section, we focus on presenting the proposed latent world model, its components and objective functions, and provide practical considerations when implementing the method.

3.1. NON-MARKOVIAN PREDICTIVE CODING

We aim to learn a latent dynamics model for planning. To do that, we define an encoder E to embed high-dimensional observations into a latent space, a latent dynamics F to model the world in this space, and a reward function, as follows Encoder: E(o t ) = s t Latent dynamics: F (s t | s <t , a <t ) = p(s t | s <t , a <t ) Reward function: R(r t | s t ) = p(r t | s t ) (3) in which t is the discrete time step, {o t , a t , r t } T t=1 are data sequences with image observations o t , continuous action vectors a t , scalar rewards r t , and s t denotes the latent state at time t. To handle potentially non-Markovian environment dynamics, we model the transition dynamics using a recurrent

