

Abstract

Value Iteration Networks (VINs) have emerged as a popular method to perform implicit planning within deep reinforcement learning, enabling performance improvements on tasks requiring long-range reasoning and understanding of environment dynamics. This came with several limitations, however: the model is not explicitly incentivised to perform meaningful planning computations, the underlying state space is assumed to be discrete, and the Markov decision process (MDP) is assumed fixed and known. We propose eXecuted Latent Value Iteration Networks (XLVINs), which combine recent developments across contrastive self-supervised learning, graph representation learning and neural algorithmic reasoning to alleviate all of the above limitations, successfully deploying VIN-style models on generic environments. XLVINs match the performance of VIN-like models when the underlying MDP is discrete, fixed and known, and provide significant improvements to model-free baselines across three general MDP setups.

1. INTRODUCTION

Planning is an important aspect of reinforcement learning (RL) algorithms, and planning algorithms are usually characterised by explicit modelling of the environment. Recently, several approaches explore implicit planning (also called model-free planning) (Tamar et al., 2016; Oh et al., 2017; Racanière et al., 2017; Silver et al., 2017; Niu et al., 2018; Guez et al., 2018; 2019) . Instead of training explicit environment models and leveraging planning algorithms, such approaches propose inductive biases in the policy function to enable planning to emerge, while training the policy in a model-free manner. A notable example of this line of research are value iteration networks (VINs), which observe that the value iteration (VI) algorithm on a grid-world can be understood as a convolution of state values and transition probabilities followed by max-pooling, which inspired the use of a CNN-based VI module (Tamar et al., 2016) . Generalized value iteration networks (GVINs), based on graph kernels (Yanardag & Vishwanathan, 2015) , lift the assumption that the environment is a grid-world and allow planning on irregular discrete state spaces (Niu et al., 2018) , such as graphs. While such models can learn to perform VI, they are in no way constrained or explicitly incentivised to do so. Policies including such planning modules might exploit their capacity for different purposes, potentially finding ways to overfit the training data instead of learning how to perform planning. Further, both VINs and GVINs assume discrete state spaces, incurring loss of information for problems with naturally continuous state spaces. Finally, and most importantly, both approaches require the graph specifying the underlying Markov decision process (MDP) to be known in advance and are inapplicable if it is too large to be stored in memory, or in other ways inaccessible. In this paper, we propose the eXecuted Latent Value Iteration Network (XLVIN), an implicit planning policy network which embodies the computation of VIN-like models while addressing all of the above issues. As a result, we are able to seamlessly run XLVINs with minimal configuration changes on discrete-action environments from MDPs with known structure (such as grid-worlds), through pixel-based ones (such as Atari), all the way towards fully continuous-state environments, consistently outperforming or matching baseline models which lack XLVIN's inductive biases. To achieve this, we have unified recent concepts from several areas of representation learning: • Using contrastive self-supervised representation learning, we are able to meaningfully infer dynamics of the MDP, even when it is not provided. In particular, we leverage the work 



of Kipf et al. (2020); van der Pol et al. (2020), which uses the TransE model (Bordes et al.,

