

Abstract

Value Iteration Networks (VINs) have emerged as a popular method to perform implicit planning within deep reinforcement learning, enabling performance improvements on tasks requiring long-range reasoning and understanding of environment dynamics. This came with several limitations, however: the model is not explicitly incentivised to perform meaningful planning computations, the underlying state space is assumed to be discrete, and the Markov decision process (MDP) is assumed fixed and known. We propose eXecuted Latent Value Iteration Networks (XLVINs), which combine recent developments across contrastive self-supervised learning, graph representation learning and neural algorithmic reasoning to alleviate all of the above limitations, successfully deploying VIN-style models on generic environments. XLVINs match the performance of VIN-like models when the underlying MDP is discrete, fixed and known, and provide significant improvements to model-free baselines across three general MDP setups.

1. INTRODUCTION

Planning is an important aspect of reinforcement learning (RL) algorithms, and planning algorithms are usually characterised by explicit modelling of the environment. Recently, several approaches explore implicit planning (also called model-free planning) (Tamar et al., 2016; Oh et al., 2017; Racanière et al., 2017; Silver et al., 2017; Niu et al., 2018; Guez et al., 2018; 2019) . Instead of training explicit environment models and leveraging planning algorithms, such approaches propose inductive biases in the policy function to enable planning to emerge, while training the policy in a model-free manner. A notable example of this line of research are value iteration networks (VINs), which observe that the value iteration (VI) algorithm on a grid-world can be understood as a convolution of state values and transition probabilities followed by max-pooling, which inspired the use of a CNN-based VI module (Tamar et al., 2016) . Generalized value iteration networks (GVINs), based on graph kernels (Yanardag & Vishwanathan, 2015) , lift the assumption that the environment is a grid-world and allow planning on irregular discrete state spaces (Niu et al., 2018) , such as graphs. While such models can learn to perform VI, they are in no way constrained or explicitly incentivised to do so. Policies including such planning modules might exploit their capacity for different purposes, potentially finding ways to overfit the training data instead of learning how to perform planning. Further, both VINs and GVINs assume discrete state spaces, incurring loss of information for problems with naturally continuous state spaces. Finally, and most importantly, both approaches require the graph specifying the underlying Markov decision process (MDP) to be known in advance and are inapplicable if it is too large to be stored in memory, or in other ways inaccessible. In this paper, we propose the eXecuted Latent Value Iteration Network (XLVIN), an implicit planning policy network which embodies the computation of VIN-like models while addressing all of the above issues. As a result, we are able to seamlessly run XLVINs with minimal configuration changes on discrete-action environments from MDPs with known structure (such as grid-worlds), through pixel-based ones (such as Atari), all the way towards fully continuous-state environments, consistently outperforming or matching baseline models which lack XLVIN's inductive biases. To achieve this, we have unified recent concepts from several areas of representation learning: • Using contrastive self-supervised representation learning, we are able to meaningfully infer dynamics of the MDP, even when it is not provided. In particular, we leverage the work of Kipf et al. (2020) ; van der Pol et al. ( 2020), which uses the TransE model (Bordes et al., 2013) to embed states and actions into vector spaces, in such a way that the effect of action embeddings onto the state embeddings are consistent with the true environment dynamics. • By applying recent advances in graph representation learning (Battaglia et al., 2018; Bronstein et al., 2017; Hamilton et al., 2017) , we designed a message passing architecture (Gilmer et al., 2017) which can traverse our partially-inferred MDP, without imposing strong structural constraints (i.e., our model is not restricted to grid-worlds). • We better align our planning module with VI by leveraging recent advances from neural algorithm execution, which has shown that GNNs can learn polynomial-time graph algorithms (Xu et al., 2019; Veličković et al., 2019; Yan et al., 2020; Georgiev & Lió, 2020; Veličković et al., 2020) , by supervising them to structure their problem solving process according to a target algorithm. Relevantly, it was shown that GNNs are capable of executing value iteration in supervised learning settings (Deac et al., 2020) . To the best of our knowledge, our work represents the first implicit planning architecture powered by concepts from neural algorithmic reasoning, expanding the application space of VIN-like models. While we focus our discussion on VINs and GVINs, which we directly generalise and with which we share key design concepts (like the VI-based differentiable planning module), there are other lines of research that our approach could be linked to. Significant work has been done on representation learning in RL (Givan et al., 2003; Ferns et al., 2004; 2011; Jaderberg et al., 2017; Ha & Schmidhuber, 2018b; Gelada et al., 2019) , often exploiting observed state similarities. Regarding work in planning in latent spaces (Oh et al., 2017; Farquhar et al., 2018; Hafner et al., 2019; van der Pol et al., 2020 ), Value Prediction Networks (Oh et al., 2017) and TreeQN (Farquhar et al., 2018) explore some ideas similar to our work, with important differences; they use explicit planning algorithms, while XLVINs do fully implicit planning in the latent space. However, due to the way in which value estimates are represented, the policy network is capable of melding both model-free and model-based cues robustly. Furthermore, while our VI executor provides a representation that aligns with the predictive needs of value iteration, it can also incorporate additional information if it benefits the performance of the model.

2. BACKGROUND

Value iteration Value iteration is a successive approximation method for finding the optimal value function of a discounted Markov decision processes (MDPs) as the fixed-point of the so-called Bellman optimality operator (Puterman, 2014) . A discounted MDP is a tuple (S, A, R, P, γ) where s ∈ S are states, a ∈ A are actions, R : S × A → R is a reward function, P : S × A → Dist(S) is a transition function such that P (s |s, a) is the conditional probability of transitioning to state s when the agent executes action a in state s, and γ ∈ [0, 1] is a discount factor which trades off between the relevance of immediate and future rewards. In the infinite horizon discounted setting, an agent sequentially chooses actions according to a stationary Markov policy π : S × A → [0, 1] such that π(a|s) is a conditional probability distribution over actions given a state. The return is defined as G t = ∞ k=0 γ k R(a t+k , s t+k ). Value functions V π (s, a) = E π [G t |s t = s] and Q π (s, a) = E π [G t |s t = s, a t = a] represent the expected return induced by a policy in an MDP when conditioned on a state or state-action pair respectively. In the infinite horizon discounted setting, we know that there exists an optimal stationary Markov policy π * such that for any policy π it holds that V π * (s) ≥ V π (s) for all s ∈ S. Furthermore, such optimal policy can be deterministic greedy -with respect to the optimal values. Therefore, in order to find a π * it suffices to find the unique optimal value function V as the fixed-point of the Bellman optimality operator. Value iteration is in fact the instantiation of the method of successive approximation method for finding the fixed-point of a contractive operator. The optimal value function V is such a fixed-point and satisfies the Bellman optimality equations (Bellman, 1966):  V (s) = max



a∈A R(s, a) + γ s ∈S P (s |s, a)V (s ) . (1) TransE The TransE (Bordes et al., 2013) loss for embedding objects and relations can be adapted to RL. State embeddings are obtained by an encoder z : S → R k and the effect of an action in a given state is modelled by a translation model T : R k × A → R k . Specifically, T (z(s), a) is a

