PLANNING FROM PIXELS USING INVERSE DYNAMICS MODELS

Abstract

Learning task-agnostic dynamics models in high-dimensional observation spaces can be challenging for model-based RL agents. We propose a novel way to learn latent world models by learning to predict sequences of future actions conditioned on task completion. These task-conditioned models adaptively focus modeling capacity on task-relevant dynamics, while simultaneously serving as an effective heuristic for planning with sparse rewards. We evaluate our method on challenging visual goal completion tasks and show a substantial increase in performance compared to prior model-free approaches.

1. INTRODUCTION

Deep reinforcement learning has proven to be a powerful and effective framework for solving a diversity of challenging decision-making problems (Silver et al., 2017a; Berner et al., 2019) . However these algorithms are typically trained to maximize a single reward function, ignoring information that is not directly relevant to the associated task at hand. This way of learning is in stark contrast to how humans learn (Tenenbaum, 2018) . Without being prompted by a specific task, humans can still explore their environment, practice achieving imaginary goals, and in so doing learn about the dynamics of the environment. When subsequently presented with a novel task, humans can utilize this learned knowledge to bootstrap learning -a property we would like our artificial agents to have. In this work, we investigate one way to bridge this gap by learning world models (Ha & Schmidhuber, 2018) that enable the realization of previously unseen tasks. By modeling the task-agnostic dynamics of an environment, an agent can make predictions about how its own actions may affect the environment state without the need for additional samples from the environment. Prior work has shown that by using powerful function approximators to model environment dynamics, training an agent entirely within its own world models can result in large gains in sample efficiency (Ha & Schmidhuber, 2018) . However, learning world models that are both accurate and general has largely remained elusive, with these models experiencing many performance issues in the multi-task setting. The main reason for poor performance is the so-called planning horizon dilemma (Wang et al., 2019) : accurately modeling dynamics over a long horizon is necessary to accurately estimate rewards, but performance is often poor when planning over long sequences due to the accumulation of errors. These modeling errors are especially prevalent in high-dimensional observation spaces where loss functions that operate on pixels may focus model capacity on task-irrelevant features (Kaiser et al., 2020) . Recent work (Hafner et al., 2020; Schrittwieser et al., 2019) has attempted to side-step these issues by learning a world model in a latent space and propagating gradients over multiple time-steps. While these methods are able to learn accurate world models and achieve high performance on benchmark tasks, their representations are usually trained with task-specific information such as rewards, encouraging the model to focus on tracking task-relevant features but compromising their ability to generalize to new tasks. In this work, we propose to learn powerful, latent world models that can predict environment dynamics when planning for a distribution of tasks. The main contributions of our paper are three-fold: we propose to learn a latent world model conditioned on a goal; we train our latent representation to model inverse dynamics -sequences of actions that take the agent from one state to another, rather than training it to capture information about reward; and we show that by combining our inverse dynamics model and a prior over action sequences, we can quickly construct plans that maximize the probability of reaching a goal state. We evaluate our world model on a diverse distribution of challenging visual goals in Atari games and the Deepmind Control Suite (Tassa et al., 2018) to assess both its accuracy and sample efficiency. We find that when planning in our latent world model, our agent outperforms prior, model-free methods across most tasks, while providing an order of magnitude better sample efficiency on some tasks.

2. RELATED WORK

Model-based RL has typically focused on learning powerful forward dynamics models, which are trained to predict the next state given the current state and action. In works such as (Kaiser et al., 2020) , these models are trained to predict the next state in observation space -often by minimizing L2 distance. While the performance of these algorithms in the low data regime is often strong, they can struggle to reach the asymptotic performance of model-free methods (Hafner et al., 2020) . An alternative approach is to learn a forward model in a latent space, which may be able to avoid modeling irrelevant features and better optimize for long-term consistency. These latent spaces can be trained to maximize mutual information with the observations (Hafner et al., 2020; 2019) or even task-specific quantities like the reward, value, or policy (Schrittwieser et al., 2019) . Using a learned forward model, there are several ways that an agent could create a policy. While forward dynamics models map a state and action to the next state, an inverse dynamics model maps two subsequent states to an action. Inverse dynamics models have been used in various ways in sequential decision making. In exploration, inverse dynamics serves as a way to learn representations of the controllable aspects of the state (Pathak et al., 2017) . In imitation learning, inverse dynamics models can be used to map a sequence of states to the actions needed to imitate the trajectory (Pavse et al., 2019) . Christiano et al. ( 2016) use inverse dynamics models to translate actions taken in a simulated environment to the real world. Recently, there has been an emergence of work (e.g., Ghosh et al., 2020; Schmidhuber, 2019; Srivastava et al., 2019) highlighting the relationship between imitation learning and reinforcement learning. Specifically, rather than learn to map states and actions to reward, as is typical in reinforcement learning, Srivastava et al. ( 2019) train a model to predict actions given a state and an outcome, which could be the amount of reward the agent is to collect within a certain amount of time. Ghosh et al. (2020) use a similar idea, predicting actions conditioned on an initial state, a goal state, and the amount of time left to achieve the goal. As explored in Appendix A.1, these methods are perhaps the nearest neighbors to our algorithm. In our paper, we tackle a visual goal-completion task due to its generality and ability to generate tasks with no domain knowledge. Reinforcement learning with multiple goals has been studied



Figure 1: The network architecture for the inverse dynamics model used in GLAMOR. ResNets are used to encode state features and an LSTM predicts the action sequence.

