A TEACHER-STUDENT FRAMEWORK TO DISTILL FUTURE TRAJECTORIES

Abstract

By learning to predict trajectories of dynamical systems, model-based methods can make extensive use of all observations from past experience. However, due to partial observability, stochasticity, compounding errors, and irrelevant dynamics, training to predict observations explicitly often results in poor models. Model-free techniques try to side-step the problem by learning to predict values directly. While breaking the explicit dependency on future observations can result in strong performance, this usually comes at the cost of low sample efficiency, as the abundant information about the dynamics contained in future observations goes unused. Here we take a step back from both approaches: Instead of hand-designing how trajectories should be incorporated, a teacher network learns to extract relevant information from the trajectories and to distill it into target activations which guide a student model that can only observe the present. The teacher is trained with meta-gradients to maximize the student's performance on a validation set. Our approach performs well on tasks that are difficult for model-free and model-based methods, and we study the role of every component through ablation studies.

1. INTRODUCTION

The ability to learn models of the world has long been argued to be an important ability of intelligent agents. An open and actively researched question is how to learn world models at the right level of abstraction. This paper argues, as others have before, that model-based and model-free methods lie on a spectrum in which advantages and disadvantages of either approach can be traded off against each other, and that there is an optimal compromise for every task. Predicting future observations allows extensive use of all observations from previous experiences during training, and to swiftly transfer to a new reward if the learned model is accurate. However, due to partial observability, stochasticity, irrelevant dynamics and compounding errors in planning, model-based methods tend to be outperformed asymptotically (Pong et al., 2018; Chua et al., 2018) . On the other end of the spectrum, purely model-free methods use the scalar reward as the only source of learning signal. By avoiding the potentially impossible task of explicitly modeling the environment, model-free methods can often achieve substantially better performance in complex environments (Vinyals et al., 2019; OpenAI et al., 2019) . However, this comes at the cost of extreme sample inefficiency, as only predicting rewards throws away useful information contained in the sequences of future observations. What is the right way to incorporate information from trajectories that are associated with the inputs? In this paper we take a step back: Instead of trying to answer this question ourselves by handdesigning what information should be taken into consideration and how, we let a model learn how to make use of the data. Depending on what works well within the setting, the model should learn if and how to learn from the trajectories available at training time. We will adopt a teacher-student setting: a teacher network learns to extract relevant information from the trajectories, and distills it into target activations to guide a student network.foot_0 A sketch of our approach can be found in Figure 1 , next to prototypical computational graphs used to integrate trajectory information in most model-free and model-based methods. Future trajectories can be seen as being a form of privileged information Vapnik and Vashist (2009) , i.e. data available at training time which provides additional information but is not available at test time.

Contributions

The main contribution of this paper is the proposal of a generic method to extract relevant signal from privileged information, specifically trajectories of future observations. We present an instantiation of this approach called Learning to Distill Trajectories (LDT) and an empirical analysis of it. 

2. RELATED WORK

Efficiently making use of signal from trajectories is an actively researched topic. The technique of bootstrapping in TD-learning (Sutton, 1988) uses future observations to reduce the variance of value function approximations. However, in its basic form, bootstrapping provides learning signal only through a scalar bottleneck, potentially missing out on rich additional sources of learning signal. Another approach to extract additional training signal from observations is the framework of Generalized Value Functions (Sutton et al., 2011) , which has been argued to be able to bridge the gap between model-free and model-based methods as well. A similar interpretation can be given to the technique of successor representations (Dayan, 1993) . A number of methods have been proposed that try to leverage the strengths of both model-free and model-based methods, among them Racanière et al. ( 2017), who learn generative models of the environment and fuse predicted rollouts with a model-free network path. In a different line of research, Silver et al. (2017) and Oh et al. (2017) show that value prediction can be improved by incorporating dynamical structure and planning computation into the function approximators. Guez et al. (2019) investigate to what extent agents can learn implicit dynamics models which allow them to solve planning tasks effectively, using only model-free methods. Similarly to LDT, those models can learn their own utility-based state abstractions and can even be temporally abstract to some extent. One difference of these approaches to LDT is that they use reward as their only learning signal without making direct use of future observations when training the predictor. The meta-gradient approach presented in this paper can be used more generally for problems in the framework of learning using privileged information (LUPI, (Vapnik and Vashist, 2009; Lopez-Paz et al., 2016)) , where privileged information is additional context about the data that is available at training time but not at test time. Hindsight information such as the trajectories in a value-prediction task falls into this category. There are a variety of representation learning approaches which can learn to extract learning signal from trajectories. Jaderberg et al. (2016) demonstrate that the performance of RL agents can be



Note that the term distillation is often used in the context of "distilling a large model into a smaller one"(Hinton et al., 2015), but in this context we talk about distilling a trajectory into vectors used as target activations.



Figure1: Comparison of architectures. The data generator is a Markov reward process (no actions) with an episode length of n. x denotes the initial observation. y = i yi is the n-step return (no bootstrapping).x * = (x * 1 , x * 2 , ..., x * n )is the trajectory of observations (privileged data). Model activations and predictions are displayed boxed. Losses are displayed as red lines. Solid edges denote learned functions. Dotted edges denote fixed functions.

