ON THE FEASIBILITY OF CROSS-TASK TRANSFER WITH MODEL-BASED REINFORCEMENT LEARNING

Abstract

Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so. Recently, model-based RL algorithms have greatly improved sample-efficiency by concurrently learning an internal model of the world, and supplementing real environment interactions with imagined rollouts for policy improvement. However, learning an effective model of the world from scratch is challenging, and in stark contrast to humans that rely heavily on world understanding and visual cues for learning new skills. In this work, we investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster. We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models. By offline multi-task pretraining and online cross-task finetuning, we achieve substantial improvements over a baseline trained from scratch; we improve mean performance of model-based algorithm EfficientZero by 23%, and by as much as 71% in some instances.

1. INTRODUCTION

Reinforcement Learning (RL) has achieved great feats across a wide range of areas, most notably game-playing (Mnih et al., 2013; Silver et al., 2016; Berner et al., 2019; Cobbe et al., 2020) . However, traditional RL algorithms often suffer from poor sample-efficiency and require millions (or even billions) of environment interactions to solve tasks -especially when learning from high-dimensional observations such as images. This is in stark contrast to humans that have a remarkable ability to quickly learn new skills despite very limited exposure (Dubey et al., 2018) . In an effort to reliably benchmark and improve the sample-efficiency of image-based RL across a variety of problems, the Arcade Learning Environment (ALE; (Bellemare et al., 2013) ) has become a long-standing challenge for RL. This task suite has given rise to numerous successful and increasingly sampleefficient algorithms (Mnih et al., 2013; Badia et al., 2020; Kaiser et al., 2020; Schrittwieser et al., 2020; Kostrikov et al., 2021; Hafner et al., 2021; Ye et al., 2021) , notably most of which are model-based, i.e., they learn a model of the environment (Ha & Schmidhuber, 2018) . Most recently, EfficientZero Ye et al. ( 2021), a model-based RL algorithm, has demonstrated impressive sample-efficiency, surpassing human-level performance with as little as 2 hours of real-time game play in select Atari 2600 games from the ALE. This achievement is attributed, in part, to the algorithm concurrently learning an internal model of the environment from interaction, and using the learned model to imagine (simulate) further interactions for planning and policy improvement, thus reducing reliance on real environment interactions for skill acquisition. However, current RL algorithms, including EfficientZero, are still predominantly assumed to learn both perception, model, and skills



Figure 1. Atari100k score, normalized by mean EfficientZero performance at 100k environment steps across 10 games. Mean of 5 seeds; shaded area indicates 95% CIs.

