ON THE FEASIBILITY OF CROSS-TASK TRANSFER WITH MODEL-BASED REINFORCEMENT LEARNING

Abstract

Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so. Recently, model-based RL algorithms have greatly improved sample-efficiency by concurrently learning an internal model of the world, and supplementing real environment interactions with imagined rollouts for policy improvement. However, learning an effective model of the world from scratch is challenging, and in stark contrast to humans that rely heavily on world understanding and visual cues for learning new skills. In this work, we investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster. We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models. By offline multi-task pretraining and online cross-task finetuning, we achieve substantial improvements over a baseline trained from scratch; we improve mean performance of model-based algorithm EfficientZero by 23%, and by as much as 71% in some instances.

1. INTRODUCTION

Reinforcement Learning (RL) has achieved great feats across a wide range of areas, most notably game-playing (Mnih et al., 2013; Silver et al., 2016; Berner et al., 2019; Cobbe et al., 2020) . However, traditional RL algorithms often suffer from poor sample-efficiency and require millions (or even billions) of environment interactions to solve tasks -especially when learning from high-dimensional observations such as images. This is in stark contrast to humans that have a remarkable ability to quickly learn new skills despite very limited exposure (Dubey et al., 2018) . In an effort to reliably benchmark and improve the sample-efficiency of image-based RL across a variety of problems, the Arcade Learning Environment (ALE; (Bellemare et al., 2013) ) has become a long-standing challenge for RL. This task suite has given rise to numerous successful and increasingly sampleefficient algorithms (Mnih et al., 2013; Badia et al., 2020; Kaiser et al., 2020; Schrittwieser et al., 2020; Kostrikov et al., 2021; Hafner et al., 2021; Ye et al., 2021) , notably most of which are model-based, i.e., they learn a model of the environment (Ha & Schmidhuber, 2018) . Most recently, EfficientZero Ye et al. ( 2021), a model-based RL algorithm, has demonstrated impressive sample-efficiency, surpassing human-level performance with as little as 2 hours of real-time game play in select Atari 2600 games from the ALE. This achievement is attributed, in part, to the algorithm concurrently learning an internal model of the environment from interaction, and using the learned model to imagine (simulate) further interactions for planning and policy improvement, thus reducing reliance on real environment interactions for skill acquisition. However, current RL algorithms, including EfficientZero, are still predominantly assumed to learn both perception, model, and skills In related areas such as computer vision and natural language processing, large-scale unsupervised/self-supervised/supervised pretraining on large-scale datasets (Devlin et al., 2019; Brown et al., 2020; Li et al., 2022; Radford et al., 2021; Chowdhery et al., 2022) has emerged as a powerful framework for solving numerous downstream tasks with few samples (Alayrac et al., 2022) . This pretraining paradigm has recently been extended to visuo-motor control in various forms, e.g., by leveraging frozen (no finetuning) pretrained representations (Xiao et al., 2022; Parisi et al., 2022) or by finetuning in a supervised setting (Reed et al., 2022; Lee et al., 2022) . However, the success of finetuning for online RL has mostly been limited to same-task initialization of model-free policies from offline datasets (Wang et al., 2022; Zheng et al., 2022) , or adapting policies to novel instances of a given task (Mishra et al., 2017; Julian et al., 2020; Hansen et al., 2021a) , with prior work citing high-variance objectives and catastrophical forgetting as the main obstacles to finetuning representations with RL (Bodnar et al., 2020; Xiao et al., 2022) . In this work, we explore whether such positive transfer can be induced with current model-based RL algorithms in an online RL setting, and across markedly distinct tasks. Specifically, we seek to answer the following questions: when and how can a model-based RL algorithm such as EfficientZero benefit from pretraining on a diverse set of tasks? We base our experiments on the ALE due to cues that are easily identifiable to humans despite great diversity in tasks, and identify two key ingredientscross-task finetuning and task alignment -for model-based adaptation that improve sample-efficiency substantially compared to models learned tabula rasa. In comparison, we find that a naïve treatment of the finetuning procedure as commonly used in supervised learning (Pan & Yang, 2010; Doersch et al., 2015; He et al., 2020; Reed et al., 2022; Lee et al., 2022 ) is found to be unsuccessful or outright harmful in an RL context. Based on our findings, we propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models using extra, auxiliary data from other tasks (see Figure 2 ). Concretely, our framework consists of two stages: (i) offline multi-task pretraining of a world model on an offline dataset from m diverse tasks, a (ii) finetuning stage where the world model is jointly finetuned on a target task in addition to m offline tasks. By leveraging offline data both in pretraining and finetuning, XTRA overcomes the challenges of catastrophical forgetting. To prevent harmful interference from certain offline tasks, we adaptively re-weight gradient contributions in unsupervised manner based on similarity to target task. We evaluate our method and a set of strong baselines extensively across 14 Atari 2600 games from the Atari100k benchmark (Kaiser et al., 2020) that require algorithms to be extremely sample-efficient. From Figure 1 and Table 1 , we observe that XTRA improves sample-efficiency substantially across most tasks, improving mean and median performance of EfficientZero by 23% and 25%, respectively.

2. BACKGROUND

Problem setting. We model image-based agent-environment interaction as an episodic Partially Observable Markov Decision Process (POMDP; Kaelbling et al. (1998) ) defined by the tuple M =



Figure 1. Atari100k score, normalized by mean EfficientZero performance at 100k environment steps across 10 games. Mean of 5 seeds; shaded area indicates 95% CIs.

Figure 2. Model-Based Cross-Task Transfer (XTRA): a sample-efficient online RL framework with scalable pretraining and finetuning of learned world models using auxiliary data from offline tasks.tabula rasa (from scratch) for each new task. Conversely, humans rely heavily on prior knowledge and visual cues when learning new skills -a study found that human players easily identify visual cues about game mechanics when exposed to a new game, and that human performance is severely degraded if such cues are removed or conflict with prior experiences(Dubey et al., 2018).

