INVESTIGATING MULTI-TASK PRETRAINING AND GEN-ERALIZATION IN REINFORCEMENT LEARNING

Abstract

Deep reinforcement learning (RL) has achieved remarkable successes in complex single-task settings. However, designing RL agents that can learn multiple tasks and leverage prior experience to quickly adapt to a related new task remains challenging. Despite previous attempts to improve on these areas, our understanding of multi-task training and generalization in RL remains limited. To fill this gap, we investigate the generalization capabilities of a popular actor-critic method, IMPALA (Espeholt et al., 2018). Specifically, we build on previous work that has advocated for the use of modes and difficulties of Atari 2600 games as a challenging benchmark for transfer learning in RL (Farebrother et al., 2018; Rusu et al., 2022). We do so by pretraining an agent on multiple variants of the same Atari game before fine-tuning on the remaining never-before-seen variants. This protocol simplifies the multi-task pretraining phase by limiting negative interference between tasks and allows us to better understand the dynamics of multi-task training and generalization. We find that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better. Surprisingly, we also observe that this advantage can still be present after fine-tuning for 200M environment frames than when doing zero-shot transfer. This highlights the potential effect of a good learned representation. We also find that, even though small networks have remained popular to solve Atari 2600 games, increasing the capacity of the value and policy network is critical to achieve good performance as we increase the number of pretraining modes and difficulties. Overall, our findings emphasize key points that are essential for efficient multi-task training and generalization in reinforcement learning.

1. INTRODUCTION

Deep RL has achieved remarkable results in recent years, from surpassing human-level performance on challenging games (Silver et al., 2017; Berner et al., 2019; Vinyals et al., 2019) to learning complex control policies that can be deployed in the real world (Levine et al., 2016; Bellemare et al., 2020) . However, these successes were attained with specialized agents trained to solve a single task and with every new task requiring a new policy trained from scratch. On the other hand, high-capacity models trained on large amounts of data have remarkable generalization abilities in other deep learning domains such as vision and NLP (Brown et al., 2020; He et al., 2022) . Such models can solve multiple tasks simultaneously (Chowdhery et al., 2022) , show emergent capabilities (Srivastava et al., 2022) , and quickly adapt to unseen but related tasks by leveraging previously acquired knowledge (Brown et al., 2020) . While recent work has succeeded in learning such broadly generalizing policies using supervised learning (Reed et al., 2022; Lee et al., 2022) , despite many attempts, deep RL agents have not been able to achieve the same kind of broad generalization. RL policies pretrained on multiple tasks are often unable to leverage information about previous tasks to accelerate learning on related tasks (Kirk et al., 2021) . Furthermore, multi-task agents are reported to perform worse on individual tasks than a single-task agent trained on that task (Espeholt et al., 2018) . The limited performance of multi-task policies is believed to be due to negative interference between tasks (Schaul et al., 2019) . Recent work has tried to address this issue, Hessel et al. ( 2019) proposes a reward rescaling scheme that normalizes the effect of each task on the learning dynamics. Similarly Guo et al. ( 2020) introduces a self-supervised learning objective to improve representation learning during multi-task training. Other works have shown that multi-task pretraining can lead to better representations that are useful for downstream tasks when the policy only needs to generalize to a new reward (Borsa et al., 2016; Yang et al., 2020; Sodhani et al., 2021) . Nevertheless, in complex multi-task problems, such as the Arcade Learning Environment (ALE; Bellemare et al., 2013) , generalization of pretrained policies to unseen tasks remains an unsolved problem. In this paper, we propose to take a closer look at multi-task RL pretraining and generalization on the ALE, one of the most widely used deep RL benchmark. Though, one might hope that an agent could benefit from multi-task training to learn some high level concepts such as affordances (Khetarpal et al., 2020) or contingency awareness (Bellemare et al., 2012) , the lack of common ground between tasks makes joint training on multiple Atari games a difficult problem. To circumvent these issues, we make use of the modes and difficulties of Atari 2600 games (Figure 1 ), which we call variants (Machado et al., 2018) . These variants were developed by game designers to make each game progressively more challenging for human players. Previous work has advocated for their use to study generalization in reinforcement learning (Farebrother et al., 2018; Rusu et al., 2022) , though these works did not go beyond pretraining on a single task. Notably, Farebrother et al. (2018) argued that the representation learned by a DQN agent (Mnih et al., 2013) after pretraining is brittle and is not able transfer well to variants of the same game. They find that during pretraining, the representation tends to overfit to the task it is being trained on, decreasing its generalization capabilities. Our goal in this work is to revisit these results under a new light, leveraging advances in algorithms, architectures and computation that have happened since. We use a more efficient algorithm, IMPALA (Espeholt et al., 2018) instead of DQN, much larger networks than the decade old 3-layer convolutional neural networks used by DQN (Mnih et al., 2013) and pretrain our agents using multiple variants of a game as opposed to just a single one. By limiting the pretraining tasks to different variants of the same game, we facilitate multi-task transfer between variants, which in-turn allows us to study under what conditions RL algorithms are able to generalize better when training on multiple tasks. Our contribution are as follow: • We find that pretrained policies can achieve zero-shot transfer on variants of the same game. If we then fine-tune these polices using interactions in the unseen variant, the fine-tuned policies generalize quite well and learn significantly faster than a randomly initialized policy. • We observe that a good representation can be learned using pretraining on a relatively small number of modes. Fine-tuning performance from these representations improve as we increase the amount of the pretraining data, as opposed to overfitting on pretraining tasks. • Finally, we demonstrate that it is possible to train high capacity networks such as residual networks (He et al., 2016) , with tens of millions of parameters, using online RL. We find that increased representation capabilities from such networks are essential to reach peak performance in the multi-task regime.

