POLICY-INDUCED SELF-SUPERVISION IMPROVES REP-RESENTATION FINETUNING IN VISUAL RL

Abstract

We study how to transfer representations pretrained on source tasks to target tasks in visual percept based RL. We analyze two popular approaches: freezing or finetuning the pretrained representations. Empirical studies on a set of popular tasks reveal several properties of pretrained representations. First, finetuning is required even when pretrained representations perfectly capture the information required to solve the target task. Second, finetuned representations improve learnability and are more robust to noise. Third, pretrained bottom layers are task-agnostic and readily transferable to new tasks, while top layers encode task-specific information and require adaptation. Building on these insights, we propose a self-supervised objective that clusters representations according to the policy they induce, as opposed to traditional representation similarity measures which are policy-agnostic (e.g. Euclidean norm, cosine similarity). Together with freezing the bottom layers, this objective results in significantly better representation than frozen, finetuned, and self-supervised alternatives on a wide range of benchmarks.

1. INTRODUCTION

Learning representations via pretraining is a staple of modern transfer learning. Typically, a feature encoder is pretrained on one or a few source task(s). Then it is either frozen (i.e., the encoder stays fixed) or finetuned (i.e., the parameters of the encoders are to be updated) when solving a new downstream task (Yosinski et al., 2014) . While whether to freeze or finetune is application-specific, finetuning outperforms in general freezing when there are sufficient (labeled) data and compute. This pretrain-then-transfer recipe has led to many success stories in vision (Razavian et al., 2014; Chen et al., 2021 ), speech (Amodei et al., 2016; Akbari et al., 2021), and NLP (Brown et al., 2020; Chowdhery et al., 2022) . For reinforcement learning (RL), however, finetuning is a costly option as the learning agent needs to collect its own data specific to the downstream task. Moreover, when the source tasks are very different from the downstream task, the first few updates from finetuning destroy the representations learned on the source tasks, cancelling all potential benefits of transferring from pretraining. For those reasons, practitioners often choose to freeze representations, thus completely preventing finetuning. But representation freezing has its own shortcomings, especially pronounced in visual RL, where the (visual) feature encoder can be pretrained on existing image datasets such as ImageNet (Deng et al., 2009) and even collections of web images. Such generic but easier-to-annotate datasets are not constructured with downstream (control) tasks in mind and the pretraining does not necesarily capture important attributes used to solve those tasks. For example, the downstream embodied AI task of navigating around household items (Savva et al., 2019; Kolve et al., 2017) requires knowing the precise size of the objects in the scene. Yet this information is not required when pretraining on visual objection categorization tasks, resulting in what is called negative transfer where a frozen representation hurts downstream performance. More seriously, even when the (visual) representation needed for the downstream task is known a priori, it is unclear that learning it from the source tasks then freezing it should be preferred to finetuning, as shown in Figure 1 . On the left two plots, freezing representations (Frozen) underperforms learning representations using only downstream data (De Novo). On the right two plots, we observe the opposite outcome. Finetuning representations (Finetuned) performs well overall, but fails to unequivocally outperform freezing on the rightmost plots. Freezing pretrained representations can underperform no pretraining at all (Figures 1a and 1b ) or outperform it (Figures 1c and 1d ). Finetuning representations is always competitive, but fails to significantly outperform freezing on visually complex domains (Figures 1c and 1d ). Solid lines indicate mean over 5 random seeds, shades denote 95% confidence interval. See Section 3.2 for details. Contributions When should we freeze representations, when do they require finetuning, and why? This paper answers those questions through several empirical studies on visual RL tasks ranging from simple game and robotic tasks (Tachet des Combes et al., 2018; Tassa et al., 2018) to photo-realistic Habitat domains (Savva et al., 2019) . Our studies highlight properties of finetuned representations which improve learnability. First, they are more consistent in clustering states according to the actions they induce on the downstream task; second, they are more robust to noisy state observations. Inspired by these empirical findings, we propose PiSCO, a representation finetuning objective which encourages robustness and consistency with respect to the actions they induce. We also show that visual percept feature encoders first compute task-agnostic information and then refine this information to be task-specific (i.e., predictive of rewards and / or dynamics) -a wellknown lesson from computer vision (Yosinski et al., 2014) but, to the best of our knowledge, never demonstrated for RL so far. We suspect that finetuning with RL destroys the task-agnostic (and readily transferrable) information found in lower layers of the feature encoder, thus cancelling the benefits of transfer. To retain this information, we show how to identify transferrable layers, and propose to freeze those layers while adapting the remaining ones with PiSCO. This combination yields excellent results on all testbeds, outperforming both representation freezing and finetuning.

2. RELATED WORKS AND BACKGROUND

Learning representations for visual reinforcement learning (RL) has come a long way since its early days. Testbeds have evolved from simple video games (Koutník et al., 2013; Mnih et al., 2015) to self-driving simulators (Shah et al., 2018; Dosovitskiy et al., 2017) , realistic robotics engines (Tassa et al., 2018; Makoviychuk et al., 2021) , and embodied AI platforms (Savva et al., 2019; Kolve et al., 2017) . Visual RL algorithms have similarly progressed, and can now match human efficiency and performance on the simpler video games (Hafner et al., 2021; Ye et al., 2021) , and control complex simulated robots in a handful of hours (Yarats et al., 2022; Laskin et al., 2020; Hansen et al., 2022) . In spite of this success, visual representations remain challenging to transfer in RL (Lazaric, 2012) . Prior work shows that learned representations can be surprisingly brittle, and fail to generalize to minor changes in pixel observations (Witty et al., 2021) . This perspective is often studied under the umbrella of generalization (Zhang et al., 2018; Cobbe et al., 2020; Packer et al., 2018) or adversarial RL (Pinto et al., 2017; Khalifa et al., 2020) . In part, those issues arise due to our lack of understanding in what can be transferred and how to learn it in RL. Others have argued for a plethora of representation pretraining objectives, ranging from capturing policy values (Lehnert et al., 2020; Liu et al., 2021) and summarizing states (Mazoure et al., 2020; Schwarzer et al., 2021; Abel et al., 2016; Littman & Sutton, 2001) to disentangled (Higgins et al., 2017) and bi-simulation metrics (Zhang et al., 2021; Castro & Precup, 2010) . Those objectives can also be aggregated in the hope of learning more generic representations (Gelada et al., 2019; Yang & Nachum, 2021) . Nonetheless, it remains unclear which of these methods should be preferred to learn generic transferrable representations.



Figure 1: When should we freeze or finetune pretrained representations in visual RL? Reward and success weighted by path length (SPL) transfer curves on MSR Jump, DeepMind Control, and Habitat tasks.Freezing pretrained representations can underperform no pretraining at all (Figures1a and 1b) or outperform it (Figures1c and 1d). Finetuning representations is always competitive, but fails to significantly outperform freezing on visually complex domains (Figures1c and 1d). Solid lines indicate mean over 5 random seeds, shades denote 95% confidence interval. See Section 3.2 for details.

