POLICY-INDUCED SELF-SUPERVISION IMPROVES REP-RESENTATION FINETUNING IN VISUAL RL

Abstract

We study how to transfer representations pretrained on source tasks to target tasks in visual percept based RL. We analyze two popular approaches: freezing or finetuning the pretrained representations. Empirical studies on a set of popular tasks reveal several properties of pretrained representations. First, finetuning is required even when pretrained representations perfectly capture the information required to solve the target task. Second, finetuned representations improve learnability and are more robust to noise. Third, pretrained bottom layers are task-agnostic and readily transferable to new tasks, while top layers encode task-specific information and require adaptation. Building on these insights, we propose a self-supervised objective that clusters representations according to the policy they induce, as opposed to traditional representation similarity measures which are policy-agnostic (e.g. Euclidean norm, cosine similarity). Together with freezing the bottom layers, this objective results in significantly better representation than frozen, finetuned, and self-supervised alternatives on a wide range of benchmarks.

1. INTRODUCTION

Learning representations via pretraining is a staple of modern transfer learning. Typically, a feature encoder is pretrained on one or a few source task(s). Then it is either frozen (i.e., the encoder stays fixed) or finetuned (i.e., the parameters of the encoders are to be updated) when solving a new downstream task (Yosinski et al., 2014) . While whether to freeze or finetune is application-specific, finetuning outperforms in general freezing when there are sufficient (labeled) data and compute. This pretrain-then-transfer recipe has led to many success stories in vision (Razavian et al., 2014; Chen et al., 2021 ), speech (Amodei et al., 2016; Akbari et al., 2021), and NLP (Brown et al., 2020; Chowdhery et al., 2022) . For reinforcement learning (RL), however, finetuning is a costly option as the learning agent needs to collect its own data specific to the downstream task. Moreover, when the source tasks are very different from the downstream task, the first few updates from finetuning destroy the representations learned on the source tasks, cancelling all potential benefits of transferring from pretraining. For those reasons, practitioners often choose to freeze representations, thus completely preventing finetuning. But representation freezing has its own shortcomings, especially pronounced in visual RL, where the (visual) feature encoder can be pretrained on existing image datasets such as ImageNet (Deng et al., 2009) and even collections of web images. Such generic but easier-to-annotate datasets are not constructured with downstream (control) tasks in mind and the pretraining does not necesarily capture important attributes used to solve those tasks. For example, the downstream embodied AI task of navigating around household items (Savva et al., 2019; Kolve et al., 2017) requires knowing the precise size of the objects in the scene. Yet this information is not required when pretraining on visual objection categorization tasks, resulting in what is called negative transfer where a frozen representation hurts downstream performance. More seriously, even when the (visual) representation needed for the downstream task is known a priori, it is unclear that learning it from the source tasks then freezing it should be preferred to finetuning, as shown in Figure 1 . On the left two plots, freezing representations (Frozen) underperforms learning representations using only downstream data (De Novo). On the right two plots, we observe the opposite outcome. Finetuning representations (Finetuned) performs well overall, but fails to unequivocally outperform freezing on the rightmost plots.

