LIGHT-WEIGHT PROBING OF UNSUPERVISED REPRESEN-TATIONS FOR REINFORCEMENT LEARNING

Abstract

Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. To alleviate this issue, we design an evaluation protocol for unsupervised RL representations with lower variance and up to 600x lower computational cost. Inspired by the vision community, we propose two linear probing tasks: predicting the reward observed in a given state, and predicting the action of an expert in a given state. These two tasks are generally applicable to many RL domains, and we show through rigorous experimentation that they correlate strongly with the actual downstream control performance on the Atari100k Benchmark. This provides a better method for exploring the space of pretraining algorithms without the need of running RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective. Code will be released upon acceptance.

1. INTRODUCTION

Learning visual representations is a critical step towards solving many kinds of tasks, from supervised tasks such as image classification or object detection, to reinforcement learning (RL). Ever since the early successes of deep reinforcement learning (Mnih et al., 2015) , neural networks have been widely adopted to solve pixel-based reinforcement learning tasks such as arcade games (Bellemare et al., 2013) , physical continuous control (Todorov et al., 2012; Tassa et al., 2018), and complex video games (Synnaeve et al., 2018; Oh et al., 2016) . However, learning deep representations directly from rewards is a challenging task, since this learning signal is often noisy, sparse and delayed. With ongoing progress in unsupervised visual representation learning for vision tasks (Zbontar et al., 2021; Chen et al., 2020a; b; Grill et al., 2020; Caron et al., 2020; 2021) , recent efforts have likewise applied self-supervised techniques and ideas to improve representation learning for RL. Some promising approaches include supplementing the RL loss with self-supervised objectives (Laskin et al., 2020; Schwarzer et al., 2021a) , or first pre-training the representations on a corpus of trajectories (Schwarzer et al., 2021b; Stooke et al., 2021) . However, the diversity in the settings considered, as well as the self-supervised methods used, make it difficult to identify the core principles of successful self-supervised methods in RL. Moreover, estimating the performance of RL algorithms is notoriously challenging (Henderson et al., 2018; Agarwal et al., 2021) : it often requires repeating the same experience with a different random seed, and the high CPU-to-GPU ratio is a compute requirement of most online RL methods that is inefficient for typical research compute clusters. This hinders systematic exploration of the many design choices that characterize SSL methods. In this paper, we strive to provide a reliable and lightweight evaluation scheme for unsupervised visual representation in the context of RL. Inspired by the vision community, we propose to evaluate the representations using linear probing, by training a linear prediction head on top of frozen features. We devise two probing tasks that we deem widely applicable: predicting the reward in a given state, and predicting the action that would be taken by a fixed policy in a given state (for example that of an expert). We stress that these probing tasks are only used as a means of evaluation. Because

